1
00:00:00,000 --> 00:00:03,968
[MUSIC PLAYING]

2
00:00:03,968 --> 00:00:24,800


3
00:00:24,800 --> 00:00:26,720
On behalf of the
MIT summer session,

4
00:00:26,720 --> 00:00:29,900
let me indicate how happy
we are to be participating

5
00:00:29,900 --> 00:00:33,560
in this edition of the
University Video Communications

6
00:00:33,560 --> 00:00:35,300
Distinguished Lecture Series.

7
00:00:35,300 --> 00:00:38,750
Our speaker tonight is
Danny Hillis, a founder

8
00:00:38,750 --> 00:00:42,230
and chief guru of Thinking
Machines Corporation.

9
00:00:42,230 --> 00:00:46,370
He received his PhD
from MIT in 1988,

10
00:00:46,370 --> 00:00:49,160
but in 1981, he was
already becoming interested

11
00:00:49,160 --> 00:00:52,460
in the physical
limitations of computation

12
00:00:52,460 --> 00:00:54,890
and the possibility
of constructing

13
00:00:54,890 --> 00:00:56,840
highly parallel computers.

14
00:00:56,840 --> 00:00:59,870
This work culminated
in 1985 with the design

15
00:00:59,870 --> 00:01:04,459
of the Connection Machine,
which became the foundation

16
00:01:04,459 --> 00:01:06,680
of his very successful company.

17
00:01:06,680 --> 00:01:10,310
His topic tonight is Connection
Machine architectures.

18
00:01:10,310 --> 00:01:11,730
Danny--

19
00:01:11,730 --> 00:01:13,980
I'm interested in
designing computers

20
00:01:13,980 --> 00:01:16,740
for the very largest problems.

21
00:01:16,740 --> 00:01:20,100
And I got interested in
this kind of computer design

22
00:01:20,100 --> 00:01:23,070
originally as a user,
because I had some problems

23
00:01:23,070 --> 00:01:25,980
that I was studying in
artificial intelligence that

24
00:01:25,980 --> 00:01:29,610
required much, much faster
computers than I could buy--

25
00:01:29,610 --> 00:01:32,430
like 100,000 times faster.

26
00:01:32,430 --> 00:01:35,670
And it didn't look like anybody
else was going to build them,

27
00:01:35,670 --> 00:01:39,360
and so I got interested in
the problem of building them.

28
00:01:39,360 --> 00:01:42,030
And as it turned
out, fortunately,

29
00:01:42,030 --> 00:01:44,790
a lot of other people were
interested in such computers

30
00:01:44,790 --> 00:01:45,770
too.

31
00:01:45,770 --> 00:01:48,460
A typical application
is-- for example,

32
00:01:48,460 --> 00:01:50,670
in designing an
airplane, people want

33
00:01:50,670 --> 00:01:53,280
to simulate the shape
of the airplane wing

34
00:01:53,280 --> 00:01:55,992
and the air flying over it
so that they can figure out

35
00:01:55,992 --> 00:01:58,200
how the airplane is going
to fly before they build it

36
00:01:58,200 --> 00:02:01,110
and before they have to
test it in a wind tunnel.

37
00:02:01,110 --> 00:02:02,880
Or another example
as the scientist

38
00:02:02,880 --> 00:02:06,180
might use it to model
a system that they

39
00:02:06,180 --> 00:02:07,590
have no hope of
really physically

40
00:02:07,590 --> 00:02:10,620
modeling-- for example,
the formation of a galaxy

41
00:02:10,620 --> 00:02:12,732
or the folding of a protein.

42
00:02:12,732 --> 00:02:14,940
And all of these problems
require a tremendous amount

43
00:02:14,940 --> 00:02:18,720
of computation-- much, much
more than we can really do today

44
00:02:18,720 --> 00:02:20,490
to do the problems accurately.

45
00:02:20,490 --> 00:02:22,260
So it's great for
computer designer,

46
00:02:22,260 --> 00:02:25,120
because as soon as you
build a faster machine,

47
00:02:25,120 --> 00:02:27,627
the users of this kind of
problem want to get faster one.

48
00:02:27,627 --> 00:02:29,460
And there's really no
end in sight for that.

49
00:02:29,460 --> 00:02:31,830
They seem to want computers
that are certainly

50
00:02:31,830 --> 00:02:36,340
thousands of times faster than
what we can give them today.

51
00:02:36,340 --> 00:02:40,050
Now, another great thing
about this set of big users

52
00:02:40,050 --> 00:02:41,820
is, because they're
solving problems

53
00:02:41,820 --> 00:02:45,030
that people haven't solved
before, they're generally

54
00:02:45,030 --> 00:02:46,950
writing their
programs from scratch.

55
00:02:46,950 --> 00:02:49,410
And they're very often willing
to write their programs

56
00:02:49,410 --> 00:02:52,680
in a different way to run
them on your machine, which

57
00:02:52,680 --> 00:02:55,068
is very important,
because it frees up

58
00:02:55,068 --> 00:02:56,610
what's one of the
biggest constraints

59
00:02:56,610 --> 00:02:58,620
for a normal computer
architect, which

60
00:02:58,620 --> 00:03:01,030
is to run existing programs.

61
00:03:01,030 --> 00:03:03,450
So the computers that I'll
be talking about today, which

62
00:03:03,450 --> 00:03:05,130
are called it a
parallel computers,

63
00:03:05,130 --> 00:03:07,650
use a somewhat
different formalism

64
00:03:07,650 --> 00:03:10,350
in programming them, some
very different primitives,

65
00:03:10,350 --> 00:03:13,450
different programming
languages, and things like that.

66
00:03:13,450 --> 00:03:16,950
But the users are willing to
adopt that different formalism,

67
00:03:16,950 --> 00:03:19,630
because that's the only way that
they can solve their problem.

68
00:03:19,630 --> 00:03:21,047
And in fact, as
it turns out, it's

69
00:03:21,047 --> 00:03:25,360
a very natural formalism for
a wide class of problems.

70
00:03:25,360 --> 00:03:27,480
So what I'm going to
be talking about today

71
00:03:27,480 --> 00:03:30,090
is this kind of architecture,
which is called a data parallel

72
00:03:30,090 --> 00:03:32,550
architecture, of which
the connection machine is

73
00:03:32,550 --> 00:03:34,150
an example.

74
00:03:34,150 --> 00:03:38,070
And I will, I hope,
convince you that this

75
00:03:38,070 --> 00:03:40,110
is the most natural
kind of architecture

76
00:03:40,110 --> 00:03:41,610
for very large problems.

77
00:03:41,610 --> 00:03:44,570
I'm certainly convince of that.

78
00:03:44,570 --> 00:03:48,100
I'm going to start by explaining
why conventional Von Neumann

79
00:03:48,100 --> 00:03:51,580
machines are a poor fit
to very large problems,

80
00:03:51,580 --> 00:03:53,980
and why data
parallel machines are

81
00:03:53,980 --> 00:03:56,550
a much more natural approach.

82
00:03:56,550 --> 00:03:58,950
I'm then going to talk about
some of the design issues

83
00:03:58,950 --> 00:04:01,350
in designing data
parallel machines,

84
00:04:01,350 --> 00:04:03,030
and specifically,
how do you achieve

85
00:04:03,030 --> 00:04:05,550
a balance between processing
and communications

86
00:04:05,550 --> 00:04:08,700
in I/O, which turns out to be
some of the hardest engineering

87
00:04:08,700 --> 00:04:09,670
issues.

88
00:04:09,670 --> 00:04:11,878
And then finally, I'm going
to talk just a little bit

89
00:04:11,878 --> 00:04:14,730
about where we stand today
and where this technology is

90
00:04:14,730 --> 00:04:17,459
likely to lead in the future.

91
00:04:17,459 --> 00:04:20,209
So let me start by talking about
Von Neumann machines, which

92
00:04:20,209 --> 00:04:22,550
are the kind of machines
that we're all used to,

93
00:04:22,550 --> 00:04:26,600
and represent almost all of the
computers in the world today.

94
00:04:26,600 --> 00:04:28,820
And the basic idea of
a Von Neumann machine

95
00:04:28,820 --> 00:04:30,500
is that you have two pieces.

96
00:04:30,500 --> 00:04:33,530
One of them is a processor,
which operates on data--

97
00:04:33,530 --> 00:04:34,950
does arithmetic, for example--

98
00:04:34,950 --> 00:04:37,670
and another one is the memory,
which holds data and doesn't

99
00:04:37,670 --> 00:04:38,780
operate on it.

100
00:04:38,780 --> 00:04:42,020
And so to operate on data, you
leave it stored in the memory.

101
00:04:42,020 --> 00:04:42,833
You move it across.

102
00:04:42,833 --> 00:04:44,000
You add two things together.

103
00:04:44,000 --> 00:04:46,470
You put it back in
the memory, and so on.

104
00:04:46,470 --> 00:04:49,790
And typically, you move
one word or a few words

105
00:04:49,790 --> 00:04:51,620
across at a time.

106
00:04:51,620 --> 00:04:53,690
This is the single
operation model

107
00:04:53,690 --> 00:04:56,360
that's sort of implicit in all
of the programming languages

108
00:04:56,360 --> 00:04:59,510
that most of you are
used to programming in.

109
00:04:59,510 --> 00:05:02,810
Now, from an
engineer standpoint,

110
00:05:02,810 --> 00:05:07,520
the Von Neumann machine was
a solution to a technological

111
00:05:07,520 --> 00:05:09,560
opportunity-- or a
technological problem,

112
00:05:09,560 --> 00:05:11,270
depending on how
you look at it--

113
00:05:11,270 --> 00:05:13,820
which is that, when those
machines were designed,

114
00:05:13,820 --> 00:05:15,770
the switching components
were very, very fast

115
00:05:15,770 --> 00:05:18,110
and very expensive, and
the memory components

116
00:05:18,110 --> 00:05:20,490
were relatively slow
and relatively cheap.

117
00:05:20,490 --> 00:05:22,640
So it made sense to
build a machine that,

118
00:05:22,640 --> 00:05:24,860
on one side of the room,
had these big slow things

119
00:05:24,860 --> 00:05:28,010
like mercury [INAUDIBLE]
and another side of the room

120
00:05:28,010 --> 00:05:31,490
had these very fast expensive
things, like vacuum tubes,

121
00:05:31,490 --> 00:05:33,148
or relays, or
something like that.

122
00:05:33,148 --> 00:05:34,940
And the vacuum tubes
switched lots and lots

123
00:05:34,940 --> 00:05:36,980
of times for every
memory reference,

124
00:05:36,980 --> 00:05:39,505
and it worked out very well.

125
00:05:39,505 --> 00:05:41,630
In fact, it worked out so
well that, as we improved

126
00:05:41,630 --> 00:05:43,610
the components in
that kind of machine,

127
00:05:43,610 --> 00:05:45,780
we kept that same basic design.

128
00:05:45,780 --> 00:05:48,018
And so if you look on a
modern one-chip computer,

129
00:05:48,018 --> 00:05:50,060
you'll see on one side of
the chip is the memory,

130
00:05:50,060 --> 00:05:52,010
and on the other
side [INAUDIBLE]

131
00:05:52,010 --> 00:05:54,320
of the chip, the
switching elements, which

132
00:05:54,320 --> 00:05:56,785
are the processors--
only today, it's all

133
00:05:56,785 --> 00:05:58,910
made of the same kind of
stuff, so it doesn't quite

134
00:05:58,910 --> 00:06:00,770
make sense to separate them.

135
00:06:00,770 --> 00:06:03,080
And in fact, there's a
tremendous inefficiency

136
00:06:03,080 --> 00:06:08,810
in separating them out, which
comes from the use of silicon

137
00:06:08,810 --> 00:06:11,030
and the activity of the silicon.

138
00:06:11,030 --> 00:06:13,490
So if I take a very,
very large machine--

139
00:06:13,490 --> 00:06:16,100
remember, those are the
ones I'm interested in--

140
00:06:16,100 --> 00:06:18,020
like a Cray or a
big [INAUDIBLE],,

141
00:06:18,020 --> 00:06:20,690
or something like that,
or a Connection Machine--

142
00:06:20,690 --> 00:06:23,570
and look at what's inside of it,
it's all made of silicon chips.

143
00:06:23,570 --> 00:06:25,760
And if I lay out
the silicon chips,

144
00:06:25,760 --> 00:06:28,650
it probably has something on
the order of a squaring meter

145
00:06:28,650 --> 00:06:29,690
of silicon in it.

146
00:06:29,690 --> 00:06:32,107
Actually, these days, it's
going to be a little bit more--

147
00:06:32,107 --> 00:06:33,110
but something like that.

148
00:06:33,110 --> 00:06:36,620
So imagine that I laid out
all the silicon, looked

149
00:06:36,620 --> 00:06:37,610
at it like that.

150
00:06:37,610 --> 00:06:41,780
Now, I would notice that these
days, about this much of it

151
00:06:41,780 --> 00:06:45,320
is in the processor itself
and about this much of it

152
00:06:45,320 --> 00:06:46,500
is in the memory system--

153
00:06:46,500 --> 00:06:48,500
mostly in the bits of the
memory, and the memory

154
00:06:48,500 --> 00:06:51,330
decoding, and things like that.

155
00:06:51,330 --> 00:06:54,410
So if I were able
to look at this

156
00:06:54,410 --> 00:06:57,470
under some kind of
imaginary microscope

157
00:06:57,470 --> 00:06:59,030
that showed the
activity of the chips

158
00:06:59,030 --> 00:07:01,400
and showed where useful
things were happening,

159
00:07:01,400 --> 00:07:04,700
I would see that, when
this machine was operating,

160
00:07:04,700 --> 00:07:08,270
this part of it was very,
very busy most of the time

161
00:07:08,270 --> 00:07:10,700
and this part of it would
have one memory location

162
00:07:10,700 --> 00:07:12,565
being referenced at once.

163
00:07:12,565 --> 00:07:14,690
Or maybe, if it was a
[INAUDIBLE] pipeline machine,

164
00:07:14,690 --> 00:07:18,380
it might have four, or five,
or something like that.

165
00:07:18,380 --> 00:07:20,330
Well, there's a terribly
inefficient design,

166
00:07:20,330 --> 00:07:25,190
because we're-- in effect, we're
optimizing the activity of this

167
00:07:25,190 --> 00:07:27,830
and leaving all this silicon
sitting there doing nothing,

168
00:07:27,830 --> 00:07:29,940
just holding bits.

169
00:07:29,940 --> 00:07:32,670
And in fact, the time
that the machine has taken

170
00:07:32,670 --> 00:07:35,430
is mostly information
traveling from here to there

171
00:07:35,430 --> 00:07:38,610
and back again, which is what
Backus called the Von Neumann

172
00:07:38,610 --> 00:07:40,200
bottleneck.

173
00:07:40,200 --> 00:07:42,030
The terrible thing
about this inefficiency

174
00:07:42,030 --> 00:07:44,860
is the bigger that you build
a computer, the worse it gets,

175
00:07:44,860 --> 00:07:46,610
because the easiest
way to build it bigger

176
00:07:46,610 --> 00:07:48,110
is the memory system
becomes larger,

177
00:07:48,110 --> 00:07:50,040
and this proportion of
processing to memory

178
00:07:50,040 --> 00:07:51,960
gets even more and
more out of whack.

179
00:07:51,960 --> 00:07:54,930
And that waiting for data
to move back and forth

180
00:07:54,930 --> 00:07:56,790
between memory
becomes more and more

181
00:07:56,790 --> 00:07:58,740
the bottleneck of the machine.

182
00:07:58,740 --> 00:08:02,310
So it's tolerable when you
build a little machine, which

183
00:08:02,310 --> 00:08:06,030
has a closer balance of
memory and processing,

184
00:08:06,030 --> 00:08:07,530
but it becomes
kind of intolerable

185
00:08:07,530 --> 00:08:10,920
when you build a big machine
for these kinds of problems.

186
00:08:10,920 --> 00:08:12,890
Now, obviously,
the answer is you

187
00:08:12,890 --> 00:08:15,050
want to put in more
processing somehow.

188
00:08:15,050 --> 00:08:17,150
And the general
answer is somehow,

189
00:08:17,150 --> 00:08:19,850
you want to distribute that
processing through the memory,

190
00:08:19,850 --> 00:08:23,330
and you want to mix up in
here processors and memory.

191
00:08:23,330 --> 00:08:25,610
And the trick is, how do
you do that efficiently?

192
00:08:25,610 --> 00:08:27,560
How does it change
the programming model,

193
00:08:27,560 --> 00:08:29,270
and how do you get it to work?

194
00:08:29,270 --> 00:08:33,179
So the Von Neumann
machine is basically

195
00:08:33,179 --> 00:08:35,700
good for the big problems,
and the potential answer

196
00:08:35,700 --> 00:08:38,610
is parallelism of getting a lot
of things to happen at once.

197
00:08:38,610 --> 00:08:41,659
And that will get the
efficiency of your machine up.

198
00:08:41,659 --> 00:08:48,810
Now, just because parallelism
is the only answer doesn't mean

199
00:08:48,810 --> 00:08:49,448
it's an answer.

200
00:08:49,448 --> 00:08:51,240
And it could be that
the problems, in fact,

201
00:08:51,240 --> 00:08:53,070
don't allow for parallelism.

202
00:08:53,070 --> 00:08:55,530
So that's the real
question-- is, is there

203
00:08:55,530 --> 00:08:58,320
something fundamental about our
problems that requires things

204
00:08:58,320 --> 00:09:00,430
to be done one at
a time, or can,

205
00:09:00,430 --> 00:09:03,600
in fact, they be done
in parallel all at once?

206
00:09:03,600 --> 00:09:05,100
And if they can be
done in parallel,

207
00:09:05,100 --> 00:09:07,660
where is the big source
of the parallelism?

208
00:09:07,660 --> 00:09:10,770
Now, if I take a physical
simulation example,

209
00:09:10,770 --> 00:09:13,140
it's sort of obvious
where the parallelism is--

210
00:09:13,140 --> 00:09:15,410
because for example,
imagine that airplane--

211
00:09:15,410 --> 00:09:18,210
you're flowing
air over the wing.

212
00:09:18,210 --> 00:09:20,010
Well, when the
airplane's flying,

213
00:09:20,010 --> 00:09:21,863
the real air is computing
that in parallel.

214
00:09:21,863 --> 00:09:23,280
The air over here
is deciding what

215
00:09:23,280 --> 00:09:25,780
to do completely independently
of the air over here deciding

216
00:09:25,780 --> 00:09:26,920
what to do.

217
00:09:26,920 --> 00:09:29,362
And so in some sense,
the physical situation

218
00:09:29,362 --> 00:09:31,320
is fundamentally parallel,
but the question is,

219
00:09:31,320 --> 00:09:33,570
is that reflected
in the algorithms?

220
00:09:33,570 --> 00:09:36,210
And it turns out that people
solve that same problem

221
00:09:36,210 --> 00:09:38,300
by a variety of
different algorithms.

222
00:09:38,300 --> 00:09:39,810
For example, one
kind of algorithm--

223
00:09:39,810 --> 00:09:41,820
they do it just like the
physics does it-- they

224
00:09:41,820 --> 00:09:44,250
model a whole bunch of particles
and model the interaction

225
00:09:44,250 --> 00:09:46,042
of the particles as
they bump into the wing

226
00:09:46,042 --> 00:09:47,920
and bump into each other.

227
00:09:47,920 --> 00:09:52,510
And for example, those
are called either particle

228
00:09:52,510 --> 00:09:54,670
algorithms or cellular
automata algorithms.

229
00:09:54,670 --> 00:09:56,800
Another completely
different kind of algorithm

230
00:09:56,800 --> 00:09:59,000
is you model the space as
a whole bunch of-- there's

231
00:09:59,000 --> 00:10:01,330
a grid of cells, and
you model the behavior

232
00:10:01,330 --> 00:10:03,700
of each of the cells by a set
of equations that says how

233
00:10:03,700 --> 00:10:05,630
the air flows through the cell.

234
00:10:05,630 --> 00:10:09,345
Another kind of algorithm is
you transform that problem--

235
00:10:09,345 --> 00:10:11,470
that physical problem to
a different problem, where

236
00:10:11,470 --> 00:10:13,280
you have to solve a
system of equations.

237
00:10:13,280 --> 00:10:14,950
And you solve those
equations, and then

238
00:10:14,950 --> 00:10:17,360
you transform the solutions
back to the physical problem.

239
00:10:17,360 --> 00:10:19,360
So there are a whole lot
of different algorithms

240
00:10:19,360 --> 00:10:21,370
for solving that same
physical problem.

241
00:10:21,370 --> 00:10:23,260
It turns out every
single one of them

242
00:10:23,260 --> 00:10:25,600
has a tremendous
amount of parallelism.

243
00:10:25,600 --> 00:10:28,600
And the good news is [INAUDIBLE]
parallelism that grows

244
00:10:28,600 --> 00:10:30,050
with the size of the problem.

245
00:10:30,050 --> 00:10:33,160
So the bigger the problem,
the more parallelism there is.

246
00:10:33,160 --> 00:10:36,430
So that looks good for
the side of parallelism.

247
00:10:36,430 --> 00:10:38,560
It looks like maybe it's
fundamentally there,

248
00:10:38,560 --> 00:10:40,000
when you look at the algorithm.

249
00:10:40,000 --> 00:10:43,330
But there used to be this
thing that would bother people

250
00:10:43,330 --> 00:10:47,170
about that argument,
which was put most crisply

251
00:10:47,170 --> 00:10:50,620
by Gene Amdahl in Amdahl's law.

252
00:10:50,620 --> 00:10:53,273
And he made the argument that
parallel computers will never

253
00:10:53,273 --> 00:10:54,190
really work that much.

254
00:10:54,190 --> 00:10:56,860
They'll never get you more
than about a factor of 10.

255
00:10:56,860 --> 00:10:58,820
And here was the argument.

256
00:10:58,820 --> 00:11:02,240
The argument was imagine that
I represent the total amount

257
00:11:02,240 --> 00:11:05,730
of computation to be done.

258
00:11:05,730 --> 00:11:07,890
So that's the line, and
here's the total amount

259
00:11:07,890 --> 00:11:09,060
of computation.

260
00:11:09,060 --> 00:11:12,510
Now, imagine that some of it
is fundamentally sequential

261
00:11:12,510 --> 00:11:15,300
and has to be done a bit at
a time, and the rest of it

262
00:11:15,300 --> 00:11:17,770
is parallel and can
be done all at once.

263
00:11:17,770 --> 00:11:21,270
So let's be generous and
assume 90% of it is parallel

264
00:11:21,270 --> 00:11:23,467
and only 10% of
it is sequential.

265
00:11:23,467 --> 00:11:27,113


266
00:11:27,113 --> 00:11:28,530
This would represent
how much time

267
00:11:28,530 --> 00:11:31,080
it would take on a
sequential machine,

268
00:11:31,080 --> 00:11:33,630
because we'll do all
of it in sequence.

269
00:11:33,630 --> 00:11:38,410
But imagine I put 10 processors
to work on that same problem.

270
00:11:38,410 --> 00:11:42,150
Well, I get to divide this
part of the problem by 10,

271
00:11:42,150 --> 00:11:43,525
so this becomes
a tenth the size.

272
00:11:43,525 --> 00:11:46,135


273
00:11:46,135 --> 00:11:48,260
And now the parallel and
the sequential [INAUDIBLE]

274
00:11:48,260 --> 00:11:50,670
the same time, because
the 10 processors

275
00:11:50,670 --> 00:11:53,210
don't help me on
the sequential part.

276
00:11:53,210 --> 00:11:55,510
So I've made the
problem a lot faster--

277
00:11:55,510 --> 00:11:57,280
almost 10 times as fast.

278
00:11:57,280 --> 00:12:01,310
But now let's say I put 100
processors on the same problem.

279
00:12:01,310 --> 00:12:04,710
So I now divide this by 100.

280
00:12:04,710 --> 00:12:06,480
Well, if I divide
that by 100, then I

281
00:12:06,480 --> 00:12:08,970
get something which is
almost entirely dominated

282
00:12:08,970 --> 00:12:11,580
by the sequential part, so I'm
getting diminishing returns now

283
00:12:11,580 --> 00:12:15,390
as I put more processors on it.

284
00:12:15,390 --> 00:12:19,020
So this is a very compelling
argument [INAUDIBLE] very

285
00:12:19,020 --> 00:12:21,390
intuitive that parallelism--

286
00:12:21,390 --> 00:12:23,580
even if it works on
a lot of the problem,

287
00:12:23,580 --> 00:12:26,320
as long as there's a sequential
part of the problem there,

288
00:12:26,320 --> 00:12:29,850
the machines are going to
become increasingly inefficient.

289
00:12:29,850 --> 00:12:36,150
It used to bother me a lot,
but then I started noticing--

290
00:12:36,150 --> 00:12:38,990
and I noticed, in my problem, it
didn't really seem to come up.

291
00:12:38,990 --> 00:12:40,850
In my application, it didn't
really seem to come up,

292
00:12:40,850 --> 00:12:43,517
and I started noticing, in a lot
of other people's applications,

293
00:12:43,517 --> 00:12:45,442
it didn't come up.

294
00:12:45,442 --> 00:12:47,150
I think a lot of people
have noticed this

295
00:12:47,150 --> 00:12:49,460
these days, that, in
fact, Amdahl's doesn't

296
00:12:49,460 --> 00:12:50,420
seem to be hurting us.

297
00:12:50,420 --> 00:12:51,630
And why not?

298
00:12:51,630 --> 00:12:58,577
And I believe that the best
answer is one that was--

299
00:12:58,577 --> 00:13:01,160
I'm going to give a version of
an answer that was published by

300
00:13:01,160 --> 00:13:03,632
[INAUDIBLE],, and it
has to do with scaling.

301
00:13:03,632 --> 00:13:05,090
The answer is people
are not trying

302
00:13:05,090 --> 00:13:07,700
to solve the same problem
on the parallel machine

303
00:13:07,700 --> 00:13:09,325
that they solved on
the serial machine.

304
00:13:09,325 --> 00:13:11,367
They don't want to solve
the same problem faster.

305
00:13:11,367 --> 00:13:13,035
They want to solve a
much bigger problem

306
00:13:13,035 --> 00:13:14,160
in the same amount of time.

307
00:13:14,160 --> 00:13:17,300


308
00:13:17,300 --> 00:13:19,130
If we break up the
parallel processing part

309
00:13:19,130 --> 00:13:21,290
of the problem-- take
something like that airflow--

310
00:13:21,290 --> 00:13:22,970
if we look at an
algorithm for airflow,

311
00:13:22,970 --> 00:13:26,775
we'll find that some parts
of the airflow problem are--

312
00:13:26,775 --> 00:13:28,400
the amount of
computation we have to do

313
00:13:28,400 --> 00:13:30,192
are independent of the
size of the problem.

314
00:13:30,192 --> 00:13:32,450
So let me call that
using order notation.

315
00:13:32,450 --> 00:13:33,590
That part is order 1.

316
00:13:33,590 --> 00:13:36,140
It's some constant
part of the problem.

317
00:13:36,140 --> 00:13:40,640
Then there's a part that, if
I double the linear dimension,

318
00:13:40,640 --> 00:13:43,370
I go to twice the
resolution, then

319
00:13:43,370 --> 00:13:45,590
there's a part of the
computation that doubles.

320
00:13:45,590 --> 00:13:49,220
So if I parameterize the problem
in terms of the number of grid

321
00:13:49,220 --> 00:13:51,920
points, for example,
and I call that n,

322
00:13:51,920 --> 00:13:54,785
then there's a part of the
problem that grows as order n.

323
00:13:54,785 --> 00:13:56,660
Then there's another
part of the problem that

324
00:13:56,660 --> 00:13:58,243
depends on the
boundary conditions and

325
00:13:58,243 --> 00:14:01,370
so on that grows as order
n squared, and so on.

326
00:14:01,370 --> 00:14:04,520
And in fact, there's a big
chunk of the problem, which

327
00:14:04,520 --> 00:14:08,720
is the relaxation of the
interior of the problem, which

328
00:14:08,720 --> 00:14:09,570
takes more--

329
00:14:09,570 --> 00:14:11,030
not only does it--

330
00:14:11,030 --> 00:14:13,562
the number of points grow
up as n cubed, but also

331
00:14:13,562 --> 00:14:15,770
the amount of time that you
have to relax it grows up

332
00:14:15,770 --> 00:14:16,887
as another factor of n.

333
00:14:16,887 --> 00:14:18,470
So this keeps on
going up, and there's

334
00:14:18,470 --> 00:14:21,545
a part of the problem that
grows is [? n ?] to the fourth.

335
00:14:21,545 --> 00:14:24,170
So we've got a lot of different
parts of the problem like this.

336
00:14:24,170 --> 00:14:26,160
Well, what happens if
somebody comes along and says,

337
00:14:26,160 --> 00:14:27,800
I want a problem
that's twice as big,

338
00:14:27,800 --> 00:14:30,150
I want twice the resolution?

339
00:14:30,150 --> 00:14:32,450
Well, this part of the
problem doesn't change at all.

340
00:14:32,450 --> 00:14:34,460
This part doubles,
this quadruples,

341
00:14:34,460 --> 00:14:37,190
and so on, and this
multiplies by 16.

342
00:14:37,190 --> 00:14:38,730
I end up with a
much bigger problem,

343
00:14:38,730 --> 00:14:40,890
which goes right off
the top of the board.

344
00:14:40,890 --> 00:14:44,930
But if I renormalize
that and draw it again,

345
00:14:44,930 --> 00:14:47,540
it looks like
almost all of it is

346
00:14:47,540 --> 00:14:49,790
the end of the fourth
part of the problem.

347
00:14:49,790 --> 00:14:53,100
And this stuff is tiny.

348
00:14:53,100 --> 00:14:54,870
And if I want it to
double yet again,

349
00:14:54,870 --> 00:14:58,480
I can make that
even more extreme.

350
00:14:58,480 --> 00:15:02,250
So it turns out, in practice,
when you look at this--

351
00:15:02,250 --> 00:15:04,300
and I'm not certain why
this should be true,

352
00:15:04,300 --> 00:15:06,570
and I'm certainly not
certain it's always true--

353
00:15:06,570 --> 00:15:09,510
but in practice, these
higher order terms

354
00:15:09,510 --> 00:15:12,700
are always things that
can be done in parallel.

355
00:15:12,700 --> 00:15:14,850
So in fact, it's
not like there's

356
00:15:14,850 --> 00:15:16,585
a fixed percentage
of the problem

357
00:15:16,585 --> 00:15:18,960
that can be done in parallel,
but in fact, the percentage

358
00:15:18,960 --> 00:15:20,955
grows as the
problem gets bigger.

359
00:15:20,955 --> 00:15:22,830
Since I'm interested in
the biggest problems,

360
00:15:22,830 --> 00:15:24,247
I can always pick
a problem that's

361
00:15:24,247 --> 00:15:26,190
big enough, where it's not--

362
00:15:26,190 --> 00:15:28,650
where, in fact,
that's not a factor.

363
00:15:28,650 --> 00:15:30,390
In fact, the customers
pick that problem,

364
00:15:30,390 --> 00:15:33,015
because the customers will pick
a problem where it's big enough

365
00:15:33,015 --> 00:15:34,150
that that's not a factor.

366
00:15:34,150 --> 00:15:38,250
So as it turns out, Amdahl's
law doesn't end up biting us--

367
00:15:38,250 --> 00:15:40,395
I think much to
everybody's surprise.

368
00:15:40,395 --> 00:15:43,440


369
00:15:43,440 --> 00:15:47,430
I was even surprised,
and I'm an optimist.

370
00:15:47,430 --> 00:15:51,330
So I guess the lesson
to learn from this

371
00:15:51,330 --> 00:15:54,690
is that the bigger the problem
is, the bigger the opportunity

372
00:15:54,690 --> 00:15:55,900
of parallelism.

373
00:15:55,900 --> 00:15:59,880
This is one of the differences
between designing a machine

374
00:15:59,880 --> 00:16:01,415
for the center of the market--

375
00:16:01,415 --> 00:16:02,790
so the workstation
on your desk--

376
00:16:02,790 --> 00:16:06,150
versus designing a machine
for these super applications,

377
00:16:06,150 --> 00:16:09,150
because they're very skewed
in the types of opportunities

378
00:16:09,150 --> 00:16:13,290
for parallelism that you have.

379
00:16:13,290 --> 00:16:16,230
The problems that I'm
designing for typically

380
00:16:16,230 --> 00:16:19,080
have the opportunity
for parallelism

381
00:16:19,080 --> 00:16:21,150
on the order of
hundreds of thousands,

382
00:16:21,150 --> 00:16:23,220
or millions of things
happening at once,

383
00:16:23,220 --> 00:16:25,650
and very, very
little of the problem

384
00:16:25,650 --> 00:16:27,660
is in this fundamentally
sequential part.

385
00:16:27,660 --> 00:16:30,610


386
00:16:30,610 --> 00:16:32,530
OK, so the obvious
answer then is

387
00:16:32,530 --> 00:16:35,320
I can use lots of processors.

388
00:16:35,320 --> 00:16:38,020
And VLSI technology
lets us stamp out

389
00:16:38,020 --> 00:16:40,450
processes relatively
inexpensively,

390
00:16:40,450 --> 00:16:42,280
and we can interleave
them with memory,

391
00:16:42,280 --> 00:16:43,450
and we can build a
machine that's all

392
00:16:43,450 --> 00:16:44,890
mixed up processors and memory.

393
00:16:44,890 --> 00:16:47,770
So what are the design issues
for designing a machine

394
00:16:47,770 --> 00:16:51,970
like so that it
works efficiently?

395
00:16:51,970 --> 00:16:55,350
Well, first of all,
notice that the big source

396
00:16:55,350 --> 00:16:57,630
of the parallelism
is in the data.

397
00:16:57,630 --> 00:16:59,490
It's because the data
is getting bigger.

398
00:16:59,490 --> 00:17:01,980
The size of the problem,
I keep talking about,

399
00:17:01,980 --> 00:17:03,990
is really a data size.

400
00:17:03,990 --> 00:17:07,530
And so the big
opportunity is somehow

401
00:17:07,530 --> 00:17:09,793
operating on all of
that data at once.

402
00:17:09,793 --> 00:17:12,210
And that's also a very different
than a lot of other kinds

403
00:17:12,210 --> 00:17:13,140
of parallel machines.

404
00:17:13,140 --> 00:17:15,960
A much more normal way
of getting parallelism

405
00:17:15,960 --> 00:17:18,569
in a machine is to
split up the program

406
00:17:18,569 --> 00:17:21,359
into several programs which
run it once, each of which

407
00:17:21,359 --> 00:17:23,310
operates on its own
data, or some show data,

408
00:17:23,310 --> 00:17:24,579
or something like that.

409
00:17:24,579 --> 00:17:27,060
But the normal effort
involved in making

410
00:17:27,060 --> 00:17:28,765
a parallel programming
is an effort

411
00:17:28,765 --> 00:17:30,390
of splitting up the
program and looking

412
00:17:30,390 --> 00:17:33,400
for control parallelism.

413
00:17:33,400 --> 00:17:36,190
And basically, I think that
that's a perfectly good kind

414
00:17:36,190 --> 00:17:38,350
of effort, and I
think it makes sense,

415
00:17:38,350 --> 00:17:41,590
and it probably gets you factors
of eight or something like

416
00:17:41,590 --> 00:17:43,990
that, but it won't get
factors of 100,000,

417
00:17:43,990 --> 00:17:46,300
because you can't break
up a 10,000-line program

418
00:17:46,300 --> 00:17:49,470
into 100,000 things running all
at once that are all different.

419
00:17:49,470 --> 00:17:52,852
In fact, the data is
growing much, much faster

420
00:17:52,852 --> 00:17:53,560
than the program.

421
00:17:53,560 --> 00:17:57,720
The programs are pretty
much fixed in length.

422
00:17:57,720 --> 00:17:59,790
So the opportunity
for the parallelism

423
00:17:59,790 --> 00:18:03,490
is, in fact, in the
loops of that program.

424
00:18:03,490 --> 00:18:06,120
And so for example,
take the airplane

425
00:18:06,120 --> 00:18:09,300
wing example, the
airflow example.

426
00:18:09,300 --> 00:18:11,675
A sequential computer
has a big loop

427
00:18:11,675 --> 00:18:13,300
that calculates what
the air does here,

428
00:18:13,300 --> 00:18:14,720
then calculates what
the air does here,

429
00:18:14,720 --> 00:18:15,930
then calculates what
the air does here,

430
00:18:15,930 --> 00:18:17,400
and so on, and so on, and so on.

431
00:18:17,400 --> 00:18:18,720
And after it finishes
that whole line,

432
00:18:18,720 --> 00:18:21,303
it starts the next line, and the
next line, and the next line,

433
00:18:21,303 --> 00:18:22,000
and so on--

434
00:18:22,000 --> 00:18:27,160
whereas the physical fluid
calculates all of that

435
00:18:27,160 --> 00:18:29,270
simultaneously.

436
00:18:29,270 --> 00:18:32,320
So the big opportunity is
in calculating all of that

437
00:18:32,320 --> 00:18:33,210
simultaneously.

438
00:18:33,210 --> 00:18:37,900
So I'm not so much
interested in programs

439
00:18:37,900 --> 00:18:39,310
that can be split
so that you can

440
00:18:39,310 --> 00:18:40,540
run several programs at once.

441
00:18:40,540 --> 00:18:42,850
I'm interested in really
just running one program,

442
00:18:42,850 --> 00:18:44,800
but what I want to do
in that one program is I

443
00:18:44,800 --> 00:18:46,990
want to take out
the loop and do all

444
00:18:46,990 --> 00:18:49,130
of the iterations
of the loop at once.

445
00:18:49,130 --> 00:18:52,150
That turns out to have
some great advantages

446
00:18:52,150 --> 00:18:54,910
for programming, but I'm
going to concentrate mostly

447
00:18:54,910 --> 00:18:56,260
on the hardware right now.

448
00:18:56,260 --> 00:18:59,500
It also has some nice
advantages of hardware.

449
00:18:59,500 --> 00:19:01,406
So having done that--

450
00:19:01,406 --> 00:19:04,312
so when I do that, that's what
I call a data parallel machine.

451
00:19:04,312 --> 00:19:05,770
It's a little bit
different issue--

452
00:19:05,770 --> 00:19:08,463
you hear people talk about
single instruction machines--

453
00:19:08,463 --> 00:19:09,880
single instruction,
multiple data,

454
00:19:09,880 --> 00:19:12,220
versus multiple
instruction, multiple data.

455
00:19:12,220 --> 00:19:14,470
It's a related issue, but
it's a little bit different,

456
00:19:14,470 --> 00:19:19,180
because really, SIMD versus MIMD
is kind of an implementation

457
00:19:19,180 --> 00:19:21,335
issue that, once you've
picked a programming model,

458
00:19:21,335 --> 00:19:22,960
you figure out, what's
the cheapest way

459
00:19:22,960 --> 00:19:24,310
to implement that model?

460
00:19:24,310 --> 00:19:27,197
This is really an issue
of what programming model

461
00:19:27,197 --> 00:19:28,030
you want to support.

462
00:19:28,030 --> 00:19:29,488
And in fact, then
you have a choice

463
00:19:29,488 --> 00:19:31,960
of whether you want to put
some of the control down

464
00:19:31,960 --> 00:19:33,947
in the processors or
keep it centralized.

465
00:19:33,947 --> 00:19:35,530
And that's a kind
of engineering issue

466
00:19:35,530 --> 00:19:37,720
that just depends on the
relative cost of wires

467
00:19:37,720 --> 00:19:41,020
and silicon, and it probably
changes from year to year.

468
00:19:41,020 --> 00:19:43,630
I think a lot of
people expect me

469
00:19:43,630 --> 00:19:45,820
to defend SIMD
machines just because I

470
00:19:45,820 --> 00:19:48,820
designed machines that way.

471
00:19:48,820 --> 00:19:53,910
I think that was the right way
to design machines in 1990,

472
00:19:53,910 --> 00:19:57,210
and it maybe won't be
in 1995, and maybe it

473
00:19:57,210 --> 00:19:58,830
will be again in 2000.

474
00:19:58,830 --> 00:19:59,730
It's a small issue.

475
00:19:59,730 --> 00:20:01,230
That's not really
what I believe in.

476
00:20:01,230 --> 00:20:04,320
What I believe in is presenting
this data parallel model

477
00:20:04,320 --> 00:20:10,380
to the user, and doing that
in the most efficient way.

478
00:20:10,380 --> 00:20:12,030
So how do you do that?

479
00:20:12,030 --> 00:20:16,500
Well, most obvious issue is
arithmetic and processing.

480
00:20:16,500 --> 00:20:19,820
And that also turns out to
be about the easiest problem

481
00:20:19,820 --> 00:20:20,320
to solve.

482
00:20:20,320 --> 00:20:23,800
So for example, in the
CM2 Connection Machine,

483
00:20:23,800 --> 00:20:25,755
it has two kinds of arithmetic.

484
00:20:25,755 --> 00:20:28,620
It has 64,000 bit-serial
arithmetic units

485
00:20:28,620 --> 00:20:30,300
that are distributed
through the memory,

486
00:20:30,300 --> 00:20:32,460
and it has 2,000
floating point units,

487
00:20:32,460 --> 00:20:35,010
which are 64-bit wide
words which are also

488
00:20:35,010 --> 00:20:36,970
distributed to the memory.

489
00:20:36,970 --> 00:20:39,265
And it has some
32-bit data paths

490
00:20:39,265 --> 00:20:41,690
for doing indirect addressing
and referencing data.

491
00:20:41,690 --> 00:20:44,290
So it's a mixture of different
kinds of arithmetic units

492
00:20:44,290 --> 00:20:47,560
that are sprinkled through
the memory, evenly distributed

493
00:20:47,560 --> 00:20:51,453
to prevent moving data around
any more than you have to.

494
00:20:51,453 --> 00:20:53,620
And of course, the big issue
is not so much building

495
00:20:53,620 --> 00:20:56,037
all these [INAUDIBLE],, but
it's building a memory systems

496
00:20:56,037 --> 00:20:58,510
to support them and has
sufficient memory bandwidth

497
00:20:58,510 --> 00:20:59,783
to keep them busy.

498
00:20:59,783 --> 00:21:01,450
Of course, that works
out very naturally

499
00:21:01,450 --> 00:21:04,930
in a parallel machine, because
you put each one of these LAUs

500
00:21:04,930 --> 00:21:07,120
right next to its own
little piece of memory,

501
00:21:07,120 --> 00:21:10,952
and so you can read out
64,000 memory bits in parallel

502
00:21:10,952 --> 00:21:12,910
and get a tremendous
amount of memory bandwidth

503
00:21:12,910 --> 00:21:16,200
to keep all these LAUs busy.

504
00:21:16,200 --> 00:21:19,470
So it turns out that the peak
floating point performance--

505
00:21:19,470 --> 00:21:21,210
it's a good way
of measuring it--

506
00:21:21,210 --> 00:21:25,650
of the machine is on the
order of something like 20

507
00:21:25,650 --> 00:21:32,450
gigaflops, which is
20,000 megaflops, which

508
00:21:32,450 --> 00:21:35,760
is millions of floating
point operations per second.

509
00:21:35,760 --> 00:21:37,760
So that's fast, but it's
not-- that's not really

510
00:21:37,760 --> 00:21:40,040
a very relevant number,
because who cares

511
00:21:40,040 --> 00:21:42,410
what the megaflop number is?

512
00:21:42,410 --> 00:21:45,830
It's turns out that, if
I look at this problem,

513
00:21:45,830 --> 00:21:47,630
only about a third
of the problem

514
00:21:47,630 --> 00:21:50,480
is, in fact, doing
that kind of operation.

515
00:21:50,480 --> 00:21:53,662
What is the rest of
the time spent doing?

516
00:21:53,662 --> 00:21:55,120
Well, the rest of
the time is spent

517
00:21:55,120 --> 00:21:57,560
doing two other major things.

518
00:21:57,560 --> 00:22:01,000
One of them is moving data
around in communications,

519
00:22:01,000 --> 00:22:03,160
and the other one is
input/output-- things

520
00:22:03,160 --> 00:22:04,990
like writing to disks.

521
00:22:04,990 --> 00:22:07,960
So I guess one of the
points that I'd like to make

522
00:22:07,960 --> 00:22:11,770
is that, if you're going to
achieve this kind of speed

523
00:22:11,770 --> 00:22:15,780
up that the revised
Amdahl's law promises,

524
00:22:15,780 --> 00:22:17,850
you have to do all of
the things that you

525
00:22:17,850 --> 00:22:19,930
can do in parallel in parallel.

526
00:22:19,930 --> 00:22:23,930
So if I only did the floating
point operations in parallel,

527
00:22:23,930 --> 00:22:26,030
then I'm going to get
killed by Amdahl's again,

528
00:22:26,030 --> 00:22:27,860
because I can only
speed it up by 30%,

529
00:22:27,860 --> 00:22:30,240
no matter how fast
I can do those.

530
00:22:30,240 --> 00:22:32,820
And in fact, one of the critical
things that you have to do,

531
00:22:32,820 --> 00:22:34,695
which is much harder
than the floating point,

532
00:22:34,695 --> 00:22:40,187
is the data motion, because it's
all very nice for the hardware

533
00:22:40,187 --> 00:22:41,770
designer to say I'm
going to divide up

534
00:22:41,770 --> 00:22:43,330
the memories into lots
of little memories,

535
00:22:43,330 --> 00:22:44,955
each which talk to
their own processor,

536
00:22:44,955 --> 00:22:47,450
but unfortunately, the
algorithm doesn't work that way.

537
00:22:47,450 --> 00:22:49,540
The algorithm-- data
has to move all over,

538
00:22:49,540 --> 00:22:52,770
and data here has to be added
to data here, and so on.

539
00:22:52,770 --> 00:22:54,730
So it goes in
arbitrary patterns.

540
00:22:54,730 --> 00:22:57,880
Now, sometimes those
patterns have some locality

541
00:22:57,880 --> 00:22:59,440
and some regularity.

542
00:22:59,440 --> 00:23:04,000
So for example, if I solve
the fluid flow problem over

543
00:23:04,000 --> 00:23:07,000
the airplane by dividing
up the space into a nice,

544
00:23:07,000 --> 00:23:09,040
regular three-dimensional
grid then,

545
00:23:09,040 --> 00:23:12,100
that would work very nicely
and neatly if I connect it up

546
00:23:12,100 --> 00:23:14,740
my processors into a nice
three-dimensional grid--

547
00:23:14,740 --> 00:23:17,470
as long as it was the
same shape as whatever

548
00:23:17,470 --> 00:23:19,788
I wanted to solve for
that particular airplane.

549
00:23:19,788 --> 00:23:21,580
In fact, there's a lot
of parallel machines

550
00:23:21,580 --> 00:23:23,372
that work that way,
and they work very well

551
00:23:23,372 --> 00:23:25,280
on that class of problems.

552
00:23:25,280 --> 00:23:28,810
The problem is that most of the
algorithms are not that neat.

553
00:23:28,810 --> 00:23:33,010
And in fact, if I take the
implicit algorithms there,

554
00:23:33,010 --> 00:23:36,010
where basically, I'm working
with systems of equations which

555
00:23:36,010 --> 00:23:38,530
have some communications
patterns that

556
00:23:38,530 --> 00:23:41,260
are kind of all their
own, that are not--

557
00:23:41,260 --> 00:23:44,820
don't map directly into this
three-dimensional space.

558
00:23:44,820 --> 00:23:47,130
Or for example, I
might use an algorithm

559
00:23:47,130 --> 00:23:49,050
called the free
Lagrange method, where

560
00:23:49,050 --> 00:23:52,540
I model each chunk of
stuff as it flows around.

561
00:23:52,540 --> 00:23:55,290
In effect, I'm
re-drawing that grid

562
00:23:55,290 --> 00:23:58,800
at every time cycle,
dependent on where

563
00:23:58,800 --> 00:24:00,660
the stuff is that's flowing.

564
00:24:00,660 --> 00:24:02,883
And there, I really can't
even decide beforehand

565
00:24:02,883 --> 00:24:04,800
until I run the problem
what my communications

566
00:24:04,800 --> 00:24:06,090
pattern is going to be.

567
00:24:06,090 --> 00:24:08,010
So I need a way of
arbitrarily allowing

568
00:24:08,010 --> 00:24:10,450
the processors to communicate.

569
00:24:10,450 --> 00:24:13,800
And that turns out
to be one of the most

570
00:24:13,800 --> 00:24:16,217
difficult technical problems
in the design of the machine.

571
00:24:16,217 --> 00:24:18,217
And there's a lot of
different ways of doing it,

572
00:24:18,217 --> 00:24:19,680
but typically,
what you do is you

573
00:24:19,680 --> 00:24:22,530
build a physical
interconnection network.

574
00:24:22,530 --> 00:24:25,380
And on the CM2, the physical
interconnection network

575
00:24:25,380 --> 00:24:31,272
is every 16 processors is
connected by a crossbar switch

576
00:24:31,272 --> 00:24:32,730
so they can all
talk to each other.

577
00:24:32,730 --> 00:24:34,350
So that's a reasonable number.

578
00:24:34,350 --> 00:24:37,290
And then every one of those
clusters of 16 processors

579
00:24:37,290 --> 00:24:42,090
is on the corner of a
hypercube, and they all

580
00:24:42,090 --> 00:24:43,420
talk to each other.

581
00:24:43,420 --> 00:24:47,370
So that means they each have
neighbors in hyperspace.

582
00:24:47,370 --> 00:24:49,180
And it's a particular
way of wiring them.

583
00:24:49,180 --> 00:24:51,360
I won't stop and explain
what a hypercube is.

584
00:24:51,360 --> 00:24:52,240
It doesn't matter.

585
00:24:52,240 --> 00:24:54,510
The point is the programmer
never really knows

586
00:24:54,510 --> 00:24:56,010
what a hypercube is.

587
00:24:56,010 --> 00:24:57,960
And I might design it
differently next year,

588
00:24:57,960 --> 00:24:58,960
and you shouldn't care--

589
00:24:58,960 --> 00:25:01,340


590
00:25:01,340 --> 00:25:03,320
because basically,
the hardware engineer

591
00:25:03,320 --> 00:25:05,390
comes up with some
neat wiring pattern

592
00:25:05,390 --> 00:25:08,330
that they convinced themselves
is the best in the world,

593
00:25:08,330 --> 00:25:13,490
and then the programmer wants
to see a model of totally

594
00:25:13,490 --> 00:25:14,570
arbitrary communications.

595
00:25:14,570 --> 00:25:16,640
So the programmer
just wants to say,

596
00:25:16,640 --> 00:25:19,280
this processor, get
that memory location.

597
00:25:19,280 --> 00:25:20,385
How do you do that?

598
00:25:20,385 --> 00:25:22,760
Well, in the Connection Machine,
the way that you do that

599
00:25:22,760 --> 00:25:25,100
is, one, that physical pattern.

600
00:25:25,100 --> 00:25:27,140
You have a bunch of
autonomous switches

601
00:25:27,140 --> 00:25:29,840
that root messages
through that network.

602
00:25:29,840 --> 00:25:32,067
And there's different
ways of doing that.

603
00:25:32,067 --> 00:25:34,400
One way is circuit switching,
kind of like the telephone

604
00:25:34,400 --> 00:25:36,470
system works, where
you call up and you

605
00:25:36,470 --> 00:25:38,420
establish a connection
between two things that

606
00:25:38,420 --> 00:25:39,695
want to talk to each other.

607
00:25:39,695 --> 00:25:42,320
Another one is packet switching,
like local area networks work,

608
00:25:42,320 --> 00:25:45,170
where you put together a bunch
of bits with an address on them

609
00:25:45,170 --> 00:25:46,613
and steer them to
the right place.

610
00:25:46,613 --> 00:25:48,530
Well, it turns out that
the Connection Machine

611
00:25:48,530 --> 00:25:50,960
is sort of a combination
of those two methods.

612
00:25:50,960 --> 00:25:53,870
It actually send out little
packets with addresses on them,

613
00:25:53,870 --> 00:25:55,820
but it starts sending
the beginning of them

614
00:25:55,820 --> 00:25:57,680
before it's got the end of them.

615
00:25:57,680 --> 00:26:00,370
So they sort of thread through
the network like little snakes.

616
00:26:00,370 --> 00:26:02,120
And in fact, sometimes
the beginning of it

617
00:26:02,120 --> 00:26:05,570
is actually being delivered
before the end of it

618
00:26:05,570 --> 00:26:07,230
has even been sent
yet, so they're

619
00:26:07,230 --> 00:26:08,630
sort of snaking through.

620
00:26:08,630 --> 00:26:10,010
And that method
of routing, which

621
00:26:10,010 --> 00:26:12,230
is used in the
Connection Machine

622
00:26:12,230 --> 00:26:14,240
and several other
machines these days,

623
00:26:14,240 --> 00:26:16,070
is called wormhole routing.

624
00:26:16,070 --> 00:26:21,500
And it's a solution that is a
very neat solution for hardware

625
00:26:21,500 --> 00:26:22,310
designers today.

626
00:26:22,310 --> 00:26:24,260
Again, it's one of these things
that probably won't matter

627
00:26:24,260 --> 00:26:26,810
a few years from now as you
get better ways of building

628
00:26:26,810 --> 00:26:28,280
switching networks.

629
00:26:28,280 --> 00:26:30,080
It may be that a
few years from now,

630
00:26:30,080 --> 00:26:32,510
for instance, we'll build
things with optical technology,

631
00:26:32,510 --> 00:26:35,480
and the messages will just jump
from one part of the machine

632
00:26:35,480 --> 00:26:37,270
to another.

633
00:26:37,270 --> 00:26:41,380
But the trick is to be able
to put in things like that

634
00:26:41,380 --> 00:26:43,540
and provide the
user with a level

635
00:26:43,540 --> 00:26:45,640
of architectural abstraction
so that they don't

636
00:26:45,640 --> 00:26:46,990
need to know about the details.

637
00:26:46,990 --> 00:26:48,782
So the programmer of
the Connection Machine

638
00:26:48,782 --> 00:26:51,400
doesn't know anything about
wormhole routing, or packet

639
00:26:51,400 --> 00:26:56,048
switching, or single-bit ALUs,
and 64-bit ALUs, or even really

640
00:26:56,048 --> 00:26:58,090
the number of processors
that are on the machine.

641
00:26:58,090 --> 00:27:01,060
There are levels of abstraction
built into the machine, which

642
00:27:01,060 --> 00:27:03,278
are the equivalent things,
like virtual memory, which

643
00:27:03,278 --> 00:27:05,320
provide the programmer
with a much clearer model.

644
00:27:05,320 --> 00:27:07,480
And part of the trick of
being a computer architect

645
00:27:07,480 --> 00:27:09,730
is to be able to put in all
these kind of messy things

646
00:27:09,730 --> 00:27:12,550
that take advantage of what
technology you can buy in

647
00:27:12,550 --> 00:27:18,420
a given year without providing
that messiness at a level

648
00:27:18,420 --> 00:27:19,860
that the programmer sees it--

649
00:27:19,860 --> 00:27:23,250
so to be able to take advantage
of the quirks of technology

650
00:27:23,250 --> 00:27:26,080
without messing up the
programmer's model.

651
00:27:26,080 --> 00:27:28,230
So I think that the data
parallel programming

652
00:27:28,230 --> 00:27:30,540
model and the data parallel
abstraction in the machine

653
00:27:30,540 --> 00:27:33,330
is something that transcends
various applications

654
00:27:33,330 --> 00:27:34,058
of the machine.

655
00:27:34,058 --> 00:27:35,850
So I don't think you
should get too hung up

656
00:27:35,850 --> 00:27:40,410
with the particular details
of how the CM2 is implemented.

657
00:27:40,410 --> 00:27:43,020
OK, I've talked
about arithmetic,

658
00:27:43,020 --> 00:27:45,690
memory bandwidth, I/O--

659
00:27:45,690 --> 00:27:47,400
no, I haven't talked about I/O.

660
00:27:47,400 --> 00:27:50,790
I/O is probably the
most neglected part

661
00:27:50,790 --> 00:27:54,780
of this equation, because
fundamentally, most computer

662
00:27:54,780 --> 00:27:57,345
designers kind of think of
it as not their problem.

663
00:27:57,345 --> 00:27:58,590
It' somebody else's.

664
00:27:58,590 --> 00:28:02,920


665
00:28:02,920 --> 00:28:05,230
The problem is that the
users are not content just

666
00:28:05,230 --> 00:28:07,690
to sit there and compute.

667
00:28:07,690 --> 00:28:09,640
They want to write down
the answer someplace.

668
00:28:09,640 --> 00:28:12,310
They want to read in
things to compete on.

669
00:28:12,310 --> 00:28:16,480
And that means that the
machine has to stop and talk

670
00:28:16,480 --> 00:28:18,280
to some external device.

671
00:28:18,280 --> 00:28:20,312
And typically,
computer designers

672
00:28:20,312 --> 00:28:22,270
treat that as kind of an
afterthought about how

673
00:28:22,270 --> 00:28:24,007
to deal with that,
because they don't

674
00:28:24,007 --> 00:28:25,840
know exactly what the
device is going to be,

675
00:28:25,840 --> 00:28:27,340
and it's very
application-dependent,

676
00:28:27,340 --> 00:28:28,370
and so on.

677
00:28:28,370 --> 00:28:30,040
The problem is, in
this kind of case,

678
00:28:30,040 --> 00:28:32,198
where the I/O is
actually increasing

679
00:28:32,198 --> 00:28:34,240
with the size of the
problem, that would kill you

680
00:28:34,240 --> 00:28:35,950
if you took that attitude.

681
00:28:35,950 --> 00:28:38,805
So you really have to think
of I/O as an integral part

682
00:28:38,805 --> 00:28:39,430
of the machine.

683
00:28:39,430 --> 00:28:44,230
So just as you're distributing
processing through there,

684
00:28:44,230 --> 00:28:48,730
and just as you distribute
communications--

685
00:28:48,730 --> 00:28:50,590
that wormhole routing
scheme is a way

686
00:28:50,590 --> 00:28:53,040
of effectively distributing
communications function

687
00:28:53,040 --> 00:28:53,920
to the machine--

688
00:28:53,920 --> 00:28:56,830
you have to
distribute I/O. So I/O

689
00:28:56,830 --> 00:29:00,670
has to come out of that machine,
that square meter of silicon

690
00:29:00,670 --> 00:29:03,760
all over it in order to
get it out fast enough.

691
00:29:03,760 --> 00:29:05,800
The problem isn't just
getting out of the memory

692
00:29:05,800 --> 00:29:07,687
fast enough or into
the memory fast enough.

693
00:29:07,687 --> 00:29:09,520
That's relatively easy,
because you can just

694
00:29:09,520 --> 00:29:10,600
use a lot of wires.

695
00:29:10,600 --> 00:29:12,460
And in fact, the
Connection Machine--

696
00:29:12,460 --> 00:29:15,550
for example, the CM2
has eight I/O channels

697
00:29:15,550 --> 00:29:18,660
of 100 megabytes each--

698
00:29:18,660 --> 00:29:22,850
sounds like a lot of I/O.
But what do you plug it into?

699
00:29:22,850 --> 00:29:24,430
And I would love
to just say, well,

700
00:29:24,430 --> 00:29:27,890
that's somebody else's
problem, but in fact,

701
00:29:27,890 --> 00:29:29,860
when you start selling
these machines--

702
00:29:29,860 --> 00:29:32,120
becomes your problem,
because the customer

703
00:29:32,120 --> 00:29:34,040
needs to get it done fast.

704
00:29:34,040 --> 00:29:39,950
So for that reason, we started
solving the problem of building

705
00:29:39,950 --> 00:29:41,210
our own peripheral devices--

706
00:29:41,210 --> 00:29:43,470
and specifically, disk arrays.

707
00:29:43,470 --> 00:29:44,880
And so in the
Connection Machine,

708
00:29:44,880 --> 00:29:47,760
for example, in every one
of these I/O channels,

709
00:29:47,760 --> 00:29:51,920
you use a parallel disk array,
which does achieve speed

710
00:29:51,920 --> 00:29:53,670
by doing a lot of slow
things in parallel,

711
00:29:53,670 --> 00:29:55,260
just like the machine does.

712
00:29:55,260 --> 00:29:58,620
So you put 40 disk
drives in parallel,

713
00:29:58,620 --> 00:30:01,330
and all that goes
on one I/O channel.

714
00:30:01,330 --> 00:30:05,440
And when you write
a 32-bit word,

715
00:30:05,440 --> 00:30:08,440
you write the 32 data bits each
to its independent disk drive,

716
00:30:08,440 --> 00:30:10,810
and you write eight check bits--

717
00:30:10,810 --> 00:30:12,930
you compute an
error-correcting code

718
00:30:12,930 --> 00:30:16,060
and write eight check bits
to eight other drives.

719
00:30:16,060 --> 00:30:18,430
Actually, you only
need seven check bits.

720
00:30:18,430 --> 00:30:20,890
The other one's a spare.

721
00:30:20,890 --> 00:30:22,270
The advantage of
those check bits

722
00:30:22,270 --> 00:30:25,490
is that, when a driver crashes--

723
00:30:25,490 --> 00:30:28,030
which, it inevitably does you
still have all of the data,

724
00:30:28,030 --> 00:30:29,245
because--

725
00:30:29,245 --> 00:30:31,510
because you wrote
this check bit,

726
00:30:31,510 --> 00:30:33,520
you can reconstruct
the entire data,

727
00:30:33,520 --> 00:30:36,460
even though you're missing
any one of the data bits.

728
00:30:36,460 --> 00:30:39,910
And the advantage of it is
you get a tremendously fast

729
00:30:39,910 --> 00:30:41,115
transfer [INAUDIBLE].

730
00:30:41,115 --> 00:30:42,490
When we designed
that, I thought,

731
00:30:42,490 --> 00:30:44,230
well, that'll be plenty
to keep everybody happy--

732
00:30:44,230 --> 00:30:45,820
40 disk drives at
once-- but in fact,

733
00:30:45,820 --> 00:30:48,290
very quickly, the customer
said, we want faster.

734
00:30:48,290 --> 00:30:51,100
So in fact, the CM2 today
supports, in addition,

735
00:30:51,100 --> 00:30:52,330
an idea of disk striping.

736
00:30:52,330 --> 00:30:55,650
So in fact, you can run eight of
these data vaults all at once.

737
00:30:55,650 --> 00:30:58,090
And so you're using
320 disk drives

738
00:30:58,090 --> 00:30:59,650
that you're writing
out to at once.

739
00:30:59,650 --> 00:31:02,410
And it's very clear,
in another year or two,

740
00:31:02,410 --> 00:31:04,460
they're not going to be
happy with that either.

741
00:31:04,460 --> 00:31:07,870
So the I/O problem becomes
part of the problem

742
00:31:07,870 --> 00:31:10,100
that you have to solve,
whether you want to or not.

743
00:31:10,100 --> 00:31:12,430
And it becomes one of
the biggest bottleneck

744
00:31:12,430 --> 00:31:15,580
problems in designing a
very large-scale computer.

745
00:31:15,580 --> 00:31:17,830
And I'm just touching on
the bandwidth issues of it.

746
00:31:17,830 --> 00:31:20,200
There's a lot of
other issues as to how

747
00:31:20,200 --> 00:31:21,640
it fits into the
operating system,

748
00:31:21,640 --> 00:31:23,620
and protection, and
error tolerance,

749
00:31:23,620 --> 00:31:26,130
and things like that.

750
00:31:26,130 --> 00:31:33,410
So I talked about arithmetic,
communication, input/output.

751
00:31:33,410 --> 00:31:35,680
What about control?

752
00:31:35,680 --> 00:31:38,830
That's the SIMD/MIMD issue.

753
00:31:38,830 --> 00:31:42,580
It turns out that, if
I look at the control

754
00:31:42,580 --> 00:31:46,270
portions of the
program, they don't

755
00:31:46,270 --> 00:31:50,270
grow with the amount of data.

756
00:31:50,270 --> 00:31:54,510
In other words, they
remain relatively fixed

757
00:31:54,510 --> 00:31:57,690
as the problem size increases.

758
00:31:57,690 --> 00:32:01,250
So that means that the control
portions of the program

759
00:32:01,250 --> 00:32:03,080
are really down in
the sequential part.

760
00:32:03,080 --> 00:32:05,470
They're not in
this parallel part,

761
00:32:05,470 --> 00:32:07,392
which means that
there's really no need

762
00:32:07,392 --> 00:32:09,850
to be able to handle those in
parallel-- for the most part.

763
00:32:09,850 --> 00:32:13,200
It helps a little bit
[INAUDIBLE] in places,

764
00:32:13,200 --> 00:32:15,770
but for the most part, it's
not like the control explodes,

765
00:32:15,770 --> 00:32:17,812
because the control is
basically dependent on how

766
00:32:17,812 --> 00:32:18,920
long your program is.

767
00:32:18,920 --> 00:32:21,045
And the program doesn't
get longer just because you

768
00:32:21,045 --> 00:32:22,890
had more data to operate on.

769
00:32:22,890 --> 00:32:26,660
So that allows the
hardware designer

770
00:32:26,660 --> 00:32:29,210
to get some efficiencies by
taking advantage of that shared

771
00:32:29,210 --> 00:32:30,930
control.

772
00:32:30,930 --> 00:32:35,270
So the reason that I think
that SIMD machines are

773
00:32:35,270 --> 00:32:38,760
the most efficient
machines in 1989,

774
00:32:38,760 --> 00:32:42,170
1990 time frame is
because the cost

775
00:32:42,170 --> 00:32:48,110
of distributing that control
and putting it out is too high.

776
00:32:48,110 --> 00:32:50,090
It becomes too high a cost.

777
00:32:50,090 --> 00:32:52,700
And it's much cheaper to do
that control in one place

778
00:32:52,700 --> 00:32:56,730
and distribute out the
information to the places

779
00:32:56,730 --> 00:32:58,590
where it's needed,
because you really

780
00:32:58,590 --> 00:33:01,252
are only running one program.

781
00:33:01,252 --> 00:33:03,210
Now, that doesn't mean
that that's the only way

782
00:33:03,210 --> 00:33:05,972
to support the data parallel
programming model, though.

783
00:33:05,972 --> 00:33:08,430
And if, in the future, [? at ?]
[? some ?] [? times, ?] you

784
00:33:08,430 --> 00:33:13,620
begin to get more and more of
the costs of the machine is

785
00:33:13,620 --> 00:33:16,440
in the communications,
rather than in the silicon--

786
00:33:16,440 --> 00:33:18,955
because silicon's getting
cheaper every year,

787
00:33:18,955 --> 00:33:21,330
so it may be after a while
that the piece of wire to send

788
00:33:21,330 --> 00:33:23,955
the data is much more expensive
than just putting a whole other

789
00:33:23,955 --> 00:33:25,710
computer at the other
end of the wire.

790
00:33:25,710 --> 00:33:27,770
At that point, I
think it'll make sense

791
00:33:27,770 --> 00:33:29,520
to implement the data
parallel programming

792
00:33:29,520 --> 00:33:31,770
model by putting more control
out at the [INAUDIBLE]..

793
00:33:31,770 --> 00:33:34,467
And in fact, the CM2 adopts a
kind of compromised approach.

794
00:33:34,467 --> 00:33:36,300
It actually has four
streams in instructions

795
00:33:36,300 --> 00:33:38,030
to go through that.

796
00:33:38,030 --> 00:33:40,342
So I think that's really
the SIMD/MIMD issue.

797
00:33:40,342 --> 00:33:42,050
I don't think it's a
philosophical issue.

798
00:33:42,050 --> 00:33:44,260
And I think, in any case, even
if you had all this control

799
00:33:44,260 --> 00:33:46,643
out there, you really still
want to present to the user

800
00:33:46,643 --> 00:33:48,310
the model of a single
thread of control,

801
00:33:48,310 --> 00:33:51,070
because their parallelism is,
in fact, not in the control.

802
00:33:51,070 --> 00:33:54,600
Their parallelism
is in the data.

803
00:33:54,600 --> 00:33:56,190
So that's the control issue.

804
00:33:56,190 --> 00:33:58,550
So my conclusion
from all of that

805
00:33:58,550 --> 00:34:02,600
is, to achieve a balanced
design, what you need to do

806
00:34:02,600 --> 00:34:06,020
is make sure that
all of the parts that

807
00:34:06,020 --> 00:34:09,540
can be done in parallel are done
in parallel in the hardware.

808
00:34:09,540 --> 00:34:13,909
That means, specifically,
you have to do processing,

809
00:34:13,909 --> 00:34:15,800
you have to do
communication, and you

810
00:34:15,800 --> 00:34:18,030
have to do I/O in parallel.

811
00:34:18,030 --> 00:34:20,030
And control you get to
do whatever way you think

812
00:34:20,030 --> 00:34:22,530
is cheapest.

813
00:34:22,530 --> 00:34:26,080
It's just a matter
of efficiency.

814
00:34:26,080 --> 00:34:29,560
So all of that sort of
theories about designing

815
00:34:29,560 --> 00:34:32,560
this kind of machine--
how well does it actually

816
00:34:32,560 --> 00:34:35,100
work out in practice?

817
00:34:35,100 --> 00:34:39,600
Well, for many problems
today, the Connection Machine

818
00:34:39,600 --> 00:34:42,088
is the fastest
machine in the world.

819
00:34:42,088 --> 00:34:44,130
And it depends on if the
problem is of this sort.

820
00:34:44,130 --> 00:34:45,300
There are also many
problems for which

821
00:34:45,300 --> 00:34:46,758
it's not the fastest
in the world--

822
00:34:46,758 --> 00:34:49,679
probably many problems for which
a [? Cray ?] is the fastest.

823
00:34:49,679 --> 00:34:52,317
But an interesting question
is, on the fastest problems

824
00:34:52,317 --> 00:34:54,150
and the biggest problems,
what's the fastest

825
00:34:54,150 --> 00:34:55,030
machine in the world?

826
00:34:55,030 --> 00:34:56,969
And the answer is the
Connection Machine.

827
00:34:56,969 --> 00:35:01,260
Gordon Bell was kind enough
to hold a contest where

828
00:35:01,260 --> 00:35:03,990
what you did is you
clocked machines

829
00:35:03,990 --> 00:35:07,110
on doing any real problem,
a real application,

830
00:35:07,110 --> 00:35:10,350
to see how fast they were-- how
many floating point operations

831
00:35:10,350 --> 00:35:11,640
per second.

832
00:35:11,640 --> 00:35:14,880
And the Connection
Machine won that--

833
00:35:14,880 --> 00:35:17,850
the prize this year
by a large margin.

834
00:35:17,850 --> 00:35:21,480
And in fact, another Connection
Machine program came in second

835
00:35:21,480 --> 00:35:23,440
and won the cost
performance prize.

836
00:35:23,440 --> 00:35:26,265
So it's clear that, for the
biggest problems and the most

837
00:35:26,265 --> 00:35:28,140
data-intensive problems,
this kind of machine

838
00:35:28,140 --> 00:35:30,770
is the fastest right now.

839
00:35:30,770 --> 00:35:32,540
So now, what about
for the future?

840
00:35:32,540 --> 00:35:35,050
How is it going to compare
to the competition, which

841
00:35:35,050 --> 00:35:38,970
is, by the way, vector
machines, like Crays?

842
00:35:38,970 --> 00:35:43,440
Well, if you look at the
basic technologies that

843
00:35:43,440 --> 00:35:45,330
are in the Connection
Machine, they're

844
00:35:45,330 --> 00:35:47,670
basically technologies
that are improving

845
00:35:47,670 --> 00:35:50,910
in cost performance by
a factor of two or three

846
00:35:50,910 --> 00:35:52,890
every two years.

847
00:35:52,890 --> 00:35:54,390
If you look at the
technologies that

848
00:35:54,390 --> 00:35:59,130
are in the vector machines,
which are things like bipolar

849
00:35:59,130 --> 00:36:02,130
integrated circuits,
high-performance cooling

850
00:36:02,130 --> 00:36:05,850
technologies, their improvement
in cost performance,

851
00:36:05,850 --> 00:36:08,490
while substantial, is
more like a factor of two

852
00:36:08,490 --> 00:36:09,890
every two years.

853
00:36:09,890 --> 00:36:11,460
So it's not as fast.

854
00:36:11,460 --> 00:36:14,970
So I'm convinced that not
only are these data parallel

855
00:36:14,970 --> 00:36:17,910
machines fastest today, but
as we look out in the future,

856
00:36:17,910 --> 00:36:21,540
they become more and more
the fastest machines.

857
00:36:21,540 --> 00:36:25,620
So somebody who has a pile of
money, like $30 million or $100

858
00:36:25,620 --> 00:36:28,740
million make it up, will
always do much better

859
00:36:28,740 --> 00:36:32,250
if they're solving a big
problem to spend that on a data

860
00:36:32,250 --> 00:36:36,810
parallel machine because of this
increasing spread in the cost

861
00:36:36,810 --> 00:36:38,530
performance of the components.

862
00:36:38,530 --> 00:36:40,170
So I think the future
of these machines

863
00:36:40,170 --> 00:36:43,290
looks extremely bright.

864
00:36:43,290 --> 00:36:45,840
And in fact, I believe that,
over the next few years,

865
00:36:45,840 --> 00:36:47,480
we'll see machines
of this sort--

866
00:36:47,480 --> 00:36:51,660
commercial machines of the sort
that are performing regularly

867
00:36:51,660 --> 00:36:53,730
in the 100 gigaflop range.

868
00:36:53,730 --> 00:36:56,760
Today they perform in
the 5 gigaflop range,

869
00:36:56,760 --> 00:36:59,470
maybe getting up to
the 10 gigaflop range.

870
00:36:59,470 --> 00:37:02,280
But I think that we'll see on
good problems performing at 100

871
00:37:02,280 --> 00:37:04,080
gigaflops in a few years--

872
00:37:04,080 --> 00:37:06,810
and by the mid-1990s,
I think that we'll

873
00:37:06,810 --> 00:37:09,320
be talking teraflops.

874
00:37:09,320 --> 00:37:14,030
And that's trillions of floating
point operations per second.

875
00:37:14,030 --> 00:37:16,660


876
00:37:16,660 --> 00:37:18,910
That's a factor
of 1,000 over what

877
00:37:18,910 --> 00:37:22,610
you do on a conventional
supercomputer today.

878
00:37:22,610 --> 00:37:26,930
And so I don't think we yet
comprehend how to use it,

879
00:37:26,930 --> 00:37:29,780
[? but a ?] factor of 1,000--
one level sounds like an awful

880
00:37:29,780 --> 00:37:33,830
lot, but then, if I've got one
of these problems that grows

881
00:37:33,830 --> 00:37:37,160
like n cubed, that's
really only less than--

882
00:37:37,160 --> 00:37:40,050
that's only a factor of 10
for a problem like that.

883
00:37:40,050 --> 00:37:43,310
And if it [INAUDIBLE] only
a factor of about five

884
00:37:43,310 --> 00:37:44,610
or something like that.

885
00:37:44,610 --> 00:37:47,720
So if I went five
times the resolution,

886
00:37:47,720 --> 00:37:49,250
I really want to
look at turbulence.

887
00:37:49,250 --> 00:37:51,980
I can use up that factor
of 1,000 very fast.

888
00:37:51,980 --> 00:37:55,160
Or if I want a more accurate
model of how a molecule falls,

889
00:37:55,160 --> 00:37:56,630
or if I want to
really understand

890
00:37:56,630 --> 00:37:59,150
how that spiral arm
of the galaxy forms,

891
00:37:59,150 --> 00:38:00,530
there's lots and
lots of problems

892
00:38:00,530 --> 00:38:02,170
that use that up very quickly.

893
00:38:02,170 --> 00:38:04,220
And in fact, we're
beginning to get hints

894
00:38:04,220 --> 00:38:07,460
that we'll be able to
begin approaching some

895
00:38:07,460 --> 00:38:09,800
of the problems in
artificial intelligence

896
00:38:09,800 --> 00:38:13,975
that I originally was interested
in quite a few years ago,

897
00:38:13,975 --> 00:38:15,350
when I started
getting interested

898
00:38:15,350 --> 00:38:16,820
in this type of machine.

899
00:38:16,820 --> 00:38:19,580
So we're beginning to see the
light at the end of the tunnel,

900
00:38:19,580 --> 00:38:22,580
that we can get machines that
are really fast enough approach

901
00:38:22,580 --> 00:38:24,300
these problems.

902
00:38:24,300 --> 00:38:27,290
So it looks like it's going
to be an interesting decade.

903
00:38:27,290 --> 00:38:30,460
So let me summarize.

904
00:38:30,460 --> 00:38:32,745
The conventional
Von Neumann machine,

905
00:38:32,745 --> 00:38:34,870
the [? word ?] [? at ?]
[? a ?] [? time ?] machine,

906
00:38:34,870 --> 00:38:37,370
is fundamentally
inefficient for very,

907
00:38:37,370 --> 00:38:40,600
very large-scale problems, and
it becomes more inefficient

908
00:38:40,600 --> 00:38:45,350
as the problem grows greater
because of this Von Neumann

909
00:38:45,350 --> 00:38:47,850
bottleneck.

910
00:38:47,850 --> 00:38:53,410
The natural solution is
to distribute processing

911
00:38:53,410 --> 00:38:54,870
through the machine.

912
00:38:54,870 --> 00:38:57,810
In order to maintain
a balanced design,

913
00:38:57,810 --> 00:38:59,730
we have to distribute
not only processing,

914
00:38:59,730 --> 00:39:05,160
but we have to distribute I/O
and communications so that we

915
00:39:05,160 --> 00:39:07,800
can do all of it in parallel.

916
00:39:07,800 --> 00:39:13,230
If we do that, then, for
the biggest problems,

917
00:39:13,230 --> 00:39:16,450
we'll be able to do arbitrary
amounts of it in parallel.

918
00:39:16,450 --> 00:39:19,920
So we'll be able to get
very, very good speedups.

919
00:39:19,920 --> 00:39:22,150
In fact, we can always pick
a version of the problems

920
00:39:22,150 --> 00:39:23,870
big enough so that
we can get them--

921
00:39:23,870 --> 00:39:27,170
we have enough parallelism
to keep it busy.

922
00:39:27,170 --> 00:39:29,720


923
00:39:29,720 --> 00:39:33,350
And finally, the
result of all of this

924
00:39:33,350 --> 00:39:35,660
is that data parallel
machines today

925
00:39:35,660 --> 00:39:38,090
are, in fact, the fastest
computers in the world

926
00:39:38,090 --> 00:39:39,860
for a class of problems.

927
00:39:39,860 --> 00:39:42,740
And it looks like that's
becoming more and more

928
00:39:42,740 --> 00:39:44,880
true in the future.

929
00:39:44,880 --> 00:39:48,660
So in conclusion,
I'd like to say

930
00:39:48,660 --> 00:39:52,290
that I think this is the
natural solution to very

931
00:39:52,290 --> 00:39:57,000
large problems, and I hope I've
convinced you all of it is too.

932
00:39:57,000 --> 00:39:58,740
Thank you very much.

933
00:39:58,740 --> 00:40:02,250
I'm Frank Morris from Pennie
and Edmonds New York City.

934
00:40:02,250 --> 00:40:04,560
Can you explain to
us what advances

935
00:40:04,560 --> 00:40:07,350
you foresee in the next few
years that will enable you

936
00:40:07,350 --> 00:40:10,290
to increase the performance
of the Connection Machine

937
00:40:10,290 --> 00:40:14,910
from about 10 gigaflops
to a teraflop?

938
00:40:14,910 --> 00:40:15,780
Yeah.

939
00:40:15,780 --> 00:40:18,300
I'll speak in very
general terms,

940
00:40:18,300 --> 00:40:20,460
because obviously,
we're getting into what

941
00:40:20,460 --> 00:40:23,400
we're developing in some
of our next products here.

942
00:40:23,400 --> 00:40:26,250
But fundamentally,
there are three sources

943
00:40:26,250 --> 00:40:28,980
of speed improvement
in the machine.

944
00:40:28,980 --> 00:40:33,510
One of them is that the basic
technology in the machine

945
00:40:33,510 --> 00:40:34,270
is improving.

946
00:40:34,270 --> 00:40:35,970
So that's that factor
of two or three

947
00:40:35,970 --> 00:40:37,230
a year I was talking about.

948
00:40:37,230 --> 00:40:40,643
As we fit more and
more onto a VLSI chip,

949
00:40:40,643 --> 00:40:42,060
we can get things
closer together,

950
00:40:42,060 --> 00:40:43,380
actually operate them faster.

951
00:40:43,380 --> 00:40:44,880
Clock speeds become faster.

952
00:40:44,880 --> 00:40:47,080
Data paths become lighter
for the same cost.

953
00:40:47,080 --> 00:40:48,990
So there's some things
that, in a sense,

954
00:40:48,990 --> 00:40:51,900
come for you for free
just from the technology.

955
00:40:51,900 --> 00:40:54,090
Another big source of--

956
00:40:54,090 --> 00:40:58,140
is that we understand much
more of what we're doing now.

957
00:40:58,140 --> 00:41:01,440
So we now have a lot of
machines out in the field--

958
00:41:01,440 --> 00:41:04,620
probably about 10% or
20% of the supercomputers

959
00:41:04,620 --> 00:41:06,960
that are out there right
now are Connection Machines,

960
00:41:06,960 --> 00:41:09,270
and so we're beginning to
get a lot of experience

961
00:41:09,270 --> 00:41:11,700
and how users actually
use the machine

962
00:41:11,700 --> 00:41:15,450
and how the machine works
in certain cases, where

963
00:41:15,450 --> 00:41:18,340
it doesn't work, where the
time is being spent, and so on.

964
00:41:18,340 --> 00:41:21,030
So another big source is
optimizing the machine,

965
00:41:21,030 --> 00:41:22,740
because we now know
what to optimize for.

966
00:41:22,740 --> 00:41:25,198
And we were kind of shooting
in the dark for the first time

967
00:41:25,198 --> 00:41:26,730
around.

968
00:41:26,730 --> 00:41:29,310
There's probably almost
an order of magnitude

969
00:41:29,310 --> 00:41:31,270
to be gained by that.

970
00:41:31,270 --> 00:41:34,770
And then the other source
of improvement in speed

971
00:41:34,770 --> 00:41:36,630
is just plain we'll
build machines bigger.

972
00:41:36,630 --> 00:41:42,600
And when we first built the
Connection Machine, $5 million

973
00:41:42,600 --> 00:41:45,893
seemed like a lot to solve
one of these problems,

974
00:41:45,893 --> 00:41:47,310
but in fact, these
days people are

975
00:41:47,310 --> 00:41:49,680
used to spending as much
as $20 or $30 million

976
00:41:49,680 --> 00:41:52,230
for a very, very large computer.

977
00:41:52,230 --> 00:41:54,730
And so you can just plain
old build bigger machines,

978
00:41:54,730 --> 00:41:56,590
and that's another
source of the power.

979
00:41:56,590 --> 00:41:58,810
So if you multiply
all of those together,

980
00:41:58,810 --> 00:42:01,530
it really is very plausible to
get to that teraflops number

981
00:42:01,530 --> 00:42:03,100
in the mid-'90s.

982
00:42:03,100 --> 00:42:07,480
How much bigger do you
see the machines becoming?

983
00:42:07,480 --> 00:42:11,740
I believe that, with the way
that we understand designing

984
00:42:11,740 --> 00:42:16,570
machines now, we can see
the next factor of 100

985
00:42:16,570 --> 00:42:18,070
after the teraflop.

986
00:42:18,070 --> 00:42:19,810
So the designs change.

987
00:42:19,810 --> 00:42:23,050
The way that you build the
communication networks change.

988
00:42:23,050 --> 00:42:25,420
The way that you build the
processing elements change.

989
00:42:25,420 --> 00:42:27,753
But fundamentally, I think
the same kind of architecture

990
00:42:27,753 --> 00:42:30,070
and the same kind of model
presented to the user

991
00:42:30,070 --> 00:42:34,270
will follow through for that
kind of speed improvement.

992
00:42:34,270 --> 00:42:37,160
We're talking over the next
decade, really, for that.

993
00:42:37,160 --> 00:42:40,030
After that, I think even that
model begins to fall apart,

994
00:42:40,030 --> 00:42:42,580
and it needs something else
to come along-- some other way

995
00:42:42,580 --> 00:42:44,050
of thinking about the problem.

996
00:42:44,050 --> 00:42:46,210
Users will have to change
all of their programs

997
00:42:46,210 --> 00:42:48,850
to get beyond the--

998
00:42:48,850 --> 00:42:51,760
I don't even know
what's after a teraflop,

999
00:42:51,760 --> 00:42:53,590
but I'm not going
to worry about it.

1000
00:42:53,590 --> 00:42:57,296
I'm worrying about
up to about 100.

1001
00:42:57,296 --> 00:43:01,830
I'm [INAUDIBLE] Howard
University in Washington DC.

1002
00:43:01,830 --> 00:43:04,690
Thanks for your
presentation, especially as

1003
00:43:04,690 --> 00:43:08,170
you addressed the
problem of time saving

1004
00:43:08,170 --> 00:43:11,200
for large computation problems.

1005
00:43:11,200 --> 00:43:16,330
For some of the activities
in numerical analysis,

1006
00:43:16,330 --> 00:43:19,210
we spend a lot of time in I/Os.

1007
00:43:19,210 --> 00:43:22,960
And I'm looking forward to hear
from you what your thoughts are

1008
00:43:22,960 --> 00:43:26,700
in terms of making the I/O
devices more intelligent.

1009
00:43:26,700 --> 00:43:28,450
Certainly, people have
been talking about,

1010
00:43:28,450 --> 00:43:32,020
for years, putting processors
down into the I/O devices,

1011
00:43:32,020 --> 00:43:34,840
in the same sense that, in
a data parallel machine,

1012
00:43:34,840 --> 00:43:37,030
you put the processors
into the memory.

1013
00:43:37,030 --> 00:43:39,580
I question in a parallel
machine how much is really

1014
00:43:39,580 --> 00:43:42,280
to be gained from that, because
the limitation is really

1015
00:43:42,280 --> 00:43:45,700
not the data motion
between the I/O device

1016
00:43:45,700 --> 00:43:48,190
and the memory of the machine
in a parallel machine.

1017
00:43:48,190 --> 00:43:51,140
It's easy to build I/O channels
to keep up with that capacity.

1018
00:43:51,140 --> 00:43:54,520
The limitation really is much
more in the physical storage

1019
00:43:54,520 --> 00:43:56,570
medium itself of the I/O device.

1020
00:43:56,570 --> 00:44:00,600
So today, even the
very, very fastest,

1021
00:44:00,600 --> 00:44:03,330
very most sophisticated
computer still use the Stone Age

1022
00:44:03,330 --> 00:44:05,790
technology of disks
spinning around,

1023
00:44:05,790 --> 00:44:08,610
and that's the thing that's
really going to have to change,

1024
00:44:08,610 --> 00:44:10,260
and what's the
technology that's going

1025
00:44:10,260 --> 00:44:15,180
to replace that is kind
of the $64,000 question.

1026
00:44:15,180 --> 00:44:18,660
Optical technology is mostly--

1027
00:44:18,660 --> 00:44:21,150
what's out there now is
still disks spinning around.

1028
00:44:21,150 --> 00:44:24,000
You're writing things on
it with optics, instead of

1029
00:44:24,000 --> 00:44:27,420
with magnetics, but it's still
that kind of mechanical motion

1030
00:44:27,420 --> 00:44:29,020
determines the
speed of the system.

1031
00:44:29,020 --> 00:44:31,830
I think we're eventually going
to have to get away from that.

1032
00:44:31,830 --> 00:44:34,410
Some of the
possibilities are things

1033
00:44:34,410 --> 00:44:38,130
like storing three-dimensionally
in crystal lattices.

1034
00:44:38,130 --> 00:44:41,130
Magnetic bubbles is another
thing people have talked about.

1035
00:44:41,130 --> 00:44:43,470
None of them seem really
quite there enough

1036
00:44:43,470 --> 00:44:45,750
to be able to predict
them, but it's clear

1037
00:44:45,750 --> 00:44:48,810
that we're going to need
something like that.

1038
00:44:48,810 --> 00:44:51,090
I know that, when
you do error analysis

1039
00:44:51,090 --> 00:44:56,740
on serial machine in terms of
different outputs, you have--

1040
00:44:56,740 --> 00:44:59,100
[INAUDIBLE] to
compare the advantage

1041
00:44:59,100 --> 00:45:01,500
in terms of troubleshooting
program [INAUDIBLE]

1042
00:45:01,500 --> 00:45:05,040
in parallel processing, more
or less [? architecture. ?]

1043
00:45:05,040 --> 00:45:06,240
OK.

1044
00:45:06,240 --> 00:45:10,380
Fundamentally, writing a program
on a data parallel machine

1045
00:45:10,380 --> 00:45:12,782
is no more difficult
than writing a program

1046
00:45:12,782 --> 00:45:14,490
on a sequential machine,
because you only

1047
00:45:14,490 --> 00:45:16,080
have one thread of control.

1048
00:45:16,080 --> 00:45:19,080
So you don't have to worry about
issues like synchronization

1049
00:45:19,080 --> 00:45:22,470
or data consistency between
processes and things like that.

1050
00:45:22,470 --> 00:45:24,060
It's things like
that, where you have

1051
00:45:24,060 --> 00:45:25,950
multiple threads
of control, that

1052
00:45:25,950 --> 00:45:29,490
makes some kinds of parallel
programming very difficult.

1053
00:45:29,490 --> 00:45:31,800
But debugging a data
parallel program

1054
00:45:31,800 --> 00:45:35,130
is really very much like
debugging a sequential program,

1055
00:45:35,130 --> 00:45:37,050
and uses essentially
the same tools.

1056
00:45:37,050 --> 00:45:39,367
Now, of course, if the
program's doing the wrong thing,

1057
00:45:39,367 --> 00:45:41,200
then it can do the wrong
thing a lot faster,

1058
00:45:41,200 --> 00:45:42,658
so it can make a
lot more mistakes,

1059
00:45:42,658 --> 00:45:45,285
but the truth of the matter
is that it can already

1060
00:45:45,285 --> 00:45:47,910
make mistakes a lot faster than
you can keep up with it anyway.

1061
00:45:47,910 --> 00:45:51,810
So that difference really
doesn't make a difference.

1062
00:45:51,810 --> 00:45:55,115
Now, that's to do
with software errors.

1063
00:45:55,115 --> 00:45:57,240
Another question, which
you might have been asking,

1064
00:45:57,240 --> 00:45:58,770
is about hardware errors.

1065
00:45:58,770 --> 00:46:01,710
And one topic that I
neglected in the talk, which

1066
00:46:01,710 --> 00:46:04,530
is a very big issue for
designing these machines,

1067
00:46:04,530 --> 00:46:07,410
is the problem of
fault tolerance.

1068
00:46:07,410 --> 00:46:09,000
And in fact, the
machines that we're

1069
00:46:09,000 --> 00:46:11,070
designing in the future
all have very high degree

1070
00:46:11,070 --> 00:46:14,400
of fault tolerance, so that one
part of the hardware can break

1071
00:46:14,400 --> 00:46:18,540
and the program continues
running, and runs--

1072
00:46:18,540 --> 00:46:20,855
the user sees no error,
even though the hardware--

1073
00:46:20,855 --> 00:46:22,980
the individual hardware
elements are making errors.

1074
00:46:22,980 --> 00:46:25,650
And that becomes a necessity
when you design very, very

1075
00:46:25,650 --> 00:46:27,510
large-scale machines.

1076
00:46:27,510 --> 00:46:30,330
And that will become
increasingly a problem

1077
00:46:30,330 --> 00:46:32,280
in designing machines,
and increasingly

1078
00:46:32,280 --> 00:46:35,331
a place for
architectural creativity.

1079
00:46:35,331 --> 00:46:38,165
I'm Mark Tuttle from
the [INAUDIBLE] Lab.

1080
00:46:38,165 --> 00:46:39,540
I felt, during
the talk, you were

1081
00:46:39,540 --> 00:46:42,900
trying to convince me that
the data parallel model was

1082
00:46:42,900 --> 00:46:46,170
the best and most natural
model for parallel computation,

1083
00:46:46,170 --> 00:46:48,660
and I'd like to press you
on that a little more.

1084
00:46:48,660 --> 00:46:52,380
It seems like, in
discrediting the other models,

1085
00:46:52,380 --> 00:46:53,940
you're taking a lot of--

1086
00:46:53,940 --> 00:46:57,450
you're taking advantage of the
parallelism that's inherent--

1087
00:46:57,450 --> 00:46:59,610
the data parallelism
that's inherent in the data

1088
00:46:59,610 --> 00:47:01,620
that you're using,
like the airflow

1089
00:47:01,620 --> 00:47:03,820
over the wing and
stuff like that.

1090
00:47:03,820 --> 00:47:06,990
But it seems like there are a
lot of problems where that's

1091
00:47:06,990 --> 00:47:09,600
not really the case,
like chasing links

1092
00:47:09,600 --> 00:47:13,740
through a semantic net, where
you can, of course, formulate

1093
00:47:13,740 --> 00:47:17,520
that is taking the transitive
closure of a graph.

1094
00:47:17,520 --> 00:47:21,120
And I just wanted to
ask you about that.

1095
00:47:21,120 --> 00:47:29,330
Well, I certainly believe that
other kinds of parallelism

1096
00:47:29,330 --> 00:47:32,000
are legitimate and useful.

1097
00:47:32,000 --> 00:47:34,880
The problem with most
kinds of parallelism

1098
00:47:34,880 --> 00:47:37,970
based on control
flow is that they

1099
00:47:37,970 --> 00:47:39,985
don't get you the factors
of tens of thousands

1100
00:47:39,985 --> 00:47:41,360
or hundreds of
thousands that you

1101
00:47:41,360 --> 00:47:45,440
need to really do these very,
very large-scale problems.

1102
00:47:45,440 --> 00:47:49,310
So I guess I feel that
data parallelism is

1103
00:47:49,310 --> 00:47:51,680
the only way that you'll
get the speed in very, very

1104
00:47:51,680 --> 00:47:52,500
large problems.

1105
00:47:52,500 --> 00:47:54,510
And in fact, the problems
that you mentioned,

1106
00:47:54,510 --> 00:47:57,140
which were database
query, computing

1107
00:47:57,140 --> 00:47:59,130
the transitive
closure of a graph,

1108
00:47:59,130 --> 00:48:01,040
are very, very
natural expressions

1109
00:48:01,040 --> 00:48:03,330
in data parallel algorithms,
and in fact, work very,

1110
00:48:03,330 --> 00:48:06,560
very well on a machine like
this when the problem is large.

1111
00:48:06,560 --> 00:48:09,050
So it's not that I think that
the other kinds of problems

1112
00:48:09,050 --> 00:48:10,202
are no good.

1113
00:48:10,202 --> 00:48:11,660
In fact, I think
they're very good.

1114
00:48:11,660 --> 00:48:15,260
I think multitasking,
forking parallelism

1115
00:48:15,260 --> 00:48:16,830
is very useful in many cases.

1116
00:48:16,830 --> 00:48:19,125
I think pipelining is
useful in many cases.

1117
00:48:19,125 --> 00:48:20,750
I think that those
make a lot of sense,

1118
00:48:20,750 --> 00:48:22,610
but they're not going to get
you the factors of hundreds

1119
00:48:22,610 --> 00:48:24,440
of thousands, so
they're really not what

1120
00:48:24,440 --> 00:48:27,868
you have to focus on when
designing a machine that's

1121
00:48:27,868 --> 00:48:29,660
going to operate in
that performance range.

1122
00:48:29,660 --> 00:48:31,910
It sounds like you're
talking about--

1123
00:48:31,910 --> 00:48:34,580
that's an argument that the
architecture that you're

1124
00:48:34,580 --> 00:48:36,440
interested in is a
superior architecture,

1125
00:48:36,440 --> 00:48:40,280
because you can make
these chips small enough

1126
00:48:40,280 --> 00:48:44,750
that you can make these
machines much, much bigger.

1127
00:48:44,750 --> 00:48:48,360
It's more cost effective to make
these kinds of machines bigger,

1128
00:48:48,360 --> 00:48:50,660
but that doesn't seem like
it's a convincing argument

1129
00:48:50,660 --> 00:48:52,520
that the model--

1130
00:48:52,520 --> 00:48:55,370
the programming model itself
is the most natural model.

1131
00:48:55,370 --> 00:48:57,927
Well, maybe a more--

1132
00:48:57,927 --> 00:49:00,260
better way to convince you
that the programming model is

1133
00:49:00,260 --> 00:49:03,260
the right model is to say
that even somebody who

1134
00:49:03,260 --> 00:49:05,460
builds hardware for a
completely different model

1135
00:49:05,460 --> 00:49:07,190
almost always ends
up programming it

1136
00:49:07,190 --> 00:49:08,703
in the data parallel model.

1137
00:49:08,703 --> 00:49:10,370
So typically, if you
have a machine that

1138
00:49:10,370 --> 00:49:13,520
has more than a few dozen
processors, even if it wasn't

1139
00:49:13,520 --> 00:49:16,100
designed to support the data
parallel programming model,

1140
00:49:16,100 --> 00:49:18,440
you'll see that the way that
people program that machine

1141
00:49:18,440 --> 00:49:20,570
is to sort of make that
machine pretend like it's

1142
00:49:20,570 --> 00:49:21,772
a data parallel machine.

1143
00:49:21,772 --> 00:49:23,480
And in fact, they only
write one program,

1144
00:49:23,480 --> 00:49:26,550
and they run it all on all
the processors at once.

1145
00:49:26,550 --> 00:49:29,330
So they hide all
of that flexibility

1146
00:49:29,330 --> 00:49:30,950
that they put into
the architecture,

1147
00:49:30,950 --> 00:49:32,880
because they don't know how to
deal with [INAUDIBLE] software.

1148
00:49:32,880 --> 00:49:34,713
So I think that's a
very convincing argument

1149
00:49:34,713 --> 00:49:36,770
that this is the most
natural way of expressing

1150
00:49:36,770 --> 00:49:38,690
those algorithms.

1151
00:49:38,690 --> 00:49:41,240
I'm Karen Matthews
from UVC, but Danny,

1152
00:49:41,240 --> 00:49:43,850
this question is from
[? Gordon ?] [? Bell. ?] To

1153
00:49:43,850 --> 00:49:46,310
what extent can we expect
a language for expressing

1154
00:49:46,310 --> 00:49:49,440
parallel computation to span
a range of types of machines

1155
00:49:49,440 --> 00:49:52,340
so that programmers don't
have to be retrained?

1156
00:49:52,340 --> 00:49:57,140
I think that it's realistic
for programming languages

1157
00:49:57,140 --> 00:49:59,810
that support machines with
roughly the same numbers

1158
00:49:59,810 --> 00:50:04,260
of processors to have the same
kind of programming language.

1159
00:50:04,260 --> 00:50:06,500
So I believe that all machines
with tens of thousands

1160
00:50:06,500 --> 00:50:08,570
of processors will be
programmed in the same way

1161
00:50:08,570 --> 00:50:10,790
by exactly the same
programming languages.

1162
00:50:10,790 --> 00:50:13,187
I believe that machines that
have dozens of processors

1163
00:50:13,187 --> 00:50:15,770
will be programmed in a pretty
different way, because there it

1164
00:50:15,770 --> 00:50:18,502
makes sense to use this kind of
fork joint kind of parallelism.

1165
00:50:18,502 --> 00:50:20,960
And it may be that there are
some intermediate stages where

1166
00:50:20,960 --> 00:50:23,900
machines with a few processors
can be programmed in a message

1167
00:50:23,900 --> 00:50:27,470
passing way, but I think that
all machines that are massively

1168
00:50:27,470 --> 00:50:30,950
parallel will run exactly the
same programming languages.

1169
00:50:30,950 --> 00:50:32,720
And interestingly
enough, they're

1170
00:50:32,720 --> 00:50:34,910
probably the same
programming languages

1171
00:50:34,910 --> 00:50:36,733
that the sequential
machines will run,

1172
00:50:36,733 --> 00:50:38,150
and that was a big
surprise to me.

1173
00:50:38,150 --> 00:50:41,800
[MUSIC PLAYING]

1174
00:50:41,800 --> 00:52:25,110