1
00:00:00,000 --> 00:00:01,446


2
00:00:01,446 --> 00:00:04,338
[MUSIC PLAYING]

3
00:00:04,338 --> 00:00:07,720


4
00:00:07,720 --> 00:00:09,970
To perform a task in
parallel is to have

5
00:00:09,970 --> 00:00:13,780
multiple entities working
together toward a single goal.

6
00:00:13,780 --> 00:00:17,020
Likewise, parallel computing is
the use of multiple processors

7
00:00:17,020 --> 00:00:20,620
together in a single machine
to solve a single problem.

8
00:00:20,620 --> 00:00:22,810
Vital application areas,
which would immediately

9
00:00:22,810 --> 00:00:25,690
benefit from the availability
of parallel computing devices,

10
00:00:25,690 --> 00:00:29,680
include superconductor
physics, atmospheric research,

11
00:00:29,680 --> 00:00:34,690
weather modeling, defense,
astronomy, genetics,

12
00:00:34,690 --> 00:00:38,020
oil exploration, and aerospace.

13
00:00:38,020 --> 00:00:40,300
Such machines could improve
current computational

14
00:00:40,300 --> 00:00:43,630
capabilities by several
orders of magnitude.

15
00:00:43,630 --> 00:00:45,430
However, today's
parallel processors

16
00:00:45,430 --> 00:00:47,980
remain in the domain of
sophisticated specialists

17
00:00:47,980 --> 00:00:49,810
and are extremely
difficult to use

18
00:00:49,810 --> 00:00:52,060
by scientists and
engineers who want

19
00:00:52,060 --> 00:00:55,930
to solve routine yet
time-consuming problems.

20
00:00:55,930 --> 00:00:58,480
The ultimate challenge is
to bring parallel computing

21
00:00:58,480 --> 00:01:01,310
into the mainstream.

22
00:01:01,310 --> 00:01:06,020
The MONSOON Project is rising
to meet this challenge.

23
00:01:06,020 --> 00:01:07,940
MONSOON is a joint
research effort

24
00:01:07,940 --> 00:01:10,820
by the Motorola Computer Group
in Cambridge, Massachusetts

25
00:01:10,820 --> 00:01:13,370
and Tempe Arizona
and the Massachusetts

26
00:01:13,370 --> 00:01:16,250
Institute of Technology
Laboratory for Computer

27
00:01:16,250 --> 00:01:17,360
Science.

28
00:01:17,360 --> 00:01:21,160
The project is funded
by DARPA and Motorola.

29
00:01:21,160 --> 00:01:23,950
The goal of the project is to
demonstrate the feasibility

30
00:01:23,950 --> 00:01:26,830
of dataflow multiple
processors using implicitly

31
00:01:26,830 --> 00:01:28,960
parallel programming languages.

32
00:01:28,960 --> 00:01:32,290
Professor Arvind of MIT'S
Laboratory for Computer Science

33
00:01:32,290 --> 00:01:35,290
computation structures group
and a leader in the dataflow

34
00:01:35,290 --> 00:01:36,800
research effort.

35
00:01:36,800 --> 00:01:40,120
Why is it that we don't
have more parallel machines?

36
00:01:40,120 --> 00:01:42,790
The economics of it
is not in the way.

37
00:01:42,790 --> 00:01:44,830
I think there is a
fairly strong agreement

38
00:01:44,830 --> 00:01:48,220
that software is the problem
to making parallel systems more

39
00:01:48,220 --> 00:01:49,600
readily available.

40
00:01:49,600 --> 00:01:51,640
Small multiprocessors
available today

41
00:01:51,640 --> 00:01:55,210
have had economic success,
only limited application areas.

42
00:01:55,210 --> 00:01:58,540
Applications with each processor
can be dedicated easily

43
00:01:58,540 --> 00:02:00,380
to a specific task.

44
00:02:00,380 --> 00:02:03,250
However, such machines
can't be exploited

45
00:02:03,250 --> 00:02:07,790
from volumes of software
available on uniprocessors.

46
00:02:07,790 --> 00:02:10,855
This says something about
the state of languages in use

47
00:02:10,855 --> 00:02:13,330
and the compilers in use.

48
00:02:13,330 --> 00:02:15,310
So users have to
resort to some sort

49
00:02:15,310 --> 00:02:20,500
of explicit parallel
programming to get the job done.

50
00:02:20,500 --> 00:02:23,050
It turns out that explicit
parallel programming

51
00:02:23,050 --> 00:02:26,380
is quite difficult. People
are reluctant to do it

52
00:02:26,380 --> 00:02:28,300
not only because it's
difficult, because there

53
00:02:28,300 --> 00:02:30,610
is no guarantee of portability.

54
00:02:30,610 --> 00:02:33,580
It's not clear if
program you have

55
00:02:33,580 --> 00:02:35,860
put some effort
into making parallel

56
00:02:35,860 --> 00:02:39,340
will port to a machine
with more processors,

57
00:02:39,340 --> 00:02:42,040
or somebody else's machine,
or for that matter,

58
00:02:42,040 --> 00:02:44,920
even to the next generation
of the same machine.

59
00:02:44,920 --> 00:02:47,140
And all of this
puts a big damper

60
00:02:47,140 --> 00:02:49,960
on trying to
parallelize a program.

61
00:02:49,960 --> 00:02:54,190
Dr. Ken Traub of the Motorola
Cambridge Research Center.

62
00:02:54,190 --> 00:02:56,922
Another kind of problem with
explicit parallel programming,

63
00:02:56,922 --> 00:02:59,380
which is even more serious, is
that these kinds of programs

64
00:02:59,380 --> 00:03:01,743
don't have very good
compositionality properties.

65
00:03:01,743 --> 00:03:03,160
We're used to
writing subroutines,

66
00:03:03,160 --> 00:03:04,702
and then calling
subroutines from all

67
00:03:04,702 --> 00:03:07,240
over the place having
subroutine libraries, and so on.

68
00:03:07,240 --> 00:03:09,000
But any experienced
program will tell you,

69
00:03:09,000 --> 00:03:11,500
that when you're working with
parallel subroutine libraries,

70
00:03:11,500 --> 00:03:13,000
you can't count on
how well they'll work

71
00:03:13,000 --> 00:03:14,410
under different circumstances.

72
00:03:14,410 --> 00:03:17,403
And this is especially true
of the data parallel model.

73
00:03:17,403 --> 00:03:19,570
If you look at the other
extreme, which is massively

74
00:03:19,570 --> 00:03:23,180
parallel machines, it's obvious
that there is some success.

75
00:03:23,180 --> 00:03:26,200
We read about these machines
in newspapers all the time.

76
00:03:26,200 --> 00:03:28,030
People are actually
using these machines

77
00:03:28,030 --> 00:03:30,340
to solve some very
important problems.

78
00:03:30,340 --> 00:03:33,490
This is in spite of extreme
difficulty in programming

79
00:03:33,490 --> 00:03:35,830
massively parallel machines.

80
00:03:35,830 --> 00:03:38,560
Right now, these machines are
being used to solve problems

81
00:03:38,560 --> 00:03:40,660
which cannot be
solved any other way.

82
00:03:40,660 --> 00:03:44,290
Working on parallel machines
has its own sex appeal.

83
00:03:44,290 --> 00:03:46,870
The difficulty
becomes the challenge.

84
00:03:46,870 --> 00:03:49,000
But the point is, if
parallel machines remain

85
00:03:49,000 --> 00:03:51,100
so difficult to
program, then it's

86
00:03:51,100 --> 00:03:54,160
obvious that they will
remain on the fringes.

87
00:03:54,160 --> 00:03:56,560
They can never become
mainline computing engines

88
00:03:56,560 --> 00:04:01,820
that we use every day in
our offices, in our homes.

89
00:04:01,820 --> 00:04:03,760
So the key issue
here is the ease

90
00:04:03,760 --> 00:04:06,730
of programming and
the languages we

91
00:04:06,730 --> 00:04:09,160
use to program these machines.

92
00:04:09,160 --> 00:04:11,350
Computer languages
play a pivotal role

93
00:04:11,350 --> 00:04:13,840
in the degree to which the
parallelism of a problem

94
00:04:13,840 --> 00:04:16,089
is exposed or obscured.

95
00:04:16,089 --> 00:04:18,850
To illustrate, consider the
following simple problem,

96
00:04:18,850 --> 00:04:21,010
known as "wavefront."

97
00:04:21,010 --> 00:04:24,400
The problem is to construct
an n by n matrix, such

98
00:04:24,400 --> 00:04:27,940
that the top row and
leftmost column contain 1s,

99
00:04:27,940 --> 00:04:29,860
and each of the other
elements in the matrix

100
00:04:29,860 --> 00:04:32,230
is defined in terms of
its neighbors to the north

101
00:04:32,230 --> 00:04:36,060
and west as shown.

102
00:04:36,060 --> 00:04:38,700
It is not difficult to see
that after the elements in one

103
00:04:38,700 --> 00:04:40,920
diagonal have been
computed, the elements

104
00:04:40,920 --> 00:04:44,020
in the following diagonal
can be computed in parallel,

105
00:04:44,020 --> 00:04:48,800
and so on, until
the matrix is full.

106
00:04:48,800 --> 00:04:51,830
The number of operations that
can be performed simultaneously

107
00:04:51,830 --> 00:04:54,410
at each time step is
displayed by a graph,

108
00:04:54,410 --> 00:04:55,880
called a parallelism profile.

109
00:04:55,880 --> 00:04:59,410


110
00:04:59,410 --> 00:05:01,000
Clearly, the number
of operations

111
00:05:01,000 --> 00:05:03,580
that can be performed
in parallel increases

112
00:05:03,580 --> 00:05:08,920
until the middle diagonal is
computed and then decreases.

113
00:05:08,920 --> 00:05:12,780
Now, consider this Fortran
program to solve wavefront.

114
00:05:12,780 --> 00:05:15,120
It consists of an
initialization step,

115
00:05:15,120 --> 00:05:17,520
doubling [? S ?] to-do
loops, and the computation

116
00:05:17,520 --> 00:05:22,020
for each element X, I, J.
The semantics of Fortran

117
00:05:22,020 --> 00:05:23,820
dictates the precise
order in which

118
00:05:23,820 --> 00:05:26,370
the elements of the matrix
are computed, namely,

119
00:05:26,370 --> 00:05:29,670
top row to bottom row and
left to right in each row.

120
00:05:29,670 --> 00:05:32,610


121
00:05:32,610 --> 00:05:35,100
The difficulty of performing
the required analysis

122
00:05:35,100 --> 00:05:37,080
to determine what can
be done in parallel,

123
00:05:37,080 --> 00:05:40,890
even for this simple program, is
beyond the capability of almost

124
00:05:40,890 --> 00:05:44,420
all commercial compilers.

125
00:05:44,420 --> 00:05:47,450
Let's examine the steps we have
taken to execute the Fortran

126
00:05:47,450 --> 00:05:49,820
wavefront program in parallel.

127
00:05:49,820 --> 00:05:51,710
We have started out
with a specification

128
00:05:51,710 --> 00:05:55,280
of a problem in which the
parallelism is evident, encoded

129
00:05:55,280 --> 00:05:58,160
it in a sequential language,
Fortran in this case,

130
00:05:58,160 --> 00:06:01,460
and then are forced to rely
on a sophisticated compiler

131
00:06:01,460 --> 00:06:04,170
to produce parallel code
from the sequential program.

132
00:06:04,170 --> 00:06:07,250


133
00:06:07,250 --> 00:06:09,890
It seems to me to be far
more effective to use

134
00:06:09,890 --> 00:06:13,370
a language in which we can
encode are algorithms, such

135
00:06:13,370 --> 00:06:15,710
that the parallelism
in the original problem

136
00:06:15,710 --> 00:06:20,690
is not obscured, but is
actually preserved by the code.

137
00:06:20,690 --> 00:06:23,420
The language Id, developed
by Professor Arvind

138
00:06:23,420 --> 00:06:26,420
and his colleagues,
is such a language.

139
00:06:26,420 --> 00:06:30,200
As an example, here is an
Id program for wavefront.

140
00:06:30,200 --> 00:06:33,590
First, the dimensions of
the matrix are defined.

141
00:06:33,590 --> 00:06:35,360
Then the boundary
elements of the matrix

142
00:06:35,360 --> 00:06:39,530
are defined as well as
the remaining elements.

143
00:06:39,530 --> 00:06:43,820
Note, that this is a declarative
specification of the problem.

144
00:06:43,820 --> 00:06:46,680
Nothing is implied about
the order of computation,

145
00:06:46,680 --> 00:06:48,980
so that whatever can be
computed in parallel,

146
00:06:48,980 --> 00:06:53,380
subject to data dependencies,
will be computed in parallel.

147
00:06:53,380 --> 00:06:55,630
To illustrate the value
of compositionality,

148
00:06:55,630 --> 00:06:57,760
consider an extension
of wavefront

149
00:06:57,760 --> 00:06:59,867
in which the value
of element X, I,

150
00:06:59,867 --> 00:07:02,530
J is a function of neighbors
to the north and west

151
00:07:02,530 --> 00:07:05,320
and its own previous value.

152
00:07:05,320 --> 00:07:07,240
This is to be iterated
multiple times.

153
00:07:07,240 --> 00:07:10,590


154
00:07:10,590 --> 00:07:13,920
At first glance, the parallelism
profile of this program

155
00:07:13,920 --> 00:07:17,010
would appear to be that of a
series of wavefronts laid end

156
00:07:17,010 --> 00:07:18,150
to end.

157
00:07:18,150 --> 00:07:20,520
This would be the parallelism
obtained in Fortran

158
00:07:20,520 --> 00:07:24,600
by simply composing the
earlier program in a loop.

159
00:07:24,600 --> 00:07:26,820
However, the parallelism
that is actually

160
00:07:26,820 --> 00:07:29,010
present in this
problem is far greater

161
00:07:29,010 --> 00:07:31,440
than what might be
expected, leading to a much

162
00:07:31,440 --> 00:07:34,270
shorter execution time.

163
00:07:34,270 --> 00:07:36,190
This parallelism
profile is what is

164
00:07:36,190 --> 00:07:38,050
obtained in Id when
the original Id

165
00:07:38,050 --> 00:07:40,880
program is enclosed in a loop.

166
00:07:40,880 --> 00:07:43,610
Plus the power of implicit
parallel programming

167
00:07:43,610 --> 00:07:45,890
leads to programs which
are both high level

168
00:07:45,890 --> 00:07:48,170
and exhibit a degree
of parallelism, which

169
00:07:48,170 --> 00:07:50,390
would be almost
unthinkable to express

170
00:07:50,390 --> 00:07:52,760
in languages like Fortran.

171
00:07:52,760 --> 00:07:55,580
I believe that implicit
parallel programming

172
00:07:55,580 --> 00:07:58,010
will be able to jump
over the final hurdle

173
00:07:58,010 --> 00:08:01,190
to make parallel machines
really commonplace.

174
00:08:01,190 --> 00:08:03,350
It can bring down the
programming difficulty

175
00:08:03,350 --> 00:08:06,470
to a level where users
would as happily program

176
00:08:06,470 --> 00:08:10,670
parallel machines as they would
program sequential machines.

177
00:08:10,670 --> 00:08:14,180
Compilers for implicitly
parallel languages, such as Id,

178
00:08:14,180 --> 00:08:16,520
are likely to generate
a multitude of threads

179
00:08:16,520 --> 00:08:19,950
of computation, reflecting the
large amounts of parallelism

180
00:08:19,950 --> 00:08:22,800
which can be
exposed in programs.

181
00:08:22,800 --> 00:08:25,490
Therefore, it is essential
that the hardware architecture

182
00:08:25,490 --> 00:08:27,830
for the system be designed
with this in mind.

183
00:08:27,830 --> 00:08:29,678
[MUSIC PLAYING]

184
00:08:29,678 --> 00:08:32,919


185
00:08:32,919 --> 00:08:35,950
In the MONSOON architecture,
the state of the computation

186
00:08:35,950 --> 00:08:39,159
is represented as a tree
of activation frames, which

187
00:08:39,159 --> 00:08:42,840
are similar to stack frames
in conventional machines.

188
00:08:42,840 --> 00:08:45,610
However, instead of only
having the most recent frame

189
00:08:45,610 --> 00:08:48,400
active at any time, in
the MONSOON architecture,

190
00:08:48,400 --> 00:08:52,250
multiple procedures are
active simultaneously.

191
00:08:52,250 --> 00:08:55,280
In addition, the control
state is virtualized,

192
00:08:55,280 --> 00:08:57,260
so that within
activation frames are

193
00:08:57,260 --> 00:08:59,270
multiple threads
of control, which

194
00:08:59,270 --> 00:09:03,250
may be simultaneously active.

195
00:09:03,250 --> 00:09:04,930
In any machine
architecture which

196
00:09:04,930 --> 00:09:07,300
supports a parallel
model of computation,

197
00:09:07,300 --> 00:09:09,010
there are certain
fundamental issues

198
00:09:09,010 --> 00:09:12,070
which must be addressed in order
to achieve efficient process

199
00:09:12,070 --> 00:09:13,870
or utilization.

200
00:09:13,870 --> 00:09:17,640
The first of these
issues is memory latency.

201
00:09:17,640 --> 00:09:21,900
Let's look at a computing
node in any multinode system.

202
00:09:21,900 --> 00:09:24,600
There is a processor component
with an on-chip memory

203
00:09:24,600 --> 00:09:27,270
cache, typically
some local memory,

204
00:09:27,270 --> 00:09:30,830
and an interface to the
communication network.

205
00:09:30,830 --> 00:09:33,440
A reference by a
processor to local memory

206
00:09:33,440 --> 00:09:35,960
may be five times slower
than a reference which hits

207
00:09:35,960 --> 00:09:37,265
the processor's local cache.

208
00:09:37,265 --> 00:09:39,850


209
00:09:39,850 --> 00:09:42,100
And even in a
well-designed machine,

210
00:09:42,100 --> 00:09:43,840
the latency incurred
by a reference

211
00:09:43,840 --> 00:09:46,250
to a different node in the
machine may be many times

212
00:09:46,250 --> 00:09:48,120
slower yet.

213
00:09:48,120 --> 00:09:51,990
Professor Greg Papadopoulos of
MIT's computation structures

214
00:09:51,990 --> 00:09:52,818
group.

215
00:09:52,818 --> 00:09:55,110
When you're looking at how
you're going to run programs

216
00:09:55,110 --> 00:09:57,850
on this machine, you have to
ask yourself the question,

217
00:09:57,850 --> 00:10:00,000
how often am I going to
require communication

218
00:10:00,000 --> 00:10:01,180
with another node?

219
00:10:01,180 --> 00:10:04,180
And what will the penalty
of such communication be?

220
00:10:04,180 --> 00:10:06,630
For example, will I just
sit there, and wait,

221
00:10:06,630 --> 00:10:08,850
and idle while I
wait for the response

222
00:10:08,850 --> 00:10:10,020
to request that I'm given?

223
00:10:10,020 --> 00:10:13,290
Or can I be more creative
and do something else--

224
00:10:13,290 --> 00:10:16,590
try to amortize or
minimize the latency?

225
00:10:16,590 --> 00:10:18,690
One natural way to
tolerate communication

226
00:10:18,690 --> 00:10:22,800
latency is to issue many remote
requests into the network.

227
00:10:22,800 --> 00:10:25,230
Although each request
experiences the latency

228
00:10:25,230 --> 00:10:28,380
of the network, the processor
which originates the requests

229
00:10:28,380 --> 00:10:30,570
is kept busy by the
multiple responses,

230
00:10:30,570 --> 00:10:32,880
which ultimately return.

231
00:10:32,880 --> 00:10:34,590
In order to
successfully generate

232
00:10:34,590 --> 00:10:36,750
sufficient remote
memory requests to keep

233
00:10:36,750 --> 00:10:39,600
the communication network
full and the processor busy,

234
00:10:39,600 --> 00:10:42,030
there must be sufficient
parallelism available,

235
00:10:42,030 --> 00:10:44,220
and the processor must
also be able to perform

236
00:10:44,220 --> 00:10:47,730
rapid context switching across
multiple threads of control,

237
00:10:47,730 --> 00:10:51,650
each of which can generate
requests into the network.

238
00:10:51,650 --> 00:10:53,930
Remote requests, which
are made by a processor,

239
00:10:53,930 --> 00:10:56,570
will produce responses which
return in a different order

240
00:10:56,570 --> 00:10:57,830
than they were issued.

241
00:10:57,830 --> 00:11:00,230
To impose an ordering
requirement on the network

242
00:11:00,230 --> 00:11:01,700
is practically unmanageable.

243
00:11:01,700 --> 00:11:04,910


244
00:11:04,910 --> 00:11:06,980
Well, as they say,
talk is cheap.

245
00:11:06,980 --> 00:11:09,620
It's listening that's an art.

246
00:11:09,620 --> 00:11:11,390
It's easy if you
assume that you have

247
00:11:11,390 --> 00:11:14,750
lots of threads to generate
as many outgoing messages

248
00:11:14,750 --> 00:11:15,860
as you want.

249
00:11:15,860 --> 00:11:18,140
The really hard
part is, when you

250
00:11:18,140 --> 00:11:20,495
think about a message coming
back in the processor,

251
00:11:20,495 --> 00:11:22,370
how are you going to
relate it to the waiting

252
00:11:22,370 --> 00:11:23,540
threads of control?

253
00:11:23,540 --> 00:11:25,160
And is this really
the only datum

254
00:11:25,160 --> 00:11:28,280
that's required to
continue the computation?

255
00:11:28,280 --> 00:11:30,350
In order to relate a
response to a waiting

256
00:11:30,350 --> 00:11:32,750
activity within a
processor, the processor

257
00:11:32,750 --> 00:11:36,620
must have the ability to perform
efficient synchronization.

258
00:11:36,620 --> 00:11:38,660
In general, multiple
responses may be

259
00:11:38,660 --> 00:11:42,930
required to activate a thread.

260
00:11:42,930 --> 00:11:44,520
What's interesting
is that if you just

261
00:11:44,520 --> 00:11:46,710
tackle this latency
problem, you end up

262
00:11:46,710 --> 00:11:48,318
with a machine that
looks like this.

263
00:11:48,318 --> 00:11:50,610
You would imagine a few
contacts that you could rapidly

264
00:11:50,610 --> 00:11:53,250
switch among in each processor,
and then some synchronization

265
00:11:53,250 --> 00:11:54,923
capabilities in the memory.

266
00:11:54,923 --> 00:11:56,340
But that hardware
in and of itself

267
00:11:56,340 --> 00:11:58,048
doesn't quite get you
to the point that's

268
00:11:58,048 --> 00:11:59,440
expected by the compiler.

269
00:11:59,440 --> 00:12:01,800
And the key difference is,
while the machine needs

270
00:12:01,800 --> 00:12:04,740
only enough context to keep the
communication network filled,

271
00:12:04,740 --> 00:12:06,222
the compiler must
be able to name

272
00:12:06,222 --> 00:12:08,430
enough activities in the
machine to support the fully

273
00:12:08,430 --> 00:12:10,000
parallel model.

274
00:12:10,000 --> 00:12:12,150
So the dataflow
approach is, rather

275
00:12:12,150 --> 00:12:14,010
than buying the
number of threads

276
00:12:14,010 --> 00:12:17,070
that we can name at a time
to physical resources,

277
00:12:17,070 --> 00:12:19,530
like the number of
registers in a processor,

278
00:12:19,530 --> 00:12:23,430
we instead virtualize this and
come up with a very lightweight

279
00:12:23,430 --> 00:12:25,150
multithreaded model.

280
00:12:25,150 --> 00:12:28,530
So threads in MONSOON look like
these lightweight computation

281
00:12:28,530 --> 00:12:32,100
descriptors, which we can
switch amongst very rapidly.

282
00:12:32,100 --> 00:12:34,020
In fact, on any
processor, there can

283
00:12:34,020 --> 00:12:36,780
be thousands of thread
descriptors awaiting execution

284
00:12:36,780 --> 00:12:38,610
at any given time.

285
00:12:38,610 --> 00:12:40,650
The thread descriptors
consist of a pointer

286
00:12:40,650 --> 00:12:44,130
to an instruction in memory, a
pointer to the activation frame

287
00:12:44,130 --> 00:12:48,410
the thread is associated
with, and a few registers.

288
00:12:48,410 --> 00:12:50,240
Two independent
thread descriptors

289
00:12:50,240 --> 00:12:52,730
may synchronize by using
the presence bits found

290
00:12:52,730 --> 00:12:55,290
on every word of frame memory.

291
00:12:55,290 --> 00:12:57,020
So the hardware
design of MONSOON

292
00:12:57,020 --> 00:13:00,380
emphasizes very rapid
switching across threads.

293
00:13:00,380 --> 00:13:04,640
In fact, it is possible to
join two threads of computation

294
00:13:04,640 --> 00:13:07,580
with one cycle of
overhead, and a new thread

295
00:13:07,580 --> 00:13:10,130
can be formed from a
currently executing thread

296
00:13:10,130 --> 00:13:12,530
with zero cycles of overhead.

297
00:13:12,530 --> 00:13:15,620
To summarize, the MONSOON
approach to parallel processing

298
00:13:15,620 --> 00:13:18,740
brings together and implicitly
parallel programming language,

299
00:13:18,740 --> 00:13:21,200
Id, which preserves
the parallelism which

300
00:13:21,200 --> 00:13:24,620
is present in algorithms, and
an architecture which supports

301
00:13:24,620 --> 00:13:26,660
the compiling model
for the language,

302
00:13:26,660 --> 00:13:28,940
as well as addressing
the fundamental problems

303
00:13:28,940 --> 00:13:31,850
of latency and synchronization.

304
00:13:31,850 --> 00:13:34,640
Certainly, we oftentimes
write programs

305
00:13:34,640 --> 00:13:37,550
where we don't know
where the parallelism is,

306
00:13:37,550 --> 00:13:40,430
but the gut feeling
is there must be some.

307
00:13:40,430 --> 00:13:42,710
The original algorithm
for these codes,

308
00:13:42,710 --> 00:13:46,040
even though no intention may
have been paid to parallelism,

309
00:13:46,040 --> 00:13:49,010
perhaps do have
sufficient parallelism.

310
00:13:49,010 --> 00:13:51,020
If you just took
the same algorithm

311
00:13:51,020 --> 00:13:54,170
and included it in Id,
there is a good chance

312
00:13:54,170 --> 00:13:57,540
that you would give significant
speedups of your program,

313
00:13:57,540 --> 00:14:00,590
regardless of how
unstructured the code is.

314
00:14:00,590 --> 00:14:03,620
The rule of Id, or implicit
parallel programming,

315
00:14:03,620 --> 00:14:07,910
is to make sure that the good
properties, the parallelism

316
00:14:07,910 --> 00:14:12,290
of your algorithm, is not
obscured, that language is not

317
00:14:12,290 --> 00:14:13,813
in the way.

318
00:14:13,813 --> 00:14:15,230
So the role of the
machine has got

319
00:14:15,230 --> 00:14:19,230
to be to efficiently exploit
all sorts of parallelism,

320
00:14:19,230 --> 00:14:22,100
which are exposed by
these wonderful languages.

321
00:14:22,100 --> 00:14:24,800
It seems to me that
the sky is the limit.

322
00:14:24,800 --> 00:14:27,850
[MUSIC PLAYING]

323
00:14:27,850 --> 00:15:13,000