1
00:00:00,000 --> 00:00:05,467
[MUSIC PLAYING]

2
00:00:05,467 --> 00:00:26,890


3
00:00:26,890 --> 00:00:29,050
On behalf of the
MIT summer session,

4
00:00:29,050 --> 00:00:32,080
let me indicate how happy
we are to be participating

5
00:00:32,080 --> 00:00:35,110
in this edition of the
university video communications

6
00:00:35,110 --> 00:00:37,030
Distinguished Lecture series.

7
00:00:37,030 --> 00:00:41,050
Our speaker tonight, Guy
Steele, received his PhD

8
00:00:41,050 --> 00:00:43,805
in computer science
here at MIT in 1980.

9
00:00:43,805 --> 00:00:45,430
He's probably best
known to those of us

10
00:00:45,430 --> 00:00:48,700
here as the co-developer of
scheme and of common lisp,

11
00:00:48,700 --> 00:00:51,130
or perhaps as the editor
and chief organizer

12
00:00:51,130 --> 00:00:52,785
behind the hackers dictionary.

13
00:00:52,785 --> 00:00:56,230
He, however, has many talents
outside computer science.

14
00:00:56,230 --> 00:00:58,870
He is, for example,
a composer and singer

15
00:00:58,870 --> 00:01:01,420
with the distinction, probably
the unique distinction

16
00:01:01,420 --> 00:01:04,900
of having his sheet music
published in the communications

17
00:01:04,900 --> 00:01:06,880
of the ACM.

18
00:01:06,880 --> 00:01:09,640
He also has demonstrated
his programming skills

19
00:01:09,640 --> 00:01:12,550
by becoming an
outstanding Chinese chef.

20
00:01:12,550 --> 00:01:15,460
He is a senior scientist of
Thinking Machines Corporation

21
00:01:15,460 --> 00:01:18,880
where he is responsible for
the design and implementation

22
00:01:18,880 --> 00:01:20,860
of parallel
programming languages

23
00:01:20,860 --> 00:01:24,100
and other system software for
the Connection Machine computer

24
00:01:24,100 --> 00:01:25,240
system.

25
00:01:25,240 --> 00:01:26,800
He will be talking
to us this evening

26
00:01:26,800 --> 00:01:29,200
about data parallel algorithms.

27
00:01:29,200 --> 00:01:30,890
Guy.

28
00:01:30,890 --> 00:01:32,780
The data parallel
programming style

29
00:01:32,780 --> 00:01:35,570
is an approach to
organizing programs

30
00:01:35,570 --> 00:01:38,540
suitable for execution on
massively parallel computer

31
00:01:38,540 --> 00:01:39,920
systems.

32
00:01:39,920 --> 00:01:42,410
In this talk, we're going
to characterize the data

33
00:01:42,410 --> 00:01:44,570
parallel programming
style, we're

34
00:01:44,570 --> 00:01:47,090
going to examine the
building blocks used

35
00:01:47,090 --> 00:01:50,145
to construct data
parallel programs,

36
00:01:50,145 --> 00:01:52,520
and we will see how to fit
these building blocks together

37
00:01:52,520 --> 00:01:53,810
to make useful algorithms.

38
00:01:53,810 --> 00:01:57,150


39
00:01:57,150 --> 00:02:01,170
All programs consist of code
and data working together,

40
00:02:01,170 --> 00:02:04,510
and there's usually some
processing agency, a processor,

41
00:02:04,510 --> 00:02:06,720
which using the code as
instructions operates

42
00:02:06,720 --> 00:02:09,300
upon the data to compute
some useful result.

43
00:02:09,300 --> 00:02:11,760
Now, there are different ways
of building this processor.

44
00:02:11,760 --> 00:02:15,420
If you have a single large
processor, or perhaps even

45
00:02:15,420 --> 00:02:18,660
a single small processor,
operating upon the code

46
00:02:18,660 --> 00:02:20,850
and operating on a single
data item at a time,

47
00:02:20,850 --> 00:02:23,190
you have a typical serial
sequential Von Neumann

48
00:02:23,190 --> 00:02:24,250
computer.

49
00:02:24,250 --> 00:02:27,030
On the other hand, you might
have many individual processing

50
00:02:27,030 --> 00:02:29,460
elements interpreting
the code and acting upon

51
00:02:29,460 --> 00:02:32,130
the data in which case, we have
a parallel computing system

52
00:02:32,130 --> 00:02:38,040
and there are different ways
of organizing this parallelism

53
00:02:38,040 --> 00:02:40,050
In control parallelism
the emphasis

54
00:02:40,050 --> 00:02:43,200
is on organizing the program
so that the processing

55
00:02:43,200 --> 00:02:46,188
elements can take advantage
of parallelism in the code.

56
00:02:46,188 --> 00:02:48,480
Typically, in this style,
different processing elements

57
00:02:48,480 --> 00:02:51,330
might be at different
places in the program text

58
00:02:51,330 --> 00:02:53,580
so that you have different
pieces of the program being

59
00:02:53,580 --> 00:02:56,230
executed all at the same time.

60
00:02:56,230 --> 00:02:58,080
Now, of course, you
don't simply operate

61
00:02:58,080 --> 00:03:02,710
by executing instructions
without regard to the data.

62
00:03:02,710 --> 00:03:04,210
Each of the processing
elements must

63
00:03:04,210 --> 00:03:06,130
have some access to
the data, and this

64
00:03:06,130 --> 00:03:09,140
may be in a regular
or irregular pattern.

65
00:03:09,140 --> 00:03:13,060
But the emphasis is on
extracting parallelism

66
00:03:13,060 --> 00:03:16,210
by orienting the
program's organization

67
00:03:16,210 --> 00:03:19,820
around the parallelism
in the code.

68
00:03:19,820 --> 00:03:21,770
In the data parallel
programming style,

69
00:03:21,770 --> 00:03:24,470
by contrast, the
emphasis is on organizing

70
00:03:24,470 --> 00:03:26,150
the programs to
extract parallelism

71
00:03:26,150 --> 00:03:28,640
from the organization
of the data.

72
00:03:28,640 --> 00:03:30,412
And the underlying
metaphor is to think

73
00:03:30,412 --> 00:03:32,120
of there being enough
processing elements

74
00:03:32,120 --> 00:03:33,690
that for every data
item of interest,

75
00:03:33,690 --> 00:03:36,300
there's a processor that
can be attached to it.

76
00:03:36,300 --> 00:03:38,300
And again, it's not that
the processing elements

77
00:03:38,300 --> 00:03:40,690
are operating on the data
without regard to the code.

78
00:03:40,690 --> 00:03:44,780
There almost must be some
means of having the processors

79
00:03:44,780 --> 00:03:46,550
interpret the code.

80
00:03:46,550 --> 00:03:49,550
Now, typically, but
not necessarily,

81
00:03:49,550 --> 00:03:51,925
all of the processors will
be at roughly the same point

82
00:03:51,925 --> 00:03:52,550
in the program.

83
00:03:52,550 --> 00:03:55,400
This is characteristic of
the data parallel programming

84
00:03:55,400 --> 00:03:56,075
style.

85
00:03:56,075 --> 00:03:57,450
But it's not a
necessary feature.

86
00:03:57,450 --> 00:04:00,320
It is possible to organize a
program into data parallel way,

87
00:04:00,320 --> 00:04:02,510
and yet, have the processors
be at different points

88
00:04:02,510 --> 00:04:03,830
in the code.

89
00:04:03,830 --> 00:04:07,850
But having a single locus
of control most of the time

90
00:04:07,850 --> 00:04:11,217
is a typical characteristic
of this style.

91
00:04:11,217 --> 00:04:12,800
Now, I want to draw
a distinction here

92
00:04:12,800 --> 00:04:16,160
between control parallelism
and data parallelism.

93
00:04:16,160 --> 00:04:17,630
That distinction
on the one hand,

94
00:04:17,630 --> 00:04:19,130
and a slightly
different distinction

95
00:04:19,130 --> 00:04:22,370
about architectural styles about
the design of hardware, which

96
00:04:22,370 --> 00:04:25,310
are usually labeled SIMD,
for single instruction,

97
00:04:25,310 --> 00:04:27,380
multiple data
organizations, versus

98
00:04:27,380 --> 00:04:30,923
MIMD, multiple instruction,
multiple data organizations.

99
00:04:30,923 --> 00:04:32,840
I think it's important
to draw the distinction

100
00:04:32,840 --> 00:04:35,262
between the style in
which a program is written

101
00:04:35,262 --> 00:04:37,220
and the way in which the
underlying hardware is

102
00:04:37,220 --> 00:04:39,020
organized because the
two don't necessarily

103
00:04:39,020 --> 00:04:40,615
have to be the same.

104
00:04:40,615 --> 00:04:42,740
It is perfectly reasonable
to write a data parallel

105
00:04:42,740 --> 00:04:45,680
program which is then
executed on a MIMD computer,

106
00:04:45,680 --> 00:04:47,570
or for that matter, on
a sequential computer.

107
00:04:47,570 --> 00:04:49,278
And similarly, a
control parallel program

108
00:04:49,278 --> 00:04:52,400
can be executed on a sequential
or a SIMD kind of computer.

109
00:04:52,400 --> 00:04:54,350
And so my emphasis
in this talk is

110
00:04:54,350 --> 00:04:57,620
going to be on styles of
organizing programs rather than

111
00:04:57,620 --> 00:04:59,300
styles of organizing hardware.

112
00:04:59,300 --> 00:05:01,190
It then becomes an
engineering issue

113
00:05:01,190 --> 00:05:03,530
of whether it's worthwhile
to match the hardware style

114
00:05:03,530 --> 00:05:06,500
to the program style or not.

115
00:05:06,500 --> 00:05:08,090
Now, the sequential
programming style

116
00:05:08,090 --> 00:05:09,890
is typified by
programming languages,

117
00:05:09,890 --> 00:05:13,280
such as C and Pascal, has
certain standard themes

118
00:05:13,280 --> 00:05:14,780
and certain standard
building blocks

119
00:05:14,780 --> 00:05:16,250
which you see all the time.

120
00:05:16,250 --> 00:05:19,610
Examples of these are
scalar arithmetic operators,

121
00:05:19,610 --> 00:05:22,160
control structures, such
as loops and if then else

122
00:05:22,160 --> 00:05:25,610
statements, and subscripted
array references.

123
00:05:25,610 --> 00:05:27,380
And the experienced
sequential programmer

124
00:05:27,380 --> 00:05:30,322
knows a great many
standard and useful ways

125
00:05:30,322 --> 00:05:32,030
of fitting these
building blocks together

126
00:05:32,030 --> 00:05:33,692
to make useful algorithms.

127
00:05:33,692 --> 00:05:35,150
And furthermore,
he has a good idea

128
00:05:35,150 --> 00:05:38,980
of the costs of these
individual building blocks.

129
00:05:38,980 --> 00:05:41,290
While it's nowhere
cast in stone,

130
00:05:41,290 --> 00:05:42,850
most of us as
sequential programmers

131
00:05:42,850 --> 00:05:45,370
know that an addition and
subtraction and perhaps

132
00:05:45,370 --> 00:05:47,520
a comparison all cost
roughly the same,

133
00:05:47,520 --> 00:05:49,270
whereas a multiplication
or division might

134
00:05:49,270 --> 00:05:50,645
cost a little bit
more, depending

135
00:05:50,645 --> 00:05:52,355
on the underlying hardware.

136
00:05:52,355 --> 00:05:54,730
Similarly, a procedure called
be a little bit more costly

137
00:05:54,730 --> 00:05:56,530
than an if then else statement.

138
00:05:56,530 --> 00:05:59,242
And the precise costs depend
upon the implementation

139
00:05:59,242 --> 00:06:01,450
of the programming language
and on the implementation

140
00:06:01,450 --> 00:06:02,660
of the hardware.

141
00:06:02,660 --> 00:06:05,380
But still, it's important to
have a good idea, a reasonably

142
00:06:05,380 --> 00:06:08,028
good model, a set
of expectations

143
00:06:08,028 --> 00:06:09,820
about the underlying
cost of the primitives

144
00:06:09,820 --> 00:06:12,160
before you can expect to
write effective and efficient

145
00:06:12,160 --> 00:06:15,160
algorithms in a sequential
programming language.

146
00:06:15,160 --> 00:06:17,987
Similarly, in order to write
good data parallel programs,

147
00:06:17,987 --> 00:06:20,320
we need to become familiar
with the appropriate building

148
00:06:20,320 --> 00:06:22,420
blocks for building
data parallel programs

149
00:06:22,420 --> 00:06:25,580
and we need to have a good
idea of their relative costs.

150
00:06:25,580 --> 00:06:28,540
So now, let us examine
certain standard themes that

151
00:06:28,540 --> 00:06:31,510
arise naturally in
data parallel programs

152
00:06:31,510 --> 00:06:33,428
and consider specific
building blocks that are

153
00:06:33,428 --> 00:06:34,720
used to implement those themes.

154
00:06:34,720 --> 00:06:37,705


155
00:06:37,705 --> 00:06:39,080
The most common
themes that I see

156
00:06:39,080 --> 00:06:41,990
popping up over and over again
in data parallel programs

157
00:06:41,990 --> 00:06:44,540
are elementwise operations.

158
00:06:44,540 --> 00:06:47,210
And these are operations
which can be carried on

159
00:06:47,210 --> 00:06:49,940
by the individual
processors independently

160
00:06:49,940 --> 00:06:52,610
each on their own pieces of
data without communication

161
00:06:52,610 --> 00:06:54,020
among the processors.

162
00:06:54,020 --> 00:06:57,410
And typically, these are
arithmetic operations and tests

163
00:06:57,410 --> 00:06:58,970
of various kinds.

164
00:06:58,970 --> 00:07:00,560
They're also
conditional operations,

165
00:07:00,560 --> 00:07:03,800
which are also elementwise, but
in which some of the processors

166
00:07:03,800 --> 00:07:05,553
perhaps do not
participate or perhaps

167
00:07:05,553 --> 00:07:07,220
act in slightly
different ways depending

168
00:07:07,220 --> 00:07:09,300
on the content of the data.

169
00:07:09,300 --> 00:07:11,880
Then, there are various
forms of communications

170
00:07:11,880 --> 00:07:15,570
among the processors, which
include replication, reduction,

171
00:07:15,570 --> 00:07:19,260
permutations, both regular and
irregular, and parallel prefix

172
00:07:19,260 --> 00:07:19,990
operations.

173
00:07:19,990 --> 00:07:22,150
These are by no means
the only themes,

174
00:07:22,150 --> 00:07:22,860
but these are the
ones that I want

175
00:07:22,860 --> 00:07:24,568
to focus particular
attention on tonight.

176
00:07:24,568 --> 00:07:28,900


177
00:07:28,900 --> 00:07:31,750
Let's take a look at an example
of an elementwise operation.

178
00:07:31,750 --> 00:07:35,515
Here, we have what looks like
an ordinary Fortran assay

179
00:07:35,515 --> 00:07:37,270
or C assignment
statement, C gets

180
00:07:37,270 --> 00:07:40,665
A plus B. The interpretation
here that is intended

181
00:07:40,665 --> 00:07:42,040
is that A and B
represent arrays.

182
00:07:42,040 --> 00:07:44,650


183
00:07:44,650 --> 00:07:46,750
And here, we see
eight processors.

184
00:07:46,750 --> 00:07:49,000
And each processor has
its own value for A

185
00:07:49,000 --> 00:07:52,420
and its own value for B. And the
direction is for each processor

186
00:07:52,420 --> 00:07:54,842
to execute an addition
and store the result in C.

187
00:07:54,842 --> 00:07:56,800
And so you can see that
the first processor has

188
00:07:56,800 --> 00:07:58,643
added 3 and 6 to get 9.

189
00:07:58,643 --> 00:08:00,310
The second processor
has different data.

190
00:08:00,310 --> 00:08:02,530
It has added 1 and 2
to get 3 and so on.

191
00:08:02,530 --> 00:08:05,070


192
00:08:05,070 --> 00:08:06,450
Similarly,
conditional operations

193
00:08:06,450 --> 00:08:08,760
can be carried out elementwise.

194
00:08:08,760 --> 00:08:11,790
Here, we are comparing
the contents of the array

195
00:08:11,790 --> 00:08:14,760
A and the array B
element by element

196
00:08:14,760 --> 00:08:17,910
and setting a flag, shown
at the top of the graphic,

197
00:08:17,910 --> 00:08:20,048
to indicate whether or
not the test succeeded.

198
00:08:20,048 --> 00:08:21,840
And so we can see that
in the first column,

199
00:08:21,840 --> 00:08:24,600
the test failed because
3 is not greater than 6.

200
00:08:24,600 --> 00:08:27,600
Whereas, in the third column,
the test succeeded because 4

201
00:08:27,600 --> 00:08:29,520
is greater than 1.

202
00:08:29,520 --> 00:08:31,710
And the result is
a series of Boolean

203
00:08:31,710 --> 00:08:34,860
result-- a series of
bits one per processor,

204
00:08:34,860 --> 00:08:38,159
which can then be
used to conditionalize

205
00:08:38,159 --> 00:08:40,095
further operations.

206
00:08:40,095 --> 00:08:41,470
Here, we have an
example of using

207
00:08:41,470 --> 00:08:44,320
the results of that test to
perform a conditional addition.

208
00:08:44,320 --> 00:08:47,920
Only processors whose test
succeeded in the previous step

209
00:08:47,920 --> 00:08:49,360
will execute the addition step.

210
00:08:49,360 --> 00:08:51,590
And so we can see that only
the third, fourth, fifth,

211
00:08:51,590 --> 00:08:54,040
and seventh processors
actually performed the addition

212
00:08:54,040 --> 00:08:56,370
and scored the result in C.

213
00:08:56,370 --> 00:08:58,590
The set of bits, which
is used to conditionalize

214
00:08:58,590 --> 00:09:02,640
the operation, is frequently
called a condition mask

215
00:09:02,640 --> 00:09:06,240
or a context for the execution
of further operations.

216
00:09:06,240 --> 00:09:07,800
And in this way,
each processor can

217
00:09:07,800 --> 00:09:10,800
perform different computations
based on the particular data

218
00:09:10,800 --> 00:09:14,030
that it happens to contain.

219
00:09:14,030 --> 00:09:16,850
Now, let's look at some
communications operations.

220
00:09:16,850 --> 00:09:19,700
One that happens very frequently
is to have a single quantity

221
00:09:19,700 --> 00:09:21,960
and want to get a copy
out to all the processors.

222
00:09:21,960 --> 00:09:24,170
This we call broadcasting.

223
00:09:24,170 --> 00:09:26,030
Here, we have a single
number 43 and we

224
00:09:26,030 --> 00:09:28,280
would like to get it out to
all the other processors.

225
00:09:28,280 --> 00:09:29,870
Now, there a variety
of ways of doing this,

226
00:09:29,870 --> 00:09:32,328
depending on the particular
hardware you have to support it

227
00:09:32,328 --> 00:09:35,150
and the particular
algorithms you want to use.

228
00:09:35,150 --> 00:09:37,370
This operation occurs so
frequently in data parallel

229
00:09:37,370 --> 00:09:40,520
programming that it's worthwhile
and easy to support it directly

230
00:09:40,520 --> 00:09:41,850
in hardware.

231
00:09:41,850 --> 00:09:45,110
And so it's not unusual to see
a hardware bus of some kind

232
00:09:45,110 --> 00:09:49,950
to carry a single value out to
all the multiple processors.

233
00:09:49,950 --> 00:09:52,450
On the other hand, there are
other ways of doing it as well.

234
00:09:52,450 --> 00:09:55,000
If there are many
things to be copied

235
00:09:55,000 --> 00:09:57,220
rather than a single thing,
a more complex algorithm

236
00:09:57,220 --> 00:09:59,340
is called for.

237
00:09:59,340 --> 00:10:02,880
Here, we have a typical example
in which we have a vector which

238
00:10:02,880 --> 00:10:04,300
occupies a row of a matrix.

239
00:10:04,300 --> 00:10:05,910
We would like to
copy that vector

240
00:10:05,910 --> 00:10:07,477
to all the rows of the matrix.

241
00:10:07,477 --> 00:10:09,810
So we don't have a single
quantity to be copied, rather,

242
00:10:09,810 --> 00:10:11,040
we have many quantities.

243
00:10:11,040 --> 00:10:13,800
But they are to be copied in
a reasonably regular pattern.

244
00:10:13,800 --> 00:10:15,300
Now, one very simple
way to do this,

245
00:10:15,300 --> 00:10:18,210
if each processor associated
with a piece of data

246
00:10:18,210 --> 00:10:20,460
in this matrix is able to
communicate with its nearest

247
00:10:20,460 --> 00:10:24,630
neighbors, is to
copy the row step

248
00:10:24,630 --> 00:10:30,950
by step down the matrix each
copying to its nearest neighbor

249
00:10:30,950 --> 00:10:32,882
until the entire
matrix is filled.

250
00:10:32,882 --> 00:10:34,340
This operation is
called spreading.

251
00:10:34,340 --> 00:10:36,507
And here, we have managed
to copy the first row down

252
00:10:36,507 --> 00:10:39,050
to all of the others
in seven steps

253
00:10:39,050 --> 00:10:43,970
because this is an 8 by 8
matrix and 7 is 8 minus 1.

254
00:10:43,970 --> 00:10:45,470
But there are other
ways of doing it

255
00:10:45,470 --> 00:10:49,120
which are faster if you have
additional communications

256
00:10:49,120 --> 00:10:51,520
links among the processes
you can exploit.

257
00:10:51,520 --> 00:10:53,800
Here, we will assume
that a row of a matrix,

258
00:10:53,800 --> 00:10:57,490
in fact, can be communicated
not only to the next row down,

259
00:10:57,490 --> 00:11:00,490
but, in fact, to a road down
that is a power of two away.

260
00:11:00,490 --> 00:11:03,690
And we will see that we can
do the operation much faster.

261
00:11:03,690 --> 00:11:07,840
First, we take the first row
and copy it to the second row.

262
00:11:07,840 --> 00:11:09,880
But now, each of those
rows can send copies

263
00:11:09,880 --> 00:11:12,627
to their neighbors
that are two rows down.

264
00:11:12,627 --> 00:11:14,460
And then, each of those
rows can send copies

265
00:11:14,460 --> 00:11:18,280
to neighbors that
are four rows down.

266
00:11:18,280 --> 00:11:20,950
And by jumping in
increasing steps like this,

267
00:11:20,950 --> 00:11:23,910
we can fill the array in
only three steps, which

268
00:11:23,910 --> 00:11:25,660
is logarithmic in the
size of the array, 3

269
00:11:25,660 --> 00:11:27,220
being the base 2 logarithm of 8.

270
00:11:27,220 --> 00:11:29,860


271
00:11:29,860 --> 00:11:33,100
And so we have another way
of executing the same spread

272
00:11:33,100 --> 00:11:36,100
operation, which is logarithmic
in the number of processors

273
00:11:36,100 --> 00:11:38,120
rather than linear.

274
00:11:38,120 --> 00:11:41,155
And this is typical of a lot of
interesting parallel operations

275
00:11:41,155 --> 00:11:43,280
that can be carried out in
the data parallel style.

276
00:11:43,280 --> 00:11:44,840
You see logarithmic
execution times

277
00:11:44,840 --> 00:11:46,090
popping up all over the place.

278
00:11:46,090 --> 00:11:49,917


279
00:11:49,917 --> 00:11:52,500
Reduction is another very common
operation that is essentially

280
00:11:52,500 --> 00:11:53,880
the inverse of broadcasting.

281
00:11:53,880 --> 00:11:56,100
In broadcasting, you
take a single value

282
00:11:56,100 --> 00:11:58,830
and make copies for all of
the processing elements.

283
00:11:58,830 --> 00:12:01,620
In the case of reduction, each
processing element has a value

284
00:12:01,620 --> 00:12:04,200
and you're trying to
accumulate a single result

285
00:12:04,200 --> 00:12:06,240
by combining them in
some interesting way

286
00:12:06,240 --> 00:12:07,980
to produce a single result.

287
00:12:07,980 --> 00:12:09,480
Again, I've shown
here how you might

288
00:12:09,480 --> 00:12:14,010
want to have some kind of
hard wired facility for doing

289
00:12:14,010 --> 00:12:16,150
some reduction.

290
00:12:16,150 --> 00:12:17,890
In this picture,
the eight processors

291
00:12:17,890 --> 00:12:20,380
shown at the bottom
of the screen each

292
00:12:20,380 --> 00:12:23,720
have different numbers
and there some is 27.

293
00:12:23,720 --> 00:12:26,288
So the number 27
shows up centrally.

294
00:12:26,288 --> 00:12:28,080
There are other ways
of doing that as well.

295
00:12:28,080 --> 00:12:30,285


296
00:12:30,285 --> 00:12:31,910
Let us take a look
at an algorithm that

297
00:12:31,910 --> 00:12:34,330
can be used to sum a
vector in logarithmic time.

298
00:12:34,330 --> 00:12:36,480
We start out with a vector
that has eight elements,

299
00:12:36,480 --> 00:12:38,870
x sub 0 through x sub 7.

300
00:12:38,870 --> 00:12:42,470
In the first step, we add
them pairwise so that x sub 0

301
00:12:42,470 --> 00:12:45,290
and x sub 1 are added
by the second processor,

302
00:12:45,290 --> 00:12:48,470
x sub 2 and x sub 3 by the
fourth processor, x sub 4

303
00:12:48,470 --> 00:12:50,810
and x sub 5 by the
sixth processor,

304
00:12:50,810 --> 00:12:54,357
and x sub 6 and x sub 7
by the last processor.

305
00:12:54,357 --> 00:12:56,690
And so four of the processors,
by doing these additions,

306
00:12:56,690 --> 00:12:59,720
have produced partial
sums towards the work

307
00:12:59,720 --> 00:13:01,070
of reducing total sum.

308
00:13:01,070 --> 00:13:05,360
In the second step, we add these
partial results pairwise again.

309
00:13:05,360 --> 00:13:08,000
And then, in the last
step, again, another pair.

310
00:13:08,000 --> 00:13:10,160
And we see that
the last processor

311
00:13:10,160 --> 00:13:13,465
has received the sum of
elements 0 through 7.

312
00:13:13,465 --> 00:13:15,590
Notice in this diagram that
we've used the notation

313
00:13:15,590 --> 00:13:17,870
sigma with a lower and upper
bound to indicate exactly

314
00:13:17,870 --> 00:13:21,720
which elements have
been summed so far.

315
00:13:21,720 --> 00:13:23,552
Please also note
that most of the time

316
00:13:23,552 --> 00:13:25,260
during the course of
this algorithm, most

317
00:13:25,260 --> 00:13:28,970
of the processors have not
been particularly busy.

318
00:13:28,970 --> 00:13:31,813
In the first set of additions,
only half of the processors

319
00:13:31,813 --> 00:13:33,230
have been occupied
doing editions.

320
00:13:33,230 --> 00:13:35,000
In the next step, only
a quarter of them.

321
00:13:35,000 --> 00:13:38,090
And in the last step,
only one eighth of them.

322
00:13:38,090 --> 00:13:40,250
And so we can see that
while this is fast,

323
00:13:40,250 --> 00:13:42,833
we haven't made use of
all of the processors.

324
00:13:42,833 --> 00:13:44,000
Interesting question arises.

325
00:13:44,000 --> 00:13:46,292
What would happen if you
didn't turn off the processors

326
00:13:46,292 --> 00:13:48,740
and let some of the other
processors also do additions?

327
00:13:48,740 --> 00:13:50,690
You would actually get another
very interesting building block

328
00:13:50,690 --> 00:13:52,610
for data parallel
algorithms, which

329
00:13:52,610 --> 00:13:55,260
is parallel prefix algorithms.

330
00:13:55,260 --> 00:13:57,760
We're going to carry out the
same sort of computations again

331
00:13:57,760 --> 00:13:59,920
but leave more
processors active.

332
00:13:59,920 --> 00:14:03,640
We start out with x sub 0
through x sub 7 laid out.

333
00:14:03,640 --> 00:14:08,310
And in the first
step, we sum pairs,

334
00:14:08,310 --> 00:14:09,810
but notice that
they're overlapping.

335
00:14:09,810 --> 00:14:11,310
We sum x sub 0 and x sub 1.

336
00:14:11,310 --> 00:14:14,730
We also sum x sub 1 and
x sub 2 and so forth.

337
00:14:14,730 --> 00:14:16,680
And so each of the
processors except the first

338
00:14:16,680 --> 00:14:18,150
is computed to sum.

339
00:14:18,150 --> 00:14:21,280
Then, in the next step,
we compute more sums.

340
00:14:21,280 --> 00:14:23,740
All but the first two
processors participate.

341
00:14:23,740 --> 00:14:25,900
And then, in the last step,
all but four processors

342
00:14:25,900 --> 00:14:27,660
participate.

343
00:14:27,660 --> 00:14:29,790
The result has a very
interesting property

344
00:14:29,790 --> 00:14:33,810
that each processor has
received the sum of what

345
00:14:33,810 --> 00:14:37,460
it contained plus all the
processors preceding it.

346
00:14:37,460 --> 00:14:41,120
In other words, if we regard
the original input as an array,

347
00:14:41,120 --> 00:14:43,970
then we've computed the sums
of all prefixes of the array.

348
00:14:43,970 --> 00:14:46,640
That is, all initial
segments of the array.

349
00:14:46,640 --> 00:14:49,310
I sometimes call this
the checkbook operation

350
00:14:49,310 --> 00:14:52,250
because if the numbers that
were in the array elements

351
00:14:52,250 --> 00:14:53,970
were a series of
credits and debits

352
00:14:53,970 --> 00:14:55,970
against your checkbook,
the checks you'd written

353
00:14:55,970 --> 00:14:57,957
and the deposits
you'd made, then

354
00:14:57,957 --> 00:14:59,540
the result of the
sum prefix operation

355
00:14:59,540 --> 00:15:01,130
is the series of running
balances that should

356
00:15:01,130 --> 00:15:02,415
appear in your checkbook.

357
00:15:02,415 --> 00:15:05,090


358
00:15:05,090 --> 00:15:07,760
And we will see examples
of how this operation can

359
00:15:07,760 --> 00:15:09,950
be useful in other ways
later as a building

360
00:15:09,950 --> 00:15:11,793
block of the algorithms.

361
00:15:11,793 --> 00:15:15,250


362
00:15:15,250 --> 00:15:19,270
A very simple example is an
operation known as enumeration.

363
00:15:19,270 --> 00:15:21,700
In this operation, we wish
to assign a different number

364
00:15:21,700 --> 00:15:22,912
to each processor.

365
00:15:22,912 --> 00:15:24,370
And this can be
accomplished easily

366
00:15:24,370 --> 00:15:27,040
in two steps using building
blocks we've already seen.

367
00:15:27,040 --> 00:15:29,140
The first being broadcasting.

368
00:15:29,140 --> 00:15:32,480
We take the number
one and broadcast that

369
00:15:32,480 --> 00:15:34,100
to all the processors.

370
00:15:34,100 --> 00:15:36,470
We then apply a sum
prefix operation

371
00:15:36,470 --> 00:15:38,870
so that each processor
receives the sum of the one

372
00:15:38,870 --> 00:15:43,830
that it contains plus all the
ones preceding it in the line.

373
00:15:43,830 --> 00:15:45,990
And the result is
that each processor

374
00:15:45,990 --> 00:15:49,095
gets a different number
from 1 through 8.

375
00:15:49,095 --> 00:15:50,970
Sometimes this is called
the bakery operation

376
00:15:50,970 --> 00:15:52,890
because it's like all these
processors have walked

377
00:15:52,890 --> 00:15:55,307
into the bakery each one has
gone up to the little machine

378
00:15:55,307 --> 00:15:58,305
and taken a ticket and each one
has gotten a different number.

379
00:15:58,305 --> 00:16:00,180
But it's happened very
fast rather than doing

380
00:16:00,180 --> 00:16:01,430
it sequentially one at a time.

381
00:16:01,430 --> 00:16:06,350


382
00:16:06,350 --> 00:16:10,250
Another very important
operation is various kinds

383
00:16:10,250 --> 00:16:12,860
of motion of data
without particularly

384
00:16:12,860 --> 00:16:15,610
performing any kind of
arithmetic operation.

385
00:16:15,610 --> 00:16:18,220
And most of the interesting
ones are permutations.

386
00:16:18,220 --> 00:16:24,070
Here's an example of shifting
a linear array of data.

387
00:16:24,070 --> 00:16:25,840
This is an end around
shift one place.

388
00:16:25,840 --> 00:16:27,940
One could imagine shifts
of other distances.

389
00:16:27,940 --> 00:16:29,380
But in this case,
the entire array

390
00:16:29,380 --> 00:16:31,090
has been shifted one
place to the right

391
00:16:31,090 --> 00:16:33,340
and the rightmost element
has been brought back around

392
00:16:33,340 --> 00:16:34,413
to the front.

393
00:16:34,413 --> 00:16:35,830
And of course, one
can do shifting

394
00:16:35,830 --> 00:16:37,402
not only on
one-dimensional arrays,

395
00:16:37,402 --> 00:16:38,860
but also on
two-dimensional arrays.

396
00:16:38,860 --> 00:16:40,528
You might have a
two-dimensional matrix

397
00:16:40,528 --> 00:16:42,820
and want to shift it one
position to the North or three

398
00:16:42,820 --> 00:16:50,770
positions to the East or
4,096 positions to the South.

399
00:16:50,770 --> 00:16:54,960
Another kind of permutation is
an odd even swap, shown here

400
00:16:54,960 --> 00:16:57,870
for eight elements, in which
A and B have exchanged places

401
00:16:57,870 --> 00:17:00,030
and C and D have
exchanged places.

402
00:17:00,030 --> 00:17:01,800
One might also imagine
more general forms

403
00:17:01,800 --> 00:17:04,660
of swap or one swaps
with a neighbor that is

404
00:17:04,660 --> 00:17:07,630
a power of two distance away.

405
00:17:07,630 --> 00:17:09,700
Permutations of this
form are very often

406
00:17:09,700 --> 00:17:13,119
seen as steps in algorithms such
as the fast Fourier transform

407
00:17:13,119 --> 00:17:15,456
and in various
sorting algorithms.

408
00:17:15,456 --> 00:17:17,109
The hardware
architecture that is

409
00:17:17,109 --> 00:17:19,540
capable of performing all
distance 2 to the k swaps

410
00:17:19,540 --> 00:17:21,220
turns out to be equivalent
to a hypercube, which

411
00:17:21,220 --> 00:17:22,970
is one of the reasons
why the hypercube is

412
00:17:22,970 --> 00:17:24,790
very popular as an
underlying hardware

413
00:17:24,790 --> 00:17:26,829
architecture for
algorithms of this style

414
00:17:26,829 --> 00:17:28,750
because it can
easily achieve many

415
00:17:28,750 --> 00:17:30,670
of the interesting
permutations on data

416
00:17:30,670 --> 00:17:34,005
that are necessary building
blocks for these algorithms.

417
00:17:34,005 --> 00:17:37,650


418
00:17:37,650 --> 00:17:39,660
Some algorithms call
for performing irregular

419
00:17:39,660 --> 00:17:40,800
permutations on the data.

420
00:17:40,800 --> 00:17:44,160
Most often this comes up when
the permutation to be performed

421
00:17:44,160 --> 00:17:47,770
is dependent upon the
content of the data.

422
00:17:47,770 --> 00:17:49,720
In this graphic, we've
illustrated a sorting

423
00:17:49,720 --> 00:17:50,460
algorithm.

424
00:17:50,460 --> 00:17:54,120
And in fact, here it's as if
the sort has been computed

425
00:17:54,120 --> 00:17:56,820
magically all at once and
every processor has figured out

426
00:17:56,820 --> 00:18:00,590
exactly where to send its data.

427
00:18:00,590 --> 00:18:02,680
The data sent from the
top row to the bottom row,

428
00:18:02,680 --> 00:18:04,420
and poof, it's ended up sorted.

429
00:18:04,420 --> 00:18:05,920
In practice, real
sorting algorithms

430
00:18:05,920 --> 00:18:07,750
undergo a number of
intermediate stages

431
00:18:07,750 --> 00:18:09,880
where they calculate
intermediate places to send

432
00:18:09,880 --> 00:18:13,190
the data and after a number of
steps the data ends up sorted.

433
00:18:13,190 --> 00:18:16,570
But the point is that the
particular permutation involved

434
00:18:16,570 --> 00:18:18,865
is dependent upon the
content of the data.

435
00:18:18,865 --> 00:18:20,740
Therefore, it cannot be
predicted in advance,

436
00:18:20,740 --> 00:18:24,540
and therefore, it cannot be
pre-wired into the hardware.

437
00:18:24,540 --> 00:18:26,460
We can see this by
considering an example

438
00:18:26,460 --> 00:18:28,267
from image processing.

439
00:18:28,267 --> 00:18:30,100
Let's suppose that we
have an image of, say,

440
00:18:30,100 --> 00:18:34,040
1,000 by 1,000 pixels, a picture
that we've taken of a spaceship

441
00:18:34,040 --> 00:18:35,790
and we want to do
processing on this image

442
00:18:35,790 --> 00:18:37,457
until we've assigned
a processor to each

443
00:18:37,457 --> 00:18:39,180
of the pixels of the image.

444
00:18:39,180 --> 00:18:41,978
And the question is,
where is the rocket ship?

445
00:18:41,978 --> 00:18:44,520
Well, some of the operations we
need to perform on this image

446
00:18:44,520 --> 00:18:45,750
are local.

447
00:18:45,750 --> 00:18:50,680
And so we might focus in on a
particular region of the image

448
00:18:50,680 --> 00:18:55,300
and have each processor look
not only at its own pixel value,

449
00:18:55,300 --> 00:18:59,520
but also at the pixel
values of its neighbor.

450
00:18:59,520 --> 00:19:01,990
And from this
determine, for example,

451
00:19:01,990 --> 00:19:05,610
whether a particular pixel is
on the boundary of an object

452
00:19:05,610 --> 00:19:07,350
or somewhere in the middle.

453
00:19:07,350 --> 00:19:09,360
And this is a local
piece of processing.

454
00:19:09,360 --> 00:19:11,970
It involves a fixed
pattern of communication.

455
00:19:11,970 --> 00:19:14,020
It involves regular
permutations,

456
00:19:14,020 --> 00:19:17,397
which involves shifting
the array back and forth so

457
00:19:17,397 --> 00:19:19,230
that each processor can
look at its neighbor

458
00:19:19,230 --> 00:19:20,958
in a particular fixed direction.

459
00:19:20,958 --> 00:19:22,500
On the other hand,
when it comes time

460
00:19:22,500 --> 00:19:24,270
to assemble this
boundary information

461
00:19:24,270 --> 00:19:27,335
and put it together into
a single global object,

462
00:19:27,335 --> 00:19:29,710
we can see that the particular
patterns of communications

463
00:19:29,710 --> 00:19:32,867
are going to be dependent upon
the content of the image data.

464
00:19:32,867 --> 00:19:35,200
That is, which processors
have to communicate with which

465
00:19:35,200 --> 00:19:37,630
other ones so they can
all agree that they

466
00:19:37,630 --> 00:19:40,757
are part of the image
of the same object.

467
00:19:40,757 --> 00:19:42,340
Depends upon where
the rocket actually

468
00:19:42,340 --> 00:19:43,847
happened to be in the image.

469
00:19:43,847 --> 00:19:45,430
Most of the building
blocks that we've

470
00:19:45,430 --> 00:19:47,320
looked at so far
concern themselves

471
00:19:47,320 --> 00:19:52,540
with operations on arrays,
which are very rigidly organized

472
00:19:52,540 --> 00:19:53,780
kinds of data.

473
00:19:53,780 --> 00:19:56,980
It is also possible to have
very irregularly organized

474
00:19:56,980 --> 00:19:58,310
kinds of data.

475
00:19:58,310 --> 00:20:01,210
For example, one might have a
graph structure in which nodes

476
00:20:01,210 --> 00:20:03,940
are organized not by being
neighbors within some fixed

477
00:20:03,940 --> 00:20:08,830
geometry, but by being connected
through pointers instead.

478
00:20:08,830 --> 00:20:11,120
In this graphic, we
have shown the nodes

479
00:20:11,120 --> 00:20:13,480
as being lined up side
by side just for ease

480
00:20:13,480 --> 00:20:14,710
of seeing what's going on.

481
00:20:14,710 --> 00:20:16,210
But in fact, you
should imagine them

482
00:20:16,210 --> 00:20:18,700
as being eight processors in
completely different parts

483
00:20:18,700 --> 00:20:20,470
of the machine
known to each other

484
00:20:20,470 --> 00:20:23,830
only by an address indicate
that uniquely identifies

485
00:20:23,830 --> 00:20:25,000
the processor.

486
00:20:25,000 --> 00:20:27,250
And we've indicated these
addresses by pointers.

487
00:20:27,250 --> 00:20:29,790
So we can see that the first
processor knows the address

488
00:20:29,790 --> 00:20:30,790
of the second processor.

489
00:20:30,790 --> 00:20:35,130
The second processor knows
the address of the third.

490
00:20:35,130 --> 00:20:37,290
And I had originally
thought that nothing

491
00:20:37,290 --> 00:20:40,590
could be more essentially
sequential than the need

492
00:20:40,590 --> 00:20:42,768
to follow a linked
list of pointers.

493
00:20:42,768 --> 00:20:45,060
Because you just can't find
the third one without going

494
00:20:45,060 --> 00:20:46,140
through the second one.

495
00:20:46,140 --> 00:20:47,370
But I was wrong.

496
00:20:47,370 --> 00:20:49,380
I forgot that there
is processing power

497
00:20:49,380 --> 00:20:50,850
at each node of the linked list.

498
00:20:50,850 --> 00:20:53,640
And you can take advantage
of that to do things faster.

499
00:20:53,640 --> 00:20:56,100
The most important technique
is known as pointer doubling.

500
00:20:56,100 --> 00:20:59,400
And this is the pointer analog
of the spreading operation

501
00:20:59,400 --> 00:21:01,110
that we looked at
earlier in order

502
00:21:01,110 --> 00:21:03,090
to make a copy of a
vector into a matrix

503
00:21:03,090 --> 00:21:06,030
in a logarithmic
number of steps.

504
00:21:06,030 --> 00:21:09,130
In the pointer doubling
method, each processor

505
00:21:09,130 --> 00:21:13,090
first makes a copy of the
point it has to its neighbor.

506
00:21:13,090 --> 00:21:16,180
Then, in a repeated
series of steps,

507
00:21:16,180 --> 00:21:19,090
each processor looks
at the processor

508
00:21:19,090 --> 00:21:22,150
that it's pointing to
with this extra pointer

509
00:21:22,150 --> 00:21:24,400
and gets a copy of its pointer.

510
00:21:24,400 --> 00:21:27,040
So at the first
step, each processor

511
00:21:27,040 --> 00:21:31,130
has a pointer to its neighbor
one away in the linked chain.

512
00:21:31,130 --> 00:21:33,928
But as the first processor
looks into the second process

513
00:21:33,928 --> 00:21:35,720
and gets the pointer
to the third processor

514
00:21:35,720 --> 00:21:38,360
and each of the other
processors do the same thing,

515
00:21:38,360 --> 00:21:40,400
we see that the next
step, each process

516
00:21:40,400 --> 00:21:42,200
will have a pointer
to a processor

517
00:21:42,200 --> 00:21:44,770
two steps away in
the linked chain.

518
00:21:44,770 --> 00:21:46,600
And as this operation
is repeated,

519
00:21:46,600 --> 00:21:49,120
each processor can then have
a pointer to a processor

520
00:21:49,120 --> 00:21:51,790
four steps away in the chain,
except that if you fall off

521
00:21:51,790 --> 00:21:54,680
the end of the chain then you
don't update your pointer.

522
00:21:54,680 --> 00:21:57,460
And if you keep doing this,
then after a number of steps

523
00:21:57,460 --> 00:21:59,790
logarithmic in the
length of the chain,

524
00:21:59,790 --> 00:22:01,540
every processor will
have gotten a pointer

525
00:22:01,540 --> 00:22:03,233
to the end of the chain.

526
00:22:03,233 --> 00:22:05,400
And this happens in only
logarithmic number of steps

527
00:22:05,400 --> 00:22:06,930
rather than a linear
number of steps.

528
00:22:06,930 --> 00:22:09,138
I think this is sort of an
interesting and surprising

529
00:22:09,138 --> 00:22:10,290
result.

530
00:22:10,290 --> 00:22:13,090
Now, let's see how
that can be used

531
00:22:13,090 --> 00:22:15,090
by considering nodes that
not only have pointers

532
00:22:15,090 --> 00:22:17,340
to their successors,
but also numbers.

533
00:22:17,340 --> 00:22:19,560
And we're going to
compute a parallel prefix,

534
00:22:19,560 --> 00:22:22,640
a set of partial sums,
on this linked list.

535
00:22:22,640 --> 00:22:24,590
And the algorithm is
very similar to the one

536
00:22:24,590 --> 00:22:26,840
we saw before on a
vector except we're

537
00:22:26,840 --> 00:22:31,570
taking advantage of the
pointer doubling technique.

538
00:22:31,570 --> 00:22:34,130
At the first step,
each processor

539
00:22:34,130 --> 00:22:37,280
simply takes a pointer
to its neighbor.

540
00:22:37,280 --> 00:22:40,910
At the next step, each
processor takes the value

541
00:22:40,910 --> 00:22:45,170
that it holds and adds it
into the place pointed to.

542
00:22:45,170 --> 00:22:48,200
So for example, in this
diagram, the first processor

543
00:22:48,200 --> 00:22:51,560
is going to add its x
sub 0 into the x sub

544
00:22:51,560 --> 00:22:53,270
1 in the next processor.

545
00:22:53,270 --> 00:22:55,790
Whereas, the x sub 1 is going
to be added into and x sub 2.

546
00:22:55,790 --> 00:22:58,445
And x sub 2 into the x
sub 3 all in parallel.

547
00:22:58,445 --> 00:23:00,320
And then, it will take
a pointer double step.

548
00:23:00,320 --> 00:23:03,603
And the result looks like this.

549
00:23:03,603 --> 00:23:05,020
The first processor
still contains

550
00:23:05,020 --> 00:23:08,380
x sub 0, which can also
be noted as sigma 0 to 0,

551
00:23:08,380 --> 00:23:12,160
which is just the sum of x sub 0
through x sub0, which is simply

552
00:23:12,160 --> 00:23:13,710
x sub 0.

553
00:23:13,710 --> 00:23:17,550
The second processor has the
sum of x sub 0 through x sub 1.

554
00:23:17,550 --> 00:23:21,100
The third processor has x sub
1 through x sub 2 and so forth.

555
00:23:21,100 --> 00:23:22,590
And we've doubled the pointers.

556
00:23:22,590 --> 00:23:28,800
Now, we do this again and
we see that, for example,

557
00:23:28,800 --> 00:23:34,760
the fourth processor now has the
sum of s sub 0 through x sub 3.

558
00:23:34,760 --> 00:23:38,550
And the seventh processor
has x sub 3 through x sub 6.

559
00:23:38,550 --> 00:23:41,660
Now, notice in this last step
that the third processor will

560
00:23:41,660 --> 00:23:47,870
add its sigma 0 to 2 into the
seventh processors sigma 3

561
00:23:47,870 --> 00:23:49,070
through 6.

562
00:23:49,070 --> 00:23:52,550
This will cause processor 7 to
have the sum of 0 through 6.

563
00:23:52,550 --> 00:23:55,833


564
00:23:55,833 --> 00:23:57,250
And after the third
step, in fact,

565
00:23:57,250 --> 00:23:58,890
if you carefully
trace through it all,

566
00:23:58,890 --> 00:24:00,598
you will find that
each of the processors

567
00:24:00,598 --> 00:24:03,630
has gotten some of its own
number plus all the preceding

568
00:24:03,630 --> 00:24:04,890
ones in the list.

569
00:24:04,890 --> 00:24:07,330
And again, in a logarithmic
number of time steps.

570
00:24:07,330 --> 00:24:09,122
So we see that we can
do interesting things

571
00:24:09,122 --> 00:24:11,895
not only with arrays, but
with linked lists in this data

572
00:24:11,895 --> 00:24:13,020
parallel programming style.

573
00:24:13,020 --> 00:24:15,990


574
00:24:15,990 --> 00:24:18,330
Now, let's talk a little
bit about the distinction

575
00:24:18,330 --> 00:24:21,840
between speed and efficiency.

576
00:24:21,840 --> 00:24:24,468
In sequential programming,
usually these terms

577
00:24:24,468 --> 00:24:25,510
are considered anonymous.

578
00:24:25,510 --> 00:24:27,510
If a program is fast,
then it must be efficient.

579
00:24:27,510 --> 00:24:30,120
If it's efficient,
it must be fast.

580
00:24:30,120 --> 00:24:32,040
But this coincidence
of terms comes about

581
00:24:32,040 --> 00:24:34,595
only because you have a
single processor to work with

582
00:24:34,595 --> 00:24:35,970
and you're trying
to get the most

583
00:24:35,970 --> 00:24:39,792
horsepower you can out of it and
it's the only one doing things.

584
00:24:39,792 --> 00:24:41,250
But in a parallel
case, it's not so

585
00:24:41,250 --> 00:24:44,880
because sometimes you can make
it go faster by, in some sense,

586
00:24:44,880 --> 00:24:46,730
doing extra work
more than you had to

587
00:24:46,730 --> 00:24:48,480
by taking advantage
of the fact you've got

588
00:24:48,480 --> 00:24:50,780
extra processors to work with.

589
00:24:50,780 --> 00:24:54,490
Let's consider, for example, the
algorithm that we saw earlier

590
00:24:54,490 --> 00:24:58,800
for summing a vector
and let's compare that

591
00:24:58,800 --> 00:25:00,240
with a serial algorithm.

592
00:25:00,240 --> 00:25:02,880
With a serial algorithm,
you would write a loop,

593
00:25:02,880 --> 00:25:04,590
and within the
loop, you would have

594
00:25:04,590 --> 00:25:07,830
indexed subscripted
accesses to the array.

595
00:25:07,830 --> 00:25:10,052
And you set up an accumulator,
initialize it to 0,

596
00:25:10,052 --> 00:25:12,510
or perhaps initialize it to
the first element of the vector

597
00:25:12,510 --> 00:25:14,370
to save an addition,
and then add

598
00:25:14,370 --> 00:25:17,670
in successive
elements of the array.

599
00:25:17,670 --> 00:25:20,180
So you would do this
using one processor.

600
00:25:20,180 --> 00:25:23,240
You would take N
minus 1 time steps.

601
00:25:23,240 --> 00:25:25,723
And you would have to
do N minus 1 additions--

602
00:25:25,723 --> 00:25:27,890
pairwise additions because
that's the minimum number

603
00:25:27,890 --> 00:25:30,240
you can get away with.

604
00:25:30,240 --> 00:25:31,830
So we say that the
work involved was

605
00:25:31,830 --> 00:25:34,630
N minus 1, that was the
number of additions performed.

606
00:25:34,630 --> 00:25:37,530
The cost, which is the number
of processors times the number

607
00:25:37,530 --> 00:25:40,410
of time steps is N minus 1.

608
00:25:40,410 --> 00:25:44,810
And the efficiency, which
is the amount of work

609
00:25:44,810 --> 00:25:47,510
you got done-- the number of
additions divided by the cost

610
00:25:47,510 --> 00:25:48,740
is 1.

611
00:25:48,740 --> 00:25:52,070
Now, let's compare that
with the parallel algorithm.

612
00:25:52,070 --> 00:25:54,590
In the parallel
vector sum algorithm,

613
00:25:54,590 --> 00:25:58,340
we used N processors to do
it, one for each data element.

614
00:25:58,340 --> 00:26:01,760
We did it in log N time steps
and we still accomplished

615
00:26:01,760 --> 00:26:04,230
N minus 1 additions.

616
00:26:04,230 --> 00:26:06,060
The cost of this, the
number of processors

617
00:26:06,060 --> 00:26:08,435
times the number of time steps
instead of being N minus 1

618
00:26:08,435 --> 00:26:10,830
was N log N. We used
a lot of processors.

619
00:26:10,830 --> 00:26:13,110
And as we noted earlier,
most of the processors

620
00:26:13,110 --> 00:26:14,740
weren't busy most of the time.

621
00:26:14,740 --> 00:26:17,640
And so we wasted
a lot of cycles.

622
00:26:17,640 --> 00:26:20,490
And so the efficiency, which is
the number of additions divided

623
00:26:20,490 --> 00:26:23,730
by the cost, was
approximately 1 over log N.

624
00:26:23,730 --> 00:26:25,480
So as the size of the
problem gets bigger,

625
00:26:25,480 --> 00:26:26,690
the efficiency goes down.

626
00:26:26,690 --> 00:26:28,570
But still, the speed
compares quite favorably

627
00:26:28,570 --> 00:26:33,200
with the serial algorithm
as log N versus N minus 1.

628
00:26:33,200 --> 00:26:37,920
Now, let's compare this to the
vector sum prefix algorithm.

629
00:26:37,920 --> 00:26:40,700
And again, consider
a serial version.

630
00:26:40,700 --> 00:26:44,960
Just as the data parallel
vector sum prefix

631
00:26:44,960 --> 00:26:48,040
looked like the data
parallel sum reduction,

632
00:26:48,040 --> 00:26:50,770
so the serial prefix algorithm
looks like the serial sum

633
00:26:50,770 --> 00:26:53,680
algorithm, except you save
out all the partial results.

634
00:26:53,680 --> 00:26:56,373
You start with the first element
of the array and you save it.

635
00:26:56,373 --> 00:26:57,790
Then, you add in
the next element.

636
00:26:57,790 --> 00:26:58,660
You save that sum.

637
00:26:58,660 --> 00:26:59,827
You add in the next element.

638
00:26:59,827 --> 00:27:01,590
You save that sum.

639
00:27:01,590 --> 00:27:04,560
And so after N minus 1 time
steps and N minus 1 additions,

640
00:27:04,560 --> 00:27:07,440
you computed to sum prefix with
the additional cost of simply

641
00:27:07,440 --> 00:27:10,160
storing the results
out to memory.

642
00:27:10,160 --> 00:27:13,070
And so, again, you've
achieved N minus 1 additions

643
00:27:13,070 --> 00:27:16,410
with N minus 1 cost until
the efficiency is 1.

644
00:27:16,410 --> 00:27:19,240
Now, let's compare that
to the parallel version.

645
00:27:19,240 --> 00:27:23,020
We've used N processors and
done it in log N time steps.

646
00:27:23,020 --> 00:27:25,010
But the number of
additions is much greater.

647
00:27:25,010 --> 00:27:27,190
If you work out the
numbers, it turns out

648
00:27:27,190 --> 00:27:30,600
to be N times log N minus 1.

649
00:27:30,600 --> 00:27:32,850
Because you used N
processors and use them

650
00:27:32,850 --> 00:27:36,000
at log N time steps and
nearly all of them were busy.

651
00:27:36,000 --> 00:27:39,870
So you did N times log N minus
1 additions at a cost of N

652
00:27:39,870 --> 00:27:42,480
log N. Dividing
those, the efficiency

653
00:27:42,480 --> 00:27:46,370
was log N minus 1 over log
N, which as N gets very large

654
00:27:46,370 --> 00:27:48,328
is very close to 1.

655
00:27:48,328 --> 00:27:50,620
So we can claim that this is
a very efficient algorithm

656
00:27:50,620 --> 00:27:53,380
because it then gets large
as efficiency approaches 1.

657
00:27:53,380 --> 00:27:56,495
And yet, this efficiency
is somehow bogus.

658
00:27:56,495 --> 00:27:58,870
We achieve this efficiency by
keeping the processors busy

659
00:27:58,870 --> 00:28:02,050
doing more than they really
had to do in some sense.

660
00:28:02,050 --> 00:28:04,120
Because only N minus
1 additions are really

661
00:28:04,120 --> 00:28:07,268
required to compute
a sum prefix.

662
00:28:07,268 --> 00:28:09,310
On the other hand, it
appears that more than that

663
00:28:09,310 --> 00:28:12,100
are required in
order to do it fast.

664
00:28:12,100 --> 00:28:13,720
And so we have this
curious trade-off

665
00:28:13,720 --> 00:28:18,270
that as the speed goes up,
the efficiency goes down,

666
00:28:18,270 --> 00:28:19,320
in some sense.

667
00:28:19,320 --> 00:28:23,247
But that's masked by this
particular efficiency cost

668
00:28:23,247 --> 00:28:25,830
by the fact that we actually did
more additions than we really

669
00:28:25,830 --> 00:28:26,580
had to.

670
00:28:26,580 --> 00:28:29,210
So the business of measuring the
speed of a parallel algorithm

671
00:28:29,210 --> 00:28:30,960
and the efficiency of
a parallel algorithm

672
00:28:30,960 --> 00:28:32,190
is a very tricky business.

673
00:28:32,190 --> 00:28:34,023
And in fact, I think
the particular measures

674
00:28:34,023 --> 00:28:36,540
that I used in these
graphics are a bit naive.

675
00:28:36,540 --> 00:28:38,610
And we need to develop
a better theory of how

676
00:28:38,610 --> 00:28:41,580
to measure the goodness, the
efficiency, and the speed

677
00:28:41,580 --> 00:28:44,050
of parallel algorithms.

678
00:28:44,050 --> 00:28:45,800
We've seen a useful
set of building blocks

679
00:28:45,800 --> 00:28:48,200
for constructing data
parallel algorithms.

680
00:28:48,200 --> 00:28:51,370
Now, let's build some algorithms
with those building blocks.

681
00:28:51,370 --> 00:28:53,870
For our first example, let's
consider matrix multiplication.

682
00:28:53,870 --> 00:28:56,290
This can be done in a
large variety of ways.

683
00:28:56,290 --> 00:28:58,040
But abstractly, there's
a single operation

684
00:28:58,040 --> 00:29:01,742
to be performed, which is
that given two matrices,

685
00:29:01,742 --> 00:29:03,950
we regard one as being
composed of rows and the other

686
00:29:03,950 --> 00:29:05,570
as being composed of columns.

687
00:29:05,570 --> 00:29:07,550
And each row of
the first must meet

688
00:29:07,550 --> 00:29:09,620
every column of the second.

689
00:29:09,620 --> 00:29:12,140
And each interaction
of a row and a column

690
00:29:12,140 --> 00:29:15,380
will produce a result
element in the result array

691
00:29:15,380 --> 00:29:18,955
by performing an inner product
on the two vectors, the row

692
00:29:18,955 --> 00:29:19,580
and the column.

693
00:29:19,580 --> 00:29:22,630


694
00:29:22,630 --> 00:29:24,160
One way of doing
this very quickly

695
00:29:24,160 --> 00:29:26,170
by a sort of brute
force approach

696
00:29:26,170 --> 00:29:29,110
is to use order of
n cubed processors.

697
00:29:29,110 --> 00:29:32,050
We'll assume that each of these
three matrices, the source--

698
00:29:32,050 --> 00:29:34,060
first source, the second
source, and the result

699
00:29:34,060 --> 00:29:36,380
are n by n matrices.

700
00:29:36,380 --> 00:29:38,720
And we will use n
cubed processors

701
00:29:38,720 --> 00:29:40,190
and we will organize
the matrices

702
00:29:40,190 --> 00:29:41,720
by putting the
first source on one

703
00:29:41,720 --> 00:29:44,300
face of the cube, the second
source on a different face,

704
00:29:44,300 --> 00:29:46,810
and have the result
come out a third face.

705
00:29:46,810 --> 00:29:49,310
And we will see that we can do
this very quickly in a fairly

706
00:29:49,310 --> 00:29:51,690
short number of steps.

707
00:29:51,690 --> 00:29:54,800
The first thing we will do is
make copies of the first source

708
00:29:54,800 --> 00:29:59,540
array using a spread operation
to replicate that matrix

709
00:29:59,540 --> 00:30:01,490
through the cube.

710
00:30:01,490 --> 00:30:03,740
Then, we will do the same
thing with the second source

711
00:30:03,740 --> 00:30:06,560
making copies and spreading
those down the cube.

712
00:30:06,560 --> 00:30:08,180
Each of these
spreading operations

713
00:30:08,180 --> 00:30:09,930
takes a logarithmic
number of steps.

714
00:30:09,930 --> 00:30:12,170
So so far, we've
used O of log n time.

715
00:30:12,170 --> 00:30:14,790


716
00:30:14,790 --> 00:30:16,850
Next, we do a lot of
elementwise operations.

717
00:30:16,850 --> 00:30:19,070
You can see 16 of
the multiplications

718
00:30:19,070 --> 00:30:21,620
here in this graphic shown on
the outer face of the cube.

719
00:30:21,620 --> 00:30:23,780
But in fact, these
elementwise operations

720
00:30:23,780 --> 00:30:25,190
are happening
throughout the cube

721
00:30:25,190 --> 00:30:29,000
and so we have n cubed
multiplications going on

722
00:30:29,000 --> 00:30:30,750
in a constant amount
of time because we're

723
00:30:30,750 --> 00:30:34,670
using O of n cubed processors.

724
00:30:34,670 --> 00:30:37,880
Then, in the last step,
we perform a parallel sum

725
00:30:37,880 --> 00:30:41,060
operation using the
doubling reduction method

726
00:30:41,060 --> 00:30:44,210
that we saw earlier
on each of the rows

727
00:30:44,210 --> 00:30:46,760
coming up through the array
toward the result matrix.

728
00:30:46,760 --> 00:30:49,140
And that takes a
logarithm amount of time.

729
00:30:49,140 --> 00:30:51,350
And so, in four
simple steps, we've

730
00:30:51,350 --> 00:30:54,510
managed to multiply two n
by n matrices in log n time,

731
00:30:54,510 --> 00:30:59,060
but at the cost of using
O of n cubed processors.

732
00:30:59,060 --> 00:31:01,958
It should also be pointed out
that if the very next thing we

733
00:31:01,958 --> 00:31:03,500
wanted to do was to
take this product

734
00:31:03,500 --> 00:31:05,930
and add it to one of
the source of arrays,

735
00:31:05,930 --> 00:31:07,730
the result is in the
wrong place and we're

736
00:31:07,730 --> 00:31:09,573
going to incur yet
an additional cost

737
00:31:09,573 --> 00:31:11,240
to move it to the
right place so that we

738
00:31:11,240 --> 00:31:13,022
could perform that addition.

739
00:31:13,022 --> 00:31:14,480
And so that should
also be factored

740
00:31:14,480 --> 00:31:15,772
into the cost of the operation.

741
00:31:15,772 --> 00:31:18,700


742
00:31:18,700 --> 00:31:21,820
Here is another method for doing
matrix multiplication that only

743
00:31:21,820 --> 00:31:24,760
requires n squared processors.

744
00:31:24,760 --> 00:31:28,720
In this arrangement, we
take the two source arrays

745
00:31:28,720 --> 00:31:31,513
and put them in the same
set of n squared processors.

746
00:31:31,513 --> 00:31:33,430
And the result will also
show up in the same n

747
00:31:33,430 --> 00:31:35,593
by n matrix of processors.

748
00:31:35,593 --> 00:31:37,510
And we're going to use
a very clever technique

749
00:31:37,510 --> 00:31:41,870
due to Cannon, which involves
pre-skewing the two source

750
00:31:41,870 --> 00:31:42,370
arrays.

751
00:31:42,370 --> 00:31:44,658
The first source array
has its rows skewed

752
00:31:44,658 --> 00:31:47,200
and you'll see that each row is
skewed by a different amount,

753
00:31:47,200 --> 00:31:49,520
depending on its row number.

754
00:31:49,520 --> 00:31:52,540
And similarly, the columns
of the second array

755
00:31:52,540 --> 00:31:54,550
are skewed depending
on their column number.

756
00:31:54,550 --> 00:31:57,315


757
00:31:57,315 --> 00:31:59,690
Now, we will take these to
skewed arrays and overlay them

758
00:31:59,690 --> 00:32:02,810
and the result looks like this.

759
00:32:02,810 --> 00:32:06,560
The essence of the algorithm--
this is a systolic algorithm--

760
00:32:06,560 --> 00:32:10,190
is to rotate both of the source
matrices at the same time.

761
00:32:10,190 --> 00:32:13,400
We're going to rotate the first
source matrix horizontally

762
00:32:13,400 --> 00:32:17,250
and the second one
vertically like this.

763
00:32:17,250 --> 00:32:20,060
And so at each time step,
each of the two source arrays

764
00:32:20,060 --> 00:32:22,460
will be shifted by one
position and the result

765
00:32:22,460 --> 00:32:23,660
looks something like this.

766
00:32:23,660 --> 00:32:27,840


767
00:32:27,840 --> 00:32:31,382
And after four time steps,
where in at each step,

768
00:32:31,382 --> 00:32:33,590
the two source elements that
arrived to the processor

769
00:32:33,590 --> 00:32:36,403
are multiplied together and
added into a running sum.

770
00:32:36,403 --> 00:32:37,820
And magically, the
result has been

771
00:32:37,820 --> 00:32:39,893
computed after n time steps.

772
00:32:39,893 --> 00:32:41,810
Now, let's look at that
a little more closely.

773
00:32:41,810 --> 00:32:44,690
Pay attention to the
intersection point

774
00:32:44,690 --> 00:32:46,820
at the upper left.

775
00:32:46,820 --> 00:32:50,660
You will see that at
the first time step,

776
00:32:50,660 --> 00:32:54,040
the second element of the first
row and the second element

777
00:32:54,040 --> 00:32:56,800
of the first column have met
in the upper left corner.

778
00:32:56,800 --> 00:32:59,560
They are then multiplied
and accumulated.

779
00:32:59,560 --> 00:33:02,260
Then, at the next time
step, the third elements

780
00:33:02,260 --> 00:33:04,720
of the row and column meet
are multiplied together

781
00:33:04,720 --> 00:33:06,190
and added in.

782
00:33:06,190 --> 00:33:09,250
At the next time step,
the fourth element

783
00:33:09,250 --> 00:33:11,350
of the first row and
the first column meet.

784
00:33:11,350 --> 00:33:14,230
And at the last step,
finally, the first elements

785
00:33:14,230 --> 00:33:15,910
meet and are accumulated.

786
00:33:15,910 --> 00:33:18,370
And the same thing is going
on at all the other points

787
00:33:18,370 --> 00:33:20,020
of the matrix.

788
00:33:20,020 --> 00:33:22,120
And the purpose
of the pre-skewing

789
00:33:22,120 --> 00:33:24,970
is to ensure that each
element of each row

790
00:33:24,970 --> 00:33:27,250
meets the corresponding
element of each column at just

791
00:33:27,250 --> 00:33:28,700
the right time.

792
00:33:28,700 --> 00:33:30,310
And if you don't believe that,
you'll just have to trust me.

793
00:33:30,310 --> 00:33:30,820
It works.

794
00:33:30,820 --> 00:33:33,040
It's really amazing.

795
00:33:33,040 --> 00:33:35,100
And so the net effect
is that using only n

796
00:33:35,100 --> 00:33:39,923
squared processors, we've
managed to do it in time n.

797
00:33:39,923 --> 00:33:42,090
Add an additional benefit
is that the matrix ends up

798
00:33:42,090 --> 00:33:43,860
in the right place if you
want to do yet other things

799
00:33:43,860 --> 00:33:44,860
with the other matrices.

800
00:33:44,860 --> 00:33:47,380


801
00:33:47,380 --> 00:33:49,660
Finally, let us consider
a really big example.

802
00:33:49,660 --> 00:33:52,780
We're going to go back to the
question of labeling regions

803
00:33:52,780 --> 00:33:53,910
in an image.

804
00:33:53,910 --> 00:33:55,660
Here instead of showing
you a rocket ship,

805
00:33:55,660 --> 00:33:57,580
I showed you a somewhat
more abstract image

806
00:33:57,580 --> 00:33:59,245
in order to keep
the example small.

807
00:33:59,245 --> 00:34:01,120
This is one of the
difficulties of describing

808
00:34:01,120 --> 00:34:02,890
data parallel
algorithms is that they

809
00:34:02,890 --> 00:34:05,650
are inherently
algorithms oriented

810
00:34:05,650 --> 00:34:06,920
toward large amounts of data.

811
00:34:06,920 --> 00:34:08,199
But if I try to show you
a large amount of data,

812
00:34:08,199 --> 00:34:09,324
it won't fit on the screen.

813
00:34:09,324 --> 00:34:10,900
So I'm going to do my best here.

814
00:34:10,900 --> 00:34:14,830
We have a number of regions
in this image shown here

815
00:34:14,830 --> 00:34:15,699
by different colors.

816
00:34:15,699 --> 00:34:17,199
We see a large
central green region.

817
00:34:17,199 --> 00:34:19,960
This little squiggly
reddish orange region up

818
00:34:19,960 --> 00:34:21,219
in the left corner.

819
00:34:21,219 --> 00:34:23,469
And notice that not every
region is a different color.

820
00:34:23,469 --> 00:34:25,870
There may be several regions
that are the same color.

821
00:34:25,870 --> 00:34:27,370
But as long as
they're disconnected,

822
00:34:27,370 --> 00:34:30,239
we want to consider them
to be separate regions.

823
00:34:30,239 --> 00:34:32,489
We would like to compute a
result that looks something

824
00:34:32,489 --> 00:34:35,489
like this in which
each region has been

825
00:34:35,489 --> 00:34:36,989
assigned a distinct number.

826
00:34:36,989 --> 00:34:39,239
We don't particularly care
which number gets assigned

827
00:34:39,239 --> 00:34:41,580
to it as long as each region
gets a different number

828
00:34:41,580 --> 00:34:44,820
and all contiguous
elements of a single image

829
00:34:44,820 --> 00:34:46,270
have received the same number.

830
00:34:46,270 --> 00:34:48,780
So we can see here that
in the sample result,

831
00:34:48,780 --> 00:34:52,170
the central green region has had
all its pixels assigned a value

832
00:34:52,170 --> 00:34:53,130
19.

833
00:34:53,130 --> 00:34:55,230
And the squiggly region
in the upper left corner

834
00:34:55,230 --> 00:34:58,230
has received a value
0 in all its pixels.

835
00:34:58,230 --> 00:35:01,180
The region in the
upper right corner,

836
00:35:01,180 --> 00:35:03,750
which although the same initial
color as the central region

837
00:35:03,750 --> 00:35:05,083
has received a different number.

838
00:35:05,083 --> 00:35:09,760
It's gotten all fives instead
of 19s because it's disjoint.

839
00:35:09,760 --> 00:35:11,240
And we proceed a
number of steps.

840
00:35:11,240 --> 00:35:14,080
And at each step, we'll see
how the particular themes

841
00:35:14,080 --> 00:35:17,170
and building blocks we've
discussed fit together to make

842
00:35:17,170 --> 00:35:18,645
an interesting algorithm.

843
00:35:18,645 --> 00:35:20,020
The first thing
we're going to do

844
00:35:20,020 --> 00:35:22,230
is assign a different
number to each processor.

845
00:35:22,230 --> 00:35:24,280
And in fact, here,
I've shown the numbers

846
00:35:24,280 --> 00:35:27,220
as progressing across the
rows and then down to matrix.

847
00:35:27,220 --> 00:35:29,500
But for the purposes
of this algorithm,

848
00:35:29,500 --> 00:35:31,630
any organization
would do provided

849
00:35:31,630 --> 00:35:34,510
we can reasonably quickly
assign a different number

850
00:35:34,510 --> 00:35:35,617
to each processor.

851
00:35:35,617 --> 00:35:37,450
And we've seen how the
enumeration technique

852
00:35:37,450 --> 00:35:39,800
can do this in a logarithmic
number of time steps.

853
00:35:39,800 --> 00:35:43,910
So we'll assume this particular
numbering of the processors.

854
00:35:43,910 --> 00:35:48,910
The next thing we do is to have
each of the pixels in the image

855
00:35:48,910 --> 00:35:51,850
examine the pixel
values of the neighbors.

856
00:35:51,850 --> 00:35:54,460
And this is easily accomplished
by using regular permutations,

857
00:35:54,460 --> 00:35:56,650
namely shifts of the matrix.

858
00:35:56,650 --> 00:35:58,270
We take the matrix
and we shift it up,

859
00:35:58,270 --> 00:36:00,640
we shift it down, left
and right, and also

860
00:36:00,640 --> 00:36:04,180
to the Northeast, Northwest,
Southeast, and Southwest.

861
00:36:04,180 --> 00:36:06,670
And as a result of these
shifting operations,

862
00:36:06,670 --> 00:36:08,710
each neighbor now knows
its own pixel value

863
00:36:08,710 --> 00:36:11,480
plus the pixel values
of its eight neighbors.

864
00:36:11,480 --> 00:36:13,910
This is then enough
information for each processor

865
00:36:13,910 --> 00:36:16,730
to do elementwise
computations and decide--

866
00:36:16,730 --> 00:36:19,730
each processor can decide
whether its pixel value

867
00:36:19,730 --> 00:36:22,400
is on the border of
its region or not.

868
00:36:22,400 --> 00:36:23,780
There are lots of special cases.

869
00:36:23,780 --> 00:36:25,460
And the details get
rather messy and I'm not

870
00:36:25,460 --> 00:36:27,877
going to talk about them too
much here because they're not

871
00:36:27,877 --> 00:36:30,290
central to the ideas involved.

872
00:36:30,290 --> 00:36:32,165
But be aware that there
are messy details.

873
00:36:32,165 --> 00:36:34,830


874
00:36:34,830 --> 00:36:38,430
In this example, we have
darkened the border square

875
00:36:38,430 --> 00:36:39,510
so you can see them.

876
00:36:39,510 --> 00:36:41,730
And the next several
computations to be carried out

877
00:36:41,730 --> 00:36:44,310
will be carried out only
by pixel processors that

878
00:36:44,310 --> 00:36:45,263
are on the border.

879
00:36:45,263 --> 00:36:47,430
So this will be an example
of conditional operation,

880
00:36:47,430 --> 00:36:49,950
only some of the processors
will be participating.

881
00:36:49,950 --> 00:36:53,550


882
00:36:53,550 --> 00:36:56,610
We have each processor, again,
consider the pixel values

883
00:36:56,610 --> 00:36:58,650
that came from its neighbors.

884
00:36:58,650 --> 00:37:01,730
And also inquire again using
shifting of its neighbors

885
00:37:01,730 --> 00:37:03,480
whether its neighbors
are border elements.

886
00:37:03,480 --> 00:37:06,220
This is, then, enough
information to figure out

887
00:37:06,220 --> 00:37:08,230
which of your neighbors
are border elements

888
00:37:08,230 --> 00:37:11,720
in the same region so that you
can construct pointers to them.

889
00:37:11,720 --> 00:37:14,890
And so what we've done
here is to stitch together

890
00:37:14,890 --> 00:37:16,600
the border of each
region in a linked

891
00:37:16,600 --> 00:37:20,030
list that runs around the
perimeter of the region.

892
00:37:20,030 --> 00:37:22,600
So you can see a linked list
running around the perimeter

893
00:37:22,600 --> 00:37:23,998
of the central green region.

894
00:37:23,998 --> 00:37:25,540
And you can see a
linked list running

895
00:37:25,540 --> 00:37:27,370
through the squiggly
region in the upper left.

896
00:37:27,370 --> 00:37:29,412
And each region has had
its boundary connected up

897
00:37:29,412 --> 00:37:30,520
with linked pointers.

898
00:37:30,520 --> 00:37:35,010
We are not going to use
pointer doubling algorithms.

899
00:37:35,010 --> 00:37:37,020
Each pixel processor
considers the number

900
00:37:37,020 --> 00:37:40,730
that it was assigned in
the enumeration step.

901
00:37:40,730 --> 00:37:43,040
And the pixels that
are not on the borders

902
00:37:43,040 --> 00:37:47,340
do not participate in
this consideration.

903
00:37:47,340 --> 00:37:51,060
We then use a pointer doubling
algorithm, first of all,

904
00:37:51,060 --> 00:37:54,540
to do a reduction step
using not summation,

905
00:37:54,540 --> 00:37:56,257
but the minimum operation.

906
00:37:56,257 --> 00:37:58,840
Min is as good an operation for
combining numbers as summation

907
00:37:58,840 --> 00:38:00,090
is for this purpose.

908
00:38:00,090 --> 00:38:02,770
And pointer doubling
works just fine for that.

909
00:38:02,770 --> 00:38:07,480
So each linked border list
will perform pointer doubling

910
00:38:07,480 --> 00:38:10,330
around that list and
determine what is the smallest

911
00:38:10,330 --> 00:38:11,890
number in that list.

912
00:38:11,890 --> 00:38:14,530
And then, by using another
pointer doubling step,

913
00:38:14,530 --> 00:38:16,900
send that smallest number
and make copies of it

914
00:38:16,900 --> 00:38:18,190
all around the perimeter.

915
00:38:18,190 --> 00:38:21,448
So now, every border pixel
knows the smallest number

916
00:38:21,448 --> 00:38:23,240
that was on a border
element in its region.

917
00:38:23,240 --> 00:38:24,795
So for example,
the number 19 has

918
00:38:24,795 --> 00:38:26,170
been propagated
around the border

919
00:38:26,170 --> 00:38:27,460
of the central green region.

920
00:38:27,460 --> 00:38:29,477
And the number 0 has
been propagated down

921
00:38:29,477 --> 00:38:31,060
the squiggly region
in the upper left.

922
00:38:31,060 --> 00:38:33,570


923
00:38:33,570 --> 00:38:35,670
Finally, we can use
scan operations, not

924
00:38:35,670 --> 00:38:38,580
on the linked lists anymore,
we can abandon those now,

925
00:38:38,580 --> 00:38:40,513
but by operating on
the columns or what

926
00:38:40,513 --> 00:38:42,930
amounts to the same thing in
the rows, either one will do.

927
00:38:42,930 --> 00:38:45,660
By doing a scan operation
in each direction,

928
00:38:45,660 --> 00:38:51,350
we can copy the processor
labels from the borders

929
00:38:51,350 --> 00:38:53,490
to the interior
points of the region.

930
00:38:53,490 --> 00:38:55,400
So for example, in the
interior green region,

931
00:38:55,400 --> 00:38:57,740
it actually suffices to
copy from the upper edge

932
00:38:57,740 --> 00:39:00,380
of the region downward,
thereby propagating the label

933
00:39:00,380 --> 00:39:01,285
into the interior.

934
00:39:01,285 --> 00:39:02,660
Other regions,
particularly those

935
00:39:02,660 --> 00:39:03,650
that are on the
edge of the image,

936
00:39:03,650 --> 00:39:05,775
may need the numbers
propagated up instead of down.

937
00:39:05,775 --> 00:39:08,390
So you do a scan in both
directions doing copying.

938
00:39:08,390 --> 00:39:10,850
And this is done by
essentially a variant

939
00:39:10,850 --> 00:39:12,590
of the parallel
prefix operation.

940
00:39:12,590 --> 00:39:15,110
But instead of using an
operation such as summing

941
00:39:15,110 --> 00:39:18,390
or min uses a more
complicated operation that

942
00:39:18,390 --> 00:39:20,390
is, in fact, not commutative,
but fairly simple,

943
00:39:20,390 --> 00:39:23,497
which is essentially if you've
come across a new number,

944
00:39:23,497 --> 00:39:25,080
take a copy of that
and otherwise copy

945
00:39:25,080 --> 00:39:28,120
the old number that came
in from your neighbor.

946
00:39:28,120 --> 00:39:29,370
But it's a legitimate variant.

947
00:39:29,370 --> 00:39:32,070


948
00:39:32,070 --> 00:39:33,900
This method of labeling
regions in an image

949
00:39:33,900 --> 00:39:36,980
is known as LIM's
method after Willie Lim.

950
00:39:36,980 --> 00:39:38,750
And if you use n
squared processors

951
00:39:38,750 --> 00:39:41,180
to represent the
image in n by n array,

952
00:39:41,180 --> 00:39:45,070
it accomplishes the
labeling in log n time.

953
00:39:45,070 --> 00:39:47,440
Because each of the
steps that was involved

954
00:39:47,440 --> 00:39:50,293
took either constant
time or time logarithmic

955
00:39:50,293 --> 00:39:51,460
in the number of processors.

956
00:39:51,460 --> 00:39:53,410
It was given that you
had n squared processors

957
00:39:53,410 --> 00:39:56,990
to process n squared pixels.

958
00:39:56,990 --> 00:39:58,550
Data parallel
programming makes it

959
00:39:58,550 --> 00:40:03,440
easy to organize computations
on large quantities of data

960
00:40:03,440 --> 00:40:06,170
for massively parallel
computer systems.

961
00:40:06,170 --> 00:40:07,850
It differs from
sequential programming

962
00:40:07,850 --> 00:40:09,740
in that its emphasis
is on operations

963
00:40:09,740 --> 00:40:13,010
on entire collections of data
rather than single elements

964
00:40:13,010 --> 00:40:14,840
of the data one at a time.

965
00:40:14,840 --> 00:40:16,550
In a data parallel
program, you typically

966
00:40:16,550 --> 00:40:19,790
find fewer loops and fewer
subscripted references

967
00:40:19,790 --> 00:40:22,490
to arrays than you do
in a sequential program.

968
00:40:22,490 --> 00:40:24,530
On the other hand,
data parallel programs

969
00:40:24,530 --> 00:40:27,740
are typically very similar
to sequential programs

970
00:40:27,740 --> 00:40:31,230
in containing a single
thread of control.

971
00:40:31,230 --> 00:40:32,760
At any given point
in time, there's

972
00:40:32,760 --> 00:40:35,820
conceptually execution
control at only one

973
00:40:35,820 --> 00:40:37,133
point in the program text.

974
00:40:37,133 --> 00:40:39,300
Sometimes this restriction
is loosened a little bit,

975
00:40:39,300 --> 00:40:42,240
but it's characteristic of
the data programming style.

976
00:40:42,240 --> 00:40:44,610
This can make data
parallel programs easier

977
00:40:44,610 --> 00:40:48,570
to understand than
control parallel programs.

978
00:40:48,570 --> 00:40:51,120
In order to write good
data parallel programs,

979
00:40:51,120 --> 00:40:53,640
we must become familiar
with the necessary building

980
00:40:53,640 --> 00:40:56,360
blocks for the construction
of data parallel algorithms.

981
00:40:56,360 --> 00:40:58,110
And we need to have a
good idea as to what

982
00:40:58,110 --> 00:41:00,360
the relative costs are.

983
00:41:00,360 --> 00:41:02,868
Given one processor
data element,

984
00:41:02,868 --> 00:41:04,410
there are many
interesting operations

985
00:41:04,410 --> 00:41:06,300
that can be performed
in constant time

986
00:41:06,300 --> 00:41:08,935
on an entire array of
data and other operations

987
00:41:08,935 --> 00:41:10,560
which take a logarithmic
amount of time

988
00:41:10,560 --> 00:41:13,800
or perhaps a linear amount of
time in the amount of data.

989
00:41:13,800 --> 00:41:16,500
And this depends not only
on the inherent complexity

990
00:41:16,500 --> 00:41:19,410
of the operation, but also
sometimes upon the underlying

991
00:41:19,410 --> 00:41:20,550
implementation.

992
00:41:20,550 --> 00:41:22,440
If a particular
piece of hardware

993
00:41:22,440 --> 00:41:24,150
doesn't support
sufficient connectivity

994
00:41:24,150 --> 00:41:26,190
among the processors,
for example,

995
00:41:26,190 --> 00:41:27,990
a communication's
bound operation

996
00:41:27,990 --> 00:41:31,177
may take more time than
would otherwise be required.

997
00:41:31,177 --> 00:41:33,010
Once you become familiar
with these building

998
00:41:33,010 --> 00:41:34,810
blocks and learn how
to fit them together

999
00:41:34,810 --> 00:41:37,360
in standard and
conventional ways,

1000
00:41:37,360 --> 00:41:39,910
writing a data
parallel program is

1001
00:41:39,910 --> 00:41:42,790
just as easy as and
just as hard as writing

1002
00:41:42,790 --> 00:41:44,560
a sequential program.

1003
00:41:44,560 --> 00:41:46,300
And given a suitable
underlying hardware,

1004
00:41:46,300 --> 00:41:48,145
your programs may
run much faster.

1005
00:41:48,145 --> 00:41:51,080


1006
00:41:51,080 --> 00:41:53,660
I'm Burt Halstead, DEC
Cambridge Research Lab.

1007
00:41:53,660 --> 00:41:55,580
I thought your examples
were very interesting,

1008
00:41:55,580 --> 00:41:58,280
but I was curious whether you
ever get into problems when

1009
00:41:58,280 --> 00:42:01,340
you have highly data
dependent computations

1010
00:42:01,340 --> 00:42:04,880
and it's difficult to keep
more than a very small fraction

1011
00:42:04,880 --> 00:42:07,640
of the processors actually
doing the same operation

1012
00:42:07,640 --> 00:42:09,890
at the same time.

1013
00:42:09,890 --> 00:42:10,403
Yes, you do.

1014
00:42:10,403 --> 00:42:11,820
That's a very
interesting problem.

1015
00:42:11,820 --> 00:42:14,540
And that's one reason for
making the distinction

1016
00:42:14,540 --> 00:42:17,750
between the data parallel style
and the question of whether you

1017
00:42:17,750 --> 00:42:21,350
got SIMD or MIMD style hardware
supporting it underneath.

1018
00:42:21,350 --> 00:42:24,020
It's perfectly possible
to have hardware

1019
00:42:24,020 --> 00:42:26,060
that is capable of
giving you some more

1020
00:42:26,060 --> 00:42:29,090
choice than doing the same
thing at the same time.

1021
00:42:29,090 --> 00:42:33,530
Now, the precisely best method
of designing a total system

1022
00:42:33,530 --> 00:42:35,960
that gives you flexibility
without making it extremely

1023
00:42:35,960 --> 00:42:38,210
difficult to control I think
is still an open research

1024
00:42:38,210 --> 00:42:39,920
question.

1025
00:42:39,920 --> 00:42:41,850
I'm Franklyn Turbak from MIT.

1026
00:42:41,850 --> 00:42:44,780
You indicated that a
number of your algorithms

1027
00:42:44,780 --> 00:42:48,860
took time that was logarithmic
in the size of the problem.

1028
00:42:48,860 --> 00:42:50,990
And this seems to be
based on the assumption

1029
00:42:50,990 --> 00:42:54,860
that you actually have a large
enough number of processors

1030
00:42:54,860 --> 00:42:56,960
to match the size
of your problem.

1031
00:42:56,960 --> 00:42:58,880
Yet, for any real
machine that you're

1032
00:42:58,880 --> 00:43:00,890
going to be running
these algorithms on

1033
00:43:00,890 --> 00:43:04,670
there's going to be, in fact,
a limited number of processors.

1034
00:43:04,670 --> 00:43:06,950
And if the size of
your problem exceeds

1035
00:43:06,950 --> 00:43:10,820
the number of processors, it
seems that the logarithmic time

1036
00:43:10,820 --> 00:43:12,320
growth is no longer justified.

1037
00:43:12,320 --> 00:43:13,912
Can you explain that please?

1038
00:43:13,912 --> 00:43:16,370
You're absolutely correct and
I'm glad you brought that up.

1039
00:43:16,370 --> 00:43:18,030
There is no free lunch.

1040
00:43:18,030 --> 00:43:19,683
And for a fixed
amount of hardware,

1041
00:43:19,683 --> 00:43:21,350
making a problem
bigger is going to make

1042
00:43:21,350 --> 00:43:22,517
the whole thing run slower.

1043
00:43:22,517 --> 00:43:25,100
It's a question of whether you
regard the size of the hardware

1044
00:43:25,100 --> 00:43:25,625
to be fixed.

1045
00:43:25,625 --> 00:43:28,196


1046
00:43:28,196 --> 00:43:31,447
If you have a much larger
problem so big that it's not

1047
00:43:31,447 --> 00:43:33,030
going to fit in the
computer you have,

1048
00:43:33,030 --> 00:43:34,740
you're simply going to have
to buy a bigger computer.

1049
00:43:34,740 --> 00:43:36,150
And that's all there is to it.

1050
00:43:36,150 --> 00:43:40,028
Within the range of problems
that your computer can handle,

1051
00:43:40,028 --> 00:43:42,570
then you will, indeed, tend to
get linear kinds of trade-offs

1052
00:43:42,570 --> 00:43:43,778
rather than logarithmic ones.

1053
00:43:43,778 --> 00:43:45,195
It's a question
of what you choose

1054
00:43:45,195 --> 00:43:46,860
to hold fixed in your
analysis and what

1055
00:43:46,860 --> 00:43:49,120
you choose to let vary.

1056
00:43:49,120 --> 00:43:53,100
I'm [INAUDIBLE] from Holland.

1057
00:43:53,100 --> 00:43:55,470
I want to ask you the
following question.

1058
00:43:55,470 --> 00:43:59,580
Is there something to say about
the portability of the programs

1059
00:43:59,580 --> 00:44:00,570
to different machines?

1060
00:44:00,570 --> 00:44:03,300


1061
00:44:03,300 --> 00:44:04,492
Yes, there is.

1062
00:44:04,492 --> 00:44:06,450
There is a great deal to
say about portability.

1063
00:44:06,450 --> 00:44:09,240
And one is that right now it's
very difficult because we have

1064
00:44:09,240 --> 00:44:11,460
not come to terms on standards.

1065
00:44:11,460 --> 00:44:14,460
We haven't agreed on what are
the right kinds of building

1066
00:44:14,460 --> 00:44:15,760
blocks to support.

1067
00:44:15,760 --> 00:44:20,430
And as a result, you end up
with different architectures

1068
00:44:20,430 --> 00:44:22,740
that are being built
in hardware lending

1069
00:44:22,740 --> 00:44:24,657
support to different
ones of these operations,

1070
00:44:24,657 --> 00:44:26,698
but not supporting them
all in a uniform fashion.

1071
00:44:26,698 --> 00:44:29,140
We're still trying to discover
which ones are important.

1072
00:44:29,140 --> 00:44:32,535
And this is why you end up with
non-portability of efficiencies

1073
00:44:32,535 --> 00:44:33,285
and running times.

1074
00:44:33,285 --> 00:44:35,615
This is completely aside
from the question of,

1075
00:44:35,615 --> 00:44:36,990
will the code run
at all, which I

1076
00:44:36,990 --> 00:44:39,152
think is merely a
matter of agreeing

1077
00:44:39,152 --> 00:44:40,110
to implement compilers.

1078
00:44:40,110 --> 00:44:42,680


1079
00:44:42,680 --> 00:44:44,160
It's a difficult problem.

1080
00:44:44,160 --> 00:44:48,080
And I think that's part
of the point of this talk,

1081
00:44:48,080 --> 00:44:50,720
is to bring forward one
particular set of building

1082
00:44:50,720 --> 00:44:53,420
blocks that I think can
be treated as a universal

1083
00:44:53,420 --> 00:44:56,030
and to try to get
hardware architects

1084
00:44:56,030 --> 00:44:59,000
as well as language
designers to pay attention

1085
00:44:59,000 --> 00:45:00,500
to these particular
building blocks

1086
00:45:00,500 --> 00:45:03,810
as exemplars of important things
that need to be supported well

1087
00:45:03,810 --> 00:45:05,060
in both software and hardware.

1088
00:45:05,060 --> 00:45:07,393
And there are probably others
that I haven't touched on,

1089
00:45:07,393 --> 00:45:08,875
such as sorting.

1090
00:45:08,875 --> 00:45:12,440
I'm Jim [INAUDIBLE] from Howard
University in Washington, DC.

1091
00:45:12,440 --> 00:45:14,000
I have a number
of questions that

1092
00:45:14,000 --> 00:45:17,720
came out of the lecture and some
of other personal professional

1093
00:45:17,720 --> 00:45:21,030
interest with respect
to this important topic.

1094
00:45:21,030 --> 00:45:23,450
For dealing with
large matrices that

1095
00:45:23,450 --> 00:45:27,830
are highly dense and sparse,
there are [INAUDIBLE] metals

1096
00:45:27,830 --> 00:45:29,690
that we normally
will use to reduce

1097
00:45:29,690 --> 00:45:32,810
the so-called
computational [INAUDIBLE]

1098
00:45:32,810 --> 00:45:38,060
And thus far, as you give
the very good presentation,

1099
00:45:38,060 --> 00:45:40,670
it looks to me that there
will be opportunities

1100
00:45:40,670 --> 00:45:45,950
for evolving or solving
problems [INAUDIBLE] that

1101
00:45:45,950 --> 00:45:47,780
might involve sparsity.

1102
00:45:47,780 --> 00:45:51,170
If this is true,
how do you justify

1103
00:45:51,170 --> 00:45:56,010
the overhead cost in terms of
the cost of parallel processing

1104
00:45:56,010 --> 00:45:58,730
as well as the
cost of operations?

1105
00:45:58,730 --> 00:46:00,920
I know that there's
advantages in terms

1106
00:46:00,920 --> 00:46:03,260
of the operational cost.

1107
00:46:03,260 --> 00:46:06,590
But the overhead cost in terms
of implementing such machines,

1108
00:46:06,590 --> 00:46:08,420
especially where
budgetary constraints

1109
00:46:08,420 --> 00:46:10,580
might be significant.

1110
00:46:10,580 --> 00:46:12,980
Is the gist of your question
that the kinds of matrix

1111
00:46:12,980 --> 00:46:14,980
algorithms that I showed
in this lecture are not

1112
00:46:14,980 --> 00:46:16,790
suitable for sparse
matrices because so

1113
00:46:16,790 --> 00:46:18,140
much of the processing
power will be wasted?

1114
00:46:18,140 --> 00:46:18,710
Yes.

1115
00:46:18,710 --> 00:46:20,168
Yes, that is a very
good criticism.

1116
00:46:20,168 --> 00:46:22,760
And it would not be appropriate
to use that kind of algorithm

1117
00:46:22,760 --> 00:46:24,440
on a sparse matrix,
just as you wouldn't

1118
00:46:24,440 --> 00:46:27,260
use the usual sequential
triply nested loop you

1119
00:46:27,260 --> 00:46:29,443
use for dense matrix
on a sparse one.

1120
00:46:29,443 --> 00:46:31,610
Sparse matrix processing
on a data parallel computer

1121
00:46:31,610 --> 00:46:33,350
calls for very
different approaches

1122
00:46:33,350 --> 00:46:35,330
that I did not
illustrate in this talk.

1123
00:46:35,330 --> 00:46:37,310
And they are more
difficult and typically

1124
00:46:37,310 --> 00:46:41,450
call for the kinds of irregular
permutations and communications

1125
00:46:41,450 --> 00:46:46,820
techniques that I
illustrated before.

1126
00:46:46,820 --> 00:46:50,150
Because the particular
patterns of communication

1127
00:46:50,150 --> 00:46:51,650
in the processing
of sparse matrices

1128
00:46:51,650 --> 00:46:53,960
depend so much on the particular
content of the matrix.

1129
00:46:53,960 --> 00:46:55,700
I'd just like you
to make a comment

1130
00:46:55,700 --> 00:46:58,940
on the combination of
hierarchical structures

1131
00:46:58,940 --> 00:47:03,830
in problem solving rather
than just plane parallel

1132
00:47:03,830 --> 00:47:05,570
processing or sequential.

1133
00:47:05,570 --> 00:47:09,963
Using hierarchical structures to
enhance some of the approaches

1134
00:47:09,963 --> 00:47:10,880
that you've discussed.

1135
00:47:10,880 --> 00:47:13,228
What will that give you?

1136
00:47:13,228 --> 00:47:14,645
Could you be more
specific please?

1137
00:47:14,645 --> 00:47:16,760
To have a hierarchical
structure is

1138
00:47:16,760 --> 00:47:19,760
where you have the
main task master that

1139
00:47:19,760 --> 00:47:22,130
control the sequence
of operations that

1140
00:47:22,130 --> 00:47:26,630
is done at the lower level of
the sequence of activities that

1141
00:47:26,630 --> 00:47:28,850
is being done in parallel.

1142
00:47:28,850 --> 00:47:31,580
Where we are the local
members or local elements

1143
00:47:31,580 --> 00:47:34,070
have a chance to
communicate among themselves

1144
00:47:34,070 --> 00:47:38,900
as well as direct information
to the higher level task master.

1145
00:47:38,900 --> 00:47:40,460
Oh, I think I see.

1146
00:47:40,460 --> 00:47:42,110
I've shown simply
two level hierarchy.

1147
00:47:42,110 --> 00:47:45,200
At most, I've shown a single
global force for, for example,

1148
00:47:45,200 --> 00:47:46,098
broadcasting.

1149
00:47:46,098 --> 00:47:48,140
And then, sort of homogeneous
array of processes.

1150
00:47:48,140 --> 00:47:50,810
One might imagine a more
complex hierarchical structuring

1151
00:47:50,810 --> 00:47:52,290
of the processing elements.

1152
00:47:52,290 --> 00:47:52,790
Yes.

1153
00:47:52,790 --> 00:47:53,790
Is that your suggestion?

1154
00:47:53,790 --> 00:47:54,850
Yes.

1155
00:47:54,850 --> 00:47:56,600
I think that is a
largely unexplored area.

1156
00:47:56,600 --> 00:47:57,975
That's another
thing that I would

1157
00:47:57,975 --> 00:48:00,470
regard as an open research
topic as to the best way

1158
00:48:00,470 --> 00:48:01,700
to organize that.

1159
00:48:01,700 --> 00:48:05,710
And the difficulty is
providing the flexibility

1160
00:48:05,710 --> 00:48:08,660
at the programming level
to be able to organize

1161
00:48:08,660 --> 00:48:11,412
the hierarchy in a way
demanded by the problem.

1162
00:48:11,412 --> 00:48:13,370
We have seen a number of
hardware architectures

1163
00:48:13,370 --> 00:48:15,200
that organize hardware
processing elements

1164
00:48:15,200 --> 00:48:19,910
hierarchically, but with more
or less ability to reorganize

1165
00:48:19,910 --> 00:48:21,357
those processors.

1166
00:48:21,357 --> 00:48:22,940
Because I don't think
it's well enough

1167
00:48:22,940 --> 00:48:24,857
understood how to make
a sufficiently flexible

1168
00:48:24,857 --> 00:48:28,040
communication system that
allows you to restructure

1169
00:48:28,040 --> 00:48:30,620
the hierarchy unless, of
course, you provide a completely

1170
00:48:30,620 --> 00:48:32,960
general communication system.

1171
00:48:32,960 --> 00:48:34,460
In which case, the
hierarchy, again,

1172
00:48:34,460 --> 00:48:35,835
disappears to the
hardware level.

1173
00:48:35,835 --> 00:48:38,330
And it's up to the language
implementer to provide

1174
00:48:38,330 --> 00:48:41,210
that hierarchical organization
at the language level.

1175
00:48:41,210 --> 00:48:43,250
And we have not yet seen
that emerging either.

1176
00:48:43,250 --> 00:48:45,120
And that's a very good point.

1177
00:48:45,120 --> 00:48:49,280
So thus far, the algorithms
for solving mixed integer--

1178
00:48:49,280 --> 00:48:52,590
say, maybe linear programming.

1179
00:48:52,590 --> 00:48:54,740
But when you have
nonlinear [INAUDIBLE]

1180
00:48:54,740 --> 00:48:58,580
integer programming, one
of the topics you covered

1181
00:48:58,580 --> 00:49:00,270
include enumeration.

1182
00:49:00,270 --> 00:49:02,960
So which suggests
that if we really

1183
00:49:02,960 --> 00:49:06,260
want to solve the class
of branch unbound problems

1184
00:49:06,260 --> 00:49:10,940
that you require either deep
search or [INAUDIBLE] search.

1185
00:49:10,940 --> 00:49:15,080
This parallel processing
might be a very important tool

1186
00:49:15,080 --> 00:49:20,270
to solve some-- to solve
the problem of mixing

1187
00:49:20,270 --> 00:49:24,680
nonlinear type programming
with branch and bounding.

1188
00:49:24,680 --> 00:49:26,360
Yeah, it may well be.

1189
00:49:26,360 --> 00:49:30,470
It is sometimes possible to use
data parallel techniques to do

1190
00:49:30,470 --> 00:49:34,050
what look like unstructured
searches in, for example,

1191
00:49:34,050 --> 00:49:37,440
a tree, for example, a
game tree using branch

1192
00:49:37,440 --> 00:49:39,440
and bound techniques or
other techniques related

1193
00:49:39,440 --> 00:49:42,140
to it by maintaining
a work queue,

1194
00:49:42,140 --> 00:49:45,950
very much as you might in
a more controlled parallel

1195
00:49:45,950 --> 00:49:47,810
oriented style.

1196
00:49:47,810 --> 00:49:52,100
And at every step, taking a
large number of task items

1197
00:49:52,100 --> 00:49:54,410
off that work queue
simultaneously

1198
00:49:54,410 --> 00:49:56,033
by using an enumeration
step and using

1199
00:49:56,033 --> 00:49:57,700
the result of that
enumeration to assign

1200
00:49:57,700 --> 00:49:59,468
[INAUDIBLE] processors.

1201
00:49:59,468 --> 00:50:01,760
Sometimes it's effective
providing the rest of the work

1202
00:50:01,760 --> 00:50:04,058
to be done each of these
tasks is sufficiently similar.

1203
00:50:04,058 --> 00:50:05,600
If it's not, then
control parallelism

1204
00:50:05,600 --> 00:50:08,190
may be more suited
to a task like that.

1205
00:50:08,190 --> 00:50:11,690
[INAUDIBLE] the Netherlands.

1206
00:50:11,690 --> 00:50:15,830
With the current expertise--
software expertise in 4GL,

1207
00:50:15,830 --> 00:50:19,130
four generation languages
for sequential machines,

1208
00:50:19,130 --> 00:50:23,150
I would like to know your view
if developing of data parallel

1209
00:50:23,150 --> 00:50:27,800
programming languages will
it and at least that for 4GL

1210
00:50:27,800 --> 00:50:29,060
level or up.

1211
00:50:29,060 --> 00:50:32,160


1212
00:50:32,160 --> 00:50:35,370
I have seen the
development of software

1213
00:50:35,370 --> 00:50:37,770
based on the data
parallel paradigm.

1214
00:50:37,770 --> 00:50:42,090


1215
00:50:42,090 --> 00:50:43,320
I was about to say parallel.

1216
00:50:43,320 --> 00:50:45,510
Let me say reenact the
history of the development

1217
00:50:45,510 --> 00:50:49,722
of sequential programming
languages and software systems.

1218
00:50:49,722 --> 00:50:51,180
And I think we are
now at the point

1219
00:50:51,180 --> 00:50:53,055
where we understand how
to make data parallel

1220
00:50:53,055 --> 00:50:55,290
languages at about the
level of expressiveness

1221
00:50:55,290 --> 00:50:58,710
of C Fortran, possibly lisp.

1222
00:50:58,710 --> 00:51:01,200
And I think it's going to take
a while before we can rework

1223
00:51:01,200 --> 00:51:03,217
our understanding
of this to raise

1224
00:51:03,217 --> 00:51:05,550
the level of expressiveness
to that of fourth generation

1225
00:51:05,550 --> 00:51:06,540
languages.

1226
00:51:06,540 --> 00:51:08,540
I think it's just a matter
of time and learning.

1227
00:51:08,540 --> 00:51:11,190
We still don't understand
how to fit these building

1228
00:51:11,190 --> 00:51:14,050
blocks together to make
abstractions at that level.

1229
00:51:14,050 --> 00:51:15,895
And I think that's
great because it

1230
00:51:15,895 --> 00:51:18,270
would be no fun if there
weren't interesting new problems

1231
00:51:18,270 --> 00:51:20,910
to work on.

1232
00:51:20,910 --> 00:51:24,860
[MUSIC PLAYING]

1233
00:51:24,860 --> 00:53:10,326