1
00:00:00,000 --> 00:00:02,170


2
00:00:02,170 --> 00:00:05,440
University video communication
is pleased to present this

3
00:00:05,440 --> 00:00:08,260
video tape edition of the
Distinguished Lecture Series ,

4
00:00:08,260 --> 00:00:10,750
industry leaders in computer
science and electrical

5
00:00:10,750 --> 00:00:11,740
engineering.

6
00:00:11,740 --> 00:00:13,540
Speakers and topics
in this series

7
00:00:13,540 --> 00:00:16,420
have been nominated by
University faculty who

8
00:00:16,420 --> 00:00:19,900
want industry material
for use in their courses.

9
00:00:19,900 --> 00:00:22,720
Today's presentation,
parallel processing

10
00:00:22,720 --> 00:00:24,970
is by Tilak K. Agerwala.

11
00:00:24,970 --> 00:00:27,520
An IEEE believe fellow,
he received his Bachelor

12
00:00:27,520 --> 00:00:29,560
of technology in
electrical engineering

13
00:00:29,560 --> 00:00:32,020
from the Indian
Institute of Technology,

14
00:00:32,020 --> 00:00:34,150
and his PhD in
electrical engineering

15
00:00:34,150 --> 00:00:36,490
from Johns Hopkins University.

16
00:00:36,490 --> 00:00:38,650
As a member of the corporate
technical committee

17
00:00:38,650 --> 00:00:41,110
in the science and
technology department at IBM

18
00:00:41,110 --> 00:00:44,400
corporate headquarters, one of
his current responsibilities

19
00:00:44,400 --> 00:00:48,910
is for corporate wide assessment
of parallel processing at IBM.

20
00:00:48,910 --> 00:00:53,110
UVC is happy to introduce
to you Dr. Tilak Agerwala.

21
00:00:53,110 --> 00:00:57,040
It's a pleasure to have this
opportunity to speak to you

22
00:00:57,040 --> 00:01:01,840
on one of my favorite topics
namely, parallel processing.

23
00:01:01,840 --> 00:01:07,360
The presentation today will
be a tutorial presentation

24
00:01:07,360 --> 00:01:11,620
where I will try to give
you an overview of some

25
00:01:11,620 --> 00:01:16,870
of the important trends in this
area of parallel processing.

26
00:01:16,870 --> 00:01:18,670
Now many of you
are quite familiar

27
00:01:18,670 --> 00:01:22,540
with the various motivations
for parallel processing,

28
00:01:22,540 --> 00:01:28,480
I would on this particular
slide emphasize two.

29
00:01:28,480 --> 00:01:31,690
The performance potential
of parallel processing

30
00:01:31,690 --> 00:01:34,570
is one of the key motivations.

31
00:01:34,570 --> 00:01:40,360
And the performance aspects
are really common to flavors.

32
00:01:40,360 --> 00:01:44,830
One can talk about the
maximum performance possible ,

33
00:01:44,830 --> 00:01:46,510
the maximum absolute
performance,

34
00:01:46,510 --> 00:01:51,280
one can also be motivated
by the ability to back a lot

35
00:01:51,280 --> 00:01:53,650
of performance in
the small package,

36
00:01:53,650 --> 00:01:59,270
and one typically may want
to do that in workstations.

37
00:01:59,270 --> 00:02:04,960
There is a potential for
improved cost performance,

38
00:02:04,960 --> 00:02:09,250
to provide scalable
performance, and the other one

39
00:02:09,250 --> 00:02:12,460
on this list that I really
do want to emphasize

40
00:02:12,460 --> 00:02:14,740
is reduce development cost.

41
00:02:14,740 --> 00:02:18,910
And reduce development costs
also comes in two flavors.

42
00:02:18,910 --> 00:02:22,660
Sometimes it's possible to
get significant performance

43
00:02:22,660 --> 00:02:26,950
without developing very
sophisticated technologies

44
00:02:26,950 --> 00:02:31,570
and very sophisticated packages
that are developed for very

45
00:02:31,570 --> 00:02:34,030
high performance machines.

46
00:02:34,030 --> 00:02:37,510
Secondly if one is interested
in really providing

47
00:02:37,510 --> 00:02:41,740
a very wide range of
performance for example in IBM,

48
00:02:41,740 --> 00:02:45,670
we cover the range of systems
from personal computers

49
00:02:45,670 --> 00:02:48,070
to supercomputers,
then one is faced

50
00:02:48,070 --> 00:02:52,090
with the question of how many
unique uni-processor designs

51
00:02:52,090 --> 00:02:55,210
should one develop to cover
that range of performance.

52
00:02:55,210 --> 00:02:57,130
And clearly with
parallel processing

53
00:02:57,130 --> 00:02:59,710
then one has an
added dimension so

54
00:02:59,710 --> 00:03:02,410
that the number of
unique processor designs

55
00:03:02,410 --> 00:03:04,780
can be reduced and
the range can be

56
00:03:04,780 --> 00:03:10,530
covered with the use
of parallel processing.

57
00:03:10,530 --> 00:03:15,990
Increased availability is also
an important consideration

58
00:03:15,990 --> 00:03:19,860
and is very important
in today's systems.

59
00:03:19,860 --> 00:03:23,700
There has been a huge amount
of progress in computers.

60
00:03:23,700 --> 00:03:26,760
Generally, I'd
just like to state

61
00:03:26,760 --> 00:03:30,300
that the driving force
behind much of this progress

62
00:03:30,300 --> 00:03:33,510
is really progress
in micro-electronics.

63
00:03:33,510 --> 00:03:39,030
And this the slide illustrates
what some of the trends

64
00:03:39,030 --> 00:03:44,470
are in the area of memory,
CMOS, and bipolar technologies.

65
00:03:44,470 --> 00:03:49,590
I should emphasize that
this slide is not just

66
00:03:49,590 --> 00:03:52,860
obtained by sitting
down with log paper

67
00:03:52,860 --> 00:03:57,400
and projecting into the future
by drawing straight lines.

68
00:03:57,400 --> 00:04:01,200
This graph is obtained by
a very thorough evaluation

69
00:04:01,200 --> 00:04:07,530
of the problems that are likely
to be faced in the future

70
00:04:07,530 --> 00:04:11,520
and by considering the solutions
that one may use to address

71
00:04:11,520 --> 00:04:12,930
some of these problems.

72
00:04:12,930 --> 00:04:16,110
And the basic message that I
want to give you on this slide

73
00:04:16,110 --> 00:04:20,339
is that technology at this
level is going straight ahead

74
00:04:20,339 --> 00:04:23,200
and there's no end in sight.

75
00:04:23,200 --> 00:04:29,550
Now, the basic technology
capability for example

76
00:04:29,550 --> 00:04:35,700
then translates into the ability
to build processing systems.

77
00:04:35,700 --> 00:04:38,730
And what I am
illustrating on this slide

78
00:04:38,730 --> 00:04:43,050
is the evolution of
microprocessor systems,

79
00:04:43,050 --> 00:04:46,530
that the systems that can
be built on a single chip

80
00:04:46,530 --> 00:04:48,960
or on a small
collection of chips.

81
00:04:48,960 --> 00:04:51,810
And with 1 micron
CMOS technology today,

82
00:04:51,810 --> 00:04:54,900
we are essentially sitting
in the range of 20 to 30

83
00:04:54,900 --> 00:04:58,230
million instructions per
second and this will steadily

84
00:04:58,230 --> 00:05:01,530
move into the 50 to 70 range.

85
00:05:01,530 --> 00:05:06,310
I should emphasize that
this performance comes--

86
00:05:06,310 --> 00:05:09,060
these numbers are
based on the assumption

87
00:05:09,060 --> 00:05:11,070
of having a machine
organization that

88
00:05:11,070 --> 00:05:15,930
can deliver approximately
1.2 cycles per instruction

89
00:05:15,930 --> 00:05:17,580
for average workloads.

90
00:05:17,580 --> 00:05:20,580
If one uses more sophisticated
machine organizations

91
00:05:20,580 --> 00:05:22,620
and we certainly
know how to do that,

92
00:05:22,620 --> 00:05:25,770
these numbers can
increase significantly.

93
00:05:25,770 --> 00:05:28,950
Furthermore another
factor of two

94
00:05:28,950 --> 00:05:34,870
can be obtained by using
liquid nitrogen cooling

95
00:05:34,870 --> 00:05:37,120
and by using some
of the developments

96
00:05:37,120 --> 00:05:43,960
in high density, low power,
bipolar technologies.

97
00:05:43,960 --> 00:05:47,110
It's certainly feasible
in the early 1990s

98
00:05:47,110 --> 00:05:52,210
to get into the 100 million
instruction per second range.

99
00:05:52,210 --> 00:05:54,160
I really want to make
two points over here.

100
00:05:54,160 --> 00:05:58,510
Firstly that the building blocks
that we have available to us

101
00:05:58,510 --> 00:06:01,630
now are becoming
powerful enough so

102
00:06:01,630 --> 00:06:07,480
that the small MIPS effects that
one would encounter with lower

103
00:06:07,480 --> 00:06:10,780
performance microprocessors
are no longer

104
00:06:10,780 --> 00:06:12,730
coming into the picture.

105
00:06:12,730 --> 00:06:17,650
And secondly that with a 40,
50, 100 million instruction

106
00:06:17,650 --> 00:06:19,750
per second building
block, one does not

107
00:06:19,750 --> 00:06:23,050
have to go to very high
levels of parallel processing

108
00:06:23,050 --> 00:06:25,210
to get fairly
significant systems.

109
00:06:25,210 --> 00:06:30,730
So 16 way and 30 way parallelism
with 100 microprocessor

110
00:06:30,730 --> 00:06:33,370
puts you into a very
interesting range.

111
00:06:33,370 --> 00:06:35,890
Of course, to get the
supercomputing kinds

112
00:06:35,890 --> 00:06:39,520
of capabilities as
I'll point out one

113
00:06:39,520 --> 00:06:41,380
can use higher levels
of parallelism.

114
00:06:41,380 --> 00:06:44,730


115
00:06:44,730 --> 00:06:49,440
Now let me then talk about
some of the important trends

116
00:06:49,440 --> 00:06:50,820
in parallel processing.

117
00:06:50,820 --> 00:06:53,010
And this slide will
indicate to you

118
00:06:53,010 --> 00:06:56,400
that I am in fact viewing
parallel processing

119
00:06:56,400 --> 00:07:01,470
in a very broad sense to cover
clustering, multiprocessing,

120
00:07:01,470 --> 00:07:04,270
and very highly
parallel processing.

121
00:07:04,270 --> 00:07:07,950
And I will spend the next few
minutes defining these terms

122
00:07:07,950 --> 00:07:11,820
and giving you some examples of
systems in each of these areas.

123
00:07:11,820 --> 00:07:14,640


124
00:07:14,640 --> 00:07:20,850
In clustering, one
typically connects a number

125
00:07:20,850 --> 00:07:25,770
of complete computers that
is processor and memory

126
00:07:25,770 --> 00:07:30,750
pairs together to disks.

127
00:07:30,750 --> 00:07:34,500
The key motivation
for clustering

128
00:07:34,500 --> 00:07:40,710
is to obtain improved capacity
and improved availability yet

129
00:07:40,710 --> 00:07:44,070
at the same time to provide
a single system image.

130
00:07:44,070 --> 00:07:45,900
That is the key challenge.

131
00:07:45,900 --> 00:07:49,500
One is trying to
develop a system that

132
00:07:49,500 --> 00:07:53,280
consists of multiple
processors but looks

133
00:07:53,280 --> 00:07:57,540
like a single computer
system to the operator

134
00:07:57,540 --> 00:07:59,140
and to the end user.

135
00:07:59,140 --> 00:08:01,730
Now what do I mean by
a single system image?

136
00:08:01,730 --> 00:08:04,290
I basically mean the
following, that the system

137
00:08:04,290 --> 00:08:07,290
should look like
one computer system

138
00:08:07,290 --> 00:08:09,420
from an administrative
point of view.

139
00:08:09,420 --> 00:08:11,730
It should have a
single file system.

140
00:08:11,730 --> 00:08:14,340
Users who log on to
the system should not

141
00:08:14,340 --> 00:08:17,010
be aware of the
particular computer

142
00:08:17,010 --> 00:08:19,410
that they end up executing on.

143
00:08:19,410 --> 00:08:23,190
And the system should provide
a high degree of availability

144
00:08:23,190 --> 00:08:25,360
and graceful degradation.

145
00:08:25,360 --> 00:08:27,810
Now, in clustering,
this is attained

146
00:08:27,810 --> 00:08:32,460
not by having one single
operating system that executes

147
00:08:32,460 --> 00:08:35,039
on all the machines,
but rather by having

148
00:08:35,039 --> 00:08:37,650
multiple copies of
the operating systems

149
00:08:37,650 --> 00:08:41,220
that exchange information
typically at the page level,

150
00:08:41,220 --> 00:08:44,130
and thereby software
synchronization techniques

151
00:08:44,130 --> 00:08:45,600
are adequate.

152
00:08:45,600 --> 00:08:48,630
Let me just point out that
examples of clustering

153
00:08:48,630 --> 00:08:52,380
are the transparent
computing facility, which

154
00:08:52,380 --> 00:08:55,320
is part of IBM's UNIX systems.

155
00:08:55,320 --> 00:08:57,660
And I will talk
about that later.

156
00:08:57,660 --> 00:09:00,570
Systems like tandem
and vax cluster

157
00:09:00,570 --> 00:09:02,985
are also examples of clusters.

158
00:09:02,985 --> 00:09:05,550
Now, the situation
in multiprocessing

159
00:09:05,550 --> 00:09:06,900
is a little different.

160
00:09:06,900 --> 00:09:11,100
Here you have
processors that share

161
00:09:11,100 --> 00:09:14,130
information at very
high speeds, typically

162
00:09:14,130 --> 00:09:20,100
by executing loads and stores
against a single memory system.

163
00:09:20,100 --> 00:09:22,530
The synchronization
between the processors

164
00:09:22,530 --> 00:09:24,930
is provided at the
hardware level,

165
00:09:24,930 --> 00:09:28,800
because the exchange information
at these very high speeds.

166
00:09:28,800 --> 00:09:31,270
As a result of this
hardware synchronization,

167
00:09:31,270 --> 00:09:33,390
the extensibility is
typically limited.

168
00:09:33,390 --> 00:09:38,760
So these systems are restricted
to the 8, 16, 32 processor

169
00:09:38,760 --> 00:09:42,780
range and typically will not
go significantly beyond that.

170
00:09:42,780 --> 00:09:46,050
Now the goal here really is
to get the multiprocessor

171
00:09:46,050 --> 00:09:50,350
to look and behave exactly
like a uniprocessor system.

172
00:09:50,350 --> 00:09:53,760
So all the subsystem interfaces,
all the application interfaces

173
00:09:53,760 --> 00:09:56,340
to the operating
system should be

174
00:09:56,340 --> 00:09:59,250
identical to what they
are in a unique processor.

175
00:09:59,250 --> 00:10:01,950
And this really means taking
a unique processor operating

176
00:10:01,950 --> 00:10:06,660
system and having the single
operating system execute

177
00:10:06,660 --> 00:10:09,680
across the multiple processors.

178
00:10:09,680 --> 00:10:12,990
And this involves implementing
mutual exclusion that

179
00:10:12,990 --> 00:10:15,900
involves analyzing
the data structures,

180
00:10:15,900 --> 00:10:18,570
understanding how they're
used, and providing

181
00:10:18,570 --> 00:10:23,530
the appropriate levels of
process synchronization.

182
00:10:23,530 --> 00:10:25,620
The key, of course,
is to do all of that

183
00:10:25,620 --> 00:10:28,750
and yet obtain very
high efficiency.

184
00:10:28,750 --> 00:10:31,020
Now multiprocessing
is being applied

185
00:10:31,020 --> 00:10:33,910
across a very wide
range of systems,

186
00:10:33,910 --> 00:10:38,010
all the way from workstations
to high end machines.

187
00:10:38,010 --> 00:10:41,190
The motivations in each case are
slightly different, of course,

188
00:10:41,190 --> 00:10:42,720
they are all related.

189
00:10:42,720 --> 00:10:45,960
In high end systems,
the overall goal

190
00:10:45,960 --> 00:10:48,480
is to provide levels
of performance

191
00:10:48,480 --> 00:10:51,840
beyond that achievable by
the highest performance

192
00:10:51,840 --> 00:10:54,910
uniprocessor at any
given point in time.

193
00:10:54,910 --> 00:10:59,430
So the key high end issue
is really performance.

194
00:10:59,430 --> 00:11:01,230
All the other issues
and motivations

195
00:11:01,230 --> 00:11:04,920
that I talked about
earlier, do apply as well

196
00:11:04,920 --> 00:11:07,860
but the performance issue
I believe dominates.

197
00:11:07,860 --> 00:11:11,370
In the mid-range
again performance

198
00:11:11,370 --> 00:11:13,950
comes into the picture
but the goal here

199
00:11:13,950 --> 00:11:17,250
is really to provide high
levels of performance

200
00:11:17,250 --> 00:11:20,220
without necessarily getting
into very high performance

201
00:11:20,220 --> 00:11:22,110
uniprocessor development,
without getting

202
00:11:22,110 --> 00:11:25,320
into the high performance
technologies and packages that

203
00:11:25,320 --> 00:11:28,390
characterize very high
performance uniprocessors.

204
00:11:28,390 --> 00:11:30,300
And at the same time
horizontal growth

205
00:11:30,300 --> 00:11:35,500
is a key issue in the mid-range.

206
00:11:35,500 --> 00:11:38,170
Now the workstations
the motivation

207
00:11:38,170 --> 00:11:41,320
is to provide the highest
levels of performance

208
00:11:41,320 --> 00:11:43,330
in a very small package.

209
00:11:43,330 --> 00:11:46,360
And there's really a
fairly interesting trend

210
00:11:46,360 --> 00:11:48,910
that is taking place in
the workstations, which

211
00:11:48,910 --> 00:11:52,630
to a large extent is
driven by this technology

212
00:11:52,630 --> 00:11:54,340
of parallel processing.

213
00:11:54,340 --> 00:11:58,210
And the situation is
as follows, by taking

214
00:11:58,210 --> 00:12:00,910
some of these high performance
microprocessor systems that

215
00:12:00,910 --> 00:12:04,150
are under 20, 30 million
instructions per second range

216
00:12:04,150 --> 00:12:08,770
and implementing them, a
system consisting of 4 or 8

217
00:12:08,770 --> 00:12:12,950
microprocessors, one can get
into performance ranges that

218
00:12:12,950 --> 00:12:16,900
were typically characteristic
of supercomputers of a few years

219
00:12:16,900 --> 00:12:17,980
ago.

220
00:12:17,980 --> 00:12:22,990
Now, the workstation however,
has an additional capability.

221
00:12:22,990 --> 00:12:27,790
These MIPS can also be
used to provide graphics

222
00:12:27,790 --> 00:12:30,520
and to provide user
interface processing.

223
00:12:30,520 --> 00:12:33,070
And since the
computing engine is now

224
00:12:33,070 --> 00:12:34,840
very close to the
user interface,

225
00:12:34,840 --> 00:12:38,530
they can be connected together
by very, very high bandwidth

226
00:12:38,530 --> 00:12:40,180
interconnects.

227
00:12:40,180 --> 00:12:43,900
And you can typically now
get levels of interactivity

228
00:12:43,900 --> 00:12:47,200
between the compute engine,
which is in the workstation,

229
00:12:47,200 --> 00:12:50,290
and the graphics processor,
which is in the workstation

230
00:12:50,290 --> 00:12:53,500
significantly beyond what
is possible by having

231
00:12:53,500 --> 00:12:57,430
a thin pipe that connects your
workstation to a supercomputer

232
00:12:57,430 --> 00:12:59,340
in the background.

233
00:12:59,340 --> 00:13:02,330
So this is a key trend that is
happening in the workstation

234
00:13:02,330 --> 00:13:05,510
arena, which I think is
being driven significantly

235
00:13:05,510 --> 00:13:09,200
by this technology of
parallel processing.

236
00:13:09,200 --> 00:13:11,120
Now let me just give
you a few examples

237
00:13:11,120 --> 00:13:14,180
of systems in these areas.

238
00:13:14,180 --> 00:13:18,560
In the high end IBM
3,090 mainframes

239
00:13:18,560 --> 00:13:21,620
are multi processors
as other carriers

240
00:13:21,620 --> 00:13:28,640
and as will be the SS1 machine
from Steve Chen's company.

241
00:13:28,640 --> 00:13:31,190
In the mid-range
the machine systems

242
00:13:31,190 --> 00:13:36,470
like the 4,381
alliant and convex

243
00:13:36,470 --> 00:13:40,700
are examples of systems that
use parallel processing.

244
00:13:40,700 --> 00:13:46,160
And at the workstation we
see a numerous systems today

245
00:13:46,160 --> 00:13:50,660
examples addent, stellar,
and silicon graphics.

246
00:13:50,660 --> 00:13:54,470
Now we move on to the area of
highly parallel processing.

247
00:13:54,470 --> 00:13:57,470
As many of you
are aware, there's

248
00:13:57,470 --> 00:14:02,180
an incredibly large variety
of architectures and systems

249
00:14:02,180 --> 00:14:06,530
that are being developed and
studied both in universities

250
00:14:06,530 --> 00:14:08,060
and in the industry.

251
00:14:08,060 --> 00:14:10,370
And it is not my
intention here to go

252
00:14:10,370 --> 00:14:13,880
into a very complicated
taxonomy that

253
00:14:13,880 --> 00:14:17,750
describes these various
parallel processors.

254
00:14:17,750 --> 00:14:19,700
What I would like
to do, though is

255
00:14:19,700 --> 00:14:24,080
take these highly parallel
processing systems

256
00:14:24,080 --> 00:14:28,070
and put them into two
very general categories

257
00:14:28,070 --> 00:14:31,250
that I call specialized
and multipurpose.

258
00:14:31,250 --> 00:14:35,600
And let me just make a few
comments about each of these.

259
00:14:35,600 --> 00:14:39,240
In the case of specialized
parallel processing,

260
00:14:39,240 --> 00:14:42,170
the technology is very
well understood today

261
00:14:42,170 --> 00:14:45,260
and highly parallel
specialized processes today

262
00:14:45,260 --> 00:14:48,710
are being used to solve
important problems.

263
00:14:48,710 --> 00:14:52,730
Problems that are either
economically important

264
00:14:52,730 --> 00:14:54,330
or of scientific importance.

265
00:14:54,330 --> 00:14:58,010
And I will talk about two
examples that we have developed

266
00:14:58,010 --> 00:15:01,160
and are working on in
IBM, the logic simulation

267
00:15:01,160 --> 00:15:02,900
machine and the GF11.

268
00:15:02,900 --> 00:15:06,230


269
00:15:06,230 --> 00:15:08,180
Characteristic of
these machines is

270
00:15:08,180 --> 00:15:14,570
that they are intended to solve
just one single problem or one

271
00:15:14,570 --> 00:15:17,570
single well
understood algorithm.

272
00:15:17,570 --> 00:15:19,820
And as a result,
you don't require

273
00:15:19,820 --> 00:15:24,680
very sophisticated programming
environments and compilers.

274
00:15:24,680 --> 00:15:29,000
This allows you to
develop these machines

275
00:15:29,000 --> 00:15:32,420
without great development costs.

276
00:15:32,420 --> 00:15:35,200


277
00:15:35,200 --> 00:15:38,530
It's my estimate
that machines that

278
00:15:38,530 --> 00:15:41,740
can deliver hundreds of
gigaflops or even a teraflops

279
00:15:41,740 --> 00:15:49,090
are quite feasible and will be
used over the next few years

280
00:15:49,090 --> 00:15:52,180
and such machines can have
a very significant impact

281
00:15:52,180 --> 00:15:58,780
on science and on engineering.

282
00:15:58,780 --> 00:16:01,210
But I would like to make
another point, which

283
00:16:01,210 --> 00:16:04,360
is that there are other
specialized machines

284
00:16:04,360 --> 00:16:07,270
and I've given you one example
on the slide, the database

285
00:16:07,270 --> 00:16:11,290
machine, which is a little
different from a logic

286
00:16:11,290 --> 00:16:14,620
simulation machine or a machine
for a physics application.

287
00:16:14,620 --> 00:16:17,590
These may be used
by very few people.

288
00:16:17,590 --> 00:16:21,970
But a machine that can do a
complex query very rapidly

289
00:16:21,970 --> 00:16:23,770
is performing a
specialized function

290
00:16:23,770 --> 00:16:25,630
but it's performing
a function that

291
00:16:25,630 --> 00:16:30,560
will be important to a
very large class of users.

292
00:16:30,560 --> 00:16:34,300
In fact, I believe this is an
example of a more general trend

293
00:16:34,300 --> 00:16:39,790
where clearly subsystems
are being identified

294
00:16:39,790 --> 00:16:45,220
and database is one of them,
which are often accessed

295
00:16:45,220 --> 00:16:49,400
by through very well
defined interfaces,

296
00:16:49,400 --> 00:16:55,840
for example a SQL query, which
are amenable to being addressed

297
00:16:55,840 --> 00:17:01,330
by a special hardware in
the case of SQL machine

298
00:17:01,330 --> 00:17:03,040
by a highly parallel processor.

299
00:17:03,040 --> 00:17:05,020
And this is an important
trend because it's

300
00:17:05,020 --> 00:17:07,660
a specialized function
but it's a function

301
00:17:07,660 --> 00:17:12,910
that will be of use to a
large number of people.

302
00:17:12,910 --> 00:17:15,940
I think the whole area of
specialized highly parallel

303
00:17:15,940 --> 00:17:18,220
processors is an
important one and we

304
00:17:18,220 --> 00:17:19,780
have to start
understanding the role

305
00:17:19,780 --> 00:17:23,930
that these machines will play in
general computing environments.

306
00:17:23,930 --> 00:17:27,730
Now on the other
hand, the area of

307
00:17:27,730 --> 00:17:30,220
multipurpose parallel
processing though it

308
00:17:30,220 --> 00:17:34,570
has significant
potential, I believe

309
00:17:34,570 --> 00:17:37,960
is also an area where a
significant amount of research

310
00:17:37,960 --> 00:17:39,640
still remains to be done.

311
00:17:39,640 --> 00:17:43,120
I will talk about this
a little more later on

312
00:17:43,120 --> 00:17:47,670
when I describe the RP3 machine
that we are working on at IBM.

313
00:17:47,670 --> 00:17:49,920
I would like to make two
points at this stage, though.

314
00:17:49,920 --> 00:17:55,030
One is that the
availability of hypercubes

315
00:17:55,030 --> 00:17:58,250
has really been very important.

316
00:17:58,250 --> 00:18:00,910
The second point I would like
to make is that in my estimate

317
00:18:00,910 --> 00:18:04,780
the area of data flow
computing is very important

318
00:18:04,780 --> 00:18:07,180
because it provides
the potential

319
00:18:07,180 --> 00:18:11,440
to exploit massively parallel
unstructured parallelism.

320
00:18:11,440 --> 00:18:13,840
I believe more work needs
to be done in this area

321
00:18:13,840 --> 00:18:16,177
but the potential
is very significant.

322
00:18:16,177 --> 00:18:20,080


323
00:18:20,080 --> 00:18:25,090
And to end this
introductory section,

324
00:18:25,090 --> 00:18:28,390
let me just say that in
addition to the various classes

325
00:18:28,390 --> 00:18:32,110
of machines the two
uses of parallelism

326
00:18:32,110 --> 00:18:34,510
to high level uses
of parallelism

327
00:18:34,510 --> 00:18:37,960
that we should distinguish
between, one of them

328
00:18:37,960 --> 00:18:40,180
I call system parallelism.

329
00:18:40,180 --> 00:18:44,470
And this is the ability to take
different users' applications

330
00:18:44,470 --> 00:18:46,960
or different
database transactions

331
00:18:46,960 --> 00:18:48,940
and run them in parallel.

332
00:18:48,940 --> 00:18:51,100
This does increase throughput.

333
00:18:51,100 --> 00:18:54,770
It doesn't make any single
application run faster.

334
00:18:54,770 --> 00:18:57,280
A big advantage is that
it requires no change

335
00:18:57,280 --> 00:18:58,930
to user programs.

336
00:18:58,930 --> 00:19:02,860
And typically, this
class of parallelism

337
00:19:02,860 --> 00:19:04,630
is obtained through
multiprocessing

338
00:19:04,630 --> 00:19:06,400
and through clustering.

339
00:19:06,400 --> 00:19:10,000
The other kind of parallelism
is application parallelism.

340
00:19:10,000 --> 00:19:12,310
And here the goal
is really to take

341
00:19:12,310 --> 00:19:16,180
a single job or a single
transaction or a unit of work

342
00:19:16,180 --> 00:19:18,310
and break it up
into little pieces

343
00:19:18,310 --> 00:19:21,250
and run all these
pieces in parallel.

344
00:19:21,250 --> 00:19:24,130
This doesn't make any
individual application run--

345
00:19:24,130 --> 00:19:27,700
this does make individual
applications run faster.

346
00:19:27,700 --> 00:19:30,340
It may not increased
system throughput

347
00:19:30,340 --> 00:19:33,490
and it may require changes
to the user's program.

348
00:19:33,490 --> 00:19:38,350
Typically, multiprocessing
and highly parallel processing

349
00:19:38,350 --> 00:19:43,360
have the potential to provide
this kind of parallelism.

350
00:19:43,360 --> 00:19:47,020
And this indicates
also multiprocessing

351
00:19:47,020 --> 00:19:50,620
is a very important
systems trend because it

352
00:19:50,620 --> 00:19:54,520
does address both systems
parallelism and application

353
00:19:54,520 --> 00:19:55,810
parallelism.

354
00:19:55,810 --> 00:19:59,140
What I would like to do
next is to illustrate

355
00:19:59,140 --> 00:20:00,760
some of the concepts
I have talked

356
00:20:00,760 --> 00:20:03,550
about with specific examples.

357
00:20:03,550 --> 00:20:07,570
I've chosen examples from IBM
because those are the ones

358
00:20:07,570 --> 00:20:08,860
that I am more familiar with.

359
00:20:08,860 --> 00:20:12,700


360
00:20:12,700 --> 00:20:17,040
So we start off with clustering.

361
00:20:17,040 --> 00:20:23,850
And that's part of
IBM's UNIX offerings.

362
00:20:23,850 --> 00:20:27,180
An IBM UNIX is called AIX.

363
00:20:27,180 --> 00:20:32,640
We have a system called the
transparent computing facility.

364
00:20:32,640 --> 00:20:36,480
This system provides a single
system image of the type

365
00:20:36,480 --> 00:20:39,570
that I had discussed
earlier, across

366
00:20:39,570 --> 00:20:41,610
heterogeneous processors.

367
00:20:41,610 --> 00:20:46,170
The processors can be either
system 370 machines or PS2

368
00:20:46,170 --> 00:20:47,320
machines.

369
00:20:47,320 --> 00:20:50,220
Now a single system
image in this case

370
00:20:50,220 --> 00:20:53,640
means what I had said earlier
that is you get a single file

371
00:20:53,640 --> 00:20:56,700
system, you get log
on transparency,

372
00:20:56,700 --> 00:21:01,920
you also get a certain amount
of local remote execution

373
00:21:01,920 --> 00:21:03,760
in a transparent manner.

374
00:21:03,760 --> 00:21:07,470
For example, you could
be developing and editing

375
00:21:07,470 --> 00:21:10,980
your program on a workstation.

376
00:21:10,980 --> 00:21:16,680
The program could be compiled
on a mid-range 4,381 or 9,370

377
00:21:16,680 --> 00:21:21,130
and could be executed on
a high end 3,090 system.

378
00:21:21,130 --> 00:21:24,700
And all of this would happen in
a manner transparent to the end

379
00:21:24,700 --> 00:21:25,200
user.

380
00:21:25,200 --> 00:21:28,470


381
00:21:28,470 --> 00:21:31,790
The second important point
is that these systems

382
00:21:31,790 --> 00:21:35,870
can be connected in a variety
of ways, either through channel

383
00:21:35,870 --> 00:21:40,910
to channel attach or
using local area networks.

384
00:21:40,910 --> 00:21:46,260
Let me move on now to
mainframe parallelism.

385
00:21:46,260 --> 00:21:51,610
And let me talk about
the 3,090 model 600,

386
00:21:51,610 --> 00:21:56,580
which is a six-way tightly
coupled multiprocessor.

387
00:21:56,580 --> 00:22:00,570
These processes have gaseous
and there's very extensive

388
00:22:00,570 --> 00:22:05,410
hardware gas coherence that is
provided in a system like this.

389
00:22:05,410 --> 00:22:09,540
The boxes labeled SCEs are
storage control elements

390
00:22:09,540 --> 00:22:12,960
and are very complex
switches that

391
00:22:12,960 --> 00:22:15,300
bring the processes
and the channels

392
00:22:15,300 --> 00:22:19,950
together and provide a coherent
view of the shared memory

393
00:22:19,950 --> 00:22:21,330
system.

394
00:22:21,330 --> 00:22:25,650
Now, in many ways in a
system like this also

395
00:22:25,650 --> 00:22:28,440
some of the key challenges
and some of the key issues

396
00:22:28,440 --> 00:22:30,720
are really software issues.

397
00:22:30,720 --> 00:22:34,770
And we have made a
tremendous amount of progress

398
00:22:34,770 --> 00:22:39,300
in obtaining true
symmetric multiprocessor

399
00:22:39,300 --> 00:22:44,330
capability in our MVS
and VM operating systems.

400
00:22:44,330 --> 00:22:47,190
Furthermore this
multiprocessor capability

401
00:22:47,190 --> 00:22:53,040
is exploited by subsystems
like IMS and GICS

402
00:22:53,040 --> 00:22:56,070
to get system parallelism.

403
00:22:56,070 --> 00:22:58,680
We have made extensions
to the Fortran language.

404
00:22:58,680 --> 00:23:03,150
And together with the automatic
detection of parallelism

405
00:23:03,150 --> 00:23:05,910
we can now exploit
application parallelism

406
00:23:05,910 --> 00:23:10,350
across this six-way tightly
coupled multiprocessor.

407
00:23:10,350 --> 00:23:12,150
There's been a
fair amount of work

408
00:23:12,150 --> 00:23:16,960
on subroutine libraries
for our vector machines.

409
00:23:16,960 --> 00:23:18,900
And these subroutine
libraries are now

410
00:23:18,900 --> 00:23:22,800
being extended to provide
support for parallel processing

411
00:23:22,800 --> 00:23:24,930
as well.

412
00:23:24,930 --> 00:23:28,080
The point I want to make
is that the six-way,

413
00:23:28,080 --> 00:23:30,420
for example in this case
six-way tightly coupled

414
00:23:30,420 --> 00:23:34,440
MP is a very well
understood technology that

415
00:23:34,440 --> 00:23:37,470
is being exploited both
for system parallelism

416
00:23:37,470 --> 00:23:41,340
and for application
parallelism side IBM.

417
00:23:41,340 --> 00:23:48,150
Just to give you a brief
example of the kind of speedups

418
00:23:48,150 --> 00:23:53,800
that we are capable of getting
by the use of parallel Fortran,

419
00:23:53,800 --> 00:23:58,530
here's a slide that shows three
applications of varying degrees

420
00:23:58,530 --> 00:24:02,580
of utilization but you
can essentially get 4 to 5

421
00:24:02,580 --> 00:24:06,050
and 1/2 way speed up on
a six processor system.

422
00:24:06,050 --> 00:24:10,370


423
00:24:10,370 --> 00:24:14,610
Another system that we
have we are working on,

424
00:24:14,610 --> 00:24:18,410
which is an
experimental prototype

425
00:24:18,410 --> 00:24:24,020
is a combination of concepts and
we are calling it form coupling

426
00:24:24,020 --> 00:24:25,910
for the lack of a better name.

427
00:24:25,910 --> 00:24:30,950
This is really taking
two 3090 600 complexes

428
00:24:30,950 --> 00:24:34,700
and connecting them at the
shared electronic level

429
00:24:34,700 --> 00:24:40,160
with very high bandwidth
fiber optic interconnections.

430
00:24:40,160 --> 00:24:43,280
This as I said should be
viewed as an exploratory system

431
00:24:43,280 --> 00:24:46,970
and we are trying to understand
the various ways in which it

432
00:24:46,970 --> 00:24:47,880
can be used.

433
00:24:47,880 --> 00:24:50,250
Some of which I have
listed on the slide.

434
00:24:50,250 --> 00:24:54,800
So you could use one side of
the system for application

435
00:24:54,800 --> 00:24:59,240
parallelism, the other
side for system parallelism

436
00:24:59,240 --> 00:25:02,000
or you could use both sides
for application parallelism

437
00:25:02,000 --> 00:25:05,630
or you could try to take one
job and split it across all 12

438
00:25:05,630 --> 00:25:08,720
processes and get application
parallelism across 12

439
00:25:08,720 --> 00:25:10,670
processes.

440
00:25:10,670 --> 00:25:13,190
Using the system to address
some of the operating

441
00:25:13,190 --> 00:25:16,610
system and compiler and
language issues that would

442
00:25:16,610 --> 00:25:19,062
be involved in form coupling.

443
00:25:19,062 --> 00:25:21,780


444
00:25:21,780 --> 00:25:25,260
Let me move on to
workstation parallelism.

445
00:25:25,260 --> 00:25:29,940
We have developed in
researched a vehicle,

446
00:25:29,940 --> 00:25:33,270
which is again an
experimental vehicle

447
00:25:33,270 --> 00:25:37,170
to understand some of the
important issues in this area.

448
00:25:37,170 --> 00:25:40,800
In this particular case,
what we would like to do

449
00:25:40,800 --> 00:25:44,700
is we would like to study how
best parallel workstation can

450
00:25:44,700 --> 00:25:46,560
be used--

451
00:25:46,560 --> 00:25:50,200
how parallelism can be used
and exploited in a workstation.

452
00:25:50,200 --> 00:25:53,280
Again the various
forms of parallelism

453
00:25:53,280 --> 00:25:55,890
that one could exploit,
either the system parallelism

454
00:25:55,890 --> 00:25:57,930
or the application parallelism.

455
00:25:57,930 --> 00:26:01,020
And this is a vehicle to start
studying some of these issues

456
00:26:01,020 --> 00:26:02,970
from a software point of view.

457
00:26:02,970 --> 00:26:09,420
The vehicle consists of eight
IBM risk microprocessors,

458
00:26:09,420 --> 00:26:11,880
each with floating point
capability and each

459
00:26:11,880 --> 00:26:16,750
with about 8
megabytes of memory.

460
00:26:16,750 --> 00:26:19,950
There's 32 megabytes of
shared memory capability.

461
00:26:19,950 --> 00:26:22,410
This machine does have a shared
address, a global address

462
00:26:22,410 --> 00:26:23,880
space.

463
00:26:23,880 --> 00:26:26,040
The memories and processors
are connected together

464
00:26:26,040 --> 00:26:28,980
by a high speed 80
megabytes per second bars.

465
00:26:28,980 --> 00:26:31,560
Again since this is
a research vehicle,

466
00:26:31,560 --> 00:26:34,020
we have provided a
real-time hardware monitor

467
00:26:34,020 --> 00:26:36,840
that will allow us
to make measurements

468
00:26:36,840 --> 00:26:42,180
on various parts of the system
as the software as executing.

469
00:26:42,180 --> 00:26:45,870
Let me now move on to the area
of highly parallel processing.

470
00:26:45,870 --> 00:26:49,140
And I will start with
specialized highly parallel

471
00:26:49,140 --> 00:26:53,400
processing and I will go through
very briefly two examples,

472
00:26:53,400 --> 00:26:55,290
the engineering
verification engine

473
00:26:55,290 --> 00:26:58,167
and the GF11 parallel processor.

474
00:26:58,167 --> 00:27:01,030


475
00:27:01,030 --> 00:27:04,090
Now what was the
problem that we were

476
00:27:04,090 --> 00:27:07,390
trying to address with EVE,
the Engineering Verification

477
00:27:07,390 --> 00:27:07,930
Engine.

478
00:27:07,930 --> 00:27:11,170
The problem is really
that of logic simulation.

479
00:27:11,170 --> 00:27:14,020
And in this particular
case, the goal

480
00:27:14,020 --> 00:27:17,000
is to simulate the
logic of a high pros--

481
00:27:17,000 --> 00:27:20,380
high performance processor,
one of our high end machines.

482
00:27:20,380 --> 00:27:24,580
And logic simulation is
a very important part.

483
00:27:24,580 --> 00:27:27,907
And it's the front end
part of the design process.

484
00:27:27,907 --> 00:27:28,990
The goal is the following.

485
00:27:28,990 --> 00:27:33,310
The goal is to try to simulate
at the logic level processor

486
00:27:33,310 --> 00:27:35,710
yet to be designed,
which is in this case

487
00:27:35,710 --> 00:27:40,420
approximately 300,000
gates and to simulate

488
00:27:40,420 --> 00:27:43,780
5 minutes of execution
of that process that

489
00:27:43,780 --> 00:27:45,160
remains to be designed.

490
00:27:45,160 --> 00:27:48,290
Now to do this.

491
00:27:48,290 --> 00:27:55,010
This was the problem we faced
in the late '70s and early '80s.

492
00:27:55,010 --> 00:27:58,900
And basically to simulate
the next generation

493
00:27:58,900 --> 00:28:01,930
high-end machine
for five minutes

494
00:28:01,930 --> 00:28:05,200
would take a 1980 mainframe
approximately 100 years,

495
00:28:05,200 --> 00:28:09,250
and this is
obviously impossible.

496
00:28:09,250 --> 00:28:14,560
The EVE machine will
do this in six weeks.

497
00:28:14,560 --> 00:28:18,520
This is an important
example of how

498
00:28:18,520 --> 00:28:22,330
you can use parallel processing
to take a problem that

499
00:28:22,330 --> 00:28:27,580
is an important design problem,
which is essentially impossible

500
00:28:27,580 --> 00:28:30,370
and to make it very practical.

501
00:28:30,370 --> 00:28:33,070
And that's what EVE achieved.

502
00:28:33,070 --> 00:28:36,230


503
00:28:36,230 --> 00:28:39,890
It achieved this by using
a highly powerful processor

504
00:28:39,890 --> 00:28:43,370
but a very
specialized processor.

505
00:28:43,370 --> 00:28:48,920
That consists of 256 elementary
processing elements connected

506
00:28:48,920 --> 00:28:51,540
together by a crossbar switch.

507
00:28:51,540 --> 00:28:55,610
Now each of these processors,
these elementary processors

508
00:28:55,610 --> 00:28:58,580
is really a very simple
computing element.

509
00:28:58,580 --> 00:29:02,660
And what it does is to read
from an instruction memory

510
00:29:02,660 --> 00:29:08,810
and then do very simple logical
operations on a data memory.

511
00:29:08,810 --> 00:29:11,300
And it does these
operations one at a time.

512
00:29:11,300 --> 00:29:15,140
That is, it'll take four digit
items do a logical operation

513
00:29:15,140 --> 00:29:18,890
on them and produce a result,
which either goes back

514
00:29:18,890 --> 00:29:23,780
into a local data memory element
or goes across the crossbar

515
00:29:23,780 --> 00:29:25,880
to one of the remote processors.

516
00:29:25,880 --> 00:29:29,512
So it's a very simple
architecture conceptually.

517
00:29:29,512 --> 00:29:31,220
And the key, of course,
is the following.

518
00:29:31,220 --> 00:29:34,100
The key is to avoid the
bottleneck of event driven

519
00:29:34,100 --> 00:29:37,310
simulation and to
take a logic network

520
00:29:37,310 --> 00:29:43,160
and actually compile onto
this simple little structure.

521
00:29:43,160 --> 00:29:45,260
Again, some of the key
problems are really

522
00:29:45,260 --> 00:29:46,760
in the software compilation.

523
00:29:46,760 --> 00:29:50,240
These problems were
addressed and solved

524
00:29:50,240 --> 00:29:54,170
and we were able to meet
the overall goals of logic

525
00:29:54,170 --> 00:29:57,260
simulation that we started with.

526
00:29:57,260 --> 00:30:01,790
The engineering verification
engine as the slide says

527
00:30:01,790 --> 00:30:05,900
has a peak rate of about 2
billion gates simulations

528
00:30:05,900 --> 00:30:06,530
per second.

529
00:30:06,530 --> 00:30:09,170


530
00:30:09,170 --> 00:30:14,590
It is operational at
several IBM locations.

531
00:30:14,590 --> 00:30:20,080
Some of these machines have
256 processors, some have less,

532
00:30:20,080 --> 00:30:22,900
the smallest one
has 64 processors.

533
00:30:22,900 --> 00:30:25,870
And this is possibly one
of the earliest production

534
00:30:25,870 --> 00:30:30,680
uses of a highly parallel
processing element.

535
00:30:30,680 --> 00:30:34,490
We have been using these
EVEs for a few years

536
00:30:34,490 --> 00:30:37,070
now in our development
laboratories

537
00:30:37,070 --> 00:30:41,240
to do logic simulation.

538
00:30:41,240 --> 00:30:45,690
The second problem is different.

539
00:30:45,690 --> 00:30:48,080
The second problem is
a scientific problem

540
00:30:48,080 --> 00:30:53,120
and it's the problem of
quantum chromodynamics.

541
00:30:53,120 --> 00:30:55,940
The theory of quantum
chromodynamics deals

542
00:30:55,940 --> 00:30:59,630
with the forces that
exist inside the nucleus

543
00:30:59,630 --> 00:31:01,340
at the quark, antiquark level.

544
00:31:01,340 --> 00:31:03,830
So there's a
chromoelectric field

545
00:31:03,830 --> 00:31:05,570
that is hypothesized
to exist, which

546
00:31:05,570 --> 00:31:08,150
is very similar to
the dynamic field that

547
00:31:08,150 --> 00:31:11,090
exists between electrons
and the nucleus.

548
00:31:11,090 --> 00:31:14,930
The chromoelectric field
exists inside the nucleus.

549
00:31:14,930 --> 00:31:17,660
QCD gives you the
transition probabilities

550
00:31:17,660 --> 00:31:20,570
between quark and
antiquark configurations

551
00:31:20,570 --> 00:31:23,690
at various points
in space and time.

552
00:31:23,690 --> 00:31:26,090
From these transition
probabilities,

553
00:31:26,090 --> 00:31:29,730
from first principles one can
compute the mass of the proton.

554
00:31:29,730 --> 00:31:32,630
So the goal is to compute
the mass of the proton, which

555
00:31:32,630 --> 00:31:36,080
is a known quantity and to see
whether the computation matches

556
00:31:36,080 --> 00:31:38,570
what we know the mass
of the product to be.

557
00:31:38,570 --> 00:31:41,570
If it does, then we have a
certain amount of confidence

558
00:31:41,570 --> 00:31:43,650
that QCD is correct.

559
00:31:43,650 --> 00:31:47,400
If it doesn't match, then
we know that QCD is wrong.

560
00:31:47,400 --> 00:31:50,780
Now, this is a structured
and regular computation

561
00:31:50,780 --> 00:31:58,370
but the physicists involved with
this particular theory at IBM

562
00:31:58,370 --> 00:32:02,090
estimates that 3 times
10 to the 17 operations

563
00:32:02,090 --> 00:32:05,120
would be required to compute
the mass of the proton.

564
00:32:05,120 --> 00:32:07,730
On a machine that
could sustain 100 flops

565
00:32:07,730 --> 00:32:09,170
this would take 100 years.

566
00:32:09,170 --> 00:32:12,090
And again, is quite impossible.

567
00:32:12,090 --> 00:32:16,680
On the GF11 this problem can
be solved in approximately one

568
00:32:16,680 --> 00:32:17,180
year.

569
00:32:17,180 --> 00:32:20,420
This is another somewhat
dramatic example

570
00:32:20,420 --> 00:32:22,130
of the use of
parallel processing

571
00:32:22,130 --> 00:32:24,770
to take an important
problem and this is

572
00:32:24,770 --> 00:32:29,990
an important scientific
problem and bring it

573
00:32:29,990 --> 00:32:31,610
into the realm of possibility.

574
00:32:31,610 --> 00:32:37,680
A problem that for all practical
purposes today is intractable.

575
00:32:37,680 --> 00:32:41,610
The GF11 architecture
is an ASMD architecture,

576
00:32:41,610 --> 00:32:45,450
that is there's central control
that broadcasts instructions

577
00:32:45,450 --> 00:32:50,560
to 576 processors.

578
00:32:50,560 --> 00:32:53,240
Now each of these processors
is a fairly powerful machine

579
00:32:53,240 --> 00:32:56,140
it's not a little
microprocessor.

580
00:32:56,140 --> 00:32:58,720
Each processor
has the capability

581
00:32:58,720 --> 00:33:03,550
of executing 20 mega flops, that
is 20 million floating point

582
00:33:03,550 --> 00:33:04,870
operations per second.

583
00:33:04,870 --> 00:33:08,650
It can do I/O at the rate
of 20 megabytes per second

584
00:33:08,650 --> 00:33:11,980
and it has the capability also
of doing 20 million fixed point

585
00:33:11,980 --> 00:33:13,210
operations per second.

586
00:33:13,210 --> 00:33:16,660
If you multiply all
that by 576, you

587
00:33:16,660 --> 00:33:19,630
end up with over 11 gigaflops--

588
00:33:19,630 --> 00:33:22,570
over 11 billion fixed
point operations

589
00:33:22,570 --> 00:33:27,260
per second and a total bandwidth
of 11 gigabytes per second.

590
00:33:27,260 --> 00:33:31,720
Now it is not our
claim that you can

591
00:33:31,720 --> 00:33:35,920
get that level of performance
on any arbitrary problem,

592
00:33:35,920 --> 00:33:41,380
but we do believe that on QCD
we can get up to 90% utilization

593
00:33:41,380 --> 00:33:43,840
and therefore can
sustain a computation

594
00:33:43,840 --> 00:33:47,170
rate of 10 gigaflops per
second, which would be required

595
00:33:47,170 --> 00:33:49,840
to solve the QCD
problem and compute

596
00:33:49,840 --> 00:33:52,420
the mass of the proton in
approximately one year.

597
00:33:52,420 --> 00:33:55,090


598
00:33:55,090 --> 00:33:57,130
The other important
feature of this machine

599
00:33:57,130 --> 00:33:59,290
is the interconnection network.

600
00:33:59,290 --> 00:34:03,070
The interconnection
network works as follows.

601
00:34:03,070 --> 00:34:04,570
Before you start
the problem, you

602
00:34:04,570 --> 00:34:09,310
can select up to 1,000
of the factorial 576

603
00:34:09,310 --> 00:34:11,389
possible permutations.

604
00:34:11,389 --> 00:34:12,820
And then you can
actually provide

605
00:34:12,820 --> 00:34:16,150
each one of these permutations
every 200 nanoseconds

606
00:34:16,150 --> 00:34:18,719
at execution time.

607
00:34:18,719 --> 00:34:22,870
Now QCD itself does not
require that much generality

608
00:34:22,870 --> 00:34:27,070
for the computations, yet
this computation network

609
00:34:27,070 --> 00:34:29,920
and the ability to have
these various permutations

610
00:34:29,920 --> 00:34:33,699
is important because you
want to be able to separate

611
00:34:33,699 --> 00:34:38,080
the physical interconnection
of the processors from the way

612
00:34:38,080 --> 00:34:39,610
they are used logically.

613
00:34:39,610 --> 00:34:41,440
And this is
important if you want

614
00:34:41,440 --> 00:34:46,550
to be able to operate in the
presence of processor failures.

615
00:34:46,550 --> 00:34:49,060
And that is an
important consideration

616
00:34:49,060 --> 00:34:50,650
in this architecture.

617
00:34:50,650 --> 00:34:53,920
I'm also using the word
permutation somewhat loosely,

618
00:34:53,920 --> 00:34:58,780
you can also provide broadcast
every 200 nanoseconds.

619
00:34:58,780 --> 00:35:00,940
And this is one of
the features that

620
00:35:00,940 --> 00:35:04,990
will make the GF11 actually
suitable for a wider

621
00:35:04,990 --> 00:35:07,870
variety of problems than
just the QCD problem.

622
00:35:07,870 --> 00:35:12,650


623
00:35:12,650 --> 00:35:16,310
Let me now move on to the
area of multipurpose parallel

624
00:35:16,310 --> 00:35:17,090
processing.

625
00:35:17,090 --> 00:35:20,270
As I said in my
introductory section,

626
00:35:20,270 --> 00:35:22,620
I believe this is a
very important area,

627
00:35:22,620 --> 00:35:26,510
but I also believe that there
are many open questions that

628
00:35:26,510 --> 00:35:27,810
remain to be answered.

629
00:35:27,810 --> 00:35:31,670
We are slowly answering these
questions but a lot of work

630
00:35:31,670 --> 00:35:33,440
remains to be done.

631
00:35:33,440 --> 00:35:36,330
How much parallelism is
there in applications,

632
00:35:36,330 --> 00:35:39,440
what is the granularity
of this parallelism,

633
00:35:39,440 --> 00:35:41,420
how should the
parallelism be specified?

634
00:35:41,420 --> 00:35:43,910
Should it be
specified explicitly,

635
00:35:43,910 --> 00:35:45,620
should it be implicit
in the language,

636
00:35:45,620 --> 00:35:48,230
should it be derived
by compilers?

637
00:35:48,230 --> 00:35:51,350
What are the appropriate
resource management

638
00:35:51,350 --> 00:35:54,230
and scheduling approaches?

639
00:35:54,230 --> 00:35:57,700


640
00:35:57,700 --> 00:36:00,640
From what class of problems do
you need a global address space

641
00:36:00,640 --> 00:36:03,790
for what kinds of problems will
explicit message passing be

642
00:36:03,790 --> 00:36:04,510
good?

643
00:36:04,510 --> 00:36:07,600
What is the impact of
communication latency

644
00:36:07,600 --> 00:36:08,980
and bandwidth on performance.

645
00:36:08,980 --> 00:36:12,910
And the bottom line really
is what kind of performance

646
00:36:12,910 --> 00:36:16,300
can you sustain across a
wide range of applications

647
00:36:16,300 --> 00:36:18,790
on a given architecture.

648
00:36:18,790 --> 00:36:21,670
As a community we are making
progress on these questions

649
00:36:21,670 --> 00:36:24,550
but I believe a lot more
work remains to be done.

650
00:36:24,550 --> 00:36:27,430
Now two or three years ago
the situation I believe

651
00:36:27,430 --> 00:36:29,770
was as follows.

652
00:36:29,770 --> 00:36:33,220
We as a technical
community had done

653
00:36:33,220 --> 00:36:36,940
a lot of analysis and
simulation to start

654
00:36:36,940 --> 00:36:40,420
understanding the answers
to some of these questions.

655
00:36:40,420 --> 00:36:45,220
But what was really required
was a non-trivial prototypes,

656
00:36:45,220 --> 00:36:47,830
on which we could
run real applications

657
00:36:47,830 --> 00:36:49,790
and perform real measurements.

658
00:36:49,790 --> 00:36:52,990
Approximately four years
ago, we started down

659
00:36:52,990 --> 00:36:58,930
the path of building a
set of prototypes in IBM

660
00:36:58,930 --> 00:37:02,740
to start answering some of
these important questions.

661
00:37:02,740 --> 00:37:06,950
I will come to the RP3
program in a minute,

662
00:37:06,950 --> 00:37:11,230
but before I get to that,
let me mention that we also

663
00:37:11,230 --> 00:37:14,560
have a message passing
parallel processor at IBM

664
00:37:14,560 --> 00:37:16,960
research called Victor.

665
00:37:16,960 --> 00:37:20,230
A 32 node MIMD system
has been operational

666
00:37:20,230 --> 00:37:22,600
since the middle of 1987.

667
00:37:22,600 --> 00:37:25,330
This is based on
the mass transmuter

668
00:37:25,330 --> 00:37:29,600
and is programmed using
C Pascal Fortran outcome.

669
00:37:29,600 --> 00:37:33,210


670
00:37:33,210 --> 00:37:36,780
Victor was and it is a very
quick and inexpensive approach

671
00:37:36,780 --> 00:37:41,790
to putting together a
parallel process very rapidly.

672
00:37:41,790 --> 00:37:43,620
It has been shown
to be exceedingly

673
00:37:43,620 --> 00:37:46,110
good for a variety
of applications

674
00:37:46,110 --> 00:37:49,440
where there's a huge
amount of parallelism.

675
00:37:49,440 --> 00:37:51,960
Some of these
applications are those

676
00:37:51,960 --> 00:37:56,280
that use Monte Carlo methods,
ray tracing and fractals.

677
00:37:56,280 --> 00:38:00,330
We are currently implementing
circuit simulation on Victor

678
00:38:00,330 --> 00:38:02,760
and we'll use Victor as
an experimental vehicle

679
00:38:02,760 --> 00:38:05,610
to study the area
of message passing.

680
00:38:05,610 --> 00:38:07,680
A much bigger version
of Victor will also

681
00:38:07,680 --> 00:38:09,970
be operational fairly soon.

682
00:38:09,970 --> 00:38:13,180


683
00:38:13,180 --> 00:38:21,220
Moving on then to the RP3
program, our goal in 1984

684
00:38:21,220 --> 00:38:22,440
was really two-fold.

685
00:38:22,440 --> 00:38:25,660
One was to establish the
feasibility of building

686
00:38:25,660 --> 00:38:27,850
a multipurpose
parallel processor out

687
00:38:27,850 --> 00:38:31,000
of conventional
microprocessors and the second

688
00:38:31,000 --> 00:38:34,450
as I said was to provide a
set of tools for research

689
00:38:34,450 --> 00:38:42,370
on applications algorithms
languages and architectures.

690
00:38:42,370 --> 00:38:46,810
Here's a block diagram
of the RP3 machine.

691
00:38:46,810 --> 00:38:52,180
In its initial formulation
RP3 was a 512 way parallel

692
00:38:52,180 --> 00:38:57,760
processor and we have actually
implemented 1/8 of this 64

693
00:38:57,760 --> 00:39:00,400
processor version of RP3.

694
00:39:00,400 --> 00:39:02,140
Let me just very
quickly go through some

695
00:39:02,140 --> 00:39:04,780
of the key attributes
of this machine.

696
00:39:04,780 --> 00:39:07,300
First of all, we were really
after a very, very powerful

697
00:39:07,300 --> 00:39:10,780
network and the network is built
out of the highest performance

698
00:39:10,780 --> 00:39:13,780
bipolar technology and package
that was available to us

699
00:39:13,780 --> 00:39:15,800
in that time frame.

700
00:39:15,800 --> 00:39:19,660
The network was also
initially the 512 way

701
00:39:19,660 --> 00:39:25,330
and was supposed to
have combining basically

702
00:39:25,330 --> 00:39:29,740
lets you access a global
memory location that

703
00:39:29,740 --> 00:39:31,930
is end processors can
access a global memory

704
00:39:31,930 --> 00:39:35,650
location essentially in constant
time for a given machine

705
00:39:35,650 --> 00:39:38,080
and not an order of
end-time as would

706
00:39:38,080 --> 00:39:40,000
be the case without combining.

707
00:39:40,000 --> 00:39:42,070
We did not feel that
combining was necessary

708
00:39:42,070 --> 00:39:45,070
as part of the 64-way
machine and so combining

709
00:39:45,070 --> 00:39:47,930
was not implemented
in this version.

710
00:39:47,930 --> 00:39:50,950
So the first important
characteristic of RP3

711
00:39:50,950 --> 00:39:52,360
was the switch.

712
00:39:52,360 --> 00:39:54,700
The second important
characteristic

713
00:39:54,700 --> 00:40:00,160
was the overall approach to
managing caches and managing

714
00:40:00,160 --> 00:40:01,330
memories.

715
00:40:01,330 --> 00:40:04,390
So, cache management was
done under software control

716
00:40:04,390 --> 00:40:07,390
hardware mechanisms
were not provided,

717
00:40:07,390 --> 00:40:09,370
but essentially data
could be tagged as being

718
00:40:09,370 --> 00:40:11,790
cachable or non-cachable.

719
00:40:11,790 --> 00:40:15,550
The memory system was unique
in the following sense

720
00:40:15,550 --> 00:40:19,750
even though, the whole machine
has a global address space,

721
00:40:19,750 --> 00:40:22,210
this address space
can be broken up

722
00:40:22,210 --> 00:40:25,750
into a local address space
and a global address space.

723
00:40:25,750 --> 00:40:27,790
And the amount of
local and global

724
00:40:27,790 --> 00:40:29,980
can be changed
dynamically at runtime

725
00:40:29,980 --> 00:40:31,120
on the program controlled.

726
00:40:31,120 --> 00:40:33,280
So the two extremes
of this would

727
00:40:33,280 --> 00:40:35,410
be to have all the
memory configured

728
00:40:35,410 --> 00:40:38,410
as part of the global
address space in which case,

729
00:40:38,410 --> 00:40:41,770
it becomes a true
shared memory machine.

730
00:40:41,770 --> 00:40:44,770
The other extreme would be
to configure the address

731
00:40:44,770 --> 00:40:47,230
space as pieces of
local address space

732
00:40:47,230 --> 00:40:50,030
only without any shared
address space and in that case,

733
00:40:50,030 --> 00:40:53,060
the machine becomes similar
to a message passing machine.

734
00:40:53,060 --> 00:40:55,330
So part of the motivation
for doing repeat three

735
00:40:55,330 --> 00:40:59,950
was to build a machine that was
a flexible tool for studying

736
00:40:59,950 --> 00:41:04,700
various classes
of architectures.

737
00:41:04,700 --> 00:41:10,670
A third important feature
of RP3 was very extensive

738
00:41:10,670 --> 00:41:13,580
built in monitoring
capabilities.

739
00:41:13,580 --> 00:41:17,390
These monitoring capabilities
allow us to measure things

740
00:41:17,390 --> 00:41:25,280
like the ratio of local to
global memory access, the cache

741
00:41:25,280 --> 00:41:28,310
ratios, instruction
frequencies, delays

742
00:41:28,310 --> 00:41:30,560
caused due to network
access and so on.

743
00:41:30,560 --> 00:41:33,980
There's a processor as part
of every one of these--

744
00:41:33,980 --> 00:41:37,310
there's a chip which is part
of each processing element

745
00:41:37,310 --> 00:41:43,200
that gathers this data while
the machine is executing.

746
00:41:43,200 --> 00:41:51,110
So these were some of the
important features of the RP3

747
00:41:51,110 --> 00:41:52,420
program.

748
00:41:52,420 --> 00:41:53,990
We also felt
initially that it was

749
00:41:53,990 --> 00:41:58,100
important to get a level of
performance high enough to make

750
00:41:58,100 --> 00:41:59,870
the machine interesting.

751
00:41:59,870 --> 00:42:02,750
And the initial
levels of performance

752
00:42:02,750 --> 00:42:06,500
were to get roughly a billion
instructions per second

753
00:42:06,500 --> 00:42:08,720
by the use of 512 processors.

754
00:42:08,720 --> 00:42:11,460


755
00:42:11,460 --> 00:42:13,800
In addition to
the hardware there

756
00:42:13,800 --> 00:42:16,740
was a fairly
extensive undertaking

757
00:42:16,740 --> 00:42:23,130
in terms of developing
software and to develop

758
00:42:23,130 --> 00:42:24,840
software tools
that would allow us

759
00:42:24,840 --> 00:42:26,670
to experiment with parallelism.

760
00:42:26,670 --> 00:42:30,390
One of these tools
as we VM/EPEX, which

761
00:42:30,390 --> 00:42:36,720
allows experiments with
parallelism on a 370

762
00:42:36,720 --> 00:42:39,550
multiprocessor as
opposed to the RP3.

763
00:42:39,550 --> 00:42:42,780
It was developed
to allow us to do

764
00:42:42,780 --> 00:42:45,990
the necessary work and
languages and applications

765
00:42:45,990 --> 00:42:48,000
while RP3 was being developed.

766
00:42:48,000 --> 00:42:50,490
As a side effect this
piece of software

767
00:42:50,490 --> 00:42:55,000
also provides real speed
up on a 370 multiprocessor.

768
00:42:55,000 --> 00:42:59,170
And this is been operational
at several sites.

769
00:42:59,170 --> 00:43:03,910
Together with VM/EPEX we've
also developed several tools

770
00:43:03,910 --> 00:43:07,840
which allow us to do post-mortem
analysis of applications

771
00:43:07,840 --> 00:43:12,370
that are run under
VM/EPEX in a parallel way.

772
00:43:12,370 --> 00:43:14,620
So what you can do is you
can take an application,

773
00:43:14,620 --> 00:43:16,510
you can run it
under VM/EPEX, you

774
00:43:16,510 --> 00:43:20,200
can gather trace information
about the execution,

775
00:43:20,200 --> 00:43:23,110
and then you can run
that trace information

776
00:43:23,110 --> 00:43:25,630
through various
simulators and schedulers

777
00:43:25,630 --> 00:43:29,240
to try to understand what the
effect of the switch would be,

778
00:43:29,240 --> 00:43:32,500
and to try to understand what
efficiency you would get out

779
00:43:32,500 --> 00:43:36,130
of an RP3 like machine.

780
00:43:36,130 --> 00:43:38,950
This particular slide
is not important

781
00:43:38,950 --> 00:43:40,930
but it shows the
kind of information

782
00:43:40,930 --> 00:43:43,210
or one example of the
kind of information one

783
00:43:43,210 --> 00:43:48,100
can get out of VM/EPEX and these
various tools that I described.

784
00:43:48,100 --> 00:43:51,520
To summarize then,
I said earlier

785
00:43:51,520 --> 00:43:55,690
that we had two overriding
goals for the RP3 project.

786
00:43:55,690 --> 00:43:58,000
One was to construct a
multipurpose parallel

787
00:43:58,000 --> 00:43:59,350
processor.

788
00:43:59,350 --> 00:44:01,840
And the features that
make RP3 multi-purpose

789
00:44:01,840 --> 00:44:03,610
are listed on the slide.

790
00:44:03,610 --> 00:44:06,370
It uses a general-purpose
microprocessor,

791
00:44:06,370 --> 00:44:09,320
it's an MIMD structure,
it has a very low latency,

792
00:44:09,320 --> 00:44:12,040
high bandwidth,
self-loading network,

793
00:44:12,040 --> 00:44:15,880
and it provides for flexible
local and global addressing

794
00:44:15,880 --> 00:44:19,750
mechanisms together with a
fairly general I/O subsystem

795
00:44:19,750 --> 00:44:23,170
that I didn't talk about.

796
00:44:23,170 --> 00:44:25,900
What makes RP3 an
experimental facility?

797
00:44:25,900 --> 00:44:28,360
The system monitoring
facilities is the fact

798
00:44:28,360 --> 00:44:32,830
that it uses standard
operating systems,

799
00:44:32,830 --> 00:44:34,960
it has a UNIX based
operating system that

800
00:44:34,960 --> 00:44:37,700
uses conventional
programming languages,

801
00:44:37,700 --> 00:44:41,230
and we have a fairly
extensive suite of application

802
00:44:41,230 --> 00:44:43,450
development and analysis tools.

803
00:44:43,450 --> 00:44:46,780
The RP3 also has adequate
system reliability

804
00:44:46,780 --> 00:44:50,780
to serve as an
experimental facility.

805
00:44:50,780 --> 00:44:54,090
And I'd like to make two points.

806
00:44:54,090 --> 00:44:57,800
One is that this whole
area is extremely

807
00:44:57,800 --> 00:44:59,560
important to our industry.

808
00:44:59,560 --> 00:45:03,020


809
00:45:03,020 --> 00:45:07,410
If I were to net it out I see
two very important trends.

810
00:45:07,410 --> 00:45:11,900
One is the area of
mainframe parallelism.

811
00:45:11,900 --> 00:45:14,450
Here we will construct
the highest performance

812
00:45:14,450 --> 00:45:17,840
uniprocessors and we
will use parallelism

813
00:45:17,840 --> 00:45:21,650
to get performance levels
beyond that possible

814
00:45:21,650 --> 00:45:23,630
with a uniprocessor.

815
00:45:23,630 --> 00:45:25,460
To a large extent,
this will rely

816
00:45:25,460 --> 00:45:28,580
on advances in
technology and packaging

817
00:45:28,580 --> 00:45:30,080
and that's where
the risk will be.

818
00:45:30,080 --> 00:45:32,840
To a large extent
mainframe parallelism

819
00:45:32,840 --> 00:45:36,020
will use evolutionary software.

820
00:45:36,020 --> 00:45:39,320
The other very important
trend is the whole area

821
00:45:39,320 --> 00:45:42,350
of microprocessor-based
parallelism, where

822
00:45:42,350 --> 00:45:46,880
we will use commodity parts and
higher degrees of parallelism.

823
00:45:46,880 --> 00:45:50,360
And the key issue here will
be advances in software

824
00:45:50,360 --> 00:45:53,540
that will be required to
make these efficient use

825
00:45:53,540 --> 00:45:55,290
of these higher
degrees of federalism.

826
00:45:55,290 --> 00:45:57,980


827
00:45:57,980 --> 00:46:00,770
The second sort of summary
comment I would like to make

828
00:46:00,770 --> 00:46:04,300
is that parallel processing
is important today.

829
00:46:04,300 --> 00:46:05,910
It's starting to be used.

830
00:46:05,910 --> 00:46:08,300
And I believe that
this technology will be

831
00:46:08,300 --> 00:46:10,050
very pervasive in the future.

832
00:46:10,050 --> 00:46:13,410
This is one of the key systems
issues that we have to address

833
00:46:13,410 --> 00:46:15,830
and we have to solve all
the way from workstations

834
00:46:15,830 --> 00:46:18,560
to supercomputers.

835
00:46:18,560 --> 00:46:20,810
I would also like to
leave you with the thought

836
00:46:20,810 --> 00:46:24,120
that we are actively
pursuing in IBM

837
00:46:24,120 --> 00:46:26,490
several different
systems approaches.

838
00:46:26,490 --> 00:46:29,900
I've tried to identify
and briefly explain

839
00:46:29,900 --> 00:46:31,400
most of these to you.

840
00:46:31,400 --> 00:46:33,860
And some of these are
being pursued as products

841
00:46:33,860 --> 00:46:36,780
and some of them
are in research.

842
00:46:36,780 --> 00:46:38,900
That brings me to
the end of my talk

843
00:46:38,900 --> 00:46:41,150
and I will now take questions.

844
00:46:41,150 --> 00:46:44,210
And the first question from
Gordon Bell at Arden computer.

845
00:46:44,210 --> 00:46:47,030
Why did the GR11 and
RP3 parallel machines

846
00:46:47,030 --> 00:46:50,210
take so long to develop,
and what did you learn?

847
00:46:50,210 --> 00:46:53,990
I think the basic reason
why we have spent more time

848
00:46:53,990 --> 00:46:57,530
than we initially anticipated
on board these machines is

849
00:46:57,530 --> 00:47:02,100
that they are both in
their own way very complex,

850
00:47:02,100 --> 00:47:03,170
large systems.

851
00:47:03,170 --> 00:47:13,910
In neither case did we really
exploit the CMOS building block

852
00:47:13,910 --> 00:47:18,740
replicated to produce a very
high performance machine.

853
00:47:18,740 --> 00:47:23,280
The RP3 became complex
for a variety of reasons.

854
00:47:23,280 --> 00:47:27,800
There was a significant
amount of basic logic design

855
00:47:27,800 --> 00:47:30,860
that we ended up having to do.

856
00:47:30,860 --> 00:47:34,950
A lot of this was in the
cache and memory system.

857
00:47:34,950 --> 00:47:42,200
We also ended up using multiple
technologies and multiple power

858
00:47:42,200 --> 00:47:45,020
supplies all of
which contributed

859
00:47:45,020 --> 00:47:47,840
to the overall complexity
of the machine.

860
00:47:47,840 --> 00:47:51,850
In the case of GF11 also we
didn't use a standard building

861
00:47:51,850 --> 00:47:52,350
block.

862
00:47:52,350 --> 00:47:57,800
We ended up building our
own uniprocessors that

863
00:47:57,800 --> 00:48:01,840
were fairly significant
in their complexity

864
00:48:01,840 --> 00:48:04,670
and we ended up building a
fairly complex interconnection

865
00:48:04,670 --> 00:48:05,420
network.

866
00:48:05,420 --> 00:48:08,990
For myself Gerald Maguire
at Columbia University.

867
00:48:08,990 --> 00:48:12,440
For such calculations as
used in QCD, for computations

868
00:48:12,440 --> 00:48:14,630
run for a long
period of time, what

869
00:48:14,630 --> 00:48:17,630
is the reliability of the
machine, what about the problem

870
00:48:17,630 --> 00:48:19,070
of detecting faults?

871
00:48:19,070 --> 00:48:22,950
That's a very good question.

872
00:48:22,950 --> 00:48:28,050
As you probably know the GF11--

873
00:48:28,050 --> 00:48:32,670
that the QCD
competition is basically

874
00:48:32,670 --> 00:48:35,730
a fairly stable competition and
it's a Monte Carlo computation.

875
00:48:35,730 --> 00:48:39,150
In fact, the physicist
on the project

876
00:48:39,150 --> 00:48:44,670
has always had the view that
if you randomly lost a bit once

877
00:48:44,670 --> 00:48:46,680
and away, once in a
meaning let's say once

878
00:48:46,680 --> 00:48:49,530
every day or every
several hours,

879
00:48:49,530 --> 00:48:52,080
it wouldn't in fact
significantly affect

880
00:48:52,080 --> 00:48:52,860
the competition.

881
00:48:52,860 --> 00:48:54,060
That's point 1.

882
00:48:54,060 --> 00:48:56,790
Point 2 is that the
GF11 is designed

883
00:48:56,790 --> 00:49:00,720
so that you can do a dump
of the entire memory state

884
00:49:00,720 --> 00:49:02,950
of the system very rapidly.

885
00:49:02,950 --> 00:49:08,190
So you can dump the entire state
in approximately 12 seconds.

886
00:49:08,190 --> 00:49:11,490
And therefore one
could very frequently

887
00:49:11,490 --> 00:49:13,407
checkpoint the system.

888
00:49:13,407 --> 00:49:15,240
So I think basically
the way one will end up

889
00:49:15,240 --> 00:49:18,000
doing the QCD
computation is either

890
00:49:18,000 --> 00:49:21,000
to do every computation twice
and compare the answers,

891
00:49:21,000 --> 00:49:24,780
or to split the machine into
two pieces and run the problem

892
00:49:24,780 --> 00:49:28,180
and compare processor by
processor of the results.

893
00:49:28,180 --> 00:49:31,020
That's how you get your
basic error detection.

894
00:49:31,020 --> 00:49:33,180
And with that, you
would do two things.

895
00:49:33,180 --> 00:49:34,680
You would look at
the errors and you

896
00:49:34,680 --> 00:49:36,690
would see if they were
truly just random errors

897
00:49:36,690 --> 00:49:37,773
or there was some pattern.

898
00:49:37,773 --> 00:49:40,620
If there was a pattern
there would be a problem.

899
00:49:40,620 --> 00:49:42,180
Secondly, you could
always back up

900
00:49:42,180 --> 00:49:44,520
and restart the computation
because of this check

901
00:49:44,520 --> 00:49:46,310
pointing capability.

902
00:49:46,310 --> 00:49:50,190
So I think we'll end up running
GF11 and its architecture

903
00:49:50,190 --> 00:49:52,800
to do that basically
split the machine into

904
00:49:52,800 --> 00:49:56,190
and run the identical
computation on both sides

905
00:49:56,190 --> 00:49:59,540
and compare the
results of processors.

906
00:49:59,540 --> 00:51:17,000