1
00:00:00,000 --> 00:00:04,980


2
00:00:04,980 --> 00:00:08,466
[MUSIC PLAYING]

3
00:00:08,466 --> 00:00:29,900


4
00:00:29,900 --> 00:00:32,720
The Cray T3D system is the
first of a new generation

5
00:00:32,720 --> 00:00:35,240
of supercomputers from
Cray Research based

6
00:00:35,240 --> 00:00:38,870
on the principles of massively
parallel processor design.

7
00:00:38,870 --> 00:00:41,330
This lecture will describe
architectural and hardware

8
00:00:41,330 --> 00:00:44,300
implementation aspects
of the Cray T3D.

9
00:00:44,300 --> 00:00:45,860
A second tape in
this series will

10
00:00:45,860 --> 00:00:48,590
describe the system software
and application software

11
00:00:48,590 --> 00:00:50,480
environments.

12
00:00:50,480 --> 00:00:52,070
This new family
of supercomputers

13
00:00:52,070 --> 00:00:54,080
complements the
highly successful line

14
00:00:54,080 --> 00:00:57,020
of Cray parallel-vector
processing systems.

15
00:00:57,020 --> 00:00:58,940
Those products, the
most recent of which

16
00:00:58,940 --> 00:01:01,550
is the Cray C90
and its dependents,

17
00:01:01,550 --> 00:01:04,879
have roots tracing back
to the original Cray-1.

18
00:01:04,879 --> 00:01:07,610
The Cray-1, a universally
accepted guidepost

19
00:01:07,610 --> 00:01:10,790
for the industry, was
designed by Seymour Cray

20
00:01:10,790 --> 00:01:14,690
and introduced in the
marketplace in 1976.

21
00:01:14,690 --> 00:01:17,330
Significant changes have
occurred since those early days

22
00:01:17,330 --> 00:01:18,950
at Cray Research.

23
00:01:18,950 --> 00:01:20,960
Most compelling are
the great advances

24
00:01:20,960 --> 00:01:24,050
in the integrated circuit
density and speed.

25
00:01:24,050 --> 00:01:26,930
These advances have spawned
multiple generations

26
00:01:26,930 --> 00:01:30,290
of microprocessors and
inexpensive semiconductor

27
00:01:30,290 --> 00:01:31,590
memory.

28
00:01:31,590 --> 00:01:33,470
In fact, the latest
microprocessors

29
00:01:33,470 --> 00:01:36,770
now challenge the performance
claims of the original Cray-1,

30
00:01:36,770 --> 00:01:40,080
but at a small
fraction of the cost.

31
00:01:40,080 --> 00:01:43,490
By combining this microprocessor
technology with advanced Cray

32
00:01:43,490 --> 00:01:46,190
proprietary integrated
circuit, packaging,

33
00:01:46,190 --> 00:01:50,420
and cooling technology, the Cray
T3D has been designed to solve

34
00:01:50,420 --> 00:01:55,370
highly parallel, vary-compute,
and I/O-intensive problems.

35
00:01:55,370 --> 00:01:58,700
The Cray T3D is a
multiprocessor system,

36
00:01:58,700 --> 00:02:03,200
scalable up to configurations
of 2,048 processors.

37
00:02:03,200 --> 00:02:05,750
It operates in a
heterogeneous parallel-vector

38
00:02:05,750 --> 00:02:08,690
and massively
parallel environment.

39
00:02:08,690 --> 00:02:12,440
The Cray T3D supports a
globally addressable, physically

40
00:02:12,440 --> 00:02:14,720
distributed memory subsystem.

41
00:02:14,720 --> 00:02:17,510
I/O is also scalable
via I/O gateways, which

42
00:02:17,510 --> 00:02:20,660
support high-bandwidth
proprietary channels connected

43
00:02:20,660 --> 00:02:25,700
to Cray parallel-vector systems
and dedicated I/O processors.

44
00:02:25,700 --> 00:02:27,680
The configuration
supports a high degree

45
00:02:27,680 --> 00:02:30,050
of concurrency to
achieve high efficiencies

46
00:02:30,050 --> 00:02:33,920
from the computational
and data motion resources.

47
00:02:33,920 --> 00:02:37,250
Scaling options allow varying
subsystem configurations

48
00:02:37,250 --> 00:02:39,320
across a wide range.

49
00:02:39,320 --> 00:02:41,450
These options allow
parallel-vector production

50
00:02:41,450 --> 00:02:45,200
environments to explore the
potential of MPP systems.

51
00:02:45,200 --> 00:02:49,520
Others who have optimized
MPP applications in hand

52
00:02:49,520 --> 00:02:51,710
will configure their
systems with the majority

53
00:02:51,710 --> 00:02:55,160
of the resources
on the MPP side.

54
00:02:55,160 --> 00:02:58,310
Closely coupling the MPP
with proven parallel-vector

55
00:02:58,310 --> 00:03:01,850
technology serves
several purposes.

56
00:03:01,850 --> 00:03:04,640
Moving MPP technology into
the production environment

57
00:03:04,640 --> 00:03:08,042
requires a host of
operating system services.

58
00:03:08,042 --> 00:03:09,500
In the parallel-vector
environment,

59
00:03:09,500 --> 00:03:12,620
Cray has more than a 10-year
development investment

60
00:03:12,620 --> 00:03:17,360
in our supercomputer variant
of Unix, which we call UNICOS.

61
00:03:17,360 --> 00:03:21,080
By adding microkernel technology
on the MPP side of it,

62
00:03:21,080 --> 00:03:23,810
it has been possible
to maximize MPP OS

63
00:03:23,810 --> 00:03:29,180
functionality and stability
and minimize time to market.

64
00:03:29,180 --> 00:03:32,780
It is well-known that moving
applications onto MPP platforms

65
00:03:32,780 --> 00:03:35,220
can be a daunting task.

66
00:03:35,220 --> 00:03:38,030
The heterogeneous configuration
allows an environment

67
00:03:38,030 --> 00:03:39,740
in which an
application programmer

68
00:03:39,740 --> 00:03:43,740
can manage the work in a
staged, incremental way.

69
00:03:43,740 --> 00:03:45,740
Finally, the closely
coupled connection

70
00:03:45,740 --> 00:03:48,830
between the parallel-vector
system and the Cray T3D

71
00:03:48,830 --> 00:03:52,250
allows new opportunities for
high-performance computing.

72
00:03:52,250 --> 00:03:54,230
There have been recent
reports of applications

73
00:03:54,230 --> 00:03:57,410
which have portions of code
that map best on the MPP,

74
00:03:57,410 --> 00:03:59,180
while other portions
of the code do better

75
00:03:59,180 --> 00:04:01,310
on the parallel-vector platform.

76
00:04:01,310 --> 00:04:04,130
The end result can be
performance which outpaces

77
00:04:04,130 --> 00:04:06,600
either component working alone.

78
00:04:06,600 --> 00:04:08,060
The Cray environment
will support

79
00:04:08,060 --> 00:04:10,490
efficient
heterogeneous computing

80
00:04:10,490 --> 00:04:14,120
using the Parallel Virtual
Machine software technology.

81
00:04:14,120 --> 00:04:16,459
PVM was originally
developed through

82
00:04:16,459 --> 00:04:18,800
a three-way collaborative
effort of researchers

83
00:04:18,800 --> 00:04:20,930
at Oak Ridge
National Laboratory,

84
00:04:20,930 --> 00:04:24,680
the University of Tennessee,
and Emory University.

85
00:04:24,680 --> 00:04:27,080
The Cray T3D compute
and I/O nodes

86
00:04:27,080 --> 00:04:29,840
are interconnected with a
three-dimensional toroidal

87
00:04:29,840 --> 00:04:30,800
switch.

88
00:04:30,800 --> 00:04:34,070
The switch is designed for very
high bandwidth and extremely

89
00:04:34,070 --> 00:04:36,560
low latencies,
setting a new standard

90
00:04:36,560 --> 00:04:39,500
for massively parallel
switch design.

91
00:04:39,500 --> 00:04:42,620
The switch supports
independent bidirectional paths

92
00:04:42,620 --> 00:04:45,770
to nearest neighbors in each
of the six directions, which

93
00:04:45,770 --> 00:04:49,790
can be thought of as up, down,
north, south, east, and west.

94
00:04:49,790 --> 00:04:52,340
A seventh pair of paths
provides connectivity

95
00:04:52,340 --> 00:04:56,090
to the compute resources
associated with the node.

96
00:04:56,090 --> 00:04:59,510
The switch manages a peak
input flux and, of course,

97
00:04:59,510 --> 00:05:02,330
an equal output flux
of 16.8 gigabits

98
00:05:02,330 --> 00:05:06,230
or 2.1 gigabytes per second.

99
00:05:06,230 --> 00:05:08,810
Unidirectional switch
and wire latency

100
00:05:08,810 --> 00:05:13,010
for a two-byte-wide packet
fit for physical unit

101
00:05:13,010 --> 00:05:16,670
is typically 13.3 nanoseconds.

102
00:05:16,670 --> 00:05:18,710
When routing requires
a corner turn,

103
00:05:18,710 --> 00:05:24,320
there's only an additional 6.7
to 13.3 nanoseconds of delay.

104
00:05:24,320 --> 00:05:26,300
What kinds of problems
have the potential

105
00:05:26,300 --> 00:05:28,480
to capitalize on the
computing power delivered

106
00:05:28,480 --> 00:05:30,940
by large numbers of
processors numbering

107
00:05:30,940 --> 00:05:33,898
in the hundreds
or even thousands?

108
00:05:33,898 --> 00:05:35,440
While we are interested
in delivering

109
00:05:35,440 --> 00:05:37,840
a high degree of sustained
as opposed to peak

110
00:05:37,840 --> 00:05:40,480
performance for
real-world applications,

111
00:05:40,480 --> 00:05:42,130
the most important
objective must

112
00:05:42,130 --> 00:05:45,100
be in reducing the
time to solution.

113
00:05:45,100 --> 00:05:47,140
The problem and
selected algorithm

114
00:05:47,140 --> 00:05:49,540
must, of course,
be highly parallel.

115
00:05:49,540 --> 00:05:52,600
As is well-known, Amdahl's law
observes that for practical

116
00:05:52,600 --> 00:05:56,260
problems, the total execution
time will consist of a parallel

117
00:05:56,260 --> 00:05:59,740
portion parceled out to a
collection of processors,

118
00:05:59,740 --> 00:06:02,500
plus a serial portion
which cannot be performed

119
00:06:02,500 --> 00:06:04,180
in a parallel manner.

120
00:06:04,180 --> 00:06:06,670
If large numbers of
processors are frequently

121
00:06:06,670 --> 00:06:09,040
forced to wait for the
computation of data coming

122
00:06:09,040 --> 00:06:12,550
from a single processor in the
system plus the additional time

123
00:06:12,550 --> 00:06:14,770
required to
distribute that data,

124
00:06:14,770 --> 00:06:17,440
the actual benefit of the
large numbers of processors

125
00:06:17,440 --> 00:06:18,940
will be reduced.

126
00:06:18,940 --> 00:06:20,860
At some point, adding
more processors

127
00:06:20,860 --> 00:06:24,760
is no longer cost-effective in
reducing the time to solution.

128
00:06:24,760 --> 00:06:28,180
Even for a problem
which is 95% parallel,

129
00:06:28,180 --> 00:06:30,490
one cannot expect
speedups beyond 20,

130
00:06:30,490 --> 00:06:34,270
even with an arbitrarily
large number of processors.

131
00:06:34,270 --> 00:06:36,670
Massively parallel
or MPP systems

132
00:06:36,670 --> 00:06:39,310
of hundreds of processors
require problems that

133
00:06:39,310 --> 00:06:42,880
are more than 99% parallel.

134
00:06:42,880 --> 00:06:46,270
The problem design should
not only be highly parallel,

135
00:06:46,270 --> 00:06:49,960
but it should exhibit high
algorithmic efficiency.

136
00:06:49,960 --> 00:06:53,410
Each calculation should be
effectively moving the problem

137
00:06:53,410 --> 00:06:55,180
toward the solution.

138
00:06:55,180 --> 00:06:57,730
A high sustained
floating-point operation rate

139
00:06:57,730 --> 00:07:00,790
is diminished in value
if the algorithm demands

140
00:07:00,790 --> 00:07:03,880
many more intermediate
results than a competing

141
00:07:03,880 --> 00:07:05,980
algorithmic choice.

142
00:07:05,980 --> 00:07:10,000
Third, it must be possible to
efficiently move dependent data

143
00:07:10,000 --> 00:07:12,350
to requesting processors.

144
00:07:12,350 --> 00:07:14,050
This communication
overhead can have

145
00:07:14,050 --> 00:07:18,100
the same effect as the serial
portion of the computation.

146
00:07:18,100 --> 00:07:20,110
Here, we are interested
in algorithms

147
00:07:20,110 --> 00:07:22,780
which require only modest
amounts of interprocessor

148
00:07:22,780 --> 00:07:27,130
communication or which
allow mechanisms in design

149
00:07:27,130 --> 00:07:30,130
to effectively make the
latency transparent.

150
00:07:30,130 --> 00:07:33,280
This is accomplished by hardware
and software prefetching

151
00:07:33,280 --> 00:07:34,630
techniques.

152
00:07:34,630 --> 00:07:37,000
The importance of each
of these criteria--

153
00:07:37,000 --> 00:07:39,490
algorithmic efficiency,
a high level

154
00:07:39,490 --> 00:07:43,330
of parallelism, and moderate
or transparent processor

155
00:07:43,330 --> 00:07:44,980
communications--

156
00:07:44,980 --> 00:07:47,620
can be mitigated by selecting
ever-higher-performing

157
00:07:47,620 --> 00:07:50,690
processors and
communication networks.

158
00:07:50,690 --> 00:07:52,660
But in the real
world, users are also

159
00:07:52,660 --> 00:07:54,250
concerned about
price performance.

160
00:07:54,250 --> 00:07:57,790
It would be terrific if we
could design a system consisting

161
00:07:57,790 --> 00:08:01,210
of 1,000 processors each with
gigaflops of performance,

162
00:08:01,210 --> 00:08:04,150
an interprocessor network that
delivers tens of gigabytes

163
00:08:04,150 --> 00:08:06,910
per second of data flux
for each processor,

164
00:08:06,910 --> 00:08:11,440
and interprocessor latencies
below 100 nanoseconds.

165
00:08:11,440 --> 00:08:15,130
The problem today is that such
a system, if it could be built,

166
00:08:15,130 --> 00:08:18,130
would be outrageously expensive.

167
00:08:18,130 --> 00:08:21,250
This will no doubt be
less true tomorrow.

168
00:08:21,250 --> 00:08:23,770
Today, however, we are
faced with a series

169
00:08:23,770 --> 00:08:27,490
of important architectural
and technology trade-offs.

170
00:08:27,490 --> 00:08:30,220
Properly made, there
is vast potential

171
00:08:30,220 --> 00:08:32,020
for improving the
price performance

172
00:08:32,020 --> 00:08:34,600
of challenging and grand
challenging problems

173
00:08:34,600 --> 00:08:38,080
having the characteristics
just described.

174
00:08:38,080 --> 00:08:40,120
The rest of the story
then will take you

175
00:08:40,120 --> 00:08:42,580
through the series of
architectural and technology

176
00:08:42,580 --> 00:08:47,090
choices and trade-offs made by
the designers of the Cray T3D.

177
00:08:47,090 --> 00:08:50,170
Three distinct kinds of
parallel programming models

178
00:08:50,170 --> 00:08:51,850
and the accompanying
high-level language

179
00:08:51,850 --> 00:08:54,880
constructs to support
them have evolved.

180
00:08:54,880 --> 00:08:58,600
These models are
data-parallel, message passing,

181
00:08:58,600 --> 00:09:00,820
and work sharing.

182
00:09:00,820 --> 00:09:04,720
Data-parallel models use a
limited, often single control

183
00:09:04,720 --> 00:09:09,730
flow within a system to
control a very large data set.

184
00:09:09,730 --> 00:09:12,940
Data communication is
usually specified implicitly

185
00:09:12,940 --> 00:09:16,180
in the high-level language
using, for example, naming,

186
00:09:16,180 --> 00:09:19,600
which references
entire arrays of data.

187
00:09:19,600 --> 00:09:22,300
Interaction among
individual data elements

188
00:09:22,300 --> 00:09:24,700
is masked by
high-level operators,

189
00:09:24,700 --> 00:09:29,770
such as matrix multiply,
transpose, or inversion.

190
00:09:29,770 --> 00:09:32,530
Message-passing models
recognize distinct data

191
00:09:32,530 --> 00:09:36,580
domains which may represent real
domains in the problem space

192
00:09:36,580 --> 00:09:39,190
or which may just be
artifacts of the organization

193
00:09:39,190 --> 00:09:41,410
of the computer system itself.

194
00:09:41,410 --> 00:09:44,620
In any case, explicit
communication sequences are

195
00:09:44,620 --> 00:09:46,850
specified-- indeed, required--

196
00:09:46,850 --> 00:09:49,240
which serve to move
data and also provide

197
00:09:49,240 --> 00:09:52,340
a low-level
synchronization function.

198
00:09:52,340 --> 00:09:54,520
The third model is work sharing.

199
00:09:54,520 --> 00:09:57,970
The emphasis here is on
implicit data motion requests

200
00:09:57,970 --> 00:10:00,820
automatically invoked
by the use of variables

201
00:10:00,820 --> 00:10:04,120
representing a fine
degree of granularity.

202
00:10:04,120 --> 00:10:06,820
The naming can effectively
specify and use

203
00:10:06,820 --> 00:10:10,900
arrays, array
elements, and scalars.

204
00:10:10,900 --> 00:10:14,230
Multiple control flows are
supported through a rich set

205
00:10:14,230 --> 00:10:16,900
of synchronization primitives.

206
00:10:16,900 --> 00:10:19,390
The objective here is to
keep a high percentage

207
00:10:19,390 --> 00:10:24,340
of the processing cycles
actively performing real work.

208
00:10:24,340 --> 00:10:26,110
Without applications,
particularly

209
00:10:26,110 --> 00:10:29,310
third-party applications,
no supercomputing system

210
00:10:29,310 --> 00:10:32,460
can be broadly successful
in the marketplace.

211
00:10:32,460 --> 00:10:36,240
Although application development
remains in an immature state,

212
00:10:36,240 --> 00:10:38,850
the variety of
applications which do exist

213
00:10:38,850 --> 00:10:41,790
and which must be supported
for market success

214
00:10:41,790 --> 00:10:46,740
use all three models,
data-parallel, message passing,

215
00:10:46,740 --> 00:10:49,800
and the emerging
work-sharing model.

216
00:10:49,800 --> 00:10:54,970
The Cray T3D was designed to
efficiently support all three.

217
00:10:54,970 --> 00:10:56,700
Let's turn our
attention to the variety

218
00:10:56,700 --> 00:10:58,380
of architectural
and implementation

219
00:10:58,380 --> 00:11:01,770
choices for massively
parallel processor design.

220
00:11:01,770 --> 00:11:04,680
It is expected that for
highly parallel problems,

221
00:11:04,680 --> 00:11:06,420
significant
performance gains will

222
00:11:06,420 --> 00:11:10,350
come from large numbers of
processors working together.

223
00:11:10,350 --> 00:11:12,900
But there are two distinct
ways this cooperation

224
00:11:12,900 --> 00:11:14,770
might be enforced.

225
00:11:14,770 --> 00:11:16,230
The collection of
processors could

226
00:11:16,230 --> 00:11:20,640
follow the Single Instruction,
Multiple Data or SIMD model

227
00:11:20,640 --> 00:11:23,460
or the Multiple
Instruction, Multiple Data

228
00:11:23,460 --> 00:11:25,800
model, the MIMD model.

229
00:11:25,800 --> 00:11:27,930
In the first case,
each processor

230
00:11:27,930 --> 00:11:30,840
operates in lockstep
with all other processors

231
00:11:30,840 --> 00:11:35,670
in the partition, but with
different parallel data sets.

232
00:11:35,670 --> 00:11:38,820
In the second case,
MIMD, each processor

233
00:11:38,820 --> 00:11:42,780
follows its own control flow,
executing from a distinctly

234
00:11:42,780 --> 00:11:45,870
different instruction stream.

235
00:11:45,870 --> 00:11:49,320
SIMD designs are a good match
for data-parallel models

236
00:11:49,320 --> 00:11:52,440
and can be somewhat
simpler to program.

237
00:11:52,440 --> 00:11:55,590
On the other hand, they
suffer from inefficiencies

238
00:11:55,590 --> 00:11:57,900
when, for example,
sparse matrix data

239
00:11:57,900 --> 00:12:00,330
is being manipulated
in rather naive ways

240
00:12:00,330 --> 00:12:04,650
with lots of multiplications
or additions of 0 operands.

241
00:12:04,650 --> 00:12:07,170
Although the operation
count may be high,

242
00:12:07,170 --> 00:12:11,460
the algorithmic efficiency
for this example is low.

243
00:12:11,460 --> 00:12:14,520
MIMD systems provide many
more degrees of freedom,

244
00:12:14,520 --> 00:12:16,650
with-- conceptually--
each processor

245
00:12:16,650 --> 00:12:19,740
operating from distinctly
different optimized control

246
00:12:19,740 --> 00:12:21,060
stores.

247
00:12:21,060 --> 00:12:23,430
In practice, however,
the computation

248
00:12:23,430 --> 00:12:25,830
would probably be
broken up into a handful

249
00:12:25,830 --> 00:12:27,810
of computational domains.

250
00:12:27,810 --> 00:12:30,180
There may, for example,
be separate code

251
00:12:30,180 --> 00:12:32,010
to simulate a
physical structure's

252
00:12:32,010 --> 00:12:36,420
interior and, separately,
its boundary regions,

253
00:12:36,420 --> 00:12:40,500
whereas synchronization is
implicit for the SIMD model,

254
00:12:40,500 --> 00:12:44,310
with all parallel calculations
occurring simultaneously.

255
00:12:44,310 --> 00:12:46,350
It is necessary
for the MIMD case

256
00:12:46,350 --> 00:12:49,530
to manage more explicitly
the limited interaction

257
00:12:49,530 --> 00:12:51,370
among processors.

258
00:12:51,370 --> 00:12:54,180
This adds to complexity at
the application programmer

259
00:12:54,180 --> 00:12:57,040
or at least the
compiler designer level.

260
00:12:57,040 --> 00:13:01,470
Nevertheless, the opportunity
for high real performance

261
00:13:01,470 --> 00:13:03,690
has caused almost
every MPP vendor

262
00:13:03,690 --> 00:13:06,480
to concentrate on MIMD designs.

263
00:13:06,480 --> 00:13:08,730
And the Cray approach
is no different.

264
00:13:08,730 --> 00:13:12,390
The Cray T3D is a multiple
instruction, multiple data

265
00:13:12,390 --> 00:13:13,890
design.

266
00:13:13,890 --> 00:13:15,690
The next important
choice involves

267
00:13:15,690 --> 00:13:18,060
the way in which memory
and, ultimately, data

268
00:13:18,060 --> 00:13:20,400
is shared among processors.

269
00:13:20,400 --> 00:13:23,580
Parallel-vector systems from
Cray Research and others

270
00:13:23,580 --> 00:13:28,290
use a uniform access physically
shared memory implementation.

271
00:13:28,290 --> 00:13:30,570
With this approach, the
processors and the memory

272
00:13:30,570 --> 00:13:33,870
subsystem form two distinct
and physically separated

273
00:13:33,870 --> 00:13:35,580
resource pools.

274
00:13:35,580 --> 00:13:38,730
All processors compete more
or less equally for access

275
00:13:38,730 --> 00:13:41,940
to each location in
the memory subsystem.

276
00:13:41,940 --> 00:13:44,310
The latency each
processor sees is also

277
00:13:44,310 --> 00:13:46,800
approximately equivalent.

278
00:13:46,800 --> 00:13:49,650
Although not fundamental to
the physically shared design,

279
00:13:49,650 --> 00:13:52,980
the communication paths
between the processor partition

280
00:13:52,980 --> 00:13:56,670
and the memory partition
are often highly optimized

281
00:13:56,670 --> 00:13:59,730
for high bandwidth
and low latency,

282
00:13:59,730 --> 00:14:03,780
adding significantly to the
supercomputer system cost.

283
00:14:03,780 --> 00:14:06,150
Unfortunately, such
system organizations

284
00:14:06,150 --> 00:14:09,360
do not scale well to large
numbers of processors,

285
00:14:09,360 --> 00:14:12,270
because the required number
of independent data paths

286
00:14:12,270 --> 00:14:14,490
grows exponentially.

287
00:14:14,490 --> 00:14:17,430
The non-uniform physically
distributed memory model

288
00:14:17,430 --> 00:14:20,370
provides a much more
scalable solution,

289
00:14:20,370 --> 00:14:23,760
with each processor located
physically nearby a set

290
00:14:23,760 --> 00:14:25,740
of memory devices.

291
00:14:25,740 --> 00:14:29,310
The processors can then
have preferential access

292
00:14:29,310 --> 00:14:32,220
that is lower-latency
and/or higher-bandwidth

293
00:14:32,220 --> 00:14:34,230
to the nearby memory.

294
00:14:34,230 --> 00:14:36,360
The non-uniform
memory access model

295
00:14:36,360 --> 00:14:39,900
can provide a much more
cost-effective solution

296
00:14:39,900 --> 00:14:42,330
for parallel algorithms
that emphasize

297
00:14:42,330 --> 00:14:45,510
a high degree of
local computation

298
00:14:45,510 --> 00:14:49,230
and only moderate amounts
of global data movement,

299
00:14:49,230 --> 00:14:51,210
whereas the high
degree of connectivity

300
00:14:51,210 --> 00:14:53,670
in the uniform
access memory system

301
00:14:53,670 --> 00:14:57,840
may be underutilized a
significant part of the time.

302
00:14:57,840 --> 00:15:01,560
The Cray T3D utilizes the
non-uniform physically

303
00:15:01,560 --> 00:15:04,730
distributed memory model
in its construction.

304
00:15:04,730 --> 00:15:07,650
However, through its memory
addressing mechanism,

305
00:15:07,650 --> 00:15:10,470
it, in a logical sense,
preserves the concept

306
00:15:10,470 --> 00:15:13,860
of memory sharing, allowing
the Cray T3D to operate

307
00:15:13,860 --> 00:15:16,620
as a true multiprocessor.

308
00:15:16,620 --> 00:15:18,420
Many MPP systems
to date have been

309
00:15:18,420 --> 00:15:22,350
designed as multi-computers,
not multiprocessors.

310
00:15:22,350 --> 00:15:25,350
Interprocessor communication
must depend upon explicit

311
00:15:25,350 --> 00:15:27,240
I/O-based requests.

312
00:15:27,240 --> 00:15:29,550
Data moving from one
processor to another

313
00:15:29,550 --> 00:15:33,130
must pass through the I/O
ports of the processors.

314
00:15:33,130 --> 00:15:36,870
This usually requires the use of
processor interrupt mechanisms

315
00:15:36,870 --> 00:15:40,380
and even intervention
by the operating system.

316
00:15:40,380 --> 00:15:43,470
The target processor
must decode or interpret

317
00:15:43,470 --> 00:15:45,840
via software techniques
the address tag

318
00:15:45,840 --> 00:15:49,410
and the purpose of the message
to determine the exact source

319
00:15:49,410 --> 00:15:51,990
or destination for the payload.

320
00:15:51,990 --> 00:15:54,240
This results in
significant increases

321
00:15:54,240 --> 00:15:56,670
in latency for the
interprocessor communication

322
00:15:56,670 --> 00:15:57,795
process.

323
00:15:57,795 --> 00:16:00,490
A multiprocessor,
on the other hand,

324
00:16:00,490 --> 00:16:03,300
provides direct paths
to each memory location

325
00:16:03,300 --> 00:16:06,930
within the entire
processor pool partition.

326
00:16:06,930 --> 00:16:11,310
Addressing is managed through
a single global address space.

327
00:16:11,310 --> 00:16:13,410
On the target end
of the transfer,

328
00:16:13,410 --> 00:16:15,780
the data communication
mechanism does not

329
00:16:15,780 --> 00:16:18,780
interrupt the processor,
but proceeds directly

330
00:16:18,780 --> 00:16:22,950
via hardware support to that
processor's local memory.

331
00:16:22,950 --> 00:16:25,740
Because network communication
within less efficient

332
00:16:25,740 --> 00:16:30,000
multi-computers lacks a unique
global addressing mechanism

333
00:16:30,000 --> 00:16:32,250
and also requires
processor intervention

334
00:16:32,250 --> 00:16:35,400
on each side of the transfer,
the programming models

335
00:16:35,400 --> 00:16:40,050
on these systems are dependent
upon the message-passing model.

336
00:16:40,050 --> 00:16:42,090
For a multiprocessor
system, however,

337
00:16:42,090 --> 00:16:45,330
a true work-sharing
model can be implemented.

338
00:16:45,330 --> 00:16:49,080
The specification of data
using a high-level language,

339
00:16:49,080 --> 00:16:53,160
whether local or global,
can be consistent.

340
00:16:53,160 --> 00:16:56,160
Memory references local or
global at the hardware level

341
00:16:56,160 --> 00:16:59,880
are direct, with
very low overhead.

342
00:16:59,880 --> 00:17:03,330
It is very important, however,
for the Cray T3D to effectively

343
00:17:03,330 --> 00:17:05,069
support message passing.

344
00:17:05,069 --> 00:17:06,540
And it does.

345
00:17:06,540 --> 00:17:08,940
Explicit message-passing
send and receive

346
00:17:08,940 --> 00:17:12,390
requests are translated to
allow at the lower hardware

347
00:17:12,390 --> 00:17:15,390
level actual data
movement to progress

348
00:17:15,390 --> 00:17:18,750
through the multiprocessor
memory protocols.

349
00:17:18,750 --> 00:17:21,329
Let's concentrate now on
the interprocessor data

350
00:17:21,329 --> 00:17:24,780
communication network to provide
motivation for the network

351
00:17:24,780 --> 00:17:27,359
topology and switch
characteristics selected

352
00:17:27,359 --> 00:17:30,300
by the Cray T3D designers.

353
00:17:30,300 --> 00:17:32,790
Access by processors
to the local memory

354
00:17:32,790 --> 00:17:36,540
on the same node results
in preferential treatment,

355
00:17:36,540 --> 00:17:39,210
with higher bandwidth
and lower latency,

356
00:17:39,210 --> 00:17:43,260
than access to memory locations
residing in neighboring nodes.

357
00:17:43,260 --> 00:17:45,300
When making trade-offs
among resources

358
00:17:45,300 --> 00:17:48,300
spent on local versus
global traffic,

359
00:17:48,300 --> 00:17:52,590
support for locality
of data has priority.

360
00:17:52,590 --> 00:17:57,120
We have seen that the Cray
T3D is a true multiprocessor.

361
00:17:57,120 --> 00:18:00,450
Although processors own
local memory resources,

362
00:18:00,450 --> 00:18:02,790
that memory appears
logically as part

363
00:18:02,790 --> 00:18:06,240
of a larger globally addressed
physically distributed

364
00:18:06,240 --> 00:18:08,580
shared memory subsystem.

365
00:18:08,580 --> 00:18:12,030
Although the objective in
programming for a distributed

366
00:18:12,030 --> 00:18:14,640
memory system should
always be to maximize

367
00:18:14,640 --> 00:18:16,560
the occurrence of
local calculations

368
00:18:16,560 --> 00:18:18,990
and thereby minimize
global traffic,

369
00:18:18,990 --> 00:18:21,120
it is nevertheless
helpful to minimize,

370
00:18:21,120 --> 00:18:24,270
within reason, global latency.

371
00:18:24,270 --> 00:18:26,730
We give this priority
because a certain amount

372
00:18:26,730 --> 00:18:29,700
of global communication
for real-world problems

373
00:18:29,700 --> 00:18:31,710
is unavoidable.

374
00:18:31,710 --> 00:18:34,020
But we want more
than low latency.

375
00:18:34,020 --> 00:18:36,030
We also want to
include mechanisms

376
00:18:36,030 --> 00:18:38,460
which allow the
system to hide latency

377
00:18:38,460 --> 00:18:40,890
whenever possible
through invocation

378
00:18:40,890 --> 00:18:42,870
of prefetch mechanisms.

379
00:18:42,870 --> 00:18:44,880
Network paths
should obviously be

380
00:18:44,880 --> 00:18:47,640
high-bandwidth within the
constraints of system cost

381
00:18:47,640 --> 00:18:48,840
trade-offs.

382
00:18:48,840 --> 00:18:52,920
The Cray T3D achieves both
low latency and high bandwidth

383
00:18:52,920 --> 00:18:56,190
by using true
supercomputer technology.

384
00:18:56,190 --> 00:18:59,340
Microprocessor shell
and network switch logic

385
00:18:59,340 --> 00:19:04,350
are clocked by a single
synchronous 150-megahertz clock

386
00:19:04,350 --> 00:19:07,170
distribution network.

387
00:19:07,170 --> 00:19:10,230
Market requirements for
massively parallel systems

388
00:19:10,230 --> 00:19:13,830
dictate support for a
wide range of system sizes

389
00:19:13,830 --> 00:19:16,380
with incremental
pricing structures.

390
00:19:16,380 --> 00:19:20,490
The network must be scalable,
with a fairly linear increase

391
00:19:20,490 --> 00:19:23,400
in cost to match
linear increases

392
00:19:23,400 --> 00:19:26,280
in the number of processors.

393
00:19:26,280 --> 00:19:29,940
Fault-tolerant features must
be considered when designing

394
00:19:29,940 --> 00:19:31,650
the switching network.

395
00:19:31,650 --> 00:19:34,680
Reconfiguration can allow
masking of hardware faults

396
00:19:34,680 --> 00:19:38,970
until scheduled maintenance,
increasing system availability.

397
00:19:38,970 --> 00:19:40,920
Finally, the system
architect should

398
00:19:40,920 --> 00:19:43,290
be aware of the
technology choices which

399
00:19:43,290 --> 00:19:47,040
will be available to those who
implement the hardware design.

400
00:19:47,040 --> 00:19:49,830
Advanced packaging to support
high-clock-rate designs

401
00:19:49,830 --> 00:19:52,080
have been a trademark
of Cray Research designs

402
00:19:52,080 --> 00:19:53,730
through the years.

403
00:19:53,730 --> 00:19:56,830
With time-to-market always
an important consideration,

404
00:19:56,830 --> 00:20:00,900
it is wise to exploit but
not overextend the technology

405
00:20:00,900 --> 00:20:03,540
reach of each new product.

406
00:20:03,540 --> 00:20:07,290
Good judgment therefore
requires a good technology fit

407
00:20:07,290 --> 00:20:09,780
with the design and
manufacturing processes

408
00:20:09,780 --> 00:20:12,870
of the organization.

409
00:20:12,870 --> 00:20:14,460
And now, Steve
Oberlin will discuss

410
00:20:14,460 --> 00:20:16,920
the details of the
interprocessor communication

411
00:20:16,920 --> 00:20:18,240
network.

412
00:20:18,240 --> 00:20:19,530
Thanks, Steve.

413
00:20:19,530 --> 00:20:23,820
As previously mentioned, Cray
T3D Processing Elements or PEs

414
00:20:23,820 --> 00:20:25,650
communicate over a
high-speed network

415
00:20:25,650 --> 00:20:27,330
configured in a
three-dimensional torus

416
00:20:27,330 --> 00:20:28,620
topology.

417
00:20:28,620 --> 00:20:29,892
What's that?

418
00:20:29,892 --> 00:20:32,100
Well, to illustrate, let's
start with networks having

419
00:20:32,100 --> 00:20:34,230
less than three dimensions.

420
00:20:34,230 --> 00:20:37,020
This is a diagram of a
simple single-dimension torus

421
00:20:37,020 --> 00:20:39,240
network connecting eight nodes.

422
00:20:39,240 --> 00:20:41,340
The links between them
carry data packets

423
00:20:41,340 --> 00:20:44,700
at some maximum rate we're
not concerned with for now.

424
00:20:44,700 --> 00:20:47,380
There are a few things worth
noting about this network.

425
00:20:47,380 --> 00:20:50,550
First, what's the latency
of the network going to be?

426
00:20:50,550 --> 00:20:53,820
Topologically speaking, there
are no ends to the network.

427
00:20:53,820 --> 00:20:56,010
During random global
communications,

428
00:20:56,010 --> 00:20:57,777
a packet can choose
to travel whichever

429
00:20:57,777 --> 00:20:59,610
direction around the
network is the shortest

430
00:20:59,610 --> 00:21:01,750
path to its destination.

431
00:21:01,750 --> 00:21:04,680
The worst-case distance
then is to a node halfway

432
00:21:04,680 --> 00:21:06,060
around the network.

433
00:21:06,060 --> 00:21:09,570
If we measured latency in hops,
the maximum for this network

434
00:21:09,570 --> 00:21:11,550
would be four hops.

435
00:21:11,550 --> 00:21:15,150
If you look at all possible
communications between PEs,

436
00:21:15,150 --> 00:21:17,850
the average latency
is about two hops.

437
00:21:17,850 --> 00:21:21,210
Second, what is the
bandwidth of this network?

438
00:21:21,210 --> 00:21:23,490
The ability of the network
to carry global traffic

439
00:21:23,490 --> 00:21:25,560
is measured by counting
the number of wires

440
00:21:25,560 --> 00:21:28,350
that would be cut if you
snipped the network in two

441
00:21:28,350 --> 00:21:29,280
at the center.

442
00:21:29,280 --> 00:21:31,770
This is called
"bisecting" the network.

443
00:21:31,770 --> 00:21:34,020
When you multiply the
result by the speed data

444
00:21:34,020 --> 00:21:36,960
moves on each wire, you get a
metric called the "bisection

445
00:21:36,960 --> 00:21:38,410
bandwidth."

446
00:21:38,410 --> 00:21:40,950
And for this simple 1D
example, the bisection

447
00:21:40,950 --> 00:21:43,530
is two times the channel rate.

448
00:21:43,530 --> 00:21:45,840
In this diagram, it looks as
though the link connecting

449
00:21:45,840 --> 00:21:49,020
the ends to form a torus
must be a pretty long wire.

450
00:21:49,020 --> 00:21:52,920
In reality, it's possible to
fold a torus so that all wires

451
00:21:52,920 --> 00:21:54,720
are about the same length.

452
00:21:54,720 --> 00:21:56,520
This diagram is also a torus.

453
00:21:56,520 --> 00:21:58,170
And it's topologically
equivalent

454
00:21:58,170 --> 00:21:59,640
to the previous diagram.

455
00:21:59,640 --> 00:22:03,780
Only the wires go from node to
node in an interleaved pattern.

456
00:22:03,780 --> 00:22:06,180
In T3D, two of the
three dimensions

457
00:22:06,180 --> 00:22:08,400
are folded in this fashion.

458
00:22:08,400 --> 00:22:10,410
There's one other
characteristic of the torus

459
00:22:10,410 --> 00:22:11,670
that's worth mentioning.

460
00:22:11,670 --> 00:22:13,740
It's naturally fault-tolerant.

461
00:22:13,740 --> 00:22:16,230
Suppose one of the links
broke for some reason.

462
00:22:16,230 --> 00:22:19,060
The torus provides a
second path between nodes--

463
00:22:19,060 --> 00:22:20,860
the long way around
the network--

464
00:22:20,860 --> 00:22:23,160
so that the system
can continue running.

465
00:22:23,160 --> 00:22:25,980
The more processors in a
system, the more important fault

466
00:22:25,980 --> 00:22:29,700
tolerance becomes, because the
failure of a single PE or link

467
00:22:29,700 --> 00:22:32,070
represents a correspondingly
smaller fraction

468
00:22:32,070 --> 00:22:33,480
of the machine.

469
00:22:33,480 --> 00:22:36,000
Who wants to give up their
machine for repair if a failure

470
00:22:36,000 --> 00:22:39,940
only costs 1,000th
of its capability?

471
00:22:39,940 --> 00:22:42,810
Let's look at a more complex
network, an eight-node 2D

472
00:22:42,810 --> 00:22:43,942
torus.

473
00:22:43,942 --> 00:22:45,900
Now, our nodes are
connected to their neighbors

474
00:22:45,900 --> 00:22:47,280
in two dimensions.

475
00:22:47,280 --> 00:22:51,150
By adding a second dimension,
we double the bandwidth

476
00:22:51,150 --> 00:22:53,880
to four over the 1D Taurus.

477
00:22:53,880 --> 00:22:57,780
Maximum latency is now three and
the average is less than two.

478
00:22:57,780 --> 00:23:01,120
These effects are magnified
the more nodes you have.

479
00:23:01,120 --> 00:23:03,810
For instance, a
64-node torus has

480
00:23:03,810 --> 00:23:06,600
eight times the global
bandwidth of a 1D torus

481
00:23:06,600 --> 00:23:10,380
and a maximum latency
of eight, instead of 32.

482
00:23:10,380 --> 00:23:12,780
You can see where
this is going by now.

483
00:23:12,780 --> 00:23:15,630
By stacking planes of
2D tori and connecting

484
00:23:15,630 --> 00:23:18,270
nodes in the third dimension
into a torus, as well,

485
00:23:18,270 --> 00:23:20,460
we can construct
a 3D torus, which

486
00:23:20,460 --> 00:23:22,770
is the T3D's network topology.

487
00:23:22,770 --> 00:23:25,860
We've again doubled the
bisection bandwidth to eight.

488
00:23:25,860 --> 00:23:28,740
And maximum latency is
now two and the average

489
00:23:28,740 --> 00:23:30,480
barely over one.

490
00:23:30,480 --> 00:23:33,270
Again, the biggest advantages
of adding a dimension

491
00:23:33,270 --> 00:23:35,850
come with the larger sizes.

492
00:23:35,850 --> 00:23:39,510
At 64 nodes, a 3D torus
has 16 times the bandwidth

493
00:23:39,510 --> 00:23:44,010
of a 1D torus and a maximum
latency of six, instead of 32.

494
00:23:44,010 --> 00:23:46,740
For comparison, the
largest-size T3D

495
00:23:46,740 --> 00:23:49,680
has a torus with 1,024
nodes in an eight

496
00:23:49,680 --> 00:23:52,260
by 16 by eight configuration.

497
00:23:52,260 --> 00:23:56,620
Data packets route through the
T3D network in dimension order.

498
00:23:56,620 --> 00:23:59,310
This means that a packet
injected into the network

499
00:23:59,310 --> 00:24:02,430
routes in the x-dimension
first, then the y-dimension,

500
00:24:02,430 --> 00:24:05,580
then z, until it reaches
the destination node, where

501
00:24:05,580 --> 00:24:07,900
it's ejected from the network.

502
00:24:07,900 --> 00:24:10,980
So what does a T3D
network node look like?

503
00:24:10,980 --> 00:24:14,220
This block diagram shows the
communication paths associated

504
00:24:14,220 --> 00:24:15,360
with a network switch.

505
00:24:15,360 --> 00:24:19,680
Each network node in T3D is
shared by two independent PEs.

506
00:24:19,680 --> 00:24:23,100
The PEs are logically
separate and only share

507
00:24:23,100 --> 00:24:25,620
the bandwidth of a network
switch and a data mover

508
00:24:25,620 --> 00:24:27,150
called the "Block
Transfer Engine,"

509
00:24:27,150 --> 00:24:29,350
which I'll talk about later.

510
00:24:29,350 --> 00:24:31,740
There are high-speed channels
radiating from the network

511
00:24:31,740 --> 00:24:33,420
switch in seven directions.

512
00:24:33,420 --> 00:24:36,870
Six of those directions are
for the 3D torus connections.

513
00:24:36,870 --> 00:24:40,020
The seventh direction is
the channel to the PEs.

514
00:24:40,020 --> 00:24:43,380
Each channel is actually
two unidirection channels

515
00:24:43,380 --> 00:24:45,400
that operate independently.

516
00:24:45,400 --> 00:24:48,000
In other words, it's possible
to have data streaming

517
00:24:48,000 --> 00:24:50,500
through the switch from east
to west at the same time

518
00:24:50,500 --> 00:24:53,070
other data is moving
from west to east.

519
00:24:53,070 --> 00:24:55,380
Packet steering is
controlled by a routing tag

520
00:24:55,380 --> 00:24:56,970
in the header of
a data packet that

521
00:24:56,970 --> 00:24:59,160
specifies the number of
hops a package should

522
00:24:59,160 --> 00:25:00,930
travel in each direction.

523
00:25:00,930 --> 00:25:04,080
In T3D, routing tags are
stored in a high-speed lookup

524
00:25:04,080 --> 00:25:05,790
table at each node.

525
00:25:05,790 --> 00:25:08,280
When a data link fails,
the routing directions

526
00:25:08,280 --> 00:25:12,480
in the lookup tables are simply
changed to avoid the bad link.

527
00:25:12,480 --> 00:25:15,000
T3D also provide some
extra compute nodes

528
00:25:15,000 --> 00:25:18,390
as spares in the event of a PE
or node failure for increased

529
00:25:18,390 --> 00:25:19,870
fault tolerance.

530
00:25:19,870 --> 00:25:23,520
There are eight redundant
PEs for every 512 PEs.

531
00:25:23,520 --> 00:25:25,830
Redundant nodes can be
electronically switched

532
00:25:25,830 --> 00:25:27,390
in place of a
failed compute node

533
00:25:27,390 --> 00:25:30,430
by rewriting the lookup table
to substitute the redundant node

534
00:25:30,430 --> 00:25:33,460
for the failed node.

535
00:25:33,460 --> 00:25:35,980
The network switch
or router is actually

536
00:25:35,980 --> 00:25:37,780
implemented on three chips.

537
00:25:37,780 --> 00:25:39,820
Each chip handles one dimension.

538
00:25:39,820 --> 00:25:42,910
We call this a
"dimension-sliced router."

539
00:25:42,910 --> 00:25:45,340
Its functioning is
fairly straightforward.

540
00:25:45,340 --> 00:25:48,310
Packets enter the
x-dimension switch from a PE

541
00:25:48,310 --> 00:25:50,800
and route from x-switch
to x-switch, then

542
00:25:50,800 --> 00:25:53,320
turn the corner by crossing
to the y-switch chip

543
00:25:53,320 --> 00:25:55,390
to travel in the
y-dimension, then

544
00:25:55,390 --> 00:25:58,510
cross to the z-switch chip
for z-dimension routing,

545
00:25:58,510 --> 00:26:01,960
then finally emerge
at the destination PE.

546
00:26:01,960 --> 00:26:04,060
Latency of the
switch is very low.

547
00:26:04,060 --> 00:26:07,510
Essentially, it takes one
clock period, 6.7 nanoseconds

548
00:26:07,510 --> 00:26:10,210
at 150 megahertz, for a
packet entering a chip

549
00:26:10,210 --> 00:26:12,970
to decide which way it's
going and leave again.

550
00:26:12,970 --> 00:26:15,580
At these clock speeds, the
time spent on the wires

551
00:26:15,580 --> 00:26:17,620
isn't negligible and
must also be included

552
00:26:17,620 --> 00:26:19,720
in latency calculations.

553
00:26:19,720 --> 00:26:24,100
In T3D, all wires are either one
or one and a half clocks long.

554
00:26:24,100 --> 00:26:26,230
Each hop in the network
then takes one clock

555
00:26:26,230 --> 00:26:30,160
for the switch plus one to one
and a half clocks for the wire.

556
00:26:30,160 --> 00:26:33,040
Turning the corner is similar
to routing within a dimension.

557
00:26:33,040 --> 00:26:35,560
It takes one clock
inside the first chip,

558
00:26:35,560 --> 00:26:37,900
one clock for the
connection between chips,

559
00:26:37,900 --> 00:26:39,610
and one clock for
the second chip.

560
00:26:39,610 --> 00:26:43,180
And you're on the wires
in the next dimension.

561
00:26:43,180 --> 00:26:45,400
Notice this diagram
has all the same arrows

562
00:26:45,400 --> 00:26:47,380
as the earlier node diagram.

563
00:26:47,380 --> 00:26:49,270
Each arrow represents
a channel that

564
00:26:49,270 --> 00:26:52,360
is 16 bits wide that
is clocked at 150

565
00:26:52,360 --> 00:26:56,490
megahertz for a raw bandwidth
of 300 megabytes per second.

566
00:26:56,490 --> 00:26:58,540
Earlier, we talked about
the bisection bandwidth

567
00:26:58,540 --> 00:26:59,740
of a network.

568
00:26:59,740 --> 00:27:03,220
The bisection bandwidth
of a 1K processor T3D,

569
00:27:03,220 --> 00:27:05,800
which has 512 network
nodes in an eight

570
00:27:05,800 --> 00:27:09,250
by eight by eight torus,
is easy to calculate.

571
00:27:09,250 --> 00:27:12,700
The number of channels cut when
we bisect the torus is 8 times

572
00:27:12,700 --> 00:27:15,850
8, one plane of
the torus, times 2

573
00:27:15,850 --> 00:27:19,000
for the connected ends
of the torus times 2

574
00:27:19,000 --> 00:27:21,460
for the channels in
each direction times

575
00:27:21,460 --> 00:27:24,700
300 megabytes per second,
giving a total of 76

576
00:27:24,700 --> 00:27:27,100
gigabytes per second.

577
00:27:27,100 --> 00:27:31,380
Data packets carry data in
two payload sizes, one or four

578
00:27:31,380 --> 00:27:33,070
64-bit words.

579
00:27:33,070 --> 00:27:35,530
Packets are broken
up into 16-bit chunks

580
00:27:35,530 --> 00:27:37,870
called "phits"
for physical units

581
00:27:37,870 --> 00:27:42,050
and sent over the network in a
string of successive parcels.

582
00:27:42,050 --> 00:27:44,200
The routing tag is
always in the first phit.

583
00:27:44,200 --> 00:27:46,270
And it opens a path
through the switches

584
00:27:46,270 --> 00:27:48,940
for the following
control and data phits.

585
00:27:48,940 --> 00:27:51,940
The head of a packet can
easily be many switches ahead

586
00:27:51,940 --> 00:27:53,680
of the tail in the network.

587
00:27:53,680 --> 00:27:56,800
In scientific literature, this
is commonly called "wormhole

588
00:27:56,800 --> 00:27:57,700
routing."

589
00:27:57,700 --> 00:28:00,070
Deadlock is a
phenomenon like gridlock

590
00:28:00,070 --> 00:28:02,020
in Manhattan at
rush hour that can

591
00:28:02,020 --> 00:28:04,990
happen when packets in a network
are not guaranteed to have room

592
00:28:04,990 --> 00:28:06,910
to make forward progress.

593
00:28:06,910 --> 00:28:09,130
Deadlock is impossible
in T3D because

594
00:28:09,130 --> 00:28:11,020
of the use of
layers of buffering

595
00:28:11,020 --> 00:28:13,120
called "virtual channels."

596
00:28:13,120 --> 00:28:16,150
By separating the buffering
for requests and responses

597
00:28:16,150 --> 00:28:18,880
and providing an additional
set of buffers for packets that

598
00:28:18,880 --> 00:28:21,910
have passed a certain point in
a torus ring called the "date

599
00:28:21,910 --> 00:28:24,280
line," we guarantee
network packets

600
00:28:24,280 --> 00:28:27,040
can't be blocked indefinitely.

601
00:28:27,040 --> 00:28:30,550
T3D thus has four sets of
virtual channel buffers,

602
00:28:30,550 --> 00:28:33,490
two for requests and
responses before the date line

603
00:28:33,490 --> 00:28:37,630
and two for requests and
responses after the date line.

604
00:28:37,630 --> 00:28:40,240
Cray chose the 3D
torus topology after

605
00:28:40,240 --> 00:28:42,160
an extensive
investigation ending

606
00:28:42,160 --> 00:28:45,550
in a final shootout between
a couple of champion networks

607
00:28:45,550 --> 00:28:47,770
in a detailed
simulation using real

608
00:28:47,770 --> 00:28:50,230
and synthetic
communication patterns.

609
00:28:50,230 --> 00:28:52,210
When evaluated within
the constraints

610
00:28:52,210 --> 00:28:56,050
of real-world packaging limits
and wiring capabilities,

611
00:28:56,050 --> 00:28:58,930
the 3D torus provided the
highest global bandwidth

612
00:28:58,930 --> 00:29:01,330
and lowest global
latency for the number

613
00:29:01,330 --> 00:29:03,220
of processors we
expected to have

614
00:29:03,220 --> 00:29:06,580
in T3D, hundreds to thousands.

615
00:29:06,580 --> 00:29:09,040
Several additional
important characteristics

616
00:29:09,040 --> 00:29:12,580
include good physical
scalability, fault tolerance,

617
00:29:12,580 --> 00:29:15,970
ability to exploit locality for
reduced latency and improved

618
00:29:15,970 --> 00:29:18,880
bandwidth, and a good
fit with our technology--

619
00:29:18,880 --> 00:29:22,450
that is, short wires in
a regular wiring pattern,

620
00:29:22,450 --> 00:29:24,670
resulting in a high clock speed.

621
00:29:24,670 --> 00:29:26,320
There's also
significant headroom

622
00:29:26,320 --> 00:29:28,300
in the technology
that will be used

623
00:29:28,300 --> 00:29:33,070
to provide even better network
performance in future machines.

624
00:29:33,070 --> 00:29:37,300
Let's define and differentiate
two levels of the Cray T3D

625
00:29:37,300 --> 00:29:38,800
architecture.

626
00:29:38,800 --> 00:29:41,200
These are the
macro-architecture level

627
00:29:41,200 --> 00:29:44,170
and the
micro-architecture level.

628
00:29:44,170 --> 00:29:47,140
The macro architecture
represents the high-level view

629
00:29:47,140 --> 00:29:50,710
of the system seen by
the MPP programmer.

630
00:29:50,710 --> 00:29:52,810
This includes the
globally addressed

631
00:29:52,810 --> 00:29:55,330
physically distributed
memory characteristics,

632
00:29:55,330 --> 00:29:57,760
in a general sense,
the topology, which

633
00:29:57,760 --> 00:30:00,700
defines relationships
among compute nodes,

634
00:30:00,700 --> 00:30:03,490
and the synchronization
features to the degree

635
00:30:03,490 --> 00:30:07,480
they are exposed in high-level
language constructs.

636
00:30:07,480 --> 00:30:10,150
It is important to
present a consistent,

637
00:30:10,150 --> 00:30:14,480
although probably slowly
evolving view to the user.

638
00:30:14,480 --> 00:30:17,290
This allows
generation-to-generation source

639
00:30:17,290 --> 00:30:19,270
compatibility.

640
00:30:19,270 --> 00:30:22,630
The balance among
computational, memory bandwidth,

641
00:30:22,630 --> 00:30:27,580
and interconnect support will
be also fairly consistent.

642
00:30:27,580 --> 00:30:29,760
Application-level
optimizations that work

643
00:30:29,760 --> 00:30:32,790
well on one generation
system should also

644
00:30:32,790 --> 00:30:35,580
work well on the next.

645
00:30:35,580 --> 00:30:38,100
If a system is successful
in the marketplace,

646
00:30:38,100 --> 00:30:40,890
this steadiness of
purpose will also

647
00:30:40,890 --> 00:30:44,940
work to support the development
of standards in the MPP world.

648
00:30:44,940 --> 00:30:48,130
The micro-architecture
roadmap, on the other hand,

649
00:30:48,130 --> 00:30:50,490
must be much more agile.

650
00:30:50,490 --> 00:30:54,840
Micro architecture refers to the
CPU and the lower-level logic

651
00:30:54,840 --> 00:30:59,520
shell, which maps the CPU
into the system overall.

652
00:30:59,520 --> 00:31:02,760
When beginning an MPP design,
one of the first decisions

653
00:31:02,760 --> 00:31:05,820
to be made is whether to
design a new proprietary

654
00:31:05,820 --> 00:31:11,010
microprocessor or to select from
a variety of emerging commodity

655
00:31:11,010 --> 00:31:12,990
devices.

656
00:31:12,990 --> 00:31:15,990
The primary argument
for a proprietary design

657
00:31:15,990 --> 00:31:19,530
is that it gives an opportunity
to specifically and efficiently

658
00:31:19,530 --> 00:31:22,140
tailor the
micro-architectural design

659
00:31:22,140 --> 00:31:25,860
within the context of the
overall macro-architectural

660
00:31:25,860 --> 00:31:27,570
goals.

661
00:31:27,570 --> 00:31:29,310
But there is a problem.

662
00:31:29,310 --> 00:31:31,770
With commodity microprocessor
development cycles

663
00:31:31,770 --> 00:31:34,920
lasting only two years,
there is opportunity

664
00:31:34,920 --> 00:31:39,630
for rapid improvement, which
also can mean rapid change.

665
00:31:39,630 --> 00:31:43,080
To be competitive with current
state-of-the-art microprocessor

666
00:31:43,080 --> 00:31:47,130
designs, which by some
counts are already as a class

667
00:31:47,130 --> 00:31:50,490
in the middle of the
fourth or fifth generation,

668
00:31:50,490 --> 00:31:53,670
demand significant design and
integrated circuit process

669
00:31:53,670 --> 00:31:56,280
development resources.

670
00:31:56,280 --> 00:31:58,320
The human resources
must be applied

671
00:31:58,320 --> 00:32:02,790
in a widely parallel fashion
to minimize time to market.

672
00:32:02,790 --> 00:32:06,060
There is always an unfolding
next generation coming

673
00:32:06,060 --> 00:32:09,180
from large competing
design groups.

674
00:32:09,180 --> 00:32:11,250
For these reasons,
even the largest

675
00:32:11,250 --> 00:32:14,430
of computer corporations have
seen fit to build and support

676
00:32:14,430 --> 00:32:19,050
consortia to provide sufficient
market presence to pay back

677
00:32:19,050 --> 00:32:21,240
the design effort.

678
00:32:21,240 --> 00:32:24,940
And of course, these devices
benefit from high volume,

679
00:32:24,940 --> 00:32:27,240
which allows lower pricing.

680
00:32:27,240 --> 00:32:30,210
This counterargument to
embrace an emerging commodity

681
00:32:30,210 --> 00:32:32,490
micro-architectural
implementation

682
00:32:32,490 --> 00:32:36,900
was most persuasive for
the Cray T3D project team.

683
00:32:36,900 --> 00:32:40,560
The processor selected was from
the Alpha Microprocessor family

684
00:32:40,560 --> 00:32:43,230
from Digital
Equipment Corporation.

685
00:32:43,230 --> 00:32:45,270
This device was
selected primarily

686
00:32:45,270 --> 00:32:48,270
because it best met our
objectives of providing

687
00:32:48,270 --> 00:32:50,880
high performance
and high precision

688
00:32:50,880 --> 00:32:54,180
for both floating-point
and integer data types.

689
00:32:54,180 --> 00:32:56,340
We present only a
short list of some

690
00:32:56,340 --> 00:32:59,430
of the more important
characteristics of this device,

691
00:32:59,430 --> 00:33:01,410
but would refer you
to the other videos

692
00:33:01,410 --> 00:33:05,100
in the UVC series featuring
lectures by Digital Equipment

693
00:33:05,100 --> 00:33:07,230
Corporation engineers.

694
00:33:07,230 --> 00:33:10,830
These tapes cover Alpha
architecture and implementation

695
00:33:10,830 --> 00:33:12,960
in detail.

696
00:33:12,960 --> 00:33:20,730
The DEC chip 21064 represents a
0.75-micron CMOS implementation

697
00:33:20,730 --> 00:33:25,950
of a full 64-bit reduced
instruction set computer.

698
00:33:25,950 --> 00:33:27,870
It has a peak
floating-point performance

699
00:33:27,870 --> 00:33:32,250
of 150 million floating-point
operations per second,

700
00:33:32,250 --> 00:33:36,690
utilizing a 150-megahertz
clock frequency.

701
00:33:36,690 --> 00:33:38,850
Up to two instructions
can be issued

702
00:33:38,850 --> 00:33:42,840
per clock period or 300 MIPS.

703
00:33:42,840 --> 00:33:45,090
The internal
register set consists

704
00:33:45,090 --> 00:33:49,440
of 32 integer and 32
floating-point registers.

705
00:33:49,440 --> 00:33:53,970
There are two internal caches,
8K bytes of instruction cache

706
00:33:53,970 --> 00:33:56,520
and 8K bytes of data cache.

707
00:33:56,520 --> 00:33:59,370
All candidate commodity
microprocessors,

708
00:33:59,370 --> 00:34:01,560
including the Alpha
Microprocessor,

709
00:34:01,560 --> 00:34:04,170
were designed primarily to
satisfy the requirements

710
00:34:04,170 --> 00:34:07,440
of workstations containing
single or, at most,

711
00:34:07,440 --> 00:34:10,350
a handful of CPU devices.

712
00:34:10,350 --> 00:34:13,469
This design focus, while
appropriate for the workstation

713
00:34:13,469 --> 00:34:16,110
market, results in
certain weaknesses

714
00:34:16,110 --> 00:34:19,199
when it is inserted into a
scalable massively parallel

715
00:34:19,199 --> 00:34:21,400
processing system.

716
00:34:21,400 --> 00:34:23,670
This is particularly true
when the system design

717
00:34:23,670 --> 00:34:26,370
has an objective of
supporting very large numbers

718
00:34:26,370 --> 00:34:31,480
of processors sharing a very
large global address space.

719
00:34:31,480 --> 00:34:34,199
Fortunately, there are ways
in which the microprocessor

720
00:34:34,199 --> 00:34:39,389
can be enhanced with appropriate
supporting shell logic.

721
00:34:39,389 --> 00:34:42,630
A globally addressed
memory subsystem designed

722
00:34:42,630 --> 00:34:46,380
with the objective of scaling
to more than 1,000 processors

723
00:34:46,380 --> 00:34:49,949
requires a very large
physical address.

724
00:34:49,949 --> 00:34:53,250
Even a large workstation
memory by today's standards

725
00:34:53,250 --> 00:34:57,210
is small compared to that of
a fully configured Cray T3D

726
00:34:57,210 --> 00:34:58,530
system.

727
00:34:58,530 --> 00:35:01,470
When data cannot be found
in the local data cache,

728
00:35:01,470 --> 00:35:05,400
the cache miss invokes an
external memory reference.

729
00:35:05,400 --> 00:35:07,500
Typical microprocessors
can support

730
00:35:07,500 --> 00:35:11,070
only a very limited number
of outstanding memory loads,

731
00:35:11,070 --> 00:35:12,900
often only one.

732
00:35:12,900 --> 00:35:16,590
But in a distributed memory
MPP, any global reference,

733
00:35:16,590 --> 00:35:19,770
even nearest neighbor,
which stalls the processor

734
00:35:19,770 --> 00:35:23,400
will seriously affect
computational efficiency.

735
00:35:23,400 --> 00:35:26,010
Logic must be added
to support and mask

736
00:35:26,010 --> 00:35:30,720
many outstanding references,
thereby hiding latency.

737
00:35:30,720 --> 00:35:33,470
This provides capability
to fully exploit

738
00:35:33,470 --> 00:35:36,650
modern compilers that are
able to accurately judge when

739
00:35:36,650 --> 00:35:38,990
to prefetch streams of data.

740
00:35:38,990 --> 00:35:44,120
Interprocessor synchronization
must be fast and scalable.

741
00:35:44,120 --> 00:35:46,310
Typical RISC
microprocessor designs

742
00:35:46,310 --> 00:35:48,500
anticipate only very
moderate numbers

743
00:35:48,500 --> 00:35:52,130
of processors working
together within a workstation.

744
00:35:52,130 --> 00:35:54,740
Techniques used are
often bus-based,

745
00:35:54,740 --> 00:35:58,160
such as the popular
bus snooping protocol.

746
00:35:58,160 --> 00:36:00,440
Due to very real
physical constraints,

747
00:36:00,440 --> 00:36:03,650
these schemes do not scale
well beyond about a dozen

748
00:36:03,650 --> 00:36:05,600
processors.

749
00:36:05,600 --> 00:36:08,300
Software and I/O-based
synchronization protocols used

750
00:36:08,300 --> 00:36:11,240
in workstation cluster
organizations may scale

751
00:36:11,240 --> 00:36:15,020
to large numbers, but fall far
short of our goal for providing

752
00:36:15,020 --> 00:36:20,480
single microsecond or less
global synchronization latency.

753
00:36:20,480 --> 00:36:22,820
The T3D's macro
architecture can be

754
00:36:22,820 --> 00:36:24,530
described by the
characteristics that

755
00:36:24,530 --> 00:36:26,150
are visible to the programmer.

756
00:36:26,150 --> 00:36:29,040
These include the physically
distributed memory,

757
00:36:29,040 --> 00:36:31,833
which probably represents
both the largest challenge

758
00:36:31,833 --> 00:36:33,500
and the greatest
performance opportunity

759
00:36:33,500 --> 00:36:36,662
for a programmer, the
global address space, which

760
00:36:36,662 --> 00:36:38,120
provides the ability
to communicate

761
00:36:38,120 --> 00:36:40,760
in a fine-grained
manner, fast barrier

762
00:36:40,760 --> 00:36:43,400
synchronization, especially
useful for data-parallel

763
00:36:43,400 --> 00:36:47,000
programming, fast memory locks
and support for dynamic loop

764
00:36:47,000 --> 00:36:50,030
distribution, especially
useful for work sharing,

765
00:36:50,030 --> 00:36:52,340
and hardware messaging
support, especially

766
00:36:52,340 --> 00:36:55,550
useful for the message-passing
programming model.

767
00:36:55,550 --> 00:36:58,040
Also visible to the
programmer due to their effect

768
00:36:58,040 --> 00:37:01,550
on performance are the 3D
torus interconnect network,

769
00:37:01,550 --> 00:37:04,520
where a programmer may wish to
optimize communication patterns

770
00:37:04,520 --> 00:37:07,670
or data locality, the
block transfer engine,

771
00:37:07,670 --> 00:37:10,490
which can help hide
remote latency,

772
00:37:10,490 --> 00:37:12,595
and the local
processor's cache memory,

773
00:37:12,595 --> 00:37:13,970
which a programmer
may need to be

774
00:37:13,970 --> 00:37:16,470
aware of for best performance.

775
00:37:16,470 --> 00:37:19,250
Let's take a look at how the PE
micro architecture implements

776
00:37:19,250 --> 00:37:22,227
the macro architecture in T3D.

777
00:37:22,227 --> 00:37:23,810
At the heart of the
processing element

778
00:37:23,810 --> 00:37:25,760
is the Alpha Microprocessor.

779
00:37:25,760 --> 00:37:28,850
Around the Alpha, we've added
a shell of external circuits

780
00:37:28,850 --> 00:37:31,640
to implement the MPP
functions we need.

781
00:37:31,640 --> 00:37:33,440
These fall into the
three categories

782
00:37:33,440 --> 00:37:36,980
Steve mentioned earlier, address
extension, latency hiding,

783
00:37:36,980 --> 00:37:38,910
and synchronization.

784
00:37:38,910 --> 00:37:41,690
The Alpha has a 43-bit
virtual address space that is

785
00:37:41,690 --> 00:37:43,850
translated in the on-chip DTB--

786
00:37:43,850 --> 00:37:46,520
that's Data Translation
lookaside Buffer--

787
00:37:46,520 --> 00:37:49,340
to a 34-bit physical
byte address.

788
00:37:49,340 --> 00:37:52,670
T3D has up to 64
megabytes of local memory

789
00:37:52,670 --> 00:37:55,640
and up to 2K processors,
which would require

790
00:37:55,640 --> 00:37:57,920
at least 37 bits of address.

791
00:37:57,920 --> 00:38:00,860
In reality, this
isn't even close.

792
00:38:00,860 --> 00:38:02,870
Several additional
address bits are also

793
00:38:02,870 --> 00:38:05,570
needed to control caching
and to facilitate control

794
00:38:05,570 --> 00:38:08,780
of the memory map mechanisms
that implement the external MPP

795
00:38:08,780 --> 00:38:09,950
shell.

796
00:38:09,950 --> 00:38:13,670
So an external address
extension mechanism of some sort

797
00:38:13,670 --> 00:38:15,110
is required.

798
00:38:15,110 --> 00:38:18,470
In T3D, this takes the
form of a 32-entry register

799
00:38:18,470 --> 00:38:20,870
set called the "DTB annex."

800
00:38:20,870 --> 00:38:23,540
The way memory is
distributed in T3D,

801
00:38:23,540 --> 00:38:26,540
addresses have two
parts, a PE number part

802
00:38:26,540 --> 00:38:29,000
that says essentially
which bank of memory

803
00:38:29,000 --> 00:38:30,680
and an offset part
that says which

804
00:38:30,680 --> 00:38:32,780
memory location within the bank.

805
00:38:32,780 --> 00:38:34,850
When the Alpha initiates
a load or store,

806
00:38:34,850 --> 00:38:36,290
the offset portion
of the address

807
00:38:36,290 --> 00:38:38,000
is taken from the
physical address bits

808
00:38:38,000 --> 00:38:39,890
that are translated on-chip.

809
00:38:39,890 --> 00:38:43,850
And the PE number portion
is taken from the DTB annex.

810
00:38:43,850 --> 00:38:45,560
The choice of which
of the 32 annex

811
00:38:45,560 --> 00:38:47,480
registers to read
for the PE number

812
00:38:47,480 --> 00:38:50,540
is made by a five-bit
field above the offset.

813
00:38:50,540 --> 00:38:54,620
The DTB annex is readable
and writable by user code.

814
00:38:54,620 --> 00:38:57,890
The value read from the
annex is a virtual PE number,

815
00:38:57,890 --> 00:39:00,080
which is translated
by a shell mechanism

816
00:39:00,080 --> 00:39:03,170
into a logical PE number and
checked for validity prior

817
00:39:03,170 --> 00:39:05,450
to applying it to the
routing tag lookup table

818
00:39:05,450 --> 00:39:07,640
to make a remote reference.

819
00:39:07,640 --> 00:39:11,090
The virtuological PE
translation mechanism

820
00:39:11,090 --> 00:39:12,920
is used to create
user partitions

821
00:39:12,920 --> 00:39:15,050
within the physical
torus and provides

822
00:39:15,050 --> 00:39:17,270
a good deal of flexibility
concerning size

823
00:39:17,270 --> 00:39:19,670
and shape of the partition.

824
00:39:19,670 --> 00:39:22,670
Shell circuitry always
checks the virtual PE number

825
00:39:22,670 --> 00:39:25,670
and, if it matches the local
PE, performs a local memory

826
00:39:25,670 --> 00:39:28,970
reference, instead of
a remote reference.

827
00:39:28,970 --> 00:39:31,280
What about latency hiding?

828
00:39:31,280 --> 00:39:33,470
The Alpha, like most
microprocessors,

829
00:39:33,470 --> 00:39:36,560
has a very limited external
pipeline capability.

830
00:39:36,560 --> 00:39:38,990
When it performs a load,
it places the address

831
00:39:38,990 --> 00:39:41,180
on its address bus
and waits for the data

832
00:39:41,180 --> 00:39:44,630
to be delivered before it will
move on to the next reference.

833
00:39:44,630 --> 00:39:46,610
This may be OK for
a desktop system

834
00:39:46,610 --> 00:39:49,820
with a single processor, when
the memory is only inches away.

835
00:39:49,820 --> 00:39:51,860
But it's unacceptable
in an MPP when

836
00:39:51,860 --> 00:39:53,900
the data is at a remote node.

837
00:39:53,900 --> 00:39:55,520
We can't have the
processor stalled

838
00:39:55,520 --> 00:39:57,230
waiting for remote references.

839
00:39:57,230 --> 00:40:01,250
So in T3D, we've extended the
memory pipeline externally

840
00:40:01,250 --> 00:40:03,470
by adding a latency
hiding mechanism called

841
00:40:03,470 --> 00:40:05,570
the "prefetch queue."

842
00:40:05,570 --> 00:40:08,510
The prefetch queue is a
first in, first out memory

843
00:40:08,510 --> 00:40:12,290
16 64-bit-words
deep that is used

844
00:40:12,290 --> 00:40:15,380
as the temporary
destination of remote loads.

845
00:40:15,380 --> 00:40:19,740
Prefetches are initiated by
executing a FETCH instruction,

846
00:40:19,740 --> 00:40:22,820
which is an Alpha op code meant
to be used as a prefetch hint

847
00:40:22,820 --> 00:40:25,820
to a board-level
secondary cache.

848
00:40:25,820 --> 00:40:28,890
In T3D, a fetch causes
a single-word load

849
00:40:28,890 --> 00:40:32,040
to be initiated from
memory, possibly remote.

850
00:40:32,040 --> 00:40:34,260
But since the destination
is not the processor,

851
00:40:34,260 --> 00:40:35,820
the fetch can be
acknowledged as soon

852
00:40:35,820 --> 00:40:37,560
as we've latched the address.

853
00:40:37,560 --> 00:40:40,380
And the processor can
continue with other work.

854
00:40:40,380 --> 00:40:42,420
When the load
response returns, it's

855
00:40:42,420 --> 00:40:44,790
placed in a slot in
the prefetch queue that

856
00:40:44,790 --> 00:40:47,640
was reserved for it at
the time the fetch issued.

857
00:40:47,640 --> 00:40:50,610
Sometime later, the processor
can come back and pop the data

858
00:40:50,610 --> 00:40:53,520
from the memory-mapped
head of the prefetch queue,

859
00:40:53,520 --> 00:40:56,910
thereby hiding the
latency of the load.

860
00:40:56,910 --> 00:40:59,160
If the processor tries to
pop the data from the queue

861
00:40:59,160 --> 00:41:01,290
before it's returned,
the processor

862
00:41:01,290 --> 00:41:03,480
will stall until
the data arrives.

863
00:41:03,480 --> 00:41:05,820
Using the prefetch
queue, up to 16 loads

864
00:41:05,820 --> 00:41:08,045
can be outstanding at any time.

865
00:41:08,045 --> 00:41:09,420
The prefetch queue
is turning out

866
00:41:09,420 --> 00:41:14,130
to be a very powerful mechanism
easily used by the compiler.

867
00:41:14,130 --> 00:41:17,620
Alpha stores to memory do not
need to wait for a response.

868
00:41:17,620 --> 00:41:21,060
So a large number can naturally
be outstanding at any time.

869
00:41:21,060 --> 00:41:22,980
This is a good
communication mechanism

870
00:41:22,980 --> 00:41:25,890
if the producer knows who
the next consumer will be,

871
00:41:25,890 --> 00:41:28,650
because the Alpha has
write buffers on board that

872
00:41:28,650 --> 00:41:32,130
try to accumulate a cache
line, four words, of data

873
00:41:32,130 --> 00:41:35,160
prior to performing the
actual remote store.

874
00:41:35,160 --> 00:41:39,000
This increases the payload
size and effective bandwidth.

875
00:41:39,000 --> 00:41:41,730
A counter in the shell circuitry
is incremented each time

876
00:41:41,730 --> 00:41:44,610
a store is issued by
Alpha and decremented

877
00:41:44,610 --> 00:41:46,680
each time a write completes.

878
00:41:46,680 --> 00:41:49,080
This counter can be read by
the processor at any time

879
00:41:49,080 --> 00:41:50,610
to determine when
all of the writes

880
00:41:50,610 --> 00:41:53,970
have completed for
synchronization purposes.

881
00:41:53,970 --> 00:41:56,190
The third major
latency-hiding mechanism

882
00:41:56,190 --> 00:41:58,140
is an asynchronous
memory-to-memory data

883
00:41:58,140 --> 00:42:01,560
mover called the block
transfer engine or BLT.

884
00:42:01,560 --> 00:42:04,380
There's one BLT per node--
no sense having two,

885
00:42:04,380 --> 00:42:07,590
since it can saturate the
network interface by itself.

886
00:42:07,590 --> 00:42:10,140
Once started, it functions
independent of the PE,

887
00:42:10,140 --> 00:42:12,120
transferring data
between remote memory

888
00:42:12,120 --> 00:42:14,700
and the local memory
in either direction.

889
00:42:14,700 --> 00:42:17,730
The BLT can move up to
256K words of memory

890
00:42:17,730 --> 00:42:19,770
with a single invocation.

891
00:42:19,770 --> 00:42:22,230
The local data that is the
source on a write block

892
00:42:22,230 --> 00:42:25,710
transfer or the destination
on a read block transfer

893
00:42:25,710 --> 00:42:28,440
can be addressed with
any constant stride.

894
00:42:28,440 --> 00:42:31,530
Remote addresses can
be any constant stride

895
00:42:31,530 --> 00:42:34,020
or can be indirect,
meaning the address is

896
00:42:34,020 --> 00:42:36,330
from a separate stream
previously stored

897
00:42:36,330 --> 00:42:37,890
in local memory.

898
00:42:37,890 --> 00:42:39,810
Indirect operations
are commonly called

899
00:42:39,810 --> 00:42:42,510
"scatter/gather operations"
because they can reference data

900
00:42:42,510 --> 00:42:45,270
in completely arbitrary ways.

901
00:42:45,270 --> 00:42:47,520
The BLT also has
a unique mechanism

902
00:42:47,520 --> 00:42:49,740
for interpreting
the remote address.

903
00:42:49,740 --> 00:42:51,750
This mechanism, called
the "Centrifuge,"

904
00:42:51,750 --> 00:42:54,390
is used to support and control
the distribution of data

905
00:42:54,390 --> 00:42:57,060
among the PEs that is
so critical to attaining

906
00:42:57,060 --> 00:43:00,300
good performance on a
distributed memory computer.

907
00:43:00,300 --> 00:43:03,390
We previously said that memory
addresses have two parts,

908
00:43:03,390 --> 00:43:05,640
an offset and a PE number.

909
00:43:05,640 --> 00:43:07,650
A remote address
generated by the BLT

910
00:43:07,650 --> 00:43:09,480
must somehow have
its bits interpreted

911
00:43:09,480 --> 00:43:12,000
to determine which PE
to send the reference to

912
00:43:12,000 --> 00:43:15,060
and what offset to use
once the reference arrives.

913
00:43:15,060 --> 00:43:17,490
I've got a couple of examples
to show you how this works

914
00:43:17,490 --> 00:43:19,740
and why it's so powerful.

915
00:43:19,740 --> 00:43:22,290
In a conventional computer
with an interleaved memory,

916
00:43:22,290 --> 00:43:23,940
an address is
usually interpreted

917
00:43:23,940 --> 00:43:26,640
as having the memory bank
number in the low-order bits,

918
00:43:26,640 --> 00:43:29,040
with the offset in
the high-order bits.

919
00:43:29,040 --> 00:43:31,860
If we increment this
address repetitively by one,

920
00:43:31,860 --> 00:43:34,620
we see that the resulting
address pattern visits the same

921
00:43:34,620 --> 00:43:36,330
offset on each bank once.

922
00:43:36,330 --> 00:43:38,640
Then the carry increments
the offset field,

923
00:43:38,640 --> 00:43:41,280
and all the banks
are visited again.

924
00:43:41,280 --> 00:43:42,840
In a distributed
memory computer,

925
00:43:42,840 --> 00:43:44,640
sometimes the offset
is interpreted

926
00:43:44,640 --> 00:43:47,160
as lying in the low-order
bits of the address,

927
00:43:47,160 --> 00:43:50,790
while the bank or PE number
is in the high-order bits.

928
00:43:50,790 --> 00:43:53,490
If we increment this
address repetitively by one,

929
00:43:53,490 --> 00:43:55,980
we see that the resulting
address pattern traverses

930
00:43:55,980 --> 00:43:57,930
all of the memory
within the first bank

931
00:43:57,930 --> 00:44:01,350
before finally incrementing the
bank address to the next bank,

932
00:44:01,350 --> 00:44:03,990
where all of its memory is
traversed before incrementing

933
00:44:03,990 --> 00:44:06,240
the bank again, et cetera.

934
00:44:06,240 --> 00:44:09,180
In our MPP, we'd like
to have it both ways.

935
00:44:09,180 --> 00:44:11,130
Indeed, we want even more.

936
00:44:11,130 --> 00:44:12,990
We'd like to be able
to control exactly

937
00:44:12,990 --> 00:44:15,660
which bits of the address
are interpreted as PE

938
00:44:15,660 --> 00:44:17,370
and which are offset.

939
00:44:17,370 --> 00:44:21,210
This would allow us to, for
example, slide the PE number up

940
00:44:21,210 --> 00:44:25,690
by, say, four bits so that it
had offset bits on either side.

941
00:44:25,690 --> 00:44:28,380
If we start incrementing
this address by one,

942
00:44:28,380 --> 00:44:30,210
we see that the
address pattern touches

943
00:44:30,210 --> 00:44:33,210
a block of 16
addresses on one PE

944
00:44:33,210 --> 00:44:38,050
before incrementing to the next
PE to touch 16 more and so on.

945
00:44:38,050 --> 00:44:40,560
We call this a
"block distribution."

946
00:44:40,560 --> 00:44:42,780
The Centrifuge in
T3D is controlled

947
00:44:42,780 --> 00:44:46,920
by a mask that indicates
which bits are PE number.

948
00:44:46,920 --> 00:44:49,410
Any place a one is
set in the mask,

949
00:44:49,410 --> 00:44:52,110
the address is interpreted
to be a PE number.

950
00:44:52,110 --> 00:44:54,690
Any place a zero is in
the mask, the address

951
00:44:54,690 --> 00:44:56,880
is interpreted as an offset.

952
00:44:56,880 --> 00:45:00,210
The output of the Centrifuge
is a separated PE number

953
00:45:00,210 --> 00:45:04,200
and offset that is used to
make a memory reference.

954
00:45:04,200 --> 00:45:07,950
Note that bits in the mask do
not even need to be contiguous.

955
00:45:07,950 --> 00:45:10,650
This permits precise
control of the distribution

956
00:45:10,650 --> 00:45:14,580
of individual dimensions of
multidimensional data arrays.

957
00:45:14,580 --> 00:45:17,940
All remote addresses generated
by the block transfer engine

958
00:45:17,940 --> 00:45:20,460
are centrifuged prior to use.

959
00:45:20,460 --> 00:45:23,130
The BLT naturally has a
larger start-up overhead

960
00:45:23,130 --> 00:45:25,890
than direct processor stores
or the prefetch queue.

961
00:45:25,890 --> 00:45:29,870
So it's most useful when
moving larger blocks of data.

962
00:45:29,870 --> 00:45:32,880
Now, let's talk about
synchronization.

963
00:45:32,880 --> 00:45:34,610
The PE shell
circuitry implements

964
00:45:34,610 --> 00:45:37,790
four major synchronization
mechanisms, barriers,

965
00:45:37,790 --> 00:45:41,600
fetch and increment,
atomic swap, and messaging.

966
00:45:41,600 --> 00:45:43,940
A barrier is a point in
a program beyond which

967
00:45:43,940 --> 00:45:46,760
no processor should advance
before all processors have

968
00:45:46,760 --> 00:45:47,810
arrived.

969
00:45:47,810 --> 00:45:49,760
Barriers are often
used to delineate

970
00:45:49,760 --> 00:45:51,680
between parallel
sections of code

971
00:45:51,680 --> 00:45:53,870
when data produced
by the first section

972
00:45:53,870 --> 00:45:56,630
is to be consumed during
the second section.

973
00:45:56,630 --> 00:45:58,550
It is a block form
of synchronization

974
00:45:58,550 --> 00:46:01,250
that is latency-sensitive,
since there is usually

975
00:46:01,250 --> 00:46:03,650
little to do while
waiting for the knowledge

976
00:46:03,650 --> 00:46:06,470
that the last processor
has arrived at a barrier

977
00:46:06,470 --> 00:46:08,330
to be propagated
to all processors

978
00:46:08,330 --> 00:46:10,550
so that they can proceed.

979
00:46:10,550 --> 00:46:12,860
T3D has specialized
barrier hardware

980
00:46:12,860 --> 00:46:16,340
in the form of 16
parallel logical AND trees

981
00:46:16,340 --> 00:46:18,890
that permit multiple
barriers to be pipelined

982
00:46:18,890 --> 00:46:21,260
and the resource
to be partitioned.

983
00:46:21,260 --> 00:46:23,660
The barrier resource
appears to a PE

984
00:46:23,660 --> 00:46:25,760
as a memory-mapped register.

985
00:46:25,760 --> 00:46:27,890
When a PE arrives
at a barrier, it

986
00:46:27,890 --> 00:46:31,340
sets a designated bit in its
barrier register to a one

987
00:46:31,340 --> 00:46:33,590
and waits for it to
switch to a zero.

988
00:46:33,590 --> 00:46:36,530
When all PEs in a partition
have reached the barrier

989
00:46:36,530 --> 00:46:40,580
and set the same bit to a one,
the AND function is satisfied.

990
00:46:40,580 --> 00:46:42,770
And the barrier bits
are cleared by hardware,

991
00:46:42,770 --> 00:46:47,000
signaling the processors
that they may proceed.

992
00:46:47,000 --> 00:46:48,650
As an alternative
to repetitively

993
00:46:48,650 --> 00:46:52,280
testing a set barrier bit to
see if it switched to a zero,

994
00:46:52,280 --> 00:46:54,320
the processor can
enable and interrupt

995
00:46:54,320 --> 00:46:58,730
and perform unrelated work
until the barrier is satisfied.

996
00:46:58,730 --> 00:47:00,560
The barrier has a
second mode called

997
00:47:00,560 --> 00:47:04,220
Eureka mode that supports
search-type operations.

998
00:47:04,220 --> 00:47:07,790
A Eureka is simply a logical
OR instead of an AND.

999
00:47:07,790 --> 00:47:10,850
A Eureka's satisfied
by any one processor.

1000
00:47:10,850 --> 00:47:14,060
A Eureka may be used during,
say, a database search,

1001
00:47:14,060 --> 00:47:16,340
where all processors would
enable their interrupt

1002
00:47:16,340 --> 00:47:18,410
on Eureka and
proceed with a search

1003
00:47:18,410 --> 00:47:21,200
until any processor succeeds
and terminates the search

1004
00:47:21,200 --> 00:47:23,780
by satisfying the Eureka.

1005
00:47:23,780 --> 00:47:27,050
The barrier mechanism in
T3D is extremely fast.

1006
00:47:27,050 --> 00:47:29,060
Even in the largest
configuration,

1007
00:47:29,060 --> 00:47:33,120
it's less than 50 clock
periods, about 330 nanoseconds,

1008
00:47:33,120 --> 00:47:36,050
which is about the latency
of a local memory read.

1009
00:47:36,050 --> 00:47:37,970
A second synchronization
mechanism

1010
00:47:37,970 --> 00:47:40,340
that supports work
distribution well

1011
00:47:40,340 --> 00:47:42,650
is the fetch-and-increment
registers.

1012
00:47:42,650 --> 00:47:45,830
This set of registers is
globally accessible by the PEs.

1013
00:47:45,830 --> 00:47:47,930
There's one per PE.

1014
00:47:47,930 --> 00:47:50,900
A store to a fetch-and-increment
register acts normally,

1015
00:47:50,900 --> 00:47:54,770
but a load from one causes the
contents to auto-increment.

1016
00:47:54,770 --> 00:47:57,690
This is very useful for
distributing array indices,

1017
00:47:57,690 --> 00:47:59,950
for example, while
strip mining a loop.

1018
00:47:59,950 --> 00:48:01,700
Each time the
fetch-and-increment register

1019
00:48:01,700 --> 00:48:05,900
is read, it delivers a value
one greater than the last.

1020
00:48:05,900 --> 00:48:08,360
To hide the latency of accesses
to the fetch-and-increment

1021
00:48:08,360 --> 00:48:12,890
registers, they can be
fetched to the prefetch queue.

1022
00:48:12,890 --> 00:48:15,320
Another fine-grained
synchronization mechanism,

1023
00:48:15,320 --> 00:48:18,860
atomic swap, permits
construction of locks in memory

1024
00:48:18,860 --> 00:48:20,660
or the emulation
of full empty bits

1025
00:48:20,660 --> 00:48:23,000
on individual words of data.

1026
00:48:23,000 --> 00:48:25,400
The swap is simply
an atomic read/write

1027
00:48:25,400 --> 00:48:27,770
of a global memory location.

1028
00:48:27,770 --> 00:48:29,810
The value that is stored
in the memory location

1029
00:48:29,810 --> 00:48:32,120
comes from a local
memory-mapped register

1030
00:48:32,120 --> 00:48:34,123
called the "swaperand."

1031
00:48:34,123 --> 00:48:36,290
The value that was originally
in the memory location

1032
00:48:36,290 --> 00:48:38,480
is returned to the PE.

1033
00:48:38,480 --> 00:48:40,670
The return destination,
like any load,

1034
00:48:40,670 --> 00:48:44,750
can be the prefetch queue in
order to hide the latency.

1035
00:48:44,750 --> 00:48:46,910
Full empty bits in
memory are emulated

1036
00:48:46,910 --> 00:48:48,830
by designating a token value--

1037
00:48:48,830 --> 00:48:52,550
say, an undefined IEEE
floating-point Not A Number

1038
00:48:52,550 --> 00:48:53,570
or NAN--

1039
00:48:53,570 --> 00:48:55,250
to mean "empty."

1040
00:48:55,250 --> 00:48:57,440
When a processor
wants to load and lock

1041
00:48:57,440 --> 00:48:59,870
a memory location in
order to, for instance,

1042
00:48:59,870 --> 00:49:02,510
perform an atomic
add to that location,

1043
00:49:02,510 --> 00:49:06,420
it simply swaps the empty
token into the memory location.

1044
00:49:06,420 --> 00:49:08,810
If the return value is
also an empty token,

1045
00:49:08,810 --> 00:49:11,810
then the location was
already locked by another PE

1046
00:49:11,810 --> 00:49:14,060
and the swap must be retried.

1047
00:49:14,060 --> 00:49:17,570
If the value it returned
is a valid number,

1048
00:49:17,570 --> 00:49:20,420
then the PE has successfully
locked the location

1049
00:49:20,420 --> 00:49:23,750
and may perform its
atomic operation.

1050
00:49:23,750 --> 00:49:27,710
To unlock the location, the PE
simply stores the valid data

1051
00:49:27,710 --> 00:49:31,190
back to memory, overwriting
the empty token.

1052
00:49:31,190 --> 00:49:34,250
Message passing is directly
supported by T3D hardware

1053
00:49:34,250 --> 00:49:35,900
in the form of
low-level primitives

1054
00:49:35,900 --> 00:49:39,110
that permit cache-line-size
messages to be sent directly

1055
00:49:39,110 --> 00:49:40,700
from user space.

1056
00:49:40,700 --> 00:49:42,500
These short messages
are initiated

1057
00:49:42,500 --> 00:49:45,620
by performing stores through
special memory-mapped space

1058
00:49:45,620 --> 00:49:47,510
to the target PE.

1059
00:49:47,510 --> 00:49:50,420
The message data is stored
in a hardware-managed message

1060
00:49:50,420 --> 00:49:53,090
queue in the local
memory of the target PE.

1061
00:49:53,090 --> 00:49:55,610
And that PE is interrupted.

1062
00:49:55,610 --> 00:49:57,830
Hardware automatically
allocates space

1063
00:49:57,830 --> 00:49:59,690
from the sender's
message queue in case

1064
00:49:59,690 --> 00:50:03,920
the destination queue is full
and the message rejected.

1065
00:50:03,920 --> 00:50:06,330
Rejected messages land
in the sender's queue

1066
00:50:06,330 --> 00:50:10,040
and interrupt the sending
processor for a retry.

1067
00:50:10,040 --> 00:50:12,350
Messages carry 32
bytes of payload

1068
00:50:12,350 --> 00:50:14,330
and are stored in the
target message queue,

1069
00:50:14,330 --> 00:50:16,250
with the routing tag
and packet control

1070
00:50:16,250 --> 00:50:20,360
information bringing the
total to 64 bytes per message.

1071
00:50:20,360 --> 00:50:22,700
The message queue
area in local DRAM

1072
00:50:22,700 --> 00:50:28,740
is 256K bytes, permitting up
to 4K messages to he received.

1073
00:50:28,740 --> 00:50:31,770
The messaging facility is
also key to the network

1074
00:50:31,770 --> 00:50:33,630
error-handling mechanism.

1075
00:50:33,630 --> 00:50:36,330
All network packets
have parity on each phit

1076
00:50:36,330 --> 00:50:37,920
of control information.

1077
00:50:37,920 --> 00:50:41,830
Data is protected by sec-det.
If a parity error or a packet

1078
00:50:41,830 --> 00:50:44,910
miswrote is detected when
a packet arrives at a node,

1079
00:50:44,910 --> 00:50:46,920
the entire packet,
header and all,

1080
00:50:46,920 --> 00:50:50,770
is deposited in the apparent
destination PE's message queue

1081
00:50:50,770 --> 00:50:53,130
and an error interrupt issued.

1082
00:50:53,130 --> 00:50:54,990
That pretty much
wraps up our overview

1083
00:50:54,990 --> 00:50:57,210
of the T3D micro architecture.

1084
00:50:57,210 --> 00:50:58,830
Let's step back a
minute to discuss

1085
00:50:58,830 --> 00:51:01,440
the design of T3D's I/O.

1086
00:51:01,440 --> 00:51:03,960
System I/O is performed
via multiple Cray

1087
00:51:03,960 --> 00:51:07,920
high-speed channels that
connect T3D to a host YMP

1088
00:51:07,920 --> 00:51:10,170
or to standard
Cray I/O subsystems

1089
00:51:10,170 --> 00:51:12,690
that contain interfaces
to a wide array

1090
00:51:12,690 --> 00:51:15,420
of peripheral
devices and networks.

1091
00:51:15,420 --> 00:51:18,120
A single Cray high-speed
channel consists

1092
00:51:18,120 --> 00:51:21,990
of two unidirectional channels,
one input and one output,

1093
00:51:21,990 --> 00:51:25,020
each capable of 200
megabytes per second, coupled

1094
00:51:25,020 --> 00:51:26,910
with a low-speed
control channel that

1095
00:51:26,910 --> 00:51:30,270
allows the passing of
supervisory information.

1096
00:51:30,270 --> 00:51:33,300
Each high-speed channel pair
connects to a specialized node

1097
00:51:33,300 --> 00:51:36,090
within T3D called a "gateway."

1098
00:51:36,090 --> 00:51:40,050
There are eight gateways for
every 512 processors in a T3D

1099
00:51:40,050 --> 00:51:41,220
system.

1100
00:51:41,220 --> 00:51:44,430
Aggregate I/O bandwidth
for 512 processors

1101
00:51:44,430 --> 00:51:47,370
is 3.2 gigabytes per second.

1102
00:51:47,370 --> 00:51:49,950
Each gateway itself
is actually a pair

1103
00:51:49,950 --> 00:51:52,410
of nodes, each
handling one direction

1104
00:51:52,410 --> 00:51:54,000
of the high-speed channel.

1105
00:51:54,000 --> 00:51:57,090
The two nodes within a gateway
share the low-speed control

1106
00:51:57,090 --> 00:51:57,840
channel.

1107
00:51:57,840 --> 00:52:01,530
Each of the two gateway nodes
has its own CPU for control,

1108
00:52:01,530 --> 00:52:04,680
separate local memory, and
a dedicated block transfer

1109
00:52:04,680 --> 00:52:06,540
engine that can move
data between memory

1110
00:52:06,540 --> 00:52:08,460
and the high-speed channel.

1111
00:52:08,460 --> 00:52:10,810
A gateway node network
switch is connected

1112
00:52:10,810 --> 00:52:13,140
into the network in
the x- and z-dimensions

1113
00:52:13,140 --> 00:52:15,780
only, that being all
that's necessary to allow

1114
00:52:15,780 --> 00:52:17,520
complete communication.

1115
00:52:17,520 --> 00:52:20,580
Gateways are placed on the
diagonal of the 3D torus

1116
00:52:20,580 --> 00:52:23,520
network in order to
distribute the I/O bandwidth

1117
00:52:23,520 --> 00:52:27,000
and minimize I/O traffic
contention with compute traffic

1118
00:52:27,000 --> 00:52:29,730
in the network.

1119
00:52:29,730 --> 00:52:32,130
This Cray T3D processor
element module

1120
00:52:32,130 --> 00:52:35,910
holds eight microprocessors,
four on this side

1121
00:52:35,910 --> 00:52:38,040
and four on the back side.

1122
00:52:38,040 --> 00:52:40,950
Each pair of microprocessors
within the compute node

1123
00:52:40,950 --> 00:52:45,000
is supported by 21 proprietary
integrated circuits.

1124
00:52:45,000 --> 00:52:47,640
These circuits implement
high-performance bipolar

1125
00:52:47,640 --> 00:52:49,830
emitter-coupled logic.

1126
00:52:49,830 --> 00:52:53,640
Small dotter boards support
wide-word dynamic random access

1127
00:52:53,640 --> 00:52:55,410
memory components.

1128
00:52:55,410 --> 00:52:58,050
These dotter boards can
be easily replaced locally

1129
00:52:58,050 --> 00:53:00,990
by field service personnel.

1130
00:53:00,990 --> 00:53:02,850
Each long edge of
the motherboard

1131
00:53:02,850 --> 00:53:05,880
is configured with a dense
array of zero-insertion-force

1132
00:53:05,880 --> 00:53:07,410
connectors.

1133
00:53:07,410 --> 00:53:09,540
These connectors
provide the high degree

1134
00:53:09,540 --> 00:53:12,780
of interconnectivity required
for the interprocessor

1135
00:53:12,780 --> 00:53:15,210
channels.

1136
00:53:15,210 --> 00:53:18,240
The module is liquid-cooled
by attaching the circuit

1137
00:53:18,240 --> 00:53:21,330
boards and integrated circuits
to an anodized aluminum

1138
00:53:21,330 --> 00:53:23,190
cold plate.

1139
00:53:23,190 --> 00:53:26,190
An inert cooling fluid
flows into, through,

1140
00:53:26,190 --> 00:53:28,740
and then out of the
module via hoses

1141
00:53:28,740 --> 00:53:33,250
connected to a cooling
manifold in the chassis.

1142
00:53:33,250 --> 00:53:36,550
A single Cray T3D
chassis can support up

1143
00:53:36,550 --> 00:53:41,500
to 512 compute processors,
eight I/O gateways,

1144
00:53:41,500 --> 00:53:45,490
and additional redundant
processors for fault tolerance.

1145
00:53:45,490 --> 00:53:46,990
The zero-insertion-force
connector

1146
00:53:46,990 --> 00:53:50,680
is activated and deactivated
by inserting a cam

1147
00:53:50,680 --> 00:53:53,770
down each edge of the
module after it has been

1148
00:53:53,770 --> 00:53:56,770
inserted in the module slot.

1149
00:53:56,770 --> 00:54:01,330
The cam works to move small
sliding contacts one by one

1150
00:54:01,330 --> 00:54:03,070
out of the
chassis-mounted connector

1151
00:54:03,070 --> 00:54:07,990
half of the mated pair into
the half mounted on the module.

1152
00:54:07,990 --> 00:54:10,750
Bundles of high-performance
Teflon wires attached

1153
00:54:10,750 --> 00:54:12,940
to the chassis
connectors provide

1154
00:54:12,940 --> 00:54:16,600
the toroidal
communication paths.

1155
00:54:16,600 --> 00:54:18,190
The arrangement of
the wire bundles

1156
00:54:18,190 --> 00:54:21,940
can be changed to support the
future addition of modules

1157
00:54:21,940 --> 00:54:23,830
into chassis which
were initially

1158
00:54:23,830 --> 00:54:26,680
configured to smaller sizes.

1159
00:54:26,680 --> 00:54:31,360
Switching power supplies, each
supplying up to 1,800 amperes,

1160
00:54:31,360 --> 00:54:33,670
provide module power.

1161
00:54:33,670 --> 00:54:37,480
Standard configurations provide
for one extra power supply

1162
00:54:37,480 --> 00:54:40,690
for each required voltage,
allowing additional fault

1163
00:54:40,690 --> 00:54:43,570
coverage for the system.

1164
00:54:43,570 --> 00:54:45,730
Steve and I have tried
to convey our excitement

1165
00:54:45,730 --> 00:54:47,200
about this design.

1166
00:54:47,200 --> 00:54:50,620
We believe the Cray T3D to
be a powerful and flexible

1167
00:54:50,620 --> 00:54:52,570
supercomputing tool.

1168
00:54:52,570 --> 00:54:55,190
I hope you've enjoyed
our presentation.

1169
00:54:55,190 --> 00:54:56,620
Thank you.

1170
00:54:56,620 --> 00:54:59,970
[MUSIC PLAYING]

1171
00:54:59,970 --> 00:55:57,000