1
00:00:00,000 --> 00:00:34,410


2
00:00:34,410 --> 00:00:37,320
Today, I want to talk about
the DASH project with you

3
00:00:37,320 --> 00:00:39,840
and our research at
Stanford involved

4
00:00:39,840 --> 00:00:44,350
in the building of scalable
shared memory machines.

5
00:00:44,350 --> 00:00:48,030
We'll start by talking about
some of the general challenges

6
00:00:48,030 --> 00:00:49,650
as well as some
of the advantages

7
00:00:49,650 --> 00:00:52,375
of a scalable shared
memory machine

8
00:00:52,375 --> 00:00:54,000
And then in the second
half of the talk

9
00:00:54,000 --> 00:00:55,920
move into actually
discussing the dash

10
00:00:55,920 --> 00:01:00,450
architecture and the concepts
behind that architecture.

11
00:01:00,450 --> 00:01:03,300
Before starting, let's
discuss a little bit the goals

12
00:01:03,300 --> 00:01:06,890
in designing a parallel machine.

13
00:01:06,890 --> 00:01:10,040
I think probably the first
goal is the obvious one, namely

14
00:01:10,040 --> 00:01:13,520
one with high performance out
of a parallel machine, faster

15
00:01:13,520 --> 00:01:17,190
than available unit processors.

16
00:01:17,190 --> 00:01:19,350
Second goal, which we
think is quite important,

17
00:01:19,350 --> 00:01:23,970
is that the cost performance
of a multiprocessor

18
00:01:23,970 --> 00:01:28,680
be competitive with the cost
performance of workstations

19
00:01:28,680 --> 00:01:33,080
that use the microprocessor
technology available.

20
00:01:33,080 --> 00:01:35,360
Obviously, scalability
has been something

21
00:01:35,360 --> 00:01:36,890
that people in
parallel processing

22
00:01:36,890 --> 00:01:39,080
have talked about
extensively and one

23
00:01:39,080 --> 00:01:41,360
would like to have
scalable performance

24
00:01:41,360 --> 00:01:46,540
in a multiprocessor, as well
as scalable cost performance.

25
00:01:46,540 --> 00:01:49,560
Finally, a goal that we've
considered very important

26
00:01:49,560 --> 00:01:52,200
in our research is
to design machines,

27
00:01:52,200 --> 00:01:54,990
which are general-purpose,
which have wide applicability

28
00:01:54,990 --> 00:01:57,330
to a range of applications.

29
00:01:57,330 --> 00:01:59,100
We think each one
of these points

30
00:01:59,100 --> 00:02:02,100
is important in trying to
make parallel processors

31
00:02:02,100 --> 00:02:05,550
pervasive in getting
to the point where

32
00:02:05,550 --> 00:02:07,729
we all as a computing
community choose

33
00:02:07,729 --> 00:02:09,479
to use parallel
processors because they're

34
00:02:09,479 --> 00:02:11,396
a better and faster way
to solve our problems.

35
00:02:11,396 --> 00:02:16,160


36
00:02:16,160 --> 00:02:18,380
The term scalability
has been used

37
00:02:18,380 --> 00:02:20,130
to mean a whole
variety of things.

38
00:02:20,130 --> 00:02:22,783
And I think it's
helpful to mention

39
00:02:22,783 --> 00:02:24,200
what some of those
definitions are

40
00:02:24,200 --> 00:02:27,320
and to say what we
mean by scalability.

41
00:02:27,320 --> 00:02:29,630
One definition of
scalability is what I would

42
00:02:29,630 --> 00:02:32,380
call an impractical definition.

43
00:02:32,380 --> 00:02:35,510
It states that scalability
is either speed-up

44
00:02:35,510 --> 00:02:38,630
on a constant size problem
across a range of machines

45
00:02:38,630 --> 00:02:41,330
or even speed-up
on scale problems,

46
00:02:41,330 --> 00:02:43,580
which for practical machines,
machines that can be

47
00:02:43,580 --> 00:02:47,710
implemented, is unattainable.

48
00:02:47,710 --> 00:02:50,950
A second definition
for scalability

49
00:02:50,950 --> 00:02:54,820
would be realistic in the sense
that we could asymptotically

50
00:02:54,820 --> 00:02:57,880
approach that performance
but might disregard cost

51
00:02:57,880 --> 00:03:00,130
and might disregard
constant factors, which

52
00:03:00,130 --> 00:03:03,310
in practice would make
the machine perhaps

53
00:03:03,310 --> 00:03:09,440
scalable but uninteresting
from a cost viewpoint.

54
00:03:09,440 --> 00:03:11,750
What we mean by scalability,
and the definition

55
00:03:11,750 --> 00:03:15,200
we've used in our project, is
a very pragmatic definition

56
00:03:15,200 --> 00:03:16,430
of scalability.

57
00:03:16,430 --> 00:03:20,900
Our interest is in good
speed-up and good scalable

58
00:03:20,900 --> 00:03:25,192
cost across a range of
parameters for a machine.

59
00:03:25,192 --> 00:03:27,650
In particular, what I'm going
to talk about today, or I do,

60
00:03:27,650 --> 00:03:29,810
which are interesting
in the range of tens

61
00:03:29,810 --> 00:03:32,547
to low thousands of processors.

62
00:03:32,547 --> 00:03:36,450


63
00:03:36,450 --> 00:03:40,420
What's required to build
a scalable multiprocessor?

64
00:03:40,420 --> 00:03:43,180
Well, obviously, we must
scale both bandwidth

65
00:03:43,180 --> 00:03:46,300
and do something about
the scaling of latency.

66
00:03:46,300 --> 00:03:48,800
We'll return to this
topic of latency shortly.

67
00:03:48,800 --> 00:03:53,020
Let's just focus on
bandwidth for now.

68
00:03:53,020 --> 00:03:54,520
When we say we must
scale bandwidth,

69
00:03:54,520 --> 00:03:58,450
that means that we have to be
able to scale the local memory

70
00:03:58,450 --> 00:04:01,600
bandwidth, the bandwidth
that each processor demands

71
00:04:01,600 --> 00:04:04,450
linearly, and we also
have to have a way

72
00:04:04,450 --> 00:04:08,710
to scale the communication
bandwidth between processors

73
00:04:08,710 --> 00:04:10,540
as we scale the
size of the machine.

74
00:04:10,540 --> 00:04:13,370


75
00:04:13,370 --> 00:04:16,160
An important observation
is that scaling

76
00:04:16,160 --> 00:04:19,459
the memory bandwidth requires
the use of physically

77
00:04:19,459 --> 00:04:21,697
distributed memories.

78
00:04:21,697 --> 00:04:22,280
It's inherent.

79
00:04:22,280 --> 00:04:25,063
There's no cost-efficient
way to scale up a machine

80
00:04:25,063 --> 00:04:27,230
and scale up the memory
bandwidth without physically

81
00:04:27,230 --> 00:04:29,700
distributing the memory.

82
00:04:29,700 --> 00:04:31,200
I also want to point
out that if you

83
00:04:31,200 --> 00:04:33,210
want to achieve scalable
cost performance,

84
00:04:33,210 --> 00:04:35,880
you've got to take advantage
of the best price performance

85
00:04:35,880 --> 00:04:37,650
available in processors.

86
00:04:37,650 --> 00:04:41,280
That means using state of
the art microprocessors,

87
00:04:41,280 --> 00:04:43,410
and I think we can see
this trend evidenced

88
00:04:43,410 --> 00:04:45,930
in various recently
announced machines-- the CM5,

89
00:04:45,930 --> 00:04:49,212
for example, that chose to use
off-the-shelf microprocessor

90
00:04:49,212 --> 00:04:51,420
rather than build a custom
processor for the machine.

91
00:04:51,420 --> 00:04:56,610


92
00:04:56,610 --> 00:05:00,180
Once one has decided
to use a distributed

93
00:05:00,180 --> 00:05:03,100
memory, a physically
distributed memory,

94
00:05:03,100 --> 00:05:06,760
there are two interesting
architectural alternatives.

95
00:05:06,760 --> 00:05:11,250
One is a message-passing
machine, where the memories are

96
00:05:11,250 --> 00:05:14,100
private, and communication
between processors

97
00:05:14,100 --> 00:05:18,170
is done by explicitly
passing messages.

98
00:05:18,170 --> 00:05:20,180
A second alternative,
and the one

99
00:05:20,180 --> 00:05:23,090
we're going to discuss in
some detail in this talk,

100
00:05:23,090 --> 00:05:26,960
is a machine that has
physically separated memories

101
00:05:26,960 --> 00:05:30,110
but has a single address space
across all those memories.

102
00:05:30,110 --> 00:05:35,400


103
00:05:35,400 --> 00:05:40,230
It's important to observe
that either approach requires

104
00:05:40,230 --> 00:05:43,530
you to deal with
locality of access.

105
00:05:43,530 --> 00:05:46,020
Because the memories are
physically distributed,

106
00:05:46,020 --> 00:05:48,300
people have to pay
attention-- programmers have

107
00:05:48,300 --> 00:05:51,270
to pay attention to
the locality of access.

108
00:05:51,270 --> 00:05:53,910


109
00:05:53,910 --> 00:05:55,650
Both these approaches
also require

110
00:05:55,650 --> 00:05:57,900
a scalable interconnection
technology,

111
00:05:57,900 --> 00:05:59,850
which allows us to
scale up the bandwidth

112
00:05:59,850 --> 00:06:02,980
between the processors.

113
00:06:02,980 --> 00:06:06,670
Given that, why do we find a
single address-based approach

114
00:06:06,670 --> 00:06:08,640
attractive?

115
00:06:08,640 --> 00:06:12,200
Well, there are both functional
and performance advantages.

116
00:06:12,200 --> 00:06:15,110
The functional advantages--
one of the most important

117
00:06:15,110 --> 00:06:19,250
is that a single address
space is easier to program in

118
00:06:19,250 --> 00:06:21,420
and easier to build
compilers for.

119
00:06:21,420 --> 00:06:23,420
It's easier to program
for because it's a better

120
00:06:23,420 --> 00:06:26,892
understood programming
model, and it's easier

121
00:06:26,892 --> 00:06:28,850
to build compilers for
it because the compilers

122
00:06:28,850 --> 00:06:32,240
can treat the single address
space and the memory--

123
00:06:32,240 --> 00:06:37,380
the physical, distributed memory
as an optimization problem.

124
00:06:37,380 --> 00:06:40,700
There's also, as we'll see
in this talk, the opportunity

125
00:06:40,700 --> 00:06:46,400
to create an incremental program
development methodology whereby

126
00:06:46,400 --> 00:06:50,000
a programmer can
relatively easily port

127
00:06:50,000 --> 00:06:53,570
a program to this
machine and can regard

128
00:06:53,570 --> 00:06:56,240
the process of
enhancing performance

129
00:06:56,240 --> 00:06:59,810
as an incremental process
whereby the issues of locality

130
00:06:59,810 --> 00:07:02,150
of access are addressed as
an optimization problem--

131
00:07:02,150 --> 00:07:03,830
a performance
enhancement problem,

132
00:07:03,830 --> 00:07:09,110
as opposed to a correctness
or fundamental problem getting

133
00:07:09,110 --> 00:07:11,990
the program working.

134
00:07:11,990 --> 00:07:13,580
With a little bit
of hardware help,

135
00:07:13,580 --> 00:07:16,070
a single address-based
machine can also

136
00:07:16,070 --> 00:07:20,080
efficiently simulate
other models.

137
00:07:20,080 --> 00:07:21,850
There are also
performance advantages

138
00:07:21,850 --> 00:07:25,350
associated with a single
address space approach.

139
00:07:25,350 --> 00:07:26,850
One of the most
important of these

140
00:07:26,850 --> 00:07:29,940
is that small data objects
can be communicated

141
00:07:29,940 --> 00:07:32,910
more efficiently
between processors,

142
00:07:32,910 --> 00:07:34,590
and that's because
that communication is

143
00:07:34,590 --> 00:07:37,990
integrated in the memory model.

144
00:07:37,990 --> 00:07:41,052
A second advantage is that
it's easier to exploit caching,

145
00:07:41,052 --> 00:07:43,510
and we'll see that this is the
backbone of much of the work

146
00:07:43,510 --> 00:07:44,677
that we've done at Stanford.

147
00:07:44,677 --> 00:07:49,460


148
00:07:49,460 --> 00:07:52,370
As one scales up
the machine size,

149
00:07:52,370 --> 00:07:57,540
the latency to access
remote memories increases.

150
00:07:57,540 --> 00:08:00,330
This increase is unavoidable.

151
00:08:00,330 --> 00:08:04,920
For example, if we were to look
at how that latency scales up

152
00:08:04,920 --> 00:08:10,470
in machines today, we'd see as
typical numbers the local cash

153
00:08:10,470 --> 00:08:14,700
access time might typically on a
processor today take one cycle.

154
00:08:14,700 --> 00:08:17,160
Accessing local memory
might take anywhere

155
00:08:17,160 --> 00:08:19,440
from 10 to 40 cycles.

156
00:08:19,440 --> 00:08:22,890
Accessing a remote memory having
to go across an interconnection

157
00:08:22,890 --> 00:08:28,140
network might take anywhere
from 50 to 200 cycles.

158
00:08:28,140 --> 00:08:31,740
And remote access that might
have to actually require

159
00:08:31,740 --> 00:08:34,770
trapping to the operating system
or some other software library

160
00:08:34,770 --> 00:08:40,000
might take as many as 1,000, or
in fact, up to 10,000 cycles.

161
00:08:40,000 --> 00:08:43,990
This remote access time
scaling is unavoidable

162
00:08:43,990 --> 00:08:46,120
if we want to allow
the machine to scale up

163
00:08:46,120 --> 00:08:50,160
to large numbers of processors.

164
00:08:50,160 --> 00:08:52,930
How do we deal with
this long latency?

165
00:08:52,930 --> 00:08:57,500
There are three fundamental ways
in which we can deal with it.

166
00:08:57,500 --> 00:09:00,790
The first is we can lower
the frequency of long latency

167
00:09:00,790 --> 00:09:02,020
events.

168
00:09:02,020 --> 00:09:03,940
By using better
parallel algorithms

169
00:09:03,940 --> 00:09:06,220
or better implementations
of parallel algorithms,

170
00:09:06,220 --> 00:09:09,040
we can actually have
the programmer lower

171
00:09:09,040 --> 00:09:10,720
the frequency of
long latency events

172
00:09:10,720 --> 00:09:15,270
and thereby improve the
performance of a program.

173
00:09:15,270 --> 00:09:19,260
A second approach is to try to
reduce the latency by the use

174
00:09:19,260 --> 00:09:21,310
of hardware techniques.

175
00:09:21,310 --> 00:09:23,995
For example, we try and make
the access to remote objects

176
00:09:23,995 --> 00:09:24,495
cheaper.

177
00:09:24,495 --> 00:09:27,570


178
00:09:27,570 --> 00:09:29,670
A third approach is
to tolerate latency,

179
00:09:29,670 --> 00:09:33,630
that is find some way to overlap
computation and communication

180
00:09:33,630 --> 00:09:39,710
so that the impact of latency
on performance is minimized.

181
00:09:39,710 --> 00:09:41,135
Let's talk about
reducing latency.

182
00:09:41,135 --> 00:09:43,710


183
00:09:43,710 --> 00:09:45,300
One of the most
important methods

184
00:09:45,300 --> 00:09:50,580
for trying to reduce latency is
to eliminate the long latency

185
00:09:50,580 --> 00:09:53,700
accesses by the use of caching.

186
00:09:53,700 --> 00:09:55,800
Caching is a
time-honored idea that

187
00:09:55,800 --> 00:09:58,980
has been used in
architecture for a long time

188
00:09:58,980 --> 00:10:04,670
and been extremely useful
in trying to reduce latency.

189
00:10:04,670 --> 00:10:06,830
When we introduce caching,
we can introduce it

190
00:10:06,830 --> 00:10:11,260
either as a hardware technique
or as a software technique.

191
00:10:11,260 --> 00:10:14,410
And, of course, the static
versus dynamic behavior

192
00:10:14,410 --> 00:10:17,590
of the application
affects the suitability

193
00:10:17,590 --> 00:10:19,647
of using a software
technique, which

194
00:10:19,647 --> 00:10:21,730
might be more appropriate
in a static environment,

195
00:10:21,730 --> 00:10:23,230
or a hardware
technique, which might

196
00:10:23,230 --> 00:10:25,175
be more appropriate in
a dynamic environment.

197
00:10:25,175 --> 00:10:27,850


198
00:10:27,850 --> 00:10:30,550
We can also reduce the
communication latency

199
00:10:30,550 --> 00:10:35,930
by the use of low latency,
high bandwidth protocols, those

200
00:10:35,930 --> 00:10:39,560
who both avoid contention
and reduce latency.

201
00:10:39,560 --> 00:10:41,980
This is an area where there's
been dramatic progress

202
00:10:41,980 --> 00:10:43,480
and where it appears
that there will

203
00:10:43,480 --> 00:10:45,460
continue to be important
progress in interconnection

204
00:10:45,460 --> 00:10:45,960
networks.

205
00:10:45,960 --> 00:10:51,040


206
00:10:51,040 --> 00:10:55,170
As we said earlier, we
can also tolerate latency.

207
00:10:55,170 --> 00:10:57,000
One obvious way to
tolerate latency

208
00:10:57,000 --> 00:10:58,710
is in a message-passing
machine, where

209
00:10:58,710 --> 00:11:02,010
the programmer has the
task of overlapping

210
00:11:02,010 --> 00:11:05,830
computation and communication.

211
00:11:05,830 --> 00:11:07,950
Another way to
tolerate latency is

212
00:11:07,950 --> 00:11:10,400
by changing the memory model.

213
00:11:10,400 --> 00:11:15,200
The memory model specifies
when a processor can continue

214
00:11:15,200 --> 00:11:18,500
after doing a load or
a store, for example,

215
00:11:18,500 --> 00:11:21,290
and when it can assume
that access is completed.

216
00:11:21,290 --> 00:11:23,930
By relaxing the memory
model, we allow the processor

217
00:11:23,930 --> 00:11:26,870
to continue earlier,
allowing it to ignore

218
00:11:26,870 --> 00:11:28,640
the fact that a
store, for example,

219
00:11:28,640 --> 00:11:30,470
may take quite a long
time to propagate

220
00:11:30,470 --> 00:11:31,550
across an entire machine.

221
00:11:31,550 --> 00:11:37,170


222
00:11:37,170 --> 00:11:38,820
Two other ways to
tolerate latency

223
00:11:38,820 --> 00:11:42,210
are through prefetching
and multiple context.

224
00:11:42,210 --> 00:11:45,690
In prefetching the program
or the compiler actually

225
00:11:45,690 --> 00:11:49,410
inserts instructions
into the program,

226
00:11:49,410 --> 00:11:52,470
which have the effect
of bringing data

227
00:11:52,470 --> 00:11:55,590
from a remote location
to a nearby location,

228
00:11:55,590 --> 00:11:58,710
thereby reducing
its access time.

229
00:11:58,710 --> 00:12:01,890
In multiple contexts we
actually designed the machine

230
00:12:01,890 --> 00:12:05,190
so that when a long
latency event occurs,

231
00:12:05,190 --> 00:12:09,330
the processor contexts switch
to another thread of execution,

232
00:12:09,330 --> 00:12:11,550
thereby overlapping
that long latency

233
00:12:11,550 --> 00:12:13,541
event with the other
thread of execution.

234
00:12:13,541 --> 00:12:16,130


235
00:12:16,130 --> 00:12:19,220
It's important to observe
that all latency tolerating

236
00:12:19,220 --> 00:12:23,000
techniques require
extra parallelism to be

237
00:12:23,000 --> 00:12:24,920
available in the
program because they

238
00:12:24,920 --> 00:12:27,200
overlap these long
latency events

239
00:12:27,200 --> 00:12:30,190
with that extra parallelism.

240
00:12:30,190 --> 00:12:32,290
They also generate
extra bandwidth demands

241
00:12:32,290 --> 00:12:34,630
on the machine by doing
prefetching or multiple

242
00:12:34,630 --> 00:12:36,580
contexts, we increase
the bandwidth

243
00:12:36,580 --> 00:12:38,350
demands that the
process requires

244
00:12:38,350 --> 00:12:39,510
on that particular program.

245
00:12:39,510 --> 00:12:44,850


246
00:12:44,850 --> 00:12:47,520
Let's return to the topic of
caching and cache coherency.

247
00:12:47,520 --> 00:12:50,410


248
00:12:50,410 --> 00:12:51,880
There are several
alternatives we

249
00:12:51,880 --> 00:12:55,480
could look at in a single
address space with respect

250
00:12:55,480 --> 00:12:58,660
to the use of caching
and whether or not

251
00:12:58,660 --> 00:13:00,550
the caches should
be kept coherent

252
00:13:00,550 --> 00:13:03,250
with hardware assistance.

253
00:13:03,250 --> 00:13:07,720
One approach would be to say,
let's not cache share data,

254
00:13:07,720 --> 00:13:10,750
and as we'll see that
will make the accesses

255
00:13:10,750 --> 00:13:13,710
to remote data very expensive.

256
00:13:13,710 --> 00:13:16,440
A second related alternative
is that we cache share data

257
00:13:16,440 --> 00:13:19,490
but under software control.

258
00:13:19,490 --> 00:13:23,090
In that case, we're limited
by the ability of the software

259
00:13:23,090 --> 00:13:25,380
to do the job perfectly.

260
00:13:25,380 --> 00:13:29,250
By that we mean that if the
software cannot be certain that

261
00:13:29,250 --> 00:13:33,090
it can cache a local copy
and still retain a consistent

262
00:13:33,090 --> 00:13:35,490
programming model and a
consistent memory model

263
00:13:35,490 --> 00:13:39,720
for the programmer, then the
software must be conservative,

264
00:13:39,720 --> 00:13:41,820
must not choose to
make a cache copy,

265
00:13:41,820 --> 00:13:43,650
and must use the remote copy.

266
00:13:43,650 --> 00:13:45,150
And that becomes
the limiting factor

267
00:13:45,150 --> 00:13:46,442
in trying to cache in software.

268
00:13:46,442 --> 00:13:48,950


269
00:13:48,950 --> 00:13:51,830
What would an architecture
with private caches

270
00:13:51,830 --> 00:13:55,420
that did not attempt to
cache share data look like?

271
00:13:55,420 --> 00:13:57,720
Well, that architecture
might well have caches,

272
00:13:57,720 --> 00:14:02,340
but the caches would contain
only private data therefore

273
00:14:02,340 --> 00:14:04,740
accesses to remote
data would have

274
00:14:04,740 --> 00:14:07,530
to go through the caches over
the interconnection network

275
00:14:07,530 --> 00:14:08,820
and to the remote memory.

276
00:14:08,820 --> 00:14:12,910


277
00:14:12,910 --> 00:14:15,700
Some examples of these
kinds of machines

278
00:14:15,700 --> 00:14:19,830
are the IBM RP3 and
the BBM Butterfly.

279
00:14:19,830 --> 00:14:26,470
These machines feature, as we
see here, a distributed memory,

280
00:14:26,470 --> 00:14:28,362
and they even use caches.

281
00:14:28,362 --> 00:14:30,820
There's also a single address
space across this distributed

282
00:14:30,820 --> 00:14:32,917
memories.

283
00:14:32,917 --> 00:14:35,250
They also need to make use
of a scalable interconnection

284
00:14:35,250 --> 00:14:35,750
network.

285
00:14:35,750 --> 00:14:38,410


286
00:14:38,410 --> 00:14:41,680
This slide shows
why this approach

287
00:14:41,680 --> 00:14:44,490
runs into difficulties.

288
00:14:44,490 --> 00:14:47,730
The frequency of
shared accesses--

289
00:14:47,730 --> 00:14:49,980
that is, what percentage
of the memory accesses

290
00:14:49,980 --> 00:14:52,470
made by the processor
are to share data--

291
00:14:52,470 --> 00:14:56,190
is quite large, often in
the range of 35% to 40%

292
00:14:56,190 --> 00:14:57,720
of the accesses
made by the program.

293
00:14:57,720 --> 00:15:00,260


294
00:15:00,260 --> 00:15:02,750
That means, for
example, in a machine

295
00:15:02,750 --> 00:15:08,013
with caches that cache only
private data, that 35% to 40%

296
00:15:08,013 --> 00:15:09,680
of the memory accesses
are going to take

297
00:15:09,680 --> 00:15:11,750
on the order of 100 cycles.

298
00:15:11,750 --> 00:15:14,480
Obviously, the penalty for
this and the effect on speed-up

299
00:15:14,480 --> 00:15:17,690
would be quite dramatic.

300
00:15:17,690 --> 00:15:21,950
Now, there's actually good news
in these sorts of measurements.

301
00:15:21,950 --> 00:15:25,040
And one element of good news is
that the shared read frequency

302
00:15:25,040 --> 00:15:28,520
tends to be much higher than
the shared right frequency.

303
00:15:28,520 --> 00:15:30,350
That's good news
because from the point

304
00:15:30,350 --> 00:15:33,478
of view of a cache designer,
that means the caching will

305
00:15:33,478 --> 00:15:34,520
probably work quite well.

306
00:15:34,520 --> 00:15:39,130


307
00:15:39,130 --> 00:15:42,270
So, a third alternative
to designing a machine

308
00:15:42,270 --> 00:15:45,660
with a single address space
is to design cache coherency

309
00:15:45,660 --> 00:15:47,640
into the hardware.

310
00:15:47,640 --> 00:15:52,360
The key question is, can
that cache mechanism--

311
00:15:52,360 --> 00:15:55,330
can that coherency
mechanism-- be made to scale?

312
00:15:55,330 --> 00:16:01,880


313
00:16:01,880 --> 00:16:04,370
This chart shows how
conventional cache coherency

314
00:16:04,370 --> 00:16:07,340
has traditionally been
implemented using what's

315
00:16:07,340 --> 00:16:10,280
typically called a Snoopy Bus.

316
00:16:10,280 --> 00:16:12,650
In this scheme
the processors are

317
00:16:12,650 --> 00:16:16,060
connected onto the single bus.

318
00:16:16,060 --> 00:16:20,260
When a cache miss occurs, that
miss is placed on the bus,

319
00:16:20,260 --> 00:16:23,800
and the cache is snooped to
try to determine whether or not

320
00:16:23,800 --> 00:16:27,960
they might contain the data
required by the processor.

321
00:16:27,960 --> 00:16:30,100
If one of those caches
contains the data,

322
00:16:30,100 --> 00:16:31,673
then it places the
data on the bus

323
00:16:31,673 --> 00:16:33,090
and the processor,
which generated

324
00:16:33,090 --> 00:16:35,298
the initial cache miss,
finds the data present there.

325
00:16:35,298 --> 00:16:39,160


326
00:16:39,160 --> 00:16:42,010
When a processor writes
a shared data item,

327
00:16:42,010 --> 00:16:46,180
that data item is placed on the
bus, the address of that data

328
00:16:46,180 --> 00:16:49,720
item, and an invalidation
is generated on the bus

329
00:16:49,720 --> 00:16:54,580
so that the copies of the data
resident in the other caches

330
00:16:54,580 --> 00:17:00,410
are eliminated by
that validation.

331
00:17:00,410 --> 00:17:01,880
A good way to
think about this is

332
00:17:01,880 --> 00:17:04,369
that what a Snoopy
Bus protocol presents

333
00:17:04,369 --> 00:17:08,520
is a solution to the multiple
readers single writer problem.

334
00:17:08,520 --> 00:17:10,780
We allow multiple
people to read and copy

335
00:17:10,780 --> 00:17:12,750
the data item into their cache.

336
00:17:12,750 --> 00:17:14,550
When we want to
write the data item,

337
00:17:14,550 --> 00:17:16,650
we invalidate all
those caches so

338
00:17:16,650 --> 00:17:18,165
that there's only
a single copy that

339
00:17:18,165 --> 00:17:19,290
can be written and updated.

340
00:17:19,290 --> 00:17:23,890


341
00:17:23,890 --> 00:17:27,550
Now, if we try and
ask what's the problem

342
00:17:27,550 --> 00:17:32,170
with a snoopy-based
protocol, the major problem

343
00:17:32,170 --> 00:17:34,570
is that when a
cache miss occurs,

344
00:17:34,570 --> 00:17:40,020
we must broadcast that
address on the bus

345
00:17:40,020 --> 00:17:42,177
and require all
the caches to snoop

346
00:17:42,177 --> 00:17:44,510
to see whether or not they
have that particular address.

347
00:17:44,510 --> 00:17:47,080


348
00:17:47,080 --> 00:17:49,060
An important observation
here is that this

349
00:17:49,060 --> 00:17:52,350
is an inherent feature
of this protocol.

350
00:17:52,350 --> 00:17:56,490
It is not a property
that can be eliminated

351
00:17:56,490 --> 00:18:01,770
just by making the bus faster
or by even putting two buses in.

352
00:18:01,770 --> 00:18:05,100
Because the limitation
that will occur

353
00:18:05,100 --> 00:18:07,290
will be that the bandwidth
of the caches that

354
00:18:07,290 --> 00:18:10,798
is used for snooping
will quickly saturate.

355
00:18:10,798 --> 00:18:12,840
And therefore, we can't
extend the machine purely

356
00:18:12,840 --> 00:18:14,550
by increasing the
amount of bus family.

357
00:18:14,550 --> 00:18:17,220


358
00:18:17,220 --> 00:18:21,150
Let me talk about an alternative
to this snoopy-based approach

359
00:18:21,150 --> 00:18:24,000
that enables us to
have a cache coherency

360
00:18:24,000 --> 00:18:26,670
scheme that will scale.

361
00:18:26,670 --> 00:18:28,650
This alternative is a
cash coherency scheme

362
00:18:28,650 --> 00:18:31,380
called directories.

363
00:18:31,380 --> 00:18:36,180
In a directory-based
scheme we have an object

364
00:18:36,180 --> 00:18:40,410
called a directory, which
keeps track of all the shared

365
00:18:40,410 --> 00:18:42,960
copies of a data item.

366
00:18:42,960 --> 00:18:46,530
It keeps track of it with the
use of presence bits, which

367
00:18:46,530 --> 00:18:49,470
denote which caches
have a shared

368
00:18:49,470 --> 00:18:54,180
copy of a particular block
or line in the cache.

369
00:18:54,180 --> 00:18:57,090
Each block and memory
has a corresponding line

370
00:18:57,090 --> 00:18:58,980
in the directory,
which keeps track

371
00:18:58,980 --> 00:19:00,660
of that particular
block of memory

372
00:19:00,660 --> 00:19:03,480
and which caches have a copy.

373
00:19:03,480 --> 00:19:06,870
There's also a dirty
bit to indicate

374
00:19:06,870 --> 00:19:12,810
when the single copy is the
only copy of a particular block

375
00:19:12,810 --> 00:19:15,210
anywhere in the machine.

376
00:19:15,210 --> 00:19:17,040
So one of two situations exists.

377
00:19:17,040 --> 00:19:19,890
Either there are one
or more presence bits

378
00:19:19,890 --> 00:19:23,040
on indicating if
there's more than one

379
00:19:23,040 --> 00:19:25,830
that the copy is shared
by multiple readers

380
00:19:25,830 --> 00:19:31,200
and exists in multiple caches,
or the dirty bit is on,

381
00:19:31,200 --> 00:19:33,990
and there is a single copy with
one presence bit indicating

382
00:19:33,990 --> 00:19:37,260
the location of that copy.

383
00:19:37,260 --> 00:19:42,520
Now, how do these directories
maintain coherency?

384
00:19:42,520 --> 00:19:46,720
Well, the next two slides
will tell us about that.

385
00:19:46,720 --> 00:19:51,660
When a location is
red, the reader--

386
00:19:51,660 --> 00:19:54,220
the processor
doing the reading--

387
00:19:54,220 --> 00:19:57,850
causes the directory to
be updated to indicate

388
00:19:57,850 --> 00:20:02,070
that that processor has a copy
of that particular cache line.

389
00:20:02,070 --> 00:20:06,070


390
00:20:06,070 --> 00:20:10,030
That basically
tracks the readers.

391
00:20:10,030 --> 00:20:12,520
Then, when a
location is written,

392
00:20:12,520 --> 00:20:15,430
the directory does two things.

393
00:20:15,430 --> 00:20:21,620
First, it grants
ownership to the writer,

394
00:20:21,620 --> 00:20:26,630
that is, it determines
which of all the

395
00:20:26,630 --> 00:20:30,080
processes that they want to
write get priority in trying

396
00:20:30,080 --> 00:20:32,130
to write this data item.

397
00:20:32,130 --> 00:20:35,850
It grants the ownership, and
then using the presence bits

398
00:20:35,850 --> 00:20:38,400
contained in the
directory, it causes

399
00:20:38,400 --> 00:20:42,120
the other copies to be
invalidated and eliminated

400
00:20:42,120 --> 00:20:44,860
from the cache.

401
00:20:44,860 --> 00:20:46,750
The advantages of
this approach are

402
00:20:46,750 --> 00:20:52,140
that it has no
unnecessary broadcasts,

403
00:20:52,140 --> 00:20:55,860
and it imposes no restrictions
on the topology or the ordering

404
00:20:55,860 --> 00:20:58,200
of messages sent around the
interconnection network.

405
00:20:58,200 --> 00:21:02,330


406
00:21:02,330 --> 00:21:08,350
Directory schemes actually
predate snoopy-based schemes,

407
00:21:08,350 --> 00:21:11,120
but they were passed over.

408
00:21:11,120 --> 00:21:14,450
They were passed over because
the Snoopy schemes are simpler

409
00:21:14,450 --> 00:21:17,790
and because for most
multiprocessors built,

410
00:21:17,790 --> 00:21:19,410
scaling hasn't been an issue.

411
00:21:19,410 --> 00:21:24,800
People have been interested in
smaller scale multiprocessors.

412
00:21:24,800 --> 00:21:29,870
In order to explore larger scale
machines with cache coherency

413
00:21:29,870 --> 00:21:34,060
directories and the interest in
directories has been revived.

414
00:21:34,060 --> 00:21:35,740
In order to scale
these directories,

415
00:21:35,740 --> 00:21:38,720
two important things
have to be done.

416
00:21:38,720 --> 00:21:40,540
One is that we
have to distribute

417
00:21:40,540 --> 00:21:43,773
the directories with
the memories, that is,

418
00:21:43,773 --> 00:21:45,190
when we have
multiple memories, we

419
00:21:45,190 --> 00:21:47,530
have to have multiple
corresponding directories.

420
00:21:47,530 --> 00:21:49,780
If we didn't, then the single
directory in the machine

421
00:21:49,780 --> 00:21:51,250
would become the
bottleneck, and it

422
00:21:51,250 --> 00:21:52,833
would be no different
from then having

423
00:21:52,833 --> 00:21:55,500
a single memory in the machine.

424
00:21:55,500 --> 00:21:58,490
We also have to use a
scalable directory scheme, one

425
00:21:58,490 --> 00:22:02,090
that allows the total size
of the machine and processor

426
00:22:02,090 --> 00:22:05,090
count and memory to
scale up without making

427
00:22:05,090 --> 00:22:08,183
the directory overly costly.

428
00:22:08,183 --> 00:22:10,100
I'm not going to talk
about scalable directory

429
00:22:10,100 --> 00:22:12,287
schemes in detail in
this talk, although there

430
00:22:12,287 --> 00:22:14,620
are several papers that have
been written on this topic.

431
00:22:14,620 --> 00:22:17,760


432
00:22:17,760 --> 00:22:21,350
These distributed
directory protocols

433
00:22:21,350 --> 00:22:24,410
have now been adopted
by a wide variety

434
00:22:24,410 --> 00:22:28,520
of machines looking at trying to
build scalable cache coherency.

435
00:22:28,520 --> 00:22:31,740
The Stanford machine that we're
talking about is one of these.

436
00:22:31,740 --> 00:22:34,700
But the scalable
coherent interface

437
00:22:34,700 --> 00:22:38,980
is another example that's
proposed as an IEEE standard

438
00:22:38,980 --> 00:22:41,950
for interconnecting
multiple processors.

439
00:22:41,950 --> 00:22:45,490
The MIT Alewife machine uses
a scalable directory scheme,

440
00:22:45,490 --> 00:22:49,680
as does the Kendall
Square KSR one.

441
00:22:49,680 --> 00:22:51,930
Now that we've discussed
some of the background issues

442
00:22:51,930 --> 00:22:54,930
in building scalable
single address machines,

443
00:22:54,930 --> 00:22:57,270
I want to talk about
the DASH architecture

444
00:22:57,270 --> 00:23:00,390
and design in particular.

445
00:23:00,390 --> 00:23:05,990
DASH is built around the concept
of a distributed directory.

446
00:23:05,990 --> 00:23:08,810
It uses individual
processors connected

447
00:23:08,810 --> 00:23:12,510
to their caches, a
memory and directory,

448
00:23:12,510 --> 00:23:15,380
which forms a cluster.

449
00:23:15,380 --> 00:23:20,640
That cluster is interconnected
with the communication network

450
00:23:20,640 --> 00:23:24,870
based on the Caltech Mesh
and using a high bandwidth

451
00:23:24,870 --> 00:23:27,780
version of that mesh.

452
00:23:27,780 --> 00:23:29,910
A single address
space is implemented

453
00:23:29,910 --> 00:23:33,270
across the entire machine even
though the physical memories

454
00:23:33,270 --> 00:23:35,520
are distributed.

455
00:23:35,520 --> 00:23:38,310
For example, the high
order bits of every address

456
00:23:38,310 --> 00:23:42,000
specify the cluster in which
that particular address or word

457
00:23:42,000 --> 00:23:44,860
will live.

458
00:23:44,860 --> 00:23:47,260
The individual
clusters are, in fact,

459
00:23:47,260 --> 00:23:54,080
multiprocessors based on
Silicon Graphics 4D-340s.

460
00:23:54,080 --> 00:23:56,960
And the coherency
protocol that's

461
00:23:56,960 --> 00:23:59,960
implemented as a Snoopy
protocol on that bus

462
00:23:59,960 --> 00:24:02,567
is used as part of the
overall architecture.

463
00:24:02,567 --> 00:24:05,910


464
00:24:05,910 --> 00:24:10,500
Let me give you an idea of what
the DASH memory hierarchy looks

465
00:24:10,500 --> 00:24:15,010
like and define some terminology
so we can talk about how

466
00:24:15,010 --> 00:24:18,420
the coherency protocol works.

467
00:24:18,420 --> 00:24:21,990
At the top level, we have the
processor and its two level

468
00:24:21,990 --> 00:24:23,490
cache system.

469
00:24:23,490 --> 00:24:27,430
That's part of the
Silicon Graphics 4D.

470
00:24:27,430 --> 00:24:31,270
The next level of
hierarchy is provided

471
00:24:31,270 --> 00:24:35,710
by the other caches in
the same cluster, which

472
00:24:35,710 --> 00:24:40,430
act as super caches via
the snooping protocol.

473
00:24:40,430 --> 00:24:43,730
If the data is not found
for particular access

474
00:24:43,730 --> 00:24:46,910
either in the processes
caches or the other caches

475
00:24:46,910 --> 00:24:49,800
of the processors
in that cluster,

476
00:24:49,800 --> 00:24:55,870
we go to the home cluster
of that particular address.

477
00:24:55,870 --> 00:24:59,350
The home cluster, remember,
is given by the upper order

478
00:24:59,350 --> 00:25:01,180
bits in the address.

479
00:25:01,180 --> 00:25:04,220
There are two possible
situations here.

480
00:25:04,220 --> 00:25:08,650
One is that the home cluster is
the same as the cluster that's

481
00:25:08,650 --> 00:25:11,680
generating the request,
that is, the address is

482
00:25:11,680 --> 00:25:15,303
the address of a word in the
local memory of that cluster,

483
00:25:15,303 --> 00:25:16,720
in which case, of
course, we don't

484
00:25:16,720 --> 00:25:19,680
need to go across the
interconnection network.

485
00:25:19,680 --> 00:25:21,680
The other case is that
that address belongs

486
00:25:21,680 --> 00:25:25,610
to a remote cluster, and we
have to actually send a request

487
00:25:25,610 --> 00:25:27,860
across the
interconnection network

488
00:25:27,860 --> 00:25:31,520
to that home cluster
where the address lives.

489
00:25:31,520 --> 00:25:35,710


490
00:25:35,710 --> 00:25:38,580
In fact, there's even another
possible level in the hierarchy

491
00:25:38,580 --> 00:25:39,870
here.

492
00:25:39,870 --> 00:25:44,610
Consider the situation when the
only copy of a particular word

493
00:25:44,610 --> 00:25:49,980
that a processor has requested
lives in yet another cluster,

494
00:25:49,980 --> 00:25:53,400
neither the local cluster,
which requested the data,

495
00:25:53,400 --> 00:25:57,708
nor the home directory but
a third remote cluster,

496
00:25:57,708 --> 00:26:00,000
in which case, we may have
to cross the interconnection

497
00:26:00,000 --> 00:26:02,310
network again to go
and get that data.

498
00:26:02,310 --> 00:26:04,990


499
00:26:04,990 --> 00:26:07,020
Let's look at how
this protocol operates

500
00:26:07,020 --> 00:26:12,390
when we try to read a cache
line, which is held dirty

501
00:26:12,390 --> 00:26:16,140
in a remote cluster.

502
00:26:16,140 --> 00:26:19,320
The access will begin with the
local cluster generating a read

503
00:26:19,320 --> 00:26:20,820
request.

504
00:26:20,820 --> 00:26:24,720
That read request will
miss in the local caches

505
00:26:24,720 --> 00:26:28,890
and cause a message to be sent
across the interconnection

506
00:26:28,890 --> 00:26:36,270
network, requesting the
data from the home cluster.

507
00:26:36,270 --> 00:26:39,600
When that message arrives
in the home cluster,

508
00:26:39,600 --> 00:26:42,600
we'll do a directory
lookup to determine

509
00:26:42,600 --> 00:26:45,480
where that particular
word currently

510
00:26:45,480 --> 00:26:46,470
resides in the machine.

511
00:26:46,470 --> 00:26:49,070


512
00:26:49,070 --> 00:26:53,150
We'll find out that the
word is dirty and resides

513
00:26:53,150 --> 00:26:55,990
in a remote cluster.

514
00:26:55,990 --> 00:27:01,820
That will cause the home
cluster to forward the request

515
00:27:01,820 --> 00:27:05,040
to that dirty remote cluster.

516
00:27:05,040 --> 00:27:06,950
So a message is
actually created,

517
00:27:06,950 --> 00:27:12,070
and the address of the
data object is forwarded.

518
00:27:12,070 --> 00:27:14,740
When it arrives at the
dirty remote cluster,

519
00:27:14,740 --> 00:27:18,820
that cluster will do a
lookup, using the Snoopy Bus

520
00:27:18,820 --> 00:27:23,140
protocols to access the
caches in the remote cluster.

521
00:27:23,140 --> 00:27:26,260
It will find the data
word and transmit it

522
00:27:26,260 --> 00:27:32,510
directly back to the original
requesting local cluster.

523
00:27:32,510 --> 00:27:36,320
This forwarding directly
back of the request

524
00:27:36,320 --> 00:27:41,240
eliminates the need to keep
state in the home cluster

525
00:27:41,240 --> 00:27:42,770
and keep track of that message.

526
00:27:42,770 --> 00:27:46,160
It also reduces the latency.

527
00:27:46,160 --> 00:27:48,140
At the same time
the dirty cluster

528
00:27:48,140 --> 00:27:52,290
will also write the data
back and generate a write

529
00:27:52,290 --> 00:27:56,250
back so that the home
cluster will be updated.

530
00:27:56,250 --> 00:27:58,200
This will mean that
future requests don't

531
00:27:58,200 --> 00:28:00,240
need to be forwarded
to the dirty cluster

532
00:28:00,240 --> 00:28:02,705
but can just be accessed
directly in the home cluster

533
00:28:02,705 --> 00:28:04,080
and then the data
can be returned

534
00:28:04,080 --> 00:28:05,163
to the requesting cluster.

535
00:28:05,163 --> 00:28:09,150


536
00:28:09,150 --> 00:28:11,040
Now, let's look at
the case of a write

537
00:28:11,040 --> 00:28:13,500
and see how cache
coherency is maintained

538
00:28:13,500 --> 00:28:17,540
with a write to a
shared data item.

539
00:28:17,540 --> 00:28:21,020
When the local cluster sees
a write to a shared data

540
00:28:21,020 --> 00:28:25,940
item that it does not currently
have exclusive access to,

541
00:28:25,940 --> 00:28:28,940
it will generate a read
exclusive request, which

542
00:28:28,940 --> 00:28:31,560
is sent to the home directory.

543
00:28:31,560 --> 00:28:34,490
The home directory ACKs is
the point of arbitration.

544
00:28:34,490 --> 00:28:38,510
If two clusters try to do
a right to the same word

545
00:28:38,510 --> 00:28:41,720
in a cache or the same
cache line at once,

546
00:28:41,720 --> 00:28:43,370
those requests will
both be forwarded

547
00:28:43,370 --> 00:28:47,940
to the home directory, and
they'll be serialized there.

548
00:28:47,940 --> 00:28:50,430
When the home directory
gets the request,

549
00:28:50,430 --> 00:28:54,780
it replies directly to the
local requesting cluster,

550
00:28:54,780 --> 00:28:57,360
granting it exclusive access.

551
00:28:57,360 --> 00:29:00,970
The cluster can then
proceed to do the write.

552
00:29:00,970 --> 00:29:03,850
In parallel with
granting access,

553
00:29:03,850 --> 00:29:08,320
the home cluster also
will cause an validations

554
00:29:08,320 --> 00:29:11,200
to be sent out to all
the objects, which

555
00:29:11,200 --> 00:29:12,670
are sharing the data.

556
00:29:12,670 --> 00:29:14,532
It uses the presence
bits of the directories

557
00:29:14,532 --> 00:29:16,990
to determine which clusters
have shared copies of the data.

558
00:29:16,990 --> 00:29:19,820


559
00:29:19,820 --> 00:29:23,180
Those shared objects will
then do invalidations

560
00:29:23,180 --> 00:29:26,570
in their individual caches
and send acknowledgments

561
00:29:26,570 --> 00:29:29,720
back to the local cluster
when the invalidations have

562
00:29:29,720 --> 00:29:31,580
completed.

563
00:29:31,580 --> 00:29:34,970
These acknowledgments allow us
to use a more relaxed memory

564
00:29:34,970 --> 00:29:38,592
model, as well as increase
the resiliency of the machine

565
00:29:38,592 --> 00:29:39,550
in the face of failure.

566
00:29:39,550 --> 00:29:45,540


567
00:29:45,540 --> 00:29:47,150
In addition to the
basic mechanism

568
00:29:47,150 --> 00:29:48,800
of caches in cache
coherency, there

569
00:29:48,800 --> 00:29:51,020
are several other
features in DASH, which

570
00:29:51,020 --> 00:29:52,415
are used to reduce latency.

571
00:29:52,415 --> 00:29:55,070


572
00:29:55,070 --> 00:29:58,580
I've mentioned the notion
of a relaxed memory model.

573
00:29:58,580 --> 00:30:02,690
What actually happens in
DASH, when a write occurs,

574
00:30:02,690 --> 00:30:05,570
the cluster gets the read
exclusive acknowledgment

575
00:30:05,570 --> 00:30:07,580
and immediately continues.

576
00:30:07,580 --> 00:30:12,230
That allows it to actually
overlap its execution

577
00:30:12,230 --> 00:30:14,750
with the actual process
of doing the invalidation

578
00:30:14,750 --> 00:30:17,820
to the remote clusters.

579
00:30:17,820 --> 00:30:20,040
The writes are then
pipelined basically,

580
00:30:20,040 --> 00:30:25,000
and we obtain improved
performance from that.

581
00:30:25,000 --> 00:30:27,550
By going to a
relaxed memory model,

582
00:30:27,550 --> 00:30:30,940
we don't need to actually
halt the processor until we

583
00:30:30,940 --> 00:30:33,420
reach a synchronization point.

584
00:30:33,420 --> 00:30:36,060
This also allows the
compiler more freedom

585
00:30:36,060 --> 00:30:40,110
in reordering shared accesses
between two synchronization

586
00:30:40,110 --> 00:30:42,330
points.

587
00:30:42,330 --> 00:30:45,760
The DASH architecture
also supports prefetching.

588
00:30:45,760 --> 00:30:48,610
It does this by actually
including the ability

589
00:30:48,610 --> 00:30:52,360
to generate a request
to an address, which

590
00:30:52,360 --> 00:30:56,230
will cause a local
copy of that cache line

591
00:30:56,230 --> 00:30:58,270
to be brought into the cluster.

592
00:30:58,270 --> 00:31:00,880
Then, rather than see the long
latency of an entire cluster

593
00:31:00,880 --> 00:31:04,190
reference, it's just
obtained locally.

594
00:31:04,190 --> 00:31:07,730
One of the important innovations
in the DASH fetching mechanism

595
00:31:07,730 --> 00:31:11,420
is that the prefetching is what
we sometimes call non-binding,

596
00:31:11,420 --> 00:31:14,660
that is, even though a data
item is brought locally,

597
00:31:14,660 --> 00:31:18,030
its coherency is
maintained so that if it's

598
00:31:18,030 --> 00:31:21,390
prefetched early and somebody
updates the data item,

599
00:31:21,390 --> 00:31:23,660
the semantics are
correct and the prefetch

600
00:31:23,660 --> 00:31:27,360
purely acts as an optimization.

601
00:31:27,360 --> 00:31:28,920
This is quite
critical in allowing

602
00:31:28,920 --> 00:31:30,750
the compiler or the
programmer freedom

603
00:31:30,750 --> 00:31:32,250
in place and prefetches.

604
00:31:32,250 --> 00:31:35,290
It means you can place a
prefetch anywhere you want,

605
00:31:35,290 --> 00:31:37,980
and you will not change
the correct behavior

606
00:31:37,980 --> 00:31:40,650
of the program.

607
00:31:40,650 --> 00:31:45,810
DASH includes features to reduce
the latency of synchronization

608
00:31:45,810 --> 00:31:48,840
operations and also to
reduce the bandwidth

609
00:31:48,840 --> 00:31:51,030
demands of those
synchronization operations.

610
00:31:51,030 --> 00:31:53,760
Two examples of that
are the queuing locks,

611
00:31:53,760 --> 00:31:57,690
which allow us to get improved
low latency access to highly

612
00:31:57,690 --> 00:32:01,980
contended locks and fetch an
increment operation, which

613
00:32:01,980 --> 00:32:05,910
gives us the ability to reduce
the contention on things

614
00:32:05,910 --> 00:32:10,380
like queue indices
and loop indices.

615
00:32:10,380 --> 00:32:13,440
Let me actually show you
a diagram of what the DASH

616
00:32:13,440 --> 00:32:15,000
hardware looks like here.

617
00:32:15,000 --> 00:32:17,820


618
00:32:17,820 --> 00:32:21,660
In this figure, we can
see the clusters organized

619
00:32:21,660 --> 00:32:24,930
with the multiple processors and
the directory in the cluster,

620
00:32:24,930 --> 00:32:29,790
as well as I/O and the
portion of the memory.

621
00:32:29,790 --> 00:32:34,210
The interconnection network
is actually a pair of meshes.

622
00:32:34,210 --> 00:32:36,420
One of these is
used for requests,

623
00:32:36,420 --> 00:32:40,190
and the other is
used for replies.

624
00:32:40,190 --> 00:32:42,950
What actually happens in the
machine is when a request is

625
00:32:42,950 --> 00:32:46,370
generated, the
directory, recognizing

626
00:32:46,370 --> 00:32:49,820
that the request is
for a remote address,

627
00:32:49,820 --> 00:32:53,540
creates a message, which it
transmits over the request

628
00:32:53,540 --> 00:32:55,150
network.

629
00:32:55,150 --> 00:32:57,760
That message is routed
through the network,

630
00:32:57,760 --> 00:33:00,920
to the appropriate
cluster, where

631
00:33:00,920 --> 00:33:04,280
the directory at that
cluster sees the message

632
00:33:04,280 --> 00:33:06,640
as an incoming request.

633
00:33:06,640 --> 00:33:11,550
That directory in
the target cluster

634
00:33:11,550 --> 00:33:14,190
then takes the message
off the network,

635
00:33:14,190 --> 00:33:17,250
does whatever accesses
are required on the bus,

636
00:33:17,250 --> 00:33:21,490
and creates a new message
containing the data.

637
00:33:21,490 --> 00:33:25,300
That message is then sent back
out on the reply network, where

638
00:33:25,300 --> 00:33:27,130
it is routed through
the reply network

639
00:33:27,130 --> 00:33:29,983
and back to the original
requesting cluster, where

640
00:33:29,983 --> 00:33:31,900
again the directory puts
that back on the bus.

641
00:33:31,900 --> 00:33:34,710


642
00:33:34,710 --> 00:33:38,430
Let's look one level deeper into
how the directory controller

643
00:33:38,430 --> 00:33:39,930
actually functions.

644
00:33:39,930 --> 00:33:42,750


645
00:33:42,750 --> 00:33:45,900
We see here on the bottom
the two buses, which

646
00:33:45,900 --> 00:33:48,570
are the backbone of the
cluster, the Silicon Graphics

647
00:33:48,570 --> 00:33:51,210
multiprocessor bus.

648
00:33:51,210 --> 00:33:53,400
The directory
controller actually

649
00:33:53,400 --> 00:33:55,920
acts like another
snoop on that bus,

650
00:33:55,920 --> 00:33:58,910
sitting there watching
the bus activity.

651
00:33:58,910 --> 00:34:02,090
When the directory controller
sees the remote request,

652
00:34:02,090 --> 00:34:03,927
it generates a message,
which it sends out

653
00:34:03,927 --> 00:34:06,260
through the message routing
chip and through the request

654
00:34:06,260 --> 00:34:07,750
network.

655
00:34:07,750 --> 00:34:10,060
That message goes
around the machine

656
00:34:10,060 --> 00:34:14,630
and comes in the request
network of the remote cluster.

657
00:34:14,630 --> 00:34:19,710
The message then is routed to
what we call the Pseudo-CPU.

658
00:34:19,710 --> 00:34:21,840
The Pseudo-CPU is
a piece of logic,

659
00:34:21,840 --> 00:34:25,199
which acts on behalf
of a remote requester,

660
00:34:25,199 --> 00:34:29,320
as if it were a demon
sitting on the bus.

661
00:34:29,320 --> 00:34:31,360
It generates the
necessary bus control

662
00:34:31,360 --> 00:34:34,690
signals to get the
data either from memory

663
00:34:34,690 --> 00:34:38,080
or from the cache of one of
the processors on that bus.

664
00:34:38,080 --> 00:34:41,110


665
00:34:41,110 --> 00:34:45,520
When the access is completed,
the directory controller

666
00:34:45,520 --> 00:34:51,250
picks up the data from the
bus and sends the message

667
00:34:51,250 --> 00:34:53,679
containing the requested data
back out through the message

668
00:34:53,679 --> 00:34:57,010
routing chip on
the reply network.

669
00:34:57,010 --> 00:35:01,630
A piece of logic in the
original requesting cluster

670
00:35:01,630 --> 00:35:04,450
called the reply
controller keeps track

671
00:35:04,450 --> 00:35:06,850
of all outstanding requests.

672
00:35:06,850 --> 00:35:10,060
When the outstanding request
returns to the cluster,

673
00:35:10,060 --> 00:35:12,850
the reply controller places
the data back on the network

674
00:35:12,850 --> 00:35:18,570
and tells the processor that it
can now proceed with execution.

675
00:35:18,570 --> 00:35:20,790
What's important is
that this entire process

676
00:35:20,790 --> 00:35:22,890
is totally transparent.

677
00:35:22,890 --> 00:35:25,590
A program issues a
load instruction,

678
00:35:25,590 --> 00:35:28,860
and the hardware does this
entire mechanism, including

679
00:35:28,860 --> 00:35:30,900
the creation of
messages, routing them

680
00:35:30,900 --> 00:35:32,640
through the network,
getting the data,

681
00:35:32,640 --> 00:35:34,880
and providing the data
back to the processor.

682
00:35:34,880 --> 00:35:37,960


683
00:35:37,960 --> 00:35:41,650
The DASH design also includes a
rather substantial performance

684
00:35:41,650 --> 00:35:46,660
monitor to try to monitor and
watch the behavior of programs.

685
00:35:46,660 --> 00:35:49,210
This allows us not only
to have a machine which

686
00:35:49,210 --> 00:35:51,370
can execute parallel
programs, but to actually

687
00:35:51,370 --> 00:35:53,800
have an environment for
studying parallel programs

688
00:35:53,800 --> 00:35:57,080
and watching how they behave.

689
00:35:57,080 --> 00:36:00,530
Overall, in terms of gate
count, this coherency mechanism

690
00:36:00,530 --> 00:36:06,630
adds about 10% in the gate
count to the individual cluster.

691
00:36:06,630 --> 00:36:08,390
So, the overhead is
actually quite modest.

692
00:36:08,390 --> 00:36:11,350


693
00:36:11,350 --> 00:36:13,870
Now, let me show you
a picture of what

694
00:36:13,870 --> 00:36:16,240
the DASH prototype looks like.

695
00:36:16,240 --> 00:36:19,030
We've implemented this using
state-of-the-art academic

696
00:36:19,030 --> 00:36:20,840
packaging.

697
00:36:20,840 --> 00:36:23,480
We've actually removed
the board supports

698
00:36:23,480 --> 00:36:25,430
that we use in
this figure, which

699
00:36:25,430 --> 00:36:29,300
normally consist of old
phone books and TTL catalogs.

700
00:36:29,300 --> 00:36:32,840
The design is actually
done so that the directory

701
00:36:32,840 --> 00:36:36,850
logic slips right into the
Silicon Graphics backplane.

702
00:36:36,850 --> 00:36:38,320
With the help of
Silicon Graphics,

703
00:36:38,320 --> 00:36:40,570
we were able to make a
number of small but critical

704
00:36:40,570 --> 00:36:44,240
modifications to their
processor environment

705
00:36:44,240 --> 00:36:46,660
so that the processor
can correctly

706
00:36:46,660 --> 00:36:50,740
operate with the directory
board and its backplane.

707
00:36:50,740 --> 00:36:54,660
The message network is
actually implemented

708
00:36:54,660 --> 00:36:58,670
over the ribbon cables that
are visible in this picture.

709
00:36:58,670 --> 00:37:00,680
Now, let me say something
about the performance

710
00:37:00,680 --> 00:37:03,860
of these remote accesses
in this machine.

711
00:37:03,860 --> 00:37:06,490


712
00:37:06,490 --> 00:37:10,030
This figure shows you
the breakdown of the time

713
00:37:10,030 --> 00:37:14,890
to do a local fill of a cache,
that is, to fill a cache miss,

714
00:37:14,890 --> 00:37:18,040
which is occurred
on the local bus,

715
00:37:18,040 --> 00:37:22,570
a two cluster fill that is
the time to do a cache miss,

716
00:37:22,570 --> 00:37:26,410
which has to go to
one remote cluster,

717
00:37:26,410 --> 00:37:28,570
and finally a
three cluster fill,

718
00:37:28,570 --> 00:37:30,130
the time to do a
cache miss that has

719
00:37:30,130 --> 00:37:33,620
to go to two remote clusters.

720
00:37:33,620 --> 00:37:38,610
A local fill takes just
over 20 processor clocks.

721
00:37:38,610 --> 00:37:41,760
And you can see the
breakdown with that time

722
00:37:41,760 --> 00:37:46,930
almost evenly divided between
memory access time and bus

723
00:37:46,930 --> 00:37:49,300
overhead.

724
00:37:49,300 --> 00:37:50,860
One of the surprising
things you'll

725
00:37:50,860 --> 00:37:53,110
see in both the two
and three cluster fill,

726
00:37:53,110 --> 00:37:58,900
which take respectively about
100 and 130 processor clocks,

727
00:37:58,900 --> 00:38:01,720
is that a substantial
amount of the time

728
00:38:01,720 --> 00:38:03,940
is devoted to getting
through the memory

729
00:38:03,940 --> 00:38:08,940
hierarchy onto the bus and
in and out of clusters.

730
00:38:08,940 --> 00:38:10,800
A rather small
amount of the time

731
00:38:10,800 --> 00:38:15,540
overall is devoted to actually
getting through the network.

732
00:38:15,540 --> 00:38:19,520


733
00:38:19,520 --> 00:38:23,350
This actually has some
important advantages.

734
00:38:23,350 --> 00:38:26,380
It means that in
larger scale machines,

735
00:38:26,380 --> 00:38:33,120
the network time will still
not be the major bottleneck

736
00:38:33,120 --> 00:38:36,300
in the design.

737
00:38:36,300 --> 00:38:38,640
Let me also observe that
although one might think

738
00:38:38,640 --> 00:38:40,500
that these times
of 100 processor

739
00:38:40,500 --> 00:38:44,560
clocks to go to a
remote cluster or 130

740
00:38:44,560 --> 00:38:48,780
to go through two remote
clusters sound large,

741
00:38:48,780 --> 00:38:51,420
that, in fact, we expect
that next generation

742
00:38:51,420 --> 00:38:54,540
versions of this architecture,
we will be hard pressed

743
00:38:54,540 --> 00:38:56,640
to maintain those
processor clock

744
00:38:56,640 --> 00:39:01,220
ratios given the rapid
improvement in CPU performance.

745
00:39:01,220 --> 00:39:03,910
So, we believe that these are
realistic numbers and numbers

746
00:39:03,910 --> 00:39:06,160
that real designs and
programs have to live with.

747
00:39:06,160 --> 00:39:09,160


748
00:39:09,160 --> 00:39:12,340
So, given that, one
might ask the question,

749
00:39:12,340 --> 00:39:16,858
if remote accesses can take 100
to 130 processor clocks, what's

750
00:39:16,858 --> 00:39:18,400
the performance of
this machine going

751
00:39:18,400 --> 00:39:21,190
to look like on real
interesting applications?

752
00:39:21,190 --> 00:39:25,077


753
00:39:25,077 --> 00:39:26,660
Well, I guess I
wouldn't be here today

754
00:39:26,660 --> 00:39:28,910
if the performance
wasn't interesting.

755
00:39:28,910 --> 00:39:32,170
So, on this slide,
I'm going to show you

756
00:39:32,170 --> 00:39:35,530
some numbers that have been
captured on the 16 processor

757
00:39:35,530 --> 00:39:36,910
prototype for DASH.

758
00:39:36,910 --> 00:39:40,950


759
00:39:40,950 --> 00:39:44,880
These numbers are for a wide
range of applications, varying

760
00:39:44,880 --> 00:39:49,140
from things that do radiosity
calculations for graphics

761
00:39:49,140 --> 00:39:52,140
to a variety of
scientific programs

762
00:39:52,140 --> 00:39:57,510
that do galaxy evolution,
fluid flow, the classic matrix

763
00:39:57,510 --> 00:40:00,810
multiply, as well
as more scientific,

764
00:40:00,810 --> 00:40:03,090
more engineering-oriented
programs that

765
00:40:03,090 --> 00:40:07,640
do things like placement and
wrap by simulated annealing.

766
00:40:07,640 --> 00:40:11,840
You'll notice the speed-up for
a wide range of these problems

767
00:40:11,840 --> 00:40:14,560
is actually quite
good, and what I'd

768
00:40:14,560 --> 00:40:16,570
like to do is show
you some reason why

769
00:40:16,570 --> 00:40:20,080
the performance on one of these
programs is particularly good

770
00:40:20,080 --> 00:40:22,420
and why on one of
them, namely the one

771
00:40:22,420 --> 00:40:24,880
on the bottom of
the chart, MP3D,

772
00:40:24,880 --> 00:40:28,400
the performance is not so good.

773
00:40:28,400 --> 00:40:31,160
This next photograph
shows a picture

774
00:40:31,160 --> 00:40:36,320
of our real-time monitoring
system, which actually enables

775
00:40:36,320 --> 00:40:41,120
us to watch what's happening
on the machine in real time,

776
00:40:41,120 --> 00:40:45,340
using the performance
monitor to capture data.

777
00:40:45,340 --> 00:40:47,000
On the left hand
side of the screen,

778
00:40:47,000 --> 00:40:50,530
you can see four windows, which
correspond to the four clusters

779
00:40:50,530 --> 00:40:54,400
and tell you what the remote
access latency in those four

780
00:40:54,400 --> 00:40:56,010
clusters looks like.

781
00:40:56,010 --> 00:41:00,230
And you can see that the
times are quite tightly packed

782
00:41:00,230 --> 00:41:03,050
together, indicating that
not very much contention is

783
00:41:03,050 --> 00:41:05,750
occurring in the machine.

784
00:41:05,750 --> 00:41:07,750
On the right hand side,
you can see four windows

785
00:41:07,750 --> 00:41:10,480
that tell you how
much idle time exists

786
00:41:10,480 --> 00:41:13,180
in each one of the processors,
with the large green bar

787
00:41:13,180 --> 00:41:15,400
indicating that the
processors are busy most

788
00:41:15,400 --> 00:41:18,190
of the time, roughly
90% to 95% of the time

789
00:41:18,190 --> 00:41:19,850
in this application.

790
00:41:19,850 --> 00:41:24,250
So this application, which is
simulating molecular dynamics

791
00:41:24,250 --> 00:41:29,070
in a water molecule, actually
gets quite good speed-up.

792
00:41:29,070 --> 00:41:32,750
Let's look at what this
kind of performance monitor

793
00:41:32,750 --> 00:41:34,730
tells us about the
program, which doesn't

794
00:41:34,730 --> 00:41:38,570
get good speed-up, MP3D.

795
00:41:38,570 --> 00:41:44,760
If we look at this shot of the
screen, we'll see two things.

796
00:41:44,760 --> 00:41:47,100
On the left-hand
side, we can see

797
00:41:47,100 --> 00:41:50,160
that there is quite a lot of
contention in the machine,

798
00:41:50,160 --> 00:41:53,880
and rather than have a sharply
focused remote access time,

799
00:41:53,880 --> 00:41:58,200
we've got a remote access
time, which is smoothed out

800
00:41:58,200 --> 00:41:59,970
and blurred out.

801
00:41:59,970 --> 00:42:01,920
And we see that the
average remote access

802
00:42:01,920 --> 00:42:05,580
times are considerably larger
than what the minimum latency

803
00:42:05,580 --> 00:42:07,710
is, indicating that
there's a lot of contention

804
00:42:07,710 --> 00:42:09,810
in the machine.

805
00:42:09,810 --> 00:42:12,440
We also can see on
the right-hand side

806
00:42:12,440 --> 00:42:15,770
that three of the four
clusters are idle almost 50%

807
00:42:15,770 --> 00:42:18,380
of the time, that is, the
process of utilization

808
00:42:18,380 --> 00:42:20,613
is only about 60%.

809
00:42:20,613 --> 00:42:23,030
And hence, that's why we're
not getting very good speed-up

810
00:42:23,030 --> 00:42:26,160
on this application.

811
00:42:26,160 --> 00:42:29,210
Now, let me return to this
performance chart for a second

812
00:42:29,210 --> 00:42:31,820
and say something about this.

813
00:42:31,820 --> 00:42:34,310
The MP3D application
was originally

814
00:42:34,310 --> 00:42:38,030
written for a vector
supercomputer,

815
00:42:38,030 --> 00:42:40,760
and it actually effectively
streams its entire memory

816
00:42:40,760 --> 00:42:44,030
image, a three-dimensional
representation of space,

817
00:42:44,030 --> 00:42:46,470
through the processor.

818
00:42:46,470 --> 00:42:49,590
One of the things we did was
to rewrite this application,

819
00:42:49,590 --> 00:42:53,960
and the version of it that's
been rewritten is called PSIM4.

820
00:42:53,960 --> 00:42:56,360
PSIM4 is actually a more
sophisticated version

821
00:42:56,360 --> 00:42:58,610
of this application
in that it includes

822
00:42:58,610 --> 00:43:04,050
more complicated chemistry
of the upper atmosphere.

823
00:43:04,050 --> 00:43:06,050
However, it obtained
substantially better

824
00:43:06,050 --> 00:43:08,480
performance as we can
see on this chart,

825
00:43:08,480 --> 00:43:12,170
and it does that because PSIM4
pays more attention to locality

826
00:43:12,170 --> 00:43:15,380
of access and uses a
more sophisticated data

827
00:43:15,380 --> 00:43:17,930
structure to improve
its locality of access.

828
00:43:17,930 --> 00:43:21,050
So, we see that with some work,
we can even obtain performance

829
00:43:21,050 --> 00:43:23,810
on a program whose performance
was initially quite dismal.

830
00:43:23,810 --> 00:43:28,337


831
00:43:28,337 --> 00:43:30,920
Let me say something about the
software strategy, which we are

832
00:43:30,920 --> 00:43:32,810
developing for this machine.

833
00:43:32,810 --> 00:43:36,220


834
00:43:36,220 --> 00:43:38,230
One focus of that
software strategy

835
00:43:38,230 --> 00:43:43,210
is compiler-directed parallelism
and hierarchy-- memory

836
00:43:43,210 --> 00:43:47,660
hierarchy management,
whereby the compiler analyzes

837
00:43:47,660 --> 00:43:50,880
scientifically oriented
programs consisting primarily

838
00:43:50,880 --> 00:43:56,470
of loop intensive programs and
both decomposes the parallelism

839
00:43:56,470 --> 00:43:59,680
to run in parallel and
manages the memory hierarchy,

840
00:43:59,680 --> 00:44:04,990
aggressively tries to improve
the locality of memory access.

841
00:44:04,990 --> 00:44:07,030
We're also developing
languages, which

842
00:44:07,030 --> 00:44:11,470
are aimed at parallel programs,
which have a larger grain size

843
00:44:11,470 --> 00:44:14,140
and which also have a more
complicated model of memory

844
00:44:14,140 --> 00:44:15,280
locality.

845
00:44:15,280 --> 00:44:17,200
These languages, Jade
and COOL, both have

846
00:44:17,200 --> 00:44:20,200
ways for users to talk
about data dependencies

847
00:44:20,200 --> 00:44:22,810
between parallel
tasks, as well as

848
00:44:22,810 --> 00:44:24,784
to deal with the
issue of locality.

849
00:44:24,784 --> 00:44:29,050


850
00:44:29,050 --> 00:44:33,175
One of the nice things
about a machine like DASH

851
00:44:33,175 --> 00:44:35,710
that has a single
address space is

852
00:44:35,710 --> 00:44:38,290
that there's a fairly
straightforward strategy

853
00:44:38,290 --> 00:44:40,870
for getting an
operating system on it.

854
00:44:40,870 --> 00:44:44,410
We took a version of the Silicon
Graphics IRIX operating system

855
00:44:44,410 --> 00:44:47,310
and ported it quite
easily to DASH.

856
00:44:47,310 --> 00:44:49,620
Some small performance
tuning was done in order

857
00:44:49,620 --> 00:44:53,760
to enable the machine to
operate more effectively

858
00:44:53,760 --> 00:44:55,290
with multiple clusters.

859
00:44:55,290 --> 00:44:57,470
But the important thing
is that it runs UNIX

860
00:44:57,470 --> 00:45:01,800
in a multi-threaded environment
in a way that most of us like.

861
00:45:01,800 --> 00:45:04,540


862
00:45:04,540 --> 00:45:06,400
We've also focused
quite heavily on trying

863
00:45:06,400 --> 00:45:09,910
to build tools to help
people tune their programs

864
00:45:09,910 --> 00:45:13,940
and make them run effectively
on a machine like DASH.

865
00:45:13,940 --> 00:45:16,015
This is particularly
important because the memory

866
00:45:16,015 --> 00:45:17,390
hierarchy, as
we've seen earlier,

867
00:45:17,390 --> 00:45:19,850
is quite sophisticated,
and explaining this

868
00:45:19,850 --> 00:45:23,300
to a programmer who hasn't had
several courses in computer

869
00:45:23,300 --> 00:45:26,090
architecture can
be quite difficult.

870
00:45:26,090 --> 00:45:28,280
In order to assist them,
one of the tools we built

871
00:45:28,280 --> 00:45:31,270
is a tool called Mtool.

872
00:45:31,270 --> 00:45:36,730
Mtool actually tracks
the behavior of a program

873
00:45:36,730 --> 00:45:40,210
and keeps track of where the
performance bottlenecks are.

874
00:45:40,210 --> 00:45:43,720
So, it can tell you, in
a particular procedure,

875
00:45:43,720 --> 00:45:45,460
for example, that
there's a memory

876
00:45:45,460 --> 00:45:47,720
bottleneck or a
synchronization bottleneck,

877
00:45:47,720 --> 00:45:51,370
which is preventing you from
obtaining good speed-up.

878
00:45:51,370 --> 00:45:53,290
Mtool not only can
do that, but it

879
00:45:53,290 --> 00:45:56,740
can isolate the actual
lines in the source code,

880
00:45:56,740 --> 00:45:58,360
and it has a fully
windowed interface

881
00:45:58,360 --> 00:46:01,480
to enable you to go directly
from the memory bottleneck

882
00:46:01,480 --> 00:46:04,150
to the particular set of
lines and source code,

883
00:46:04,150 --> 00:46:07,220
which are the source
of that bottleneck.

884
00:46:07,220 --> 00:46:09,410
This allows users
to quickly focus in

885
00:46:09,410 --> 00:46:11,540
on what the problem
in their program

886
00:46:11,540 --> 00:46:13,010
is and allows them to tune it.

887
00:46:13,010 --> 00:46:16,110


888
00:46:16,110 --> 00:46:17,910
Now, let me say a
little bit about what

889
00:46:17,910 --> 00:46:20,550
we see for future opportunities,
future versions of the DASH

890
00:46:20,550 --> 00:46:22,883
architecture.

891
00:46:22,883 --> 00:46:24,300
One thing we've
been looking at is

892
00:46:24,300 --> 00:46:26,190
a much more highly
integrated version

893
00:46:26,190 --> 00:46:30,500
of this architecture that
would use newer processors.

894
00:46:30,500 --> 00:46:34,850
One could use modern
packaging technology to design

895
00:46:34,850 --> 00:46:40,130
a four-processor cluster, which
with ASICs and type packaging

896
00:46:40,130 --> 00:46:43,550
technology, could put the four
processors with their first

897
00:46:43,550 --> 00:46:49,250
and second level caches,
128 megabytes of memory,

898
00:46:49,250 --> 00:46:54,020
and ASIC-based
directory controller,

899
00:46:54,020 --> 00:46:57,470
new state-of-the-art
interconnection network using

900
00:46:57,470 --> 00:47:03,680
new mesh routing chips all
together onto a single board.

901
00:47:03,680 --> 00:47:07,520
Using processors that
are available in 1992,

902
00:47:07,520 --> 00:47:09,800
you could obtain
a peak performance

903
00:47:09,800 --> 00:47:15,260
of 600 MIPS out of that cluster
and about 400 megaflops.

904
00:47:15,260 --> 00:47:17,930
If you took this highly
integrated cluster

905
00:47:17,930 --> 00:47:23,210
and placed it into a tightly
packed system environment,

906
00:47:23,210 --> 00:47:26,330
you could place a thousand
processor machine, that is,

907
00:47:26,330 --> 00:47:29,600
a 16 by 16 grid with four
processors per cluster,

908
00:47:29,600 --> 00:47:32,417
into four large racks.

909
00:47:32,417 --> 00:47:34,250
With today's processors,
that might give you

910
00:47:34,250 --> 00:47:39,670
a peak performance of about 600
GIPS and about 400 gigaflops.

911
00:47:39,670 --> 00:47:42,640
But what's perhaps even more
exciting is that if you took

912
00:47:42,640 --> 00:47:45,220
processors, which we
expect to be available

913
00:47:45,220 --> 00:47:49,240
in the '93/'94 time frame,
you could obtain a machine,

914
00:47:49,240 --> 00:47:53,780
which would have about 1.6
teraflops of peak performance.

915
00:47:53,780 --> 00:47:56,360


916
00:47:56,360 --> 00:48:02,610
Let me now make some concluding
remarks about the DASH design.

917
00:48:02,610 --> 00:48:04,830
What DASH tries to
do is combine some

918
00:48:04,830 --> 00:48:08,400
of the scalability advantages
that have been demonstrated

919
00:48:08,400 --> 00:48:11,250
and associated with
message-passing machines

920
00:48:11,250 --> 00:48:14,100
together with the
programming advantages

921
00:48:14,100 --> 00:48:17,760
that we know from a single
address space machine.

922
00:48:17,760 --> 00:48:22,450
It does that with the use of a
distributed directory protocol.

923
00:48:22,450 --> 00:48:25,990
We see that DASH looks like
the beginning of a convergence

924
00:48:25,990 --> 00:48:28,810
between these two approaches,
the message-based approach

925
00:48:28,810 --> 00:48:31,920
and the single
address space approach

926
00:48:31,920 --> 00:48:36,320
and hopefully has some of
the advantages of both.

927
00:48:36,320 --> 00:48:40,160
Let me say, though, that a
single address space, while it

928
00:48:40,160 --> 00:48:43,070
eases the programming
effort, does not

929
00:48:43,070 --> 00:48:47,370
allow the programmer to ignore
and forget about locality.

930
00:48:47,370 --> 00:48:50,210
So that to really
use these machines,

931
00:48:50,210 --> 00:48:54,650
we'll need new algorithms that
have scalable parallelism, that

932
00:48:54,650 --> 00:48:59,390
pay attention to locality, that
help us make use of latency

933
00:48:59,390 --> 00:49:02,320
tolerating support,
and help us minimize

934
00:49:02,320 --> 00:49:05,760
global synchronization.

935
00:49:05,760 --> 00:49:08,490
The DASH project has been
a large project for us

936
00:49:08,490 --> 00:49:10,680
by academic standards
and wouldn't

937
00:49:10,680 --> 00:49:14,160
have been possible without
the help of a lot of people.

938
00:49:14,160 --> 00:49:16,380
My colleagues at Stanford
have played a big role

939
00:49:16,380 --> 00:49:18,630
in making this machine
work, professors

940
00:49:18,630 --> 00:49:20,520
Gupta, Horowitz,
and Lam have all

941
00:49:20,520 --> 00:49:23,080
made invaluable contributions.

942
00:49:23,080 --> 00:49:24,508
Dave Nakahira, our
staff engineer,

943
00:49:24,508 --> 00:49:26,800
has actually helped keep the
hardware together and make

944
00:49:26,800 --> 00:49:28,940
it really run.

945
00:49:28,940 --> 00:49:30,680
A large number of
PhD students have

946
00:49:30,680 --> 00:49:34,070
contributed to both the
hardware and software system.

947
00:49:34,070 --> 00:49:38,430
Dan Lenoski and Jim Laudon were
key designers on this project.

948
00:49:38,430 --> 00:49:41,360
And finally, I want to thank
Silicon Graphics, without whose

949
00:49:41,360 --> 00:49:45,320
help we could have never built a
working prototype of this scale

950
00:49:45,320 --> 00:49:49,600
and demonstrated the
viability of these ideas.

951
00:49:49,600 --> 00:49:51,450
Thank you.

952
00:49:51,450 --> 00:51:00,000