1 00:00:00,000 --> 00:00:34,410 2 00:00:34,410 --> 00:00:37,320 Today, I want to talk about the DASH project with you 3 00:00:37,320 --> 00:00:39,840 and our research at Stanford involved 4 00:00:39,840 --> 00:00:44,350 in the building of scalable shared memory machines. 5 00:00:44,350 --> 00:00:48,030 We'll start by talking about some of the general challenges 6 00:00:48,030 --> 00:00:49,650 as well as some of the advantages 7 00:00:49,650 --> 00:00:52,375 of a scalable shared memory machine 8 00:00:52,375 --> 00:00:54,000 And then in the second half of the talk 9 00:00:54,000 --> 00:00:55,920 move into actually discussing the dash 10 00:00:55,920 --> 00:01:00,450 architecture and the concepts behind that architecture. 11 00:01:00,450 --> 00:01:03,300 Before starting, let's discuss a little bit the goals 12 00:01:03,300 --> 00:01:06,890 in designing a parallel machine. 13 00:01:06,890 --> 00:01:10,040 I think probably the first goal is the obvious one, namely 14 00:01:10,040 --> 00:01:13,520 one with high performance out of a parallel machine, faster 15 00:01:13,520 --> 00:01:17,190 than available unit processors. 16 00:01:17,190 --> 00:01:19,350 Second goal, which we think is quite important, 17 00:01:19,350 --> 00:01:23,970 is that the cost performance of a multiprocessor 18 00:01:23,970 --> 00:01:28,680 be competitive with the cost performance of workstations 19 00:01:28,680 --> 00:01:33,080 that use the microprocessor technology available. 20 00:01:33,080 --> 00:01:35,360 Obviously, scalability has been something 21 00:01:35,360 --> 00:01:36,890 that people in parallel processing 22 00:01:36,890 --> 00:01:39,080 have talked about extensively and one 23 00:01:39,080 --> 00:01:41,360 would like to have scalable performance 24 00:01:41,360 --> 00:01:46,540 in a multiprocessor, as well as scalable cost performance. 25 00:01:46,540 --> 00:01:49,560 Finally, a goal that we've considered very important 26 00:01:49,560 --> 00:01:52,200 in our research is to design machines, 27 00:01:52,200 --> 00:01:54,990 which are general-purpose, which have wide applicability 28 00:01:54,990 --> 00:01:57,330 to a range of applications. 29 00:01:57,330 --> 00:01:59,100 We think each one of these points 30 00:01:59,100 --> 00:02:02,100 is important in trying to make parallel processors 31 00:02:02,100 --> 00:02:05,550 pervasive in getting to the point where 32 00:02:05,550 --> 00:02:07,729 we all as a computing community choose 33 00:02:07,729 --> 00:02:09,479 to use parallel processors because they're 34 00:02:09,479 --> 00:02:11,396 a better and faster way to solve our problems. 35 00:02:11,396 --> 00:02:16,160 36 00:02:16,160 --> 00:02:18,380 The term scalability has been used 37 00:02:18,380 --> 00:02:20,130 to mean a whole variety of things. 38 00:02:20,130 --> 00:02:22,783 And I think it's helpful to mention 39 00:02:22,783 --> 00:02:24,200 what some of those definitions are 40 00:02:24,200 --> 00:02:27,320 and to say what we mean by scalability. 41 00:02:27,320 --> 00:02:29,630 One definition of scalability is what I would 42 00:02:29,630 --> 00:02:32,380 call an impractical definition. 43 00:02:32,380 --> 00:02:35,510 It states that scalability is either speed-up 44 00:02:35,510 --> 00:02:38,630 on a constant size problem across a range of machines 45 00:02:38,630 --> 00:02:41,330 or even speed-up on scale problems, 46 00:02:41,330 --> 00:02:43,580 which for practical machines, machines that can be 47 00:02:43,580 --> 00:02:47,710 implemented, is unattainable. 48 00:02:47,710 --> 00:02:50,950 A second definition for scalability 49 00:02:50,950 --> 00:02:54,820 would be realistic in the sense that we could asymptotically 50 00:02:54,820 --> 00:02:57,880 approach that performance but might disregard cost 51 00:02:57,880 --> 00:03:00,130 and might disregard constant factors, which 52 00:03:00,130 --> 00:03:03,310 in practice would make the machine perhaps 53 00:03:03,310 --> 00:03:09,440 scalable but uninteresting from a cost viewpoint. 54 00:03:09,440 --> 00:03:11,750 What we mean by scalability, and the definition 55 00:03:11,750 --> 00:03:15,200 we've used in our project, is a very pragmatic definition 56 00:03:15,200 --> 00:03:16,430 of scalability. 57 00:03:16,430 --> 00:03:20,900 Our interest is in good speed-up and good scalable 58 00:03:20,900 --> 00:03:25,192 cost across a range of parameters for a machine. 59 00:03:25,192 --> 00:03:27,650 In particular, what I'm going to talk about today, or I do, 60 00:03:27,650 --> 00:03:29,810 which are interesting in the range of tens 61 00:03:29,810 --> 00:03:32,547 to low thousands of processors. 62 00:03:32,547 --> 00:03:36,450 63 00:03:36,450 --> 00:03:40,420 What's required to build a scalable multiprocessor? 64 00:03:40,420 --> 00:03:43,180 Well, obviously, we must scale both bandwidth 65 00:03:43,180 --> 00:03:46,300 and do something about the scaling of latency. 66 00:03:46,300 --> 00:03:48,800 We'll return to this topic of latency shortly. 67 00:03:48,800 --> 00:03:53,020 Let's just focus on bandwidth for now. 68 00:03:53,020 --> 00:03:54,520 When we say we must scale bandwidth, 69 00:03:54,520 --> 00:03:58,450 that means that we have to be able to scale the local memory 70 00:03:58,450 --> 00:04:01,600 bandwidth, the bandwidth that each processor demands 71 00:04:01,600 --> 00:04:04,450 linearly, and we also have to have a way 72 00:04:04,450 --> 00:04:08,710 to scale the communication bandwidth between processors 73 00:04:08,710 --> 00:04:10,540 as we scale the size of the machine. 74 00:04:10,540 --> 00:04:13,370 75 00:04:13,370 --> 00:04:16,160 An important observation is that scaling 76 00:04:16,160 --> 00:04:19,459 the memory bandwidth requires the use of physically 77 00:04:19,459 --> 00:04:21,697 distributed memories. 78 00:04:21,697 --> 00:04:22,280 It's inherent. 79 00:04:22,280 --> 00:04:25,063 There's no cost-efficient way to scale up a machine 80 00:04:25,063 --> 00:04:27,230 and scale up the memory bandwidth without physically 81 00:04:27,230 --> 00:04:29,700 distributing the memory. 82 00:04:29,700 --> 00:04:31,200 I also want to point out that if you 83 00:04:31,200 --> 00:04:33,210 want to achieve scalable cost performance, 84 00:04:33,210 --> 00:04:35,880 you've got to take advantage of the best price performance 85 00:04:35,880 --> 00:04:37,650 available in processors. 86 00:04:37,650 --> 00:04:41,280 That means using state of the art microprocessors, 87 00:04:41,280 --> 00:04:43,410 and I think we can see this trend evidenced 88 00:04:43,410 --> 00:04:45,930 in various recently announced machines-- the CM5, 89 00:04:45,930 --> 00:04:49,212 for example, that chose to use off-the-shelf microprocessor 90 00:04:49,212 --> 00:04:51,420 rather than build a custom processor for the machine. 91 00:04:51,420 --> 00:04:56,610 92 00:04:56,610 --> 00:05:00,180 Once one has decided to use a distributed 93 00:05:00,180 --> 00:05:03,100 memory, a physically distributed memory, 94 00:05:03,100 --> 00:05:06,760 there are two interesting architectural alternatives. 95 00:05:06,760 --> 00:05:11,250 One is a message-passing machine, where the memories are 96 00:05:11,250 --> 00:05:14,100 private, and communication between processors 97 00:05:14,100 --> 00:05:18,170 is done by explicitly passing messages. 98 00:05:18,170 --> 00:05:20,180 A second alternative, and the one 99 00:05:20,180 --> 00:05:23,090 we're going to discuss in some detail in this talk, 100 00:05:23,090 --> 00:05:26,960 is a machine that has physically separated memories 101 00:05:26,960 --> 00:05:30,110 but has a single address space across all those memories. 102 00:05:30,110 --> 00:05:35,400 103 00:05:35,400 --> 00:05:40,230 It's important to observe that either approach requires 104 00:05:40,230 --> 00:05:43,530 you to deal with locality of access. 105 00:05:43,530 --> 00:05:46,020 Because the memories are physically distributed, 106 00:05:46,020 --> 00:05:48,300 people have to pay attention-- programmers have 107 00:05:48,300 --> 00:05:51,270 to pay attention to the locality of access. 108 00:05:51,270 --> 00:05:53,910 109 00:05:53,910 --> 00:05:55,650 Both these approaches also require 110 00:05:55,650 --> 00:05:57,900 a scalable interconnection technology, 111 00:05:57,900 --> 00:05:59,850 which allows us to scale up the bandwidth 112 00:05:59,850 --> 00:06:02,980 between the processors. 113 00:06:02,980 --> 00:06:06,670 Given that, why do we find a single address-based approach 114 00:06:06,670 --> 00:06:08,640 attractive? 115 00:06:08,640 --> 00:06:12,200 Well, there are both functional and performance advantages. 116 00:06:12,200 --> 00:06:15,110 The functional advantages-- one of the most important 117 00:06:15,110 --> 00:06:19,250 is that a single address space is easier to program in 118 00:06:19,250 --> 00:06:21,420 and easier to build compilers for. 119 00:06:21,420 --> 00:06:23,420 It's easier to program for because it's a better 120 00:06:23,420 --> 00:06:26,892 understood programming model, and it's easier 121 00:06:26,892 --> 00:06:28,850 to build compilers for it because the compilers 122 00:06:28,850 --> 00:06:32,240 can treat the single address space and the memory-- 123 00:06:32,240 --> 00:06:37,380 the physical, distributed memory as an optimization problem. 124 00:06:37,380 --> 00:06:40,700 There's also, as we'll see in this talk, the opportunity 125 00:06:40,700 --> 00:06:46,400 to create an incremental program development methodology whereby 126 00:06:46,400 --> 00:06:50,000 a programmer can relatively easily port 127 00:06:50,000 --> 00:06:53,570 a program to this machine and can regard 128 00:06:53,570 --> 00:06:56,240 the process of enhancing performance 129 00:06:56,240 --> 00:06:59,810 as an incremental process whereby the issues of locality 130 00:06:59,810 --> 00:07:02,150 of access are addressed as an optimization problem-- 131 00:07:02,150 --> 00:07:03,830 a performance enhancement problem, 132 00:07:03,830 --> 00:07:09,110 as opposed to a correctness or fundamental problem getting 133 00:07:09,110 --> 00:07:11,990 the program working. 134 00:07:11,990 --> 00:07:13,580 With a little bit of hardware help, 135 00:07:13,580 --> 00:07:16,070 a single address-based machine can also 136 00:07:16,070 --> 00:07:20,080 efficiently simulate other models. 137 00:07:20,080 --> 00:07:21,850 There are also performance advantages 138 00:07:21,850 --> 00:07:25,350 associated with a single address space approach. 139 00:07:25,350 --> 00:07:26,850 One of the most important of these 140 00:07:26,850 --> 00:07:29,940 is that small data objects can be communicated 141 00:07:29,940 --> 00:07:32,910 more efficiently between processors, 142 00:07:32,910 --> 00:07:34,590 and that's because that communication is 143 00:07:34,590 --> 00:07:37,990 integrated in the memory model. 144 00:07:37,990 --> 00:07:41,052 A second advantage is that it's easier to exploit caching, 145 00:07:41,052 --> 00:07:43,510 and we'll see that this is the backbone of much of the work 146 00:07:43,510 --> 00:07:44,677 that we've done at Stanford. 147 00:07:44,677 --> 00:07:49,460 148 00:07:49,460 --> 00:07:52,370 As one scales up the machine size, 149 00:07:52,370 --> 00:07:57,540 the latency to access remote memories increases. 150 00:07:57,540 --> 00:08:00,330 This increase is unavoidable. 151 00:08:00,330 --> 00:08:04,920 For example, if we were to look at how that latency scales up 152 00:08:04,920 --> 00:08:10,470 in machines today, we'd see as typical numbers the local cash 153 00:08:10,470 --> 00:08:14,700 access time might typically on a processor today take one cycle. 154 00:08:14,700 --> 00:08:17,160 Accessing local memory might take anywhere 155 00:08:17,160 --> 00:08:19,440 from 10 to 40 cycles. 156 00:08:19,440 --> 00:08:22,890 Accessing a remote memory having to go across an interconnection 157 00:08:22,890 --> 00:08:28,140 network might take anywhere from 50 to 200 cycles. 158 00:08:28,140 --> 00:08:31,740 And remote access that might have to actually require 159 00:08:31,740 --> 00:08:34,770 trapping to the operating system or some other software library 160 00:08:34,770 --> 00:08:40,000 might take as many as 1,000, or in fact, up to 10,000 cycles. 161 00:08:40,000 --> 00:08:43,990 This remote access time scaling is unavoidable 162 00:08:43,990 --> 00:08:46,120 if we want to allow the machine to scale up 163 00:08:46,120 --> 00:08:50,160 to large numbers of processors. 164 00:08:50,160 --> 00:08:52,930 How do we deal with this long latency? 165 00:08:52,930 --> 00:08:57,500 There are three fundamental ways in which we can deal with it. 166 00:08:57,500 --> 00:09:00,790 The first is we can lower the frequency of long latency 167 00:09:00,790 --> 00:09:02,020 events. 168 00:09:02,020 --> 00:09:03,940 By using better parallel algorithms 169 00:09:03,940 --> 00:09:06,220 or better implementations of parallel algorithms, 170 00:09:06,220 --> 00:09:09,040 we can actually have the programmer lower 171 00:09:09,040 --> 00:09:10,720 the frequency of long latency events 172 00:09:10,720 --> 00:09:15,270 and thereby improve the performance of a program. 173 00:09:15,270 --> 00:09:19,260 A second approach is to try to reduce the latency by the use 174 00:09:19,260 --> 00:09:21,310 of hardware techniques. 175 00:09:21,310 --> 00:09:23,995 For example, we try and make the access to remote objects 176 00:09:23,995 --> 00:09:24,495 cheaper. 177 00:09:24,495 --> 00:09:27,570 178 00:09:27,570 --> 00:09:29,670 A third approach is to tolerate latency, 179 00:09:29,670 --> 00:09:33,630 that is find some way to overlap computation and communication 180 00:09:33,630 --> 00:09:39,710 so that the impact of latency on performance is minimized. 181 00:09:39,710 --> 00:09:41,135 Let's talk about reducing latency. 182 00:09:41,135 --> 00:09:43,710 183 00:09:43,710 --> 00:09:45,300 One of the most important methods 184 00:09:45,300 --> 00:09:50,580 for trying to reduce latency is to eliminate the long latency 185 00:09:50,580 --> 00:09:53,700 accesses by the use of caching. 186 00:09:53,700 --> 00:09:55,800 Caching is a time-honored idea that 187 00:09:55,800 --> 00:09:58,980 has been used in architecture for a long time 188 00:09:58,980 --> 00:10:04,670 and been extremely useful in trying to reduce latency. 189 00:10:04,670 --> 00:10:06,830 When we introduce caching, we can introduce it 190 00:10:06,830 --> 00:10:11,260 either as a hardware technique or as a software technique. 191 00:10:11,260 --> 00:10:14,410 And, of course, the static versus dynamic behavior 192 00:10:14,410 --> 00:10:17,590 of the application affects the suitability 193 00:10:17,590 --> 00:10:19,647 of using a software technique, which 194 00:10:19,647 --> 00:10:21,730 might be more appropriate in a static environment, 195 00:10:21,730 --> 00:10:23,230 or a hardware technique, which might 196 00:10:23,230 --> 00:10:25,175 be more appropriate in a dynamic environment. 197 00:10:25,175 --> 00:10:27,850 198 00:10:27,850 --> 00:10:30,550 We can also reduce the communication latency 199 00:10:30,550 --> 00:10:35,930 by the use of low latency, high bandwidth protocols, those 200 00:10:35,930 --> 00:10:39,560 who both avoid contention and reduce latency. 201 00:10:39,560 --> 00:10:41,980 This is an area where there's been dramatic progress 202 00:10:41,980 --> 00:10:43,480 and where it appears that there will 203 00:10:43,480 --> 00:10:45,460 continue to be important progress in interconnection 204 00:10:45,460 --> 00:10:45,960 networks. 205 00:10:45,960 --> 00:10:51,040 206 00:10:51,040 --> 00:10:55,170 As we said earlier, we can also tolerate latency. 207 00:10:55,170 --> 00:10:57,000 One obvious way to tolerate latency 208 00:10:57,000 --> 00:10:58,710 is in a message-passing machine, where 209 00:10:58,710 --> 00:11:02,010 the programmer has the task of overlapping 210 00:11:02,010 --> 00:11:05,830 computation and communication. 211 00:11:05,830 --> 00:11:07,950 Another way to tolerate latency is 212 00:11:07,950 --> 00:11:10,400 by changing the memory model. 213 00:11:10,400 --> 00:11:15,200 The memory model specifies when a processor can continue 214 00:11:15,200 --> 00:11:18,500 after doing a load or a store, for example, 215 00:11:18,500 --> 00:11:21,290 and when it can assume that access is completed. 216 00:11:21,290 --> 00:11:23,930 By relaxing the memory model, we allow the processor 217 00:11:23,930 --> 00:11:26,870 to continue earlier, allowing it to ignore 218 00:11:26,870 --> 00:11:28,640 the fact that a store, for example, 219 00:11:28,640 --> 00:11:30,470 may take quite a long time to propagate 220 00:11:30,470 --> 00:11:31,550 across an entire machine. 221 00:11:31,550 --> 00:11:37,170 222 00:11:37,170 --> 00:11:38,820 Two other ways to tolerate latency 223 00:11:38,820 --> 00:11:42,210 are through prefetching and multiple context. 224 00:11:42,210 --> 00:11:45,690 In prefetching the program or the compiler actually 225 00:11:45,690 --> 00:11:49,410 inserts instructions into the program, 226 00:11:49,410 --> 00:11:52,470 which have the effect of bringing data 227 00:11:52,470 --> 00:11:55,590 from a remote location to a nearby location, 228 00:11:55,590 --> 00:11:58,710 thereby reducing its access time. 229 00:11:58,710 --> 00:12:01,890 In multiple contexts we actually designed the machine 230 00:12:01,890 --> 00:12:05,190 so that when a long latency event occurs, 231 00:12:05,190 --> 00:12:09,330 the processor contexts switch to another thread of execution, 232 00:12:09,330 --> 00:12:11,550 thereby overlapping that long latency 233 00:12:11,550 --> 00:12:13,541 event with the other thread of execution. 234 00:12:13,541 --> 00:12:16,130 235 00:12:16,130 --> 00:12:19,220 It's important to observe that all latency tolerating 236 00:12:19,220 --> 00:12:23,000 techniques require extra parallelism to be 237 00:12:23,000 --> 00:12:24,920 available in the program because they 238 00:12:24,920 --> 00:12:27,200 overlap these long latency events 239 00:12:27,200 --> 00:12:30,190 with that extra parallelism. 240 00:12:30,190 --> 00:12:32,290 They also generate extra bandwidth demands 241 00:12:32,290 --> 00:12:34,630 on the machine by doing prefetching or multiple 242 00:12:34,630 --> 00:12:36,580 contexts, we increase the bandwidth 243 00:12:36,580 --> 00:12:38,350 demands that the process requires 244 00:12:38,350 --> 00:12:39,510 on that particular program. 245 00:12:39,510 --> 00:12:44,850 246 00:12:44,850 --> 00:12:47,520 Let's return to the topic of caching and cache coherency. 247 00:12:47,520 --> 00:12:50,410 248 00:12:50,410 --> 00:12:51,880 There are several alternatives we 249 00:12:51,880 --> 00:12:55,480 could look at in a single address space with respect 250 00:12:55,480 --> 00:12:58,660 to the use of caching and whether or not 251 00:12:58,660 --> 00:13:00,550 the caches should be kept coherent 252 00:13:00,550 --> 00:13:03,250 with hardware assistance. 253 00:13:03,250 --> 00:13:07,720 One approach would be to say, let's not cache share data, 254 00:13:07,720 --> 00:13:10,750 and as we'll see that will make the accesses 255 00:13:10,750 --> 00:13:13,710 to remote data very expensive. 256 00:13:13,710 --> 00:13:16,440 A second related alternative is that we cache share data 257 00:13:16,440 --> 00:13:19,490 but under software control. 258 00:13:19,490 --> 00:13:23,090 In that case, we're limited by the ability of the software 259 00:13:23,090 --> 00:13:25,380 to do the job perfectly. 260 00:13:25,380 --> 00:13:29,250 By that we mean that if the software cannot be certain that 261 00:13:29,250 --> 00:13:33,090 it can cache a local copy and still retain a consistent 262 00:13:33,090 --> 00:13:35,490 programming model and a consistent memory model 263 00:13:35,490 --> 00:13:39,720 for the programmer, then the software must be conservative, 264 00:13:39,720 --> 00:13:41,820 must not choose to make a cache copy, 265 00:13:41,820 --> 00:13:43,650 and must use the remote copy. 266 00:13:43,650 --> 00:13:45,150 And that becomes the limiting factor 267 00:13:45,150 --> 00:13:46,442 in trying to cache in software. 268 00:13:46,442 --> 00:13:48,950 269 00:13:48,950 --> 00:13:51,830 What would an architecture with private caches 270 00:13:51,830 --> 00:13:55,420 that did not attempt to cache share data look like? 271 00:13:55,420 --> 00:13:57,720 Well, that architecture might well have caches, 272 00:13:57,720 --> 00:14:02,340 but the caches would contain only private data therefore 273 00:14:02,340 --> 00:14:04,740 accesses to remote data would have 274 00:14:04,740 --> 00:14:07,530 to go through the caches over the interconnection network 275 00:14:07,530 --> 00:14:08,820 and to the remote memory. 276 00:14:08,820 --> 00:14:12,910 277 00:14:12,910 --> 00:14:15,700 Some examples of these kinds of machines 278 00:14:15,700 --> 00:14:19,830 are the IBM RP3 and the BBM Butterfly. 279 00:14:19,830 --> 00:14:26,470 These machines feature, as we see here, a distributed memory, 280 00:14:26,470 --> 00:14:28,362 and they even use caches. 281 00:14:28,362 --> 00:14:30,820 There's also a single address space across this distributed 282 00:14:30,820 --> 00:14:32,917 memories. 283 00:14:32,917 --> 00:14:35,250 They also need to make use of a scalable interconnection 284 00:14:35,250 --> 00:14:35,750 network. 285 00:14:35,750 --> 00:14:38,410 286 00:14:38,410 --> 00:14:41,680 This slide shows why this approach 287 00:14:41,680 --> 00:14:44,490 runs into difficulties. 288 00:14:44,490 --> 00:14:47,730 The frequency of shared accesses-- 289 00:14:47,730 --> 00:14:49,980 that is, what percentage of the memory accesses 290 00:14:49,980 --> 00:14:52,470 made by the processor are to share data-- 291 00:14:52,470 --> 00:14:56,190 is quite large, often in the range of 35% to 40% 292 00:14:56,190 --> 00:14:57,720 of the accesses made by the program. 293 00:14:57,720 --> 00:15:00,260 294 00:15:00,260 --> 00:15:02,750 That means, for example, in a machine 295 00:15:02,750 --> 00:15:08,013 with caches that cache only private data, that 35% to 40% 296 00:15:08,013 --> 00:15:09,680 of the memory accesses are going to take 297 00:15:09,680 --> 00:15:11,750 on the order of 100 cycles. 298 00:15:11,750 --> 00:15:14,480 Obviously, the penalty for this and the effect on speed-up 299 00:15:14,480 --> 00:15:17,690 would be quite dramatic. 300 00:15:17,690 --> 00:15:21,950 Now, there's actually good news in these sorts of measurements. 301 00:15:21,950 --> 00:15:25,040 And one element of good news is that the shared read frequency 302 00:15:25,040 --> 00:15:28,520 tends to be much higher than the shared right frequency. 303 00:15:28,520 --> 00:15:30,350 That's good news because from the point 304 00:15:30,350 --> 00:15:33,478 of view of a cache designer, that means the caching will 305 00:15:33,478 --> 00:15:34,520 probably work quite well. 306 00:15:34,520 --> 00:15:39,130 307 00:15:39,130 --> 00:15:42,270 So, a third alternative to designing a machine 308 00:15:42,270 --> 00:15:45,660 with a single address space is to design cache coherency 309 00:15:45,660 --> 00:15:47,640 into the hardware. 310 00:15:47,640 --> 00:15:52,360 The key question is, can that cache mechanism-- 311 00:15:52,360 --> 00:15:55,330 can that coherency mechanism-- be made to scale? 312 00:15:55,330 --> 00:16:01,880 313 00:16:01,880 --> 00:16:04,370 This chart shows how conventional cache coherency 314 00:16:04,370 --> 00:16:07,340 has traditionally been implemented using what's 315 00:16:07,340 --> 00:16:10,280 typically called a Snoopy Bus. 316 00:16:10,280 --> 00:16:12,650 In this scheme the processors are 317 00:16:12,650 --> 00:16:16,060 connected onto the single bus. 318 00:16:16,060 --> 00:16:20,260 When a cache miss occurs, that miss is placed on the bus, 319 00:16:20,260 --> 00:16:23,800 and the cache is snooped to try to determine whether or not 320 00:16:23,800 --> 00:16:27,960 they might contain the data required by the processor. 321 00:16:27,960 --> 00:16:30,100 If one of those caches contains the data, 322 00:16:30,100 --> 00:16:31,673 then it places the data on the bus 323 00:16:31,673 --> 00:16:33,090 and the processor, which generated 324 00:16:33,090 --> 00:16:35,298 the initial cache miss, finds the data present there. 325 00:16:35,298 --> 00:16:39,160 326 00:16:39,160 --> 00:16:42,010 When a processor writes a shared data item, 327 00:16:42,010 --> 00:16:46,180 that data item is placed on the bus, the address of that data 328 00:16:46,180 --> 00:16:49,720 item, and an invalidation is generated on the bus 329 00:16:49,720 --> 00:16:54,580 so that the copies of the data resident in the other caches 330 00:16:54,580 --> 00:17:00,410 are eliminated by that validation. 331 00:17:00,410 --> 00:17:01,880 A good way to think about this is 332 00:17:01,880 --> 00:17:04,369 that what a Snoopy Bus protocol presents 333 00:17:04,369 --> 00:17:08,520 is a solution to the multiple readers single writer problem. 334 00:17:08,520 --> 00:17:10,780 We allow multiple people to read and copy 335 00:17:10,780 --> 00:17:12,750 the data item into their cache. 336 00:17:12,750 --> 00:17:14,550 When we want to write the data item, 337 00:17:14,550 --> 00:17:16,650 we invalidate all those caches so 338 00:17:16,650 --> 00:17:18,165 that there's only a single copy that 339 00:17:18,165 --> 00:17:19,290 can be written and updated. 340 00:17:19,290 --> 00:17:23,890 341 00:17:23,890 --> 00:17:27,550 Now, if we try and ask what's the problem 342 00:17:27,550 --> 00:17:32,170 with a snoopy-based protocol, the major problem 343 00:17:32,170 --> 00:17:34,570 is that when a cache miss occurs, 344 00:17:34,570 --> 00:17:40,020 we must broadcast that address on the bus 345 00:17:40,020 --> 00:17:42,177 and require all the caches to snoop 346 00:17:42,177 --> 00:17:44,510 to see whether or not they have that particular address. 347 00:17:44,510 --> 00:17:47,080 348 00:17:47,080 --> 00:17:49,060 An important observation here is that this 349 00:17:49,060 --> 00:17:52,350 is an inherent feature of this protocol. 350 00:17:52,350 --> 00:17:56,490 It is not a property that can be eliminated 351 00:17:56,490 --> 00:18:01,770 just by making the bus faster or by even putting two buses in. 352 00:18:01,770 --> 00:18:05,100 Because the limitation that will occur 353 00:18:05,100 --> 00:18:07,290 will be that the bandwidth of the caches that 354 00:18:07,290 --> 00:18:10,798 is used for snooping will quickly saturate. 355 00:18:10,798 --> 00:18:12,840 And therefore, we can't extend the machine purely 356 00:18:12,840 --> 00:18:14,550 by increasing the amount of bus family. 357 00:18:14,550 --> 00:18:17,220 358 00:18:17,220 --> 00:18:21,150 Let me talk about an alternative to this snoopy-based approach 359 00:18:21,150 --> 00:18:24,000 that enables us to have a cache coherency 360 00:18:24,000 --> 00:18:26,670 scheme that will scale. 361 00:18:26,670 --> 00:18:28,650 This alternative is a cash coherency scheme 362 00:18:28,650 --> 00:18:31,380 called directories. 363 00:18:31,380 --> 00:18:36,180 In a directory-based scheme we have an object 364 00:18:36,180 --> 00:18:40,410 called a directory, which keeps track of all the shared 365 00:18:40,410 --> 00:18:42,960 copies of a data item. 366 00:18:42,960 --> 00:18:46,530 It keeps track of it with the use of presence bits, which 367 00:18:46,530 --> 00:18:49,470 denote which caches have a shared 368 00:18:49,470 --> 00:18:54,180 copy of a particular block or line in the cache. 369 00:18:54,180 --> 00:18:57,090 Each block and memory has a corresponding line 370 00:18:57,090 --> 00:18:58,980 in the directory, which keeps track 371 00:18:58,980 --> 00:19:00,660 of that particular block of memory 372 00:19:00,660 --> 00:19:03,480 and which caches have a copy. 373 00:19:03,480 --> 00:19:06,870 There's also a dirty bit to indicate 374 00:19:06,870 --> 00:19:12,810 when the single copy is the only copy of a particular block 375 00:19:12,810 --> 00:19:15,210 anywhere in the machine. 376 00:19:15,210 --> 00:19:17,040 So one of two situations exists. 377 00:19:17,040 --> 00:19:19,890 Either there are one or more presence bits 378 00:19:19,890 --> 00:19:23,040 on indicating if there's more than one 379 00:19:23,040 --> 00:19:25,830 that the copy is shared by multiple readers 380 00:19:25,830 --> 00:19:31,200 and exists in multiple caches, or the dirty bit is on, 381 00:19:31,200 --> 00:19:33,990 and there is a single copy with one presence bit indicating 382 00:19:33,990 --> 00:19:37,260 the location of that copy. 383 00:19:37,260 --> 00:19:42,520 Now, how do these directories maintain coherency? 384 00:19:42,520 --> 00:19:46,720 Well, the next two slides will tell us about that. 385 00:19:46,720 --> 00:19:51,660 When a location is red, the reader-- 386 00:19:51,660 --> 00:19:54,220 the processor doing the reading-- 387 00:19:54,220 --> 00:19:57,850 causes the directory to be updated to indicate 388 00:19:57,850 --> 00:20:02,070 that that processor has a copy of that particular cache line. 389 00:20:02,070 --> 00:20:06,070 390 00:20:06,070 --> 00:20:10,030 That basically tracks the readers. 391 00:20:10,030 --> 00:20:12,520 Then, when a location is written, 392 00:20:12,520 --> 00:20:15,430 the directory does two things. 393 00:20:15,430 --> 00:20:21,620 First, it grants ownership to the writer, 394 00:20:21,620 --> 00:20:26,630 that is, it determines which of all the 395 00:20:26,630 --> 00:20:30,080 processes that they want to write get priority in trying 396 00:20:30,080 --> 00:20:32,130 to write this data item. 397 00:20:32,130 --> 00:20:35,850 It grants the ownership, and then using the presence bits 398 00:20:35,850 --> 00:20:38,400 contained in the directory, it causes 399 00:20:38,400 --> 00:20:42,120 the other copies to be invalidated and eliminated 400 00:20:42,120 --> 00:20:44,860 from the cache. 401 00:20:44,860 --> 00:20:46,750 The advantages of this approach are 402 00:20:46,750 --> 00:20:52,140 that it has no unnecessary broadcasts, 403 00:20:52,140 --> 00:20:55,860 and it imposes no restrictions on the topology or the ordering 404 00:20:55,860 --> 00:20:58,200 of messages sent around the interconnection network. 405 00:20:58,200 --> 00:21:02,330 406 00:21:02,330 --> 00:21:08,350 Directory schemes actually predate snoopy-based schemes, 407 00:21:08,350 --> 00:21:11,120 but they were passed over. 408 00:21:11,120 --> 00:21:14,450 They were passed over because the Snoopy schemes are simpler 409 00:21:14,450 --> 00:21:17,790 and because for most multiprocessors built, 410 00:21:17,790 --> 00:21:19,410 scaling hasn't been an issue. 411 00:21:19,410 --> 00:21:24,800 People have been interested in smaller scale multiprocessors. 412 00:21:24,800 --> 00:21:29,870 In order to explore larger scale machines with cache coherency 413 00:21:29,870 --> 00:21:34,060 directories and the interest in directories has been revived. 414 00:21:34,060 --> 00:21:35,740 In order to scale these directories, 415 00:21:35,740 --> 00:21:38,720 two important things have to be done. 416 00:21:38,720 --> 00:21:40,540 One is that we have to distribute 417 00:21:40,540 --> 00:21:43,773 the directories with the memories, that is, 418 00:21:43,773 --> 00:21:45,190 when we have multiple memories, we 419 00:21:45,190 --> 00:21:47,530 have to have multiple corresponding directories. 420 00:21:47,530 --> 00:21:49,780 If we didn't, then the single directory in the machine 421 00:21:49,780 --> 00:21:51,250 would become the bottleneck, and it 422 00:21:51,250 --> 00:21:52,833 would be no different from then having 423 00:21:52,833 --> 00:21:55,500 a single memory in the machine. 424 00:21:55,500 --> 00:21:58,490 We also have to use a scalable directory scheme, one 425 00:21:58,490 --> 00:22:02,090 that allows the total size of the machine and processor 426 00:22:02,090 --> 00:22:05,090 count and memory to scale up without making 427 00:22:05,090 --> 00:22:08,183 the directory overly costly. 428 00:22:08,183 --> 00:22:10,100 I'm not going to talk about scalable directory 429 00:22:10,100 --> 00:22:12,287 schemes in detail in this talk, although there 430 00:22:12,287 --> 00:22:14,620 are several papers that have been written on this topic. 431 00:22:14,620 --> 00:22:17,760 432 00:22:17,760 --> 00:22:21,350 These distributed directory protocols 433 00:22:21,350 --> 00:22:24,410 have now been adopted by a wide variety 434 00:22:24,410 --> 00:22:28,520 of machines looking at trying to build scalable cache coherency. 435 00:22:28,520 --> 00:22:31,740 The Stanford machine that we're talking about is one of these. 436 00:22:31,740 --> 00:22:34,700 But the scalable coherent interface 437 00:22:34,700 --> 00:22:38,980 is another example that's proposed as an IEEE standard 438 00:22:38,980 --> 00:22:41,950 for interconnecting multiple processors. 439 00:22:41,950 --> 00:22:45,490 The MIT Alewife machine uses a scalable directory scheme, 440 00:22:45,490 --> 00:22:49,680 as does the Kendall Square KSR one. 441 00:22:49,680 --> 00:22:51,930 Now that we've discussed some of the background issues 442 00:22:51,930 --> 00:22:54,930 in building scalable single address machines, 443 00:22:54,930 --> 00:22:57,270 I want to talk about the DASH architecture 444 00:22:57,270 --> 00:23:00,390 and design in particular. 445 00:23:00,390 --> 00:23:05,990 DASH is built around the concept of a distributed directory. 446 00:23:05,990 --> 00:23:08,810 It uses individual processors connected 447 00:23:08,810 --> 00:23:12,510 to their caches, a memory and directory, 448 00:23:12,510 --> 00:23:15,380 which forms a cluster. 449 00:23:15,380 --> 00:23:20,640 That cluster is interconnected with the communication network 450 00:23:20,640 --> 00:23:24,870 based on the Caltech Mesh and using a high bandwidth 451 00:23:24,870 --> 00:23:27,780 version of that mesh. 452 00:23:27,780 --> 00:23:29,910 A single address space is implemented 453 00:23:29,910 --> 00:23:33,270 across the entire machine even though the physical memories 454 00:23:33,270 --> 00:23:35,520 are distributed. 455 00:23:35,520 --> 00:23:38,310 For example, the high order bits of every address 456 00:23:38,310 --> 00:23:42,000 specify the cluster in which that particular address or word 457 00:23:42,000 --> 00:23:44,860 will live. 458 00:23:44,860 --> 00:23:47,260 The individual clusters are, in fact, 459 00:23:47,260 --> 00:23:54,080 multiprocessors based on Silicon Graphics 4D-340s. 460 00:23:54,080 --> 00:23:56,960 And the coherency protocol that's 461 00:23:56,960 --> 00:23:59,960 implemented as a Snoopy protocol on that bus 462 00:23:59,960 --> 00:24:02,567 is used as part of the overall architecture. 463 00:24:02,567 --> 00:24:05,910 464 00:24:05,910 --> 00:24:10,500 Let me give you an idea of what the DASH memory hierarchy looks 465 00:24:10,500 --> 00:24:15,010 like and define some terminology so we can talk about how 466 00:24:15,010 --> 00:24:18,420 the coherency protocol works. 467 00:24:18,420 --> 00:24:21,990 At the top level, we have the processor and its two level 468 00:24:21,990 --> 00:24:23,490 cache system. 469 00:24:23,490 --> 00:24:27,430 That's part of the Silicon Graphics 4D. 470 00:24:27,430 --> 00:24:31,270 The next level of hierarchy is provided 471 00:24:31,270 --> 00:24:35,710 by the other caches in the same cluster, which 472 00:24:35,710 --> 00:24:40,430 act as super caches via the snooping protocol. 473 00:24:40,430 --> 00:24:43,730 If the data is not found for particular access 474 00:24:43,730 --> 00:24:46,910 either in the processes caches or the other caches 475 00:24:46,910 --> 00:24:49,800 of the processors in that cluster, 476 00:24:49,800 --> 00:24:55,870 we go to the home cluster of that particular address. 477 00:24:55,870 --> 00:24:59,350 The home cluster, remember, is given by the upper order 478 00:24:59,350 --> 00:25:01,180 bits in the address. 479 00:25:01,180 --> 00:25:04,220 There are two possible situations here. 480 00:25:04,220 --> 00:25:08,650 One is that the home cluster is the same as the cluster that's 481 00:25:08,650 --> 00:25:11,680 generating the request, that is, the address is 482 00:25:11,680 --> 00:25:15,303 the address of a word in the local memory of that cluster, 483 00:25:15,303 --> 00:25:16,720 in which case, of course, we don't 484 00:25:16,720 --> 00:25:19,680 need to go across the interconnection network. 485 00:25:19,680 --> 00:25:21,680 The other case is that that address belongs 486 00:25:21,680 --> 00:25:25,610 to a remote cluster, and we have to actually send a request 487 00:25:25,610 --> 00:25:27,860 across the interconnection network 488 00:25:27,860 --> 00:25:31,520 to that home cluster where the address lives. 489 00:25:31,520 --> 00:25:35,710 490 00:25:35,710 --> 00:25:38,580 In fact, there's even another possible level in the hierarchy 491 00:25:38,580 --> 00:25:39,870 here. 492 00:25:39,870 --> 00:25:44,610 Consider the situation when the only copy of a particular word 493 00:25:44,610 --> 00:25:49,980 that a processor has requested lives in yet another cluster, 494 00:25:49,980 --> 00:25:53,400 neither the local cluster, which requested the data, 495 00:25:53,400 --> 00:25:57,708 nor the home directory but a third remote cluster, 496 00:25:57,708 --> 00:26:00,000 in which case, we may have to cross the interconnection 497 00:26:00,000 --> 00:26:02,310 network again to go and get that data. 498 00:26:02,310 --> 00:26:04,990 499 00:26:04,990 --> 00:26:07,020 Let's look at how this protocol operates 500 00:26:07,020 --> 00:26:12,390 when we try to read a cache line, which is held dirty 501 00:26:12,390 --> 00:26:16,140 in a remote cluster. 502 00:26:16,140 --> 00:26:19,320 The access will begin with the local cluster generating a read 503 00:26:19,320 --> 00:26:20,820 request. 504 00:26:20,820 --> 00:26:24,720 That read request will miss in the local caches 505 00:26:24,720 --> 00:26:28,890 and cause a message to be sent across the interconnection 506 00:26:28,890 --> 00:26:36,270 network, requesting the data from the home cluster. 507 00:26:36,270 --> 00:26:39,600 When that message arrives in the home cluster, 508 00:26:39,600 --> 00:26:42,600 we'll do a directory lookup to determine 509 00:26:42,600 --> 00:26:45,480 where that particular word currently 510 00:26:45,480 --> 00:26:46,470 resides in the machine. 511 00:26:46,470 --> 00:26:49,070 512 00:26:49,070 --> 00:26:53,150 We'll find out that the word is dirty and resides 513 00:26:53,150 --> 00:26:55,990 in a remote cluster. 514 00:26:55,990 --> 00:27:01,820 That will cause the home cluster to forward the request 515 00:27:01,820 --> 00:27:05,040 to that dirty remote cluster. 516 00:27:05,040 --> 00:27:06,950 So a message is actually created, 517 00:27:06,950 --> 00:27:12,070 and the address of the data object is forwarded. 518 00:27:12,070 --> 00:27:14,740 When it arrives at the dirty remote cluster, 519 00:27:14,740 --> 00:27:18,820 that cluster will do a lookup, using the Snoopy Bus 520 00:27:18,820 --> 00:27:23,140 protocols to access the caches in the remote cluster. 521 00:27:23,140 --> 00:27:26,260 It will find the data word and transmit it 522 00:27:26,260 --> 00:27:32,510 directly back to the original requesting local cluster. 523 00:27:32,510 --> 00:27:36,320 This forwarding directly back of the request 524 00:27:36,320 --> 00:27:41,240 eliminates the need to keep state in the home cluster 525 00:27:41,240 --> 00:27:42,770 and keep track of that message. 526 00:27:42,770 --> 00:27:46,160 It also reduces the latency. 527 00:27:46,160 --> 00:27:48,140 At the same time the dirty cluster 528 00:27:48,140 --> 00:27:52,290 will also write the data back and generate a write 529 00:27:52,290 --> 00:27:56,250 back so that the home cluster will be updated. 530 00:27:56,250 --> 00:27:58,200 This will mean that future requests don't 531 00:27:58,200 --> 00:28:00,240 need to be forwarded to the dirty cluster 532 00:28:00,240 --> 00:28:02,705 but can just be accessed directly in the home cluster 533 00:28:02,705 --> 00:28:04,080 and then the data can be returned 534 00:28:04,080 --> 00:28:05,163 to the requesting cluster. 535 00:28:05,163 --> 00:28:09,150 536 00:28:09,150 --> 00:28:11,040 Now, let's look at the case of a write 537 00:28:11,040 --> 00:28:13,500 and see how cache coherency is maintained 538 00:28:13,500 --> 00:28:17,540 with a write to a shared data item. 539 00:28:17,540 --> 00:28:21,020 When the local cluster sees a write to a shared data 540 00:28:21,020 --> 00:28:25,940 item that it does not currently have exclusive access to, 541 00:28:25,940 --> 00:28:28,940 it will generate a read exclusive request, which 542 00:28:28,940 --> 00:28:31,560 is sent to the home directory. 543 00:28:31,560 --> 00:28:34,490 The home directory ACKs is the point of arbitration. 544 00:28:34,490 --> 00:28:38,510 If two clusters try to do a right to the same word 545 00:28:38,510 --> 00:28:41,720 in a cache or the same cache line at once, 546 00:28:41,720 --> 00:28:43,370 those requests will both be forwarded 547 00:28:43,370 --> 00:28:47,940 to the home directory, and they'll be serialized there. 548 00:28:47,940 --> 00:28:50,430 When the home directory gets the request, 549 00:28:50,430 --> 00:28:54,780 it replies directly to the local requesting cluster, 550 00:28:54,780 --> 00:28:57,360 granting it exclusive access. 551 00:28:57,360 --> 00:29:00,970 The cluster can then proceed to do the write. 552 00:29:00,970 --> 00:29:03,850 In parallel with granting access, 553 00:29:03,850 --> 00:29:08,320 the home cluster also will cause an validations 554 00:29:08,320 --> 00:29:11,200 to be sent out to all the objects, which 555 00:29:11,200 --> 00:29:12,670 are sharing the data. 556 00:29:12,670 --> 00:29:14,532 It uses the presence bits of the directories 557 00:29:14,532 --> 00:29:16,990 to determine which clusters have shared copies of the data. 558 00:29:16,990 --> 00:29:19,820 559 00:29:19,820 --> 00:29:23,180 Those shared objects will then do invalidations 560 00:29:23,180 --> 00:29:26,570 in their individual caches and send acknowledgments 561 00:29:26,570 --> 00:29:29,720 back to the local cluster when the invalidations have 562 00:29:29,720 --> 00:29:31,580 completed. 563 00:29:31,580 --> 00:29:34,970 These acknowledgments allow us to use a more relaxed memory 564 00:29:34,970 --> 00:29:38,592 model, as well as increase the resiliency of the machine 565 00:29:38,592 --> 00:29:39,550 in the face of failure. 566 00:29:39,550 --> 00:29:45,540 567 00:29:45,540 --> 00:29:47,150 In addition to the basic mechanism 568 00:29:47,150 --> 00:29:48,800 of caches in cache coherency, there 569 00:29:48,800 --> 00:29:51,020 are several other features in DASH, which 570 00:29:51,020 --> 00:29:52,415 are used to reduce latency. 571 00:29:52,415 --> 00:29:55,070 572 00:29:55,070 --> 00:29:58,580 I've mentioned the notion of a relaxed memory model. 573 00:29:58,580 --> 00:30:02,690 What actually happens in DASH, when a write occurs, 574 00:30:02,690 --> 00:30:05,570 the cluster gets the read exclusive acknowledgment 575 00:30:05,570 --> 00:30:07,580 and immediately continues. 576 00:30:07,580 --> 00:30:12,230 That allows it to actually overlap its execution 577 00:30:12,230 --> 00:30:14,750 with the actual process of doing the invalidation 578 00:30:14,750 --> 00:30:17,820 to the remote clusters. 579 00:30:17,820 --> 00:30:20,040 The writes are then pipelined basically, 580 00:30:20,040 --> 00:30:25,000 and we obtain improved performance from that. 581 00:30:25,000 --> 00:30:27,550 By going to a relaxed memory model, 582 00:30:27,550 --> 00:30:30,940 we don't need to actually halt the processor until we 583 00:30:30,940 --> 00:30:33,420 reach a synchronization point. 584 00:30:33,420 --> 00:30:36,060 This also allows the compiler more freedom 585 00:30:36,060 --> 00:30:40,110 in reordering shared accesses between two synchronization 586 00:30:40,110 --> 00:30:42,330 points. 587 00:30:42,330 --> 00:30:45,760 The DASH architecture also supports prefetching. 588 00:30:45,760 --> 00:30:48,610 It does this by actually including the ability 589 00:30:48,610 --> 00:30:52,360 to generate a request to an address, which 590 00:30:52,360 --> 00:30:56,230 will cause a local copy of that cache line 591 00:30:56,230 --> 00:30:58,270 to be brought into the cluster. 592 00:30:58,270 --> 00:31:00,880 Then, rather than see the long latency of an entire cluster 593 00:31:00,880 --> 00:31:04,190 reference, it's just obtained locally. 594 00:31:04,190 --> 00:31:07,730 One of the important innovations in the DASH fetching mechanism 595 00:31:07,730 --> 00:31:11,420 is that the prefetching is what we sometimes call non-binding, 596 00:31:11,420 --> 00:31:14,660 that is, even though a data item is brought locally, 597 00:31:14,660 --> 00:31:18,030 its coherency is maintained so that if it's 598 00:31:18,030 --> 00:31:21,390 prefetched early and somebody updates the data item, 599 00:31:21,390 --> 00:31:23,660 the semantics are correct and the prefetch 600 00:31:23,660 --> 00:31:27,360 purely acts as an optimization. 601 00:31:27,360 --> 00:31:28,920 This is quite critical in allowing 602 00:31:28,920 --> 00:31:30,750 the compiler or the programmer freedom 603 00:31:30,750 --> 00:31:32,250 in place and prefetches. 604 00:31:32,250 --> 00:31:35,290 It means you can place a prefetch anywhere you want, 605 00:31:35,290 --> 00:31:37,980 and you will not change the correct behavior 606 00:31:37,980 --> 00:31:40,650 of the program. 607 00:31:40,650 --> 00:31:45,810 DASH includes features to reduce the latency of synchronization 608 00:31:45,810 --> 00:31:48,840 operations and also to reduce the bandwidth 609 00:31:48,840 --> 00:31:51,030 demands of those synchronization operations. 610 00:31:51,030 --> 00:31:53,760 Two examples of that are the queuing locks, 611 00:31:53,760 --> 00:31:57,690 which allow us to get improved low latency access to highly 612 00:31:57,690 --> 00:32:01,980 contended locks and fetch an increment operation, which 613 00:32:01,980 --> 00:32:05,910 gives us the ability to reduce the contention on things 614 00:32:05,910 --> 00:32:10,380 like queue indices and loop indices. 615 00:32:10,380 --> 00:32:13,440 Let me actually show you a diagram of what the DASH 616 00:32:13,440 --> 00:32:15,000 hardware looks like here. 617 00:32:15,000 --> 00:32:17,820 618 00:32:17,820 --> 00:32:21,660 In this figure, we can see the clusters organized 619 00:32:21,660 --> 00:32:24,930 with the multiple processors and the directory in the cluster, 620 00:32:24,930 --> 00:32:29,790 as well as I/O and the portion of the memory. 621 00:32:29,790 --> 00:32:34,210 The interconnection network is actually a pair of meshes. 622 00:32:34,210 --> 00:32:36,420 One of these is used for requests, 623 00:32:36,420 --> 00:32:40,190 and the other is used for replies. 624 00:32:40,190 --> 00:32:42,950 What actually happens in the machine is when a request is 625 00:32:42,950 --> 00:32:46,370 generated, the directory, recognizing 626 00:32:46,370 --> 00:32:49,820 that the request is for a remote address, 627 00:32:49,820 --> 00:32:53,540 creates a message, which it transmits over the request 628 00:32:53,540 --> 00:32:55,150 network. 629 00:32:55,150 --> 00:32:57,760 That message is routed through the network, 630 00:32:57,760 --> 00:33:00,920 to the appropriate cluster, where 631 00:33:00,920 --> 00:33:04,280 the directory at that cluster sees the message 632 00:33:04,280 --> 00:33:06,640 as an incoming request. 633 00:33:06,640 --> 00:33:11,550 That directory in the target cluster 634 00:33:11,550 --> 00:33:14,190 then takes the message off the network, 635 00:33:14,190 --> 00:33:17,250 does whatever accesses are required on the bus, 636 00:33:17,250 --> 00:33:21,490 and creates a new message containing the data. 637 00:33:21,490 --> 00:33:25,300 That message is then sent back out on the reply network, where 638 00:33:25,300 --> 00:33:27,130 it is routed through the reply network 639 00:33:27,130 --> 00:33:29,983 and back to the original requesting cluster, where 640 00:33:29,983 --> 00:33:31,900 again the directory puts that back on the bus. 641 00:33:31,900 --> 00:33:34,710 642 00:33:34,710 --> 00:33:38,430 Let's look one level deeper into how the directory controller 643 00:33:38,430 --> 00:33:39,930 actually functions. 644 00:33:39,930 --> 00:33:42,750 645 00:33:42,750 --> 00:33:45,900 We see here on the bottom the two buses, which 646 00:33:45,900 --> 00:33:48,570 are the backbone of the cluster, the Silicon Graphics 647 00:33:48,570 --> 00:33:51,210 multiprocessor bus. 648 00:33:51,210 --> 00:33:53,400 The directory controller actually 649 00:33:53,400 --> 00:33:55,920 acts like another snoop on that bus, 650 00:33:55,920 --> 00:33:58,910 sitting there watching the bus activity. 651 00:33:58,910 --> 00:34:02,090 When the directory controller sees the remote request, 652 00:34:02,090 --> 00:34:03,927 it generates a message, which it sends out 653 00:34:03,927 --> 00:34:06,260 through the message routing chip and through the request 654 00:34:06,260 --> 00:34:07,750 network. 655 00:34:07,750 --> 00:34:10,060 That message goes around the machine 656 00:34:10,060 --> 00:34:14,630 and comes in the request network of the remote cluster. 657 00:34:14,630 --> 00:34:19,710 The message then is routed to what we call the Pseudo-CPU. 658 00:34:19,710 --> 00:34:21,840 The Pseudo-CPU is a piece of logic, 659 00:34:21,840 --> 00:34:25,199 which acts on behalf of a remote requester, 660 00:34:25,199 --> 00:34:29,320 as if it were a demon sitting on the bus. 661 00:34:29,320 --> 00:34:31,360 It generates the necessary bus control 662 00:34:31,360 --> 00:34:34,690 signals to get the data either from memory 663 00:34:34,690 --> 00:34:38,080 or from the cache of one of the processors on that bus. 664 00:34:38,080 --> 00:34:41,110 665 00:34:41,110 --> 00:34:45,520 When the access is completed, the directory controller 666 00:34:45,520 --> 00:34:51,250 picks up the data from the bus and sends the message 667 00:34:51,250 --> 00:34:53,679 containing the requested data back out through the message 668 00:34:53,679 --> 00:34:57,010 routing chip on the reply network. 669 00:34:57,010 --> 00:35:01,630 A piece of logic in the original requesting cluster 670 00:35:01,630 --> 00:35:04,450 called the reply controller keeps track 671 00:35:04,450 --> 00:35:06,850 of all outstanding requests. 672 00:35:06,850 --> 00:35:10,060 When the outstanding request returns to the cluster, 673 00:35:10,060 --> 00:35:12,850 the reply controller places the data back on the network 674 00:35:12,850 --> 00:35:18,570 and tells the processor that it can now proceed with execution. 675 00:35:18,570 --> 00:35:20,790 What's important is that this entire process 676 00:35:20,790 --> 00:35:22,890 is totally transparent. 677 00:35:22,890 --> 00:35:25,590 A program issues a load instruction, 678 00:35:25,590 --> 00:35:28,860 and the hardware does this entire mechanism, including 679 00:35:28,860 --> 00:35:30,900 the creation of messages, routing them 680 00:35:30,900 --> 00:35:32,640 through the network, getting the data, 681 00:35:32,640 --> 00:35:34,880 and providing the data back to the processor. 682 00:35:34,880 --> 00:35:37,960 683 00:35:37,960 --> 00:35:41,650 The DASH design also includes a rather substantial performance 684 00:35:41,650 --> 00:35:46,660 monitor to try to monitor and watch the behavior of programs. 685 00:35:46,660 --> 00:35:49,210 This allows us not only to have a machine which 686 00:35:49,210 --> 00:35:51,370 can execute parallel programs, but to actually 687 00:35:51,370 --> 00:35:53,800 have an environment for studying parallel programs 688 00:35:53,800 --> 00:35:57,080 and watching how they behave. 689 00:35:57,080 --> 00:36:00,530 Overall, in terms of gate count, this coherency mechanism 690 00:36:00,530 --> 00:36:06,630 adds about 10% in the gate count to the individual cluster. 691 00:36:06,630 --> 00:36:08,390 So, the overhead is actually quite modest. 692 00:36:08,390 --> 00:36:11,350 693 00:36:11,350 --> 00:36:13,870 Now, let me show you a picture of what 694 00:36:13,870 --> 00:36:16,240 the DASH prototype looks like. 695 00:36:16,240 --> 00:36:19,030 We've implemented this using state-of-the-art academic 696 00:36:19,030 --> 00:36:20,840 packaging. 697 00:36:20,840 --> 00:36:23,480 We've actually removed the board supports 698 00:36:23,480 --> 00:36:25,430 that we use in this figure, which 699 00:36:25,430 --> 00:36:29,300 normally consist of old phone books and TTL catalogs. 700 00:36:29,300 --> 00:36:32,840 The design is actually done so that the directory 701 00:36:32,840 --> 00:36:36,850 logic slips right into the Silicon Graphics backplane. 702 00:36:36,850 --> 00:36:38,320 With the help of Silicon Graphics, 703 00:36:38,320 --> 00:36:40,570 we were able to make a number of small but critical 704 00:36:40,570 --> 00:36:44,240 modifications to their processor environment 705 00:36:44,240 --> 00:36:46,660 so that the processor can correctly 706 00:36:46,660 --> 00:36:50,740 operate with the directory board and its backplane. 707 00:36:50,740 --> 00:36:54,660 The message network is actually implemented 708 00:36:54,660 --> 00:36:58,670 over the ribbon cables that are visible in this picture. 709 00:36:58,670 --> 00:37:00,680 Now, let me say something about the performance 710 00:37:00,680 --> 00:37:03,860 of these remote accesses in this machine. 711 00:37:03,860 --> 00:37:06,490 712 00:37:06,490 --> 00:37:10,030 This figure shows you the breakdown of the time 713 00:37:10,030 --> 00:37:14,890 to do a local fill of a cache, that is, to fill a cache miss, 714 00:37:14,890 --> 00:37:18,040 which is occurred on the local bus, 715 00:37:18,040 --> 00:37:22,570 a two cluster fill that is the time to do a cache miss, 716 00:37:22,570 --> 00:37:26,410 which has to go to one remote cluster, 717 00:37:26,410 --> 00:37:28,570 and finally a three cluster fill, 718 00:37:28,570 --> 00:37:30,130 the time to do a cache miss that has 719 00:37:30,130 --> 00:37:33,620 to go to two remote clusters. 720 00:37:33,620 --> 00:37:38,610 A local fill takes just over 20 processor clocks. 721 00:37:38,610 --> 00:37:41,760 And you can see the breakdown with that time 722 00:37:41,760 --> 00:37:46,930 almost evenly divided between memory access time and bus 723 00:37:46,930 --> 00:37:49,300 overhead. 724 00:37:49,300 --> 00:37:50,860 One of the surprising things you'll 725 00:37:50,860 --> 00:37:53,110 see in both the two and three cluster fill, 726 00:37:53,110 --> 00:37:58,900 which take respectively about 100 and 130 processor clocks, 727 00:37:58,900 --> 00:38:01,720 is that a substantial amount of the time 728 00:38:01,720 --> 00:38:03,940 is devoted to getting through the memory 729 00:38:03,940 --> 00:38:08,940 hierarchy onto the bus and in and out of clusters. 730 00:38:08,940 --> 00:38:10,800 A rather small amount of the time 731 00:38:10,800 --> 00:38:15,540 overall is devoted to actually getting through the network. 732 00:38:15,540 --> 00:38:19,520 733 00:38:19,520 --> 00:38:23,350 This actually has some important advantages. 734 00:38:23,350 --> 00:38:26,380 It means that in larger scale machines, 735 00:38:26,380 --> 00:38:33,120 the network time will still not be the major bottleneck 736 00:38:33,120 --> 00:38:36,300 in the design. 737 00:38:36,300 --> 00:38:38,640 Let me also observe that although one might think 738 00:38:38,640 --> 00:38:40,500 that these times of 100 processor 739 00:38:40,500 --> 00:38:44,560 clocks to go to a remote cluster or 130 740 00:38:44,560 --> 00:38:48,780 to go through two remote clusters sound large, 741 00:38:48,780 --> 00:38:51,420 that, in fact, we expect that next generation 742 00:38:51,420 --> 00:38:54,540 versions of this architecture, we will be hard pressed 743 00:38:54,540 --> 00:38:56,640 to maintain those processor clock 744 00:38:56,640 --> 00:39:01,220 ratios given the rapid improvement in CPU performance. 745 00:39:01,220 --> 00:39:03,910 So, we believe that these are realistic numbers and numbers 746 00:39:03,910 --> 00:39:06,160 that real designs and programs have to live with. 747 00:39:06,160 --> 00:39:09,160 748 00:39:09,160 --> 00:39:12,340 So, given that, one might ask the question, 749 00:39:12,340 --> 00:39:16,858 if remote accesses can take 100 to 130 processor clocks, what's 750 00:39:16,858 --> 00:39:18,400 the performance of this machine going 751 00:39:18,400 --> 00:39:21,190 to look like on real interesting applications? 752 00:39:21,190 --> 00:39:25,077 753 00:39:25,077 --> 00:39:26,660 Well, I guess I wouldn't be here today 754 00:39:26,660 --> 00:39:28,910 if the performance wasn't interesting. 755 00:39:28,910 --> 00:39:32,170 So, on this slide, I'm going to show you 756 00:39:32,170 --> 00:39:35,530 some numbers that have been captured on the 16 processor 757 00:39:35,530 --> 00:39:36,910 prototype for DASH. 758 00:39:36,910 --> 00:39:40,950 759 00:39:40,950 --> 00:39:44,880 These numbers are for a wide range of applications, varying 760 00:39:44,880 --> 00:39:49,140 from things that do radiosity calculations for graphics 761 00:39:49,140 --> 00:39:52,140 to a variety of scientific programs 762 00:39:52,140 --> 00:39:57,510 that do galaxy evolution, fluid flow, the classic matrix 763 00:39:57,510 --> 00:40:00,810 multiply, as well as more scientific, 764 00:40:00,810 --> 00:40:03,090 more engineering-oriented programs that 765 00:40:03,090 --> 00:40:07,640 do things like placement and wrap by simulated annealing. 766 00:40:07,640 --> 00:40:11,840 You'll notice the speed-up for a wide range of these problems 767 00:40:11,840 --> 00:40:14,560 is actually quite good, and what I'd 768 00:40:14,560 --> 00:40:16,570 like to do is show you some reason why 769 00:40:16,570 --> 00:40:20,080 the performance on one of these programs is particularly good 770 00:40:20,080 --> 00:40:22,420 and why on one of them, namely the one 771 00:40:22,420 --> 00:40:24,880 on the bottom of the chart, MP3D, 772 00:40:24,880 --> 00:40:28,400 the performance is not so good. 773 00:40:28,400 --> 00:40:31,160 This next photograph shows a picture 774 00:40:31,160 --> 00:40:36,320 of our real-time monitoring system, which actually enables 775 00:40:36,320 --> 00:40:41,120 us to watch what's happening on the machine in real time, 776 00:40:41,120 --> 00:40:45,340 using the performance monitor to capture data. 777 00:40:45,340 --> 00:40:47,000 On the left hand side of the screen, 778 00:40:47,000 --> 00:40:50,530 you can see four windows, which correspond to the four clusters 779 00:40:50,530 --> 00:40:54,400 and tell you what the remote access latency in those four 780 00:40:54,400 --> 00:40:56,010 clusters looks like. 781 00:40:56,010 --> 00:41:00,230 And you can see that the times are quite tightly packed 782 00:41:00,230 --> 00:41:03,050 together, indicating that not very much contention is 783 00:41:03,050 --> 00:41:05,750 occurring in the machine. 784 00:41:05,750 --> 00:41:07,750 On the right hand side, you can see four windows 785 00:41:07,750 --> 00:41:10,480 that tell you how much idle time exists 786 00:41:10,480 --> 00:41:13,180 in each one of the processors, with the large green bar 787 00:41:13,180 --> 00:41:15,400 indicating that the processors are busy most 788 00:41:15,400 --> 00:41:18,190 of the time, roughly 90% to 95% of the time 789 00:41:18,190 --> 00:41:19,850 in this application. 790 00:41:19,850 --> 00:41:24,250 So this application, which is simulating molecular dynamics 791 00:41:24,250 --> 00:41:29,070 in a water molecule, actually gets quite good speed-up. 792 00:41:29,070 --> 00:41:32,750 Let's look at what this kind of performance monitor 793 00:41:32,750 --> 00:41:34,730 tells us about the program, which doesn't 794 00:41:34,730 --> 00:41:38,570 get good speed-up, MP3D. 795 00:41:38,570 --> 00:41:44,760 If we look at this shot of the screen, we'll see two things. 796 00:41:44,760 --> 00:41:47,100 On the left-hand side, we can see 797 00:41:47,100 --> 00:41:50,160 that there is quite a lot of contention in the machine, 798 00:41:50,160 --> 00:41:53,880 and rather than have a sharply focused remote access time, 799 00:41:53,880 --> 00:41:58,200 we've got a remote access time, which is smoothed out 800 00:41:58,200 --> 00:41:59,970 and blurred out. 801 00:41:59,970 --> 00:42:01,920 And we see that the average remote access 802 00:42:01,920 --> 00:42:05,580 times are considerably larger than what the minimum latency 803 00:42:05,580 --> 00:42:07,710 is, indicating that there's a lot of contention 804 00:42:07,710 --> 00:42:09,810 in the machine. 805 00:42:09,810 --> 00:42:12,440 We also can see on the right-hand side 806 00:42:12,440 --> 00:42:15,770 that three of the four clusters are idle almost 50% 807 00:42:15,770 --> 00:42:18,380 of the time, that is, the process of utilization 808 00:42:18,380 --> 00:42:20,613 is only about 60%. 809 00:42:20,613 --> 00:42:23,030 And hence, that's why we're not getting very good speed-up 810 00:42:23,030 --> 00:42:26,160 on this application. 811 00:42:26,160 --> 00:42:29,210 Now, let me return to this performance chart for a second 812 00:42:29,210 --> 00:42:31,820 and say something about this. 813 00:42:31,820 --> 00:42:34,310 The MP3D application was originally 814 00:42:34,310 --> 00:42:38,030 written for a vector supercomputer, 815 00:42:38,030 --> 00:42:40,760 and it actually effectively streams its entire memory 816 00:42:40,760 --> 00:42:44,030 image, a three-dimensional representation of space, 817 00:42:44,030 --> 00:42:46,470 through the processor. 818 00:42:46,470 --> 00:42:49,590 One of the things we did was to rewrite this application, 819 00:42:49,590 --> 00:42:53,960 and the version of it that's been rewritten is called PSIM4. 820 00:42:53,960 --> 00:42:56,360 PSIM4 is actually a more sophisticated version 821 00:42:56,360 --> 00:42:58,610 of this application in that it includes 822 00:42:58,610 --> 00:43:04,050 more complicated chemistry of the upper atmosphere. 823 00:43:04,050 --> 00:43:06,050 However, it obtained substantially better 824 00:43:06,050 --> 00:43:08,480 performance as we can see on this chart, 825 00:43:08,480 --> 00:43:12,170 and it does that because PSIM4 pays more attention to locality 826 00:43:12,170 --> 00:43:15,380 of access and uses a more sophisticated data 827 00:43:15,380 --> 00:43:17,930 structure to improve its locality of access. 828 00:43:17,930 --> 00:43:21,050 So, we see that with some work, we can even obtain performance 829 00:43:21,050 --> 00:43:23,810 on a program whose performance was initially quite dismal. 830 00:43:23,810 --> 00:43:28,337 831 00:43:28,337 --> 00:43:30,920 Let me say something about the software strategy, which we are 832 00:43:30,920 --> 00:43:32,810 developing for this machine. 833 00:43:32,810 --> 00:43:36,220 834 00:43:36,220 --> 00:43:38,230 One focus of that software strategy 835 00:43:38,230 --> 00:43:43,210 is compiler-directed parallelism and hierarchy-- memory 836 00:43:43,210 --> 00:43:47,660 hierarchy management, whereby the compiler analyzes 837 00:43:47,660 --> 00:43:50,880 scientifically oriented programs consisting primarily 838 00:43:50,880 --> 00:43:56,470 of loop intensive programs and both decomposes the parallelism 839 00:43:56,470 --> 00:43:59,680 to run in parallel and manages the memory hierarchy, 840 00:43:59,680 --> 00:44:04,990 aggressively tries to improve the locality of memory access. 841 00:44:04,990 --> 00:44:07,030 We're also developing languages, which 842 00:44:07,030 --> 00:44:11,470 are aimed at parallel programs, which have a larger grain size 843 00:44:11,470 --> 00:44:14,140 and which also have a more complicated model of memory 844 00:44:14,140 --> 00:44:15,280 locality. 845 00:44:15,280 --> 00:44:17,200 These languages, Jade and COOL, both have 846 00:44:17,200 --> 00:44:20,200 ways for users to talk about data dependencies 847 00:44:20,200 --> 00:44:22,810 between parallel tasks, as well as 848 00:44:22,810 --> 00:44:24,784 to deal with the issue of locality. 849 00:44:24,784 --> 00:44:29,050 850 00:44:29,050 --> 00:44:33,175 One of the nice things about a machine like DASH 851 00:44:33,175 --> 00:44:35,710 that has a single address space is 852 00:44:35,710 --> 00:44:38,290 that there's a fairly straightforward strategy 853 00:44:38,290 --> 00:44:40,870 for getting an operating system on it. 854 00:44:40,870 --> 00:44:44,410 We took a version of the Silicon Graphics IRIX operating system 855 00:44:44,410 --> 00:44:47,310 and ported it quite easily to DASH. 856 00:44:47,310 --> 00:44:49,620 Some small performance tuning was done in order 857 00:44:49,620 --> 00:44:53,760 to enable the machine to operate more effectively 858 00:44:53,760 --> 00:44:55,290 with multiple clusters. 859 00:44:55,290 --> 00:44:57,470 But the important thing is that it runs UNIX 860 00:44:57,470 --> 00:45:01,800 in a multi-threaded environment in a way that most of us like. 861 00:45:01,800 --> 00:45:04,540 862 00:45:04,540 --> 00:45:06,400 We've also focused quite heavily on trying 863 00:45:06,400 --> 00:45:09,910 to build tools to help people tune their programs 864 00:45:09,910 --> 00:45:13,940 and make them run effectively on a machine like DASH. 865 00:45:13,940 --> 00:45:16,015 This is particularly important because the memory 866 00:45:16,015 --> 00:45:17,390 hierarchy, as we've seen earlier, 867 00:45:17,390 --> 00:45:19,850 is quite sophisticated, and explaining this 868 00:45:19,850 --> 00:45:23,300 to a programmer who hasn't had several courses in computer 869 00:45:23,300 --> 00:45:26,090 architecture can be quite difficult. 870 00:45:26,090 --> 00:45:28,280 In order to assist them, one of the tools we built 871 00:45:28,280 --> 00:45:31,270 is a tool called Mtool. 872 00:45:31,270 --> 00:45:36,730 Mtool actually tracks the behavior of a program 873 00:45:36,730 --> 00:45:40,210 and keeps track of where the performance bottlenecks are. 874 00:45:40,210 --> 00:45:43,720 So, it can tell you, in a particular procedure, 875 00:45:43,720 --> 00:45:45,460 for example, that there's a memory 876 00:45:45,460 --> 00:45:47,720 bottleneck or a synchronization bottleneck, 877 00:45:47,720 --> 00:45:51,370 which is preventing you from obtaining good speed-up. 878 00:45:51,370 --> 00:45:53,290 Mtool not only can do that, but it 879 00:45:53,290 --> 00:45:56,740 can isolate the actual lines in the source code, 880 00:45:56,740 --> 00:45:58,360 and it has a fully windowed interface 881 00:45:58,360 --> 00:46:01,480 to enable you to go directly from the memory bottleneck 882 00:46:01,480 --> 00:46:04,150 to the particular set of lines and source code, 883 00:46:04,150 --> 00:46:07,220 which are the source of that bottleneck. 884 00:46:07,220 --> 00:46:09,410 This allows users to quickly focus in 885 00:46:09,410 --> 00:46:11,540 on what the problem in their program 886 00:46:11,540 --> 00:46:13,010 is and allows them to tune it. 887 00:46:13,010 --> 00:46:16,110 888 00:46:16,110 --> 00:46:17,910 Now, let me say a little bit about what 889 00:46:17,910 --> 00:46:20,550 we see for future opportunities, future versions of the DASH 890 00:46:20,550 --> 00:46:22,883 architecture. 891 00:46:22,883 --> 00:46:24,300 One thing we've been looking at is 892 00:46:24,300 --> 00:46:26,190 a much more highly integrated version 893 00:46:26,190 --> 00:46:30,500 of this architecture that would use newer processors. 894 00:46:30,500 --> 00:46:34,850 One could use modern packaging technology to design 895 00:46:34,850 --> 00:46:40,130 a four-processor cluster, which with ASICs and type packaging 896 00:46:40,130 --> 00:46:43,550 technology, could put the four processors with their first 897 00:46:43,550 --> 00:46:49,250 and second level caches, 128 megabytes of memory, 898 00:46:49,250 --> 00:46:54,020 and ASIC-based directory controller, 899 00:46:54,020 --> 00:46:57,470 new state-of-the-art interconnection network using 900 00:46:57,470 --> 00:47:03,680 new mesh routing chips all together onto a single board. 901 00:47:03,680 --> 00:47:07,520 Using processors that are available in 1992, 902 00:47:07,520 --> 00:47:09,800 you could obtain a peak performance 903 00:47:09,800 --> 00:47:15,260 of 600 MIPS out of that cluster and about 400 megaflops. 904 00:47:15,260 --> 00:47:17,930 If you took this highly integrated cluster 905 00:47:17,930 --> 00:47:23,210 and placed it into a tightly packed system environment, 906 00:47:23,210 --> 00:47:26,330 you could place a thousand processor machine, that is, 907 00:47:26,330 --> 00:47:29,600 a 16 by 16 grid with four processors per cluster, 908 00:47:29,600 --> 00:47:32,417 into four large racks. 909 00:47:32,417 --> 00:47:34,250 With today's processors, that might give you 910 00:47:34,250 --> 00:47:39,670 a peak performance of about 600 GIPS and about 400 gigaflops. 911 00:47:39,670 --> 00:47:42,640 But what's perhaps even more exciting is that if you took 912 00:47:42,640 --> 00:47:45,220 processors, which we expect to be available 913 00:47:45,220 --> 00:47:49,240 in the '93/'94 time frame, you could obtain a machine, 914 00:47:49,240 --> 00:47:53,780 which would have about 1.6 teraflops of peak performance. 915 00:47:53,780 --> 00:47:56,360 916 00:47:56,360 --> 00:48:02,610 Let me now make some concluding remarks about the DASH design. 917 00:48:02,610 --> 00:48:04,830 What DASH tries to do is combine some 918 00:48:04,830 --> 00:48:08,400 of the scalability advantages that have been demonstrated 919 00:48:08,400 --> 00:48:11,250 and associated with message-passing machines 920 00:48:11,250 --> 00:48:14,100 together with the programming advantages 921 00:48:14,100 --> 00:48:17,760 that we know from a single address space machine. 922 00:48:17,760 --> 00:48:22,450 It does that with the use of a distributed directory protocol. 923 00:48:22,450 --> 00:48:25,990 We see that DASH looks like the beginning of a convergence 924 00:48:25,990 --> 00:48:28,810 between these two approaches, the message-based approach 925 00:48:28,810 --> 00:48:31,920 and the single address space approach 926 00:48:31,920 --> 00:48:36,320 and hopefully has some of the advantages of both. 927 00:48:36,320 --> 00:48:40,160 Let me say, though, that a single address space, while it 928 00:48:40,160 --> 00:48:43,070 eases the programming effort, does not 929 00:48:43,070 --> 00:48:47,370 allow the programmer to ignore and forget about locality. 930 00:48:47,370 --> 00:48:50,210 So that to really use these machines, 931 00:48:50,210 --> 00:48:54,650 we'll need new algorithms that have scalable parallelism, that 932 00:48:54,650 --> 00:48:59,390 pay attention to locality, that help us make use of latency 933 00:48:59,390 --> 00:49:02,320 tolerating support, and help us minimize 934 00:49:02,320 --> 00:49:05,760 global synchronization. 935 00:49:05,760 --> 00:49:08,490 The DASH project has been a large project for us 936 00:49:08,490 --> 00:49:10,680 by academic standards and wouldn't 937 00:49:10,680 --> 00:49:14,160 have been possible without the help of a lot of people. 938 00:49:14,160 --> 00:49:16,380 My colleagues at Stanford have played a big role 939 00:49:16,380 --> 00:49:18,630 in making this machine work, professors 940 00:49:18,630 --> 00:49:20,520 Gupta, Horowitz, and Lam have all 941 00:49:20,520 --> 00:49:23,080 made invaluable contributions. 942 00:49:23,080 --> 00:49:24,508 Dave Nakahira, our staff engineer, 943 00:49:24,508 --> 00:49:26,800 has actually helped keep the hardware together and make 944 00:49:26,800 --> 00:49:28,940 it really run. 945 00:49:28,940 --> 00:49:30,680 A large number of PhD students have 946 00:49:30,680 --> 00:49:34,070 contributed to both the hardware and software system. 947 00:49:34,070 --> 00:49:38,430 Dan Lenoski and Jim Laudon were key designers on this project. 948 00:49:38,430 --> 00:49:41,360 And finally, I want to thank Silicon Graphics, without whose 949 00:49:41,360 --> 00:49:45,320 help we could have never built a working prototype of this scale 950 00:49:45,320 --> 00:49:49,600 and demonstrated the viability of these ideas. 951 00:49:49,600 --> 00:49:51,450 Thank you. 952 00:49:51,450 --> 00:51:00,000