1 00:00:00,000 --> 00:00:03,968 [MUSIC PLAYING] 2 00:00:03,968 --> 00:00:24,800 3 00:00:24,800 --> 00:00:26,720 On behalf of the MIT summer session, 4 00:00:26,720 --> 00:00:29,900 let me indicate how happy we are to be participating 5 00:00:29,900 --> 00:00:33,560 in this edition of the University Video Communications 6 00:00:33,560 --> 00:00:35,300 Distinguished Lecture Series. 7 00:00:35,300 --> 00:00:38,750 Our speaker tonight is Danny Hillis, a founder 8 00:00:38,750 --> 00:00:42,230 and chief guru of Thinking Machines Corporation. 9 00:00:42,230 --> 00:00:46,370 He received his PhD from MIT in 1988, 10 00:00:46,370 --> 00:00:49,160 but in 1981, he was already becoming interested 11 00:00:49,160 --> 00:00:52,460 in the physical limitations of computation 12 00:00:52,460 --> 00:00:54,890 and the possibility of constructing 13 00:00:54,890 --> 00:00:56,840 highly parallel computers. 14 00:00:56,840 --> 00:00:59,870 This work culminated in 1985 with the design 15 00:00:59,870 --> 00:01:04,459 of the Connection Machine, which became the foundation 16 00:01:04,459 --> 00:01:06,680 of his very successful company. 17 00:01:06,680 --> 00:01:10,310 His topic tonight is Connection Machine architectures. 18 00:01:10,310 --> 00:01:11,730 Danny-- 19 00:01:11,730 --> 00:01:13,980 I'm interested in designing computers 20 00:01:13,980 --> 00:01:16,740 for the very largest problems. 21 00:01:16,740 --> 00:01:20,100 And I got interested in this kind of computer design 22 00:01:20,100 --> 00:01:23,070 originally as a user, because I had some problems 23 00:01:23,070 --> 00:01:25,980 that I was studying in artificial intelligence that 24 00:01:25,980 --> 00:01:29,610 required much, much faster computers than I could buy-- 25 00:01:29,610 --> 00:01:32,430 like 100,000 times faster. 26 00:01:32,430 --> 00:01:35,670 And it didn't look like anybody else was going to build them, 27 00:01:35,670 --> 00:01:39,360 and so I got interested in the problem of building them. 28 00:01:39,360 --> 00:01:42,030 And as it turned out, fortunately, 29 00:01:42,030 --> 00:01:44,790 a lot of other people were interested in such computers 30 00:01:44,790 --> 00:01:45,770 too. 31 00:01:45,770 --> 00:01:48,460 A typical application is-- for example, 32 00:01:48,460 --> 00:01:50,670 in designing an airplane, people want 33 00:01:50,670 --> 00:01:53,280 to simulate the shape of the airplane wing 34 00:01:53,280 --> 00:01:55,992 and the air flying over it so that they can figure out 35 00:01:55,992 --> 00:01:58,200 how the airplane is going to fly before they build it 36 00:01:58,200 --> 00:02:01,110 and before they have to test it in a wind tunnel. 37 00:02:01,110 --> 00:02:02,880 Or another example as the scientist 38 00:02:02,880 --> 00:02:06,180 might use it to model a system that they 39 00:02:06,180 --> 00:02:07,590 have no hope of really physically 40 00:02:07,590 --> 00:02:10,620 modeling-- for example, the formation of a galaxy 41 00:02:10,620 --> 00:02:12,732 or the folding of a protein. 42 00:02:12,732 --> 00:02:14,940 And all of these problems require a tremendous amount 43 00:02:14,940 --> 00:02:18,720 of computation-- much, much more than we can really do today 44 00:02:18,720 --> 00:02:20,490 to do the problems accurately. 45 00:02:20,490 --> 00:02:22,260 So it's great for computer designer, 46 00:02:22,260 --> 00:02:25,120 because as soon as you build a faster machine, 47 00:02:25,120 --> 00:02:27,627 the users of this kind of problem want to get faster one. 48 00:02:27,627 --> 00:02:29,460 And there's really no end in sight for that. 49 00:02:29,460 --> 00:02:31,830 They seem to want computers that are certainly 50 00:02:31,830 --> 00:02:36,340 thousands of times faster than what we can give them today. 51 00:02:36,340 --> 00:02:40,050 Now, another great thing about this set of big users 52 00:02:40,050 --> 00:02:41,820 is, because they're solving problems 53 00:02:41,820 --> 00:02:45,030 that people haven't solved before, they're generally 54 00:02:45,030 --> 00:02:46,950 writing their programs from scratch. 55 00:02:46,950 --> 00:02:49,410 And they're very often willing to write their programs 56 00:02:49,410 --> 00:02:52,680 in a different way to run them on your machine, which 57 00:02:52,680 --> 00:02:55,068 is very important, because it frees up 58 00:02:55,068 --> 00:02:56,610 what's one of the biggest constraints 59 00:02:56,610 --> 00:02:58,620 for a normal computer architect, which 60 00:02:58,620 --> 00:03:01,030 is to run existing programs. 61 00:03:01,030 --> 00:03:03,450 So the computers that I'll be talking about today, which 62 00:03:03,450 --> 00:03:05,130 are called it a parallel computers, 63 00:03:05,130 --> 00:03:07,650 use a somewhat different formalism 64 00:03:07,650 --> 00:03:10,350 in programming them, some very different primitives, 65 00:03:10,350 --> 00:03:13,450 different programming languages, and things like that. 66 00:03:13,450 --> 00:03:16,950 But the users are willing to adopt that different formalism, 67 00:03:16,950 --> 00:03:19,630 because that's the only way that they can solve their problem. 68 00:03:19,630 --> 00:03:21,047 And in fact, as it turns out, it's 69 00:03:21,047 --> 00:03:25,360 a very natural formalism for a wide class of problems. 70 00:03:25,360 --> 00:03:27,480 So what I'm going to be talking about today 71 00:03:27,480 --> 00:03:30,090 is this kind of architecture, which is called a data parallel 72 00:03:30,090 --> 00:03:32,550 architecture, of which the connection machine is 73 00:03:32,550 --> 00:03:34,150 an example. 74 00:03:34,150 --> 00:03:38,070 And I will, I hope, convince you that this 75 00:03:38,070 --> 00:03:40,110 is the most natural kind of architecture 76 00:03:40,110 --> 00:03:41,610 for very large problems. 77 00:03:41,610 --> 00:03:44,570 I'm certainly convince of that. 78 00:03:44,570 --> 00:03:48,100 I'm going to start by explaining why conventional Von Neumann 79 00:03:48,100 --> 00:03:51,580 machines are a poor fit to very large problems, 80 00:03:51,580 --> 00:03:53,980 and why data parallel machines are 81 00:03:53,980 --> 00:03:56,550 a much more natural approach. 82 00:03:56,550 --> 00:03:58,950 I'm then going to talk about some of the design issues 83 00:03:58,950 --> 00:04:01,350 in designing data parallel machines, 84 00:04:01,350 --> 00:04:03,030 and specifically, how do you achieve 85 00:04:03,030 --> 00:04:05,550 a balance between processing and communications 86 00:04:05,550 --> 00:04:08,700 in I/O, which turns out to be some of the hardest engineering 87 00:04:08,700 --> 00:04:09,670 issues. 88 00:04:09,670 --> 00:04:11,878 And then finally, I'm going to talk just a little bit 89 00:04:11,878 --> 00:04:14,730 about where we stand today and where this technology is 90 00:04:14,730 --> 00:04:17,459 likely to lead in the future. 91 00:04:17,459 --> 00:04:20,209 So let me start by talking about Von Neumann machines, which 92 00:04:20,209 --> 00:04:22,550 are the kind of machines that we're all used to, 93 00:04:22,550 --> 00:04:26,600 and represent almost all of the computers in the world today. 94 00:04:26,600 --> 00:04:28,820 And the basic idea of a Von Neumann machine 95 00:04:28,820 --> 00:04:30,500 is that you have two pieces. 96 00:04:30,500 --> 00:04:33,530 One of them is a processor, which operates on data-- 97 00:04:33,530 --> 00:04:34,950 does arithmetic, for example-- 98 00:04:34,950 --> 00:04:37,670 and another one is the memory, which holds data and doesn't 99 00:04:37,670 --> 00:04:38,780 operate on it. 100 00:04:38,780 --> 00:04:42,020 And so to operate on data, you leave it stored in the memory. 101 00:04:42,020 --> 00:04:42,833 You move it across. 102 00:04:42,833 --> 00:04:44,000 You add two things together. 103 00:04:44,000 --> 00:04:46,470 You put it back in the memory, and so on. 104 00:04:46,470 --> 00:04:49,790 And typically, you move one word or a few words 105 00:04:49,790 --> 00:04:51,620 across at a time. 106 00:04:51,620 --> 00:04:53,690 This is the single operation model 107 00:04:53,690 --> 00:04:56,360 that's sort of implicit in all of the programming languages 108 00:04:56,360 --> 00:04:59,510 that most of you are used to programming in. 109 00:04:59,510 --> 00:05:02,810 Now, from an engineer standpoint, 110 00:05:02,810 --> 00:05:07,520 the Von Neumann machine was a solution to a technological 111 00:05:07,520 --> 00:05:09,560 opportunity-- or a technological problem, 112 00:05:09,560 --> 00:05:11,270 depending on how you look at it-- 113 00:05:11,270 --> 00:05:13,820 which is that, when those machines were designed, 114 00:05:13,820 --> 00:05:15,770 the switching components were very, very fast 115 00:05:15,770 --> 00:05:18,110 and very expensive, and the memory components 116 00:05:18,110 --> 00:05:20,490 were relatively slow and relatively cheap. 117 00:05:20,490 --> 00:05:22,640 So it made sense to build a machine that, 118 00:05:22,640 --> 00:05:24,860 on one side of the room, had these big slow things 119 00:05:24,860 --> 00:05:28,010 like mercury [INAUDIBLE] and another side of the room 120 00:05:28,010 --> 00:05:31,490 had these very fast expensive things, like vacuum tubes, 121 00:05:31,490 --> 00:05:33,148 or relays, or something like that. 122 00:05:33,148 --> 00:05:34,940 And the vacuum tubes switched lots and lots 123 00:05:34,940 --> 00:05:36,980 of times for every memory reference, 124 00:05:36,980 --> 00:05:39,505 and it worked out very well. 125 00:05:39,505 --> 00:05:41,630 In fact, it worked out so well that, as we improved 126 00:05:41,630 --> 00:05:43,610 the components in that kind of machine, 127 00:05:43,610 --> 00:05:45,780 we kept that same basic design. 128 00:05:45,780 --> 00:05:48,018 And so if you look on a modern one-chip computer, 129 00:05:48,018 --> 00:05:50,060 you'll see on one side of the chip is the memory, 130 00:05:50,060 --> 00:05:52,010 and on the other side [INAUDIBLE] 131 00:05:52,010 --> 00:05:54,320 of the chip, the switching elements, which 132 00:05:54,320 --> 00:05:56,785 are the processors-- only today, it's all 133 00:05:56,785 --> 00:05:58,910 made of the same kind of stuff, so it doesn't quite 134 00:05:58,910 --> 00:06:00,770 make sense to separate them. 135 00:06:00,770 --> 00:06:03,080 And in fact, there's a tremendous inefficiency 136 00:06:03,080 --> 00:06:08,810 in separating them out, which comes from the use of silicon 137 00:06:08,810 --> 00:06:11,030 and the activity of the silicon. 138 00:06:11,030 --> 00:06:13,490 So if I take a very, very large machine-- 139 00:06:13,490 --> 00:06:16,100 remember, those are the ones I'm interested in-- 140 00:06:16,100 --> 00:06:18,020 like a Cray or a big [INAUDIBLE],, 141 00:06:18,020 --> 00:06:20,690 or something like that, or a Connection Machine-- 142 00:06:20,690 --> 00:06:23,570 and look at what's inside of it, it's all made of silicon chips. 143 00:06:23,570 --> 00:06:25,760 And if I lay out the silicon chips, 144 00:06:25,760 --> 00:06:28,650 it probably has something on the order of a squaring meter 145 00:06:28,650 --> 00:06:29,690 of silicon in it. 146 00:06:29,690 --> 00:06:32,107 Actually, these days, it's going to be a little bit more-- 147 00:06:32,107 --> 00:06:33,110 but something like that. 148 00:06:33,110 --> 00:06:36,620 So imagine that I laid out all the silicon, looked 149 00:06:36,620 --> 00:06:37,610 at it like that. 150 00:06:37,610 --> 00:06:41,780 Now, I would notice that these days, about this much of it 151 00:06:41,780 --> 00:06:45,320 is in the processor itself and about this much of it 152 00:06:45,320 --> 00:06:46,500 is in the memory system-- 153 00:06:46,500 --> 00:06:48,500 mostly in the bits of the memory, and the memory 154 00:06:48,500 --> 00:06:51,330 decoding, and things like that. 155 00:06:51,330 --> 00:06:54,410 So if I were able to look at this 156 00:06:54,410 --> 00:06:57,470 under some kind of imaginary microscope 157 00:06:57,470 --> 00:06:59,030 that showed the activity of the chips 158 00:06:59,030 --> 00:07:01,400 and showed where useful things were happening, 159 00:07:01,400 --> 00:07:04,700 I would see that, when this machine was operating, 160 00:07:04,700 --> 00:07:08,270 this part of it was very, very busy most of the time 161 00:07:08,270 --> 00:07:10,700 and this part of it would have one memory location 162 00:07:10,700 --> 00:07:12,565 being referenced at once. 163 00:07:12,565 --> 00:07:14,690 Or maybe, if it was a [INAUDIBLE] pipeline machine, 164 00:07:14,690 --> 00:07:18,380 it might have four, or five, or something like that. 165 00:07:18,380 --> 00:07:20,330 Well, there's a terribly inefficient design, 166 00:07:20,330 --> 00:07:25,190 because we're-- in effect, we're optimizing the activity of this 167 00:07:25,190 --> 00:07:27,830 and leaving all this silicon sitting there doing nothing, 168 00:07:27,830 --> 00:07:29,940 just holding bits. 169 00:07:29,940 --> 00:07:32,670 And in fact, the time that the machine has taken 170 00:07:32,670 --> 00:07:35,430 is mostly information traveling from here to there 171 00:07:35,430 --> 00:07:38,610 and back again, which is what Backus called the Von Neumann 172 00:07:38,610 --> 00:07:40,200 bottleneck. 173 00:07:40,200 --> 00:07:42,030 The terrible thing about this inefficiency 174 00:07:42,030 --> 00:07:44,860 is the bigger that you build a computer, the worse it gets, 175 00:07:44,860 --> 00:07:46,610 because the easiest way to build it bigger 176 00:07:46,610 --> 00:07:48,110 is the memory system becomes larger, 177 00:07:48,110 --> 00:07:50,040 and this proportion of processing to memory 178 00:07:50,040 --> 00:07:51,960 gets even more and more out of whack. 179 00:07:51,960 --> 00:07:54,930 And that waiting for data to move back and forth 180 00:07:54,930 --> 00:07:56,790 between memory becomes more and more 181 00:07:56,790 --> 00:07:58,740 the bottleneck of the machine. 182 00:07:58,740 --> 00:08:02,310 So it's tolerable when you build a little machine, which 183 00:08:02,310 --> 00:08:06,030 has a closer balance of memory and processing, 184 00:08:06,030 --> 00:08:07,530 but it becomes kind of intolerable 185 00:08:07,530 --> 00:08:10,920 when you build a big machine for these kinds of problems. 186 00:08:10,920 --> 00:08:12,890 Now, obviously, the answer is you 187 00:08:12,890 --> 00:08:15,050 want to put in more processing somehow. 188 00:08:15,050 --> 00:08:17,150 And the general answer is somehow, 189 00:08:17,150 --> 00:08:19,850 you want to distribute that processing through the memory, 190 00:08:19,850 --> 00:08:23,330 and you want to mix up in here processors and memory. 191 00:08:23,330 --> 00:08:25,610 And the trick is, how do you do that efficiently? 192 00:08:25,610 --> 00:08:27,560 How does it change the programming model, 193 00:08:27,560 --> 00:08:29,270 and how do you get it to work? 194 00:08:29,270 --> 00:08:33,179 So the Von Neumann machine is basically 195 00:08:33,179 --> 00:08:35,700 good for the big problems, and the potential answer 196 00:08:35,700 --> 00:08:38,610 is parallelism of getting a lot of things to happen at once. 197 00:08:38,610 --> 00:08:41,659 And that will get the efficiency of your machine up. 198 00:08:41,659 --> 00:08:48,810 Now, just because parallelism is the only answer doesn't mean 199 00:08:48,810 --> 00:08:49,448 it's an answer. 200 00:08:49,448 --> 00:08:51,240 And it could be that the problems, in fact, 201 00:08:51,240 --> 00:08:53,070 don't allow for parallelism. 202 00:08:53,070 --> 00:08:55,530 So that's the real question-- is, is there 203 00:08:55,530 --> 00:08:58,320 something fundamental about our problems that requires things 204 00:08:58,320 --> 00:09:00,430 to be done one at a time, or can, 205 00:09:00,430 --> 00:09:03,600 in fact, they be done in parallel all at once? 206 00:09:03,600 --> 00:09:05,100 And if they can be done in parallel, 207 00:09:05,100 --> 00:09:07,660 where is the big source of the parallelism? 208 00:09:07,660 --> 00:09:10,770 Now, if I take a physical simulation example, 209 00:09:10,770 --> 00:09:13,140 it's sort of obvious where the parallelism is-- 210 00:09:13,140 --> 00:09:15,410 because for example, imagine that airplane-- 211 00:09:15,410 --> 00:09:18,210 you're flowing air over the wing. 212 00:09:18,210 --> 00:09:20,010 Well, when the airplane's flying, 213 00:09:20,010 --> 00:09:21,863 the real air is computing that in parallel. 214 00:09:21,863 --> 00:09:23,280 The air over here is deciding what 215 00:09:23,280 --> 00:09:25,780 to do completely independently of the air over here deciding 216 00:09:25,780 --> 00:09:26,920 what to do. 217 00:09:26,920 --> 00:09:29,362 And so in some sense, the physical situation 218 00:09:29,362 --> 00:09:31,320 is fundamentally parallel, but the question is, 219 00:09:31,320 --> 00:09:33,570 is that reflected in the algorithms? 220 00:09:33,570 --> 00:09:36,210 And it turns out that people solve that same problem 221 00:09:36,210 --> 00:09:38,300 by a variety of different algorithms. 222 00:09:38,300 --> 00:09:39,810 For example, one kind of algorithm-- 223 00:09:39,810 --> 00:09:41,820 they do it just like the physics does it-- they 224 00:09:41,820 --> 00:09:44,250 model a whole bunch of particles and model the interaction 225 00:09:44,250 --> 00:09:46,042 of the particles as they bump into the wing 226 00:09:46,042 --> 00:09:47,920 and bump into each other. 227 00:09:47,920 --> 00:09:52,510 And for example, those are called either particle 228 00:09:52,510 --> 00:09:54,670 algorithms or cellular automata algorithms. 229 00:09:54,670 --> 00:09:56,800 Another completely different kind of algorithm 230 00:09:56,800 --> 00:09:59,000 is you model the space as a whole bunch of-- there's 231 00:09:59,000 --> 00:10:01,330 a grid of cells, and you model the behavior 232 00:10:01,330 --> 00:10:03,700 of each of the cells by a set of equations that says how 233 00:10:03,700 --> 00:10:05,630 the air flows through the cell. 234 00:10:05,630 --> 00:10:09,345 Another kind of algorithm is you transform that problem-- 235 00:10:09,345 --> 00:10:11,470 that physical problem to a different problem, where 236 00:10:11,470 --> 00:10:13,280 you have to solve a system of equations. 237 00:10:13,280 --> 00:10:14,950 And you solve those equations, and then 238 00:10:14,950 --> 00:10:17,360 you transform the solutions back to the physical problem. 239 00:10:17,360 --> 00:10:19,360 So there are a whole lot of different algorithms 240 00:10:19,360 --> 00:10:21,370 for solving that same physical problem. 241 00:10:21,370 --> 00:10:23,260 It turns out every single one of them 242 00:10:23,260 --> 00:10:25,600 has a tremendous amount of parallelism. 243 00:10:25,600 --> 00:10:28,600 And the good news is [INAUDIBLE] parallelism that grows 244 00:10:28,600 --> 00:10:30,050 with the size of the problem. 245 00:10:30,050 --> 00:10:33,160 So the bigger the problem, the more parallelism there is. 246 00:10:33,160 --> 00:10:36,430 So that looks good for the side of parallelism. 247 00:10:36,430 --> 00:10:38,560 It looks like maybe it's fundamentally there, 248 00:10:38,560 --> 00:10:40,000 when you look at the algorithm. 249 00:10:40,000 --> 00:10:43,330 But there used to be this thing that would bother people 250 00:10:43,330 --> 00:10:47,170 about that argument, which was put most crisply 251 00:10:47,170 --> 00:10:50,620 by Gene Amdahl in Amdahl's law. 252 00:10:50,620 --> 00:10:53,273 And he made the argument that parallel computers will never 253 00:10:53,273 --> 00:10:54,190 really work that much. 254 00:10:54,190 --> 00:10:56,860 They'll never get you more than about a factor of 10. 255 00:10:56,860 --> 00:10:58,820 And here was the argument. 256 00:10:58,820 --> 00:11:02,240 The argument was imagine that I represent the total amount 257 00:11:02,240 --> 00:11:05,730 of computation to be done. 258 00:11:05,730 --> 00:11:07,890 So that's the line, and here's the total amount 259 00:11:07,890 --> 00:11:09,060 of computation. 260 00:11:09,060 --> 00:11:12,510 Now, imagine that some of it is fundamentally sequential 261 00:11:12,510 --> 00:11:15,300 and has to be done a bit at a time, and the rest of it 262 00:11:15,300 --> 00:11:17,770 is parallel and can be done all at once. 263 00:11:17,770 --> 00:11:21,270 So let's be generous and assume 90% of it is parallel 264 00:11:21,270 --> 00:11:23,467 and only 10% of it is sequential. 265 00:11:23,467 --> 00:11:27,113 266 00:11:27,113 --> 00:11:28,530 This would represent how much time 267 00:11:28,530 --> 00:11:31,080 it would take on a sequential machine, 268 00:11:31,080 --> 00:11:33,630 because we'll do all of it in sequence. 269 00:11:33,630 --> 00:11:38,410 But imagine I put 10 processors to work on that same problem. 270 00:11:38,410 --> 00:11:42,150 Well, I get to divide this part of the problem by 10, 271 00:11:42,150 --> 00:11:43,525 so this becomes a tenth the size. 272 00:11:43,525 --> 00:11:46,135 273 00:11:46,135 --> 00:11:48,260 And now the parallel and the sequential [INAUDIBLE] 274 00:11:48,260 --> 00:11:50,670 the same time, because the 10 processors 275 00:11:50,670 --> 00:11:53,210 don't help me on the sequential part. 276 00:11:53,210 --> 00:11:55,510 So I've made the problem a lot faster-- 277 00:11:55,510 --> 00:11:57,280 almost 10 times as fast. 278 00:11:57,280 --> 00:12:01,310 But now let's say I put 100 processors on the same problem. 279 00:12:01,310 --> 00:12:04,710 So I now divide this by 100. 280 00:12:04,710 --> 00:12:06,480 Well, if I divide that by 100, then I 281 00:12:06,480 --> 00:12:08,970 get something which is almost entirely dominated 282 00:12:08,970 --> 00:12:11,580 by the sequential part, so I'm getting diminishing returns now 283 00:12:11,580 --> 00:12:15,390 as I put more processors on it. 284 00:12:15,390 --> 00:12:19,020 So this is a very compelling argument [INAUDIBLE] very 285 00:12:19,020 --> 00:12:21,390 intuitive that parallelism-- 286 00:12:21,390 --> 00:12:23,580 even if it works on a lot of the problem, 287 00:12:23,580 --> 00:12:26,320 as long as there's a sequential part of the problem there, 288 00:12:26,320 --> 00:12:29,850 the machines are going to become increasingly inefficient. 289 00:12:29,850 --> 00:12:36,150 It used to bother me a lot, but then I started noticing-- 290 00:12:36,150 --> 00:12:38,990 and I noticed, in my problem, it didn't really seem to come up. 291 00:12:38,990 --> 00:12:40,850 In my application, it didn't really seem to come up, 292 00:12:40,850 --> 00:12:43,517 and I started noticing, in a lot of other people's applications, 293 00:12:43,517 --> 00:12:45,442 it didn't come up. 294 00:12:45,442 --> 00:12:47,150 I think a lot of people have noticed this 295 00:12:47,150 --> 00:12:49,460 these days, that, in fact, Amdahl's doesn't 296 00:12:49,460 --> 00:12:50,420 seem to be hurting us. 297 00:12:50,420 --> 00:12:51,630 And why not? 298 00:12:51,630 --> 00:12:58,577 And I believe that the best answer is one that was-- 299 00:12:58,577 --> 00:13:01,160 I'm going to give a version of an answer that was published by 300 00:13:01,160 --> 00:13:03,632 [INAUDIBLE],, and it has to do with scaling. 301 00:13:03,632 --> 00:13:05,090 The answer is people are not trying 302 00:13:05,090 --> 00:13:07,700 to solve the same problem on the parallel machine 303 00:13:07,700 --> 00:13:09,325 that they solved on the serial machine. 304 00:13:09,325 --> 00:13:11,367 They don't want to solve the same problem faster. 305 00:13:11,367 --> 00:13:13,035 They want to solve a much bigger problem 306 00:13:13,035 --> 00:13:14,160 in the same amount of time. 307 00:13:14,160 --> 00:13:17,300 308 00:13:17,300 --> 00:13:19,130 If we break up the parallel processing part 309 00:13:19,130 --> 00:13:21,290 of the problem-- take something like that airflow-- 310 00:13:21,290 --> 00:13:22,970 if we look at an algorithm for airflow, 311 00:13:22,970 --> 00:13:26,775 we'll find that some parts of the airflow problem are-- 312 00:13:26,775 --> 00:13:28,400 the amount of computation we have to do 313 00:13:28,400 --> 00:13:30,192 are independent of the size of the problem. 314 00:13:30,192 --> 00:13:32,450 So let me call that using order notation. 315 00:13:32,450 --> 00:13:33,590 That part is order 1. 316 00:13:33,590 --> 00:13:36,140 It's some constant part of the problem. 317 00:13:36,140 --> 00:13:40,640 Then there's a part that, if I double the linear dimension, 318 00:13:40,640 --> 00:13:43,370 I go to twice the resolution, then 319 00:13:43,370 --> 00:13:45,590 there's a part of the computation that doubles. 320 00:13:45,590 --> 00:13:49,220 So if I parameterize the problem in terms of the number of grid 321 00:13:49,220 --> 00:13:51,920 points, for example, and I call that n, 322 00:13:51,920 --> 00:13:54,785 then there's a part of the problem that grows as order n. 323 00:13:54,785 --> 00:13:56,660 Then there's another part of the problem that 324 00:13:56,660 --> 00:13:58,243 depends on the boundary conditions and 325 00:13:58,243 --> 00:14:01,370 so on that grows as order n squared, and so on. 326 00:14:01,370 --> 00:14:04,520 And in fact, there's a big chunk of the problem, which 327 00:14:04,520 --> 00:14:08,720 is the relaxation of the interior of the problem, which 328 00:14:08,720 --> 00:14:09,570 takes more-- 329 00:14:09,570 --> 00:14:11,030 not only does it-- 330 00:14:11,030 --> 00:14:13,562 the number of points grow up as n cubed, but also 331 00:14:13,562 --> 00:14:15,770 the amount of time that you have to relax it grows up 332 00:14:15,770 --> 00:14:16,887 as another factor of n. 333 00:14:16,887 --> 00:14:18,470 So this keeps on going up, and there's 334 00:14:18,470 --> 00:14:21,545 a part of the problem that grows is [? n ?] to the fourth. 335 00:14:21,545 --> 00:14:24,170 So we've got a lot of different parts of the problem like this. 336 00:14:24,170 --> 00:14:26,160 Well, what happens if somebody comes along and says, 337 00:14:26,160 --> 00:14:27,800 I want a problem that's twice as big, 338 00:14:27,800 --> 00:14:30,150 I want twice the resolution? 339 00:14:30,150 --> 00:14:32,450 Well, this part of the problem doesn't change at all. 340 00:14:32,450 --> 00:14:34,460 This part doubles, this quadruples, 341 00:14:34,460 --> 00:14:37,190 and so on, and this multiplies by 16. 342 00:14:37,190 --> 00:14:38,730 I end up with a much bigger problem, 343 00:14:38,730 --> 00:14:40,890 which goes right off the top of the board. 344 00:14:40,890 --> 00:14:44,930 But if I renormalize that and draw it again, 345 00:14:44,930 --> 00:14:47,540 it looks like almost all of it is 346 00:14:47,540 --> 00:14:49,790 the end of the fourth part of the problem. 347 00:14:49,790 --> 00:14:53,100 And this stuff is tiny. 348 00:14:53,100 --> 00:14:54,870 And if I want it to double yet again, 349 00:14:54,870 --> 00:14:58,480 I can make that even more extreme. 350 00:14:58,480 --> 00:15:02,250 So it turns out, in practice, when you look at this-- 351 00:15:02,250 --> 00:15:04,300 and I'm not certain why this should be true, 352 00:15:04,300 --> 00:15:06,570 and I'm certainly not certain it's always true-- 353 00:15:06,570 --> 00:15:09,510 but in practice, these higher order terms 354 00:15:09,510 --> 00:15:12,700 are always things that can be done in parallel. 355 00:15:12,700 --> 00:15:14,850 So in fact, it's not like there's 356 00:15:14,850 --> 00:15:16,585 a fixed percentage of the problem 357 00:15:16,585 --> 00:15:18,960 that can be done in parallel, but in fact, the percentage 358 00:15:18,960 --> 00:15:20,955 grows as the problem gets bigger. 359 00:15:20,955 --> 00:15:22,830 Since I'm interested in the biggest problems, 360 00:15:22,830 --> 00:15:24,247 I can always pick a problem that's 361 00:15:24,247 --> 00:15:26,190 big enough, where it's not-- 362 00:15:26,190 --> 00:15:28,650 where, in fact, that's not a factor. 363 00:15:28,650 --> 00:15:30,390 In fact, the customers pick that problem, 364 00:15:30,390 --> 00:15:33,015 because the customers will pick a problem where it's big enough 365 00:15:33,015 --> 00:15:34,150 that that's not a factor. 366 00:15:34,150 --> 00:15:38,250 So as it turns out, Amdahl's law doesn't end up biting us-- 367 00:15:38,250 --> 00:15:40,395 I think much to everybody's surprise. 368 00:15:40,395 --> 00:15:43,440 369 00:15:43,440 --> 00:15:47,430 I was even surprised, and I'm an optimist. 370 00:15:47,430 --> 00:15:51,330 So I guess the lesson to learn from this 371 00:15:51,330 --> 00:15:54,690 is that the bigger the problem is, the bigger the opportunity 372 00:15:54,690 --> 00:15:55,900 of parallelism. 373 00:15:55,900 --> 00:15:59,880 This is one of the differences between designing a machine 374 00:15:59,880 --> 00:16:01,415 for the center of the market-- 375 00:16:01,415 --> 00:16:02,790 so the workstation on your desk-- 376 00:16:02,790 --> 00:16:06,150 versus designing a machine for these super applications, 377 00:16:06,150 --> 00:16:09,150 because they're very skewed in the types of opportunities 378 00:16:09,150 --> 00:16:13,290 for parallelism that you have. 379 00:16:13,290 --> 00:16:16,230 The problems that I'm designing for typically 380 00:16:16,230 --> 00:16:19,080 have the opportunity for parallelism 381 00:16:19,080 --> 00:16:21,150 on the order of hundreds of thousands, 382 00:16:21,150 --> 00:16:23,220 or millions of things happening at once, 383 00:16:23,220 --> 00:16:25,650 and very, very little of the problem 384 00:16:25,650 --> 00:16:27,660 is in this fundamentally sequential part. 385 00:16:27,660 --> 00:16:30,610 386 00:16:30,610 --> 00:16:32,530 OK, so the obvious answer then is 387 00:16:32,530 --> 00:16:35,320 I can use lots of processors. 388 00:16:35,320 --> 00:16:38,020 And VLSI technology lets us stamp out 389 00:16:38,020 --> 00:16:40,450 processes relatively inexpensively, 390 00:16:40,450 --> 00:16:42,280 and we can interleave them with memory, 391 00:16:42,280 --> 00:16:43,450 and we can build a machine that's all 392 00:16:43,450 --> 00:16:44,890 mixed up processors and memory. 393 00:16:44,890 --> 00:16:47,770 So what are the design issues for designing a machine 394 00:16:47,770 --> 00:16:51,970 like so that it works efficiently? 395 00:16:51,970 --> 00:16:55,350 Well, first of all, notice that the big source 396 00:16:55,350 --> 00:16:57,630 of the parallelism is in the data. 397 00:16:57,630 --> 00:16:59,490 It's because the data is getting bigger. 398 00:16:59,490 --> 00:17:01,980 The size of the problem, I keep talking about, 399 00:17:01,980 --> 00:17:03,990 is really a data size. 400 00:17:03,990 --> 00:17:07,530 And so the big opportunity is somehow 401 00:17:07,530 --> 00:17:09,793 operating on all of that data at once. 402 00:17:09,793 --> 00:17:12,210 And that's also a very different than a lot of other kinds 403 00:17:12,210 --> 00:17:13,140 of parallel machines. 404 00:17:13,140 --> 00:17:15,960 A much more normal way of getting parallelism 405 00:17:15,960 --> 00:17:18,569 in a machine is to split up the program 406 00:17:18,569 --> 00:17:21,359 into several programs which run it once, each of which 407 00:17:21,359 --> 00:17:23,310 operates on its own data, or some show data, 408 00:17:23,310 --> 00:17:24,579 or something like that. 409 00:17:24,579 --> 00:17:27,060 But the normal effort involved in making 410 00:17:27,060 --> 00:17:28,765 a parallel programming is an effort 411 00:17:28,765 --> 00:17:30,390 of splitting up the program and looking 412 00:17:30,390 --> 00:17:33,400 for control parallelism. 413 00:17:33,400 --> 00:17:36,190 And basically, I think that that's a perfectly good kind 414 00:17:36,190 --> 00:17:38,350 of effort, and I think it makes sense, 415 00:17:38,350 --> 00:17:41,590 and it probably gets you factors of eight or something like 416 00:17:41,590 --> 00:17:43,990 that, but it won't get factors of 100,000, 417 00:17:43,990 --> 00:17:46,300 because you can't break up a 10,000-line program 418 00:17:46,300 --> 00:17:49,470 into 100,000 things running all at once that are all different. 419 00:17:49,470 --> 00:17:52,852 In fact, the data is growing much, much faster 420 00:17:52,852 --> 00:17:53,560 than the program. 421 00:17:53,560 --> 00:17:57,720 The programs are pretty much fixed in length. 422 00:17:57,720 --> 00:17:59,790 So the opportunity for the parallelism 423 00:17:59,790 --> 00:18:03,490 is, in fact, in the loops of that program. 424 00:18:03,490 --> 00:18:06,120 And so for example, take the airplane 425 00:18:06,120 --> 00:18:09,300 wing example, the airflow example. 426 00:18:09,300 --> 00:18:11,675 A sequential computer has a big loop 427 00:18:11,675 --> 00:18:13,300 that calculates what the air does here, 428 00:18:13,300 --> 00:18:14,720 then calculates what the air does here, 429 00:18:14,720 --> 00:18:15,930 then calculates what the air does here, 430 00:18:15,930 --> 00:18:17,400 and so on, and so on, and so on. 431 00:18:17,400 --> 00:18:18,720 And after it finishes that whole line, 432 00:18:18,720 --> 00:18:21,303 it starts the next line, and the next line, and the next line, 433 00:18:21,303 --> 00:18:22,000 and so on-- 434 00:18:22,000 --> 00:18:27,160 whereas the physical fluid calculates all of that 435 00:18:27,160 --> 00:18:29,270 simultaneously. 436 00:18:29,270 --> 00:18:32,320 So the big opportunity is in calculating all of that 437 00:18:32,320 --> 00:18:33,210 simultaneously. 438 00:18:33,210 --> 00:18:37,900 So I'm not so much interested in programs 439 00:18:37,900 --> 00:18:39,310 that can be split so that you can 440 00:18:39,310 --> 00:18:40,540 run several programs at once. 441 00:18:40,540 --> 00:18:42,850 I'm interested in really just running one program, 442 00:18:42,850 --> 00:18:44,800 but what I want to do in that one program is I 443 00:18:44,800 --> 00:18:46,990 want to take out the loop and do all 444 00:18:46,990 --> 00:18:49,130 of the iterations of the loop at once. 445 00:18:49,130 --> 00:18:52,150 That turns out to have some great advantages 446 00:18:52,150 --> 00:18:54,910 for programming, but I'm going to concentrate mostly 447 00:18:54,910 --> 00:18:56,260 on the hardware right now. 448 00:18:56,260 --> 00:18:59,500 It also has some nice advantages of hardware. 449 00:18:59,500 --> 00:19:01,406 So having done that-- 450 00:19:01,406 --> 00:19:04,312 so when I do that, that's what I call a data parallel machine. 451 00:19:04,312 --> 00:19:05,770 It's a little bit different issue-- 452 00:19:05,770 --> 00:19:08,463 you hear people talk about single instruction machines-- 453 00:19:08,463 --> 00:19:09,880 single instruction, multiple data, 454 00:19:09,880 --> 00:19:12,220 versus multiple instruction, multiple data. 455 00:19:12,220 --> 00:19:14,470 It's a related issue, but it's a little bit different, 456 00:19:14,470 --> 00:19:19,180 because really, SIMD versus MIMD is kind of an implementation 457 00:19:19,180 --> 00:19:21,335 issue that, once you've picked a programming model, 458 00:19:21,335 --> 00:19:22,960 you figure out, what's the cheapest way 459 00:19:22,960 --> 00:19:24,310 to implement that model? 460 00:19:24,310 --> 00:19:27,197 This is really an issue of what programming model 461 00:19:27,197 --> 00:19:28,030 you want to support. 462 00:19:28,030 --> 00:19:29,488 And in fact, then you have a choice 463 00:19:29,488 --> 00:19:31,960 of whether you want to put some of the control down 464 00:19:31,960 --> 00:19:33,947 in the processors or keep it centralized. 465 00:19:33,947 --> 00:19:35,530 And that's a kind of engineering issue 466 00:19:35,530 --> 00:19:37,720 that just depends on the relative cost of wires 467 00:19:37,720 --> 00:19:41,020 and silicon, and it probably changes from year to year. 468 00:19:41,020 --> 00:19:43,630 I think a lot of people expect me 469 00:19:43,630 --> 00:19:45,820 to defend SIMD machines just because I 470 00:19:45,820 --> 00:19:48,820 designed machines that way. 471 00:19:48,820 --> 00:19:53,910 I think that was the right way to design machines in 1990, 472 00:19:53,910 --> 00:19:57,210 and it maybe won't be in 1995, and maybe it 473 00:19:57,210 --> 00:19:58,830 will be again in 2000. 474 00:19:58,830 --> 00:19:59,730 It's a small issue. 475 00:19:59,730 --> 00:20:01,230 That's not really what I believe in. 476 00:20:01,230 --> 00:20:04,320 What I believe in is presenting this data parallel model 477 00:20:04,320 --> 00:20:10,380 to the user, and doing that in the most efficient way. 478 00:20:10,380 --> 00:20:12,030 So how do you do that? 479 00:20:12,030 --> 00:20:16,500 Well, most obvious issue is arithmetic and processing. 480 00:20:16,500 --> 00:20:19,820 And that also turns out to be about the easiest problem 481 00:20:19,820 --> 00:20:20,320 to solve. 482 00:20:20,320 --> 00:20:23,800 So for example, in the CM2 Connection Machine, 483 00:20:23,800 --> 00:20:25,755 it has two kinds of arithmetic. 484 00:20:25,755 --> 00:20:28,620 It has 64,000 bit-serial arithmetic units 485 00:20:28,620 --> 00:20:30,300 that are distributed through the memory, 486 00:20:30,300 --> 00:20:32,460 and it has 2,000 floating point units, 487 00:20:32,460 --> 00:20:35,010 which are 64-bit wide words which are also 488 00:20:35,010 --> 00:20:36,970 distributed to the memory. 489 00:20:36,970 --> 00:20:39,265 And it has some 32-bit data paths 490 00:20:39,265 --> 00:20:41,690 for doing indirect addressing and referencing data. 491 00:20:41,690 --> 00:20:44,290 So it's a mixture of different kinds of arithmetic units 492 00:20:44,290 --> 00:20:47,560 that are sprinkled through the memory, evenly distributed 493 00:20:47,560 --> 00:20:51,453 to prevent moving data around any more than you have to. 494 00:20:51,453 --> 00:20:53,620 And of course, the big issue is not so much building 495 00:20:53,620 --> 00:20:56,037 all these [INAUDIBLE],, but it's building a memory systems 496 00:20:56,037 --> 00:20:58,510 to support them and has sufficient memory bandwidth 497 00:20:58,510 --> 00:20:59,783 to keep them busy. 498 00:20:59,783 --> 00:21:01,450 Of course, that works out very naturally 499 00:21:01,450 --> 00:21:04,930 in a parallel machine, because you put each one of these LAUs 500 00:21:04,930 --> 00:21:07,120 right next to its own little piece of memory, 501 00:21:07,120 --> 00:21:10,952 and so you can read out 64,000 memory bits in parallel 502 00:21:10,952 --> 00:21:12,910 and get a tremendous amount of memory bandwidth 503 00:21:12,910 --> 00:21:16,200 to keep all these LAUs busy. 504 00:21:16,200 --> 00:21:19,470 So it turns out that the peak floating point performance-- 505 00:21:19,470 --> 00:21:21,210 it's a good way of measuring it-- 506 00:21:21,210 --> 00:21:25,650 of the machine is on the order of something like 20 507 00:21:25,650 --> 00:21:32,450 gigaflops, which is 20,000 megaflops, which 508 00:21:32,450 --> 00:21:35,760 is millions of floating point operations per second. 509 00:21:35,760 --> 00:21:37,760 So that's fast, but it's not-- that's not really 510 00:21:37,760 --> 00:21:40,040 a very relevant number, because who cares 511 00:21:40,040 --> 00:21:42,410 what the megaflop number is? 512 00:21:42,410 --> 00:21:45,830 It's turns out that, if I look at this problem, 513 00:21:45,830 --> 00:21:47,630 only about a third of the problem 514 00:21:47,630 --> 00:21:50,480 is, in fact, doing that kind of operation. 515 00:21:50,480 --> 00:21:53,662 What is the rest of the time spent doing? 516 00:21:53,662 --> 00:21:55,120 Well, the rest of the time is spent 517 00:21:55,120 --> 00:21:57,560 doing two other major things. 518 00:21:57,560 --> 00:22:01,000 One of them is moving data around in communications, 519 00:22:01,000 --> 00:22:03,160 and the other one is input/output-- things 520 00:22:03,160 --> 00:22:04,990 like writing to disks. 521 00:22:04,990 --> 00:22:07,960 So I guess one of the points that I'd like to make 522 00:22:07,960 --> 00:22:11,770 is that, if you're going to achieve this kind of speed 523 00:22:11,770 --> 00:22:15,780 up that the revised Amdahl's law promises, 524 00:22:15,780 --> 00:22:17,850 you have to do all of the things that you 525 00:22:17,850 --> 00:22:19,930 can do in parallel in parallel. 526 00:22:19,930 --> 00:22:23,930 So if I only did the floating point operations in parallel, 527 00:22:23,930 --> 00:22:26,030 then I'm going to get killed by Amdahl's again, 528 00:22:26,030 --> 00:22:27,860 because I can only speed it up by 30%, 529 00:22:27,860 --> 00:22:30,240 no matter how fast I can do those. 530 00:22:30,240 --> 00:22:32,820 And in fact, one of the critical things that you have to do, 531 00:22:32,820 --> 00:22:34,695 which is much harder than the floating point, 532 00:22:34,695 --> 00:22:40,187 is the data motion, because it's all very nice for the hardware 533 00:22:40,187 --> 00:22:41,770 designer to say I'm going to divide up 534 00:22:41,770 --> 00:22:43,330 the memories into lots of little memories, 535 00:22:43,330 --> 00:22:44,955 each which talk to their own processor, 536 00:22:44,955 --> 00:22:47,450 but unfortunately, the algorithm doesn't work that way. 537 00:22:47,450 --> 00:22:49,540 The algorithm-- data has to move all over, 538 00:22:49,540 --> 00:22:52,770 and data here has to be added to data here, and so on. 539 00:22:52,770 --> 00:22:54,730 So it goes in arbitrary patterns. 540 00:22:54,730 --> 00:22:57,880 Now, sometimes those patterns have some locality 541 00:22:57,880 --> 00:22:59,440 and some regularity. 542 00:22:59,440 --> 00:23:04,000 So for example, if I solve the fluid flow problem over 543 00:23:04,000 --> 00:23:07,000 the airplane by dividing up the space into a nice, 544 00:23:07,000 --> 00:23:09,040 regular three-dimensional grid then, 545 00:23:09,040 --> 00:23:12,100 that would work very nicely and neatly if I connect it up 546 00:23:12,100 --> 00:23:14,740 my processors into a nice three-dimensional grid-- 547 00:23:14,740 --> 00:23:17,470 as long as it was the same shape as whatever 548 00:23:17,470 --> 00:23:19,788 I wanted to solve for that particular airplane. 549 00:23:19,788 --> 00:23:21,580 In fact, there's a lot of parallel machines 550 00:23:21,580 --> 00:23:23,372 that work that way, and they work very well 551 00:23:23,372 --> 00:23:25,280 on that class of problems. 552 00:23:25,280 --> 00:23:28,810 The problem is that most of the algorithms are not that neat. 553 00:23:28,810 --> 00:23:33,010 And in fact, if I take the implicit algorithms there, 554 00:23:33,010 --> 00:23:36,010 where basically, I'm working with systems of equations which 555 00:23:36,010 --> 00:23:38,530 have some communications patterns that 556 00:23:38,530 --> 00:23:41,260 are kind of all their own, that are not-- 557 00:23:41,260 --> 00:23:44,820 don't map directly into this three-dimensional space. 558 00:23:44,820 --> 00:23:47,130 Or for example, I might use an algorithm 559 00:23:47,130 --> 00:23:49,050 called the free Lagrange method, where 560 00:23:49,050 --> 00:23:52,540 I model each chunk of stuff as it flows around. 561 00:23:52,540 --> 00:23:55,290 In effect, I'm re-drawing that grid 562 00:23:55,290 --> 00:23:58,800 at every time cycle, dependent on where 563 00:23:58,800 --> 00:24:00,660 the stuff is that's flowing. 564 00:24:00,660 --> 00:24:02,883 And there, I really can't even decide beforehand 565 00:24:02,883 --> 00:24:04,800 until I run the problem what my communications 566 00:24:04,800 --> 00:24:06,090 pattern is going to be. 567 00:24:06,090 --> 00:24:08,010 So I need a way of arbitrarily allowing 568 00:24:08,010 --> 00:24:10,450 the processors to communicate. 569 00:24:10,450 --> 00:24:13,800 And that turns out to be one of the most 570 00:24:13,800 --> 00:24:16,217 difficult technical problems in the design of the machine. 571 00:24:16,217 --> 00:24:18,217 And there's a lot of different ways of doing it, 572 00:24:18,217 --> 00:24:19,680 but typically, what you do is you 573 00:24:19,680 --> 00:24:22,530 build a physical interconnection network. 574 00:24:22,530 --> 00:24:25,380 And on the CM2, the physical interconnection network 575 00:24:25,380 --> 00:24:31,272 is every 16 processors is connected by a crossbar switch 576 00:24:31,272 --> 00:24:32,730 so they can all talk to each other. 577 00:24:32,730 --> 00:24:34,350 So that's a reasonable number. 578 00:24:34,350 --> 00:24:37,290 And then every one of those clusters of 16 processors 579 00:24:37,290 --> 00:24:42,090 is on the corner of a hypercube, and they all 580 00:24:42,090 --> 00:24:43,420 talk to each other. 581 00:24:43,420 --> 00:24:47,370 So that means they each have neighbors in hyperspace. 582 00:24:47,370 --> 00:24:49,180 And it's a particular way of wiring them. 583 00:24:49,180 --> 00:24:51,360 I won't stop and explain what a hypercube is. 584 00:24:51,360 --> 00:24:52,240 It doesn't matter. 585 00:24:52,240 --> 00:24:54,510 The point is the programmer never really knows 586 00:24:54,510 --> 00:24:56,010 what a hypercube is. 587 00:24:56,010 --> 00:24:57,960 And I might design it differently next year, 588 00:24:57,960 --> 00:24:58,960 and you shouldn't care-- 589 00:24:58,960 --> 00:25:01,340 590 00:25:01,340 --> 00:25:03,320 because basically, the hardware engineer 591 00:25:03,320 --> 00:25:05,390 comes up with some neat wiring pattern 592 00:25:05,390 --> 00:25:08,330 that they convinced themselves is the best in the world, 593 00:25:08,330 --> 00:25:13,490 and then the programmer wants to see a model of totally 594 00:25:13,490 --> 00:25:14,570 arbitrary communications. 595 00:25:14,570 --> 00:25:16,640 So the programmer just wants to say, 596 00:25:16,640 --> 00:25:19,280 this processor, get that memory location. 597 00:25:19,280 --> 00:25:20,385 How do you do that? 598 00:25:20,385 --> 00:25:22,760 Well, in the Connection Machine, the way that you do that 599 00:25:22,760 --> 00:25:25,100 is, one, that physical pattern. 600 00:25:25,100 --> 00:25:27,140 You have a bunch of autonomous switches 601 00:25:27,140 --> 00:25:29,840 that root messages through that network. 602 00:25:29,840 --> 00:25:32,067 And there's different ways of doing that. 603 00:25:32,067 --> 00:25:34,400 One way is circuit switching, kind of like the telephone 604 00:25:34,400 --> 00:25:36,470 system works, where you call up and you 605 00:25:36,470 --> 00:25:38,420 establish a connection between two things that 606 00:25:38,420 --> 00:25:39,695 want to talk to each other. 607 00:25:39,695 --> 00:25:42,320 Another one is packet switching, like local area networks work, 608 00:25:42,320 --> 00:25:45,170 where you put together a bunch of bits with an address on them 609 00:25:45,170 --> 00:25:46,613 and steer them to the right place. 610 00:25:46,613 --> 00:25:48,530 Well, it turns out that the Connection Machine 611 00:25:48,530 --> 00:25:50,960 is sort of a combination of those two methods. 612 00:25:50,960 --> 00:25:53,870 It actually send out little packets with addresses on them, 613 00:25:53,870 --> 00:25:55,820 but it starts sending the beginning of them 614 00:25:55,820 --> 00:25:57,680 before it's got the end of them. 615 00:25:57,680 --> 00:26:00,370 So they sort of thread through the network like little snakes. 616 00:26:00,370 --> 00:26:02,120 And in fact, sometimes the beginning of it 617 00:26:02,120 --> 00:26:05,570 is actually being delivered before the end of it 618 00:26:05,570 --> 00:26:07,230 has even been sent yet, so they're 619 00:26:07,230 --> 00:26:08,630 sort of snaking through. 620 00:26:08,630 --> 00:26:10,010 And that method of routing, which 621 00:26:10,010 --> 00:26:12,230 is used in the Connection Machine 622 00:26:12,230 --> 00:26:14,240 and several other machines these days, 623 00:26:14,240 --> 00:26:16,070 is called wormhole routing. 624 00:26:16,070 --> 00:26:21,500 And it's a solution that is a very neat solution for hardware 625 00:26:21,500 --> 00:26:22,310 designers today. 626 00:26:22,310 --> 00:26:24,260 Again, it's one of these things that probably won't matter 627 00:26:24,260 --> 00:26:26,810 a few years from now as you get better ways of building 628 00:26:26,810 --> 00:26:28,280 switching networks. 629 00:26:28,280 --> 00:26:30,080 It may be that a few years from now, 630 00:26:30,080 --> 00:26:32,510 for instance, we'll build things with optical technology, 631 00:26:32,510 --> 00:26:35,480 and the messages will just jump from one part of the machine 632 00:26:35,480 --> 00:26:37,270 to another. 633 00:26:37,270 --> 00:26:41,380 But the trick is to be able to put in things like that 634 00:26:41,380 --> 00:26:43,540 and provide the user with a level 635 00:26:43,540 --> 00:26:45,640 of architectural abstraction so that they don't 636 00:26:45,640 --> 00:26:46,990 need to know about the details. 637 00:26:46,990 --> 00:26:48,782 So the programmer of the Connection Machine 638 00:26:48,782 --> 00:26:51,400 doesn't know anything about wormhole routing, or packet 639 00:26:51,400 --> 00:26:56,048 switching, or single-bit ALUs, and 64-bit ALUs, or even really 640 00:26:56,048 --> 00:26:58,090 the number of processors that are on the machine. 641 00:26:58,090 --> 00:27:01,060 There are levels of abstraction built into the machine, which 642 00:27:01,060 --> 00:27:03,278 are the equivalent things, like virtual memory, which 643 00:27:03,278 --> 00:27:05,320 provide the programmer with a much clearer model. 644 00:27:05,320 --> 00:27:07,480 And part of the trick of being a computer architect 645 00:27:07,480 --> 00:27:09,730 is to be able to put in all these kind of messy things 646 00:27:09,730 --> 00:27:12,550 that take advantage of what technology you can buy in 647 00:27:12,550 --> 00:27:18,420 a given year without providing that messiness at a level 648 00:27:18,420 --> 00:27:19,860 that the programmer sees it-- 649 00:27:19,860 --> 00:27:23,250 so to be able to take advantage of the quirks of technology 650 00:27:23,250 --> 00:27:26,080 without messing up the programmer's model. 651 00:27:26,080 --> 00:27:28,230 So I think that the data parallel programming 652 00:27:28,230 --> 00:27:30,540 model and the data parallel abstraction in the machine 653 00:27:30,540 --> 00:27:33,330 is something that transcends various applications 654 00:27:33,330 --> 00:27:34,058 of the machine. 655 00:27:34,058 --> 00:27:35,850 So I don't think you should get too hung up 656 00:27:35,850 --> 00:27:40,410 with the particular details of how the CM2 is implemented. 657 00:27:40,410 --> 00:27:43,020 OK, I've talked about arithmetic, 658 00:27:43,020 --> 00:27:45,690 memory bandwidth, I/O-- 659 00:27:45,690 --> 00:27:47,400 no, I haven't talked about I/O. 660 00:27:47,400 --> 00:27:50,790 I/O is probably the most neglected part 661 00:27:50,790 --> 00:27:54,780 of this equation, because fundamentally, most computer 662 00:27:54,780 --> 00:27:57,345 designers kind of think of it as not their problem. 663 00:27:57,345 --> 00:27:58,590 It' somebody else's. 664 00:27:58,590 --> 00:28:02,920 665 00:28:02,920 --> 00:28:05,230 The problem is that the users are not content just 666 00:28:05,230 --> 00:28:07,690 to sit there and compute. 667 00:28:07,690 --> 00:28:09,640 They want to write down the answer someplace. 668 00:28:09,640 --> 00:28:12,310 They want to read in things to compete on. 669 00:28:12,310 --> 00:28:16,480 And that means that the machine has to stop and talk 670 00:28:16,480 --> 00:28:18,280 to some external device. 671 00:28:18,280 --> 00:28:20,312 And typically, computer designers 672 00:28:20,312 --> 00:28:22,270 treat that as kind of an afterthought about how 673 00:28:22,270 --> 00:28:24,007 to deal with that, because they don't 674 00:28:24,007 --> 00:28:25,840 know exactly what the device is going to be, 675 00:28:25,840 --> 00:28:27,340 and it's very application-dependent, 676 00:28:27,340 --> 00:28:28,370 and so on. 677 00:28:28,370 --> 00:28:30,040 The problem is, in this kind of case, 678 00:28:30,040 --> 00:28:32,198 where the I/O is actually increasing 679 00:28:32,198 --> 00:28:34,240 with the size of the problem, that would kill you 680 00:28:34,240 --> 00:28:35,950 if you took that attitude. 681 00:28:35,950 --> 00:28:38,805 So you really have to think of I/O as an integral part 682 00:28:38,805 --> 00:28:39,430 of the machine. 683 00:28:39,430 --> 00:28:44,230 So just as you're distributing processing through there, 684 00:28:44,230 --> 00:28:48,730 and just as you distribute communications-- 685 00:28:48,730 --> 00:28:50,590 that wormhole routing scheme is a way 686 00:28:50,590 --> 00:28:53,040 of effectively distributing communications function 687 00:28:53,040 --> 00:28:53,920 to the machine-- 688 00:28:53,920 --> 00:28:56,830 you have to distribute I/O. So I/O 689 00:28:56,830 --> 00:29:00,670 has to come out of that machine, that square meter of silicon 690 00:29:00,670 --> 00:29:03,760 all over it in order to get it out fast enough. 691 00:29:03,760 --> 00:29:05,800 The problem isn't just getting out of the memory 692 00:29:05,800 --> 00:29:07,687 fast enough or into the memory fast enough. 693 00:29:07,687 --> 00:29:09,520 That's relatively easy, because you can just 694 00:29:09,520 --> 00:29:10,600 use a lot of wires. 695 00:29:10,600 --> 00:29:12,460 And in fact, the Connection Machine-- 696 00:29:12,460 --> 00:29:15,550 for example, the CM2 has eight I/O channels 697 00:29:15,550 --> 00:29:18,660 of 100 megabytes each-- 698 00:29:18,660 --> 00:29:22,850 sounds like a lot of I/O. But what do you plug it into? 699 00:29:22,850 --> 00:29:24,430 And I would love to just say, well, 700 00:29:24,430 --> 00:29:27,890 that's somebody else's problem, but in fact, 701 00:29:27,890 --> 00:29:29,860 when you start selling these machines-- 702 00:29:29,860 --> 00:29:32,120 becomes your problem, because the customer 703 00:29:32,120 --> 00:29:34,040 needs to get it done fast. 704 00:29:34,040 --> 00:29:39,950 So for that reason, we started solving the problem of building 705 00:29:39,950 --> 00:29:41,210 our own peripheral devices-- 706 00:29:41,210 --> 00:29:43,470 and specifically, disk arrays. 707 00:29:43,470 --> 00:29:44,880 And so in the Connection Machine, 708 00:29:44,880 --> 00:29:47,760 for example, in every one of these I/O channels, 709 00:29:47,760 --> 00:29:51,920 you use a parallel disk array, which does achieve speed 710 00:29:51,920 --> 00:29:53,670 by doing a lot of slow things in parallel, 711 00:29:53,670 --> 00:29:55,260 just like the machine does. 712 00:29:55,260 --> 00:29:58,620 So you put 40 disk drives in parallel, 713 00:29:58,620 --> 00:30:01,330 and all that goes on one I/O channel. 714 00:30:01,330 --> 00:30:05,440 And when you write a 32-bit word, 715 00:30:05,440 --> 00:30:08,440 you write the 32 data bits each to its independent disk drive, 716 00:30:08,440 --> 00:30:10,810 and you write eight check bits-- 717 00:30:10,810 --> 00:30:12,930 you compute an error-correcting code 718 00:30:12,930 --> 00:30:16,060 and write eight check bits to eight other drives. 719 00:30:16,060 --> 00:30:18,430 Actually, you only need seven check bits. 720 00:30:18,430 --> 00:30:20,890 The other one's a spare. 721 00:30:20,890 --> 00:30:22,270 The advantage of those check bits 722 00:30:22,270 --> 00:30:25,490 is that, when a driver crashes-- 723 00:30:25,490 --> 00:30:28,030 which, it inevitably does you still have all of the data, 724 00:30:28,030 --> 00:30:29,245 because-- 725 00:30:29,245 --> 00:30:31,510 because you wrote this check bit, 726 00:30:31,510 --> 00:30:33,520 you can reconstruct the entire data, 727 00:30:33,520 --> 00:30:36,460 even though you're missing any one of the data bits. 728 00:30:36,460 --> 00:30:39,910 And the advantage of it is you get a tremendously fast 729 00:30:39,910 --> 00:30:41,115 transfer [INAUDIBLE]. 730 00:30:41,115 --> 00:30:42,490 When we designed that, I thought, 731 00:30:42,490 --> 00:30:44,230 well, that'll be plenty to keep everybody happy-- 732 00:30:44,230 --> 00:30:45,820 40 disk drives at once-- but in fact, 733 00:30:45,820 --> 00:30:48,290 very quickly, the customer said, we want faster. 734 00:30:48,290 --> 00:30:51,100 So in fact, the CM2 today supports, in addition, 735 00:30:51,100 --> 00:30:52,330 an idea of disk striping. 736 00:30:52,330 --> 00:30:55,650 So in fact, you can run eight of these data vaults all at once. 737 00:30:55,650 --> 00:30:58,090 And so you're using 320 disk drives 738 00:30:58,090 --> 00:30:59,650 that you're writing out to at once. 739 00:30:59,650 --> 00:31:02,410 And it's very clear, in another year or two, 740 00:31:02,410 --> 00:31:04,460 they're not going to be happy with that either. 741 00:31:04,460 --> 00:31:07,870 So the I/O problem becomes part of the problem 742 00:31:07,870 --> 00:31:10,100 that you have to solve, whether you want to or not. 743 00:31:10,100 --> 00:31:12,430 And it becomes one of the biggest bottleneck 744 00:31:12,430 --> 00:31:15,580 problems in designing a very large-scale computer. 745 00:31:15,580 --> 00:31:17,830 And I'm just touching on the bandwidth issues of it. 746 00:31:17,830 --> 00:31:20,200 There's a lot of other issues as to how 747 00:31:20,200 --> 00:31:21,640 it fits into the operating system, 748 00:31:21,640 --> 00:31:23,620 and protection, and error tolerance, 749 00:31:23,620 --> 00:31:26,130 and things like that. 750 00:31:26,130 --> 00:31:33,410 So I talked about arithmetic, communication, input/output. 751 00:31:33,410 --> 00:31:35,680 What about control? 752 00:31:35,680 --> 00:31:38,830 That's the SIMD/MIMD issue. 753 00:31:38,830 --> 00:31:42,580 It turns out that, if I look at the control 754 00:31:42,580 --> 00:31:46,270 portions of the program, they don't 755 00:31:46,270 --> 00:31:50,270 grow with the amount of data. 756 00:31:50,270 --> 00:31:54,510 In other words, they remain relatively fixed 757 00:31:54,510 --> 00:31:57,690 as the problem size increases. 758 00:31:57,690 --> 00:32:01,250 So that means that the control portions of the program 759 00:32:01,250 --> 00:32:03,080 are really down in the sequential part. 760 00:32:03,080 --> 00:32:05,470 They're not in this parallel part, 761 00:32:05,470 --> 00:32:07,392 which means that there's really no need 762 00:32:07,392 --> 00:32:09,850 to be able to handle those in parallel-- for the most part. 763 00:32:09,850 --> 00:32:13,200 It helps a little bit [INAUDIBLE] in places, 764 00:32:13,200 --> 00:32:15,770 but for the most part, it's not like the control explodes, 765 00:32:15,770 --> 00:32:17,812 because the control is basically dependent on how 766 00:32:17,812 --> 00:32:18,920 long your program is. 767 00:32:18,920 --> 00:32:21,045 And the program doesn't get longer just because you 768 00:32:21,045 --> 00:32:22,890 had more data to operate on. 769 00:32:22,890 --> 00:32:26,660 So that allows the hardware designer 770 00:32:26,660 --> 00:32:29,210 to get some efficiencies by taking advantage of that shared 771 00:32:29,210 --> 00:32:30,930 control. 772 00:32:30,930 --> 00:32:35,270 So the reason that I think that SIMD machines are 773 00:32:35,270 --> 00:32:38,760 the most efficient machines in 1989, 774 00:32:38,760 --> 00:32:42,170 1990 time frame is because the cost 775 00:32:42,170 --> 00:32:48,110 of distributing that control and putting it out is too high. 776 00:32:48,110 --> 00:32:50,090 It becomes too high a cost. 777 00:32:50,090 --> 00:32:52,700 And it's much cheaper to do that control in one place 778 00:32:52,700 --> 00:32:56,730 and distribute out the information to the places 779 00:32:56,730 --> 00:32:58,590 where it's needed, because you really 780 00:32:58,590 --> 00:33:01,252 are only running one program. 781 00:33:01,252 --> 00:33:03,210 Now, that doesn't mean that that's the only way 782 00:33:03,210 --> 00:33:05,972 to support the data parallel programming model, though. 783 00:33:05,972 --> 00:33:08,430 And if, in the future, [? at ?] [? some ?] [? times, ?] you 784 00:33:08,430 --> 00:33:13,620 begin to get more and more of the costs of the machine is 785 00:33:13,620 --> 00:33:16,440 in the communications, rather than in the silicon-- 786 00:33:16,440 --> 00:33:18,955 because silicon's getting cheaper every year, 787 00:33:18,955 --> 00:33:21,330 so it may be after a while that the piece of wire to send 788 00:33:21,330 --> 00:33:23,955 the data is much more expensive than just putting a whole other 789 00:33:23,955 --> 00:33:25,710 computer at the other end of the wire. 790 00:33:25,710 --> 00:33:27,770 At that point, I think it'll make sense 791 00:33:27,770 --> 00:33:29,520 to implement the data parallel programming 792 00:33:29,520 --> 00:33:31,770 model by putting more control out at the [INAUDIBLE].. 793 00:33:31,770 --> 00:33:34,467 And in fact, the CM2 adopts a kind of compromised approach. 794 00:33:34,467 --> 00:33:36,300 It actually has four streams in instructions 795 00:33:36,300 --> 00:33:38,030 to go through that. 796 00:33:38,030 --> 00:33:40,342 So I think that's really the SIMD/MIMD issue. 797 00:33:40,342 --> 00:33:42,050 I don't think it's a philosophical issue. 798 00:33:42,050 --> 00:33:44,260 And I think, in any case, even if you had all this control 799 00:33:44,260 --> 00:33:46,643 out there, you really still want to present to the user 800 00:33:46,643 --> 00:33:48,310 the model of a single thread of control, 801 00:33:48,310 --> 00:33:51,070 because their parallelism is, in fact, not in the control. 802 00:33:51,070 --> 00:33:54,600 Their parallelism is in the data. 803 00:33:54,600 --> 00:33:56,190 So that's the control issue. 804 00:33:56,190 --> 00:33:58,550 So my conclusion from all of that 805 00:33:58,550 --> 00:34:02,600 is, to achieve a balanced design, what you need to do 806 00:34:02,600 --> 00:34:06,020 is make sure that all of the parts that 807 00:34:06,020 --> 00:34:09,540 can be done in parallel are done in parallel in the hardware. 808 00:34:09,540 --> 00:34:13,909 That means, specifically, you have to do processing, 809 00:34:13,909 --> 00:34:15,800 you have to do communication, and you 810 00:34:15,800 --> 00:34:18,030 have to do I/O in parallel. 811 00:34:18,030 --> 00:34:20,030 And control you get to do whatever way you think 812 00:34:20,030 --> 00:34:22,530 is cheapest. 813 00:34:22,530 --> 00:34:26,080 It's just a matter of efficiency. 814 00:34:26,080 --> 00:34:29,560 So all of that sort of theories about designing 815 00:34:29,560 --> 00:34:32,560 this kind of machine-- how well does it actually 816 00:34:32,560 --> 00:34:35,100 work out in practice? 817 00:34:35,100 --> 00:34:39,600 Well, for many problems today, the Connection Machine 818 00:34:39,600 --> 00:34:42,088 is the fastest machine in the world. 819 00:34:42,088 --> 00:34:44,130 And it depends on if the problem is of this sort. 820 00:34:44,130 --> 00:34:45,300 There are also many problems for which 821 00:34:45,300 --> 00:34:46,758 it's not the fastest in the world-- 822 00:34:46,758 --> 00:34:49,679 probably many problems for which a [? Cray ?] is the fastest. 823 00:34:49,679 --> 00:34:52,317 But an interesting question is, on the fastest problems 824 00:34:52,317 --> 00:34:54,150 and the biggest problems, what's the fastest 825 00:34:54,150 --> 00:34:55,030 machine in the world? 826 00:34:55,030 --> 00:34:56,969 And the answer is the Connection Machine. 827 00:34:56,969 --> 00:35:01,260 Gordon Bell was kind enough to hold a contest where 828 00:35:01,260 --> 00:35:03,990 what you did is you clocked machines 829 00:35:03,990 --> 00:35:07,110 on doing any real problem, a real application, 830 00:35:07,110 --> 00:35:10,350 to see how fast they were-- how many floating point operations 831 00:35:10,350 --> 00:35:11,640 per second. 832 00:35:11,640 --> 00:35:14,880 And the Connection Machine won that-- 833 00:35:14,880 --> 00:35:17,850 the prize this year by a large margin. 834 00:35:17,850 --> 00:35:21,480 And in fact, another Connection Machine program came in second 835 00:35:21,480 --> 00:35:23,440 and won the cost performance prize. 836 00:35:23,440 --> 00:35:26,265 So it's clear that, for the biggest problems and the most 837 00:35:26,265 --> 00:35:28,140 data-intensive problems, this kind of machine 838 00:35:28,140 --> 00:35:30,770 is the fastest right now. 839 00:35:30,770 --> 00:35:32,540 So now, what about for the future? 840 00:35:32,540 --> 00:35:35,050 How is it going to compare to the competition, which 841 00:35:35,050 --> 00:35:38,970 is, by the way, vector machines, like Crays? 842 00:35:38,970 --> 00:35:43,440 Well, if you look at the basic technologies that 843 00:35:43,440 --> 00:35:45,330 are in the Connection Machine, they're 844 00:35:45,330 --> 00:35:47,670 basically technologies that are improving 845 00:35:47,670 --> 00:35:50,910 in cost performance by a factor of two or three 846 00:35:50,910 --> 00:35:52,890 every two years. 847 00:35:52,890 --> 00:35:54,390 If you look at the technologies that 848 00:35:54,390 --> 00:35:59,130 are in the vector machines, which are things like bipolar 849 00:35:59,130 --> 00:36:02,130 integrated circuits, high-performance cooling 850 00:36:02,130 --> 00:36:05,850 technologies, their improvement in cost performance, 851 00:36:05,850 --> 00:36:08,490 while substantial, is more like a factor of two 852 00:36:08,490 --> 00:36:09,890 every two years. 853 00:36:09,890 --> 00:36:11,460 So it's not as fast. 854 00:36:11,460 --> 00:36:14,970 So I'm convinced that not only are these data parallel 855 00:36:14,970 --> 00:36:17,910 machines fastest today, but as we look out in the future, 856 00:36:17,910 --> 00:36:21,540 they become more and more the fastest machines. 857 00:36:21,540 --> 00:36:25,620 So somebody who has a pile of money, like $30 million or $100 858 00:36:25,620 --> 00:36:28,740 million make it up, will always do much better 859 00:36:28,740 --> 00:36:32,250 if they're solving a big problem to spend that on a data 860 00:36:32,250 --> 00:36:36,810 parallel machine because of this increasing spread in the cost 861 00:36:36,810 --> 00:36:38,530 performance of the components. 862 00:36:38,530 --> 00:36:40,170 So I think the future of these machines 863 00:36:40,170 --> 00:36:43,290 looks extremely bright. 864 00:36:43,290 --> 00:36:45,840 And in fact, I believe that, over the next few years, 865 00:36:45,840 --> 00:36:47,480 we'll see machines of this sort-- 866 00:36:47,480 --> 00:36:51,660 commercial machines of the sort that are performing regularly 867 00:36:51,660 --> 00:36:53,730 in the 100 gigaflop range. 868 00:36:53,730 --> 00:36:56,760 Today they perform in the 5 gigaflop range, 869 00:36:56,760 --> 00:36:59,470 maybe getting up to the 10 gigaflop range. 870 00:36:59,470 --> 00:37:02,280 But I think that we'll see on good problems performing at 100 871 00:37:02,280 --> 00:37:04,080 gigaflops in a few years-- 872 00:37:04,080 --> 00:37:06,810 and by the mid-1990s, I think that we'll 873 00:37:06,810 --> 00:37:09,320 be talking teraflops. 874 00:37:09,320 --> 00:37:14,030 And that's trillions of floating point operations per second. 875 00:37:14,030 --> 00:37:16,660 876 00:37:16,660 --> 00:37:18,910 That's a factor of 1,000 over what 877 00:37:18,910 --> 00:37:22,610 you do on a conventional supercomputer today. 878 00:37:22,610 --> 00:37:26,930 And so I don't think we yet comprehend how to use it, 879 00:37:26,930 --> 00:37:29,780 [? but a ?] factor of 1,000-- one level sounds like an awful 880 00:37:29,780 --> 00:37:33,830 lot, but then, if I've got one of these problems that grows 881 00:37:33,830 --> 00:37:37,160 like n cubed, that's really only less than-- 882 00:37:37,160 --> 00:37:40,050 that's only a factor of 10 for a problem like that. 883 00:37:40,050 --> 00:37:43,310 And if it [INAUDIBLE] only a factor of about five 884 00:37:43,310 --> 00:37:44,610 or something like that. 885 00:37:44,610 --> 00:37:47,720 So if I went five times the resolution, 886 00:37:47,720 --> 00:37:49,250 I really want to look at turbulence. 887 00:37:49,250 --> 00:37:51,980 I can use up that factor of 1,000 very fast. 888 00:37:51,980 --> 00:37:55,160 Or if I want a more accurate model of how a molecule falls, 889 00:37:55,160 --> 00:37:56,630 or if I want to really understand 890 00:37:56,630 --> 00:37:59,150 how that spiral arm of the galaxy forms, 891 00:37:59,150 --> 00:38:00,530 there's lots and lots of problems 892 00:38:00,530 --> 00:38:02,170 that use that up very quickly. 893 00:38:02,170 --> 00:38:04,220 And in fact, we're beginning to get hints 894 00:38:04,220 --> 00:38:07,460 that we'll be able to begin approaching some 895 00:38:07,460 --> 00:38:09,800 of the problems in artificial intelligence 896 00:38:09,800 --> 00:38:13,975 that I originally was interested in quite a few years ago, 897 00:38:13,975 --> 00:38:15,350 when I started getting interested 898 00:38:15,350 --> 00:38:16,820 in this type of machine. 899 00:38:16,820 --> 00:38:19,580 So we're beginning to see the light at the end of the tunnel, 900 00:38:19,580 --> 00:38:22,580 that we can get machines that are really fast enough approach 901 00:38:22,580 --> 00:38:24,300 these problems. 902 00:38:24,300 --> 00:38:27,290 So it looks like it's going to be an interesting decade. 903 00:38:27,290 --> 00:38:30,460 So let me summarize. 904 00:38:30,460 --> 00:38:32,745 The conventional Von Neumann machine, 905 00:38:32,745 --> 00:38:34,870 the [? word ?] [? at ?] [? a ?] [? time ?] machine, 906 00:38:34,870 --> 00:38:37,370 is fundamentally inefficient for very, 907 00:38:37,370 --> 00:38:40,600 very large-scale problems, and it becomes more inefficient 908 00:38:40,600 --> 00:38:45,350 as the problem grows greater because of this Von Neumann 909 00:38:45,350 --> 00:38:47,850 bottleneck. 910 00:38:47,850 --> 00:38:53,410 The natural solution is to distribute processing 911 00:38:53,410 --> 00:38:54,870 through the machine. 912 00:38:54,870 --> 00:38:57,810 In order to maintain a balanced design, 913 00:38:57,810 --> 00:38:59,730 we have to distribute not only processing, 914 00:38:59,730 --> 00:39:05,160 but we have to distribute I/O and communications so that we 915 00:39:05,160 --> 00:39:07,800 can do all of it in parallel. 916 00:39:07,800 --> 00:39:13,230 If we do that, then, for the biggest problems, 917 00:39:13,230 --> 00:39:16,450 we'll be able to do arbitrary amounts of it in parallel. 918 00:39:16,450 --> 00:39:19,920 So we'll be able to get very, very good speedups. 919 00:39:19,920 --> 00:39:22,150 In fact, we can always pick a version of the problems 920 00:39:22,150 --> 00:39:23,870 big enough so that we can get them-- 921 00:39:23,870 --> 00:39:27,170 we have enough parallelism to keep it busy. 922 00:39:27,170 --> 00:39:29,720 923 00:39:29,720 --> 00:39:33,350 And finally, the result of all of this 924 00:39:33,350 --> 00:39:35,660 is that data parallel machines today 925 00:39:35,660 --> 00:39:38,090 are, in fact, the fastest computers in the world 926 00:39:38,090 --> 00:39:39,860 for a class of problems. 927 00:39:39,860 --> 00:39:42,740 And it looks like that's becoming more and more 928 00:39:42,740 --> 00:39:44,880 true in the future. 929 00:39:44,880 --> 00:39:48,660 So in conclusion, I'd like to say 930 00:39:48,660 --> 00:39:52,290 that I think this is the natural solution to very 931 00:39:52,290 --> 00:39:57,000 large problems, and I hope I've convinced you all of it is too. 932 00:39:57,000 --> 00:39:58,740 Thank you very much. 933 00:39:58,740 --> 00:40:02,250 I'm Frank Morris from Pennie and Edmonds New York City. 934 00:40:02,250 --> 00:40:04,560 Can you explain to us what advances 935 00:40:04,560 --> 00:40:07,350 you foresee in the next few years that will enable you 936 00:40:07,350 --> 00:40:10,290 to increase the performance of the Connection Machine 937 00:40:10,290 --> 00:40:14,910 from about 10 gigaflops to a teraflop? 938 00:40:14,910 --> 00:40:15,780 Yeah. 939 00:40:15,780 --> 00:40:18,300 I'll speak in very general terms, 940 00:40:18,300 --> 00:40:20,460 because obviously, we're getting into what 941 00:40:20,460 --> 00:40:23,400 we're developing in some of our next products here. 942 00:40:23,400 --> 00:40:26,250 But fundamentally, there are three sources 943 00:40:26,250 --> 00:40:28,980 of speed improvement in the machine. 944 00:40:28,980 --> 00:40:33,510 One of them is that the basic technology in the machine 945 00:40:33,510 --> 00:40:34,270 is improving. 946 00:40:34,270 --> 00:40:35,970 So that's that factor of two or three 947 00:40:35,970 --> 00:40:37,230 a year I was talking about. 948 00:40:37,230 --> 00:40:40,643 As we fit more and more onto a VLSI chip, 949 00:40:40,643 --> 00:40:42,060 we can get things closer together, 950 00:40:42,060 --> 00:40:43,380 actually operate them faster. 951 00:40:43,380 --> 00:40:44,880 Clock speeds become faster. 952 00:40:44,880 --> 00:40:47,080 Data paths become lighter for the same cost. 953 00:40:47,080 --> 00:40:48,990 So there's some things that, in a sense, 954 00:40:48,990 --> 00:40:51,900 come for you for free just from the technology. 955 00:40:51,900 --> 00:40:54,090 Another big source of-- 956 00:40:54,090 --> 00:40:58,140 is that we understand much more of what we're doing now. 957 00:40:58,140 --> 00:41:01,440 So we now have a lot of machines out in the field-- 958 00:41:01,440 --> 00:41:04,620 probably about 10% or 20% of the supercomputers 959 00:41:04,620 --> 00:41:06,960 that are out there right now are Connection Machines, 960 00:41:06,960 --> 00:41:09,270 and so we're beginning to get a lot of experience 961 00:41:09,270 --> 00:41:11,700 and how users actually use the machine 962 00:41:11,700 --> 00:41:15,450 and how the machine works in certain cases, where 963 00:41:15,450 --> 00:41:18,340 it doesn't work, where the time is being spent, and so on. 964 00:41:18,340 --> 00:41:21,030 So another big source is optimizing the machine, 965 00:41:21,030 --> 00:41:22,740 because we now know what to optimize for. 966 00:41:22,740 --> 00:41:25,198 And we were kind of shooting in the dark for the first time 967 00:41:25,198 --> 00:41:26,730 around. 968 00:41:26,730 --> 00:41:29,310 There's probably almost an order of magnitude 969 00:41:29,310 --> 00:41:31,270 to be gained by that. 970 00:41:31,270 --> 00:41:34,770 And then the other source of improvement in speed 971 00:41:34,770 --> 00:41:36,630 is just plain we'll build machines bigger. 972 00:41:36,630 --> 00:41:42,600 And when we first built the Connection Machine, $5 million 973 00:41:42,600 --> 00:41:45,893 seemed like a lot to solve one of these problems, 974 00:41:45,893 --> 00:41:47,310 but in fact, these days people are 975 00:41:47,310 --> 00:41:49,680 used to spending as much as $20 or $30 million 976 00:41:49,680 --> 00:41:52,230 for a very, very large computer. 977 00:41:52,230 --> 00:41:54,730 And so you can just plain old build bigger machines, 978 00:41:54,730 --> 00:41:56,590 and that's another source of the power. 979 00:41:56,590 --> 00:41:58,810 So if you multiply all of those together, 980 00:41:58,810 --> 00:42:01,530 it really is very plausible to get to that teraflops number 981 00:42:01,530 --> 00:42:03,100 in the mid-'90s. 982 00:42:03,100 --> 00:42:07,480 How much bigger do you see the machines becoming? 983 00:42:07,480 --> 00:42:11,740 I believe that, with the way that we understand designing 984 00:42:11,740 --> 00:42:16,570 machines now, we can see the next factor of 100 985 00:42:16,570 --> 00:42:18,070 after the teraflop. 986 00:42:18,070 --> 00:42:19,810 So the designs change. 987 00:42:19,810 --> 00:42:23,050 The way that you build the communication networks change. 988 00:42:23,050 --> 00:42:25,420 The way that you build the processing elements change. 989 00:42:25,420 --> 00:42:27,753 But fundamentally, I think the same kind of architecture 990 00:42:27,753 --> 00:42:30,070 and the same kind of model presented to the user 991 00:42:30,070 --> 00:42:34,270 will follow through for that kind of speed improvement. 992 00:42:34,270 --> 00:42:37,160 We're talking over the next decade, really, for that. 993 00:42:37,160 --> 00:42:40,030 After that, I think even that model begins to fall apart, 994 00:42:40,030 --> 00:42:42,580 and it needs something else to come along-- some other way 995 00:42:42,580 --> 00:42:44,050 of thinking about the problem. 996 00:42:44,050 --> 00:42:46,210 Users will have to change all of their programs 997 00:42:46,210 --> 00:42:48,850 to get beyond the-- 998 00:42:48,850 --> 00:42:51,760 I don't even know what's after a teraflop, 999 00:42:51,760 --> 00:42:53,590 but I'm not going to worry about it. 1000 00:42:53,590 --> 00:42:57,296 I'm worrying about up to about 100. 1001 00:42:57,296 --> 00:43:01,830 I'm [INAUDIBLE] Howard University in Washington DC. 1002 00:43:01,830 --> 00:43:04,690 Thanks for your presentation, especially as 1003 00:43:04,690 --> 00:43:08,170 you addressed the problem of time saving 1004 00:43:08,170 --> 00:43:11,200 for large computation problems. 1005 00:43:11,200 --> 00:43:16,330 For some of the activities in numerical analysis, 1006 00:43:16,330 --> 00:43:19,210 we spend a lot of time in I/Os. 1007 00:43:19,210 --> 00:43:22,960 And I'm looking forward to hear from you what your thoughts are 1008 00:43:22,960 --> 00:43:26,700 in terms of making the I/O devices more intelligent. 1009 00:43:26,700 --> 00:43:28,450 Certainly, people have been talking about, 1010 00:43:28,450 --> 00:43:32,020 for years, putting processors down into the I/O devices, 1011 00:43:32,020 --> 00:43:34,840 in the same sense that, in a data parallel machine, 1012 00:43:34,840 --> 00:43:37,030 you put the processors into the memory. 1013 00:43:37,030 --> 00:43:39,580 I question in a parallel machine how much is really 1014 00:43:39,580 --> 00:43:42,280 to be gained from that, because the limitation is really 1015 00:43:42,280 --> 00:43:45,700 not the data motion between the I/O device 1016 00:43:45,700 --> 00:43:48,190 and the memory of the machine in a parallel machine. 1017 00:43:48,190 --> 00:43:51,140 It's easy to build I/O channels to keep up with that capacity. 1018 00:43:51,140 --> 00:43:54,520 The limitation really is much more in the physical storage 1019 00:43:54,520 --> 00:43:56,570 medium itself of the I/O device. 1020 00:43:56,570 --> 00:44:00,600 So today, even the very, very fastest, 1021 00:44:00,600 --> 00:44:03,330 very most sophisticated computer still use the Stone Age 1022 00:44:03,330 --> 00:44:05,790 technology of disks spinning around, 1023 00:44:05,790 --> 00:44:08,610 and that's the thing that's really going to have to change, 1024 00:44:08,610 --> 00:44:10,260 and what's the technology that's going 1025 00:44:10,260 --> 00:44:15,180 to replace that is kind of the $64,000 question. 1026 00:44:15,180 --> 00:44:18,660 Optical technology is mostly-- 1027 00:44:18,660 --> 00:44:21,150 what's out there now is still disks spinning around. 1028 00:44:21,150 --> 00:44:24,000 You're writing things on it with optics, instead of 1029 00:44:24,000 --> 00:44:27,420 with magnetics, but it's still that kind of mechanical motion 1030 00:44:27,420 --> 00:44:29,020 determines the speed of the system. 1031 00:44:29,020 --> 00:44:31,830 I think we're eventually going to have to get away from that. 1032 00:44:31,830 --> 00:44:34,410 Some of the possibilities are things 1033 00:44:34,410 --> 00:44:38,130 like storing three-dimensionally in crystal lattices. 1034 00:44:38,130 --> 00:44:41,130 Magnetic bubbles is another thing people have talked about. 1035 00:44:41,130 --> 00:44:43,470 None of them seem really quite there enough 1036 00:44:43,470 --> 00:44:45,750 to be able to predict them, but it's clear 1037 00:44:45,750 --> 00:44:48,810 that we're going to need something like that. 1038 00:44:48,810 --> 00:44:51,090 I know that, when you do error analysis 1039 00:44:51,090 --> 00:44:56,740 on serial machine in terms of different outputs, you have-- 1040 00:44:56,740 --> 00:44:59,100 [INAUDIBLE] to compare the advantage 1041 00:44:59,100 --> 00:45:01,500 in terms of troubleshooting program [INAUDIBLE] 1042 00:45:01,500 --> 00:45:05,040 in parallel processing, more or less [? architecture. ?] 1043 00:45:05,040 --> 00:45:06,240 OK. 1044 00:45:06,240 --> 00:45:10,380 Fundamentally, writing a program on a data parallel machine 1045 00:45:10,380 --> 00:45:12,782 is no more difficult than writing a program 1046 00:45:12,782 --> 00:45:14,490 on a sequential machine, because you only 1047 00:45:14,490 --> 00:45:16,080 have one thread of control. 1048 00:45:16,080 --> 00:45:19,080 So you don't have to worry about issues like synchronization 1049 00:45:19,080 --> 00:45:22,470 or data consistency between processes and things like that. 1050 00:45:22,470 --> 00:45:24,060 It's things like that, where you have 1051 00:45:24,060 --> 00:45:25,950 multiple threads of control, that 1052 00:45:25,950 --> 00:45:29,490 makes some kinds of parallel programming very difficult. 1053 00:45:29,490 --> 00:45:31,800 But debugging a data parallel program 1054 00:45:31,800 --> 00:45:35,130 is really very much like debugging a sequential program, 1055 00:45:35,130 --> 00:45:37,050 and uses essentially the same tools. 1056 00:45:37,050 --> 00:45:39,367 Now, of course, if the program's doing the wrong thing, 1057 00:45:39,367 --> 00:45:41,200 then it can do the wrong thing a lot faster, 1058 00:45:41,200 --> 00:45:42,658 so it can make a lot more mistakes, 1059 00:45:42,658 --> 00:45:45,285 but the truth of the matter is that it can already 1060 00:45:45,285 --> 00:45:47,910 make mistakes a lot faster than you can keep up with it anyway. 1061 00:45:47,910 --> 00:45:51,810 So that difference really doesn't make a difference. 1062 00:45:51,810 --> 00:45:55,115 Now, that's to do with software errors. 1063 00:45:55,115 --> 00:45:57,240 Another question, which you might have been asking, 1064 00:45:57,240 --> 00:45:58,770 is about hardware errors. 1065 00:45:58,770 --> 00:46:01,710 And one topic that I neglected in the talk, which 1066 00:46:01,710 --> 00:46:04,530 is a very big issue for designing these machines, 1067 00:46:04,530 --> 00:46:07,410 is the problem of fault tolerance. 1068 00:46:07,410 --> 00:46:09,000 And in fact, the machines that we're 1069 00:46:09,000 --> 00:46:11,070 designing in the future all have very high degree 1070 00:46:11,070 --> 00:46:14,400 of fault tolerance, so that one part of the hardware can break 1071 00:46:14,400 --> 00:46:18,540 and the program continues running, and runs-- 1072 00:46:18,540 --> 00:46:20,855 the user sees no error, even though the hardware-- 1073 00:46:20,855 --> 00:46:22,980 the individual hardware elements are making errors. 1074 00:46:22,980 --> 00:46:25,650 And that becomes a necessity when you design very, very 1075 00:46:25,650 --> 00:46:27,510 large-scale machines. 1076 00:46:27,510 --> 00:46:30,330 And that will become increasingly a problem 1077 00:46:30,330 --> 00:46:32,280 in designing machines, and increasingly 1078 00:46:32,280 --> 00:46:35,331 a place for architectural creativity. 1079 00:46:35,331 --> 00:46:38,165 I'm Mark Tuttle from the [INAUDIBLE] Lab. 1080 00:46:38,165 --> 00:46:39,540 I felt, during the talk, you were 1081 00:46:39,540 --> 00:46:42,900 trying to convince me that the data parallel model was 1082 00:46:42,900 --> 00:46:46,170 the best and most natural model for parallel computation, 1083 00:46:46,170 --> 00:46:48,660 and I'd like to press you on that a little more. 1084 00:46:48,660 --> 00:46:52,380 It seems like, in discrediting the other models, 1085 00:46:52,380 --> 00:46:53,940 you're taking a lot of-- 1086 00:46:53,940 --> 00:46:57,450 you're taking advantage of the parallelism that's inherent-- 1087 00:46:57,450 --> 00:46:59,610 the data parallelism that's inherent in the data 1088 00:46:59,610 --> 00:47:01,620 that you're using, like the airflow 1089 00:47:01,620 --> 00:47:03,820 over the wing and stuff like that. 1090 00:47:03,820 --> 00:47:06,990 But it seems like there are a lot of problems where that's 1091 00:47:06,990 --> 00:47:09,600 not really the case, like chasing links 1092 00:47:09,600 --> 00:47:13,740 through a semantic net, where you can, of course, formulate 1093 00:47:13,740 --> 00:47:17,520 that is taking the transitive closure of a graph. 1094 00:47:17,520 --> 00:47:21,120 And I just wanted to ask you about that. 1095 00:47:21,120 --> 00:47:29,330 Well, I certainly believe that other kinds of parallelism 1096 00:47:29,330 --> 00:47:32,000 are legitimate and useful. 1097 00:47:32,000 --> 00:47:34,880 The problem with most kinds of parallelism 1098 00:47:34,880 --> 00:47:37,970 based on control flow is that they 1099 00:47:37,970 --> 00:47:39,985 don't get you the factors of tens of thousands 1100 00:47:39,985 --> 00:47:41,360 or hundreds of thousands that you 1101 00:47:41,360 --> 00:47:45,440 need to really do these very, very large-scale problems. 1102 00:47:45,440 --> 00:47:49,310 So I guess I feel that data parallelism is 1103 00:47:49,310 --> 00:47:51,680 the only way that you'll get the speed in very, very 1104 00:47:51,680 --> 00:47:52,500 large problems. 1105 00:47:52,500 --> 00:47:54,510 And in fact, the problems that you mentioned, 1106 00:47:54,510 --> 00:47:57,140 which were database query, computing 1107 00:47:57,140 --> 00:47:59,130 the transitive closure of a graph, 1108 00:47:59,130 --> 00:48:01,040 are very, very natural expressions 1109 00:48:01,040 --> 00:48:03,330 in data parallel algorithms, and in fact, work very, 1110 00:48:03,330 --> 00:48:06,560 very well on a machine like this when the problem is large. 1111 00:48:06,560 --> 00:48:09,050 So it's not that I think that the other kinds of problems 1112 00:48:09,050 --> 00:48:10,202 are no good. 1113 00:48:10,202 --> 00:48:11,660 In fact, I think they're very good. 1114 00:48:11,660 --> 00:48:15,260 I think multitasking, forking parallelism 1115 00:48:15,260 --> 00:48:16,830 is very useful in many cases. 1116 00:48:16,830 --> 00:48:19,125 I think pipelining is useful in many cases. 1117 00:48:19,125 --> 00:48:20,750 I think that those make a lot of sense, 1118 00:48:20,750 --> 00:48:22,610 but they're not going to get you the factors of hundreds 1119 00:48:22,610 --> 00:48:24,440 of thousands, so they're really not what 1120 00:48:24,440 --> 00:48:27,868 you have to focus on when designing a machine that's 1121 00:48:27,868 --> 00:48:29,660 going to operate in that performance range. 1122 00:48:29,660 --> 00:48:31,910 It sounds like you're talking about-- 1123 00:48:31,910 --> 00:48:34,580 that's an argument that the architecture that you're 1124 00:48:34,580 --> 00:48:36,440 interested in is a superior architecture, 1125 00:48:36,440 --> 00:48:40,280 because you can make these chips small enough 1126 00:48:40,280 --> 00:48:44,750 that you can make these machines much, much bigger. 1127 00:48:44,750 --> 00:48:48,360 It's more cost effective to make these kinds of machines bigger, 1128 00:48:48,360 --> 00:48:50,660 but that doesn't seem like it's a convincing argument 1129 00:48:50,660 --> 00:48:52,520 that the model-- 1130 00:48:52,520 --> 00:48:55,370 the programming model itself is the most natural model. 1131 00:48:55,370 --> 00:48:57,927 Well, maybe a more-- 1132 00:48:57,927 --> 00:49:00,260 better way to convince you that the programming model is 1133 00:49:00,260 --> 00:49:03,260 the right model is to say that even somebody who 1134 00:49:03,260 --> 00:49:05,460 builds hardware for a completely different model 1135 00:49:05,460 --> 00:49:07,190 almost always ends up programming it 1136 00:49:07,190 --> 00:49:08,703 in the data parallel model. 1137 00:49:08,703 --> 00:49:10,370 So typically, if you have a machine that 1138 00:49:10,370 --> 00:49:13,520 has more than a few dozen processors, even if it wasn't 1139 00:49:13,520 --> 00:49:16,100 designed to support the data parallel programming model, 1140 00:49:16,100 --> 00:49:18,440 you'll see that the way that people program that machine 1141 00:49:18,440 --> 00:49:20,570 is to sort of make that machine pretend like it's 1142 00:49:20,570 --> 00:49:21,772 a data parallel machine. 1143 00:49:21,772 --> 00:49:23,480 And in fact, they only write one program, 1144 00:49:23,480 --> 00:49:26,550 and they run it all on all the processors at once. 1145 00:49:26,550 --> 00:49:29,330 So they hide all of that flexibility 1146 00:49:29,330 --> 00:49:30,950 that they put into the architecture, 1147 00:49:30,950 --> 00:49:32,880 because they don't know how to deal with [INAUDIBLE] software. 1148 00:49:32,880 --> 00:49:34,713 So I think that's a very convincing argument 1149 00:49:34,713 --> 00:49:36,770 that this is the most natural way of expressing 1150 00:49:36,770 --> 00:49:38,690 those algorithms. 1151 00:49:38,690 --> 00:49:41,240 I'm Karen Matthews from UVC, but Danny, 1152 00:49:41,240 --> 00:49:43,850 this question is from [? Gordon ?] [? Bell. ?] To 1153 00:49:43,850 --> 00:49:46,310 what extent can we expect a language for expressing 1154 00:49:46,310 --> 00:49:49,440 parallel computation to span a range of types of machines 1155 00:49:49,440 --> 00:49:52,340 so that programmers don't have to be retrained? 1156 00:49:52,340 --> 00:49:57,140 I think that it's realistic for programming languages 1157 00:49:57,140 --> 00:49:59,810 that support machines with roughly the same numbers 1158 00:49:59,810 --> 00:50:04,260 of processors to have the same kind of programming language. 1159 00:50:04,260 --> 00:50:06,500 So I believe that all machines with tens of thousands 1160 00:50:06,500 --> 00:50:08,570 of processors will be programmed in the same way 1161 00:50:08,570 --> 00:50:10,790 by exactly the same programming languages. 1162 00:50:10,790 --> 00:50:13,187 I believe that machines that have dozens of processors 1163 00:50:13,187 --> 00:50:15,770 will be programmed in a pretty different way, because there it 1164 00:50:15,770 --> 00:50:18,502 makes sense to use this kind of fork joint kind of parallelism. 1165 00:50:18,502 --> 00:50:20,960 And it may be that there are some intermediate stages where 1166 00:50:20,960 --> 00:50:23,900 machines with a few processors can be programmed in a message 1167 00:50:23,900 --> 00:50:27,470 passing way, but I think that all machines that are massively 1168 00:50:27,470 --> 00:50:30,950 parallel will run exactly the same programming languages. 1169 00:50:30,950 --> 00:50:32,720 And interestingly enough, they're 1170 00:50:32,720 --> 00:50:34,910 probably the same programming languages 1171 00:50:34,910 --> 00:50:36,733 that the sequential machines will run, 1172 00:50:36,733 --> 00:50:38,150 and that was a big surprise to me. 1173 00:50:38,150 --> 00:50:41,800 [MUSIC PLAYING] 1174 00:50:41,800 --> 00:52:25,110