1 00:00:00,000 --> 00:00:02,170 2 00:00:02,170 --> 00:00:05,440 University video communication is pleased to present this 3 00:00:05,440 --> 00:00:08,260 video tape edition of the Distinguished Lecture Series , 4 00:00:08,260 --> 00:00:10,750 industry leaders in computer science and electrical 5 00:00:10,750 --> 00:00:11,740 engineering. 6 00:00:11,740 --> 00:00:13,540 Speakers and topics in this series 7 00:00:13,540 --> 00:00:16,420 have been nominated by University faculty who 8 00:00:16,420 --> 00:00:19,900 want industry material for use in their courses. 9 00:00:19,900 --> 00:00:22,720 Today's presentation, parallel processing 10 00:00:22,720 --> 00:00:24,970 is by Tilak K. Agerwala. 11 00:00:24,970 --> 00:00:27,520 An IEEE believe fellow, he received his Bachelor 12 00:00:27,520 --> 00:00:29,560 of technology in electrical engineering 13 00:00:29,560 --> 00:00:32,020 from the Indian Institute of Technology, 14 00:00:32,020 --> 00:00:34,150 and his PhD in electrical engineering 15 00:00:34,150 --> 00:00:36,490 from Johns Hopkins University. 16 00:00:36,490 --> 00:00:38,650 As a member of the corporate technical committee 17 00:00:38,650 --> 00:00:41,110 in the science and technology department at IBM 18 00:00:41,110 --> 00:00:44,400 corporate headquarters, one of his current responsibilities 19 00:00:44,400 --> 00:00:48,910 is for corporate wide assessment of parallel processing at IBM. 20 00:00:48,910 --> 00:00:53,110 UVC is happy to introduce to you Dr. Tilak Agerwala. 21 00:00:53,110 --> 00:00:57,040 It's a pleasure to have this opportunity to speak to you 22 00:00:57,040 --> 00:01:01,840 on one of my favorite topics namely, parallel processing. 23 00:01:01,840 --> 00:01:07,360 The presentation today will be a tutorial presentation 24 00:01:07,360 --> 00:01:11,620 where I will try to give you an overview of some 25 00:01:11,620 --> 00:01:16,870 of the important trends in this area of parallel processing. 26 00:01:16,870 --> 00:01:18,670 Now many of you are quite familiar 27 00:01:18,670 --> 00:01:22,540 with the various motivations for parallel processing, 28 00:01:22,540 --> 00:01:28,480 I would on this particular slide emphasize two. 29 00:01:28,480 --> 00:01:31,690 The performance potential of parallel processing 30 00:01:31,690 --> 00:01:34,570 is one of the key motivations. 31 00:01:34,570 --> 00:01:40,360 And the performance aspects are really common to flavors. 32 00:01:40,360 --> 00:01:44,830 One can talk about the maximum performance possible , 33 00:01:44,830 --> 00:01:46,510 the maximum absolute performance, 34 00:01:46,510 --> 00:01:51,280 one can also be motivated by the ability to back a lot 35 00:01:51,280 --> 00:01:53,650 of performance in the small package, 36 00:01:53,650 --> 00:01:59,270 and one typically may want to do that in workstations. 37 00:01:59,270 --> 00:02:04,960 There is a potential for improved cost performance, 38 00:02:04,960 --> 00:02:09,250 to provide scalable performance, and the other one 39 00:02:09,250 --> 00:02:12,460 on this list that I really do want to emphasize 40 00:02:12,460 --> 00:02:14,740 is reduce development cost. 41 00:02:14,740 --> 00:02:18,910 And reduce development costs also comes in two flavors. 42 00:02:18,910 --> 00:02:22,660 Sometimes it's possible to get significant performance 43 00:02:22,660 --> 00:02:26,950 without developing very sophisticated technologies 44 00:02:26,950 --> 00:02:31,570 and very sophisticated packages that are developed for very 45 00:02:31,570 --> 00:02:34,030 high performance machines. 46 00:02:34,030 --> 00:02:37,510 Secondly if one is interested in really providing 47 00:02:37,510 --> 00:02:41,740 a very wide range of performance for example in IBM, 48 00:02:41,740 --> 00:02:45,670 we cover the range of systems from personal computers 49 00:02:45,670 --> 00:02:48,070 to supercomputers, then one is faced 50 00:02:48,070 --> 00:02:52,090 with the question of how many unique uni-processor designs 51 00:02:52,090 --> 00:02:55,210 should one develop to cover that range of performance. 52 00:02:55,210 --> 00:02:57,130 And clearly with parallel processing 53 00:02:57,130 --> 00:02:59,710 then one has an added dimension so 54 00:02:59,710 --> 00:03:02,410 that the number of unique processor designs 55 00:03:02,410 --> 00:03:04,780 can be reduced and the range can be 56 00:03:04,780 --> 00:03:10,530 covered with the use of parallel processing. 57 00:03:10,530 --> 00:03:15,990 Increased availability is also an important consideration 58 00:03:15,990 --> 00:03:19,860 and is very important in today's systems. 59 00:03:19,860 --> 00:03:23,700 There has been a huge amount of progress in computers. 60 00:03:23,700 --> 00:03:26,760 Generally, I'd just like to state 61 00:03:26,760 --> 00:03:30,300 that the driving force behind much of this progress 62 00:03:30,300 --> 00:03:33,510 is really progress in micro-electronics. 63 00:03:33,510 --> 00:03:39,030 And this the slide illustrates what some of the trends 64 00:03:39,030 --> 00:03:44,470 are in the area of memory, CMOS, and bipolar technologies. 65 00:03:44,470 --> 00:03:49,590 I should emphasize that this slide is not just 66 00:03:49,590 --> 00:03:52,860 obtained by sitting down with log paper 67 00:03:52,860 --> 00:03:57,400 and projecting into the future by drawing straight lines. 68 00:03:57,400 --> 00:04:01,200 This graph is obtained by a very thorough evaluation 69 00:04:01,200 --> 00:04:07,530 of the problems that are likely to be faced in the future 70 00:04:07,530 --> 00:04:11,520 and by considering the solutions that one may use to address 71 00:04:11,520 --> 00:04:12,930 some of these problems. 72 00:04:12,930 --> 00:04:16,110 And the basic message that I want to give you on this slide 73 00:04:16,110 --> 00:04:20,339 is that technology at this level is going straight ahead 74 00:04:20,339 --> 00:04:23,200 and there's no end in sight. 75 00:04:23,200 --> 00:04:29,550 Now, the basic technology capability for example 76 00:04:29,550 --> 00:04:35,700 then translates into the ability to build processing systems. 77 00:04:35,700 --> 00:04:38,730 And what I am illustrating on this slide 78 00:04:38,730 --> 00:04:43,050 is the evolution of microprocessor systems, 79 00:04:43,050 --> 00:04:46,530 that the systems that can be built on a single chip 80 00:04:46,530 --> 00:04:48,960 or on a small collection of chips. 81 00:04:48,960 --> 00:04:51,810 And with 1 micron CMOS technology today, 82 00:04:51,810 --> 00:04:54,900 we are essentially sitting in the range of 20 to 30 83 00:04:54,900 --> 00:04:58,230 million instructions per second and this will steadily 84 00:04:58,230 --> 00:05:01,530 move into the 50 to 70 range. 85 00:05:01,530 --> 00:05:06,310 I should emphasize that this performance comes-- 86 00:05:06,310 --> 00:05:09,060 these numbers are based on the assumption 87 00:05:09,060 --> 00:05:11,070 of having a machine organization that 88 00:05:11,070 --> 00:05:15,930 can deliver approximately 1.2 cycles per instruction 89 00:05:15,930 --> 00:05:17,580 for average workloads. 90 00:05:17,580 --> 00:05:20,580 If one uses more sophisticated machine organizations 91 00:05:20,580 --> 00:05:22,620 and we certainly know how to do that, 92 00:05:22,620 --> 00:05:25,770 these numbers can increase significantly. 93 00:05:25,770 --> 00:05:28,950 Furthermore another factor of two 94 00:05:28,950 --> 00:05:34,870 can be obtained by using liquid nitrogen cooling 95 00:05:34,870 --> 00:05:37,120 and by using some of the developments 96 00:05:37,120 --> 00:05:43,960 in high density, low power, bipolar technologies. 97 00:05:43,960 --> 00:05:47,110 It's certainly feasible in the early 1990s 98 00:05:47,110 --> 00:05:52,210 to get into the 100 million instruction per second range. 99 00:05:52,210 --> 00:05:54,160 I really want to make two points over here. 100 00:05:54,160 --> 00:05:58,510 Firstly that the building blocks that we have available to us 101 00:05:58,510 --> 00:06:01,630 now are becoming powerful enough so 102 00:06:01,630 --> 00:06:07,480 that the small MIPS effects that one would encounter with lower 103 00:06:07,480 --> 00:06:10,780 performance microprocessors are no longer 104 00:06:10,780 --> 00:06:12,730 coming into the picture. 105 00:06:12,730 --> 00:06:17,650 And secondly that with a 40, 50, 100 million instruction 106 00:06:17,650 --> 00:06:19,750 per second building block, one does not 107 00:06:19,750 --> 00:06:23,050 have to go to very high levels of parallel processing 108 00:06:23,050 --> 00:06:25,210 to get fairly significant systems. 109 00:06:25,210 --> 00:06:30,730 So 16 way and 30 way parallelism with 100 microprocessor 110 00:06:30,730 --> 00:06:33,370 puts you into a very interesting range. 111 00:06:33,370 --> 00:06:35,890 Of course, to get the supercomputing kinds 112 00:06:35,890 --> 00:06:39,520 of capabilities as I'll point out one 113 00:06:39,520 --> 00:06:41,380 can use higher levels of parallelism. 114 00:06:41,380 --> 00:06:44,730 115 00:06:44,730 --> 00:06:49,440 Now let me then talk about some of the important trends 116 00:06:49,440 --> 00:06:50,820 in parallel processing. 117 00:06:50,820 --> 00:06:53,010 And this slide will indicate to you 118 00:06:53,010 --> 00:06:56,400 that I am in fact viewing parallel processing 119 00:06:56,400 --> 00:07:01,470 in a very broad sense to cover clustering, multiprocessing, 120 00:07:01,470 --> 00:07:04,270 and very highly parallel processing. 121 00:07:04,270 --> 00:07:07,950 And I will spend the next few minutes defining these terms 122 00:07:07,950 --> 00:07:11,820 and giving you some examples of systems in each of these areas. 123 00:07:11,820 --> 00:07:14,640 124 00:07:14,640 --> 00:07:20,850 In clustering, one typically connects a number 125 00:07:20,850 --> 00:07:25,770 of complete computers that is processor and memory 126 00:07:25,770 --> 00:07:30,750 pairs together to disks. 127 00:07:30,750 --> 00:07:34,500 The key motivation for clustering 128 00:07:34,500 --> 00:07:40,710 is to obtain improved capacity and improved availability yet 129 00:07:40,710 --> 00:07:44,070 at the same time to provide a single system image. 130 00:07:44,070 --> 00:07:45,900 That is the key challenge. 131 00:07:45,900 --> 00:07:49,500 One is trying to develop a system that 132 00:07:49,500 --> 00:07:53,280 consists of multiple processors but looks 133 00:07:53,280 --> 00:07:57,540 like a single computer system to the operator 134 00:07:57,540 --> 00:07:59,140 and to the end user. 135 00:07:59,140 --> 00:08:01,730 Now what do I mean by a single system image? 136 00:08:01,730 --> 00:08:04,290 I basically mean the following, that the system 137 00:08:04,290 --> 00:08:07,290 should look like one computer system 138 00:08:07,290 --> 00:08:09,420 from an administrative point of view. 139 00:08:09,420 --> 00:08:11,730 It should have a single file system. 140 00:08:11,730 --> 00:08:14,340 Users who log on to the system should not 141 00:08:14,340 --> 00:08:17,010 be aware of the particular computer 142 00:08:17,010 --> 00:08:19,410 that they end up executing on. 143 00:08:19,410 --> 00:08:23,190 And the system should provide a high degree of availability 144 00:08:23,190 --> 00:08:25,360 and graceful degradation. 145 00:08:25,360 --> 00:08:27,810 Now, in clustering, this is attained 146 00:08:27,810 --> 00:08:32,460 not by having one single operating system that executes 147 00:08:32,460 --> 00:08:35,039 on all the machines, but rather by having 148 00:08:35,039 --> 00:08:37,650 multiple copies of the operating systems 149 00:08:37,650 --> 00:08:41,220 that exchange information typically at the page level, 150 00:08:41,220 --> 00:08:44,130 and thereby software synchronization techniques 151 00:08:44,130 --> 00:08:45,600 are adequate. 152 00:08:45,600 --> 00:08:48,630 Let me just point out that examples of clustering 153 00:08:48,630 --> 00:08:52,380 are the transparent computing facility, which 154 00:08:52,380 --> 00:08:55,320 is part of IBM's UNIX systems. 155 00:08:55,320 --> 00:08:57,660 And I will talk about that later. 156 00:08:57,660 --> 00:09:00,570 Systems like tandem and vax cluster 157 00:09:00,570 --> 00:09:02,985 are also examples of clusters. 158 00:09:02,985 --> 00:09:05,550 Now, the situation in multiprocessing 159 00:09:05,550 --> 00:09:06,900 is a little different. 160 00:09:06,900 --> 00:09:11,100 Here you have processors that share 161 00:09:11,100 --> 00:09:14,130 information at very high speeds, typically 162 00:09:14,130 --> 00:09:20,100 by executing loads and stores against a single memory system. 163 00:09:20,100 --> 00:09:22,530 The synchronization between the processors 164 00:09:22,530 --> 00:09:24,930 is provided at the hardware level, 165 00:09:24,930 --> 00:09:28,800 because the exchange information at these very high speeds. 166 00:09:28,800 --> 00:09:31,270 As a result of this hardware synchronization, 167 00:09:31,270 --> 00:09:33,390 the extensibility is typically limited. 168 00:09:33,390 --> 00:09:38,760 So these systems are restricted to the 8, 16, 32 processor 169 00:09:38,760 --> 00:09:42,780 range and typically will not go significantly beyond that. 170 00:09:42,780 --> 00:09:46,050 Now the goal here really is to get the multiprocessor 171 00:09:46,050 --> 00:09:50,350 to look and behave exactly like a uniprocessor system. 172 00:09:50,350 --> 00:09:53,760 So all the subsystem interfaces, all the application interfaces 173 00:09:53,760 --> 00:09:56,340 to the operating system should be 174 00:09:56,340 --> 00:09:59,250 identical to what they are in a unique processor. 175 00:09:59,250 --> 00:10:01,950 And this really means taking a unique processor operating 176 00:10:01,950 --> 00:10:06,660 system and having the single operating system execute 177 00:10:06,660 --> 00:10:09,680 across the multiple processors. 178 00:10:09,680 --> 00:10:12,990 And this involves implementing mutual exclusion that 179 00:10:12,990 --> 00:10:15,900 involves analyzing the data structures, 180 00:10:15,900 --> 00:10:18,570 understanding how they're used, and providing 181 00:10:18,570 --> 00:10:23,530 the appropriate levels of process synchronization. 182 00:10:23,530 --> 00:10:25,620 The key, of course, is to do all of that 183 00:10:25,620 --> 00:10:28,750 and yet obtain very high efficiency. 184 00:10:28,750 --> 00:10:31,020 Now multiprocessing is being applied 185 00:10:31,020 --> 00:10:33,910 across a very wide range of systems, 186 00:10:33,910 --> 00:10:38,010 all the way from workstations to high end machines. 187 00:10:38,010 --> 00:10:41,190 The motivations in each case are slightly different, of course, 188 00:10:41,190 --> 00:10:42,720 they are all related. 189 00:10:42,720 --> 00:10:45,960 In high end systems, the overall goal 190 00:10:45,960 --> 00:10:48,480 is to provide levels of performance 191 00:10:48,480 --> 00:10:51,840 beyond that achievable by the highest performance 192 00:10:51,840 --> 00:10:54,910 uniprocessor at any given point in time. 193 00:10:54,910 --> 00:10:59,430 So the key high end issue is really performance. 194 00:10:59,430 --> 00:11:01,230 All the other issues and motivations 195 00:11:01,230 --> 00:11:04,920 that I talked about earlier, do apply as well 196 00:11:04,920 --> 00:11:07,860 but the performance issue I believe dominates. 197 00:11:07,860 --> 00:11:11,370 In the mid-range again performance 198 00:11:11,370 --> 00:11:13,950 comes into the picture but the goal here 199 00:11:13,950 --> 00:11:17,250 is really to provide high levels of performance 200 00:11:17,250 --> 00:11:20,220 without necessarily getting into very high performance 201 00:11:20,220 --> 00:11:22,110 uniprocessor development, without getting 202 00:11:22,110 --> 00:11:25,320 into the high performance technologies and packages that 203 00:11:25,320 --> 00:11:28,390 characterize very high performance uniprocessors. 204 00:11:28,390 --> 00:11:30,300 And at the same time horizontal growth 205 00:11:30,300 --> 00:11:35,500 is a key issue in the mid-range. 206 00:11:35,500 --> 00:11:38,170 Now the workstations the motivation 207 00:11:38,170 --> 00:11:41,320 is to provide the highest levels of performance 208 00:11:41,320 --> 00:11:43,330 in a very small package. 209 00:11:43,330 --> 00:11:46,360 And there's really a fairly interesting trend 210 00:11:46,360 --> 00:11:48,910 that is taking place in the workstations, which 211 00:11:48,910 --> 00:11:52,630 to a large extent is driven by this technology 212 00:11:52,630 --> 00:11:54,340 of parallel processing. 213 00:11:54,340 --> 00:11:58,210 And the situation is as follows, by taking 214 00:11:58,210 --> 00:12:00,910 some of these high performance microprocessor systems that 215 00:12:00,910 --> 00:12:04,150 are under 20, 30 million instructions per second range 216 00:12:04,150 --> 00:12:08,770 and implementing them, a system consisting of 4 or 8 217 00:12:08,770 --> 00:12:12,950 microprocessors, one can get into performance ranges that 218 00:12:12,950 --> 00:12:16,900 were typically characteristic of supercomputers of a few years 219 00:12:16,900 --> 00:12:17,980 ago. 220 00:12:17,980 --> 00:12:22,990 Now, the workstation however, has an additional capability. 221 00:12:22,990 --> 00:12:27,790 These MIPS can also be used to provide graphics 222 00:12:27,790 --> 00:12:30,520 and to provide user interface processing. 223 00:12:30,520 --> 00:12:33,070 And since the computing engine is now 224 00:12:33,070 --> 00:12:34,840 very close to the user interface, 225 00:12:34,840 --> 00:12:38,530 they can be connected together by very, very high bandwidth 226 00:12:38,530 --> 00:12:40,180 interconnects. 227 00:12:40,180 --> 00:12:43,900 And you can typically now get levels of interactivity 228 00:12:43,900 --> 00:12:47,200 between the compute engine, which is in the workstation, 229 00:12:47,200 --> 00:12:50,290 and the graphics processor, which is in the workstation 230 00:12:50,290 --> 00:12:53,500 significantly beyond what is possible by having 231 00:12:53,500 --> 00:12:57,430 a thin pipe that connects your workstation to a supercomputer 232 00:12:57,430 --> 00:12:59,340 in the background. 233 00:12:59,340 --> 00:13:02,330 So this is a key trend that is happening in the workstation 234 00:13:02,330 --> 00:13:05,510 arena, which I think is being driven significantly 235 00:13:05,510 --> 00:13:09,200 by this technology of parallel processing. 236 00:13:09,200 --> 00:13:11,120 Now let me just give you a few examples 237 00:13:11,120 --> 00:13:14,180 of systems in these areas. 238 00:13:14,180 --> 00:13:18,560 In the high end IBM 3,090 mainframes 239 00:13:18,560 --> 00:13:21,620 are multi processors as other carriers 240 00:13:21,620 --> 00:13:28,640 and as will be the SS1 machine from Steve Chen's company. 241 00:13:28,640 --> 00:13:31,190 In the mid-range the machine systems 242 00:13:31,190 --> 00:13:36,470 like the 4,381 alliant and convex 243 00:13:36,470 --> 00:13:40,700 are examples of systems that use parallel processing. 244 00:13:40,700 --> 00:13:46,160 And at the workstation we see a numerous systems today 245 00:13:46,160 --> 00:13:50,660 examples addent, stellar, and silicon graphics. 246 00:13:50,660 --> 00:13:54,470 Now we move on to the area of highly parallel processing. 247 00:13:54,470 --> 00:13:57,470 As many of you are aware, there's 248 00:13:57,470 --> 00:14:02,180 an incredibly large variety of architectures and systems 249 00:14:02,180 --> 00:14:06,530 that are being developed and studied both in universities 250 00:14:06,530 --> 00:14:08,060 and in the industry. 251 00:14:08,060 --> 00:14:10,370 And it is not my intention here to go 252 00:14:10,370 --> 00:14:13,880 into a very complicated taxonomy that 253 00:14:13,880 --> 00:14:17,750 describes these various parallel processors. 254 00:14:17,750 --> 00:14:19,700 What I would like to do, though is 255 00:14:19,700 --> 00:14:24,080 take these highly parallel processing systems 256 00:14:24,080 --> 00:14:28,070 and put them into two very general categories 257 00:14:28,070 --> 00:14:31,250 that I call specialized and multipurpose. 258 00:14:31,250 --> 00:14:35,600 And let me just make a few comments about each of these. 259 00:14:35,600 --> 00:14:39,240 In the case of specialized parallel processing, 260 00:14:39,240 --> 00:14:42,170 the technology is very well understood today 261 00:14:42,170 --> 00:14:45,260 and highly parallel specialized processes today 262 00:14:45,260 --> 00:14:48,710 are being used to solve important problems. 263 00:14:48,710 --> 00:14:52,730 Problems that are either economically important 264 00:14:52,730 --> 00:14:54,330 or of scientific importance. 265 00:14:54,330 --> 00:14:58,010 And I will talk about two examples that we have developed 266 00:14:58,010 --> 00:15:01,160 and are working on in IBM, the logic simulation 267 00:15:01,160 --> 00:15:02,900 machine and the GF11. 268 00:15:02,900 --> 00:15:06,230 269 00:15:06,230 --> 00:15:08,180 Characteristic of these machines is 270 00:15:08,180 --> 00:15:14,570 that they are intended to solve just one single problem or one 271 00:15:14,570 --> 00:15:17,570 single well understood algorithm. 272 00:15:17,570 --> 00:15:19,820 And as a result, you don't require 273 00:15:19,820 --> 00:15:24,680 very sophisticated programming environments and compilers. 274 00:15:24,680 --> 00:15:29,000 This allows you to develop these machines 275 00:15:29,000 --> 00:15:32,420 without great development costs. 276 00:15:32,420 --> 00:15:35,200 277 00:15:35,200 --> 00:15:38,530 It's my estimate that machines that 278 00:15:38,530 --> 00:15:41,740 can deliver hundreds of gigaflops or even a teraflops 279 00:15:41,740 --> 00:15:49,090 are quite feasible and will be used over the next few years 280 00:15:49,090 --> 00:15:52,180 and such machines can have a very significant impact 281 00:15:52,180 --> 00:15:58,780 on science and on engineering. 282 00:15:58,780 --> 00:16:01,210 But I would like to make another point, which 283 00:16:01,210 --> 00:16:04,360 is that there are other specialized machines 284 00:16:04,360 --> 00:16:07,270 and I've given you one example on the slide, the database 285 00:16:07,270 --> 00:16:11,290 machine, which is a little different from a logic 286 00:16:11,290 --> 00:16:14,620 simulation machine or a machine for a physics application. 287 00:16:14,620 --> 00:16:17,590 These may be used by very few people. 288 00:16:17,590 --> 00:16:21,970 But a machine that can do a complex query very rapidly 289 00:16:21,970 --> 00:16:23,770 is performing a specialized function 290 00:16:23,770 --> 00:16:25,630 but it's performing a function that 291 00:16:25,630 --> 00:16:30,560 will be important to a very large class of users. 292 00:16:30,560 --> 00:16:34,300 In fact, I believe this is an example of a more general trend 293 00:16:34,300 --> 00:16:39,790 where clearly subsystems are being identified 294 00:16:39,790 --> 00:16:45,220 and database is one of them, which are often accessed 295 00:16:45,220 --> 00:16:49,400 by through very well defined interfaces, 296 00:16:49,400 --> 00:16:55,840 for example a SQL query, which are amenable to being addressed 297 00:16:55,840 --> 00:17:01,330 by a special hardware in the case of SQL machine 298 00:17:01,330 --> 00:17:03,040 by a highly parallel processor. 299 00:17:03,040 --> 00:17:05,020 And this is an important trend because it's 300 00:17:05,020 --> 00:17:07,660 a specialized function but it's a function 301 00:17:07,660 --> 00:17:12,910 that will be of use to a large number of people. 302 00:17:12,910 --> 00:17:15,940 I think the whole area of specialized highly parallel 303 00:17:15,940 --> 00:17:18,220 processors is an important one and we 304 00:17:18,220 --> 00:17:19,780 have to start understanding the role 305 00:17:19,780 --> 00:17:23,930 that these machines will play in general computing environments. 306 00:17:23,930 --> 00:17:27,730 Now on the other hand, the area of 307 00:17:27,730 --> 00:17:30,220 multipurpose parallel processing though it 308 00:17:30,220 --> 00:17:34,570 has significant potential, I believe 309 00:17:34,570 --> 00:17:37,960 is also an area where a significant amount of research 310 00:17:37,960 --> 00:17:39,640 still remains to be done. 311 00:17:39,640 --> 00:17:43,120 I will talk about this a little more later on 312 00:17:43,120 --> 00:17:47,670 when I describe the RP3 machine that we are working on at IBM. 313 00:17:47,670 --> 00:17:49,920 I would like to make two points at this stage, though. 314 00:17:49,920 --> 00:17:55,030 One is that the availability of hypercubes 315 00:17:55,030 --> 00:17:58,250 has really been very important. 316 00:17:58,250 --> 00:18:00,910 The second point I would like to make is that in my estimate 317 00:18:00,910 --> 00:18:04,780 the area of data flow computing is very important 318 00:18:04,780 --> 00:18:07,180 because it provides the potential 319 00:18:07,180 --> 00:18:11,440 to exploit massively parallel unstructured parallelism. 320 00:18:11,440 --> 00:18:13,840 I believe more work needs to be done in this area 321 00:18:13,840 --> 00:18:16,177 but the potential is very significant. 322 00:18:16,177 --> 00:18:20,080 323 00:18:20,080 --> 00:18:25,090 And to end this introductory section, 324 00:18:25,090 --> 00:18:28,390 let me just say that in addition to the various classes 325 00:18:28,390 --> 00:18:32,110 of machines the two uses of parallelism 326 00:18:32,110 --> 00:18:34,510 to high level uses of parallelism 327 00:18:34,510 --> 00:18:37,960 that we should distinguish between, one of them 328 00:18:37,960 --> 00:18:40,180 I call system parallelism. 329 00:18:40,180 --> 00:18:44,470 And this is the ability to take different users' applications 330 00:18:44,470 --> 00:18:46,960 or different database transactions 331 00:18:46,960 --> 00:18:48,940 and run them in parallel. 332 00:18:48,940 --> 00:18:51,100 This does increase throughput. 333 00:18:51,100 --> 00:18:54,770 It doesn't make any single application run faster. 334 00:18:54,770 --> 00:18:57,280 A big advantage is that it requires no change 335 00:18:57,280 --> 00:18:58,930 to user programs. 336 00:18:58,930 --> 00:19:02,860 And typically, this class of parallelism 337 00:19:02,860 --> 00:19:04,630 is obtained through multiprocessing 338 00:19:04,630 --> 00:19:06,400 and through clustering. 339 00:19:06,400 --> 00:19:10,000 The other kind of parallelism is application parallelism. 340 00:19:10,000 --> 00:19:12,310 And here the goal is really to take 341 00:19:12,310 --> 00:19:16,180 a single job or a single transaction or a unit of work 342 00:19:16,180 --> 00:19:18,310 and break it up into little pieces 343 00:19:18,310 --> 00:19:21,250 and run all these pieces in parallel. 344 00:19:21,250 --> 00:19:24,130 This doesn't make any individual application run-- 345 00:19:24,130 --> 00:19:27,700 this does make individual applications run faster. 346 00:19:27,700 --> 00:19:30,340 It may not increased system throughput 347 00:19:30,340 --> 00:19:33,490 and it may require changes to the user's program. 348 00:19:33,490 --> 00:19:38,350 Typically, multiprocessing and highly parallel processing 349 00:19:38,350 --> 00:19:43,360 have the potential to provide this kind of parallelism. 350 00:19:43,360 --> 00:19:47,020 And this indicates also multiprocessing 351 00:19:47,020 --> 00:19:50,620 is a very important systems trend because it 352 00:19:50,620 --> 00:19:54,520 does address both systems parallelism and application 353 00:19:54,520 --> 00:19:55,810 parallelism. 354 00:19:55,810 --> 00:19:59,140 What I would like to do next is to illustrate 355 00:19:59,140 --> 00:20:00,760 some of the concepts I have talked 356 00:20:00,760 --> 00:20:03,550 about with specific examples. 357 00:20:03,550 --> 00:20:07,570 I've chosen examples from IBM because those are the ones 358 00:20:07,570 --> 00:20:08,860 that I am more familiar with. 359 00:20:08,860 --> 00:20:12,700 360 00:20:12,700 --> 00:20:17,040 So we start off with clustering. 361 00:20:17,040 --> 00:20:23,850 And that's part of IBM's UNIX offerings. 362 00:20:23,850 --> 00:20:27,180 An IBM UNIX is called AIX. 363 00:20:27,180 --> 00:20:32,640 We have a system called the transparent computing facility. 364 00:20:32,640 --> 00:20:36,480 This system provides a single system image of the type 365 00:20:36,480 --> 00:20:39,570 that I had discussed earlier, across 366 00:20:39,570 --> 00:20:41,610 heterogeneous processors. 367 00:20:41,610 --> 00:20:46,170 The processors can be either system 370 machines or PS2 368 00:20:46,170 --> 00:20:47,320 machines. 369 00:20:47,320 --> 00:20:50,220 Now a single system image in this case 370 00:20:50,220 --> 00:20:53,640 means what I had said earlier that is you get a single file 371 00:20:53,640 --> 00:20:56,700 system, you get log on transparency, 372 00:20:56,700 --> 00:21:01,920 you also get a certain amount of local remote execution 373 00:21:01,920 --> 00:21:03,760 in a transparent manner. 374 00:21:03,760 --> 00:21:07,470 For example, you could be developing and editing 375 00:21:07,470 --> 00:21:10,980 your program on a workstation. 376 00:21:10,980 --> 00:21:16,680 The program could be compiled on a mid-range 4,381 or 9,370 377 00:21:16,680 --> 00:21:21,130 and could be executed on a high end 3,090 system. 378 00:21:21,130 --> 00:21:24,700 And all of this would happen in a manner transparent to the end 379 00:21:24,700 --> 00:21:25,200 user. 380 00:21:25,200 --> 00:21:28,470 381 00:21:28,470 --> 00:21:31,790 The second important point is that these systems 382 00:21:31,790 --> 00:21:35,870 can be connected in a variety of ways, either through channel 383 00:21:35,870 --> 00:21:40,910 to channel attach or using local area networks. 384 00:21:40,910 --> 00:21:46,260 Let me move on now to mainframe parallelism. 385 00:21:46,260 --> 00:21:51,610 And let me talk about the 3,090 model 600, 386 00:21:51,610 --> 00:21:56,580 which is a six-way tightly coupled multiprocessor. 387 00:21:56,580 --> 00:22:00,570 These processes have gaseous and there's very extensive 388 00:22:00,570 --> 00:22:05,410 hardware gas coherence that is provided in a system like this. 389 00:22:05,410 --> 00:22:09,540 The boxes labeled SCEs are storage control elements 390 00:22:09,540 --> 00:22:12,960 and are very complex switches that 391 00:22:12,960 --> 00:22:15,300 bring the processes and the channels 392 00:22:15,300 --> 00:22:19,950 together and provide a coherent view of the shared memory 393 00:22:19,950 --> 00:22:21,330 system. 394 00:22:21,330 --> 00:22:25,650 Now, in many ways in a system like this also 395 00:22:25,650 --> 00:22:28,440 some of the key challenges and some of the key issues 396 00:22:28,440 --> 00:22:30,720 are really software issues. 397 00:22:30,720 --> 00:22:34,770 And we have made a tremendous amount of progress 398 00:22:34,770 --> 00:22:39,300 in obtaining true symmetric multiprocessor 399 00:22:39,300 --> 00:22:44,330 capability in our MVS and VM operating systems. 400 00:22:44,330 --> 00:22:47,190 Furthermore this multiprocessor capability 401 00:22:47,190 --> 00:22:53,040 is exploited by subsystems like IMS and GICS 402 00:22:53,040 --> 00:22:56,070 to get system parallelism. 403 00:22:56,070 --> 00:22:58,680 We have made extensions to the Fortran language. 404 00:22:58,680 --> 00:23:03,150 And together with the automatic detection of parallelism 405 00:23:03,150 --> 00:23:05,910 we can now exploit application parallelism 406 00:23:05,910 --> 00:23:10,350 across this six-way tightly coupled multiprocessor. 407 00:23:10,350 --> 00:23:12,150 There's been a fair amount of work 408 00:23:12,150 --> 00:23:16,960 on subroutine libraries for our vector machines. 409 00:23:16,960 --> 00:23:18,900 And these subroutine libraries are now 410 00:23:18,900 --> 00:23:22,800 being extended to provide support for parallel processing 411 00:23:22,800 --> 00:23:24,930 as well. 412 00:23:24,930 --> 00:23:28,080 The point I want to make is that the six-way, 413 00:23:28,080 --> 00:23:30,420 for example in this case six-way tightly coupled 414 00:23:30,420 --> 00:23:34,440 MP is a very well understood technology that 415 00:23:34,440 --> 00:23:37,470 is being exploited both for system parallelism 416 00:23:37,470 --> 00:23:41,340 and for application parallelism side IBM. 417 00:23:41,340 --> 00:23:48,150 Just to give you a brief example of the kind of speedups 418 00:23:48,150 --> 00:23:53,800 that we are capable of getting by the use of parallel Fortran, 419 00:23:53,800 --> 00:23:58,530 here's a slide that shows three applications of varying degrees 420 00:23:58,530 --> 00:24:02,580 of utilization but you can essentially get 4 to 5 421 00:24:02,580 --> 00:24:06,050 and 1/2 way speed up on a six processor system. 422 00:24:06,050 --> 00:24:10,370 423 00:24:10,370 --> 00:24:14,610 Another system that we have we are working on, 424 00:24:14,610 --> 00:24:18,410 which is an experimental prototype 425 00:24:18,410 --> 00:24:24,020 is a combination of concepts and we are calling it form coupling 426 00:24:24,020 --> 00:24:25,910 for the lack of a better name. 427 00:24:25,910 --> 00:24:30,950 This is really taking two 3090 600 complexes 428 00:24:30,950 --> 00:24:34,700 and connecting them at the shared electronic level 429 00:24:34,700 --> 00:24:40,160 with very high bandwidth fiber optic interconnections. 430 00:24:40,160 --> 00:24:43,280 This as I said should be viewed as an exploratory system 431 00:24:43,280 --> 00:24:46,970 and we are trying to understand the various ways in which it 432 00:24:46,970 --> 00:24:47,880 can be used. 433 00:24:47,880 --> 00:24:50,250 Some of which I have listed on the slide. 434 00:24:50,250 --> 00:24:54,800 So you could use one side of the system for application 435 00:24:54,800 --> 00:24:59,240 parallelism, the other side for system parallelism 436 00:24:59,240 --> 00:25:02,000 or you could use both sides for application parallelism 437 00:25:02,000 --> 00:25:05,630 or you could try to take one job and split it across all 12 438 00:25:05,630 --> 00:25:08,720 processes and get application parallelism across 12 439 00:25:08,720 --> 00:25:10,670 processes. 440 00:25:10,670 --> 00:25:13,190 Using the system to address some of the operating 441 00:25:13,190 --> 00:25:16,610 system and compiler and language issues that would 442 00:25:16,610 --> 00:25:19,062 be involved in form coupling. 443 00:25:19,062 --> 00:25:21,780 444 00:25:21,780 --> 00:25:25,260 Let me move on to workstation parallelism. 445 00:25:25,260 --> 00:25:29,940 We have developed in researched a vehicle, 446 00:25:29,940 --> 00:25:33,270 which is again an experimental vehicle 447 00:25:33,270 --> 00:25:37,170 to understand some of the important issues in this area. 448 00:25:37,170 --> 00:25:40,800 In this particular case, what we would like to do 449 00:25:40,800 --> 00:25:44,700 is we would like to study how best parallel workstation can 450 00:25:44,700 --> 00:25:46,560 be used-- 451 00:25:46,560 --> 00:25:50,200 how parallelism can be used and exploited in a workstation. 452 00:25:50,200 --> 00:25:53,280 Again the various forms of parallelism 453 00:25:53,280 --> 00:25:55,890 that one could exploit, either the system parallelism 454 00:25:55,890 --> 00:25:57,930 or the application parallelism. 455 00:25:57,930 --> 00:26:01,020 And this is a vehicle to start studying some of these issues 456 00:26:01,020 --> 00:26:02,970 from a software point of view. 457 00:26:02,970 --> 00:26:09,420 The vehicle consists of eight IBM risk microprocessors, 458 00:26:09,420 --> 00:26:11,880 each with floating point capability and each 459 00:26:11,880 --> 00:26:16,750 with about 8 megabytes of memory. 460 00:26:16,750 --> 00:26:19,950 There's 32 megabytes of shared memory capability. 461 00:26:19,950 --> 00:26:22,410 This machine does have a shared address, a global address 462 00:26:22,410 --> 00:26:23,880 space. 463 00:26:23,880 --> 00:26:26,040 The memories and processors are connected together 464 00:26:26,040 --> 00:26:28,980 by a high speed 80 megabytes per second bars. 465 00:26:28,980 --> 00:26:31,560 Again since this is a research vehicle, 466 00:26:31,560 --> 00:26:34,020 we have provided a real-time hardware monitor 467 00:26:34,020 --> 00:26:36,840 that will allow us to make measurements 468 00:26:36,840 --> 00:26:42,180 on various parts of the system as the software as executing. 469 00:26:42,180 --> 00:26:45,870 Let me now move on to the area of highly parallel processing. 470 00:26:45,870 --> 00:26:49,140 And I will start with specialized highly parallel 471 00:26:49,140 --> 00:26:53,400 processing and I will go through very briefly two examples, 472 00:26:53,400 --> 00:26:55,290 the engineering verification engine 473 00:26:55,290 --> 00:26:58,167 and the GF11 parallel processor. 474 00:26:58,167 --> 00:27:01,030 475 00:27:01,030 --> 00:27:04,090 Now what was the problem that we were 476 00:27:04,090 --> 00:27:07,390 trying to address with EVE, the Engineering Verification 477 00:27:07,390 --> 00:27:07,930 Engine. 478 00:27:07,930 --> 00:27:11,170 The problem is really that of logic simulation. 479 00:27:11,170 --> 00:27:14,020 And in this particular case, the goal 480 00:27:14,020 --> 00:27:17,000 is to simulate the logic of a high pros-- 481 00:27:17,000 --> 00:27:20,380 high performance processor, one of our high end machines. 482 00:27:20,380 --> 00:27:24,580 And logic simulation is a very important part. 483 00:27:24,580 --> 00:27:27,907 And it's the front end part of the design process. 484 00:27:27,907 --> 00:27:28,990 The goal is the following. 485 00:27:28,990 --> 00:27:33,310 The goal is to try to simulate at the logic level processor 486 00:27:33,310 --> 00:27:35,710 yet to be designed, which is in this case 487 00:27:35,710 --> 00:27:40,420 approximately 300,000 gates and to simulate 488 00:27:40,420 --> 00:27:43,780 5 minutes of execution of that process that 489 00:27:43,780 --> 00:27:45,160 remains to be designed. 490 00:27:45,160 --> 00:27:48,290 Now to do this. 491 00:27:48,290 --> 00:27:55,010 This was the problem we faced in the late '70s and early '80s. 492 00:27:55,010 --> 00:27:58,900 And basically to simulate the next generation 493 00:27:58,900 --> 00:28:01,930 high-end machine for five minutes 494 00:28:01,930 --> 00:28:05,200 would take a 1980 mainframe approximately 100 years, 495 00:28:05,200 --> 00:28:09,250 and this is obviously impossible. 496 00:28:09,250 --> 00:28:14,560 The EVE machine will do this in six weeks. 497 00:28:14,560 --> 00:28:18,520 This is an important example of how 498 00:28:18,520 --> 00:28:22,330 you can use parallel processing to take a problem that 499 00:28:22,330 --> 00:28:27,580 is an important design problem, which is essentially impossible 500 00:28:27,580 --> 00:28:30,370 and to make it very practical. 501 00:28:30,370 --> 00:28:33,070 And that's what EVE achieved. 502 00:28:33,070 --> 00:28:36,230 503 00:28:36,230 --> 00:28:39,890 It achieved this by using a highly powerful processor 504 00:28:39,890 --> 00:28:43,370 but a very specialized processor. 505 00:28:43,370 --> 00:28:48,920 That consists of 256 elementary processing elements connected 506 00:28:48,920 --> 00:28:51,540 together by a crossbar switch. 507 00:28:51,540 --> 00:28:55,610 Now each of these processors, these elementary processors 508 00:28:55,610 --> 00:28:58,580 is really a very simple computing element. 509 00:28:58,580 --> 00:29:02,660 And what it does is to read from an instruction memory 510 00:29:02,660 --> 00:29:08,810 and then do very simple logical operations on a data memory. 511 00:29:08,810 --> 00:29:11,300 And it does these operations one at a time. 512 00:29:11,300 --> 00:29:15,140 That is, it'll take four digit items do a logical operation 513 00:29:15,140 --> 00:29:18,890 on them and produce a result, which either goes back 514 00:29:18,890 --> 00:29:23,780 into a local data memory element or goes across the crossbar 515 00:29:23,780 --> 00:29:25,880 to one of the remote processors. 516 00:29:25,880 --> 00:29:29,512 So it's a very simple architecture conceptually. 517 00:29:29,512 --> 00:29:31,220 And the key, of course, is the following. 518 00:29:31,220 --> 00:29:34,100 The key is to avoid the bottleneck of event driven 519 00:29:34,100 --> 00:29:37,310 simulation and to take a logic network 520 00:29:37,310 --> 00:29:43,160 and actually compile onto this simple little structure. 521 00:29:43,160 --> 00:29:45,260 Again, some of the key problems are really 522 00:29:45,260 --> 00:29:46,760 in the software compilation. 523 00:29:46,760 --> 00:29:50,240 These problems were addressed and solved 524 00:29:50,240 --> 00:29:54,170 and we were able to meet the overall goals of logic 525 00:29:54,170 --> 00:29:57,260 simulation that we started with. 526 00:29:57,260 --> 00:30:01,790 The engineering verification engine as the slide says 527 00:30:01,790 --> 00:30:05,900 has a peak rate of about 2 billion gates simulations 528 00:30:05,900 --> 00:30:06,530 per second. 529 00:30:06,530 --> 00:30:09,170 530 00:30:09,170 --> 00:30:14,590 It is operational at several IBM locations. 531 00:30:14,590 --> 00:30:20,080 Some of these machines have 256 processors, some have less, 532 00:30:20,080 --> 00:30:22,900 the smallest one has 64 processors. 533 00:30:22,900 --> 00:30:25,870 And this is possibly one of the earliest production 534 00:30:25,870 --> 00:30:30,680 uses of a highly parallel processing element. 535 00:30:30,680 --> 00:30:34,490 We have been using these EVEs for a few years 536 00:30:34,490 --> 00:30:37,070 now in our development laboratories 537 00:30:37,070 --> 00:30:41,240 to do logic simulation. 538 00:30:41,240 --> 00:30:45,690 The second problem is different. 539 00:30:45,690 --> 00:30:48,080 The second problem is a scientific problem 540 00:30:48,080 --> 00:30:53,120 and it's the problem of quantum chromodynamics. 541 00:30:53,120 --> 00:30:55,940 The theory of quantum chromodynamics deals 542 00:30:55,940 --> 00:30:59,630 with the forces that exist inside the nucleus 543 00:30:59,630 --> 00:31:01,340 at the quark, antiquark level. 544 00:31:01,340 --> 00:31:03,830 So there's a chromoelectric field 545 00:31:03,830 --> 00:31:05,570 that is hypothesized to exist, which 546 00:31:05,570 --> 00:31:08,150 is very similar to the dynamic field that 547 00:31:08,150 --> 00:31:11,090 exists between electrons and the nucleus. 548 00:31:11,090 --> 00:31:14,930 The chromoelectric field exists inside the nucleus. 549 00:31:14,930 --> 00:31:17,660 QCD gives you the transition probabilities 550 00:31:17,660 --> 00:31:20,570 between quark and antiquark configurations 551 00:31:20,570 --> 00:31:23,690 at various points in space and time. 552 00:31:23,690 --> 00:31:26,090 From these transition probabilities, 553 00:31:26,090 --> 00:31:29,730 from first principles one can compute the mass of the proton. 554 00:31:29,730 --> 00:31:32,630 So the goal is to compute the mass of the proton, which 555 00:31:32,630 --> 00:31:36,080 is a known quantity and to see whether the computation matches 556 00:31:36,080 --> 00:31:38,570 what we know the mass of the product to be. 557 00:31:38,570 --> 00:31:41,570 If it does, then we have a certain amount of confidence 558 00:31:41,570 --> 00:31:43,650 that QCD is correct. 559 00:31:43,650 --> 00:31:47,400 If it doesn't match, then we know that QCD is wrong. 560 00:31:47,400 --> 00:31:50,780 Now, this is a structured and regular computation 561 00:31:50,780 --> 00:31:58,370 but the physicists involved with this particular theory at IBM 562 00:31:58,370 --> 00:32:02,090 estimates that 3 times 10 to the 17 operations 563 00:32:02,090 --> 00:32:05,120 would be required to compute the mass of the proton. 564 00:32:05,120 --> 00:32:07,730 On a machine that could sustain 100 flops 565 00:32:07,730 --> 00:32:09,170 this would take 100 years. 566 00:32:09,170 --> 00:32:12,090 And again, is quite impossible. 567 00:32:12,090 --> 00:32:16,680 On the GF11 this problem can be solved in approximately one 568 00:32:16,680 --> 00:32:17,180 year. 569 00:32:17,180 --> 00:32:20,420 This is another somewhat dramatic example 570 00:32:20,420 --> 00:32:22,130 of the use of parallel processing 571 00:32:22,130 --> 00:32:24,770 to take an important problem and this is 572 00:32:24,770 --> 00:32:29,990 an important scientific problem and bring it 573 00:32:29,990 --> 00:32:31,610 into the realm of possibility. 574 00:32:31,610 --> 00:32:37,680 A problem that for all practical purposes today is intractable. 575 00:32:37,680 --> 00:32:41,610 The GF11 architecture is an ASMD architecture, 576 00:32:41,610 --> 00:32:45,450 that is there's central control that broadcasts instructions 577 00:32:45,450 --> 00:32:50,560 to 576 processors. 578 00:32:50,560 --> 00:32:53,240 Now each of these processors is a fairly powerful machine 579 00:32:53,240 --> 00:32:56,140 it's not a little microprocessor. 580 00:32:56,140 --> 00:32:58,720 Each processor has the capability 581 00:32:58,720 --> 00:33:03,550 of executing 20 mega flops, that is 20 million floating point 582 00:33:03,550 --> 00:33:04,870 operations per second. 583 00:33:04,870 --> 00:33:08,650 It can do I/O at the rate of 20 megabytes per second 584 00:33:08,650 --> 00:33:11,980 and it has the capability also of doing 20 million fixed point 585 00:33:11,980 --> 00:33:13,210 operations per second. 586 00:33:13,210 --> 00:33:16,660 If you multiply all that by 576, you 587 00:33:16,660 --> 00:33:19,630 end up with over 11 gigaflops-- 588 00:33:19,630 --> 00:33:22,570 over 11 billion fixed point operations 589 00:33:22,570 --> 00:33:27,260 per second and a total bandwidth of 11 gigabytes per second. 590 00:33:27,260 --> 00:33:31,720 Now it is not our claim that you can 591 00:33:31,720 --> 00:33:35,920 get that level of performance on any arbitrary problem, 592 00:33:35,920 --> 00:33:41,380 but we do believe that on QCD we can get up to 90% utilization 593 00:33:41,380 --> 00:33:43,840 and therefore can sustain a computation 594 00:33:43,840 --> 00:33:47,170 rate of 10 gigaflops per second, which would be required 595 00:33:47,170 --> 00:33:49,840 to solve the QCD problem and compute 596 00:33:49,840 --> 00:33:52,420 the mass of the proton in approximately one year. 597 00:33:52,420 --> 00:33:55,090 598 00:33:55,090 --> 00:33:57,130 The other important feature of this machine 599 00:33:57,130 --> 00:33:59,290 is the interconnection network. 600 00:33:59,290 --> 00:34:03,070 The interconnection network works as follows. 601 00:34:03,070 --> 00:34:04,570 Before you start the problem, you 602 00:34:04,570 --> 00:34:09,310 can select up to 1,000 of the factorial 576 603 00:34:09,310 --> 00:34:11,389 possible permutations. 604 00:34:11,389 --> 00:34:12,820 And then you can actually provide 605 00:34:12,820 --> 00:34:16,150 each one of these permutations every 200 nanoseconds 606 00:34:16,150 --> 00:34:18,719 at execution time. 607 00:34:18,719 --> 00:34:22,870 Now QCD itself does not require that much generality 608 00:34:22,870 --> 00:34:27,070 for the computations, yet this computation network 609 00:34:27,070 --> 00:34:29,920 and the ability to have these various permutations 610 00:34:29,920 --> 00:34:33,699 is important because you want to be able to separate 611 00:34:33,699 --> 00:34:38,080 the physical interconnection of the processors from the way 612 00:34:38,080 --> 00:34:39,610 they are used logically. 613 00:34:39,610 --> 00:34:41,440 And this is important if you want 614 00:34:41,440 --> 00:34:46,550 to be able to operate in the presence of processor failures. 615 00:34:46,550 --> 00:34:49,060 And that is an important consideration 616 00:34:49,060 --> 00:34:50,650 in this architecture. 617 00:34:50,650 --> 00:34:53,920 I'm also using the word permutation somewhat loosely, 618 00:34:53,920 --> 00:34:58,780 you can also provide broadcast every 200 nanoseconds. 619 00:34:58,780 --> 00:35:00,940 And this is one of the features that 620 00:35:00,940 --> 00:35:04,990 will make the GF11 actually suitable for a wider 621 00:35:04,990 --> 00:35:07,870 variety of problems than just the QCD problem. 622 00:35:07,870 --> 00:35:12,650 623 00:35:12,650 --> 00:35:16,310 Let me now move on to the area of multipurpose parallel 624 00:35:16,310 --> 00:35:17,090 processing. 625 00:35:17,090 --> 00:35:20,270 As I said in my introductory section, 626 00:35:20,270 --> 00:35:22,620 I believe this is a very important area, 627 00:35:22,620 --> 00:35:26,510 but I also believe that there are many open questions that 628 00:35:26,510 --> 00:35:27,810 remain to be answered. 629 00:35:27,810 --> 00:35:31,670 We are slowly answering these questions but a lot of work 630 00:35:31,670 --> 00:35:33,440 remains to be done. 631 00:35:33,440 --> 00:35:36,330 How much parallelism is there in applications, 632 00:35:36,330 --> 00:35:39,440 what is the granularity of this parallelism, 633 00:35:39,440 --> 00:35:41,420 how should the parallelism be specified? 634 00:35:41,420 --> 00:35:43,910 Should it be specified explicitly, 635 00:35:43,910 --> 00:35:45,620 should it be implicit in the language, 636 00:35:45,620 --> 00:35:48,230 should it be derived by compilers? 637 00:35:48,230 --> 00:35:51,350 What are the appropriate resource management 638 00:35:51,350 --> 00:35:54,230 and scheduling approaches? 639 00:35:54,230 --> 00:35:57,700 640 00:35:57,700 --> 00:36:00,640 From what class of problems do you need a global address space 641 00:36:00,640 --> 00:36:03,790 for what kinds of problems will explicit message passing be 642 00:36:03,790 --> 00:36:04,510 good? 643 00:36:04,510 --> 00:36:07,600 What is the impact of communication latency 644 00:36:07,600 --> 00:36:08,980 and bandwidth on performance. 645 00:36:08,980 --> 00:36:12,910 And the bottom line really is what kind of performance 646 00:36:12,910 --> 00:36:16,300 can you sustain across a wide range of applications 647 00:36:16,300 --> 00:36:18,790 on a given architecture. 648 00:36:18,790 --> 00:36:21,670 As a community we are making progress on these questions 649 00:36:21,670 --> 00:36:24,550 but I believe a lot more work remains to be done. 650 00:36:24,550 --> 00:36:27,430 Now two or three years ago the situation I believe 651 00:36:27,430 --> 00:36:29,770 was as follows. 652 00:36:29,770 --> 00:36:33,220 We as a technical community had done 653 00:36:33,220 --> 00:36:36,940 a lot of analysis and simulation to start 654 00:36:36,940 --> 00:36:40,420 understanding the answers to some of these questions. 655 00:36:40,420 --> 00:36:45,220 But what was really required was a non-trivial prototypes, 656 00:36:45,220 --> 00:36:47,830 on which we could run real applications 657 00:36:47,830 --> 00:36:49,790 and perform real measurements. 658 00:36:49,790 --> 00:36:52,990 Approximately four years ago, we started down 659 00:36:52,990 --> 00:36:58,930 the path of building a set of prototypes in IBM 660 00:36:58,930 --> 00:37:02,740 to start answering some of these important questions. 661 00:37:02,740 --> 00:37:06,950 I will come to the RP3 program in a minute, 662 00:37:06,950 --> 00:37:11,230 but before I get to that, let me mention that we also 663 00:37:11,230 --> 00:37:14,560 have a message passing parallel processor at IBM 664 00:37:14,560 --> 00:37:16,960 research called Victor. 665 00:37:16,960 --> 00:37:20,230 A 32 node MIMD system has been operational 666 00:37:20,230 --> 00:37:22,600 since the middle of 1987. 667 00:37:22,600 --> 00:37:25,330 This is based on the mass transmuter 668 00:37:25,330 --> 00:37:29,600 and is programmed using C Pascal Fortran outcome. 669 00:37:29,600 --> 00:37:33,210 670 00:37:33,210 --> 00:37:36,780 Victor was and it is a very quick and inexpensive approach 671 00:37:36,780 --> 00:37:41,790 to putting together a parallel process very rapidly. 672 00:37:41,790 --> 00:37:43,620 It has been shown to be exceedingly 673 00:37:43,620 --> 00:37:46,110 good for a variety of applications 674 00:37:46,110 --> 00:37:49,440 where there's a huge amount of parallelism. 675 00:37:49,440 --> 00:37:51,960 Some of these applications are those 676 00:37:51,960 --> 00:37:56,280 that use Monte Carlo methods, ray tracing and fractals. 677 00:37:56,280 --> 00:38:00,330 We are currently implementing circuit simulation on Victor 678 00:38:00,330 --> 00:38:02,760 and we'll use Victor as an experimental vehicle 679 00:38:02,760 --> 00:38:05,610 to study the area of message passing. 680 00:38:05,610 --> 00:38:07,680 A much bigger version of Victor will also 681 00:38:07,680 --> 00:38:09,970 be operational fairly soon. 682 00:38:09,970 --> 00:38:13,180 683 00:38:13,180 --> 00:38:21,220 Moving on then to the RP3 program, our goal in 1984 684 00:38:21,220 --> 00:38:22,440 was really two-fold. 685 00:38:22,440 --> 00:38:25,660 One was to establish the feasibility of building 686 00:38:25,660 --> 00:38:27,850 a multipurpose parallel processor out 687 00:38:27,850 --> 00:38:31,000 of conventional microprocessors and the second 688 00:38:31,000 --> 00:38:34,450 as I said was to provide a set of tools for research 689 00:38:34,450 --> 00:38:42,370 on applications algorithms languages and architectures. 690 00:38:42,370 --> 00:38:46,810 Here's a block diagram of the RP3 machine. 691 00:38:46,810 --> 00:38:52,180 In its initial formulation RP3 was a 512 way parallel 692 00:38:52,180 --> 00:38:57,760 processor and we have actually implemented 1/8 of this 64 693 00:38:57,760 --> 00:39:00,400 processor version of RP3. 694 00:39:00,400 --> 00:39:02,140 Let me just very quickly go through some 695 00:39:02,140 --> 00:39:04,780 of the key attributes of this machine. 696 00:39:04,780 --> 00:39:07,300 First of all, we were really after a very, very powerful 697 00:39:07,300 --> 00:39:10,780 network and the network is built out of the highest performance 698 00:39:10,780 --> 00:39:13,780 bipolar technology and package that was available to us 699 00:39:13,780 --> 00:39:15,800 in that time frame. 700 00:39:15,800 --> 00:39:19,660 The network was also initially the 512 way 701 00:39:19,660 --> 00:39:25,330 and was supposed to have combining basically 702 00:39:25,330 --> 00:39:29,740 lets you access a global memory location that 703 00:39:29,740 --> 00:39:31,930 is end processors can access a global memory 704 00:39:31,930 --> 00:39:35,650 location essentially in constant time for a given machine 705 00:39:35,650 --> 00:39:38,080 and not an order of end-time as would 706 00:39:38,080 --> 00:39:40,000 be the case without combining. 707 00:39:40,000 --> 00:39:42,070 We did not feel that combining was necessary 708 00:39:42,070 --> 00:39:45,070 as part of the 64-way machine and so combining 709 00:39:45,070 --> 00:39:47,930 was not implemented in this version. 710 00:39:47,930 --> 00:39:50,950 So the first important characteristic of RP3 711 00:39:50,950 --> 00:39:52,360 was the switch. 712 00:39:52,360 --> 00:39:54,700 The second important characteristic 713 00:39:54,700 --> 00:40:00,160 was the overall approach to managing caches and managing 714 00:40:00,160 --> 00:40:01,330 memories. 715 00:40:01,330 --> 00:40:04,390 So, cache management was done under software control 716 00:40:04,390 --> 00:40:07,390 hardware mechanisms were not provided, 717 00:40:07,390 --> 00:40:09,370 but essentially data could be tagged as being 718 00:40:09,370 --> 00:40:11,790 cachable or non-cachable. 719 00:40:11,790 --> 00:40:15,550 The memory system was unique in the following sense 720 00:40:15,550 --> 00:40:19,750 even though, the whole machine has a global address space, 721 00:40:19,750 --> 00:40:22,210 this address space can be broken up 722 00:40:22,210 --> 00:40:25,750 into a local address space and a global address space. 723 00:40:25,750 --> 00:40:27,790 And the amount of local and global 724 00:40:27,790 --> 00:40:29,980 can be changed dynamically at runtime 725 00:40:29,980 --> 00:40:31,120 on the program controlled. 726 00:40:31,120 --> 00:40:33,280 So the two extremes of this would 727 00:40:33,280 --> 00:40:35,410 be to have all the memory configured 728 00:40:35,410 --> 00:40:38,410 as part of the global address space in which case, 729 00:40:38,410 --> 00:40:41,770 it becomes a true shared memory machine. 730 00:40:41,770 --> 00:40:44,770 The other extreme would be to configure the address 731 00:40:44,770 --> 00:40:47,230 space as pieces of local address space 732 00:40:47,230 --> 00:40:50,030 only without any shared address space and in that case, 733 00:40:50,030 --> 00:40:53,060 the machine becomes similar to a message passing machine. 734 00:40:53,060 --> 00:40:55,330 So part of the motivation for doing repeat three 735 00:40:55,330 --> 00:40:59,950 was to build a machine that was a flexible tool for studying 736 00:40:59,950 --> 00:41:04,700 various classes of architectures. 737 00:41:04,700 --> 00:41:10,670 A third important feature of RP3 was very extensive 738 00:41:10,670 --> 00:41:13,580 built in monitoring capabilities. 739 00:41:13,580 --> 00:41:17,390 These monitoring capabilities allow us to measure things 740 00:41:17,390 --> 00:41:25,280 like the ratio of local to global memory access, the cache 741 00:41:25,280 --> 00:41:28,310 ratios, instruction frequencies, delays 742 00:41:28,310 --> 00:41:30,560 caused due to network access and so on. 743 00:41:30,560 --> 00:41:33,980 There's a processor as part of every one of these-- 744 00:41:33,980 --> 00:41:37,310 there's a chip which is part of each processing element 745 00:41:37,310 --> 00:41:43,200 that gathers this data while the machine is executing. 746 00:41:43,200 --> 00:41:51,110 So these were some of the important features of the RP3 747 00:41:51,110 --> 00:41:52,420 program. 748 00:41:52,420 --> 00:41:53,990 We also felt initially that it was 749 00:41:53,990 --> 00:41:58,100 important to get a level of performance high enough to make 750 00:41:58,100 --> 00:41:59,870 the machine interesting. 751 00:41:59,870 --> 00:42:02,750 And the initial levels of performance 752 00:42:02,750 --> 00:42:06,500 were to get roughly a billion instructions per second 753 00:42:06,500 --> 00:42:08,720 by the use of 512 processors. 754 00:42:08,720 --> 00:42:11,460 755 00:42:11,460 --> 00:42:13,800 In addition to the hardware there 756 00:42:13,800 --> 00:42:16,740 was a fairly extensive undertaking 757 00:42:16,740 --> 00:42:23,130 in terms of developing software and to develop 758 00:42:23,130 --> 00:42:24,840 software tools that would allow us 759 00:42:24,840 --> 00:42:26,670 to experiment with parallelism. 760 00:42:26,670 --> 00:42:30,390 One of these tools as we VM/EPEX, which 761 00:42:30,390 --> 00:42:36,720 allows experiments with parallelism on a 370 762 00:42:36,720 --> 00:42:39,550 multiprocessor as opposed to the RP3. 763 00:42:39,550 --> 00:42:42,780 It was developed to allow us to do 764 00:42:42,780 --> 00:42:45,990 the necessary work and languages and applications 765 00:42:45,990 --> 00:42:48,000 while RP3 was being developed. 766 00:42:48,000 --> 00:42:50,490 As a side effect this piece of software 767 00:42:50,490 --> 00:42:55,000 also provides real speed up on a 370 multiprocessor. 768 00:42:55,000 --> 00:42:59,170 And this is been operational at several sites. 769 00:42:59,170 --> 00:43:03,910 Together with VM/EPEX we've also developed several tools 770 00:43:03,910 --> 00:43:07,840 which allow us to do post-mortem analysis of applications 771 00:43:07,840 --> 00:43:12,370 that are run under VM/EPEX in a parallel way. 772 00:43:12,370 --> 00:43:14,620 So what you can do is you can take an application, 773 00:43:14,620 --> 00:43:16,510 you can run it under VM/EPEX, you 774 00:43:16,510 --> 00:43:20,200 can gather trace information about the execution, 775 00:43:20,200 --> 00:43:23,110 and then you can run that trace information 776 00:43:23,110 --> 00:43:25,630 through various simulators and schedulers 777 00:43:25,630 --> 00:43:29,240 to try to understand what the effect of the switch would be, 778 00:43:29,240 --> 00:43:32,500 and to try to understand what efficiency you would get out 779 00:43:32,500 --> 00:43:36,130 of an RP3 like machine. 780 00:43:36,130 --> 00:43:38,950 This particular slide is not important 781 00:43:38,950 --> 00:43:40,930 but it shows the kind of information 782 00:43:40,930 --> 00:43:43,210 or one example of the kind of information one 783 00:43:43,210 --> 00:43:48,100 can get out of VM/EPEX and these various tools that I described. 784 00:43:48,100 --> 00:43:51,520 To summarize then, I said earlier 785 00:43:51,520 --> 00:43:55,690 that we had two overriding goals for the RP3 project. 786 00:43:55,690 --> 00:43:58,000 One was to construct a multipurpose parallel 787 00:43:58,000 --> 00:43:59,350 processor. 788 00:43:59,350 --> 00:44:01,840 And the features that make RP3 multi-purpose 789 00:44:01,840 --> 00:44:03,610 are listed on the slide. 790 00:44:03,610 --> 00:44:06,370 It uses a general-purpose microprocessor, 791 00:44:06,370 --> 00:44:09,320 it's an MIMD structure, it has a very low latency, 792 00:44:09,320 --> 00:44:12,040 high bandwidth, self-loading network, 793 00:44:12,040 --> 00:44:15,880 and it provides for flexible local and global addressing 794 00:44:15,880 --> 00:44:19,750 mechanisms together with a fairly general I/O subsystem 795 00:44:19,750 --> 00:44:23,170 that I didn't talk about. 796 00:44:23,170 --> 00:44:25,900 What makes RP3 an experimental facility? 797 00:44:25,900 --> 00:44:28,360 The system monitoring facilities is the fact 798 00:44:28,360 --> 00:44:32,830 that it uses standard operating systems, 799 00:44:32,830 --> 00:44:34,960 it has a UNIX based operating system that 800 00:44:34,960 --> 00:44:37,700 uses conventional programming languages, 801 00:44:37,700 --> 00:44:41,230 and we have a fairly extensive suite of application 802 00:44:41,230 --> 00:44:43,450 development and analysis tools. 803 00:44:43,450 --> 00:44:46,780 The RP3 also has adequate system reliability 804 00:44:46,780 --> 00:44:50,780 to serve as an experimental facility. 805 00:44:50,780 --> 00:44:54,090 And I'd like to make two points. 806 00:44:54,090 --> 00:44:57,800 One is that this whole area is extremely 807 00:44:57,800 --> 00:44:59,560 important to our industry. 808 00:44:59,560 --> 00:45:03,020 809 00:45:03,020 --> 00:45:07,410 If I were to net it out I see two very important trends. 810 00:45:07,410 --> 00:45:11,900 One is the area of mainframe parallelism. 811 00:45:11,900 --> 00:45:14,450 Here we will construct the highest performance 812 00:45:14,450 --> 00:45:17,840 uniprocessors and we will use parallelism 813 00:45:17,840 --> 00:45:21,650 to get performance levels beyond that possible 814 00:45:21,650 --> 00:45:23,630 with a uniprocessor. 815 00:45:23,630 --> 00:45:25,460 To a large extent, this will rely 816 00:45:25,460 --> 00:45:28,580 on advances in technology and packaging 817 00:45:28,580 --> 00:45:30,080 and that's where the risk will be. 818 00:45:30,080 --> 00:45:32,840 To a large extent mainframe parallelism 819 00:45:32,840 --> 00:45:36,020 will use evolutionary software. 820 00:45:36,020 --> 00:45:39,320 The other very important trend is the whole area 821 00:45:39,320 --> 00:45:42,350 of microprocessor-based parallelism, where 822 00:45:42,350 --> 00:45:46,880 we will use commodity parts and higher degrees of parallelism. 823 00:45:46,880 --> 00:45:50,360 And the key issue here will be advances in software 824 00:45:50,360 --> 00:45:53,540 that will be required to make these efficient use 825 00:45:53,540 --> 00:45:55,290 of these higher degrees of federalism. 826 00:45:55,290 --> 00:45:57,980 827 00:45:57,980 --> 00:46:00,770 The second sort of summary comment I would like to make 828 00:46:00,770 --> 00:46:04,300 is that parallel processing is important today. 829 00:46:04,300 --> 00:46:05,910 It's starting to be used. 830 00:46:05,910 --> 00:46:08,300 And I believe that this technology will be 831 00:46:08,300 --> 00:46:10,050 very pervasive in the future. 832 00:46:10,050 --> 00:46:13,410 This is one of the key systems issues that we have to address 833 00:46:13,410 --> 00:46:15,830 and we have to solve all the way from workstations 834 00:46:15,830 --> 00:46:18,560 to supercomputers. 835 00:46:18,560 --> 00:46:20,810 I would also like to leave you with the thought 836 00:46:20,810 --> 00:46:24,120 that we are actively pursuing in IBM 837 00:46:24,120 --> 00:46:26,490 several different systems approaches. 838 00:46:26,490 --> 00:46:29,900 I've tried to identify and briefly explain 839 00:46:29,900 --> 00:46:31,400 most of these to you. 840 00:46:31,400 --> 00:46:33,860 And some of these are being pursued as products 841 00:46:33,860 --> 00:46:36,780 and some of them are in research. 842 00:46:36,780 --> 00:46:38,900 That brings me to the end of my talk 843 00:46:38,900 --> 00:46:41,150 and I will now take questions. 844 00:46:41,150 --> 00:46:44,210 And the first question from Gordon Bell at Arden computer. 845 00:46:44,210 --> 00:46:47,030 Why did the GR11 and RP3 parallel machines 846 00:46:47,030 --> 00:46:50,210 take so long to develop, and what did you learn? 847 00:46:50,210 --> 00:46:53,990 I think the basic reason why we have spent more time 848 00:46:53,990 --> 00:46:57,530 than we initially anticipated on board these machines is 849 00:46:57,530 --> 00:47:02,100 that they are both in their own way very complex, 850 00:47:02,100 --> 00:47:03,170 large systems. 851 00:47:03,170 --> 00:47:13,910 In neither case did we really exploit the CMOS building block 852 00:47:13,910 --> 00:47:18,740 replicated to produce a very high performance machine. 853 00:47:18,740 --> 00:47:23,280 The RP3 became complex for a variety of reasons. 854 00:47:23,280 --> 00:47:27,800 There was a significant amount of basic logic design 855 00:47:27,800 --> 00:47:30,860 that we ended up having to do. 856 00:47:30,860 --> 00:47:34,950 A lot of this was in the cache and memory system. 857 00:47:34,950 --> 00:47:42,200 We also ended up using multiple technologies and multiple power 858 00:47:42,200 --> 00:47:45,020 supplies all of which contributed 859 00:47:45,020 --> 00:47:47,840 to the overall complexity of the machine. 860 00:47:47,840 --> 00:47:51,850 In the case of GF11 also we didn't use a standard building 861 00:47:51,850 --> 00:47:52,350 block. 862 00:47:52,350 --> 00:47:57,800 We ended up building our own uniprocessors that 863 00:47:57,800 --> 00:48:01,840 were fairly significant in their complexity 864 00:48:01,840 --> 00:48:04,670 and we ended up building a fairly complex interconnection 865 00:48:04,670 --> 00:48:05,420 network. 866 00:48:05,420 --> 00:48:08,990 For myself Gerald Maguire at Columbia University. 867 00:48:08,990 --> 00:48:12,440 For such calculations as used in QCD, for computations 868 00:48:12,440 --> 00:48:14,630 run for a long period of time, what 869 00:48:14,630 --> 00:48:17,630 is the reliability of the machine, what about the problem 870 00:48:17,630 --> 00:48:19,070 of detecting faults? 871 00:48:19,070 --> 00:48:22,950 That's a very good question. 872 00:48:22,950 --> 00:48:28,050 As you probably know the GF11-- 873 00:48:28,050 --> 00:48:32,670 that the QCD competition is basically 874 00:48:32,670 --> 00:48:35,730 a fairly stable competition and it's a Monte Carlo computation. 875 00:48:35,730 --> 00:48:39,150 In fact, the physicist on the project 876 00:48:39,150 --> 00:48:44,670 has always had the view that if you randomly lost a bit once 877 00:48:44,670 --> 00:48:46,680 and away, once in a meaning let's say once 878 00:48:46,680 --> 00:48:49,530 every day or every several hours, 879 00:48:49,530 --> 00:48:52,080 it wouldn't in fact significantly affect 880 00:48:52,080 --> 00:48:52,860 the competition. 881 00:48:52,860 --> 00:48:54,060 That's point 1. 882 00:48:54,060 --> 00:48:56,790 Point 2 is that the GF11 is designed 883 00:48:56,790 --> 00:49:00,720 so that you can do a dump of the entire memory state 884 00:49:00,720 --> 00:49:02,950 of the system very rapidly. 885 00:49:02,950 --> 00:49:08,190 So you can dump the entire state in approximately 12 seconds. 886 00:49:08,190 --> 00:49:11,490 And therefore one could very frequently 887 00:49:11,490 --> 00:49:13,407 checkpoint the system. 888 00:49:13,407 --> 00:49:15,240 So I think basically the way one will end up 889 00:49:15,240 --> 00:49:18,000 doing the QCD computation is either 890 00:49:18,000 --> 00:49:21,000 to do every computation twice and compare the answers, 891 00:49:21,000 --> 00:49:24,780 or to split the machine into two pieces and run the problem 892 00:49:24,780 --> 00:49:28,180 and compare processor by processor of the results. 893 00:49:28,180 --> 00:49:31,020 That's how you get your basic error detection. 894 00:49:31,020 --> 00:49:33,180 And with that, you would do two things. 895 00:49:33,180 --> 00:49:34,680 You would look at the errors and you 896 00:49:34,680 --> 00:49:36,690 would see if they were truly just random errors 897 00:49:36,690 --> 00:49:37,773 or there was some pattern. 898 00:49:37,773 --> 00:49:40,620 If there was a pattern there would be a problem. 899 00:49:40,620 --> 00:49:42,180 Secondly, you could always back up 900 00:49:42,180 --> 00:49:44,520 and restart the computation because of this check 901 00:49:44,520 --> 00:49:46,310 pointing capability. 902 00:49:46,310 --> 00:49:50,190 So I think we'll end up running GF11 and its architecture 903 00:49:50,190 --> 00:49:52,800 to do that basically split the machine into 904 00:49:52,800 --> 00:49:56,190 and run the identical computation on both sides 905 00:49:56,190 --> 00:49:59,540 and compare the results of processors. 906 00:49:59,540 --> 00:51:17,000