1 00:00:00,000 --> 00:00:01,446 2 00:00:01,446 --> 00:00:04,338 [MUSIC PLAYING] 3 00:00:04,338 --> 00:00:07,720 4 00:00:07,720 --> 00:00:09,970 To perform a task in parallel is to have 5 00:00:09,970 --> 00:00:13,780 multiple entities working together toward a single goal. 6 00:00:13,780 --> 00:00:17,020 Likewise, parallel computing is the use of multiple processors 7 00:00:17,020 --> 00:00:20,620 together in a single machine to solve a single problem. 8 00:00:20,620 --> 00:00:22,810 Vital application areas, which would immediately 9 00:00:22,810 --> 00:00:25,690 benefit from the availability of parallel computing devices, 10 00:00:25,690 --> 00:00:29,680 include superconductor physics, atmospheric research, 11 00:00:29,680 --> 00:00:34,690 weather modeling, defense, astronomy, genetics, 12 00:00:34,690 --> 00:00:38,020 oil exploration, and aerospace. 13 00:00:38,020 --> 00:00:40,300 Such machines could improve current computational 14 00:00:40,300 --> 00:00:43,630 capabilities by several orders of magnitude. 15 00:00:43,630 --> 00:00:45,430 However, today's parallel processors 16 00:00:45,430 --> 00:00:47,980 remain in the domain of sophisticated specialists 17 00:00:47,980 --> 00:00:49,810 and are extremely difficult to use 18 00:00:49,810 --> 00:00:52,060 by scientists and engineers who want 19 00:00:52,060 --> 00:00:55,930 to solve routine yet time-consuming problems. 20 00:00:55,930 --> 00:00:58,480 The ultimate challenge is to bring parallel computing 21 00:00:58,480 --> 00:01:01,310 into the mainstream. 22 00:01:01,310 --> 00:01:06,020 The MONSOON Project is rising to meet this challenge. 23 00:01:06,020 --> 00:01:07,940 MONSOON is a joint research effort 24 00:01:07,940 --> 00:01:10,820 by the Motorola Computer Group in Cambridge, Massachusetts 25 00:01:10,820 --> 00:01:13,370 and Tempe Arizona and the Massachusetts 26 00:01:13,370 --> 00:01:16,250 Institute of Technology Laboratory for Computer 27 00:01:16,250 --> 00:01:17,360 Science. 28 00:01:17,360 --> 00:01:21,160 The project is funded by DARPA and Motorola. 29 00:01:21,160 --> 00:01:23,950 The goal of the project is to demonstrate the feasibility 30 00:01:23,950 --> 00:01:26,830 of dataflow multiple processors using implicitly 31 00:01:26,830 --> 00:01:28,960 parallel programming languages. 32 00:01:28,960 --> 00:01:32,290 Professor Arvind of MIT'S Laboratory for Computer Science 33 00:01:32,290 --> 00:01:35,290 computation structures group and a leader in the dataflow 34 00:01:35,290 --> 00:01:36,800 research effort. 35 00:01:36,800 --> 00:01:40,120 Why is it that we don't have more parallel machines? 36 00:01:40,120 --> 00:01:42,790 The economics of it is not in the way. 37 00:01:42,790 --> 00:01:44,830 I think there is a fairly strong agreement 38 00:01:44,830 --> 00:01:48,220 that software is the problem to making parallel systems more 39 00:01:48,220 --> 00:01:49,600 readily available. 40 00:01:49,600 --> 00:01:51,640 Small multiprocessors available today 41 00:01:51,640 --> 00:01:55,210 have had economic success, only limited application areas. 42 00:01:55,210 --> 00:01:58,540 Applications with each processor can be dedicated easily 43 00:01:58,540 --> 00:02:00,380 to a specific task. 44 00:02:00,380 --> 00:02:03,250 However, such machines can't be exploited 45 00:02:03,250 --> 00:02:07,790 from volumes of software available on uniprocessors. 46 00:02:07,790 --> 00:02:10,855 This says something about the state of languages in use 47 00:02:10,855 --> 00:02:13,330 and the compilers in use. 48 00:02:13,330 --> 00:02:15,310 So users have to resort to some sort 49 00:02:15,310 --> 00:02:20,500 of explicit parallel programming to get the job done. 50 00:02:20,500 --> 00:02:23,050 It turns out that explicit parallel programming 51 00:02:23,050 --> 00:02:26,380 is quite difficult. People are reluctant to do it 52 00:02:26,380 --> 00:02:28,300 not only because it's difficult, because there 53 00:02:28,300 --> 00:02:30,610 is no guarantee of portability. 54 00:02:30,610 --> 00:02:33,580 It's not clear if program you have 55 00:02:33,580 --> 00:02:35,860 put some effort into making parallel 56 00:02:35,860 --> 00:02:39,340 will port to a machine with more processors, 57 00:02:39,340 --> 00:02:42,040 or somebody else's machine, or for that matter, 58 00:02:42,040 --> 00:02:44,920 even to the next generation of the same machine. 59 00:02:44,920 --> 00:02:47,140 And all of this puts a big damper 60 00:02:47,140 --> 00:02:49,960 on trying to parallelize a program. 61 00:02:49,960 --> 00:02:54,190 Dr. Ken Traub of the Motorola Cambridge Research Center. 62 00:02:54,190 --> 00:02:56,922 Another kind of problem with explicit parallel programming, 63 00:02:56,922 --> 00:02:59,380 which is even more serious, is that these kinds of programs 64 00:02:59,380 --> 00:03:01,743 don't have very good compositionality properties. 65 00:03:01,743 --> 00:03:03,160 We're used to writing subroutines, 66 00:03:03,160 --> 00:03:04,702 and then calling subroutines from all 67 00:03:04,702 --> 00:03:07,240 over the place having subroutine libraries, and so on. 68 00:03:07,240 --> 00:03:09,000 But any experienced program will tell you, 69 00:03:09,000 --> 00:03:11,500 that when you're working with parallel subroutine libraries, 70 00:03:11,500 --> 00:03:13,000 you can't count on how well they'll work 71 00:03:13,000 --> 00:03:14,410 under different circumstances. 72 00:03:14,410 --> 00:03:17,403 And this is especially true of the data parallel model. 73 00:03:17,403 --> 00:03:19,570 If you look at the other extreme, which is massively 74 00:03:19,570 --> 00:03:23,180 parallel machines, it's obvious that there is some success. 75 00:03:23,180 --> 00:03:26,200 We read about these machines in newspapers all the time. 76 00:03:26,200 --> 00:03:28,030 People are actually using these machines 77 00:03:28,030 --> 00:03:30,340 to solve some very important problems. 78 00:03:30,340 --> 00:03:33,490 This is in spite of extreme difficulty in programming 79 00:03:33,490 --> 00:03:35,830 massively parallel machines. 80 00:03:35,830 --> 00:03:38,560 Right now, these machines are being used to solve problems 81 00:03:38,560 --> 00:03:40,660 which cannot be solved any other way. 82 00:03:40,660 --> 00:03:44,290 Working on parallel machines has its own sex appeal. 83 00:03:44,290 --> 00:03:46,870 The difficulty becomes the challenge. 84 00:03:46,870 --> 00:03:49,000 But the point is, if parallel machines remain 85 00:03:49,000 --> 00:03:51,100 so difficult to program, then it's 86 00:03:51,100 --> 00:03:54,160 obvious that they will remain on the fringes. 87 00:03:54,160 --> 00:03:56,560 They can never become mainline computing engines 88 00:03:56,560 --> 00:04:01,820 that we use every day in our offices, in our homes. 89 00:04:01,820 --> 00:04:03,760 So the key issue here is the ease 90 00:04:03,760 --> 00:04:06,730 of programming and the languages we 91 00:04:06,730 --> 00:04:09,160 use to program these machines. 92 00:04:09,160 --> 00:04:11,350 Computer languages play a pivotal role 93 00:04:11,350 --> 00:04:13,840 in the degree to which the parallelism of a problem 94 00:04:13,840 --> 00:04:16,089 is exposed or obscured. 95 00:04:16,089 --> 00:04:18,850 To illustrate, consider the following simple problem, 96 00:04:18,850 --> 00:04:21,010 known as "wavefront." 97 00:04:21,010 --> 00:04:24,400 The problem is to construct an n by n matrix, such 98 00:04:24,400 --> 00:04:27,940 that the top row and leftmost column contain 1s, 99 00:04:27,940 --> 00:04:29,860 and each of the other elements in the matrix 100 00:04:29,860 --> 00:04:32,230 is defined in terms of its neighbors to the north 101 00:04:32,230 --> 00:04:36,060 and west as shown. 102 00:04:36,060 --> 00:04:38,700 It is not difficult to see that after the elements in one 103 00:04:38,700 --> 00:04:40,920 diagonal have been computed, the elements 104 00:04:40,920 --> 00:04:44,020 in the following diagonal can be computed in parallel, 105 00:04:44,020 --> 00:04:48,800 and so on, until the matrix is full. 106 00:04:48,800 --> 00:04:51,830 The number of operations that can be performed simultaneously 107 00:04:51,830 --> 00:04:54,410 at each time step is displayed by a graph, 108 00:04:54,410 --> 00:04:55,880 called a parallelism profile. 109 00:04:55,880 --> 00:04:59,410 110 00:04:59,410 --> 00:05:01,000 Clearly, the number of operations 111 00:05:01,000 --> 00:05:03,580 that can be performed in parallel increases 112 00:05:03,580 --> 00:05:08,920 until the middle diagonal is computed and then decreases. 113 00:05:08,920 --> 00:05:12,780 Now, consider this Fortran program to solve wavefront. 114 00:05:12,780 --> 00:05:15,120 It consists of an initialization step, 115 00:05:15,120 --> 00:05:17,520 doubling [? S ?] to-do loops, and the computation 116 00:05:17,520 --> 00:05:22,020 for each element X, I, J. The semantics of Fortran 117 00:05:22,020 --> 00:05:23,820 dictates the precise order in which 118 00:05:23,820 --> 00:05:26,370 the elements of the matrix are computed, namely, 119 00:05:26,370 --> 00:05:29,670 top row to bottom row and left to right in each row. 120 00:05:29,670 --> 00:05:32,610 121 00:05:32,610 --> 00:05:35,100 The difficulty of performing the required analysis 122 00:05:35,100 --> 00:05:37,080 to determine what can be done in parallel, 123 00:05:37,080 --> 00:05:40,890 even for this simple program, is beyond the capability of almost 124 00:05:40,890 --> 00:05:44,420 all commercial compilers. 125 00:05:44,420 --> 00:05:47,450 Let's examine the steps we have taken to execute the Fortran 126 00:05:47,450 --> 00:05:49,820 wavefront program in parallel. 127 00:05:49,820 --> 00:05:51,710 We have started out with a specification 128 00:05:51,710 --> 00:05:55,280 of a problem in which the parallelism is evident, encoded 129 00:05:55,280 --> 00:05:58,160 it in a sequential language, Fortran in this case, 130 00:05:58,160 --> 00:06:01,460 and then are forced to rely on a sophisticated compiler 131 00:06:01,460 --> 00:06:04,170 to produce parallel code from the sequential program. 132 00:06:04,170 --> 00:06:07,250 133 00:06:07,250 --> 00:06:09,890 It seems to me to be far more effective to use 134 00:06:09,890 --> 00:06:13,370 a language in which we can encode are algorithms, such 135 00:06:13,370 --> 00:06:15,710 that the parallelism in the original problem 136 00:06:15,710 --> 00:06:20,690 is not obscured, but is actually preserved by the code. 137 00:06:20,690 --> 00:06:23,420 The language Id, developed by Professor Arvind 138 00:06:23,420 --> 00:06:26,420 and his colleagues, is such a language. 139 00:06:26,420 --> 00:06:30,200 As an example, here is an Id program for wavefront. 140 00:06:30,200 --> 00:06:33,590 First, the dimensions of the matrix are defined. 141 00:06:33,590 --> 00:06:35,360 Then the boundary elements of the matrix 142 00:06:35,360 --> 00:06:39,530 are defined as well as the remaining elements. 143 00:06:39,530 --> 00:06:43,820 Note, that this is a declarative specification of the problem. 144 00:06:43,820 --> 00:06:46,680 Nothing is implied about the order of computation, 145 00:06:46,680 --> 00:06:48,980 so that whatever can be computed in parallel, 146 00:06:48,980 --> 00:06:53,380 subject to data dependencies, will be computed in parallel. 147 00:06:53,380 --> 00:06:55,630 To illustrate the value of compositionality, 148 00:06:55,630 --> 00:06:57,760 consider an extension of wavefront 149 00:06:57,760 --> 00:06:59,867 in which the value of element X, I, 150 00:06:59,867 --> 00:07:02,530 J is a function of neighbors to the north and west 151 00:07:02,530 --> 00:07:05,320 and its own previous value. 152 00:07:05,320 --> 00:07:07,240 This is to be iterated multiple times. 153 00:07:07,240 --> 00:07:10,590 154 00:07:10,590 --> 00:07:13,920 At first glance, the parallelism profile of this program 155 00:07:13,920 --> 00:07:17,010 would appear to be that of a series of wavefronts laid end 156 00:07:17,010 --> 00:07:18,150 to end. 157 00:07:18,150 --> 00:07:20,520 This would be the parallelism obtained in Fortran 158 00:07:20,520 --> 00:07:24,600 by simply composing the earlier program in a loop. 159 00:07:24,600 --> 00:07:26,820 However, the parallelism that is actually 160 00:07:26,820 --> 00:07:29,010 present in this problem is far greater 161 00:07:29,010 --> 00:07:31,440 than what might be expected, leading to a much 162 00:07:31,440 --> 00:07:34,270 shorter execution time. 163 00:07:34,270 --> 00:07:36,190 This parallelism profile is what is 164 00:07:36,190 --> 00:07:38,050 obtained in Id when the original Id 165 00:07:38,050 --> 00:07:40,880 program is enclosed in a loop. 166 00:07:40,880 --> 00:07:43,610 Plus the power of implicit parallel programming 167 00:07:43,610 --> 00:07:45,890 leads to programs which are both high level 168 00:07:45,890 --> 00:07:48,170 and exhibit a degree of parallelism, which 169 00:07:48,170 --> 00:07:50,390 would be almost unthinkable to express 170 00:07:50,390 --> 00:07:52,760 in languages like Fortran. 171 00:07:52,760 --> 00:07:55,580 I believe that implicit parallel programming 172 00:07:55,580 --> 00:07:58,010 will be able to jump over the final hurdle 173 00:07:58,010 --> 00:08:01,190 to make parallel machines really commonplace. 174 00:08:01,190 --> 00:08:03,350 It can bring down the programming difficulty 175 00:08:03,350 --> 00:08:06,470 to a level where users would as happily program 176 00:08:06,470 --> 00:08:10,670 parallel machines as they would program sequential machines. 177 00:08:10,670 --> 00:08:14,180 Compilers for implicitly parallel languages, such as Id, 178 00:08:14,180 --> 00:08:16,520 are likely to generate a multitude of threads 179 00:08:16,520 --> 00:08:19,950 of computation, reflecting the large amounts of parallelism 180 00:08:19,950 --> 00:08:22,800 which can be exposed in programs. 181 00:08:22,800 --> 00:08:25,490 Therefore, it is essential that the hardware architecture 182 00:08:25,490 --> 00:08:27,830 for the system be designed with this in mind. 183 00:08:27,830 --> 00:08:29,678 [MUSIC PLAYING] 184 00:08:29,678 --> 00:08:32,919 185 00:08:32,919 --> 00:08:35,950 In the MONSOON architecture, the state of the computation 186 00:08:35,950 --> 00:08:39,159 is represented as a tree of activation frames, which 187 00:08:39,159 --> 00:08:42,840 are similar to stack frames in conventional machines. 188 00:08:42,840 --> 00:08:45,610 However, instead of only having the most recent frame 189 00:08:45,610 --> 00:08:48,400 active at any time, in the MONSOON architecture, 190 00:08:48,400 --> 00:08:52,250 multiple procedures are active simultaneously. 191 00:08:52,250 --> 00:08:55,280 In addition, the control state is virtualized, 192 00:08:55,280 --> 00:08:57,260 so that within activation frames are 193 00:08:57,260 --> 00:08:59,270 multiple threads of control, which 194 00:08:59,270 --> 00:09:03,250 may be simultaneously active. 195 00:09:03,250 --> 00:09:04,930 In any machine architecture which 196 00:09:04,930 --> 00:09:07,300 supports a parallel model of computation, 197 00:09:07,300 --> 00:09:09,010 there are certain fundamental issues 198 00:09:09,010 --> 00:09:12,070 which must be addressed in order to achieve efficient process 199 00:09:12,070 --> 00:09:13,870 or utilization. 200 00:09:13,870 --> 00:09:17,640 The first of these issues is memory latency. 201 00:09:17,640 --> 00:09:21,900 Let's look at a computing node in any multinode system. 202 00:09:21,900 --> 00:09:24,600 There is a processor component with an on-chip memory 203 00:09:24,600 --> 00:09:27,270 cache, typically some local memory, 204 00:09:27,270 --> 00:09:30,830 and an interface to the communication network. 205 00:09:30,830 --> 00:09:33,440 A reference by a processor to local memory 206 00:09:33,440 --> 00:09:35,960 may be five times slower than a reference which hits 207 00:09:35,960 --> 00:09:37,265 the processor's local cache. 208 00:09:37,265 --> 00:09:39,850 209 00:09:39,850 --> 00:09:42,100 And even in a well-designed machine, 210 00:09:42,100 --> 00:09:43,840 the latency incurred by a reference 211 00:09:43,840 --> 00:09:46,250 to a different node in the machine may be many times 212 00:09:46,250 --> 00:09:48,120 slower yet. 213 00:09:48,120 --> 00:09:51,990 Professor Greg Papadopoulos of MIT's computation structures 214 00:09:51,990 --> 00:09:52,818 group. 215 00:09:52,818 --> 00:09:55,110 When you're looking at how you're going to run programs 216 00:09:55,110 --> 00:09:57,850 on this machine, you have to ask yourself the question, 217 00:09:57,850 --> 00:10:00,000 how often am I going to require communication 218 00:10:00,000 --> 00:10:01,180 with another node? 219 00:10:01,180 --> 00:10:04,180 And what will the penalty of such communication be? 220 00:10:04,180 --> 00:10:06,630 For example, will I just sit there, and wait, 221 00:10:06,630 --> 00:10:08,850 and idle while I wait for the response 222 00:10:08,850 --> 00:10:10,020 to request that I'm given? 223 00:10:10,020 --> 00:10:13,290 Or can I be more creative and do something else-- 224 00:10:13,290 --> 00:10:16,590 try to amortize or minimize the latency? 225 00:10:16,590 --> 00:10:18,690 One natural way to tolerate communication 226 00:10:18,690 --> 00:10:22,800 latency is to issue many remote requests into the network. 227 00:10:22,800 --> 00:10:25,230 Although each request experiences the latency 228 00:10:25,230 --> 00:10:28,380 of the network, the processor which originates the requests 229 00:10:28,380 --> 00:10:30,570 is kept busy by the multiple responses, 230 00:10:30,570 --> 00:10:32,880 which ultimately return. 231 00:10:32,880 --> 00:10:34,590 In order to successfully generate 232 00:10:34,590 --> 00:10:36,750 sufficient remote memory requests to keep 233 00:10:36,750 --> 00:10:39,600 the communication network full and the processor busy, 234 00:10:39,600 --> 00:10:42,030 there must be sufficient parallelism available, 235 00:10:42,030 --> 00:10:44,220 and the processor must also be able to perform 236 00:10:44,220 --> 00:10:47,730 rapid context switching across multiple threads of control, 237 00:10:47,730 --> 00:10:51,650 each of which can generate requests into the network. 238 00:10:51,650 --> 00:10:53,930 Remote requests, which are made by a processor, 239 00:10:53,930 --> 00:10:56,570 will produce responses which return in a different order 240 00:10:56,570 --> 00:10:57,830 than they were issued. 241 00:10:57,830 --> 00:11:00,230 To impose an ordering requirement on the network 242 00:11:00,230 --> 00:11:01,700 is practically unmanageable. 243 00:11:01,700 --> 00:11:04,910 244 00:11:04,910 --> 00:11:06,980 Well, as they say, talk is cheap. 245 00:11:06,980 --> 00:11:09,620 It's listening that's an art. 246 00:11:09,620 --> 00:11:11,390 It's easy if you assume that you have 247 00:11:11,390 --> 00:11:14,750 lots of threads to generate as many outgoing messages 248 00:11:14,750 --> 00:11:15,860 as you want. 249 00:11:15,860 --> 00:11:18,140 The really hard part is, when you 250 00:11:18,140 --> 00:11:20,495 think about a message coming back in the processor, 251 00:11:20,495 --> 00:11:22,370 how are you going to relate it to the waiting 252 00:11:22,370 --> 00:11:23,540 threads of control? 253 00:11:23,540 --> 00:11:25,160 And is this really the only datum 254 00:11:25,160 --> 00:11:28,280 that's required to continue the computation? 255 00:11:28,280 --> 00:11:30,350 In order to relate a response to a waiting 256 00:11:30,350 --> 00:11:32,750 activity within a processor, the processor 257 00:11:32,750 --> 00:11:36,620 must have the ability to perform efficient synchronization. 258 00:11:36,620 --> 00:11:38,660 In general, multiple responses may be 259 00:11:38,660 --> 00:11:42,930 required to activate a thread. 260 00:11:42,930 --> 00:11:44,520 What's interesting is that if you just 261 00:11:44,520 --> 00:11:46,710 tackle this latency problem, you end up 262 00:11:46,710 --> 00:11:48,318 with a machine that looks like this. 263 00:11:48,318 --> 00:11:50,610 You would imagine a few contacts that you could rapidly 264 00:11:50,610 --> 00:11:53,250 switch among in each processor, and then some synchronization 265 00:11:53,250 --> 00:11:54,923 capabilities in the memory. 266 00:11:54,923 --> 00:11:56,340 But that hardware in and of itself 267 00:11:56,340 --> 00:11:58,048 doesn't quite get you to the point that's 268 00:11:58,048 --> 00:11:59,440 expected by the compiler. 269 00:11:59,440 --> 00:12:01,800 And the key difference is, while the machine needs 270 00:12:01,800 --> 00:12:04,740 only enough context to keep the communication network filled, 271 00:12:04,740 --> 00:12:06,222 the compiler must be able to name 272 00:12:06,222 --> 00:12:08,430 enough activities in the machine to support the fully 273 00:12:08,430 --> 00:12:10,000 parallel model. 274 00:12:10,000 --> 00:12:12,150 So the dataflow approach is, rather 275 00:12:12,150 --> 00:12:14,010 than buying the number of threads 276 00:12:14,010 --> 00:12:17,070 that we can name at a time to physical resources, 277 00:12:17,070 --> 00:12:19,530 like the number of registers in a processor, 278 00:12:19,530 --> 00:12:23,430 we instead virtualize this and come up with a very lightweight 279 00:12:23,430 --> 00:12:25,150 multithreaded model. 280 00:12:25,150 --> 00:12:28,530 So threads in MONSOON look like these lightweight computation 281 00:12:28,530 --> 00:12:32,100 descriptors, which we can switch amongst very rapidly. 282 00:12:32,100 --> 00:12:34,020 In fact, on any processor, there can 283 00:12:34,020 --> 00:12:36,780 be thousands of thread descriptors awaiting execution 284 00:12:36,780 --> 00:12:38,610 at any given time. 285 00:12:38,610 --> 00:12:40,650 The thread descriptors consist of a pointer 286 00:12:40,650 --> 00:12:44,130 to an instruction in memory, a pointer to the activation frame 287 00:12:44,130 --> 00:12:48,410 the thread is associated with, and a few registers. 288 00:12:48,410 --> 00:12:50,240 Two independent thread descriptors 289 00:12:50,240 --> 00:12:52,730 may synchronize by using the presence bits found 290 00:12:52,730 --> 00:12:55,290 on every word of frame memory. 291 00:12:55,290 --> 00:12:57,020 So the hardware design of MONSOON 292 00:12:57,020 --> 00:13:00,380 emphasizes very rapid switching across threads. 293 00:13:00,380 --> 00:13:04,640 In fact, it is possible to join two threads of computation 294 00:13:04,640 --> 00:13:07,580 with one cycle of overhead, and a new thread 295 00:13:07,580 --> 00:13:10,130 can be formed from a currently executing thread 296 00:13:10,130 --> 00:13:12,530 with zero cycles of overhead. 297 00:13:12,530 --> 00:13:15,620 To summarize, the MONSOON approach to parallel processing 298 00:13:15,620 --> 00:13:18,740 brings together and implicitly parallel programming language, 299 00:13:18,740 --> 00:13:21,200 Id, which preserves the parallelism which 300 00:13:21,200 --> 00:13:24,620 is present in algorithms, and an architecture which supports 301 00:13:24,620 --> 00:13:26,660 the compiling model for the language, 302 00:13:26,660 --> 00:13:28,940 as well as addressing the fundamental problems 303 00:13:28,940 --> 00:13:31,850 of latency and synchronization. 304 00:13:31,850 --> 00:13:34,640 Certainly, we oftentimes write programs 305 00:13:34,640 --> 00:13:37,550 where we don't know where the parallelism is, 306 00:13:37,550 --> 00:13:40,430 but the gut feeling is there must be some. 307 00:13:40,430 --> 00:13:42,710 The original algorithm for these codes, 308 00:13:42,710 --> 00:13:46,040 even though no intention may have been paid to parallelism, 309 00:13:46,040 --> 00:13:49,010 perhaps do have sufficient parallelism. 310 00:13:49,010 --> 00:13:51,020 If you just took the same algorithm 311 00:13:51,020 --> 00:13:54,170 and included it in Id, there is a good chance 312 00:13:54,170 --> 00:13:57,540 that you would give significant speedups of your program, 313 00:13:57,540 --> 00:14:00,590 regardless of how unstructured the code is. 314 00:14:00,590 --> 00:14:03,620 The rule of Id, or implicit parallel programming, 315 00:14:03,620 --> 00:14:07,910 is to make sure that the good properties, the parallelism 316 00:14:07,910 --> 00:14:12,290 of your algorithm, is not obscured, that language is not 317 00:14:12,290 --> 00:14:13,813 in the way. 318 00:14:13,813 --> 00:14:15,230 So the role of the machine has got 319 00:14:15,230 --> 00:14:19,230 to be to efficiently exploit all sorts of parallelism, 320 00:14:19,230 --> 00:14:22,100 which are exposed by these wonderful languages. 321 00:14:22,100 --> 00:14:24,800 It seems to me that the sky is the limit. 322 00:14:24,800 --> 00:14:27,850 [MUSIC PLAYING] 323 00:14:27,850 --> 00:15:13,000