1 00:00:00,000 --> 00:00:05,467 [MUSIC PLAYING] 2 00:00:05,467 --> 00:00:26,890 3 00:00:26,890 --> 00:00:29,050 On behalf of the MIT summer session, 4 00:00:29,050 --> 00:00:32,080 let me indicate how happy we are to be participating 5 00:00:32,080 --> 00:00:35,110 in this edition of the university video communications 6 00:00:35,110 --> 00:00:37,030 Distinguished Lecture series. 7 00:00:37,030 --> 00:00:41,050 Our speaker tonight, Guy Steele, received his PhD 8 00:00:41,050 --> 00:00:43,805 in computer science here at MIT in 1980. 9 00:00:43,805 --> 00:00:45,430 He's probably best known to those of us 10 00:00:45,430 --> 00:00:48,700 here as the co-developer of scheme and of common lisp, 11 00:00:48,700 --> 00:00:51,130 or perhaps as the editor and chief organizer 12 00:00:51,130 --> 00:00:52,785 behind the hackers dictionary. 13 00:00:52,785 --> 00:00:56,230 He, however, has many talents outside computer science. 14 00:00:56,230 --> 00:00:58,870 He is, for example, a composer and singer 15 00:00:58,870 --> 00:01:01,420 with the distinction, probably the unique distinction 16 00:01:01,420 --> 00:01:04,900 of having his sheet music published in the communications 17 00:01:04,900 --> 00:01:06,880 of the ACM. 18 00:01:06,880 --> 00:01:09,640 He also has demonstrated his programming skills 19 00:01:09,640 --> 00:01:12,550 by becoming an outstanding Chinese chef. 20 00:01:12,550 --> 00:01:15,460 He is a senior scientist of Thinking Machines Corporation 21 00:01:15,460 --> 00:01:18,880 where he is responsible for the design and implementation 22 00:01:18,880 --> 00:01:20,860 of parallel programming languages 23 00:01:20,860 --> 00:01:24,100 and other system software for the Connection Machine computer 24 00:01:24,100 --> 00:01:25,240 system. 25 00:01:25,240 --> 00:01:26,800 He will be talking to us this evening 26 00:01:26,800 --> 00:01:29,200 about data parallel algorithms. 27 00:01:29,200 --> 00:01:30,890 Guy. 28 00:01:30,890 --> 00:01:32,780 The data parallel programming style 29 00:01:32,780 --> 00:01:35,570 is an approach to organizing programs 30 00:01:35,570 --> 00:01:38,540 suitable for execution on massively parallel computer 31 00:01:38,540 --> 00:01:39,920 systems. 32 00:01:39,920 --> 00:01:42,410 In this talk, we're going to characterize the data 33 00:01:42,410 --> 00:01:44,570 parallel programming style, we're 34 00:01:44,570 --> 00:01:47,090 going to examine the building blocks used 35 00:01:47,090 --> 00:01:50,145 to construct data parallel programs, 36 00:01:50,145 --> 00:01:52,520 and we will see how to fit these building blocks together 37 00:01:52,520 --> 00:01:53,810 to make useful algorithms. 38 00:01:53,810 --> 00:01:57,150 39 00:01:57,150 --> 00:02:01,170 All programs consist of code and data working together, 40 00:02:01,170 --> 00:02:04,510 and there's usually some processing agency, a processor, 41 00:02:04,510 --> 00:02:06,720 which using the code as instructions operates 42 00:02:06,720 --> 00:02:09,300 upon the data to compute some useful result. 43 00:02:09,300 --> 00:02:11,760 Now, there are different ways of building this processor. 44 00:02:11,760 --> 00:02:15,420 If you have a single large processor, or perhaps even 45 00:02:15,420 --> 00:02:18,660 a single small processor, operating upon the code 46 00:02:18,660 --> 00:02:20,850 and operating on a single data item at a time, 47 00:02:20,850 --> 00:02:23,190 you have a typical serial sequential Von Neumann 48 00:02:23,190 --> 00:02:24,250 computer. 49 00:02:24,250 --> 00:02:27,030 On the other hand, you might have many individual processing 50 00:02:27,030 --> 00:02:29,460 elements interpreting the code and acting upon 51 00:02:29,460 --> 00:02:32,130 the data in which case, we have a parallel computing system 52 00:02:32,130 --> 00:02:38,040 and there are different ways of organizing this parallelism 53 00:02:38,040 --> 00:02:40,050 In control parallelism the emphasis 54 00:02:40,050 --> 00:02:43,200 is on organizing the program so that the processing 55 00:02:43,200 --> 00:02:46,188 elements can take advantage of parallelism in the code. 56 00:02:46,188 --> 00:02:48,480 Typically, in this style, different processing elements 57 00:02:48,480 --> 00:02:51,330 might be at different places in the program text 58 00:02:51,330 --> 00:02:53,580 so that you have different pieces of the program being 59 00:02:53,580 --> 00:02:56,230 executed all at the same time. 60 00:02:56,230 --> 00:02:58,080 Now, of course, you don't simply operate 61 00:02:58,080 --> 00:03:02,710 by executing instructions without regard to the data. 62 00:03:02,710 --> 00:03:04,210 Each of the processing elements must 63 00:03:04,210 --> 00:03:06,130 have some access to the data, and this 64 00:03:06,130 --> 00:03:09,140 may be in a regular or irregular pattern. 65 00:03:09,140 --> 00:03:13,060 But the emphasis is on extracting parallelism 66 00:03:13,060 --> 00:03:16,210 by orienting the program's organization 67 00:03:16,210 --> 00:03:19,820 around the parallelism in the code. 68 00:03:19,820 --> 00:03:21,770 In the data parallel programming style, 69 00:03:21,770 --> 00:03:24,470 by contrast, the emphasis is on organizing 70 00:03:24,470 --> 00:03:26,150 the programs to extract parallelism 71 00:03:26,150 --> 00:03:28,640 from the organization of the data. 72 00:03:28,640 --> 00:03:30,412 And the underlying metaphor is to think 73 00:03:30,412 --> 00:03:32,120 of there being enough processing elements 74 00:03:32,120 --> 00:03:33,690 that for every data item of interest, 75 00:03:33,690 --> 00:03:36,300 there's a processor that can be attached to it. 76 00:03:36,300 --> 00:03:38,300 And again, it's not that the processing elements 77 00:03:38,300 --> 00:03:40,690 are operating on the data without regard to the code. 78 00:03:40,690 --> 00:03:44,780 There almost must be some means of having the processors 79 00:03:44,780 --> 00:03:46,550 interpret the code. 80 00:03:46,550 --> 00:03:49,550 Now, typically, but not necessarily, 81 00:03:49,550 --> 00:03:51,925 all of the processors will be at roughly the same point 82 00:03:51,925 --> 00:03:52,550 in the program. 83 00:03:52,550 --> 00:03:55,400 This is characteristic of the data parallel programming 84 00:03:55,400 --> 00:03:56,075 style. 85 00:03:56,075 --> 00:03:57,450 But it's not a necessary feature. 86 00:03:57,450 --> 00:04:00,320 It is possible to organize a program into data parallel way, 87 00:04:00,320 --> 00:04:02,510 and yet, have the processors be at different points 88 00:04:02,510 --> 00:04:03,830 in the code. 89 00:04:03,830 --> 00:04:07,850 But having a single locus of control most of the time 90 00:04:07,850 --> 00:04:11,217 is a typical characteristic of this style. 91 00:04:11,217 --> 00:04:12,800 Now, I want to draw a distinction here 92 00:04:12,800 --> 00:04:16,160 between control parallelism and data parallelism. 93 00:04:16,160 --> 00:04:17,630 That distinction on the one hand, 94 00:04:17,630 --> 00:04:19,130 and a slightly different distinction 95 00:04:19,130 --> 00:04:22,370 about architectural styles about the design of hardware, which 96 00:04:22,370 --> 00:04:25,310 are usually labeled SIMD, for single instruction, 97 00:04:25,310 --> 00:04:27,380 multiple data organizations, versus 98 00:04:27,380 --> 00:04:30,923 MIMD, multiple instruction, multiple data organizations. 99 00:04:30,923 --> 00:04:32,840 I think it's important to draw the distinction 100 00:04:32,840 --> 00:04:35,262 between the style in which a program is written 101 00:04:35,262 --> 00:04:37,220 and the way in which the underlying hardware is 102 00:04:37,220 --> 00:04:39,020 organized because the two don't necessarily 103 00:04:39,020 --> 00:04:40,615 have to be the same. 104 00:04:40,615 --> 00:04:42,740 It is perfectly reasonable to write a data parallel 105 00:04:42,740 --> 00:04:45,680 program which is then executed on a MIMD computer, 106 00:04:45,680 --> 00:04:47,570 or for that matter, on a sequential computer. 107 00:04:47,570 --> 00:04:49,278 And similarly, a control parallel program 108 00:04:49,278 --> 00:04:52,400 can be executed on a sequential or a SIMD kind of computer. 109 00:04:52,400 --> 00:04:54,350 And so my emphasis in this talk is 110 00:04:54,350 --> 00:04:57,620 going to be on styles of organizing programs rather than 111 00:04:57,620 --> 00:04:59,300 styles of organizing hardware. 112 00:04:59,300 --> 00:05:01,190 It then becomes an engineering issue 113 00:05:01,190 --> 00:05:03,530 of whether it's worthwhile to match the hardware style 114 00:05:03,530 --> 00:05:06,500 to the program style or not. 115 00:05:06,500 --> 00:05:08,090 Now, the sequential programming style 116 00:05:08,090 --> 00:05:09,890 is typified by programming languages, 117 00:05:09,890 --> 00:05:13,280 such as C and Pascal, has certain standard themes 118 00:05:13,280 --> 00:05:14,780 and certain standard building blocks 119 00:05:14,780 --> 00:05:16,250 which you see all the time. 120 00:05:16,250 --> 00:05:19,610 Examples of these are scalar arithmetic operators, 121 00:05:19,610 --> 00:05:22,160 control structures, such as loops and if then else 122 00:05:22,160 --> 00:05:25,610 statements, and subscripted array references. 123 00:05:25,610 --> 00:05:27,380 And the experienced sequential programmer 124 00:05:27,380 --> 00:05:30,322 knows a great many standard and useful ways 125 00:05:30,322 --> 00:05:32,030 of fitting these building blocks together 126 00:05:32,030 --> 00:05:33,692 to make useful algorithms. 127 00:05:33,692 --> 00:05:35,150 And furthermore, he has a good idea 128 00:05:35,150 --> 00:05:38,980 of the costs of these individual building blocks. 129 00:05:38,980 --> 00:05:41,290 While it's nowhere cast in stone, 130 00:05:41,290 --> 00:05:42,850 most of us as sequential programmers 131 00:05:42,850 --> 00:05:45,370 know that an addition and subtraction and perhaps 132 00:05:45,370 --> 00:05:47,520 a comparison all cost roughly the same, 133 00:05:47,520 --> 00:05:49,270 whereas a multiplication or division might 134 00:05:49,270 --> 00:05:50,645 cost a little bit more, depending 135 00:05:50,645 --> 00:05:52,355 on the underlying hardware. 136 00:05:52,355 --> 00:05:54,730 Similarly, a procedure called be a little bit more costly 137 00:05:54,730 --> 00:05:56,530 than an if then else statement. 138 00:05:56,530 --> 00:05:59,242 And the precise costs depend upon the implementation 139 00:05:59,242 --> 00:06:01,450 of the programming language and on the implementation 140 00:06:01,450 --> 00:06:02,660 of the hardware. 141 00:06:02,660 --> 00:06:05,380 But still, it's important to have a good idea, a reasonably 142 00:06:05,380 --> 00:06:08,028 good model, a set of expectations 143 00:06:08,028 --> 00:06:09,820 about the underlying cost of the primitives 144 00:06:09,820 --> 00:06:12,160 before you can expect to write effective and efficient 145 00:06:12,160 --> 00:06:15,160 algorithms in a sequential programming language. 146 00:06:15,160 --> 00:06:17,987 Similarly, in order to write good data parallel programs, 147 00:06:17,987 --> 00:06:20,320 we need to become familiar with the appropriate building 148 00:06:20,320 --> 00:06:22,420 blocks for building data parallel programs 149 00:06:22,420 --> 00:06:25,580 and we need to have a good idea of their relative costs. 150 00:06:25,580 --> 00:06:28,540 So now, let us examine certain standard themes that 151 00:06:28,540 --> 00:06:31,510 arise naturally in data parallel programs 152 00:06:31,510 --> 00:06:33,428 and consider specific building blocks that are 153 00:06:33,428 --> 00:06:34,720 used to implement those themes. 154 00:06:34,720 --> 00:06:37,705 155 00:06:37,705 --> 00:06:39,080 The most common themes that I see 156 00:06:39,080 --> 00:06:41,990 popping up over and over again in data parallel programs 157 00:06:41,990 --> 00:06:44,540 are elementwise operations. 158 00:06:44,540 --> 00:06:47,210 And these are operations which can be carried on 159 00:06:47,210 --> 00:06:49,940 by the individual processors independently 160 00:06:49,940 --> 00:06:52,610 each on their own pieces of data without communication 161 00:06:52,610 --> 00:06:54,020 among the processors. 162 00:06:54,020 --> 00:06:57,410 And typically, these are arithmetic operations and tests 163 00:06:57,410 --> 00:06:58,970 of various kinds. 164 00:06:58,970 --> 00:07:00,560 They're also conditional operations, 165 00:07:00,560 --> 00:07:03,800 which are also elementwise, but in which some of the processors 166 00:07:03,800 --> 00:07:05,553 perhaps do not participate or perhaps 167 00:07:05,553 --> 00:07:07,220 act in slightly different ways depending 168 00:07:07,220 --> 00:07:09,300 on the content of the data. 169 00:07:09,300 --> 00:07:11,880 Then, there are various forms of communications 170 00:07:11,880 --> 00:07:15,570 among the processors, which include replication, reduction, 171 00:07:15,570 --> 00:07:19,260 permutations, both regular and irregular, and parallel prefix 172 00:07:19,260 --> 00:07:19,990 operations. 173 00:07:19,990 --> 00:07:22,150 These are by no means the only themes, 174 00:07:22,150 --> 00:07:22,860 but these are the ones that I want 175 00:07:22,860 --> 00:07:24,568 to focus particular attention on tonight. 176 00:07:24,568 --> 00:07:28,900 177 00:07:28,900 --> 00:07:31,750 Let's take a look at an example of an elementwise operation. 178 00:07:31,750 --> 00:07:35,515 Here, we have what looks like an ordinary Fortran assay 179 00:07:35,515 --> 00:07:37,270 or C assignment statement, C gets 180 00:07:37,270 --> 00:07:40,665 A plus B. The interpretation here that is intended 181 00:07:40,665 --> 00:07:42,040 is that A and B represent arrays. 182 00:07:42,040 --> 00:07:44,650 183 00:07:44,650 --> 00:07:46,750 And here, we see eight processors. 184 00:07:46,750 --> 00:07:49,000 And each processor has its own value for A 185 00:07:49,000 --> 00:07:52,420 and its own value for B. And the direction is for each processor 186 00:07:52,420 --> 00:07:54,842 to execute an addition and store the result in C. 187 00:07:54,842 --> 00:07:56,800 And so you can see that the first processor has 188 00:07:56,800 --> 00:07:58,643 added 3 and 6 to get 9. 189 00:07:58,643 --> 00:08:00,310 The second processor has different data. 190 00:08:00,310 --> 00:08:02,530 It has added 1 and 2 to get 3 and so on. 191 00:08:02,530 --> 00:08:05,070 192 00:08:05,070 --> 00:08:06,450 Similarly, conditional operations 193 00:08:06,450 --> 00:08:08,760 can be carried out elementwise. 194 00:08:08,760 --> 00:08:11,790 Here, we are comparing the contents of the array 195 00:08:11,790 --> 00:08:14,760 A and the array B element by element 196 00:08:14,760 --> 00:08:17,910 and setting a flag, shown at the top of the graphic, 197 00:08:17,910 --> 00:08:20,048 to indicate whether or not the test succeeded. 198 00:08:20,048 --> 00:08:21,840 And so we can see that in the first column, 199 00:08:21,840 --> 00:08:24,600 the test failed because 3 is not greater than 6. 200 00:08:24,600 --> 00:08:27,600 Whereas, in the third column, the test succeeded because 4 201 00:08:27,600 --> 00:08:29,520 is greater than 1. 202 00:08:29,520 --> 00:08:31,710 And the result is a series of Boolean 203 00:08:31,710 --> 00:08:34,860 result-- a series of bits one per processor, 204 00:08:34,860 --> 00:08:38,159 which can then be used to conditionalize 205 00:08:38,159 --> 00:08:40,095 further operations. 206 00:08:40,095 --> 00:08:41,470 Here, we have an example of using 207 00:08:41,470 --> 00:08:44,320 the results of that test to perform a conditional addition. 208 00:08:44,320 --> 00:08:47,920 Only processors whose test succeeded in the previous step 209 00:08:47,920 --> 00:08:49,360 will execute the addition step. 210 00:08:49,360 --> 00:08:51,590 And so we can see that only the third, fourth, fifth, 211 00:08:51,590 --> 00:08:54,040 and seventh processors actually performed the addition 212 00:08:54,040 --> 00:08:56,370 and scored the result in C. 213 00:08:56,370 --> 00:08:58,590 The set of bits, which is used to conditionalize 214 00:08:58,590 --> 00:09:02,640 the operation, is frequently called a condition mask 215 00:09:02,640 --> 00:09:06,240 or a context for the execution of further operations. 216 00:09:06,240 --> 00:09:07,800 And in this way, each processor can 217 00:09:07,800 --> 00:09:10,800 perform different computations based on the particular data 218 00:09:10,800 --> 00:09:14,030 that it happens to contain. 219 00:09:14,030 --> 00:09:16,850 Now, let's look at some communications operations. 220 00:09:16,850 --> 00:09:19,700 One that happens very frequently is to have a single quantity 221 00:09:19,700 --> 00:09:21,960 and want to get a copy out to all the processors. 222 00:09:21,960 --> 00:09:24,170 This we call broadcasting. 223 00:09:24,170 --> 00:09:26,030 Here, we have a single number 43 and we 224 00:09:26,030 --> 00:09:28,280 would like to get it out to all the other processors. 225 00:09:28,280 --> 00:09:29,870 Now, there a variety of ways of doing this, 226 00:09:29,870 --> 00:09:32,328 depending on the particular hardware you have to support it 227 00:09:32,328 --> 00:09:35,150 and the particular algorithms you want to use. 228 00:09:35,150 --> 00:09:37,370 This operation occurs so frequently in data parallel 229 00:09:37,370 --> 00:09:40,520 programming that it's worthwhile and easy to support it directly 230 00:09:40,520 --> 00:09:41,850 in hardware. 231 00:09:41,850 --> 00:09:45,110 And so it's not unusual to see a hardware bus of some kind 232 00:09:45,110 --> 00:09:49,950 to carry a single value out to all the multiple processors. 233 00:09:49,950 --> 00:09:52,450 On the other hand, there are other ways of doing it as well. 234 00:09:52,450 --> 00:09:55,000 If there are many things to be copied 235 00:09:55,000 --> 00:09:57,220 rather than a single thing, a more complex algorithm 236 00:09:57,220 --> 00:09:59,340 is called for. 237 00:09:59,340 --> 00:10:02,880 Here, we have a typical example in which we have a vector which 238 00:10:02,880 --> 00:10:04,300 occupies a row of a matrix. 239 00:10:04,300 --> 00:10:05,910 We would like to copy that vector 240 00:10:05,910 --> 00:10:07,477 to all the rows of the matrix. 241 00:10:07,477 --> 00:10:09,810 So we don't have a single quantity to be copied, rather, 242 00:10:09,810 --> 00:10:11,040 we have many quantities. 243 00:10:11,040 --> 00:10:13,800 But they are to be copied in a reasonably regular pattern. 244 00:10:13,800 --> 00:10:15,300 Now, one very simple way to do this, 245 00:10:15,300 --> 00:10:18,210 if each processor associated with a piece of data 246 00:10:18,210 --> 00:10:20,460 in this matrix is able to communicate with its nearest 247 00:10:20,460 --> 00:10:24,630 neighbors, is to copy the row step 248 00:10:24,630 --> 00:10:30,950 by step down the matrix each copying to its nearest neighbor 249 00:10:30,950 --> 00:10:32,882 until the entire matrix is filled. 250 00:10:32,882 --> 00:10:34,340 This operation is called spreading. 251 00:10:34,340 --> 00:10:36,507 And here, we have managed to copy the first row down 252 00:10:36,507 --> 00:10:39,050 to all of the others in seven steps 253 00:10:39,050 --> 00:10:43,970 because this is an 8 by 8 matrix and 7 is 8 minus 1. 254 00:10:43,970 --> 00:10:45,470 But there are other ways of doing it 255 00:10:45,470 --> 00:10:49,120 which are faster if you have additional communications 256 00:10:49,120 --> 00:10:51,520 links among the processes you can exploit. 257 00:10:51,520 --> 00:10:53,800 Here, we will assume that a row of a matrix, 258 00:10:53,800 --> 00:10:57,490 in fact, can be communicated not only to the next row down, 259 00:10:57,490 --> 00:11:00,490 but, in fact, to a road down that is a power of two away. 260 00:11:00,490 --> 00:11:03,690 And we will see that we can do the operation much faster. 261 00:11:03,690 --> 00:11:07,840 First, we take the first row and copy it to the second row. 262 00:11:07,840 --> 00:11:09,880 But now, each of those rows can send copies 263 00:11:09,880 --> 00:11:12,627 to their neighbors that are two rows down. 264 00:11:12,627 --> 00:11:14,460 And then, each of those rows can send copies 265 00:11:14,460 --> 00:11:18,280 to neighbors that are four rows down. 266 00:11:18,280 --> 00:11:20,950 And by jumping in increasing steps like this, 267 00:11:20,950 --> 00:11:23,910 we can fill the array in only three steps, which 268 00:11:23,910 --> 00:11:25,660 is logarithmic in the size of the array, 3 269 00:11:25,660 --> 00:11:27,220 being the base 2 logarithm of 8. 270 00:11:27,220 --> 00:11:29,860 271 00:11:29,860 --> 00:11:33,100 And so we have another way of executing the same spread 272 00:11:33,100 --> 00:11:36,100 operation, which is logarithmic in the number of processors 273 00:11:36,100 --> 00:11:38,120 rather than linear. 274 00:11:38,120 --> 00:11:41,155 And this is typical of a lot of interesting parallel operations 275 00:11:41,155 --> 00:11:43,280 that can be carried out in the data parallel style. 276 00:11:43,280 --> 00:11:44,840 You see logarithmic execution times 277 00:11:44,840 --> 00:11:46,090 popping up all over the place. 278 00:11:46,090 --> 00:11:49,917 279 00:11:49,917 --> 00:11:52,500 Reduction is another very common operation that is essentially 280 00:11:52,500 --> 00:11:53,880 the inverse of broadcasting. 281 00:11:53,880 --> 00:11:56,100 In broadcasting, you take a single value 282 00:11:56,100 --> 00:11:58,830 and make copies for all of the processing elements. 283 00:11:58,830 --> 00:12:01,620 In the case of reduction, each processing element has a value 284 00:12:01,620 --> 00:12:04,200 and you're trying to accumulate a single result 285 00:12:04,200 --> 00:12:06,240 by combining them in some interesting way 286 00:12:06,240 --> 00:12:07,980 to produce a single result. 287 00:12:07,980 --> 00:12:09,480 Again, I've shown here how you might 288 00:12:09,480 --> 00:12:14,010 want to have some kind of hard wired facility for doing 289 00:12:14,010 --> 00:12:16,150 some reduction. 290 00:12:16,150 --> 00:12:17,890 In this picture, the eight processors 291 00:12:17,890 --> 00:12:20,380 shown at the bottom of the screen each 292 00:12:20,380 --> 00:12:23,720 have different numbers and there some is 27. 293 00:12:23,720 --> 00:12:26,288 So the number 27 shows up centrally. 294 00:12:26,288 --> 00:12:28,080 There are other ways of doing that as well. 295 00:12:28,080 --> 00:12:30,285 296 00:12:30,285 --> 00:12:31,910 Let us take a look at an algorithm that 297 00:12:31,910 --> 00:12:34,330 can be used to sum a vector in logarithmic time. 298 00:12:34,330 --> 00:12:36,480 We start out with a vector that has eight elements, 299 00:12:36,480 --> 00:12:38,870 x sub 0 through x sub 7. 300 00:12:38,870 --> 00:12:42,470 In the first step, we add them pairwise so that x sub 0 301 00:12:42,470 --> 00:12:45,290 and x sub 1 are added by the second processor, 302 00:12:45,290 --> 00:12:48,470 x sub 2 and x sub 3 by the fourth processor, x sub 4 303 00:12:48,470 --> 00:12:50,810 and x sub 5 by the sixth processor, 304 00:12:50,810 --> 00:12:54,357 and x sub 6 and x sub 7 by the last processor. 305 00:12:54,357 --> 00:12:56,690 And so four of the processors, by doing these additions, 306 00:12:56,690 --> 00:12:59,720 have produced partial sums towards the work 307 00:12:59,720 --> 00:13:01,070 of reducing total sum. 308 00:13:01,070 --> 00:13:05,360 In the second step, we add these partial results pairwise again. 309 00:13:05,360 --> 00:13:08,000 And then, in the last step, again, another pair. 310 00:13:08,000 --> 00:13:10,160 And we see that the last processor 311 00:13:10,160 --> 00:13:13,465 has received the sum of elements 0 through 7. 312 00:13:13,465 --> 00:13:15,590 Notice in this diagram that we've used the notation 313 00:13:15,590 --> 00:13:17,870 sigma with a lower and upper bound to indicate exactly 314 00:13:17,870 --> 00:13:21,720 which elements have been summed so far. 315 00:13:21,720 --> 00:13:23,552 Please also note that most of the time 316 00:13:23,552 --> 00:13:25,260 during the course of this algorithm, most 317 00:13:25,260 --> 00:13:28,970 of the processors have not been particularly busy. 318 00:13:28,970 --> 00:13:31,813 In the first set of additions, only half of the processors 319 00:13:31,813 --> 00:13:33,230 have been occupied doing editions. 320 00:13:33,230 --> 00:13:35,000 In the next step, only a quarter of them. 321 00:13:35,000 --> 00:13:38,090 And in the last step, only one eighth of them. 322 00:13:38,090 --> 00:13:40,250 And so we can see that while this is fast, 323 00:13:40,250 --> 00:13:42,833 we haven't made use of all of the processors. 324 00:13:42,833 --> 00:13:44,000 Interesting question arises. 325 00:13:44,000 --> 00:13:46,292 What would happen if you didn't turn off the processors 326 00:13:46,292 --> 00:13:48,740 and let some of the other processors also do additions? 327 00:13:48,740 --> 00:13:50,690 You would actually get another very interesting building block 328 00:13:50,690 --> 00:13:52,610 for data parallel algorithms, which 329 00:13:52,610 --> 00:13:55,260 is parallel prefix algorithms. 330 00:13:55,260 --> 00:13:57,760 We're going to carry out the same sort of computations again 331 00:13:57,760 --> 00:13:59,920 but leave more processors active. 332 00:13:59,920 --> 00:14:03,640 We start out with x sub 0 through x sub 7 laid out. 333 00:14:03,640 --> 00:14:08,310 And in the first step, we sum pairs, 334 00:14:08,310 --> 00:14:09,810 but notice that they're overlapping. 335 00:14:09,810 --> 00:14:11,310 We sum x sub 0 and x sub 1. 336 00:14:11,310 --> 00:14:14,730 We also sum x sub 1 and x sub 2 and so forth. 337 00:14:14,730 --> 00:14:16,680 And so each of the processors except the first 338 00:14:16,680 --> 00:14:18,150 is computed to sum. 339 00:14:18,150 --> 00:14:21,280 Then, in the next step, we compute more sums. 340 00:14:21,280 --> 00:14:23,740 All but the first two processors participate. 341 00:14:23,740 --> 00:14:25,900 And then, in the last step, all but four processors 342 00:14:25,900 --> 00:14:27,660 participate. 343 00:14:27,660 --> 00:14:29,790 The result has a very interesting property 344 00:14:29,790 --> 00:14:33,810 that each processor has received the sum of what 345 00:14:33,810 --> 00:14:37,460 it contained plus all the processors preceding it. 346 00:14:37,460 --> 00:14:41,120 In other words, if we regard the original input as an array, 347 00:14:41,120 --> 00:14:43,970 then we've computed the sums of all prefixes of the array. 348 00:14:43,970 --> 00:14:46,640 That is, all initial segments of the array. 349 00:14:46,640 --> 00:14:49,310 I sometimes call this the checkbook operation 350 00:14:49,310 --> 00:14:52,250 because if the numbers that were in the array elements 351 00:14:52,250 --> 00:14:53,970 were a series of credits and debits 352 00:14:53,970 --> 00:14:55,970 against your checkbook, the checks you'd written 353 00:14:55,970 --> 00:14:57,957 and the deposits you'd made, then 354 00:14:57,957 --> 00:14:59,540 the result of the sum prefix operation 355 00:14:59,540 --> 00:15:01,130 is the series of running balances that should 356 00:15:01,130 --> 00:15:02,415 appear in your checkbook. 357 00:15:02,415 --> 00:15:05,090 358 00:15:05,090 --> 00:15:07,760 And we will see examples of how this operation can 359 00:15:07,760 --> 00:15:09,950 be useful in other ways later as a building 360 00:15:09,950 --> 00:15:11,793 block of the algorithms. 361 00:15:11,793 --> 00:15:15,250 362 00:15:15,250 --> 00:15:19,270 A very simple example is an operation known as enumeration. 363 00:15:19,270 --> 00:15:21,700 In this operation, we wish to assign a different number 364 00:15:21,700 --> 00:15:22,912 to each processor. 365 00:15:22,912 --> 00:15:24,370 And this can be accomplished easily 366 00:15:24,370 --> 00:15:27,040 in two steps using building blocks we've already seen. 367 00:15:27,040 --> 00:15:29,140 The first being broadcasting. 368 00:15:29,140 --> 00:15:32,480 We take the number one and broadcast that 369 00:15:32,480 --> 00:15:34,100 to all the processors. 370 00:15:34,100 --> 00:15:36,470 We then apply a sum prefix operation 371 00:15:36,470 --> 00:15:38,870 so that each processor receives the sum of the one 372 00:15:38,870 --> 00:15:43,830 that it contains plus all the ones preceding it in the line. 373 00:15:43,830 --> 00:15:45,990 And the result is that each processor 374 00:15:45,990 --> 00:15:49,095 gets a different number from 1 through 8. 375 00:15:49,095 --> 00:15:50,970 Sometimes this is called the bakery operation 376 00:15:50,970 --> 00:15:52,890 because it's like all these processors have walked 377 00:15:52,890 --> 00:15:55,307 into the bakery each one has gone up to the little machine 378 00:15:55,307 --> 00:15:58,305 and taken a ticket and each one has gotten a different number. 379 00:15:58,305 --> 00:16:00,180 But it's happened very fast rather than doing 380 00:16:00,180 --> 00:16:01,430 it sequentially one at a time. 381 00:16:01,430 --> 00:16:06,350 382 00:16:06,350 --> 00:16:10,250 Another very important operation is various kinds 383 00:16:10,250 --> 00:16:12,860 of motion of data without particularly 384 00:16:12,860 --> 00:16:15,610 performing any kind of arithmetic operation. 385 00:16:15,610 --> 00:16:18,220 And most of the interesting ones are permutations. 386 00:16:18,220 --> 00:16:24,070 Here's an example of shifting a linear array of data. 387 00:16:24,070 --> 00:16:25,840 This is an end around shift one place. 388 00:16:25,840 --> 00:16:27,940 One could imagine shifts of other distances. 389 00:16:27,940 --> 00:16:29,380 But in this case, the entire array 390 00:16:29,380 --> 00:16:31,090 has been shifted one place to the right 391 00:16:31,090 --> 00:16:33,340 and the rightmost element has been brought back around 392 00:16:33,340 --> 00:16:34,413 to the front. 393 00:16:34,413 --> 00:16:35,830 And of course, one can do shifting 394 00:16:35,830 --> 00:16:37,402 not only on one-dimensional arrays, 395 00:16:37,402 --> 00:16:38,860 but also on two-dimensional arrays. 396 00:16:38,860 --> 00:16:40,528 You might have a two-dimensional matrix 397 00:16:40,528 --> 00:16:42,820 and want to shift it one position to the North or three 398 00:16:42,820 --> 00:16:50,770 positions to the East or 4,096 positions to the South. 399 00:16:50,770 --> 00:16:54,960 Another kind of permutation is an odd even swap, shown here 400 00:16:54,960 --> 00:16:57,870 for eight elements, in which A and B have exchanged places 401 00:16:57,870 --> 00:17:00,030 and C and D have exchanged places. 402 00:17:00,030 --> 00:17:01,800 One might also imagine more general forms 403 00:17:01,800 --> 00:17:04,660 of swap or one swaps with a neighbor that is 404 00:17:04,660 --> 00:17:07,630 a power of two distance away. 405 00:17:07,630 --> 00:17:09,700 Permutations of this form are very often 406 00:17:09,700 --> 00:17:13,119 seen as steps in algorithms such as the fast Fourier transform 407 00:17:13,119 --> 00:17:15,456 and in various sorting algorithms. 408 00:17:15,456 --> 00:17:17,109 The hardware architecture that is 409 00:17:17,109 --> 00:17:19,540 capable of performing all distance 2 to the k swaps 410 00:17:19,540 --> 00:17:21,220 turns out to be equivalent to a hypercube, which 411 00:17:21,220 --> 00:17:22,970 is one of the reasons why the hypercube is 412 00:17:22,970 --> 00:17:24,790 very popular as an underlying hardware 413 00:17:24,790 --> 00:17:26,829 architecture for algorithms of this style 414 00:17:26,829 --> 00:17:28,750 because it can easily achieve many 415 00:17:28,750 --> 00:17:30,670 of the interesting permutations on data 416 00:17:30,670 --> 00:17:34,005 that are necessary building blocks for these algorithms. 417 00:17:34,005 --> 00:17:37,650 418 00:17:37,650 --> 00:17:39,660 Some algorithms call for performing irregular 419 00:17:39,660 --> 00:17:40,800 permutations on the data. 420 00:17:40,800 --> 00:17:44,160 Most often this comes up when the permutation to be performed 421 00:17:44,160 --> 00:17:47,770 is dependent upon the content of the data. 422 00:17:47,770 --> 00:17:49,720 In this graphic, we've illustrated a sorting 423 00:17:49,720 --> 00:17:50,460 algorithm. 424 00:17:50,460 --> 00:17:54,120 And in fact, here it's as if the sort has been computed 425 00:17:54,120 --> 00:17:56,820 magically all at once and every processor has figured out 426 00:17:56,820 --> 00:18:00,590 exactly where to send its data. 427 00:18:00,590 --> 00:18:02,680 The data sent from the top row to the bottom row, 428 00:18:02,680 --> 00:18:04,420 and poof, it's ended up sorted. 429 00:18:04,420 --> 00:18:05,920 In practice, real sorting algorithms 430 00:18:05,920 --> 00:18:07,750 undergo a number of intermediate stages 431 00:18:07,750 --> 00:18:09,880 where they calculate intermediate places to send 432 00:18:09,880 --> 00:18:13,190 the data and after a number of steps the data ends up sorted. 433 00:18:13,190 --> 00:18:16,570 But the point is that the particular permutation involved 434 00:18:16,570 --> 00:18:18,865 is dependent upon the content of the data. 435 00:18:18,865 --> 00:18:20,740 Therefore, it cannot be predicted in advance, 436 00:18:20,740 --> 00:18:24,540 and therefore, it cannot be pre-wired into the hardware. 437 00:18:24,540 --> 00:18:26,460 We can see this by considering an example 438 00:18:26,460 --> 00:18:28,267 from image processing. 439 00:18:28,267 --> 00:18:30,100 Let's suppose that we have an image of, say, 440 00:18:30,100 --> 00:18:34,040 1,000 by 1,000 pixels, a picture that we've taken of a spaceship 441 00:18:34,040 --> 00:18:35,790 and we want to do processing on this image 442 00:18:35,790 --> 00:18:37,457 until we've assigned a processor to each 443 00:18:37,457 --> 00:18:39,180 of the pixels of the image. 444 00:18:39,180 --> 00:18:41,978 And the question is, where is the rocket ship? 445 00:18:41,978 --> 00:18:44,520 Well, some of the operations we need to perform on this image 446 00:18:44,520 --> 00:18:45,750 are local. 447 00:18:45,750 --> 00:18:50,680 And so we might focus in on a particular region of the image 448 00:18:50,680 --> 00:18:55,300 and have each processor look not only at its own pixel value, 449 00:18:55,300 --> 00:18:59,520 but also at the pixel values of its neighbor. 450 00:18:59,520 --> 00:19:01,990 And from this determine, for example, 451 00:19:01,990 --> 00:19:05,610 whether a particular pixel is on the boundary of an object 452 00:19:05,610 --> 00:19:07,350 or somewhere in the middle. 453 00:19:07,350 --> 00:19:09,360 And this is a local piece of processing. 454 00:19:09,360 --> 00:19:11,970 It involves a fixed pattern of communication. 455 00:19:11,970 --> 00:19:14,020 It involves regular permutations, 456 00:19:14,020 --> 00:19:17,397 which involves shifting the array back and forth so 457 00:19:17,397 --> 00:19:19,230 that each processor can look at its neighbor 458 00:19:19,230 --> 00:19:20,958 in a particular fixed direction. 459 00:19:20,958 --> 00:19:22,500 On the other hand, when it comes time 460 00:19:22,500 --> 00:19:24,270 to assemble this boundary information 461 00:19:24,270 --> 00:19:27,335 and put it together into a single global object, 462 00:19:27,335 --> 00:19:29,710 we can see that the particular patterns of communications 463 00:19:29,710 --> 00:19:32,867 are going to be dependent upon the content of the image data. 464 00:19:32,867 --> 00:19:35,200 That is, which processors have to communicate with which 465 00:19:35,200 --> 00:19:37,630 other ones so they can all agree that they 466 00:19:37,630 --> 00:19:40,757 are part of the image of the same object. 467 00:19:40,757 --> 00:19:42,340 Depends upon where the rocket actually 468 00:19:42,340 --> 00:19:43,847 happened to be in the image. 469 00:19:43,847 --> 00:19:45,430 Most of the building blocks that we've 470 00:19:45,430 --> 00:19:47,320 looked at so far concern themselves 471 00:19:47,320 --> 00:19:52,540 with operations on arrays, which are very rigidly organized 472 00:19:52,540 --> 00:19:53,780 kinds of data. 473 00:19:53,780 --> 00:19:56,980 It is also possible to have very irregularly organized 474 00:19:56,980 --> 00:19:58,310 kinds of data. 475 00:19:58,310 --> 00:20:01,210 For example, one might have a graph structure in which nodes 476 00:20:01,210 --> 00:20:03,940 are organized not by being neighbors within some fixed 477 00:20:03,940 --> 00:20:08,830 geometry, but by being connected through pointers instead. 478 00:20:08,830 --> 00:20:11,120 In this graphic, we have shown the nodes 479 00:20:11,120 --> 00:20:13,480 as being lined up side by side just for ease 480 00:20:13,480 --> 00:20:14,710 of seeing what's going on. 481 00:20:14,710 --> 00:20:16,210 But in fact, you should imagine them 482 00:20:16,210 --> 00:20:18,700 as being eight processors in completely different parts 483 00:20:18,700 --> 00:20:20,470 of the machine known to each other 484 00:20:20,470 --> 00:20:23,830 only by an address indicate that uniquely identifies 485 00:20:23,830 --> 00:20:25,000 the processor. 486 00:20:25,000 --> 00:20:27,250 And we've indicated these addresses by pointers. 487 00:20:27,250 --> 00:20:29,790 So we can see that the first processor knows the address 488 00:20:29,790 --> 00:20:30,790 of the second processor. 489 00:20:30,790 --> 00:20:35,130 The second processor knows the address of the third. 490 00:20:35,130 --> 00:20:37,290 And I had originally thought that nothing 491 00:20:37,290 --> 00:20:40,590 could be more essentially sequential than the need 492 00:20:40,590 --> 00:20:42,768 to follow a linked list of pointers. 493 00:20:42,768 --> 00:20:45,060 Because you just can't find the third one without going 494 00:20:45,060 --> 00:20:46,140 through the second one. 495 00:20:46,140 --> 00:20:47,370 But I was wrong. 496 00:20:47,370 --> 00:20:49,380 I forgot that there is processing power 497 00:20:49,380 --> 00:20:50,850 at each node of the linked list. 498 00:20:50,850 --> 00:20:53,640 And you can take advantage of that to do things faster. 499 00:20:53,640 --> 00:20:56,100 The most important technique is known as pointer doubling. 500 00:20:56,100 --> 00:20:59,400 And this is the pointer analog of the spreading operation 501 00:20:59,400 --> 00:21:01,110 that we looked at earlier in order 502 00:21:01,110 --> 00:21:03,090 to make a copy of a vector into a matrix 503 00:21:03,090 --> 00:21:06,030 in a logarithmic number of steps. 504 00:21:06,030 --> 00:21:09,130 In the pointer doubling method, each processor 505 00:21:09,130 --> 00:21:13,090 first makes a copy of the point it has to its neighbor. 506 00:21:13,090 --> 00:21:16,180 Then, in a repeated series of steps, 507 00:21:16,180 --> 00:21:19,090 each processor looks at the processor 508 00:21:19,090 --> 00:21:22,150 that it's pointing to with this extra pointer 509 00:21:22,150 --> 00:21:24,400 and gets a copy of its pointer. 510 00:21:24,400 --> 00:21:27,040 So at the first step, each processor 511 00:21:27,040 --> 00:21:31,130 has a pointer to its neighbor one away in the linked chain. 512 00:21:31,130 --> 00:21:33,928 But as the first processor looks into the second process 513 00:21:33,928 --> 00:21:35,720 and gets the pointer to the third processor 514 00:21:35,720 --> 00:21:38,360 and each of the other processors do the same thing, 515 00:21:38,360 --> 00:21:40,400 we see that the next step, each process 516 00:21:40,400 --> 00:21:42,200 will have a pointer to a processor 517 00:21:42,200 --> 00:21:44,770 two steps away in the linked chain. 518 00:21:44,770 --> 00:21:46,600 And as this operation is repeated, 519 00:21:46,600 --> 00:21:49,120 each processor can then have a pointer to a processor 520 00:21:49,120 --> 00:21:51,790 four steps away in the chain, except that if you fall off 521 00:21:51,790 --> 00:21:54,680 the end of the chain then you don't update your pointer. 522 00:21:54,680 --> 00:21:57,460 And if you keep doing this, then after a number of steps 523 00:21:57,460 --> 00:21:59,790 logarithmic in the length of the chain, 524 00:21:59,790 --> 00:22:01,540 every processor will have gotten a pointer 525 00:22:01,540 --> 00:22:03,233 to the end of the chain. 526 00:22:03,233 --> 00:22:05,400 And this happens in only logarithmic number of steps 527 00:22:05,400 --> 00:22:06,930 rather than a linear number of steps. 528 00:22:06,930 --> 00:22:09,138 I think this is sort of an interesting and surprising 529 00:22:09,138 --> 00:22:10,290 result. 530 00:22:10,290 --> 00:22:13,090 Now, let's see how that can be used 531 00:22:13,090 --> 00:22:15,090 by considering nodes that not only have pointers 532 00:22:15,090 --> 00:22:17,340 to their successors, but also numbers. 533 00:22:17,340 --> 00:22:19,560 And we're going to compute a parallel prefix, 534 00:22:19,560 --> 00:22:22,640 a set of partial sums, on this linked list. 535 00:22:22,640 --> 00:22:24,590 And the algorithm is very similar to the one 536 00:22:24,590 --> 00:22:26,840 we saw before on a vector except we're 537 00:22:26,840 --> 00:22:31,570 taking advantage of the pointer doubling technique. 538 00:22:31,570 --> 00:22:34,130 At the first step, each processor 539 00:22:34,130 --> 00:22:37,280 simply takes a pointer to its neighbor. 540 00:22:37,280 --> 00:22:40,910 At the next step, each processor takes the value 541 00:22:40,910 --> 00:22:45,170 that it holds and adds it into the place pointed to. 542 00:22:45,170 --> 00:22:48,200 So for example, in this diagram, the first processor 543 00:22:48,200 --> 00:22:51,560 is going to add its x sub 0 into the x sub 544 00:22:51,560 --> 00:22:53,270 1 in the next processor. 545 00:22:53,270 --> 00:22:55,790 Whereas, the x sub 1 is going to be added into and x sub 2. 546 00:22:55,790 --> 00:22:58,445 And x sub 2 into the x sub 3 all in parallel. 547 00:22:58,445 --> 00:23:00,320 And then, it will take a pointer double step. 548 00:23:00,320 --> 00:23:03,603 And the result looks like this. 549 00:23:03,603 --> 00:23:05,020 The first processor still contains 550 00:23:05,020 --> 00:23:08,380 x sub 0, which can also be noted as sigma 0 to 0, 551 00:23:08,380 --> 00:23:12,160 which is just the sum of x sub 0 through x sub0, which is simply 552 00:23:12,160 --> 00:23:13,710 x sub 0. 553 00:23:13,710 --> 00:23:17,550 The second processor has the sum of x sub 0 through x sub 1. 554 00:23:17,550 --> 00:23:21,100 The third processor has x sub 1 through x sub 2 and so forth. 555 00:23:21,100 --> 00:23:22,590 And we've doubled the pointers. 556 00:23:22,590 --> 00:23:28,800 Now, we do this again and we see that, for example, 557 00:23:28,800 --> 00:23:34,760 the fourth processor now has the sum of s sub 0 through x sub 3. 558 00:23:34,760 --> 00:23:38,550 And the seventh processor has x sub 3 through x sub 6. 559 00:23:38,550 --> 00:23:41,660 Now, notice in this last step that the third processor will 560 00:23:41,660 --> 00:23:47,870 add its sigma 0 to 2 into the seventh processors sigma 3 561 00:23:47,870 --> 00:23:49,070 through 6. 562 00:23:49,070 --> 00:23:52,550 This will cause processor 7 to have the sum of 0 through 6. 563 00:23:52,550 --> 00:23:55,833 564 00:23:55,833 --> 00:23:57,250 And after the third step, in fact, 565 00:23:57,250 --> 00:23:58,890 if you carefully trace through it all, 566 00:23:58,890 --> 00:24:00,598 you will find that each of the processors 567 00:24:00,598 --> 00:24:03,630 has gotten some of its own number plus all the preceding 568 00:24:03,630 --> 00:24:04,890 ones in the list. 569 00:24:04,890 --> 00:24:07,330 And again, in a logarithmic number of time steps. 570 00:24:07,330 --> 00:24:09,122 So we see that we can do interesting things 571 00:24:09,122 --> 00:24:11,895 not only with arrays, but with linked lists in this data 572 00:24:11,895 --> 00:24:13,020 parallel programming style. 573 00:24:13,020 --> 00:24:15,990 574 00:24:15,990 --> 00:24:18,330 Now, let's talk a little bit about the distinction 575 00:24:18,330 --> 00:24:21,840 between speed and efficiency. 576 00:24:21,840 --> 00:24:24,468 In sequential programming, usually these terms 577 00:24:24,468 --> 00:24:25,510 are considered anonymous. 578 00:24:25,510 --> 00:24:27,510 If a program is fast, then it must be efficient. 579 00:24:27,510 --> 00:24:30,120 If it's efficient, it must be fast. 580 00:24:30,120 --> 00:24:32,040 But this coincidence of terms comes about 581 00:24:32,040 --> 00:24:34,595 only because you have a single processor to work with 582 00:24:34,595 --> 00:24:35,970 and you're trying to get the most 583 00:24:35,970 --> 00:24:39,792 horsepower you can out of it and it's the only one doing things. 584 00:24:39,792 --> 00:24:41,250 But in a parallel case, it's not so 585 00:24:41,250 --> 00:24:44,880 because sometimes you can make it go faster by, in some sense, 586 00:24:44,880 --> 00:24:46,730 doing extra work more than you had to 587 00:24:46,730 --> 00:24:48,480 by taking advantage of the fact you've got 588 00:24:48,480 --> 00:24:50,780 extra processors to work with. 589 00:24:50,780 --> 00:24:54,490 Let's consider, for example, the algorithm that we saw earlier 590 00:24:54,490 --> 00:24:58,800 for summing a vector and let's compare that 591 00:24:58,800 --> 00:25:00,240 with a serial algorithm. 592 00:25:00,240 --> 00:25:02,880 With a serial algorithm, you would write a loop, 593 00:25:02,880 --> 00:25:04,590 and within the loop, you would have 594 00:25:04,590 --> 00:25:07,830 indexed subscripted accesses to the array. 595 00:25:07,830 --> 00:25:10,052 And you set up an accumulator, initialize it to 0, 596 00:25:10,052 --> 00:25:12,510 or perhaps initialize it to the first element of the vector 597 00:25:12,510 --> 00:25:14,370 to save an addition, and then add 598 00:25:14,370 --> 00:25:17,670 in successive elements of the array. 599 00:25:17,670 --> 00:25:20,180 So you would do this using one processor. 600 00:25:20,180 --> 00:25:23,240 You would take N minus 1 time steps. 601 00:25:23,240 --> 00:25:25,723 And you would have to do N minus 1 additions-- 602 00:25:25,723 --> 00:25:27,890 pairwise additions because that's the minimum number 603 00:25:27,890 --> 00:25:30,240 you can get away with. 604 00:25:30,240 --> 00:25:31,830 So we say that the work involved was 605 00:25:31,830 --> 00:25:34,630 N minus 1, that was the number of additions performed. 606 00:25:34,630 --> 00:25:37,530 The cost, which is the number of processors times the number 607 00:25:37,530 --> 00:25:40,410 of time steps is N minus 1. 608 00:25:40,410 --> 00:25:44,810 And the efficiency, which is the amount of work 609 00:25:44,810 --> 00:25:47,510 you got done-- the number of additions divided by the cost 610 00:25:47,510 --> 00:25:48,740 is 1. 611 00:25:48,740 --> 00:25:52,070 Now, let's compare that with the parallel algorithm. 612 00:25:52,070 --> 00:25:54,590 In the parallel vector sum algorithm, 613 00:25:54,590 --> 00:25:58,340 we used N processors to do it, one for each data element. 614 00:25:58,340 --> 00:26:01,760 We did it in log N time steps and we still accomplished 615 00:26:01,760 --> 00:26:04,230 N minus 1 additions. 616 00:26:04,230 --> 00:26:06,060 The cost of this, the number of processors 617 00:26:06,060 --> 00:26:08,435 times the number of time steps instead of being N minus 1 618 00:26:08,435 --> 00:26:10,830 was N log N. We used a lot of processors. 619 00:26:10,830 --> 00:26:13,110 And as we noted earlier, most of the processors 620 00:26:13,110 --> 00:26:14,740 weren't busy most of the time. 621 00:26:14,740 --> 00:26:17,640 And so we wasted a lot of cycles. 622 00:26:17,640 --> 00:26:20,490 And so the efficiency, which is the number of additions divided 623 00:26:20,490 --> 00:26:23,730 by the cost, was approximately 1 over log N. 624 00:26:23,730 --> 00:26:25,480 So as the size of the problem gets bigger, 625 00:26:25,480 --> 00:26:26,690 the efficiency goes down. 626 00:26:26,690 --> 00:26:28,570 But still, the speed compares quite favorably 627 00:26:28,570 --> 00:26:33,200 with the serial algorithm as log N versus N minus 1. 628 00:26:33,200 --> 00:26:37,920 Now, let's compare this to the vector sum prefix algorithm. 629 00:26:37,920 --> 00:26:40,700 And again, consider a serial version. 630 00:26:40,700 --> 00:26:44,960 Just as the data parallel vector sum prefix 631 00:26:44,960 --> 00:26:48,040 looked like the data parallel sum reduction, 632 00:26:48,040 --> 00:26:50,770 so the serial prefix algorithm looks like the serial sum 633 00:26:50,770 --> 00:26:53,680 algorithm, except you save out all the partial results. 634 00:26:53,680 --> 00:26:56,373 You start with the first element of the array and you save it. 635 00:26:56,373 --> 00:26:57,790 Then, you add in the next element. 636 00:26:57,790 --> 00:26:58,660 You save that sum. 637 00:26:58,660 --> 00:26:59,827 You add in the next element. 638 00:26:59,827 --> 00:27:01,590 You save that sum. 639 00:27:01,590 --> 00:27:04,560 And so after N minus 1 time steps and N minus 1 additions, 640 00:27:04,560 --> 00:27:07,440 you computed to sum prefix with the additional cost of simply 641 00:27:07,440 --> 00:27:10,160 storing the results out to memory. 642 00:27:10,160 --> 00:27:13,070 And so, again, you've achieved N minus 1 additions 643 00:27:13,070 --> 00:27:16,410 with N minus 1 cost until the efficiency is 1. 644 00:27:16,410 --> 00:27:19,240 Now, let's compare that to the parallel version. 645 00:27:19,240 --> 00:27:23,020 We've used N processors and done it in log N time steps. 646 00:27:23,020 --> 00:27:25,010 But the number of additions is much greater. 647 00:27:25,010 --> 00:27:27,190 If you work out the numbers, it turns out 648 00:27:27,190 --> 00:27:30,600 to be N times log N minus 1. 649 00:27:30,600 --> 00:27:32,850 Because you used N processors and use them 650 00:27:32,850 --> 00:27:36,000 at log N time steps and nearly all of them were busy. 651 00:27:36,000 --> 00:27:39,870 So you did N times log N minus 1 additions at a cost of N 652 00:27:39,870 --> 00:27:42,480 log N. Dividing those, the efficiency 653 00:27:42,480 --> 00:27:46,370 was log N minus 1 over log N, which as N gets very large 654 00:27:46,370 --> 00:27:48,328 is very close to 1. 655 00:27:48,328 --> 00:27:50,620 So we can claim that this is a very efficient algorithm 656 00:27:50,620 --> 00:27:53,380 because it then gets large as efficiency approaches 1. 657 00:27:53,380 --> 00:27:56,495 And yet, this efficiency is somehow bogus. 658 00:27:56,495 --> 00:27:58,870 We achieve this efficiency by keeping the processors busy 659 00:27:58,870 --> 00:28:02,050 doing more than they really had to do in some sense. 660 00:28:02,050 --> 00:28:04,120 Because only N minus 1 additions are really 661 00:28:04,120 --> 00:28:07,268 required to compute a sum prefix. 662 00:28:07,268 --> 00:28:09,310 On the other hand, it appears that more than that 663 00:28:09,310 --> 00:28:12,100 are required in order to do it fast. 664 00:28:12,100 --> 00:28:13,720 And so we have this curious trade-off 665 00:28:13,720 --> 00:28:18,270 that as the speed goes up, the efficiency goes down, 666 00:28:18,270 --> 00:28:19,320 in some sense. 667 00:28:19,320 --> 00:28:23,247 But that's masked by this particular efficiency cost 668 00:28:23,247 --> 00:28:25,830 by the fact that we actually did more additions than we really 669 00:28:25,830 --> 00:28:26,580 had to. 670 00:28:26,580 --> 00:28:29,210 So the business of measuring the speed of a parallel algorithm 671 00:28:29,210 --> 00:28:30,960 and the efficiency of a parallel algorithm 672 00:28:30,960 --> 00:28:32,190 is a very tricky business. 673 00:28:32,190 --> 00:28:34,023 And in fact, I think the particular measures 674 00:28:34,023 --> 00:28:36,540 that I used in these graphics are a bit naive. 675 00:28:36,540 --> 00:28:38,610 And we need to develop a better theory of how 676 00:28:38,610 --> 00:28:41,580 to measure the goodness, the efficiency, and the speed 677 00:28:41,580 --> 00:28:44,050 of parallel algorithms. 678 00:28:44,050 --> 00:28:45,800 We've seen a useful set of building blocks 679 00:28:45,800 --> 00:28:48,200 for constructing data parallel algorithms. 680 00:28:48,200 --> 00:28:51,370 Now, let's build some algorithms with those building blocks. 681 00:28:51,370 --> 00:28:53,870 For our first example, let's consider matrix multiplication. 682 00:28:53,870 --> 00:28:56,290 This can be done in a large variety of ways. 683 00:28:56,290 --> 00:28:58,040 But abstractly, there's a single operation 684 00:28:58,040 --> 00:29:01,742 to be performed, which is that given two matrices, 685 00:29:01,742 --> 00:29:03,950 we regard one as being composed of rows and the other 686 00:29:03,950 --> 00:29:05,570 as being composed of columns. 687 00:29:05,570 --> 00:29:07,550 And each row of the first must meet 688 00:29:07,550 --> 00:29:09,620 every column of the second. 689 00:29:09,620 --> 00:29:12,140 And each interaction of a row and a column 690 00:29:12,140 --> 00:29:15,380 will produce a result element in the result array 691 00:29:15,380 --> 00:29:18,955 by performing an inner product on the two vectors, the row 692 00:29:18,955 --> 00:29:19,580 and the column. 693 00:29:19,580 --> 00:29:22,630 694 00:29:22,630 --> 00:29:24,160 One way of doing this very quickly 695 00:29:24,160 --> 00:29:26,170 by a sort of brute force approach 696 00:29:26,170 --> 00:29:29,110 is to use order of n cubed processors. 697 00:29:29,110 --> 00:29:32,050 We'll assume that each of these three matrices, the source-- 698 00:29:32,050 --> 00:29:34,060 first source, the second source, and the result 699 00:29:34,060 --> 00:29:36,380 are n by n matrices. 700 00:29:36,380 --> 00:29:38,720 And we will use n cubed processors 701 00:29:38,720 --> 00:29:40,190 and we will organize the matrices 702 00:29:40,190 --> 00:29:41,720 by putting the first source on one 703 00:29:41,720 --> 00:29:44,300 face of the cube, the second source on a different face, 704 00:29:44,300 --> 00:29:46,810 and have the result come out a third face. 705 00:29:46,810 --> 00:29:49,310 And we will see that we can do this very quickly in a fairly 706 00:29:49,310 --> 00:29:51,690 short number of steps. 707 00:29:51,690 --> 00:29:54,800 The first thing we will do is make copies of the first source 708 00:29:54,800 --> 00:29:59,540 array using a spread operation to replicate that matrix 709 00:29:59,540 --> 00:30:01,490 through the cube. 710 00:30:01,490 --> 00:30:03,740 Then, we will do the same thing with the second source 711 00:30:03,740 --> 00:30:06,560 making copies and spreading those down the cube. 712 00:30:06,560 --> 00:30:08,180 Each of these spreading operations 713 00:30:08,180 --> 00:30:09,930 takes a logarithmic number of steps. 714 00:30:09,930 --> 00:30:12,170 So so far, we've used O of log n time. 715 00:30:12,170 --> 00:30:14,790 716 00:30:14,790 --> 00:30:16,850 Next, we do a lot of elementwise operations. 717 00:30:16,850 --> 00:30:19,070 You can see 16 of the multiplications 718 00:30:19,070 --> 00:30:21,620 here in this graphic shown on the outer face of the cube. 719 00:30:21,620 --> 00:30:23,780 But in fact, these elementwise operations 720 00:30:23,780 --> 00:30:25,190 are happening throughout the cube 721 00:30:25,190 --> 00:30:29,000 and so we have n cubed multiplications going on 722 00:30:29,000 --> 00:30:30,750 in a constant amount of time because we're 723 00:30:30,750 --> 00:30:34,670 using O of n cubed processors. 724 00:30:34,670 --> 00:30:37,880 Then, in the last step, we perform a parallel sum 725 00:30:37,880 --> 00:30:41,060 operation using the doubling reduction method 726 00:30:41,060 --> 00:30:44,210 that we saw earlier on each of the rows 727 00:30:44,210 --> 00:30:46,760 coming up through the array toward the result matrix. 728 00:30:46,760 --> 00:30:49,140 And that takes a logarithm amount of time. 729 00:30:49,140 --> 00:30:51,350 And so, in four simple steps, we've 730 00:30:51,350 --> 00:30:54,510 managed to multiply two n by n matrices in log n time, 731 00:30:54,510 --> 00:30:59,060 but at the cost of using O of n cubed processors. 732 00:30:59,060 --> 00:31:01,958 It should also be pointed out that if the very next thing we 733 00:31:01,958 --> 00:31:03,500 wanted to do was to take this product 734 00:31:03,500 --> 00:31:05,930 and add it to one of the source of arrays, 735 00:31:05,930 --> 00:31:07,730 the result is in the wrong place and we're 736 00:31:07,730 --> 00:31:09,573 going to incur yet an additional cost 737 00:31:09,573 --> 00:31:11,240 to move it to the right place so that we 738 00:31:11,240 --> 00:31:13,022 could perform that addition. 739 00:31:13,022 --> 00:31:14,480 And so that should also be factored 740 00:31:14,480 --> 00:31:15,772 into the cost of the operation. 741 00:31:15,772 --> 00:31:18,700 742 00:31:18,700 --> 00:31:21,820 Here is another method for doing matrix multiplication that only 743 00:31:21,820 --> 00:31:24,760 requires n squared processors. 744 00:31:24,760 --> 00:31:28,720 In this arrangement, we take the two source arrays 745 00:31:28,720 --> 00:31:31,513 and put them in the same set of n squared processors. 746 00:31:31,513 --> 00:31:33,430 And the result will also show up in the same n 747 00:31:33,430 --> 00:31:35,593 by n matrix of processors. 748 00:31:35,593 --> 00:31:37,510 And we're going to use a very clever technique 749 00:31:37,510 --> 00:31:41,870 due to Cannon, which involves pre-skewing the two source 750 00:31:41,870 --> 00:31:42,370 arrays. 751 00:31:42,370 --> 00:31:44,658 The first source array has its rows skewed 752 00:31:44,658 --> 00:31:47,200 and you'll see that each row is skewed by a different amount, 753 00:31:47,200 --> 00:31:49,520 depending on its row number. 754 00:31:49,520 --> 00:31:52,540 And similarly, the columns of the second array 755 00:31:52,540 --> 00:31:54,550 are skewed depending on their column number. 756 00:31:54,550 --> 00:31:57,315 757 00:31:57,315 --> 00:31:59,690 Now, we will take these to skewed arrays and overlay them 758 00:31:59,690 --> 00:32:02,810 and the result looks like this. 759 00:32:02,810 --> 00:32:06,560 The essence of the algorithm-- this is a systolic algorithm-- 760 00:32:06,560 --> 00:32:10,190 is to rotate both of the source matrices at the same time. 761 00:32:10,190 --> 00:32:13,400 We're going to rotate the first source matrix horizontally 762 00:32:13,400 --> 00:32:17,250 and the second one vertically like this. 763 00:32:17,250 --> 00:32:20,060 And so at each time step, each of the two source arrays 764 00:32:20,060 --> 00:32:22,460 will be shifted by one position and the result 765 00:32:22,460 --> 00:32:23,660 looks something like this. 766 00:32:23,660 --> 00:32:27,840 767 00:32:27,840 --> 00:32:31,382 And after four time steps, where in at each step, 768 00:32:31,382 --> 00:32:33,590 the two source elements that arrived to the processor 769 00:32:33,590 --> 00:32:36,403 are multiplied together and added into a running sum. 770 00:32:36,403 --> 00:32:37,820 And magically, the result has been 771 00:32:37,820 --> 00:32:39,893 computed after n time steps. 772 00:32:39,893 --> 00:32:41,810 Now, let's look at that a little more closely. 773 00:32:41,810 --> 00:32:44,690 Pay attention to the intersection point 774 00:32:44,690 --> 00:32:46,820 at the upper left. 775 00:32:46,820 --> 00:32:50,660 You will see that at the first time step, 776 00:32:50,660 --> 00:32:54,040 the second element of the first row and the second element 777 00:32:54,040 --> 00:32:56,800 of the first column have met in the upper left corner. 778 00:32:56,800 --> 00:32:59,560 They are then multiplied and accumulated. 779 00:32:59,560 --> 00:33:02,260 Then, at the next time step, the third elements 780 00:33:02,260 --> 00:33:04,720 of the row and column meet are multiplied together 781 00:33:04,720 --> 00:33:06,190 and added in. 782 00:33:06,190 --> 00:33:09,250 At the next time step, the fourth element 783 00:33:09,250 --> 00:33:11,350 of the first row and the first column meet. 784 00:33:11,350 --> 00:33:14,230 And at the last step, finally, the first elements 785 00:33:14,230 --> 00:33:15,910 meet and are accumulated. 786 00:33:15,910 --> 00:33:18,370 And the same thing is going on at all the other points 787 00:33:18,370 --> 00:33:20,020 of the matrix. 788 00:33:20,020 --> 00:33:22,120 And the purpose of the pre-skewing 789 00:33:22,120 --> 00:33:24,970 is to ensure that each element of each row 790 00:33:24,970 --> 00:33:27,250 meets the corresponding element of each column at just 791 00:33:27,250 --> 00:33:28,700 the right time. 792 00:33:28,700 --> 00:33:30,310 And if you don't believe that, you'll just have to trust me. 793 00:33:30,310 --> 00:33:30,820 It works. 794 00:33:30,820 --> 00:33:33,040 It's really amazing. 795 00:33:33,040 --> 00:33:35,100 And so the net effect is that using only n 796 00:33:35,100 --> 00:33:39,923 squared processors, we've managed to do it in time n. 797 00:33:39,923 --> 00:33:42,090 Add an additional benefit is that the matrix ends up 798 00:33:42,090 --> 00:33:43,860 in the right place if you want to do yet other things 799 00:33:43,860 --> 00:33:44,860 with the other matrices. 800 00:33:44,860 --> 00:33:47,380 801 00:33:47,380 --> 00:33:49,660 Finally, let us consider a really big example. 802 00:33:49,660 --> 00:33:52,780 We're going to go back to the question of labeling regions 803 00:33:52,780 --> 00:33:53,910 in an image. 804 00:33:53,910 --> 00:33:55,660 Here instead of showing you a rocket ship, 805 00:33:55,660 --> 00:33:57,580 I showed you a somewhat more abstract image 806 00:33:57,580 --> 00:33:59,245 in order to keep the example small. 807 00:33:59,245 --> 00:34:01,120 This is one of the difficulties of describing 808 00:34:01,120 --> 00:34:02,890 data parallel algorithms is that they 809 00:34:02,890 --> 00:34:05,650 are inherently algorithms oriented 810 00:34:05,650 --> 00:34:06,920 toward large amounts of data. 811 00:34:06,920 --> 00:34:08,199 But if I try to show you a large amount of data, 812 00:34:08,199 --> 00:34:09,324 it won't fit on the screen. 813 00:34:09,324 --> 00:34:10,900 So I'm going to do my best here. 814 00:34:10,900 --> 00:34:14,830 We have a number of regions in this image shown here 815 00:34:14,830 --> 00:34:15,699 by different colors. 816 00:34:15,699 --> 00:34:17,199 We see a large central green region. 817 00:34:17,199 --> 00:34:19,960 This little squiggly reddish orange region up 818 00:34:19,960 --> 00:34:21,219 in the left corner. 819 00:34:21,219 --> 00:34:23,469 And notice that not every region is a different color. 820 00:34:23,469 --> 00:34:25,870 There may be several regions that are the same color. 821 00:34:25,870 --> 00:34:27,370 But as long as they're disconnected, 822 00:34:27,370 --> 00:34:30,239 we want to consider them to be separate regions. 823 00:34:30,239 --> 00:34:32,489 We would like to compute a result that looks something 824 00:34:32,489 --> 00:34:35,489 like this in which each region has been 825 00:34:35,489 --> 00:34:36,989 assigned a distinct number. 826 00:34:36,989 --> 00:34:39,239 We don't particularly care which number gets assigned 827 00:34:39,239 --> 00:34:41,580 to it as long as each region gets a different number 828 00:34:41,580 --> 00:34:44,820 and all contiguous elements of a single image 829 00:34:44,820 --> 00:34:46,270 have received the same number. 830 00:34:46,270 --> 00:34:48,780 So we can see here that in the sample result, 831 00:34:48,780 --> 00:34:52,170 the central green region has had all its pixels assigned a value 832 00:34:52,170 --> 00:34:53,130 19. 833 00:34:53,130 --> 00:34:55,230 And the squiggly region in the upper left corner 834 00:34:55,230 --> 00:34:58,230 has received a value 0 in all its pixels. 835 00:34:58,230 --> 00:35:01,180 The region in the upper right corner, 836 00:35:01,180 --> 00:35:03,750 which although the same initial color as the central region 837 00:35:03,750 --> 00:35:05,083 has received a different number. 838 00:35:05,083 --> 00:35:09,760 It's gotten all fives instead of 19s because it's disjoint. 839 00:35:09,760 --> 00:35:11,240 And we proceed a number of steps. 840 00:35:11,240 --> 00:35:14,080 And at each step, we'll see how the particular themes 841 00:35:14,080 --> 00:35:17,170 and building blocks we've discussed fit together to make 842 00:35:17,170 --> 00:35:18,645 an interesting algorithm. 843 00:35:18,645 --> 00:35:20,020 The first thing we're going to do 844 00:35:20,020 --> 00:35:22,230 is assign a different number to each processor. 845 00:35:22,230 --> 00:35:24,280 And in fact, here, I've shown the numbers 846 00:35:24,280 --> 00:35:27,220 as progressing across the rows and then down to matrix. 847 00:35:27,220 --> 00:35:29,500 But for the purposes of this algorithm, 848 00:35:29,500 --> 00:35:31,630 any organization would do provided 849 00:35:31,630 --> 00:35:34,510 we can reasonably quickly assign a different number 850 00:35:34,510 --> 00:35:35,617 to each processor. 851 00:35:35,617 --> 00:35:37,450 And we've seen how the enumeration technique 852 00:35:37,450 --> 00:35:39,800 can do this in a logarithmic number of time steps. 853 00:35:39,800 --> 00:35:43,910 So we'll assume this particular numbering of the processors. 854 00:35:43,910 --> 00:35:48,910 The next thing we do is to have each of the pixels in the image 855 00:35:48,910 --> 00:35:51,850 examine the pixel values of the neighbors. 856 00:35:51,850 --> 00:35:54,460 And this is easily accomplished by using regular permutations, 857 00:35:54,460 --> 00:35:56,650 namely shifts of the matrix. 858 00:35:56,650 --> 00:35:58,270 We take the matrix and we shift it up, 859 00:35:58,270 --> 00:36:00,640 we shift it down, left and right, and also 860 00:36:00,640 --> 00:36:04,180 to the Northeast, Northwest, Southeast, and Southwest. 861 00:36:04,180 --> 00:36:06,670 And as a result of these shifting operations, 862 00:36:06,670 --> 00:36:08,710 each neighbor now knows its own pixel value 863 00:36:08,710 --> 00:36:11,480 plus the pixel values of its eight neighbors. 864 00:36:11,480 --> 00:36:13,910 This is then enough information for each processor 865 00:36:13,910 --> 00:36:16,730 to do elementwise computations and decide-- 866 00:36:16,730 --> 00:36:19,730 each processor can decide whether its pixel value 867 00:36:19,730 --> 00:36:22,400 is on the border of its region or not. 868 00:36:22,400 --> 00:36:23,780 There are lots of special cases. 869 00:36:23,780 --> 00:36:25,460 And the details get rather messy and I'm not 870 00:36:25,460 --> 00:36:27,877 going to talk about them too much here because they're not 871 00:36:27,877 --> 00:36:30,290 central to the ideas involved. 872 00:36:30,290 --> 00:36:32,165 But be aware that there are messy details. 873 00:36:32,165 --> 00:36:34,830 874 00:36:34,830 --> 00:36:38,430 In this example, we have darkened the border square 875 00:36:38,430 --> 00:36:39,510 so you can see them. 876 00:36:39,510 --> 00:36:41,730 And the next several computations to be carried out 877 00:36:41,730 --> 00:36:44,310 will be carried out only by pixel processors that 878 00:36:44,310 --> 00:36:45,263 are on the border. 879 00:36:45,263 --> 00:36:47,430 So this will be an example of conditional operation, 880 00:36:47,430 --> 00:36:49,950 only some of the processors will be participating. 881 00:36:49,950 --> 00:36:53,550 882 00:36:53,550 --> 00:36:56,610 We have each processor, again, consider the pixel values 883 00:36:56,610 --> 00:36:58,650 that came from its neighbors. 884 00:36:58,650 --> 00:37:01,730 And also inquire again using shifting of its neighbors 885 00:37:01,730 --> 00:37:03,480 whether its neighbors are border elements. 886 00:37:03,480 --> 00:37:06,220 This is, then, enough information to figure out 887 00:37:06,220 --> 00:37:08,230 which of your neighbors are border elements 888 00:37:08,230 --> 00:37:11,720 in the same region so that you can construct pointers to them. 889 00:37:11,720 --> 00:37:14,890 And so what we've done here is to stitch together 890 00:37:14,890 --> 00:37:16,600 the border of each region in a linked 891 00:37:16,600 --> 00:37:20,030 list that runs around the perimeter of the region. 892 00:37:20,030 --> 00:37:22,600 So you can see a linked list running around the perimeter 893 00:37:22,600 --> 00:37:23,998 of the central green region. 894 00:37:23,998 --> 00:37:25,540 And you can see a linked list running 895 00:37:25,540 --> 00:37:27,370 through the squiggly region in the upper left. 896 00:37:27,370 --> 00:37:29,412 And each region has had its boundary connected up 897 00:37:29,412 --> 00:37:30,520 with linked pointers. 898 00:37:30,520 --> 00:37:35,010 We are not going to use pointer doubling algorithms. 899 00:37:35,010 --> 00:37:37,020 Each pixel processor considers the number 900 00:37:37,020 --> 00:37:40,730 that it was assigned in the enumeration step. 901 00:37:40,730 --> 00:37:43,040 And the pixels that are not on the borders 902 00:37:43,040 --> 00:37:47,340 do not participate in this consideration. 903 00:37:47,340 --> 00:37:51,060 We then use a pointer doubling algorithm, first of all, 904 00:37:51,060 --> 00:37:54,540 to do a reduction step using not summation, 905 00:37:54,540 --> 00:37:56,257 but the minimum operation. 906 00:37:56,257 --> 00:37:58,840 Min is as good an operation for combining numbers as summation 907 00:37:58,840 --> 00:38:00,090 is for this purpose. 908 00:38:00,090 --> 00:38:02,770 And pointer doubling works just fine for that. 909 00:38:02,770 --> 00:38:07,480 So each linked border list will perform pointer doubling 910 00:38:07,480 --> 00:38:10,330 around that list and determine what is the smallest 911 00:38:10,330 --> 00:38:11,890 number in that list. 912 00:38:11,890 --> 00:38:14,530 And then, by using another pointer doubling step, 913 00:38:14,530 --> 00:38:16,900 send that smallest number and make copies of it 914 00:38:16,900 --> 00:38:18,190 all around the perimeter. 915 00:38:18,190 --> 00:38:21,448 So now, every border pixel knows the smallest number 916 00:38:21,448 --> 00:38:23,240 that was on a border element in its region. 917 00:38:23,240 --> 00:38:24,795 So for example, the number 19 has 918 00:38:24,795 --> 00:38:26,170 been propagated around the border 919 00:38:26,170 --> 00:38:27,460 of the central green region. 920 00:38:27,460 --> 00:38:29,477 And the number 0 has been propagated down 921 00:38:29,477 --> 00:38:31,060 the squiggly region in the upper left. 922 00:38:31,060 --> 00:38:33,570 923 00:38:33,570 --> 00:38:35,670 Finally, we can use scan operations, not 924 00:38:35,670 --> 00:38:38,580 on the linked lists anymore, we can abandon those now, 925 00:38:38,580 --> 00:38:40,513 but by operating on the columns or what 926 00:38:40,513 --> 00:38:42,930 amounts to the same thing in the rows, either one will do. 927 00:38:42,930 --> 00:38:45,660 By doing a scan operation in each direction, 928 00:38:45,660 --> 00:38:51,350 we can copy the processor labels from the borders 929 00:38:51,350 --> 00:38:53,490 to the interior points of the region. 930 00:38:53,490 --> 00:38:55,400 So for example, in the interior green region, 931 00:38:55,400 --> 00:38:57,740 it actually suffices to copy from the upper edge 932 00:38:57,740 --> 00:39:00,380 of the region downward, thereby propagating the label 933 00:39:00,380 --> 00:39:01,285 into the interior. 934 00:39:01,285 --> 00:39:02,660 Other regions, particularly those 935 00:39:02,660 --> 00:39:03,650 that are on the edge of the image, 936 00:39:03,650 --> 00:39:05,775 may need the numbers propagated up instead of down. 937 00:39:05,775 --> 00:39:08,390 So you do a scan in both directions doing copying. 938 00:39:08,390 --> 00:39:10,850 And this is done by essentially a variant 939 00:39:10,850 --> 00:39:12,590 of the parallel prefix operation. 940 00:39:12,590 --> 00:39:15,110 But instead of using an operation such as summing 941 00:39:15,110 --> 00:39:18,390 or min uses a more complicated operation that 942 00:39:18,390 --> 00:39:20,390 is, in fact, not commutative, but fairly simple, 943 00:39:20,390 --> 00:39:23,497 which is essentially if you've come across a new number, 944 00:39:23,497 --> 00:39:25,080 take a copy of that and otherwise copy 945 00:39:25,080 --> 00:39:28,120 the old number that came in from your neighbor. 946 00:39:28,120 --> 00:39:29,370 But it's a legitimate variant. 947 00:39:29,370 --> 00:39:32,070 948 00:39:32,070 --> 00:39:33,900 This method of labeling regions in an image 949 00:39:33,900 --> 00:39:36,980 is known as LIM's method after Willie Lim. 950 00:39:36,980 --> 00:39:38,750 And if you use n squared processors 951 00:39:38,750 --> 00:39:41,180 to represent the image in n by n array, 952 00:39:41,180 --> 00:39:45,070 it accomplishes the labeling in log n time. 953 00:39:45,070 --> 00:39:47,440 Because each of the steps that was involved 954 00:39:47,440 --> 00:39:50,293 took either constant time or time logarithmic 955 00:39:50,293 --> 00:39:51,460 in the number of processors. 956 00:39:51,460 --> 00:39:53,410 It was given that you had n squared processors 957 00:39:53,410 --> 00:39:56,990 to process n squared pixels. 958 00:39:56,990 --> 00:39:58,550 Data parallel programming makes it 959 00:39:58,550 --> 00:40:03,440 easy to organize computations on large quantities of data 960 00:40:03,440 --> 00:40:06,170 for massively parallel computer systems. 961 00:40:06,170 --> 00:40:07,850 It differs from sequential programming 962 00:40:07,850 --> 00:40:09,740 in that its emphasis is on operations 963 00:40:09,740 --> 00:40:13,010 on entire collections of data rather than single elements 964 00:40:13,010 --> 00:40:14,840 of the data one at a time. 965 00:40:14,840 --> 00:40:16,550 In a data parallel program, you typically 966 00:40:16,550 --> 00:40:19,790 find fewer loops and fewer subscripted references 967 00:40:19,790 --> 00:40:22,490 to arrays than you do in a sequential program. 968 00:40:22,490 --> 00:40:24,530 On the other hand, data parallel programs 969 00:40:24,530 --> 00:40:27,740 are typically very similar to sequential programs 970 00:40:27,740 --> 00:40:31,230 in containing a single thread of control. 971 00:40:31,230 --> 00:40:32,760 At any given point in time, there's 972 00:40:32,760 --> 00:40:35,820 conceptually execution control at only one 973 00:40:35,820 --> 00:40:37,133 point in the program text. 974 00:40:37,133 --> 00:40:39,300 Sometimes this restriction is loosened a little bit, 975 00:40:39,300 --> 00:40:42,240 but it's characteristic of the data programming style. 976 00:40:42,240 --> 00:40:44,610 This can make data parallel programs easier 977 00:40:44,610 --> 00:40:48,570 to understand than control parallel programs. 978 00:40:48,570 --> 00:40:51,120 In order to write good data parallel programs, 979 00:40:51,120 --> 00:40:53,640 we must become familiar with the necessary building 980 00:40:53,640 --> 00:40:56,360 blocks for the construction of data parallel algorithms. 981 00:40:56,360 --> 00:40:58,110 And we need to have a good idea as to what 982 00:40:58,110 --> 00:41:00,360 the relative costs are. 983 00:41:00,360 --> 00:41:02,868 Given one processor data element, 984 00:41:02,868 --> 00:41:04,410 there are many interesting operations 985 00:41:04,410 --> 00:41:06,300 that can be performed in constant time 986 00:41:06,300 --> 00:41:08,935 on an entire array of data and other operations 987 00:41:08,935 --> 00:41:10,560 which take a logarithmic amount of time 988 00:41:10,560 --> 00:41:13,800 or perhaps a linear amount of time in the amount of data. 989 00:41:13,800 --> 00:41:16,500 And this depends not only on the inherent complexity 990 00:41:16,500 --> 00:41:19,410 of the operation, but also sometimes upon the underlying 991 00:41:19,410 --> 00:41:20,550 implementation. 992 00:41:20,550 --> 00:41:22,440 If a particular piece of hardware 993 00:41:22,440 --> 00:41:24,150 doesn't support sufficient connectivity 994 00:41:24,150 --> 00:41:26,190 among the processors, for example, 995 00:41:26,190 --> 00:41:27,990 a communication's bound operation 996 00:41:27,990 --> 00:41:31,177 may take more time than would otherwise be required. 997 00:41:31,177 --> 00:41:33,010 Once you become familiar with these building 998 00:41:33,010 --> 00:41:34,810 blocks and learn how to fit them together 999 00:41:34,810 --> 00:41:37,360 in standard and conventional ways, 1000 00:41:37,360 --> 00:41:39,910 writing a data parallel program is 1001 00:41:39,910 --> 00:41:42,790 just as easy as and just as hard as writing 1002 00:41:42,790 --> 00:41:44,560 a sequential program. 1003 00:41:44,560 --> 00:41:46,300 And given a suitable underlying hardware, 1004 00:41:46,300 --> 00:41:48,145 your programs may run much faster. 1005 00:41:48,145 --> 00:41:51,080 1006 00:41:51,080 --> 00:41:53,660 I'm Burt Halstead, DEC Cambridge Research Lab. 1007 00:41:53,660 --> 00:41:55,580 I thought your examples were very interesting, 1008 00:41:55,580 --> 00:41:58,280 but I was curious whether you ever get into problems when 1009 00:41:58,280 --> 00:42:01,340 you have highly data dependent computations 1010 00:42:01,340 --> 00:42:04,880 and it's difficult to keep more than a very small fraction 1011 00:42:04,880 --> 00:42:07,640 of the processors actually doing the same operation 1012 00:42:07,640 --> 00:42:09,890 at the same time. 1013 00:42:09,890 --> 00:42:10,403 Yes, you do. 1014 00:42:10,403 --> 00:42:11,820 That's a very interesting problem. 1015 00:42:11,820 --> 00:42:14,540 And that's one reason for making the distinction 1016 00:42:14,540 --> 00:42:17,750 between the data parallel style and the question of whether you 1017 00:42:17,750 --> 00:42:21,350 got SIMD or MIMD style hardware supporting it underneath. 1018 00:42:21,350 --> 00:42:24,020 It's perfectly possible to have hardware 1019 00:42:24,020 --> 00:42:26,060 that is capable of giving you some more 1020 00:42:26,060 --> 00:42:29,090 choice than doing the same thing at the same time. 1021 00:42:29,090 --> 00:42:33,530 Now, the precisely best method of designing a total system 1022 00:42:33,530 --> 00:42:35,960 that gives you flexibility without making it extremely 1023 00:42:35,960 --> 00:42:38,210 difficult to control I think is still an open research 1024 00:42:38,210 --> 00:42:39,920 question. 1025 00:42:39,920 --> 00:42:41,850 I'm Franklyn Turbak from MIT. 1026 00:42:41,850 --> 00:42:44,780 You indicated that a number of your algorithms 1027 00:42:44,780 --> 00:42:48,860 took time that was logarithmic in the size of the problem. 1028 00:42:48,860 --> 00:42:50,990 And this seems to be based on the assumption 1029 00:42:50,990 --> 00:42:54,860 that you actually have a large enough number of processors 1030 00:42:54,860 --> 00:42:56,960 to match the size of your problem. 1031 00:42:56,960 --> 00:42:58,880 Yet, for any real machine that you're 1032 00:42:58,880 --> 00:43:00,890 going to be running these algorithms on 1033 00:43:00,890 --> 00:43:04,670 there's going to be, in fact, a limited number of processors. 1034 00:43:04,670 --> 00:43:06,950 And if the size of your problem exceeds 1035 00:43:06,950 --> 00:43:10,820 the number of processors, it seems that the logarithmic time 1036 00:43:10,820 --> 00:43:12,320 growth is no longer justified. 1037 00:43:12,320 --> 00:43:13,912 Can you explain that please? 1038 00:43:13,912 --> 00:43:16,370 You're absolutely correct and I'm glad you brought that up. 1039 00:43:16,370 --> 00:43:18,030 There is no free lunch. 1040 00:43:18,030 --> 00:43:19,683 And for a fixed amount of hardware, 1041 00:43:19,683 --> 00:43:21,350 making a problem bigger is going to make 1042 00:43:21,350 --> 00:43:22,517 the whole thing run slower. 1043 00:43:22,517 --> 00:43:25,100 It's a question of whether you regard the size of the hardware 1044 00:43:25,100 --> 00:43:25,625 to be fixed. 1045 00:43:25,625 --> 00:43:28,196 1046 00:43:28,196 --> 00:43:31,447 If you have a much larger problem so big that it's not 1047 00:43:31,447 --> 00:43:33,030 going to fit in the computer you have, 1048 00:43:33,030 --> 00:43:34,740 you're simply going to have to buy a bigger computer. 1049 00:43:34,740 --> 00:43:36,150 And that's all there is to it. 1050 00:43:36,150 --> 00:43:40,028 Within the range of problems that your computer can handle, 1051 00:43:40,028 --> 00:43:42,570 then you will, indeed, tend to get linear kinds of trade-offs 1052 00:43:42,570 --> 00:43:43,778 rather than logarithmic ones. 1053 00:43:43,778 --> 00:43:45,195 It's a question of what you choose 1054 00:43:45,195 --> 00:43:46,860 to hold fixed in your analysis and what 1055 00:43:46,860 --> 00:43:49,120 you choose to let vary. 1056 00:43:49,120 --> 00:43:53,100 I'm [INAUDIBLE] from Holland. 1057 00:43:53,100 --> 00:43:55,470 I want to ask you the following question. 1058 00:43:55,470 --> 00:43:59,580 Is there something to say about the portability of the programs 1059 00:43:59,580 --> 00:44:00,570 to different machines? 1060 00:44:00,570 --> 00:44:03,300 1061 00:44:03,300 --> 00:44:04,492 Yes, there is. 1062 00:44:04,492 --> 00:44:06,450 There is a great deal to say about portability. 1063 00:44:06,450 --> 00:44:09,240 And one is that right now it's very difficult because we have 1064 00:44:09,240 --> 00:44:11,460 not come to terms on standards. 1065 00:44:11,460 --> 00:44:14,460 We haven't agreed on what are the right kinds of building 1066 00:44:14,460 --> 00:44:15,760 blocks to support. 1067 00:44:15,760 --> 00:44:20,430 And as a result, you end up with different architectures 1068 00:44:20,430 --> 00:44:22,740 that are being built in hardware lending 1069 00:44:22,740 --> 00:44:24,657 support to different ones of these operations, 1070 00:44:24,657 --> 00:44:26,698 but not supporting them all in a uniform fashion. 1071 00:44:26,698 --> 00:44:29,140 We're still trying to discover which ones are important. 1072 00:44:29,140 --> 00:44:32,535 And this is why you end up with non-portability of efficiencies 1073 00:44:32,535 --> 00:44:33,285 and running times. 1074 00:44:33,285 --> 00:44:35,615 This is completely aside from the question of, 1075 00:44:35,615 --> 00:44:36,990 will the code run at all, which I 1076 00:44:36,990 --> 00:44:39,152 think is merely a matter of agreeing 1077 00:44:39,152 --> 00:44:40,110 to implement compilers. 1078 00:44:40,110 --> 00:44:42,680 1079 00:44:42,680 --> 00:44:44,160 It's a difficult problem. 1080 00:44:44,160 --> 00:44:48,080 And I think that's part of the point of this talk, 1081 00:44:48,080 --> 00:44:50,720 is to bring forward one particular set of building 1082 00:44:50,720 --> 00:44:53,420 blocks that I think can be treated as a universal 1083 00:44:53,420 --> 00:44:56,030 and to try to get hardware architects 1084 00:44:56,030 --> 00:44:59,000 as well as language designers to pay attention 1085 00:44:59,000 --> 00:45:00,500 to these particular building blocks 1086 00:45:00,500 --> 00:45:03,810 as exemplars of important things that need to be supported well 1087 00:45:03,810 --> 00:45:05,060 in both software and hardware. 1088 00:45:05,060 --> 00:45:07,393 And there are probably others that I haven't touched on, 1089 00:45:07,393 --> 00:45:08,875 such as sorting. 1090 00:45:08,875 --> 00:45:12,440 I'm Jim [INAUDIBLE] from Howard University in Washington, DC. 1091 00:45:12,440 --> 00:45:14,000 I have a number of questions that 1092 00:45:14,000 --> 00:45:17,720 came out of the lecture and some of other personal professional 1093 00:45:17,720 --> 00:45:21,030 interest with respect to this important topic. 1094 00:45:21,030 --> 00:45:23,450 For dealing with large matrices that 1095 00:45:23,450 --> 00:45:27,830 are highly dense and sparse, there are [INAUDIBLE] metals 1096 00:45:27,830 --> 00:45:29,690 that we normally will use to reduce 1097 00:45:29,690 --> 00:45:32,810 the so-called computational [INAUDIBLE] 1098 00:45:32,810 --> 00:45:38,060 And thus far, as you give the very good presentation, 1099 00:45:38,060 --> 00:45:40,670 it looks to me that there will be opportunities 1100 00:45:40,670 --> 00:45:45,950 for evolving or solving problems [INAUDIBLE] that 1101 00:45:45,950 --> 00:45:47,780 might involve sparsity. 1102 00:45:47,780 --> 00:45:51,170 If this is true, how do you justify 1103 00:45:51,170 --> 00:45:56,010 the overhead cost in terms of the cost of parallel processing 1104 00:45:56,010 --> 00:45:58,730 as well as the cost of operations? 1105 00:45:58,730 --> 00:46:00,920 I know that there's advantages in terms 1106 00:46:00,920 --> 00:46:03,260 of the operational cost. 1107 00:46:03,260 --> 00:46:06,590 But the overhead cost in terms of implementing such machines, 1108 00:46:06,590 --> 00:46:08,420 especially where budgetary constraints 1109 00:46:08,420 --> 00:46:10,580 might be significant. 1110 00:46:10,580 --> 00:46:12,980 Is the gist of your question that the kinds of matrix 1111 00:46:12,980 --> 00:46:14,980 algorithms that I showed in this lecture are not 1112 00:46:14,980 --> 00:46:16,790 suitable for sparse matrices because so 1113 00:46:16,790 --> 00:46:18,140 much of the processing power will be wasted? 1114 00:46:18,140 --> 00:46:18,710 Yes. 1115 00:46:18,710 --> 00:46:20,168 Yes, that is a very good criticism. 1116 00:46:20,168 --> 00:46:22,760 And it would not be appropriate to use that kind of algorithm 1117 00:46:22,760 --> 00:46:24,440 on a sparse matrix, just as you wouldn't 1118 00:46:24,440 --> 00:46:27,260 use the usual sequential triply nested loop you 1119 00:46:27,260 --> 00:46:29,443 use for dense matrix on a sparse one. 1120 00:46:29,443 --> 00:46:31,610 Sparse matrix processing on a data parallel computer 1121 00:46:31,610 --> 00:46:33,350 calls for very different approaches 1122 00:46:33,350 --> 00:46:35,330 that I did not illustrate in this talk. 1123 00:46:35,330 --> 00:46:37,310 And they are more difficult and typically 1124 00:46:37,310 --> 00:46:41,450 call for the kinds of irregular permutations and communications 1125 00:46:41,450 --> 00:46:46,820 techniques that I illustrated before. 1126 00:46:46,820 --> 00:46:50,150 Because the particular patterns of communication 1127 00:46:50,150 --> 00:46:51,650 in the processing of sparse matrices 1128 00:46:51,650 --> 00:46:53,960 depend so much on the particular content of the matrix. 1129 00:46:53,960 --> 00:46:55,700 I'd just like you to make a comment 1130 00:46:55,700 --> 00:46:58,940 on the combination of hierarchical structures 1131 00:46:58,940 --> 00:47:03,830 in problem solving rather than just plane parallel 1132 00:47:03,830 --> 00:47:05,570 processing or sequential. 1133 00:47:05,570 --> 00:47:09,963 Using hierarchical structures to enhance some of the approaches 1134 00:47:09,963 --> 00:47:10,880 that you've discussed. 1135 00:47:10,880 --> 00:47:13,228 What will that give you? 1136 00:47:13,228 --> 00:47:14,645 Could you be more specific please? 1137 00:47:14,645 --> 00:47:16,760 To have a hierarchical structure is 1138 00:47:16,760 --> 00:47:19,760 where you have the main task master that 1139 00:47:19,760 --> 00:47:22,130 control the sequence of operations that 1140 00:47:22,130 --> 00:47:26,630 is done at the lower level of the sequence of activities that 1141 00:47:26,630 --> 00:47:28,850 is being done in parallel. 1142 00:47:28,850 --> 00:47:31,580 Where we are the local members or local elements 1143 00:47:31,580 --> 00:47:34,070 have a chance to communicate among themselves 1144 00:47:34,070 --> 00:47:38,900 as well as direct information to the higher level task master. 1145 00:47:38,900 --> 00:47:40,460 Oh, I think I see. 1146 00:47:40,460 --> 00:47:42,110 I've shown simply two level hierarchy. 1147 00:47:42,110 --> 00:47:45,200 At most, I've shown a single global force for, for example, 1148 00:47:45,200 --> 00:47:46,098 broadcasting. 1149 00:47:46,098 --> 00:47:48,140 And then, sort of homogeneous array of processes. 1150 00:47:48,140 --> 00:47:50,810 One might imagine a more complex hierarchical structuring 1151 00:47:50,810 --> 00:47:52,290 of the processing elements. 1152 00:47:52,290 --> 00:47:52,790 Yes. 1153 00:47:52,790 --> 00:47:53,790 Is that your suggestion? 1154 00:47:53,790 --> 00:47:54,850 Yes. 1155 00:47:54,850 --> 00:47:56,600 I think that is a largely unexplored area. 1156 00:47:56,600 --> 00:47:57,975 That's another thing that I would 1157 00:47:57,975 --> 00:48:00,470 regard as an open research topic as to the best way 1158 00:48:00,470 --> 00:48:01,700 to organize that. 1159 00:48:01,700 --> 00:48:05,710 And the difficulty is providing the flexibility 1160 00:48:05,710 --> 00:48:08,660 at the programming level to be able to organize 1161 00:48:08,660 --> 00:48:11,412 the hierarchy in a way demanded by the problem. 1162 00:48:11,412 --> 00:48:13,370 We have seen a number of hardware architectures 1163 00:48:13,370 --> 00:48:15,200 that organize hardware processing elements 1164 00:48:15,200 --> 00:48:19,910 hierarchically, but with more or less ability to reorganize 1165 00:48:19,910 --> 00:48:21,357 those processors. 1166 00:48:21,357 --> 00:48:22,940 Because I don't think it's well enough 1167 00:48:22,940 --> 00:48:24,857 understood how to make a sufficiently flexible 1168 00:48:24,857 --> 00:48:28,040 communication system that allows you to restructure 1169 00:48:28,040 --> 00:48:30,620 the hierarchy unless, of course, you provide a completely 1170 00:48:30,620 --> 00:48:32,960 general communication system. 1171 00:48:32,960 --> 00:48:34,460 In which case, the hierarchy, again, 1172 00:48:34,460 --> 00:48:35,835 disappears to the hardware level. 1173 00:48:35,835 --> 00:48:38,330 And it's up to the language implementer to provide 1174 00:48:38,330 --> 00:48:41,210 that hierarchical organization at the language level. 1175 00:48:41,210 --> 00:48:43,250 And we have not yet seen that emerging either. 1176 00:48:43,250 --> 00:48:45,120 And that's a very good point. 1177 00:48:45,120 --> 00:48:49,280 So thus far, the algorithms for solving mixed integer-- 1178 00:48:49,280 --> 00:48:52,590 say, maybe linear programming. 1179 00:48:52,590 --> 00:48:54,740 But when you have nonlinear [INAUDIBLE] 1180 00:48:54,740 --> 00:48:58,580 integer programming, one of the topics you covered 1181 00:48:58,580 --> 00:49:00,270 include enumeration. 1182 00:49:00,270 --> 00:49:02,960 So which suggests that if we really 1183 00:49:02,960 --> 00:49:06,260 want to solve the class of branch unbound problems 1184 00:49:06,260 --> 00:49:10,940 that you require either deep search or [INAUDIBLE] search. 1185 00:49:10,940 --> 00:49:15,080 This parallel processing might be a very important tool 1186 00:49:15,080 --> 00:49:20,270 to solve some-- to solve the problem of mixing 1187 00:49:20,270 --> 00:49:24,680 nonlinear type programming with branch and bounding. 1188 00:49:24,680 --> 00:49:26,360 Yeah, it may well be. 1189 00:49:26,360 --> 00:49:30,470 It is sometimes possible to use data parallel techniques to do 1190 00:49:30,470 --> 00:49:34,050 what look like unstructured searches in, for example, 1191 00:49:34,050 --> 00:49:37,440 a tree, for example, a game tree using branch 1192 00:49:37,440 --> 00:49:39,440 and bound techniques or other techniques related 1193 00:49:39,440 --> 00:49:42,140 to it by maintaining a work queue, 1194 00:49:42,140 --> 00:49:45,950 very much as you might in a more controlled parallel 1195 00:49:45,950 --> 00:49:47,810 oriented style. 1196 00:49:47,810 --> 00:49:52,100 And at every step, taking a large number of task items 1197 00:49:52,100 --> 00:49:54,410 off that work queue simultaneously 1198 00:49:54,410 --> 00:49:56,033 by using an enumeration step and using 1199 00:49:56,033 --> 00:49:57,700 the result of that enumeration to assign 1200 00:49:57,700 --> 00:49:59,468 [INAUDIBLE] processors. 1201 00:49:59,468 --> 00:50:01,760 Sometimes it's effective providing the rest of the work 1202 00:50:01,760 --> 00:50:04,058 to be done each of these tasks is sufficiently similar. 1203 00:50:04,058 --> 00:50:05,600 If it's not, then control parallelism 1204 00:50:05,600 --> 00:50:08,190 may be more suited to a task like that. 1205 00:50:08,190 --> 00:50:11,690 [INAUDIBLE] the Netherlands. 1206 00:50:11,690 --> 00:50:15,830 With the current expertise-- software expertise in 4GL, 1207 00:50:15,830 --> 00:50:19,130 four generation languages for sequential machines, 1208 00:50:19,130 --> 00:50:23,150 I would like to know your view if developing of data parallel 1209 00:50:23,150 --> 00:50:27,800 programming languages will it and at least that for 4GL 1210 00:50:27,800 --> 00:50:29,060 level or up. 1211 00:50:29,060 --> 00:50:32,160 1212 00:50:32,160 --> 00:50:35,370 I have seen the development of software 1213 00:50:35,370 --> 00:50:37,770 based on the data parallel paradigm. 1214 00:50:37,770 --> 00:50:42,090 1215 00:50:42,090 --> 00:50:43,320 I was about to say parallel. 1216 00:50:43,320 --> 00:50:45,510 Let me say reenact the history of the development 1217 00:50:45,510 --> 00:50:49,722 of sequential programming languages and software systems. 1218 00:50:49,722 --> 00:50:51,180 And I think we are now at the point 1219 00:50:51,180 --> 00:50:53,055 where we understand how to make data parallel 1220 00:50:53,055 --> 00:50:55,290 languages at about the level of expressiveness 1221 00:50:55,290 --> 00:50:58,710 of C Fortran, possibly lisp. 1222 00:50:58,710 --> 00:51:01,200 And I think it's going to take a while before we can rework 1223 00:51:01,200 --> 00:51:03,217 our understanding of this to raise 1224 00:51:03,217 --> 00:51:05,550 the level of expressiveness to that of fourth generation 1225 00:51:05,550 --> 00:51:06,540 languages. 1226 00:51:06,540 --> 00:51:08,540 I think it's just a matter of time and learning. 1227 00:51:08,540 --> 00:51:11,190 We still don't understand how to fit these building 1228 00:51:11,190 --> 00:51:14,050 blocks together to make abstractions at that level. 1229 00:51:14,050 --> 00:51:15,895 And I think that's great because it 1230 00:51:15,895 --> 00:51:18,270 would be no fun if there weren't interesting new problems 1231 00:51:18,270 --> 00:51:20,910 to work on. 1232 00:51:20,910 --> 00:51:24,860 [MUSIC PLAYING] 1233 00:51:24,860 --> 00:53:10,326