1 00:00:00,000 --> 00:00:03,437 [MUSIC PLAYING] 2 00:00:03,437 --> 00:00:20,640 3 00:00:20,640 --> 00:00:23,157 Hi, I'm Josh Fisher from HP Labs. 4 00:00:23,157 --> 00:00:24,240 I'm glad to be here today. 5 00:00:24,240 --> 00:00:28,580 I'll be talking about instruction level parallelism. 6 00:00:28,580 --> 00:00:31,820 ILP is very different from parallel processing 7 00:00:31,820 --> 00:00:33,590 that we hear so much about. 8 00:00:33,590 --> 00:00:36,410 Rather than having separate processors getting 9 00:00:36,410 --> 00:00:38,300 separate chunks of a program that's 10 00:00:38,300 --> 00:00:43,190 been programmed to do that, ILP is an architectural technique 11 00:00:43,190 --> 00:00:46,010 that lies at the bottom end of the CPU 12 00:00:46,010 --> 00:00:51,080 and allows the overlap of the ordinary risk level operations 13 00:00:51,080 --> 00:00:53,660 so that you might be doing a multiply and an add 14 00:00:53,660 --> 00:00:54,470 at the same time. 15 00:00:54,470 --> 00:00:57,320 The same multiply and add that appear in the user's program, 16 00:00:57,320 --> 00:01:00,350 only now they're issued one after the other 17 00:01:00,350 --> 00:01:02,750 without the second one waiting for the first to finish, 18 00:01:02,750 --> 00:01:03,360 for example. 19 00:01:03,360 --> 00:01:05,459 Or they might be issued at the same time. 20 00:01:05,459 --> 00:01:08,720 Instruction level parallelism is something that 21 00:01:08,720 --> 00:01:11,190 is meant to be transparent. 22 00:01:11,190 --> 00:01:14,190 That is, the user should not see any of it. 23 00:01:14,190 --> 00:01:17,550 We take ordinary programs and make them go a lot faster using 24 00:01:17,550 --> 00:01:18,540 this technique. 25 00:01:18,540 --> 00:01:20,610 It's an old technique. 26 00:01:20,610 --> 00:01:24,390 It's been around for over 30 years in the fastest computers. 27 00:01:24,390 --> 00:01:28,080 And after a 30 year history in which it really 28 00:01:28,080 --> 00:01:29,580 didn't do an awful lot, particularly 29 00:01:29,580 --> 00:01:33,090 for the last 20 of those years, in the '80s, 30 00:01:33,090 --> 00:01:34,710 it started really taking off. 31 00:01:34,710 --> 00:01:36,840 And it's become very much the hot topic 32 00:01:36,840 --> 00:01:38,760 in architecture conferences. 33 00:01:38,760 --> 00:01:42,510 And you just don't build a fast microprocessor these days 34 00:01:42,510 --> 00:01:46,210 without using instruction level parallelism. 35 00:01:46,210 --> 00:01:49,040 It's a very simple concept. 36 00:01:49,040 --> 00:01:52,550 If you look at a small chunk of code. 37 00:01:52,550 --> 00:01:57,630 And here we have just eight operations 38 00:01:57,630 --> 00:01:59,580 that are doing some multiplies, adding 39 00:01:59,580 --> 00:02:02,550 the result, two more multiplies, adding the result, 40 00:02:02,550 --> 00:02:03,480 and a couple of adds. 41 00:02:03,480 --> 00:02:06,240 And I just took this random chunk 42 00:02:06,240 --> 00:02:08,460 of code out of an application. 43 00:02:08,460 --> 00:02:10,830 Using instruction level parallelism, 44 00:02:10,830 --> 00:02:15,300 you might issue the first multiply in the first cycle. 45 00:02:15,300 --> 00:02:17,855 But rather than just stopping there, 46 00:02:17,855 --> 00:02:20,050 you might at the same time notice 47 00:02:20,050 --> 00:02:25,150 that this add down at the bottom doesn't depend upon anything 48 00:02:25,150 --> 00:02:26,380 that came earlier. 49 00:02:26,380 --> 00:02:29,440 These other adds do have to wait for the multiplies that 50 00:02:29,440 --> 00:02:30,850 are producing their data. 51 00:02:30,850 --> 00:02:33,930 But this add is data independent. 52 00:02:33,930 --> 00:02:36,990 And you certainly could start executing it right away. 53 00:02:36,990 --> 00:02:40,350 An ordinary computer typically has a separate floating point 54 00:02:40,350 --> 00:02:41,640 and integer unit. 55 00:02:41,640 --> 00:02:44,260 And that integer unit isn't doing anything. 56 00:02:44,260 --> 00:02:46,170 So you might somehow get the bright idea 57 00:02:46,170 --> 00:02:49,470 of issuing the integer operation at the same time 58 00:02:49,470 --> 00:02:53,100 as the floating operation rather than waiting for it to stop. 59 00:02:53,100 --> 00:02:56,070 And that way you're overlapping operations, 60 00:02:56,070 --> 00:02:58,050 and you're able to get a speed up. 61 00:02:58,050 --> 00:03:00,150 In this little expression, if you 62 00:03:00,150 --> 00:03:03,600 did each of these operations one at a time, 63 00:03:03,600 --> 00:03:06,690 you'd find it took 12 cycles, say, if the multiply took 64 00:03:06,690 --> 00:03:08,295 two and the adds took one. 65 00:03:08,295 --> 00:03:09,420 There are eight operations. 66 00:03:09,420 --> 00:03:11,280 Four of them are multiplies. 67 00:03:11,280 --> 00:03:15,390 If you look at what might happen down here, 68 00:03:15,390 --> 00:03:18,030 we're able to complete the same work in six cycles. 69 00:03:18,030 --> 00:03:20,010 So we'd get a factor of two speed up. 70 00:03:20,010 --> 00:03:21,450 That's a typical sort of thing you 71 00:03:21,450 --> 00:03:24,360 can do using this kind of hardware arrangement. 72 00:03:24,360 --> 00:03:26,640 The hardware you pretty much already have 73 00:03:26,640 --> 00:03:28,620 and the ability now suddenly to be able to 74 00:03:28,620 --> 00:03:30,702 overlap a lot of the operations. 75 00:03:30,702 --> 00:03:32,160 Now, notice I haven't said anything 76 00:03:32,160 --> 00:03:36,730 at all about how you go from getting handed this stuff 77 00:03:36,730 --> 00:03:42,160 and coming down to this, getting this record of execution 78 00:03:42,160 --> 00:03:45,490 of what actually happened when that went twice as fast. 79 00:03:45,490 --> 00:03:47,860 The technology of instruction level parallelism 80 00:03:47,860 --> 00:03:52,330 is getting somehow the ability to go from a sequential program 81 00:03:52,330 --> 00:03:56,530 automatically to this kind of speed up or a good deal 82 00:03:56,530 --> 00:03:58,840 more, as we'll see. 83 00:03:58,840 --> 00:04:03,890 If we look at the history of instruction level parallelism, 84 00:04:03,890 --> 00:04:06,950 it turns out that it's been tremendously influenced 85 00:04:06,950 --> 00:04:09,530 by the question of how much ILP there 86 00:04:09,530 --> 00:04:12,512 is in typical programs and people's belief about that. 87 00:04:12,512 --> 00:04:13,970 And in fact, that's still, as we'll 88 00:04:13,970 --> 00:04:16,990 see later, a very controversial question. 89 00:04:16,990 --> 00:04:20,230 People really don't have a good handle on the answer to that. 90 00:04:20,230 --> 00:04:21,730 But if you look historically at what 91 00:04:21,730 --> 00:04:24,710 happened with machines like this, in the late '50s, 92 00:04:24,710 --> 00:04:26,920 there were some experimental machines being built. 93 00:04:26,920 --> 00:04:32,480 And then in the early '60s, CDC delivered the 6600, 94 00:04:32,480 --> 00:04:35,030 the fastest scientific computer of its time, 95 00:04:35,030 --> 00:04:37,140 sort of the supercomputer of its day. 96 00:04:37,140 --> 00:04:39,140 And it had eight functional units. 97 00:04:39,140 --> 00:04:41,060 They built extra functional units 98 00:04:41,060 --> 00:04:43,580 so that you could attempt to start 99 00:04:43,580 --> 00:04:45,950 a new operation every single cycle 100 00:04:45,950 --> 00:04:49,220 whether or not the previous ones had finished. 101 00:04:49,220 --> 00:04:50,900 That was a very successful machine. 102 00:04:50,900 --> 00:04:53,570 There were a lot of them used for a lot 103 00:04:53,570 --> 00:04:55,100 of important applications. 104 00:04:55,100 --> 00:04:59,730 At the same time, IBM was building the 360 91, 105 00:04:59,730 --> 00:05:02,540 which had a lot of instruction level parallelism 106 00:05:02,540 --> 00:05:04,715 as well, a rather different kind underneath. 107 00:05:04,715 --> 00:05:07,310 And we'll talk about that in a few minutes. 108 00:05:07,310 --> 00:05:09,470 It was not a commercial success, but then there 109 00:05:09,470 --> 00:05:11,450 were follow ons to that machine that 110 00:05:11,450 --> 00:05:15,660 used the same kind of CPU design and were very successful. 111 00:05:15,660 --> 00:05:18,080 So in the early '60s, this was the way 112 00:05:18,080 --> 00:05:19,520 to make computers go faster. 113 00:05:19,520 --> 00:05:22,040 If you talked about parallel processing, then 114 00:05:22,040 --> 00:05:23,840 you meant this kind of stuff. 115 00:05:23,840 --> 00:05:25,820 Nobody called it instruction level parallelism. 116 00:05:25,820 --> 00:05:29,430 It was just what people did to make computers go faster. 117 00:05:29,430 --> 00:05:32,210 It was a topic that got some academic interest 118 00:05:32,210 --> 00:05:34,490 and certainly a lot of design interest. 119 00:05:34,490 --> 00:05:37,640 And people started asking this very natural question 120 00:05:37,640 --> 00:05:41,030 of how much parallelism there is in typical programs. 121 00:05:41,030 --> 00:05:47,450 And several groups set out to answer that question. 122 00:05:47,450 --> 00:05:49,678 There were three that I know of, and I 123 00:05:49,678 --> 00:05:51,470 believe there were probably others as well. 124 00:05:51,470 --> 00:05:52,820 The three I know for sure. 125 00:05:52,820 --> 00:05:54,740 And they all got the same answer when they 126 00:05:54,740 --> 00:05:56,150 tried to answer this question. 127 00:05:56,150 --> 00:05:58,980 The way they tried to answer this question was as follows. 128 00:05:58,980 --> 00:06:03,800 They said imagine you have a bunch of ordinary programs, 129 00:06:03,800 --> 00:06:07,940 but imagine that you also have infinitely much hardware 130 00:06:07,940 --> 00:06:11,330 to run these programs on trying to get as much 131 00:06:11,330 --> 00:06:14,700 of instruction level parallelism as you possibly could. 132 00:06:14,700 --> 00:06:17,930 So you have 5,000 integer adders, 133 00:06:17,930 --> 00:06:21,050 or as many as you'd like, as many floating point units 134 00:06:21,050 --> 00:06:23,510 as you'd like, as much memory bandwidth 135 00:06:23,510 --> 00:06:26,180 and registers as you need, and the ability 136 00:06:26,180 --> 00:06:28,910 to issue as many operations on the tick of a clock 137 00:06:28,910 --> 00:06:30,590 as you possibly could. 138 00:06:30,590 --> 00:06:33,110 And then the question is, how much 139 00:06:33,110 --> 00:06:34,820 of this kind of parallelism would you 140 00:06:34,820 --> 00:06:40,710 get running ordinary programs on machines of that ability? 141 00:06:40,710 --> 00:06:42,540 And the answer was very uniform. 142 00:06:42,540 --> 00:06:45,480 And people still either on purpose or not even 143 00:06:45,480 --> 00:06:48,270 knowing this history repeat these experiments today. 144 00:06:48,270 --> 00:06:50,880 And you always get pretty much the same range. 145 00:06:50,880 --> 00:06:53,340 About one and a half to two and a half or so 146 00:06:53,340 --> 00:06:55,960 is the kind of number you get. 147 00:06:55,960 --> 00:06:59,710 That was not considered a very wonderful result. 148 00:06:59,710 --> 00:07:01,480 People were pretty pessimistic about all 149 00:07:01,480 --> 00:07:03,280 of this when they heard it. 150 00:07:03,280 --> 00:07:07,150 And in fact, the interest in instruction level parallelism 151 00:07:07,150 --> 00:07:09,800 died out at the same time these experiments came out. 152 00:07:09,800 --> 00:07:12,520 And I can't know for sure if this was a huge factor, 153 00:07:12,520 --> 00:07:15,160 but talking to people who were in the computer 154 00:07:15,160 --> 00:07:18,550 architecture and compiler worlds at the time, 155 00:07:18,550 --> 00:07:21,790 I've heard very often, oh yeah, this was a big factor. 156 00:07:21,790 --> 00:07:23,620 We just thought this technology had run out 157 00:07:23,620 --> 00:07:25,210 of steam right there. 158 00:07:25,210 --> 00:07:29,140 And I remember very clearly as a young university professor 159 00:07:29,140 --> 00:07:33,160 going off to Washington to get my first grants working 160 00:07:33,160 --> 00:07:33,800 in this area. 161 00:07:33,800 --> 00:07:35,830 And I wanted to do something about getting 162 00:07:35,830 --> 00:07:38,435 more parallelism for this and I had ways of doing it. 163 00:07:38,435 --> 00:07:40,060 And I remember the people in Washington 164 00:07:40,060 --> 00:07:43,690 telling me very clearly, oh, don't bother with this. 165 00:07:43,690 --> 00:07:44,950 This is the Flynn bottleneck. 166 00:07:44,950 --> 00:07:46,540 It even had a name, like Flynn had 167 00:07:46,540 --> 00:07:48,040 done some of these experiments. 168 00:07:48,040 --> 00:07:50,050 And you can't get that kind of parallelism. 169 00:07:50,050 --> 00:07:51,970 It's just not there to be had. 170 00:07:51,970 --> 00:07:53,020 Go do something else. 171 00:07:53,020 --> 00:07:55,270 And fortunately, I had read these papers 172 00:07:55,270 --> 00:07:56,320 and I had a ready answer. 173 00:07:56,320 --> 00:08:00,310 But it was very striking how clear that perception was. 174 00:08:00,310 --> 00:08:03,040 Well, it turns out that this belief 175 00:08:03,040 --> 00:08:06,160 that there's only that much parallelism is really wrong. 176 00:08:06,160 --> 00:08:08,980 And it's sort of wrong for a big general reason. 177 00:08:08,980 --> 00:08:11,170 And it's wrong because they really 178 00:08:11,170 --> 00:08:13,120 missed one specific, narrow thing 179 00:08:13,120 --> 00:08:17,080 right away that makes the results not quite what you want 180 00:08:17,080 --> 00:08:19,000 to know when you measure this. 181 00:08:19,000 --> 00:08:20,680 The general thing is the question 182 00:08:20,680 --> 00:08:23,660 of code transformations. 183 00:08:23,660 --> 00:08:27,680 A compiler can do things to code that 184 00:08:27,680 --> 00:08:30,740 will expose more of this instruction level parallelism. 185 00:08:30,740 --> 00:08:33,320 And there's no way to predict just what the compiler will 186 00:08:33,320 --> 00:08:37,370 do that will give you more of this parallelism. 187 00:08:37,370 --> 00:08:40,309 There are lots of tricks similar, for example, 188 00:08:40,309 --> 00:08:42,655 to the vectorizing tricks that give you 189 00:08:42,655 --> 00:08:44,780 more and more of this instruction level parallelism 190 00:08:44,780 --> 00:08:47,510 compared to the way the program is originally 191 00:08:47,510 --> 00:08:51,545 expressed by the programmer. 192 00:08:51,545 --> 00:08:53,170 If you think about it, in fact, there's 193 00:08:53,170 --> 00:08:55,630 a silly argument that's a little less silly than it seems 194 00:08:55,630 --> 00:08:59,770 at first that says that these limit studies really can't make 195 00:08:59,770 --> 00:09:03,190 sense as far as giving a flat statement of how 196 00:09:03,190 --> 00:09:04,660 much parallelism there is. 197 00:09:04,660 --> 00:09:09,310 After all, you could reduce a computation to a table lookup 198 00:09:09,310 --> 00:09:11,950 by computing ahead all the possible answers 199 00:09:11,950 --> 00:09:14,830 for all the possible ranges of data 200 00:09:14,830 --> 00:09:18,010 and then doing a one cycle table lookup in order 201 00:09:18,010 --> 00:09:20,720 to get that computation done in a single cycle. 202 00:09:20,720 --> 00:09:24,130 So you take a big matrix of programs and data 203 00:09:24,130 --> 00:09:26,190 sets and just look up your answer 204 00:09:26,190 --> 00:09:28,690 for the program and data set at hand and do it in one cycle. 205 00:09:28,690 --> 00:09:29,920 Now, that's silly. 206 00:09:29,920 --> 00:09:32,380 Of course, nothing wildly like that is practical. 207 00:09:32,380 --> 00:09:35,830 But in fact, a lot of speedups that one does 208 00:09:35,830 --> 00:09:37,960 reduce to that trick. 209 00:09:37,960 --> 00:09:40,330 And there are even ways in which compilers, 210 00:09:40,330 --> 00:09:43,450 not to mention hand coders, will substitute table lookups 211 00:09:43,450 --> 00:09:46,520 for more complex, longer computations. 212 00:09:46,520 --> 00:09:48,880 So really, this is a real problem 213 00:09:48,880 --> 00:09:51,310 in trying to make a flat statement about how 214 00:09:51,310 --> 00:09:52,600 much parallelism there is. 215 00:09:52,600 --> 00:09:55,780 A more accurate statement is much narrower. 216 00:09:55,780 --> 00:09:58,780 It's conditioned on the compiler technology 217 00:09:58,780 --> 00:10:00,940 available to the experimenter at the time. 218 00:10:00,940 --> 00:10:03,700 And very rarely do they even have state of the art compiling 219 00:10:03,700 --> 00:10:06,010 techniques when they do these experiments, 220 00:10:06,010 --> 00:10:08,740 never mind compiling techniques comparable to what 221 00:10:08,740 --> 00:10:10,340 we'll have in the future. 222 00:10:10,340 --> 00:10:12,308 So in some sense, these limit studies, 223 00:10:12,308 --> 00:10:13,350 they're very interesting. 224 00:10:13,350 --> 00:10:14,770 They're very informative. 225 00:10:14,770 --> 00:10:16,870 But people should not be taking them 226 00:10:16,870 --> 00:10:20,530 as this is some absolute that we'll never get beyond. 227 00:10:20,530 --> 00:10:22,600 But there was a much bigger problem right away, 228 00:10:22,600 --> 00:10:25,620 and that concerned conditional branches. 229 00:10:25,620 --> 00:10:29,650 When these limit studies were done in the early '70s, 230 00:10:29,650 --> 00:10:32,340 people said, well, in an ordinary program, 231 00:10:32,340 --> 00:10:34,860 we have tests and branches. 232 00:10:34,860 --> 00:10:38,670 So if something is true, go this way, otherwise go that way. 233 00:10:38,670 --> 00:10:42,390 And you might want to raise an operation up 234 00:10:42,390 --> 00:10:45,990 from beyond one of these branches up into the place 235 00:10:45,990 --> 00:10:47,610 that you're right now considering 236 00:10:47,610 --> 00:10:48,743 how much you can issue. 237 00:10:48,743 --> 00:10:49,410 So here you are. 238 00:10:49,410 --> 00:10:52,050 You're issuing operations, pulling them up from below, 239 00:10:52,050 --> 00:10:54,060 but you now have run into a conditional branch. 240 00:10:54,060 --> 00:10:56,640 And you'd like to pull one up from one side or the other. 241 00:10:56,640 --> 00:11:00,060 Well, they reasoned you can't do that because after all, 242 00:11:00,060 --> 00:11:01,560 if you pull an operation up, it's 243 00:11:01,560 --> 00:11:03,030 bound to have side effects. 244 00:11:03,030 --> 00:11:04,920 Must be a reason you're doing it. 245 00:11:04,920 --> 00:11:06,967 And suppose the branch goes the other way. 246 00:11:06,967 --> 00:11:08,550 Then you'll have done something that's 247 00:11:08,550 --> 00:11:11,680 affected your computation and you didn't mean to do it. 248 00:11:11,680 --> 00:11:12,690 And that would be bad. 249 00:11:12,690 --> 00:11:15,090 And of course, you can think of some little special cases 250 00:11:15,090 --> 00:11:17,860 where it isn't so bad and you get away with it. 251 00:11:17,860 --> 00:11:19,890 But people basically felt that this was not 252 00:11:19,890 --> 00:11:24,220 a source of big speed for instruction level parallelism. 253 00:11:24,220 --> 00:11:27,700 What's interesting is that one of the groups that 254 00:11:27,700 --> 00:11:31,930 did this experiment, which was Reisman and Foster at UMass 255 00:11:31,930 --> 00:11:36,760 Amherst, did a second experiment in which they suggested 256 00:11:36,760 --> 00:11:40,720 that there is a way around this conditional branch problem. 257 00:11:40,720 --> 00:11:44,210 And they looked at that and measured what could happen. 258 00:11:44,210 --> 00:11:47,810 And it really was a very interesting result 259 00:11:47,810 --> 00:11:48,520 that they got. 260 00:11:48,520 --> 00:11:51,992 261 00:11:51,992 --> 00:11:53,450 They termed this differently, but I 262 00:11:53,450 --> 00:11:55,460 think this is the way you can look at it 263 00:11:55,460 --> 00:11:56,690 and understand it best. 264 00:11:56,690 --> 00:11:59,810 They said, suppose you take the program in advance 265 00:11:59,810 --> 00:12:03,960 and unwind every possible path of computation. 266 00:12:03,960 --> 00:12:07,730 So write down every path through the code 267 00:12:07,730 --> 00:12:10,410 that the code could take when it runs. 268 00:12:10,410 --> 00:12:12,230 So at each conditional jump, you now 269 00:12:12,230 --> 00:12:14,600 would write down two ways, the one leading up 270 00:12:14,600 --> 00:12:16,250 to here going that way and then the one leading up to here 271 00:12:16,250 --> 00:12:17,490 and then going that way. 272 00:12:17,490 --> 00:12:20,480 So now you've written every single path through the code. 273 00:12:20,480 --> 00:12:22,310 Now, it's an awful lot of those. 274 00:12:22,310 --> 00:12:24,740 But just imagine for a second you had done that. 275 00:12:24,740 --> 00:12:26,990 Don't forget the hypothesis of this experiment 276 00:12:26,990 --> 00:12:28,790 is infinite hardware. 277 00:12:28,790 --> 00:12:29,550 All you need. 278 00:12:29,550 --> 00:12:31,070 So you have a big piece of paper. 279 00:12:31,070 --> 00:12:32,840 You write all this stuff down. 280 00:12:32,840 --> 00:12:38,540 And then you build a processor for each one of these paths. 281 00:12:38,540 --> 00:12:42,990 And you hand that path to that processor. 282 00:12:42,990 --> 00:12:44,400 And now you start computing. 283 00:12:44,400 --> 00:12:46,260 Now, a processor that's been given a path 284 00:12:46,260 --> 00:12:48,385 doesn't have to worry about these conditional jumps 285 00:12:48,385 --> 00:12:49,010 anymore. 286 00:12:49,010 --> 00:12:53,530 OK some processor was told to go this way and then this way. 287 00:12:53,530 --> 00:12:55,620 And so if it wants to start up here, 288 00:12:55,620 --> 00:12:58,140 if it wants to pull this operation up 289 00:12:58,140 --> 00:12:59,670 above these jumps, no problem. 290 00:12:59,670 --> 00:13:02,580 It knows for its world the jumps are going this way. 291 00:13:02,580 --> 00:13:04,630 The branches are going that way. 292 00:13:04,630 --> 00:13:05,552 So no problem. 293 00:13:05,552 --> 00:13:06,510 So it would pull it up. 294 00:13:06,510 --> 00:13:08,580 Each processor in turn stops worrying 295 00:13:08,580 --> 00:13:11,370 about the branches and starts pulling up 296 00:13:11,370 --> 00:13:14,530 all the code that's data ready. 297 00:13:14,530 --> 00:13:16,750 Now, every time a branch resolves, 298 00:13:16,750 --> 00:13:20,410 you have a whole lot of processors to throw away. 299 00:13:20,410 --> 00:13:23,050 All the ones that had been sent some way inconsistent 300 00:13:23,050 --> 00:13:25,652 with the branch resolving in the direction that it did. 301 00:13:25,652 --> 00:13:26,860 So you just throw those away. 302 00:13:26,860 --> 00:13:27,400 No problem. 303 00:13:27,400 --> 00:13:29,120 Infinite hardware. 304 00:13:29,120 --> 00:13:31,520 Eventually some processor will have 305 00:13:31,520 --> 00:13:35,440 been handed exactly the right path through the code, 306 00:13:35,440 --> 00:13:38,380 and that processor will have gotten instruction level 307 00:13:38,380 --> 00:13:40,180 parallelism without having to worry 308 00:13:40,180 --> 00:13:42,490 about the conditional jumps. 309 00:13:42,490 --> 00:13:44,650 And it's going to get more parallelism than you'd 310 00:13:44,650 --> 00:13:46,210 get if you stopped at branches. 311 00:13:46,210 --> 00:13:47,170 Well, how much more? 312 00:13:47,170 --> 00:13:49,505 Huge amount more, as it turned out. 313 00:13:49,505 --> 00:13:51,130 The kinds of numbers they were getting, 314 00:13:51,130 --> 00:13:53,830 I think the programs and data they looked at averaged 315 00:13:53,830 --> 00:13:56,710 54 or 55 factor speed up. 316 00:13:56,710 --> 00:13:59,500 And there were some where they said, look, this is silly. 317 00:13:59,500 --> 00:14:02,950 This is just dependent on the data set size. 318 00:14:02,950 --> 00:14:05,540 You can get any number you want with some of these, 319 00:14:05,540 --> 00:14:07,150 and that's certainly true. 320 00:14:07,150 --> 00:14:11,680 Huge order of magnitude and more speed up available 321 00:14:11,680 --> 00:14:14,570 if you could somehow go past conditional branches. 322 00:14:14,570 --> 00:14:19,030 So then they actually set out to see how crazy an idea this was. 323 00:14:19,030 --> 00:14:20,980 And what they measured was that if you even 324 00:14:20,980 --> 00:14:25,210 wanted a factor of 10 speed up, then at the worst 325 00:14:25,210 --> 00:14:28,330 place, the place that was densest with machines still 326 00:14:28,330 --> 00:14:33,370 running of the program, you had to have 16 327 00:14:33,370 --> 00:14:35,930 unresolved conditional jumps. 328 00:14:35,930 --> 00:14:38,570 And that meant you had to have 2 to the 16th 329 00:14:38,570 --> 00:14:42,230 or 64,000 copies of your machine. 330 00:14:42,230 --> 00:14:44,442 And that's just to get a factor of 10 speed up. 331 00:14:44,442 --> 00:14:46,400 And that doesn't even speak to all the machines 332 00:14:46,400 --> 00:14:48,260 you had to throw out for the whole rest 333 00:14:48,260 --> 00:14:50,343 of the program, which is probably a good deal more 334 00:14:50,343 --> 00:14:51,240 than that. 335 00:14:51,240 --> 00:14:52,970 So they reasoned this is ludicrous. 336 00:14:52,970 --> 00:14:55,100 This is a preposterous amount of hardware 337 00:14:55,100 --> 00:14:56,690 you'd have to throw at the problem 338 00:14:56,690 --> 00:14:58,870 in order to get this speed up. 339 00:14:58,870 --> 00:15:04,808 And wrote a summary, and these were very good experimenters, 340 00:15:04,808 --> 00:15:06,350 good computer scientists have gone on 341 00:15:06,350 --> 00:15:09,290 to do good work in computer science besides this, 342 00:15:09,290 --> 00:15:11,390 and a very imaginative experiment. 343 00:15:11,390 --> 00:15:12,770 But I think sort of as a prisoner 344 00:15:12,770 --> 00:15:15,590 of their time, the conclusion that they 345 00:15:15,590 --> 00:15:18,380 wrote was not the conclusion one would like 346 00:15:18,380 --> 00:15:19,940 in retrospect to have seen. 347 00:15:19,940 --> 00:15:22,340 Here's the conclusion you'd like to see. 348 00:15:22,340 --> 00:15:24,290 You'd like it to have said, boy, there's 349 00:15:24,290 --> 00:15:27,110 a lot of this instruction level parallelism. 350 00:15:27,110 --> 00:15:29,420 But the crazy way we thought of trying to tap into it 351 00:15:29,420 --> 00:15:29,990 doesn't work. 352 00:15:29,990 --> 00:15:30,680 Sorry. 353 00:15:30,680 --> 00:15:33,440 Maybe somebody can come along with a better way. 354 00:15:33,440 --> 00:15:37,580 Instead they basically wrote it off and people reading it 355 00:15:37,580 --> 00:15:38,840 basically wrote it off. 356 00:15:38,840 --> 00:15:41,475 And the message wasn't what they wrote, 357 00:15:41,475 --> 00:15:43,850 which was there's a lot of instruction level parallelism, 358 00:15:43,850 --> 00:15:46,370 but it's impractical. 359 00:15:46,370 --> 00:15:48,553 Bad enough, people basically came away 360 00:15:48,553 --> 00:15:50,720 with the feeling there wasn't much instruction level 361 00:15:50,720 --> 00:15:52,160 parallelism. 362 00:15:52,160 --> 00:15:53,170 So that's all wrong. 363 00:15:53,170 --> 00:15:54,950 It really is the first thing I said 364 00:15:54,950 --> 00:15:56,720 that's the correct conclusion. 365 00:15:56,720 --> 00:16:00,050 And it turns out that there is a practical embodiment 366 00:16:00,050 --> 00:16:01,520 of this idea. 367 00:16:01,520 --> 00:16:04,880 I've spent a lot of time on this one sort of obscure experiment, 368 00:16:04,880 --> 00:16:06,740 but a lot more people know about it now. 369 00:16:06,740 --> 00:16:09,080 Because the idea that comes out of it, 370 00:16:09,080 --> 00:16:13,430 this idea of unwinding all the possible execution 371 00:16:13,430 --> 00:16:17,940 paths in order to get more parallelism by going 372 00:16:17,940 --> 00:16:21,000 past branches is very central to an awful lot 373 00:16:21,000 --> 00:16:22,250 of the technology that's here. 374 00:16:22,250 --> 00:16:23,750 We're going to learn about something 375 00:16:23,750 --> 00:16:25,800 called speculative execution in a minute. 376 00:16:25,800 --> 00:16:28,110 And that really is the way you can 377 00:16:28,110 --> 00:16:31,140 tap into huge amounts of instruction level parallelism. 378 00:16:31,140 --> 00:16:34,470 But you have to make this idea of going down 379 00:16:34,470 --> 00:16:37,807 paths speculatively practical. 380 00:16:37,807 --> 00:16:40,140 And you're not going to do it the way these guys set out 381 00:16:40,140 --> 00:16:41,260 to measure it. 382 00:16:41,260 --> 00:16:43,110 But there is something you can see 383 00:16:43,110 --> 00:16:48,090 right away that does kind of bring some practicality to it. 384 00:16:48,090 --> 00:16:49,070 And here's the idea. 385 00:16:49,070 --> 00:16:54,540 Take a single CPU and start filling it up from below 386 00:16:54,540 --> 00:16:56,640 with things to be done. 387 00:16:56,640 --> 00:16:58,920 The way we were doing on that first example. 388 00:16:58,920 --> 00:17:01,230 But now when you get to a conditional branch, 389 00:17:01,230 --> 00:17:03,780 pick one of the directions, say. 390 00:17:03,780 --> 00:17:09,410 So just pick taken and start pulling things up from taken. 391 00:17:09,410 --> 00:17:13,069 But when you do it, make sure that you do it in such a way 392 00:17:13,069 --> 00:17:16,250 that you don't destroy so much of the state of computation 393 00:17:16,250 --> 00:17:18,380 of a computation that if it turns out 394 00:17:18,380 --> 00:17:21,089 that untaken was the right answer, 395 00:17:21,089 --> 00:17:23,220 you have no ability to recover. 396 00:17:23,220 --> 00:17:26,530 So make sure that when you pull operations up 397 00:17:26,530 --> 00:17:28,900 and they're clobbering things, they're 398 00:17:28,900 --> 00:17:31,090 changing the state of the computation, that you've 399 00:17:31,090 --> 00:17:33,910 somehow arranged in the hardware or software 400 00:17:33,910 --> 00:17:37,870 to leave around enough stuff so that when untaken turns out 401 00:17:37,870 --> 00:17:40,545 to be the right answer, you can still not 402 00:17:40,545 --> 00:17:42,670 get the advantage of having pulled these things up, 403 00:17:42,670 --> 00:17:46,930 but at least the semantics of the program are still correct. 404 00:17:46,930 --> 00:17:51,640 That idea, again, is called speculative execution. 405 00:17:51,640 --> 00:17:55,780 It allows you to go past a branch, pull things up. 406 00:17:55,780 --> 00:17:58,697 You're sort of in a danger zone here because you're 407 00:17:58,697 --> 00:18:01,030 doing things that you're not quite sure whether you want 408 00:18:01,030 --> 00:18:03,250 the results of them or the old stuff back 409 00:18:03,250 --> 00:18:04,960 or you want to ignore them, which 410 00:18:04,960 --> 00:18:07,640 is usually what you end up just having to do. 411 00:18:07,640 --> 00:18:09,490 And then at some point, the branch resolves 412 00:18:09,490 --> 00:18:13,120 and you can continue with the correct computation. 413 00:18:13,120 --> 00:18:17,010 Now, in fact, you can go up from both sides, not just one. 414 00:18:17,010 --> 00:18:19,330 So that when you look down at this branch, whatever 415 00:18:19,330 --> 00:18:20,872 it is you're going to do on the taken 416 00:18:20,872 --> 00:18:23,140 side you can simultaneously do on the untaken side 417 00:18:23,140 --> 00:18:25,030 and pull up operations from both. 418 00:18:25,030 --> 00:18:27,580 Make sure that you're preserving the semantics 419 00:18:27,580 --> 00:18:29,587 of the program both ways. 420 00:18:29,587 --> 00:18:31,420 But now what's happened is we've gotten back 421 00:18:31,420 --> 00:18:33,640 to this Reisman and Foster situation. 422 00:18:33,640 --> 00:18:38,880 Now instead of having 64,000 computers running this program, 423 00:18:38,880 --> 00:18:42,870 instead what we have is one computer with 64,000 operations 424 00:18:42,870 --> 00:18:45,690 to do in order to get a factor of 10 speed up. 425 00:18:45,690 --> 00:18:48,670 Doesn't feel like we've improved things very much. 426 00:18:48,670 --> 00:18:51,120 However, if when you get to a branch 427 00:18:51,120 --> 00:18:54,090 that branch is very predictable, that is it 428 00:18:54,090 --> 00:18:58,180 tends to go in one direction most of the time 429 00:18:58,180 --> 00:19:02,500 and you can figure out what that direction is, then 430 00:19:02,500 --> 00:19:05,800 it makes sense to just gamble, a very common trade 431 00:19:05,800 --> 00:19:09,400 off in computer systems, gamble on the more likely event, 432 00:19:09,400 --> 00:19:12,640 get a big speed up there, and then make sure 433 00:19:12,640 --> 00:19:15,170 you do the right computation when the less likely event 434 00:19:15,170 --> 00:19:15,670 occurs. 435 00:19:15,670 --> 00:19:18,130 That's what's behind caches after all. 436 00:19:18,130 --> 00:19:22,670 And that really makes sense if branches are predictable. 437 00:19:22,670 --> 00:19:26,030 Well, it turns out, these are some recent numbers 438 00:19:26,030 --> 00:19:29,970 we've gotten here at HP, it turns out 439 00:19:29,970 --> 00:19:31,845 that branches are a lot more predictable 440 00:19:31,845 --> 00:19:32,970 than a lot of people think. 441 00:19:32,970 --> 00:19:35,370 Now, people working in instruction level parallelism 442 00:19:35,370 --> 00:19:36,620 are pretty well aware of this. 443 00:19:36,620 --> 00:19:38,550 But I remember very often being told by people 444 00:19:38,550 --> 00:19:40,282 that this is not the way it is. 445 00:19:40,282 --> 00:19:42,240 And so it was interesting to measure it and see 446 00:19:42,240 --> 00:19:43,140 just what it is. 447 00:19:43,140 --> 00:19:45,210 If you consider a given source branch, 448 00:19:45,210 --> 00:19:48,180 maybe it's executed a million times, 449 00:19:48,180 --> 00:19:54,180 if in a run of the program it's taken 800,000 times 450 00:19:54,180 --> 00:19:58,200 and not taken 200,000 times, then you would say it 451 00:19:58,200 --> 00:20:02,760 goes in its majority direction 80% of the time. 452 00:20:02,760 --> 00:20:05,190 And you would count that as a million branches 453 00:20:05,190 --> 00:20:08,790 that go in their majority direction 80% of the time. 454 00:20:08,790 --> 00:20:11,610 And now take a wide range of programs 455 00:20:11,610 --> 00:20:13,860 over a very wide and varied data set. 456 00:20:13,860 --> 00:20:15,480 We have the spec marks here with a lot 457 00:20:15,480 --> 00:20:18,660 of added data sets and a lot of other particular C integer 458 00:20:18,660 --> 00:20:20,760 codes, which we're most interested in. 459 00:20:20,760 --> 00:20:22,150 And we took all of the branches. 460 00:20:22,150 --> 00:20:25,392 This is around 7 billion branches, something like that 461 00:20:25,392 --> 00:20:26,850 were measured to get these numbers, 462 00:20:26,850 --> 00:20:29,490 out of real programs running real data. 463 00:20:29,490 --> 00:20:32,610 And these were systems and scientific codes. 464 00:20:32,610 --> 00:20:36,180 And what we found is that 40% of the branches you execute 465 00:20:36,180 --> 00:20:40,670 come from a source branch that's predictable more than 99 466 00:20:40,670 --> 00:20:44,040 and 1/2 percent of the time. 467 00:20:44,040 --> 00:20:47,040 And if you sort of cut off at some reasonable place 468 00:20:47,040 --> 00:20:49,260 where you'd like your predictions to be right 469 00:20:49,260 --> 00:20:52,290 in order to get a payoff from your speculative execution, 470 00:20:52,290 --> 00:20:55,050 you get a huge preponderance of the branches 471 00:20:55,050 --> 00:20:58,120 where going one way seems like the thing to do. 472 00:20:58,120 --> 00:21:02,950 And in fact, very few branches dynamically are unpredictable. 473 00:21:02,950 --> 00:21:06,260 Now, you can always find a program where that's not true. 474 00:21:06,260 --> 00:21:07,580 But I'm actually surprised. 475 00:21:07,580 --> 00:21:11,340 I haven't been able to find non-pathological programs. 476 00:21:11,340 --> 00:21:13,900 I'm sure they're out there, but I just haven't run into them. 477 00:21:13,900 --> 00:21:14,960 I sort of looked at some programs 478 00:21:14,960 --> 00:21:15,960 that I thought would be. 479 00:21:15,960 --> 00:21:17,180 Even some of the spec marks. 480 00:21:17,180 --> 00:21:18,620 I think most people would predict 481 00:21:18,620 --> 00:21:22,040 that the LISP interpreter, for example, would have very 50-50 482 00:21:22,040 --> 00:21:22,790 behavior. 483 00:21:22,790 --> 00:21:24,583 They very much fit into this pattern. 484 00:21:24,583 --> 00:21:26,750 And there's sometimes a little more weight down here 485 00:21:26,750 --> 00:21:27,625 but very little more. 486 00:21:27,625 --> 00:21:29,240 And sometimes the scientific code, 487 00:21:29,240 --> 00:21:32,300 there's a lot more weight up at the top. 488 00:21:32,300 --> 00:21:34,640 So branches are very predictable. 489 00:21:34,640 --> 00:21:38,090 And this gives us a technique for dealing 490 00:21:38,090 --> 00:21:44,150 with this problem of going past conditional branches. 491 00:21:44,150 --> 00:21:48,140 And it really puts some validity to the limit studies that 492 00:21:48,140 --> 00:21:49,850 don't stop at conditional branches 493 00:21:49,850 --> 00:21:51,990 but say if you want to know what's available, 494 00:21:51,990 --> 00:21:53,270 here's what's available. 495 00:21:53,270 --> 00:21:54,980 Now you need tricks to get it. 496 00:21:54,980 --> 00:21:57,680 You need architectural techniques 497 00:21:57,680 --> 00:22:02,000 that will allow you to go past those conditional jumps. 498 00:22:02,000 --> 00:22:04,340 So that's sort of a little bit of narrow history. 499 00:22:04,340 --> 00:22:06,350 I'll say much more about the recent history 500 00:22:06,350 --> 00:22:07,500 at the end of the talk. 501 00:22:07,500 --> 00:22:10,220 But a little narrow history and some perspective 502 00:22:10,220 --> 00:22:11,810 that I have always found very valuable 503 00:22:11,810 --> 00:22:14,780 in understanding the hardware and software techniques that 504 00:22:14,780 --> 00:22:17,360 go into instruction level parallelism, which I'd 505 00:22:17,360 --> 00:22:20,990 like to start looking at now. 506 00:22:20,990 --> 00:22:23,830 So if you think back to that example we did at the beginning 507 00:22:23,830 --> 00:22:25,900 where we took a section of straight line code, 508 00:22:25,900 --> 00:22:27,608 there were no conditional branches there, 509 00:22:27,608 --> 00:22:29,860 and we tried to schedule it to see 510 00:22:29,860 --> 00:22:33,610 what would be fired when in the instruction level parallel 511 00:22:33,610 --> 00:22:38,410 hardware, we really did two different tasks. 512 00:22:38,410 --> 00:22:40,760 And you heard me doing those out loud, 513 00:22:40,760 --> 00:22:42,510 though I didn't separate them at the time. 514 00:22:42,510 --> 00:22:45,970 The first task was can I issue this operation now? 515 00:22:45,970 --> 00:22:49,600 That is, is it independent of all the operations 516 00:22:49,600 --> 00:22:52,780 that I'm issuing now or that are in between where 517 00:22:52,780 --> 00:22:53,650 we are and now? 518 00:22:53,650 --> 00:22:56,260 In other words, is its data ready? 519 00:22:56,260 --> 00:22:58,810 Have the things it has to wait for already 520 00:22:58,810 --> 00:23:00,590 completed their execution? 521 00:23:00,590 --> 00:23:03,550 So that's called recognizing the independence of operations 522 00:23:03,550 --> 00:23:07,240 or dependence analysis is the typical word used for it. 523 00:23:07,240 --> 00:23:09,760 And that's a job that you have to do in order 524 00:23:09,760 --> 00:23:12,460 to find out whether an operation is ready to be fired. 525 00:23:12,460 --> 00:23:14,740 If its data isn't ready, it doesn't make any sense 526 00:23:14,740 --> 00:23:16,570 to start the add. 527 00:23:16,570 --> 00:23:17,680 So that's one job. 528 00:23:17,680 --> 00:23:20,410 Then having done that, you get a handful of things 529 00:23:20,410 --> 00:23:22,540 that you would like to fire right now. 530 00:23:22,540 --> 00:23:25,277 You have to ask the question, where in this hardware 531 00:23:25,277 --> 00:23:26,110 am I going to do it? 532 00:23:26,110 --> 00:23:29,778 Can I do it this cycle and using what resources? 533 00:23:29,778 --> 00:23:31,570 So the second question, then, is how do you 534 00:23:31,570 --> 00:23:33,960 schedule the operation? 535 00:23:33,960 --> 00:23:37,320 So the way you can think of this is if you sort of fix 536 00:23:37,320 --> 00:23:39,960 the endpoints here, here's the source code that we're 537 00:23:39,960 --> 00:23:43,265 starting with and here's the execution 538 00:23:43,265 --> 00:23:44,640 we're trying to get the execution 539 00:23:44,640 --> 00:23:48,240 pattern at the end of instruction level parallelism, 540 00:23:48,240 --> 00:23:50,190 and those are sort of fixed independent of how 541 00:23:50,190 --> 00:23:53,440 you do this stuff, the steps along the way 542 00:23:53,440 --> 00:23:56,050 are, well, you want to run your compiler 543 00:23:56,050 --> 00:23:59,500 and turn it into low level operations. 544 00:23:59,500 --> 00:24:02,680 You might not want to run it all the way into the final version 545 00:24:02,680 --> 00:24:04,240 because, for example, you may not 546 00:24:04,240 --> 00:24:07,150 have chosen what functional units or what registers to use. 547 00:24:07,150 --> 00:24:09,040 You may do that as part of scheduling. 548 00:24:09,040 --> 00:24:11,530 But you do most of the compiling job. 549 00:24:11,530 --> 00:24:15,082 And then you want to have some thing, whatever it is, 550 00:24:15,082 --> 00:24:17,290 and we'll talk about that in the next slide, whatever 551 00:24:17,290 --> 00:24:19,953 it is that will do the dependence analysis. 552 00:24:19,953 --> 00:24:22,120 And then you want to do something that will schedule 553 00:24:22,120 --> 00:24:25,510 the operations on the hardware. 554 00:24:25,510 --> 00:24:28,120 Now, I've gone on and on about some thing on purpose 555 00:24:28,120 --> 00:24:31,720 to emphasize one of the most important things 556 00:24:31,720 --> 00:24:35,040 that I hope you can get out of this talk. 557 00:24:35,040 --> 00:24:39,480 And that is that if you think in terms of the source code 558 00:24:39,480 --> 00:24:44,880 and the record of execution as fixed, the desired 559 00:24:44,880 --> 00:24:47,790 kind of execution and the desired source code, 560 00:24:47,790 --> 00:24:51,730 then you have a spectrum here of jobs 561 00:24:51,730 --> 00:24:54,520 to do where you can draw the line of which 562 00:24:54,520 --> 00:24:58,360 to do in software before the program runs 563 00:24:58,360 --> 00:25:02,080 and which to do in hardware as the program runs 564 00:25:02,080 --> 00:25:05,460 in lots of different places. 565 00:25:05,460 --> 00:25:08,750 There are two extremes of this. 566 00:25:08,750 --> 00:25:13,220 The one is to simply take ordinary risk operations here, 567 00:25:13,220 --> 00:25:15,240 hand them to the hardware. 568 00:25:15,240 --> 00:25:16,940 So now on this side of this boundary, 569 00:25:16,940 --> 00:25:18,560 everything is hardware. 570 00:25:18,560 --> 00:25:23,180 Ask the hardware to do the independence recognition. 571 00:25:23,180 --> 00:25:26,810 Ask the hardware to schedule the operations 572 00:25:26,810 --> 00:25:30,650 on the functional units using the registers, whatever. 573 00:25:30,650 --> 00:25:33,860 In that case, you have what people now call a super scalar. 574 00:25:33,860 --> 00:25:36,730 575 00:25:36,730 --> 00:25:38,620 Alternatively, in the other extreme, 576 00:25:38,620 --> 00:25:41,330 and there are intermediate points between these two, 577 00:25:41,330 --> 00:25:43,960 in the other extreme, you could say 578 00:25:43,960 --> 00:25:46,300 I'm going to allow the dependence 579 00:25:46,300 --> 00:25:49,510 analysis and the scheduling to all occur kind of as 580 00:25:49,510 --> 00:25:53,350 part of the compiling process, to all occur in the software 581 00:25:53,350 --> 00:25:54,910 before the program runs. 582 00:25:54,910 --> 00:25:57,890 583 00:25:57,890 --> 00:25:59,800 And then finally, the hardware has nothing 584 00:25:59,800 --> 00:26:03,800 left to do but execute that code that you've already scheduled. 585 00:26:03,800 --> 00:26:06,710 If you do that, then the hardware you're talking about 586 00:26:06,710 --> 00:26:09,782 is called a VLIW, a Very Long Instruction Word machine. 587 00:26:09,782 --> 00:26:11,990 So we're going to look at these two kinds of machines 588 00:26:11,990 --> 00:26:13,490 in a little more detail, and we're 589 00:26:13,490 --> 00:26:14,890 going to discuss this trade off. 590 00:26:14,890 --> 00:26:17,960 This trade off is very, very similar 591 00:26:17,960 --> 00:26:19,970 to the kinds of hardware, software trade 592 00:26:19,970 --> 00:26:23,815 offs one is always making in systems design. 593 00:26:23,815 --> 00:26:25,190 But I guess before going on, it's 594 00:26:25,190 --> 00:26:27,710 worth saying that these are the popular things people 595 00:26:27,710 --> 00:26:28,430 talk about. 596 00:26:28,430 --> 00:26:32,090 There are clever ideas that try to embody the best 597 00:26:32,090 --> 00:26:35,180 of both kinds of processors. 598 00:26:35,180 --> 00:26:40,780 That is, they try to not apply the hardware software trade off 599 00:26:40,780 --> 00:26:42,790 in one extreme or the other but try 600 00:26:42,790 --> 00:26:47,080 to look at all the individual pieces of the jobs in here 601 00:26:47,080 --> 00:26:49,510 and figure out which ones to parcel to the software 602 00:26:49,510 --> 00:26:51,940 and which ones to parcel to the hardware. 603 00:26:51,940 --> 00:26:53,530 One of the most clever ideas, just 604 00:26:53,530 --> 00:26:55,810 to give you an example of this, is 605 00:26:55,810 --> 00:27:00,130 something that appears in the horizon processor, which 606 00:27:00,130 --> 00:27:02,320 I first saw at the Supercomputer Research Center 607 00:27:02,320 --> 00:27:06,730 I guess is now being built by Tera Computer in Washington 608 00:27:06,730 --> 00:27:08,650 state, I think. 609 00:27:08,650 --> 00:27:13,110 That idea is to draw the boundary here, basically. 610 00:27:13,110 --> 00:27:15,480 That's an obvious idea if you look at this picture. 611 00:27:15,480 --> 00:27:17,370 But the way they did it was very clever. 612 00:27:17,370 --> 00:27:21,830 The idea is to do the dependence analysis and the compiler 613 00:27:21,830 --> 00:27:27,680 and then encode the statements of what an operation is 614 00:27:27,680 --> 00:27:33,340 dependent upon into the code in the form of hints 615 00:27:33,340 --> 00:27:36,160 so that the hardware can simply run that program 616 00:27:36,160 --> 00:27:40,265 step after step without looking at the hints at all. 617 00:27:40,265 --> 00:27:42,640 So you know the program can run correctly without anybody 618 00:27:42,640 --> 00:27:43,940 having to do anything. 619 00:27:43,940 --> 00:27:45,640 But then if the hardware wants to use 620 00:27:45,640 --> 00:27:48,460 the hints to do more clever scheduling 621 00:27:48,460 --> 00:27:50,930 and get some instruction level parallelism, it can do that. 622 00:27:50,930 --> 00:27:52,250 And the hints are very simple. 623 00:27:52,250 --> 00:27:55,690 It's just a number attached to each instruction. 624 00:27:55,690 --> 00:27:58,360 And it says how many instructions straight 625 00:27:58,360 --> 00:28:02,200 before this one this instruction is guaranteed 626 00:28:02,200 --> 00:28:04,130 to be independent of. 627 00:28:04,130 --> 00:28:07,310 So if there is a 10 in that field that 628 00:28:07,310 --> 00:28:09,470 goes with that instruction, then the hardware 629 00:28:09,470 --> 00:28:14,450 is guaranteed by the compiler that the previous 10 630 00:28:14,450 --> 00:28:16,580 instructions in the instruction stream 631 00:28:16,580 --> 00:28:18,052 are all independent of this one. 632 00:28:18,052 --> 00:28:20,510 And the hardware just has to look and see if anything older 633 00:28:20,510 --> 00:28:23,220 than that is still in flight. 634 00:28:23,220 --> 00:28:26,960 And if nothing is, then it knows it can issue this operation. 635 00:28:26,960 --> 00:28:29,720 If something is, then it has to play safe and wait, 636 00:28:29,720 --> 00:28:31,850 not knowing for certain whether that's a data 637 00:28:31,850 --> 00:28:33,740 predecessor of this one or not. 638 00:28:33,740 --> 00:28:35,330 But it's within that window. 639 00:28:35,330 --> 00:28:36,620 That's called horizon. 640 00:28:36,620 --> 00:28:37,340 Clever idea. 641 00:28:37,340 --> 00:28:39,480 Nice name for it. 642 00:28:39,480 --> 00:28:40,550 So that's one example. 643 00:28:40,550 --> 00:28:42,170 And I guess I went through that just 644 00:28:42,170 --> 00:28:44,212 to make sure that you understand that this really 645 00:28:44,212 --> 00:28:44,870 is a continuum. 646 00:28:44,870 --> 00:28:48,500 And people often talk about the VLIWs and superscalars 647 00:28:48,500 --> 00:28:50,750 as being sort of opposites. 648 00:28:50,750 --> 00:28:53,420 But really they're two approaches 649 00:28:53,420 --> 00:28:56,340 to trying to get the exact same thing. 650 00:28:56,340 --> 00:28:58,220 It's just where you draw that boundary. 651 00:28:58,220 --> 00:29:01,070 And practitioners in the field are always discovering this 652 00:29:01,070 --> 00:29:02,750 and are always glad to discover it. 653 00:29:02,750 --> 00:29:05,000 It's one of those nice things that you suddenly 654 00:29:05,000 --> 00:29:10,570 see what's underlying everything in that sense. 655 00:29:10,570 --> 00:29:11,070 OK. 656 00:29:11,070 --> 00:29:15,850 Now, if you look at how these things work, 657 00:29:15,850 --> 00:29:18,890 let's consider first a superscalar. 658 00:29:18,890 --> 00:29:20,370 It's pretty intuitive. 659 00:29:20,370 --> 00:29:23,610 You build some extra hardware. 660 00:29:23,610 --> 00:29:25,320 You feed in the instruction stream. 661 00:29:25,320 --> 00:29:26,790 And it's just an ordinary stream, 662 00:29:26,790 --> 00:29:28,763 one operation at a time. 663 00:29:28,763 --> 00:29:30,180 Could be the same stream you would 664 00:29:30,180 --> 00:29:33,270 have been running on a non-superscalar processor 665 00:29:33,270 --> 00:29:35,400 implementing the same architecture. 666 00:29:35,400 --> 00:29:40,050 And this hardware starts fetching the operations, 667 00:29:40,050 --> 00:29:45,000 looking at them, figuring out if they're data ready or not, 668 00:29:45,000 --> 00:29:47,140 doing the independence recognition. 669 00:29:47,140 --> 00:29:49,230 And if they are, figuring out where 670 00:29:49,230 --> 00:29:51,280 to send them in the system. 671 00:29:51,280 --> 00:29:54,900 Now, there is every imaginable level of sophistication that 672 00:29:54,900 --> 00:29:57,330 can be applied to this job. 673 00:29:57,330 --> 00:30:02,340 The very simplest thing you can do is get an operation, 674 00:30:02,340 --> 00:30:05,520 figure out whether the registers it's going to use 675 00:30:05,520 --> 00:30:08,310 are reserved by somebody. 676 00:30:08,310 --> 00:30:11,600 And if they are, just stop. 677 00:30:11,600 --> 00:30:13,190 Reserved in the sense that somebody 678 00:30:13,190 --> 00:30:17,040 is planning on writing the results into that register. 679 00:30:17,040 --> 00:30:22,650 Or instead of just stopping, you can stick the operation aside 680 00:30:22,650 --> 00:30:24,090 and look at the next one. 681 00:30:24,090 --> 00:30:26,370 Now, when I said stop, I mean stop issuing 682 00:30:26,370 --> 00:30:29,490 and allow the ones that were already in flight to complete. 683 00:30:29,490 --> 00:30:32,820 Instead of stopping issuing, you can put that one aside 684 00:30:32,820 --> 00:30:37,320 in a special buffer and look at the next operation and ask, 685 00:30:37,320 --> 00:30:39,060 can this one now go? 686 00:30:39,060 --> 00:30:41,340 And then you need to periodically, like each cycle, 687 00:30:41,340 --> 00:30:42,840 look at the ones that you've delayed 688 00:30:42,840 --> 00:30:44,940 and see whether they can fire. 689 00:30:44,940 --> 00:30:47,100 Or you might be even more sophisticated 690 00:30:47,100 --> 00:30:50,010 and when you put something aside arrange to have the hardware 691 00:30:50,010 --> 00:30:54,660 notify you when that operation is now data ready. 692 00:30:54,660 --> 00:30:58,590 So they're starting with the more modest superscalar 693 00:30:58,590 --> 00:31:01,020 algorithm in the 6600. 694 00:31:01,020 --> 00:31:02,920 And the rather more ambitious one, 695 00:31:02,920 --> 00:31:08,250 it's called Tomasulo algorithm in the 360 in the early '60s. 696 00:31:08,250 --> 00:31:12,090 Those sorts of things have been used in several processors. 697 00:31:12,090 --> 00:31:14,640 And as we'll see later, they are very 698 00:31:14,640 --> 00:31:18,930 much what we find in the highest performance processors today. 699 00:31:18,930 --> 00:31:22,410 The algorithms that were used in the CDC and IBM machines 700 00:31:22,410 --> 00:31:23,850 lived for quite a long time. 701 00:31:23,850 --> 00:31:27,510 The CDC method evolved into something considerably more 702 00:31:27,510 --> 00:31:31,210 ambitious as the superscalar unit of the Cray supercomputer. 703 00:31:31,210 --> 00:31:35,850 Seymour Cray was at CDC and took that same philosophy 704 00:31:35,850 --> 00:31:37,440 when he built the Cray-1. 705 00:31:37,440 --> 00:31:40,023 And in one form or another, these ideas 706 00:31:40,023 --> 00:31:41,690 have lived on in that line of computers. 707 00:31:41,690 --> 00:31:43,770 And the same thing with IBM mainframes 708 00:31:43,770 --> 00:31:46,470 that lived on in a lot of the most powerful models 709 00:31:46,470 --> 00:31:48,670 and the RS/6000 and so on. 710 00:31:48,670 --> 00:31:51,060 So that's what you do with a superscalar. 711 00:31:51,060 --> 00:31:55,350 And real question is, as we'll see later, 712 00:31:55,350 --> 00:31:58,010 how does this hardware scale? 713 00:31:58,010 --> 00:32:01,640 How big can it grow before it starts getting impractical? 714 00:32:01,640 --> 00:32:05,380 And we'll talk about that in a while. 715 00:32:05,380 --> 00:32:08,980 An alternative to that is to throw out that control hardware 716 00:32:08,980 --> 00:32:11,120 completely. 717 00:32:11,120 --> 00:32:15,560 Instead let the compiler do the whole job 718 00:32:15,560 --> 00:32:18,980 and do it in such a way that it schedules 719 00:32:18,980 --> 00:32:25,140 exactly the record of execution that it wants to have happen. 720 00:32:25,140 --> 00:32:29,070 So that rather than giving a sequential chunk of operations 721 00:32:29,070 --> 00:32:33,660 in, it gives the exact outcome, the record of execution 722 00:32:33,660 --> 00:32:36,840 mapped exactly into the code that the compiler 723 00:32:36,840 --> 00:32:38,503 expects to happen. 724 00:32:38,503 --> 00:32:39,920 And then the hardware doesn't have 725 00:32:39,920 --> 00:32:42,290 to do anything in the way of sophisticated scheduling. 726 00:32:42,290 --> 00:32:44,450 It simply takes the first operation 727 00:32:44,450 --> 00:32:48,230 and assigns each-- it takes the first instruction. 728 00:32:48,230 --> 00:32:50,510 People tend to call these long things instructions. 729 00:32:50,510 --> 00:32:53,330 And the constituent risk operations 730 00:32:53,330 --> 00:32:56,600 are called operations, just to distinguish between the two. 731 00:32:56,600 --> 00:32:58,970 Take the first instruction, parcel 732 00:32:58,970 --> 00:33:01,520 each operation out to the functional unit 733 00:33:01,520 --> 00:33:04,040 it registers and says it uses. 734 00:33:04,040 --> 00:33:05,960 Bang, execute them in one cycle. 735 00:33:05,960 --> 00:33:09,290 Grab the next one and so on. 736 00:33:09,290 --> 00:33:11,600 So that's the idea of the VLIW. 737 00:33:11,600 --> 00:33:13,290 It's very simple. 738 00:33:13,290 --> 00:33:15,680 It requires very much less hardware, 739 00:33:15,680 --> 00:33:17,975 because this is quite complex hardware 740 00:33:17,975 --> 00:33:19,100 that you can now throw out. 741 00:33:19,100 --> 00:33:22,160 In fact, you tend to very often throw out 742 00:33:22,160 --> 00:33:25,070 a little bit of the control hardware in a VLIW 743 00:33:25,070 --> 00:33:27,710 that you would have built in an ordinary sequential machine, 744 00:33:27,710 --> 00:33:31,100 simply because you have such a sophisticated compiler in order 745 00:33:31,100 --> 00:33:34,010 to do that at all that there are a few other things that 746 00:33:34,010 --> 00:33:36,650 sometimes control hardware has to do that you can maybe 747 00:33:36,650 --> 00:33:40,730 arrange to do a little more efficiently in the software. 748 00:33:40,730 --> 00:33:46,610 Now, I've been saying consistently issue 749 00:33:46,610 --> 00:33:48,245 the next instruction on the next cycle, 750 00:33:48,245 --> 00:33:50,120 issue the next instruction on the next cycle. 751 00:33:50,120 --> 00:33:52,670 And a lot of these operations, for example, 752 00:33:52,670 --> 00:33:56,030 you'd expect a floating multiply and a floating add and a load, 753 00:33:56,030 --> 00:33:58,940 take more than one cycle to operate. 754 00:33:58,940 --> 00:34:01,340 Well, in all instruction level parallel machines, 755 00:34:01,340 --> 00:34:02,900 independent of the kind of machine 756 00:34:02,900 --> 00:34:08,170 it is, you like to build pipelined functional units 757 00:34:08,170 --> 00:34:12,820 and pipelined memory when you have the need 758 00:34:12,820 --> 00:34:15,489 to start a new operation before the old one finishes. 759 00:34:15,489 --> 00:34:19,130 Whenever you have a longer than one cycle latency operation, 760 00:34:19,130 --> 00:34:21,340 it makes sense to build a pipeline unit. 761 00:34:21,340 --> 00:34:23,840 And your goal is to be able to start a new operation 762 00:34:23,840 --> 00:34:25,239 in that unit each cycle. 763 00:34:25,239 --> 00:34:27,070 After all, you have all this mechanism 764 00:34:27,070 --> 00:34:29,830 for firing operations before the old one stops. 765 00:34:29,830 --> 00:34:34,000 Why not build hardware that allows you to fire more of them 766 00:34:34,000 --> 00:34:35,870 with basically the same hardware dollar? 767 00:34:35,870 --> 00:34:38,530 So you tend to see any long latency 768 00:34:38,530 --> 00:34:41,230 operation in instruction level parallel 769 00:34:41,230 --> 00:34:43,389 machine being pipelined. 770 00:34:43,389 --> 00:34:44,500 Now, not all of them are. 771 00:34:44,500 --> 00:34:47,260 That's not always the case, but it is certainly the rule. 772 00:34:47,260 --> 00:34:48,909 And the example I've written here 773 00:34:48,909 --> 00:34:50,380 certainly would count on the fact 774 00:34:50,380 --> 00:34:53,110 that these longer latency operations aren't holding you 775 00:34:53,110 --> 00:34:57,620 up when you issue the next instruction. 776 00:34:57,620 --> 00:35:01,050 Now, I've shown this a issuing a bunch at the same time. 777 00:35:01,050 --> 00:35:02,750 And when we saw the record of execution 778 00:35:02,750 --> 00:35:06,860 at the very beginning for either superscalar or VLIW machines, 779 00:35:06,860 --> 00:35:10,220 we saw it written as a bunch of operations 780 00:35:10,220 --> 00:35:11,700 that fired at the same time. 781 00:35:11,700 --> 00:35:13,460 So it was sort of multi issue. 782 00:35:13,460 --> 00:35:17,030 Another idea that's very similar but just a little bit different 783 00:35:17,030 --> 00:35:19,820 is what's come to be called super pipelined or deeply 784 00:35:19,820 --> 00:35:23,870 pipelined or highly pipelined machines in which you increase 785 00:35:23,870 --> 00:35:28,040 the issue rate tremendously as far as the clock of the issue. 786 00:35:28,040 --> 00:35:30,420 But you issue more than one operator. 787 00:35:30,420 --> 00:35:30,920 I'm sorry. 788 00:35:30,920 --> 00:35:33,410 You don't issue more than one operation per clock. 789 00:35:33,410 --> 00:35:34,910 So that, for example, you might even 790 00:35:34,910 --> 00:35:37,340 have an issue rate a lot faster than an integer add, 791 00:35:37,340 --> 00:35:38,990 which is sort of what we tend to think 792 00:35:38,990 --> 00:35:44,370 of as the atomic unit in the user visible clock in a system. 793 00:35:44,370 --> 00:35:47,570 Instead, you might start an integer add and then issue 794 00:35:47,570 --> 00:35:50,660 more operations before it completes 795 00:35:50,660 --> 00:35:55,030 but at a very, very fast clip. 796 00:35:55,030 --> 00:35:57,460 Now, that's an important hardware trade off 797 00:35:57,460 --> 00:36:00,070 and has lots of implications for how you build your system. 798 00:36:00,070 --> 00:36:01,157 But really in concept. 799 00:36:01,157 --> 00:36:02,740 It's not very different from this kind 800 00:36:02,740 --> 00:36:04,390 of record of execution. 801 00:36:04,390 --> 00:36:07,750 It's just finding the available instruction level parallelism 802 00:36:07,750 --> 00:36:10,660 and scheduling it is going to be a pretty similar job. 803 00:36:10,660 --> 00:36:13,120 And it's likely to be similarly effective, 804 00:36:13,120 --> 00:36:16,390 give or take the advantages and disadvantages of the hardware 805 00:36:16,390 --> 00:36:17,800 trade offs that you get. 806 00:36:17,800 --> 00:36:22,390 And it applies just as well to superscalar and VLIW machines. 807 00:36:22,390 --> 00:36:24,910 It's a little more attractive for a superscalar machine 808 00:36:24,910 --> 00:36:27,760 because you can build a more simple issue unit 809 00:36:27,760 --> 00:36:30,580 and just make it very fast rather than trying to issue 810 00:36:30,580 --> 00:36:31,940 a lot at the same time. 811 00:36:31,940 --> 00:36:34,190 So perhaps that's an advantage there. 812 00:36:34,190 --> 00:36:36,280 But in any case, that's an idea that people often 813 00:36:36,280 --> 00:36:39,130 think of as being more different than in my opinion 814 00:36:39,130 --> 00:36:40,720 it really is. 815 00:36:40,720 --> 00:36:42,400 So one last thing about VLIWs. 816 00:36:42,400 --> 00:36:46,090 I thought you'd enjoy seeing some code that 817 00:36:46,090 --> 00:36:48,970 is an example of when you're doing incredibly well 818 00:36:48,970 --> 00:36:50,140 on a VLIW. 819 00:36:50,140 --> 00:36:53,560 Now, you very rarely will do this well. 820 00:36:53,560 --> 00:36:56,920 This is two instructions, two clock ticks, 821 00:36:56,920 --> 00:37:02,410 of compiled code for the Multiflow Trace 28/300 822 00:37:02,410 --> 00:37:04,090 that I happen to have around. 823 00:37:04,090 --> 00:37:07,360 And it's for a dense solver. 824 00:37:07,360 --> 00:37:08,860 And dense solvers are things that 825 00:37:08,860 --> 00:37:11,020 are very easy to get parallelism out of. 826 00:37:11,020 --> 00:37:15,220 Almost any form of parallelism works great on a dense solver. 827 00:37:15,220 --> 00:37:16,720 And this is two clock ticks. 828 00:37:16,720 --> 00:37:19,630 First clock tick, you issue all of these operations. 829 00:37:19,630 --> 00:37:22,900 If you look at, if you can read the op codes from a distance, 830 00:37:22,900 --> 00:37:26,630 there's a lot of floating point multiplies integer ALUs, 831 00:37:26,630 --> 00:37:28,090 a lot of loads and stores. 832 00:37:28,090 --> 00:37:29,620 Actually, I guess I grabbed two that 833 00:37:29,620 --> 00:37:32,080 don't have loads and stores now that I look at it. 834 00:37:32,080 --> 00:37:32,770 Oh yes they do. 835 00:37:32,770 --> 00:37:34,510 Lots of loads in stores as well. 836 00:37:34,510 --> 00:37:36,910 And a bunch of branches, because when 837 00:37:36,910 --> 00:37:39,040 you do that many operations at once, 838 00:37:39,040 --> 00:37:41,950 you better not be limited to doing one branch per cycle 839 00:37:41,950 --> 00:37:44,750 or that will really make your code very sparse again. 840 00:37:44,750 --> 00:37:47,260 So in here is the second instruction filled 841 00:37:47,260 --> 00:37:48,520 with all these operations. 842 00:37:48,520 --> 00:37:51,730 Obviously if you issue this much on the first clock tick 843 00:37:51,730 --> 00:37:54,100 and then don't even wait for it to stop before issuing 844 00:37:54,100 --> 00:37:56,050 that much on the second clock tick, 845 00:37:56,050 --> 00:37:57,790 this machine is going to scream. 846 00:37:57,790 --> 00:38:01,120 And sure, a code like this goes incredibly well 847 00:38:01,120 --> 00:38:02,100 on this machine. 848 00:38:02,100 --> 00:38:04,990 If you measure the performance per clock tick of code 849 00:38:04,990 --> 00:38:08,020 like this on a machine like that, it's just tremendous. 850 00:38:08,020 --> 00:38:09,580 Of course, most of the time it's not 851 00:38:09,580 --> 00:38:10,955 going to look anything like that. 852 00:38:10,955 --> 00:38:13,690 853 00:38:13,690 --> 00:38:15,820 But interesting nonetheless. 854 00:38:15,820 --> 00:38:18,580 So now that we've seen the kinds of machines 855 00:38:18,580 --> 00:38:22,210 that there are, and of course in the scope of the time 856 00:38:22,210 --> 00:38:23,950 of this talk I can't really go into a lot 857 00:38:23,950 --> 00:38:26,183 of the architectural details that 858 00:38:26,183 --> 00:38:27,850 are very interesting about this machine, 859 00:38:27,850 --> 00:38:29,890 at the end I'll point you at some references 860 00:38:29,890 --> 00:38:33,650 if you want to learn some more about this. 861 00:38:33,650 --> 00:38:36,740 I'd like to talk about two big problems, two 862 00:38:36,740 --> 00:38:40,070 or three big problems in instruction level parallelism. 863 00:38:40,070 --> 00:38:43,870 And the first one is speculative execution. 864 00:38:43,870 --> 00:38:46,278 Speculative execution is an interesting topic. 865 00:38:46,278 --> 00:38:48,820 You hardly heard anybody talk about it just a couple of years 866 00:38:48,820 --> 00:38:49,420 ago. 867 00:38:49,420 --> 00:38:52,210 And now when you look at, let's see, 868 00:38:52,210 --> 00:38:54,620 I was at the ASPLOS conference recently, 869 00:38:54,620 --> 00:38:56,170 which is a good systems conference, 870 00:38:56,170 --> 00:38:58,690 there were three or four papers on speculative execution, 871 00:38:58,690 --> 00:38:59,560 good papers. 872 00:38:59,560 --> 00:39:01,960 So people are really working on the problems here. 873 00:39:01,960 --> 00:39:04,630 And here's the thing about speculative execution. 874 00:39:04,630 --> 00:39:07,310 There really are two sets of problems. 875 00:39:07,310 --> 00:39:09,190 There's how you implement it and then 876 00:39:09,190 --> 00:39:11,950 there's the big problem you get when you do. 877 00:39:11,950 --> 00:39:14,110 The how you implement it is actually simpler 878 00:39:14,110 --> 00:39:15,820 than you'd think. 879 00:39:15,820 --> 00:39:17,770 Here you are with a program. 880 00:39:17,770 --> 00:39:20,620 And looking ahead, you might decide because of the way 881 00:39:20,620 --> 00:39:22,720 you think these branches will go that you'd 882 00:39:22,720 --> 00:39:27,160 like to grab this operation and issue it ahead of these two 883 00:39:27,160 --> 00:39:29,690 branches, because you think they're going to go this way. 884 00:39:29,690 --> 00:39:31,940 And so you speculate that that'll be true. 885 00:39:31,940 --> 00:39:33,970 Well, now the first job you have to do 886 00:39:33,970 --> 00:39:37,810 is what happens when it was an unprofitable speculative 887 00:39:37,810 --> 00:39:38,660 operation? 888 00:39:38,660 --> 00:39:40,060 How do you clean up after it? 889 00:39:40,060 --> 00:39:43,690 And there's a series of techniques you can use. 890 00:39:43,690 --> 00:39:48,010 One technique you can use is in the hardware, 891 00:39:48,010 --> 00:39:52,570 you can arrange to sort of shadow the work you're doing 892 00:39:52,570 --> 00:39:56,770 and have the old state sitting around and the new state. 893 00:39:56,770 --> 00:39:59,620 For example, you can have a shadow set of registers. 894 00:39:59,620 --> 00:40:01,813 And this is an idea that people have 895 00:40:01,813 --> 00:40:02,980 kicked around several times. 896 00:40:02,980 --> 00:40:06,310 And now I've seen there was a paper at ASPLOS 897 00:40:06,310 --> 00:40:10,660 by Mike Smith at Stanford, who was advocating the use of this. 898 00:40:10,660 --> 00:40:14,770 That you grab an operation, do it, but write its results 899 00:40:14,770 --> 00:40:16,750 in a set of special registers. 900 00:40:16,750 --> 00:40:18,790 And you don't sort of commit these registers 901 00:40:18,790 --> 00:40:21,790 as the real results until the branches 902 00:40:21,790 --> 00:40:23,630 resolve in that direction. 903 00:40:23,630 --> 00:40:25,420 So I'm paraphrasing a harder thing. 904 00:40:25,420 --> 00:40:27,742 But this idea has been kicked around. 905 00:40:27,742 --> 00:40:29,200 There was a little bit of something 906 00:40:29,200 --> 00:40:32,047 like that in the multiflow VLIW computer 907 00:40:32,047 --> 00:40:34,130 and it's something people have talked about a lot. 908 00:40:34,130 --> 00:40:36,430 So that's one idea that you could do. 909 00:40:36,430 --> 00:40:39,520 Second idea is to use renaming. 910 00:40:39,520 --> 00:40:44,150 That is when you write the results of your operation. 911 00:40:44,150 --> 00:40:46,810 Suppose the operation that you're pulling up 912 00:40:46,810 --> 00:40:48,670 writes a variable x. 913 00:40:48,670 --> 00:40:53,470 Suppose somebody down here reads the old value of x. 914 00:40:53,470 --> 00:40:56,980 Then you can give this new result a different name. 915 00:40:56,980 --> 00:40:59,440 And when you get down here, when the jumps resolve 916 00:40:59,440 --> 00:41:01,240 in the direction that you had planned, 917 00:41:01,240 --> 00:41:05,590 you can put in code that names that result back to x 918 00:41:05,590 --> 00:41:10,470 if it happens that this value of x and the new value of x 919 00:41:10,470 --> 00:41:11,760 reach some place in common. 920 00:41:11,760 --> 00:41:13,403 Then they have to have the same name. 921 00:41:13,403 --> 00:41:15,570 Those of you who are familiar with compilers, that's 922 00:41:15,570 --> 00:41:17,400 a typical kind of problem. 923 00:41:17,400 --> 00:41:21,330 You would just use compiler renaming, putting 924 00:41:21,330 --> 00:41:27,530 a new sort of move to rename the new value to x down here so 925 00:41:27,530 --> 00:41:30,210 that the same value gets read down both branches. 926 00:41:30,210 --> 00:41:33,733 This is the typical thing you would do, especially in a VLIW, 927 00:41:33,733 --> 00:41:35,150 where you're setting up the world, 928 00:41:35,150 --> 00:41:37,700 that we'll see in a superscalar as well, where you're setting 929 00:41:37,700 --> 00:41:42,500 up the world to do speculative execution without destroying 930 00:41:42,500 --> 00:41:44,900 the state of the old values. 931 00:41:44,900 --> 00:41:47,990 And instead of having to do moves and all of the stuff 932 00:41:47,990 --> 00:41:49,940 and renaming, one thing you can do very often 933 00:41:49,940 --> 00:41:52,190 is simply do operations that don't hurt anything 934 00:41:52,190 --> 00:41:53,270 down this branch. 935 00:41:53,270 --> 00:41:55,530 And then just ignore them when you go down there. 936 00:41:55,530 --> 00:41:59,120 So if over here you're writing a value x 937 00:41:59,120 --> 00:42:02,060 but there's no interest in the old value of x down here, 938 00:42:02,060 --> 00:42:04,657 you can just move it up and not worry about it. 939 00:42:04,657 --> 00:42:06,240 Forget about it when it goes that way. 940 00:42:06,240 --> 00:42:07,310 So x has some value. 941 00:42:07,310 --> 00:42:08,870 You don't care. 942 00:42:08,870 --> 00:42:11,300 So those are the sorts of techniques you can use. 943 00:42:11,300 --> 00:42:13,730 This is all perfectly tractable. 944 00:42:13,730 --> 00:42:14,810 It's been implemented. 945 00:42:14,810 --> 00:42:18,140 The Multiflow Trace did speculative execution 946 00:42:18,140 --> 00:42:18,950 like this. 947 00:42:18,950 --> 00:42:23,690 And people have done the sort of moving things up 948 00:42:23,690 --> 00:42:25,412 that you can ignore for many years 949 00:42:25,412 --> 00:42:26,870 in various different architectures. 950 00:42:26,870 --> 00:42:28,940 So this has been found around, and it seems 951 00:42:28,940 --> 00:42:30,110 like a reasonable thing. 952 00:42:30,110 --> 00:42:33,450 There's a big gotcha with speculative execution though. 953 00:42:33,450 --> 00:42:36,640 And that is what happens when there's an exception? 954 00:42:36,640 --> 00:42:41,350 So it's easiest if we look at it in terms of loads. 955 00:42:41,350 --> 00:42:44,650 So suppose you have a while loop here. 956 00:42:44,650 --> 00:42:47,980 And what we're doing is while the index 957 00:42:47,980 --> 00:42:51,380 I is below some upper limit, we're 958 00:42:51,380 --> 00:42:53,600 going around and doing some computation. 959 00:42:53,600 --> 00:42:57,840 And the computation involves loading A of I. Now, 960 00:42:57,840 --> 00:43:01,020 you might want to look down further iterations 961 00:43:01,020 --> 00:43:04,080 of this computation and bring some loads further up. 962 00:43:04,080 --> 00:43:06,600 That's a typical thing people like to do, because loads very 963 00:43:06,600 --> 00:43:09,480 often, especially array loads if the array is large enough 964 00:43:09,480 --> 00:43:11,430 and you're bypassing a cache, say, 965 00:43:11,430 --> 00:43:13,380 can have a very long latency. 966 00:43:13,380 --> 00:43:15,180 So you want to bring those loads up. 967 00:43:15,180 --> 00:43:18,270 But at some point, you're going to bring a load up that you 968 00:43:18,270 --> 00:43:21,840 wouldn't have wanted to do because of having 969 00:43:21,840 --> 00:43:23,470 gone past the upper limit. 970 00:43:23,470 --> 00:43:27,500 So now you've loaded off the end of the array. 971 00:43:27,500 --> 00:43:30,410 Well, I'm not a big fan of sports sayings, 972 00:43:30,410 --> 00:43:32,360 which are typically trite. 973 00:43:32,360 --> 00:43:34,430 And there's one in particular that 974 00:43:34,430 --> 00:43:38,600 makes me totally crazy whenever I hear a color person in an NFL 975 00:43:38,600 --> 00:43:40,180 game doing. 976 00:43:40,180 --> 00:43:42,470 They'll very often say, I don't know if any of you 977 00:43:42,470 --> 00:43:45,350 have ever noticed this, they'll say when you throw a pass, 978 00:43:45,350 --> 00:43:47,090 three things can happen and two of them 979 00:43:47,090 --> 00:43:51,140 are bad, being an incompletion and an interception. 980 00:43:51,140 --> 00:43:54,140 And the implication is you shouldn't throw passes. 981 00:43:54,140 --> 00:43:55,567 You should run a lot. 982 00:43:55,567 --> 00:43:57,150 So for those of you who know football, 983 00:43:57,150 --> 00:43:59,850 that makes some sense. 984 00:43:59,850 --> 00:44:01,490 That's of course, patently crazy, 985 00:44:01,490 --> 00:44:05,540 because it doesn't weigh the probabilities of the bad things 986 00:44:05,540 --> 00:44:08,420 and good things happening and the cost and benefit 987 00:44:08,420 --> 00:44:09,687 of the good and bad things. 988 00:44:09,687 --> 00:44:12,020 But you're always hearing them say, if you throw a pass, 989 00:44:12,020 --> 00:44:13,340 three things can happen, two of them are bad. 990 00:44:13,340 --> 00:44:15,680 Well, speculative loads are just like that. 991 00:44:15,680 --> 00:44:20,000 If you do a speculative load, three things can happen 992 00:44:20,000 --> 00:44:22,003 and two of them are bad. 993 00:44:22,003 --> 00:44:23,420 The good thing that happens is you 994 00:44:23,420 --> 00:44:25,100 do the load, everything's fine. 995 00:44:25,100 --> 00:44:26,570 You fetch the data. 996 00:44:26,570 --> 00:44:28,550 Nobody seems to have a complaint. 997 00:44:28,550 --> 00:44:30,050 If it turns out that you don't want 998 00:44:30,050 --> 00:44:31,250 to use the data, then great. 999 00:44:31,250 --> 00:44:32,063 You ignore it. 1000 00:44:32,063 --> 00:44:33,230 I mean, you just did a load. 1001 00:44:33,230 --> 00:44:34,563 What's the harm of doing a load? 1002 00:44:34,563 --> 00:44:36,650 You can certainly ignore it if it turns out 1003 00:44:36,650 --> 00:44:39,590 that you don't go the direction that would have required 1004 00:44:39,590 --> 00:44:41,580 the use of that load. 1005 00:44:41,580 --> 00:44:43,830 Second thing that can happen, though, and this is bad, 1006 00:44:43,830 --> 00:44:47,130 is that you might have a miss in your memory hierarchy 1007 00:44:47,130 --> 00:44:48,990 somewhere. 1008 00:44:48,990 --> 00:44:51,680 And now you've got to figure out whether to service that miss. 1009 00:44:51,680 --> 00:44:54,260 1010 00:44:54,260 --> 00:44:55,850 Now, you might have no choice. 1011 00:44:55,850 --> 00:44:58,230 Somebody might have built the system in that way. 1012 00:44:58,230 --> 00:45:00,350 But for the system designer, maybe you 1013 00:45:00,350 --> 00:45:05,270 want to build in some ability to allow you to not have to stop. 1014 00:45:05,270 --> 00:45:07,520 Particularly, I mean, OK, if it's a cache miss. 1015 00:45:07,520 --> 00:45:09,770 But suppose the page fault. 1016 00:45:09,770 --> 00:45:11,690 Do you really want to stop and service 1017 00:45:11,690 --> 00:45:16,010 the page fault, which takes an awfully long time, if you start 1018 00:45:16,010 --> 00:45:17,990 getting a lot of them because you've 1019 00:45:17,990 --> 00:45:21,140 run off the end of an array? 1020 00:45:21,140 --> 00:45:23,500 And now you're really getting into the area of data 1021 00:45:23,500 --> 00:45:24,790 that you're not interested in. 1022 00:45:24,790 --> 00:45:28,880 Worse than that, imagine you're chasing pointers. 1023 00:45:28,880 --> 00:45:33,350 And now you start loading from some useless address 1024 00:45:33,350 --> 00:45:35,247 because you're going off into some place 1025 00:45:35,247 --> 00:45:36,830 that the pointer structure never meant 1026 00:45:36,830 --> 00:45:39,470 you to be because your test for the end 1027 00:45:39,470 --> 00:45:41,550 would have stopped you before you got there. 1028 00:45:41,550 --> 00:45:43,880 And now you've got some address that's 1029 00:45:43,880 --> 00:45:46,380 an address you're not even slightly interested in. 1030 00:45:46,380 --> 00:45:48,560 So you have a policy question of whether, 1031 00:45:48,560 --> 00:45:51,080 if you build the hardware, whether to allow yourself 1032 00:45:51,080 --> 00:45:54,830 to not service page fault. So that's the first bad thing that 1033 00:45:54,830 --> 00:45:55,460 can happen. 1034 00:45:55,460 --> 00:45:56,940 Second bad thing that can happen, 1035 00:45:56,940 --> 00:45:59,930 which is maybe less bad, is that you 1036 00:45:59,930 --> 00:46:02,420 might get an actual exception. 1037 00:46:02,420 --> 00:46:06,530 So you might try to access an address that's 1038 00:46:06,530 --> 00:46:08,000 not only off the end of the array 1039 00:46:08,000 --> 00:46:09,890 but out of your user space. 1040 00:46:09,890 --> 00:46:11,678 Or it might be on Mars. 1041 00:46:11,678 --> 00:46:14,220 There might be some address that doesn't even make any sense. 1042 00:46:14,220 --> 00:46:14,930 So that's easy. 1043 00:46:14,930 --> 00:46:18,620 The hardware raises its hand and says, oops, sorry, bad address. 1044 00:46:18,620 --> 00:46:20,210 Well, now what? 1045 00:46:20,210 --> 00:46:26,270 Chances are you've only gotten that because you've 1046 00:46:26,270 --> 00:46:27,770 done something speculatively. 1047 00:46:27,770 --> 00:46:30,740 The only other alternative is that it's a user error. 1048 00:46:30,740 --> 00:46:32,900 And those are pretty infrequent compared to, 1049 00:46:32,900 --> 00:46:34,340 as it turns out when you measure, 1050 00:46:34,340 --> 00:46:37,580 compared to exceptions raised from speculative operations. 1051 00:46:37,580 --> 00:46:40,100 The multiflow compiler, which I have some experience with, 1052 00:46:40,100 --> 00:46:41,558 Stefan Freudenberger, and I've been 1053 00:46:41,558 --> 00:46:45,200 looking at these properties at HP of the multiflow compiler. 1054 00:46:45,200 --> 00:46:49,460 And we've been finding that, in fact, some programs 1055 00:46:49,460 --> 00:46:54,760 have hundreds of thousands of fake exceptions. 1056 00:46:54,760 --> 00:46:59,460 That is, exceptions raised by unprofitable speculative loads. 1057 00:46:59,460 --> 00:47:03,070 And if you have to stop and report each of those 1058 00:47:03,070 --> 00:47:06,917 or take some kind of trap during each of those, that's misery. 1059 00:47:06,917 --> 00:47:07,750 But what can you do? 1060 00:47:07,750 --> 00:47:11,790 If you just ignore them, which is what you're tempted to do, 1061 00:47:11,790 --> 00:47:15,720 what do you do when the user actually has an error and does 1062 00:47:15,720 --> 00:47:19,950 a speculative load near your compiler or hardware has marked 1063 00:47:19,950 --> 00:47:22,620 it is speculative, and now your system says, oh look, 1064 00:47:22,620 --> 00:47:24,600 here's an exception on the speculative load. 1065 00:47:24,600 --> 00:47:26,400 Ha ha, they're not fooling me. 1066 00:47:26,400 --> 00:47:28,650 Well, what if it turns out it's actually a user error? 1067 00:47:28,650 --> 00:47:30,240 Planes fall out of the sky. 1068 00:47:30,240 --> 00:47:31,050 Bridges collapse. 1069 00:47:31,050 --> 00:47:34,370 So you have to report them or you're in trouble there. 1070 00:47:34,370 --> 00:47:35,620 So now what? 1071 00:47:35,620 --> 00:47:38,410 You've got 200,000 of these things in some program. 1072 00:47:38,410 --> 00:47:40,900 You don't want to report them except if it's user error, 1073 00:47:40,900 --> 00:47:42,907 and then you do want to report it. 1074 00:47:42,907 --> 00:47:44,740 So that's a piece of engineering that people 1075 00:47:44,740 --> 00:47:47,410 have been working very hard on the last couple of years. 1076 00:47:47,410 --> 00:47:51,340 And the basic idea that you'll see in several papers 1077 00:47:51,340 --> 00:47:54,910 that people have written up is sort 1078 00:47:54,910 --> 00:47:57,680 of to mimic the IEEE floating point idea, 1079 00:47:57,680 --> 00:48:03,580 which is to produce some invalid value, set a bit somewhere that 1080 00:48:03,580 --> 00:48:05,703 says, of course, integers, unfortunately, 1081 00:48:05,703 --> 00:48:07,120 addresses, things like that, don't 1082 00:48:07,120 --> 00:48:10,150 have this nice extra representation 1083 00:48:10,150 --> 00:48:12,880 that floating point numbers have that makes no sense 1084 00:48:12,880 --> 00:48:13,730 that you can use. 1085 00:48:13,730 --> 00:48:15,910 You can call not a number. 1086 00:48:15,910 --> 00:48:17,800 And then for those of about that, and then 1087 00:48:17,800 --> 00:48:19,745 get this for free. 1088 00:48:19,745 --> 00:48:21,370 You have to build some hardware to take 1089 00:48:21,370 --> 00:48:24,070 care of this, some extra bits somewhere, 1090 00:48:24,070 --> 00:48:26,230 that allow you to tag something you've 1091 00:48:26,230 --> 00:48:31,750 done as having been speculative and raised an exception. 1092 00:48:31,750 --> 00:48:33,670 And then somehow, you arrange, and this 1093 00:48:33,670 --> 00:48:35,680 is where the engineering comes in, 1094 00:48:35,680 --> 00:48:41,320 somehow you arrange to find out when the branches resolve 1095 00:48:41,320 --> 00:48:43,570 in such a way that you know for a fact 1096 00:48:43,570 --> 00:48:46,270 that it was a profitable speculative operation. 1097 00:48:46,270 --> 00:48:50,420 That is, the flow of control eventually did go that way. 1098 00:48:50,420 --> 00:48:53,132 And so you shouldn't have gotten an exception. 1099 00:48:53,132 --> 00:48:55,090 It wasn't that didn't mean to do that operation 1100 00:48:55,090 --> 00:48:55,630 in the first place. 1101 00:48:55,630 --> 00:48:56,170 You did. 1102 00:48:56,170 --> 00:48:57,370 It's a real user error. 1103 00:48:57,370 --> 00:48:58,630 You better stop and report it. 1104 00:48:58,630 --> 00:49:00,250 So you have to somehow leave around 1105 00:49:00,250 --> 00:49:02,050 in that spot when you finally get 1106 00:49:02,050 --> 00:49:04,960 to where the branches resolve the right way the ability 1107 00:49:04,960 --> 00:49:09,710 to report this user error. 1108 00:49:09,710 --> 00:49:11,020 So that's the idea. 1109 00:49:11,020 --> 00:49:14,830 Hardware delays the report until the branch resolves. 1110 00:49:14,830 --> 00:49:17,185 And you'll see people building systems 1111 00:49:17,185 --> 00:49:18,310 that do this in the future. 1112 00:49:18,310 --> 00:49:20,645 Nothing like this has been built yet. 1113 00:49:20,645 --> 00:49:23,020 Again, the machine that I've had a lot of experience with 1114 00:49:23,020 --> 00:49:24,880 is probably the most ambitious by 1115 00:49:24,880 --> 00:49:26,230 far in terms of doing this step. 1116 00:49:26,230 --> 00:49:31,120 The Multiflow Trace VLIWs did do speculative loads. 1117 00:49:31,120 --> 00:49:33,400 It marked them as speculative. 1118 00:49:33,400 --> 00:49:36,258 But in the end, would just throw away 1119 00:49:36,258 --> 00:49:38,050 exceptions on speculative loads, and that's 1120 00:49:38,050 --> 00:49:40,690 a pretty dangerous policy and not the sort of thing 1121 00:49:40,690 --> 00:49:42,520 you'd want to be doing. 1122 00:49:42,520 --> 00:49:49,030 Second huge area concerns the two real big disadvantages 1123 00:49:49,030 --> 00:49:53,360 of superscalars and VLIWs. 1124 00:49:53,360 --> 00:49:57,530 And I'm going to talk about the VLIW disadvantage is 1125 00:49:57,530 --> 00:49:59,570 what the next slide will say. 1126 00:49:59,570 --> 00:50:01,250 But in the course of doing it, then I'll 1127 00:50:01,250 --> 00:50:03,500 compare it to superscalars. 1128 00:50:03,500 --> 00:50:08,450 And I'll talk about what the big disadvantages for superscalars. 1129 00:50:08,450 --> 00:50:14,220 The big disadvantage for VLIWs is object code compatibility. 1130 00:50:14,220 --> 00:50:18,270 This is really the Achilles heel of these architectures. 1131 00:50:18,270 --> 00:50:20,040 And it is a big problem. 1132 00:50:20,040 --> 00:50:23,325 And I'm personally dying to know just how all of this 1133 00:50:23,325 --> 00:50:24,200 is going to turn out. 1134 00:50:24,200 --> 00:50:25,740 I certainly have my guesses. 1135 00:50:25,740 --> 00:50:31,800 But right now VLIWs pose a very big engineering problem 1136 00:50:31,800 --> 00:50:35,250 for computer manufacturers. 1137 00:50:35,250 --> 00:50:39,110 This is not a technical problem in terms of building systems 1138 00:50:39,110 --> 00:50:41,430 that work correctly and go fast. 1139 00:50:41,430 --> 00:50:44,870 It's the problem that when you build a VLIW 1140 00:50:44,870 --> 00:50:47,023 and you're a computer manufacturer, 1141 00:50:47,023 --> 00:50:48,440 you have the problem that when you 1142 00:50:48,440 --> 00:50:51,530 want to ship your next VLIW, it's 1143 00:50:51,530 --> 00:50:54,080 unlikely to run the code that was 1144 00:50:54,080 --> 00:50:57,490 running on your previous VLIW. 1145 00:50:57,490 --> 00:50:59,140 Let me explain why that is. 1146 00:50:59,140 --> 00:51:02,470 If what you've done in a VLIW is written 1147 00:51:02,470 --> 00:51:09,200 the actual record of execution that you're expecting to get, 1148 00:51:09,200 --> 00:51:12,620 then what happens when you've put a multiply here 1149 00:51:12,620 --> 00:51:15,920 and an add that uses its data three cycles later 1150 00:51:15,920 --> 00:51:19,520 and you build a new machine in which the multiply takes 1151 00:51:19,520 --> 00:51:22,090 five cycles instead of three? 1152 00:51:22,090 --> 00:51:25,240 This old code will not have the right answer waiting for it 1153 00:51:25,240 --> 00:51:26,890 after three cycles. 1154 00:51:26,890 --> 00:51:31,452 Now, you can put in a way of stopping and waiting, spinning 1155 00:51:31,452 --> 00:51:33,160 for a couple of cycles, and that happens. 1156 00:51:33,160 --> 00:51:34,090 But now you're going to throw away 1157 00:51:34,090 --> 00:51:35,740 a lot of the advantage of your new system 1158 00:51:35,740 --> 00:51:36,850 when it runs the old code. 1159 00:51:36,850 --> 00:51:38,642 That's pretty unacceptable, and even that's 1160 00:51:38,642 --> 00:51:41,260 hard to engineer in some aspects. 1161 00:51:41,260 --> 00:51:44,140 So the problem is that when you have your old code 1162 00:51:44,140 --> 00:51:45,940 and you want to run it on your new VLIW, 1163 00:51:45,940 --> 00:51:48,940 unless you've just scaled the clock, then it's OK. 1164 00:51:48,940 --> 00:51:51,670 If everything is in the same relation to each other, 1165 00:51:51,670 --> 00:51:54,460 if all the latencies are the same relative to each other, 1166 00:51:54,460 --> 00:51:55,400 that's fine. 1167 00:51:55,400 --> 00:51:57,410 Same number of clock signals. 1168 00:51:57,410 --> 00:52:00,880 But if you've done anything else about changing the latencies, 1169 00:52:00,880 --> 00:52:03,470 you have a real problem in this. 1170 00:52:03,470 --> 00:52:07,510 Now, there are a lot of hardware and software tricks 1171 00:52:07,510 --> 00:52:10,930 that people have talked about playing for a VLIWs. 1172 00:52:10,930 --> 00:52:15,040 And there's a lot of energy being put toward really solving 1173 00:52:15,040 --> 00:52:18,470 this problem in the system by sort 1174 00:52:18,470 --> 00:52:20,750 of importing a little bit of superscalerness 1175 00:52:20,750 --> 00:52:22,700 into the machines. 1176 00:52:22,700 --> 00:52:25,790 And the machines that have been built, both the Multiflow 1177 00:52:25,790 --> 00:52:28,680 machine and the Cydrome machine, for example, 1178 00:52:28,680 --> 00:52:31,040 which were the two most ambitious VLIWs that 1179 00:52:31,040 --> 00:52:32,870 were built in the 1980s. 1180 00:52:32,870 --> 00:52:37,760 Both had some mechanism whereby you could do loads 1181 00:52:37,760 --> 00:52:40,430 and in case the load had to delay 1182 00:52:40,430 --> 00:52:42,310 for one reason or the other, the machine 1183 00:52:42,310 --> 00:52:43,310 could take care of that. 1184 00:52:43,310 --> 00:52:44,810 It could sort of stop and wait. 1185 00:52:44,810 --> 00:52:46,760 But you don't want to push that idea too far, 1186 00:52:46,760 --> 00:52:49,880 or you're throwing away all of the performance that the VLIW 1187 00:52:49,880 --> 00:52:51,980 is supposed to be giving you. 1188 00:52:51,980 --> 00:52:54,560 Now, there's a second approach to solving this, 1189 00:52:54,560 --> 00:52:57,320 and that is to distribute something different from fully 1190 00:52:57,320 --> 00:52:58,730 bound object code. 1191 00:52:58,730 --> 00:53:01,010 This is a period in a few experimental systems. 1192 00:53:01,010 --> 00:53:03,230 Even some little bit of extra work 1193 00:53:03,230 --> 00:53:07,140 is done in linkers in some commercial systems as well. 1194 00:53:07,140 --> 00:53:10,740 So what you distribute is not quite fully bound object code. 1195 00:53:10,740 --> 00:53:12,980 Well, you might want to take that idea to a much 1196 00:53:12,980 --> 00:53:15,260 greater extreme in a VLIW. 1197 00:53:15,260 --> 00:53:17,480 And what you might want to do instead 1198 00:53:17,480 --> 00:53:20,930 is to ship something in which the final scheduling 1199 00:53:20,930 --> 00:53:22,760 step hasn't been done. 1200 00:53:22,760 --> 00:53:25,130 The dependence analysis has been done. 1201 00:53:25,130 --> 00:53:28,460 Some amount of the scheduling perhaps has been done. 1202 00:53:28,460 --> 00:53:32,270 But the linker, basically, or what looks to the user 1203 00:53:32,270 --> 00:53:34,940 like the linker does this final scheduling 1204 00:53:34,940 --> 00:53:37,320 at install time for the code. 1205 00:53:37,320 --> 00:53:41,090 So that's distributing some sort of raised objects 1206 00:53:41,090 --> 00:53:42,710 beyond the thing you'd expect. 1207 00:53:42,710 --> 00:53:45,110 So those are the directions that people 1208 00:53:45,110 --> 00:53:49,320 have been using for solving this object code compatibility 1209 00:53:49,320 --> 00:53:49,820 problem. 1210 00:53:49,820 --> 00:53:52,310 Now, of course, there's another solution to the object code 1211 00:53:52,310 --> 00:53:56,390 compatibility problem, and that is build a superscalar. 1212 00:53:56,390 --> 00:53:58,610 And at first glance, and in reality 1213 00:53:58,610 --> 00:54:00,800 in the systems we ship today, this problem 1214 00:54:00,800 --> 00:54:03,950 is completely solved by shipping a superscalar. 1215 00:54:03,950 --> 00:54:06,530 A superscalar, after all, gets the sequential code 1216 00:54:06,530 --> 00:54:08,100 in the hardware. 1217 00:54:08,100 --> 00:54:09,870 The thing that you send to the user 1218 00:54:09,870 --> 00:54:12,480 is the stuff that comes out after the squiggly line 1219 00:54:12,480 --> 00:54:14,730 way over here, not the squiggly line right over there, 1220 00:54:14,730 --> 00:54:16,590 if you remember that slide. 1221 00:54:16,590 --> 00:54:18,870 And all the rest of the stuff that's 1222 00:54:18,870 --> 00:54:20,430 changing according to the latencies 1223 00:54:20,430 --> 00:54:23,020 and so on of the hardware is being done in the hardware. 1224 00:54:23,020 --> 00:54:24,990 So you just build different control hardware 1225 00:54:24,990 --> 00:54:27,090 as part of the implementation of the superscalar. 1226 00:54:27,090 --> 00:54:29,220 It's going to run the sequential code. 1227 00:54:29,220 --> 00:54:32,340 As far as the user's concerned, distribute the sequential code, 1228 00:54:32,340 --> 00:54:33,900 it'll run correctly on every system. 1229 00:54:33,900 --> 00:54:36,300 No problem. 1230 00:54:36,300 --> 00:54:37,980 There is a problem with it. 1231 00:54:37,980 --> 00:54:40,440 For systems with the amount of ILP 1232 00:54:40,440 --> 00:54:47,520 that we see today, the problem really is solved just fine, 1233 00:54:47,520 --> 00:54:49,770 I believe. 1234 00:54:49,770 --> 00:54:54,510 But what I'm not convinced of is that the problem is solved 1235 00:54:54,510 --> 00:54:57,420 for large amounts more ILP. 1236 00:54:57,420 --> 00:54:59,910 And I'd like to explain why and what the problem is. 1237 00:54:59,910 --> 00:55:02,790 Now, this is all conjecture. 1238 00:55:02,790 --> 00:55:04,710 People have never built a superscalar 1239 00:55:04,710 --> 00:55:09,960 that goes beyond the superscalars we have today. 1240 00:55:09,960 --> 00:55:11,610 We finally, in fact, have gotten up 1241 00:55:11,610 --> 00:55:13,680 just sort of to the level of the big machines 1242 00:55:13,680 --> 00:55:17,707 that people were building dozens of years ago. 1243 00:55:17,707 --> 00:55:19,540 And it's going to be very interesting to see 1244 00:55:19,540 --> 00:55:21,880 what direction this goes in now that we 1245 00:55:21,880 --> 00:55:24,880 have machines that have finally gotten up to that boundary. 1246 00:55:24,880 --> 00:55:27,400 But here's the problem with large amounts of instruction 1247 00:55:27,400 --> 00:55:29,610 level parallelism. 1248 00:55:29,610 --> 00:55:32,920 First of all, you might want to ask 1249 00:55:32,920 --> 00:55:35,320 whether that amount of instruction level parallelism 1250 00:55:35,320 --> 00:55:36,737 is there. 1251 00:55:36,737 --> 00:55:38,320 Before the end of this talk, I'm going 1252 00:55:38,320 --> 00:55:40,570 to talk about how much instruction level 1253 00:55:40,570 --> 00:55:43,690 parallelism I think is available in different kinds of codes. 1254 00:55:43,690 --> 00:55:47,290 But one quick answer before I talk 1255 00:55:47,290 --> 00:55:48,880 about this problem with superscalars 1256 00:55:48,880 --> 00:55:51,490 is to say, well, maybe this is all the ILP 1257 00:55:51,490 --> 00:55:54,910 there really is in most programs except for scientific programs. 1258 00:55:54,910 --> 00:55:56,470 Scientific programs, you give them 1259 00:55:56,470 --> 00:55:58,990 a vector unit along with your modest superscalar 1260 00:55:58,990 --> 00:56:00,730 and then you're just doing fine. 1261 00:56:00,730 --> 00:56:02,970 And if that's all there is, then this is no problem. 1262 00:56:02,970 --> 00:56:05,300 Then we can go on from there. 1263 00:56:05,300 --> 00:56:07,090 So that's one simple answer. 1264 00:56:07,090 --> 00:56:10,090 I don't believe it's going to be true in the end, 1265 00:56:10,090 --> 00:56:12,348 and I certainly hope it's not true in the end. 1266 00:56:12,348 --> 00:56:13,390 But it's always possible. 1267 00:56:13,390 --> 00:56:16,410 There's no way to know for sure right now. 1268 00:56:16,410 --> 00:56:18,900 But what happens when you have large amounts of ILP? 1269 00:56:18,900 --> 00:56:21,420 One negative answer is you can't even 1270 00:56:21,420 --> 00:56:24,455 build a superscalar like that. 1271 00:56:24,455 --> 00:56:26,830 After all, if you have large amounts of instruction level 1272 00:56:26,830 --> 00:56:28,840 parallelism, then you have to be looking 1273 00:56:28,840 --> 00:56:31,710 at a lot of operations at once. 1274 00:56:31,710 --> 00:56:35,080 And to do the independence recognition, 1275 00:56:35,080 --> 00:56:37,540 you have to look at these operations 1276 00:56:37,540 --> 00:56:41,987 basically pairwise and ask, are they independent of each other? 1277 00:56:41,987 --> 00:56:43,570 Whatever algorithm you're going to use 1278 00:56:43,570 --> 00:56:45,880 is going to seem to grow exponentially 1279 00:56:45,880 --> 00:56:49,130 in the number of operations you want to consider. 1280 00:56:49,130 --> 00:56:51,230 If you have a large amount of operations 1281 00:56:51,230 --> 00:56:53,840 that you're trying to fire every cycle, 1282 00:56:53,840 --> 00:56:57,660 you better be considering a huge amount more than that. 1283 00:56:57,660 --> 00:57:01,230 And having this control hardware grow exponentially 1284 00:57:01,230 --> 00:57:04,770 in so many operations is likely to be really bad news 1285 00:57:04,770 --> 00:57:06,090 for the hardware designer. 1286 00:57:06,090 --> 00:57:07,757 I think if you talk to the people who've 1287 00:57:07,757 --> 00:57:11,730 built today's most ambitious superscalars, most of them 1288 00:57:11,730 --> 00:57:13,620 maybe are more plucky and public, 1289 00:57:13,620 --> 00:57:16,390 but they're not really excited about the next steps. 1290 00:57:16,390 --> 00:57:19,840 It really seems hard to go much beyond where we are now. 1291 00:57:19,840 --> 00:57:21,750 So this is a real problem. 1292 00:57:21,750 --> 00:57:22,590 Totally unknown. 1293 00:57:22,590 --> 00:57:25,470 It might turn out that there are nice hardware algorithms you 1294 00:57:25,470 --> 00:57:28,930 can use that will make this problem fine 1295 00:57:28,930 --> 00:57:30,430 and then everything is fine. 1296 00:57:30,430 --> 00:57:34,720 But question one is, can you really scale superscalars 1297 00:57:34,720 --> 00:57:36,580 much beyond where we are today? 1298 00:57:36,580 --> 00:57:38,680 It's not a problem for VLIWs. 1299 00:57:38,680 --> 00:57:42,587 You have to scale the compiling time, which unfortunately 1300 00:57:42,587 --> 00:57:44,920 in this tape there isn't enough time for me to really go 1301 00:57:44,920 --> 00:57:45,650 into that subject. 1302 00:57:45,650 --> 00:57:47,108 But I don't think compiling time is 1303 00:57:47,108 --> 00:57:50,440 going to be a very big barrier in the end to using VLIWs. 1304 00:57:50,440 --> 00:57:51,940 I think compiling is going to end up 1305 00:57:51,940 --> 00:57:55,240 being very fast, even very sophisticated compiling 1306 00:57:55,240 --> 00:57:55,980 for VLIWs. 1307 00:57:55,980 --> 00:57:57,730 That's something I'm very confident about. 1308 00:57:57,730 --> 00:58:03,910 But scaling these superscalars, so VLIWs scale very nicely. 1309 00:58:03,910 --> 00:58:05,440 You just add more functionality. 1310 00:58:05,440 --> 00:58:10,920 It's a pretty linear design, and it's not a huge problem 1311 00:58:10,920 --> 00:58:11,832 to grow it. 1312 00:58:11,832 --> 00:58:13,290 But growing the superscalar, you've 1313 00:58:13,290 --> 00:58:15,870 got the central control unit more or less all 1314 00:58:15,870 --> 00:58:19,230 in one place that has to do this exponentially growing job. 1315 00:58:19,230 --> 00:58:21,660 It's kind of scary, actually. 1316 00:58:21,660 --> 00:58:23,370 Next question is, what about the register 1317 00:58:23,370 --> 00:58:25,410 architecture of the machine? 1318 00:58:25,410 --> 00:58:27,000 Once you fix your architecture, you 1319 00:58:27,000 --> 00:58:29,770 have to fix the number of registers. 1320 00:58:29,770 --> 00:58:32,710 If you have a lot more instruction level parallelism, 1321 00:58:32,710 --> 00:58:35,260 you have a lot more operations in flight at the same time. 1322 00:58:35,260 --> 00:58:37,570 There's a natural pressure that starts 1323 00:58:37,570 --> 00:58:40,030 going on the number of registers you have. 1324 00:58:40,030 --> 00:58:42,790 Unless you start with a very large number of registers, 1325 00:58:42,790 --> 00:58:47,373 enough to supply data for your most ambitious superscalar, 1326 00:58:47,373 --> 00:58:48,790 then you're going to find yourself 1327 00:58:48,790 --> 00:58:51,820 starved as you go from the inexpensive model 1328 00:58:51,820 --> 00:58:53,080 to the expensive model. 1329 00:58:53,080 --> 00:58:55,430 They all have to have the same register architecture. 1330 00:58:55,430 --> 00:58:59,125 So that's a bit of a limit in scaling superscalars that 1331 00:58:59,125 --> 00:59:01,750 really is a problem if you want to stay object code compatible. 1332 00:59:01,750 --> 00:59:04,450 You can't, after all, send instructions 1333 00:59:04,450 --> 00:59:07,030 that refer to register 39 in a machine that 1334 00:59:07,030 --> 00:59:09,430 only has 32 registers. 1335 00:59:09,430 --> 00:59:10,870 And then the bigger question. 1336 00:59:10,870 --> 00:59:13,060 And here's a real unknown, and I'm 1337 00:59:13,060 --> 00:59:16,090 really curious about what the answer to this is. 1338 00:59:16,090 --> 00:59:18,730 That is, suppose you build two different superscalar 1339 00:59:18,730 --> 00:59:21,777 implementations of the same architecture. 1340 00:59:21,777 --> 00:59:23,860 Now, they're both going to run the code correctly. 1341 00:59:23,860 --> 00:59:25,580 That's not an issue. 1342 00:59:25,580 --> 00:59:26,980 But what happens if the latencies 1343 00:59:26,980 --> 00:59:28,890 are dramatically different? 1344 00:59:28,890 --> 00:59:31,900 For example, you might have one expensive one and one 1345 00:59:31,900 --> 00:59:33,290 low cost model. 1346 00:59:33,290 --> 00:59:35,050 And if you look at those two, it's 1347 00:59:35,050 --> 00:59:37,570 very likely going to be that the ratio 1348 00:59:37,570 --> 00:59:41,890 of the long latency operations, like memory 1349 00:59:41,890 --> 00:59:44,680 operations and floating point operations, 1350 00:59:44,680 --> 00:59:48,040 compared to integer operations is going to be very much 1351 00:59:48,040 --> 00:59:51,550 greater in the low cost model than it will in the highest 1352 00:59:51,550 --> 00:59:52,330 performance model. 1353 00:59:52,330 --> 00:59:54,520 Because that's sort of where you put your speed 1354 00:59:54,520 --> 00:59:55,770 in the high performance model. 1355 00:59:55,770 --> 01:00:00,370 You engineer those systems, multi-level caches and fast, 1356 01:00:00,370 --> 01:00:02,560 fast floating point into the hardware. 1357 01:00:02,560 --> 01:00:04,270 And that's where you put your dollar. 1358 01:00:04,270 --> 01:00:06,790 So now you have these tremendously different ratios 1359 01:00:06,790 --> 01:00:09,160 of latencies of operations between the two 1360 01:00:09,160 --> 01:00:10,180 implementations. 1361 01:00:10,180 --> 01:00:14,530 Can you really distribute the same object code to both 1362 01:00:14,530 --> 01:00:17,830 and not give up tremendous performance on one of them 1363 01:00:17,830 --> 01:00:21,130 compared to what you could have if you distributed two 1364 01:00:21,130 --> 01:00:23,710 separate sets of object code? 1365 01:00:23,710 --> 01:00:26,140 Big unknown. 1366 01:00:26,140 --> 01:00:30,670 If it turns out that on, for example, the low cost model 1367 01:00:30,670 --> 01:00:33,040 your performance suffers terribly 1368 01:00:33,040 --> 01:00:37,270 because you ship the expensive model's object code, 1369 01:00:37,270 --> 01:00:39,490 third party software vendors are probably 1370 01:00:39,490 --> 01:00:41,140 going to be very reluctant to do that. 1371 01:00:41,140 --> 01:00:43,600 They're going to want their best performance on the machine 1372 01:00:43,600 --> 01:00:45,730 that you sold zillions of, not on the machine 1373 01:00:45,730 --> 01:00:46,960 that you sold a very few of. 1374 01:00:46,960 --> 01:00:49,180 But then what happens is you're losing your advantage 1375 01:00:49,180 --> 01:00:50,290 on the expensive machine. 1376 01:00:50,290 --> 01:00:52,123 If they have to ship two different versions, 1377 01:00:52,123 --> 01:00:53,710 there goes object code compatibility. 1378 01:00:53,710 --> 01:00:55,127 The whole point of this is to only 1379 01:00:55,127 --> 01:00:57,100 have one set of object code. 1380 01:00:57,100 --> 01:01:00,060 1381 01:01:00,060 --> 01:01:01,060 And that's just unknown. 1382 01:01:01,060 --> 01:01:02,848 It could well be that all you have to do 1383 01:01:02,848 --> 01:01:05,140 is get the things that are independent near each other. 1384 01:01:05,140 --> 01:01:06,825 That's good enough and will get you 1385 01:01:06,825 --> 01:01:08,200 close to the best performance you 1386 01:01:08,200 --> 01:01:10,150 can get on either kind of machine, 1387 01:01:10,150 --> 01:01:12,370 on either implementation of the superscalar. 1388 01:01:12,370 --> 01:01:17,190 But that's a really big unknown for us in this field. 1389 01:01:17,190 --> 01:01:18,720 Now, there's one other thing. 1390 01:01:18,720 --> 01:01:21,170 And that is that you very often hear 1391 01:01:21,170 --> 01:01:24,320 people say that what they don't like about VLIW 1392 01:01:24,320 --> 01:01:26,720 and that they do like about superscalars is 1393 01:01:26,720 --> 01:01:30,350 that the compiling effort is so bad on VLIWs 1394 01:01:30,350 --> 01:01:32,840 and superscalars aren't like that. 1395 01:01:32,840 --> 01:01:37,448 And really, I have my beliefs about these subjects, 1396 01:01:37,448 --> 01:01:38,990 and maybe they come through, but I've 1397 01:01:38,990 --> 01:01:42,710 tried to really say what I think is known and what's not known. 1398 01:01:42,710 --> 01:01:45,080 This is not known either. 1399 01:01:45,080 --> 01:01:47,040 But I just feel so strongly about it, 1400 01:01:47,040 --> 01:01:49,580 I have a lot of trouble believing that the statement 1401 01:01:49,580 --> 01:01:51,410 could possibly be true. 1402 01:01:51,410 --> 01:01:53,913 To me the effort in compiling-- 1403 01:01:53,913 --> 01:01:54,830 and not the statement. 1404 01:01:54,830 --> 01:01:56,205 The statement that you don't have 1405 01:01:56,205 --> 01:02:00,162 to compile very ambitiously for superscalars. 1406 01:02:00,162 --> 01:02:02,370 What I believe is that this statement really is true. 1407 01:02:02,370 --> 01:02:04,460 And that is that the effort in compiling 1408 01:02:04,460 --> 01:02:07,490 is a function of the ILP that you want to get. 1409 01:02:07,490 --> 01:02:10,190 It's not a function of whether the machine is 1410 01:02:10,190 --> 01:02:14,260 a VLIW or a superscalar. 1411 01:02:14,260 --> 01:02:17,100 And my argument is as follows. 1412 01:02:17,100 --> 01:02:19,380 The argument that it's easier for a superscalar 1413 01:02:19,380 --> 01:02:22,470 is that the superscalar looks down the instruction stream 1414 01:02:22,470 --> 01:02:25,170 and it does the data dependence analysis 1415 01:02:25,170 --> 01:02:30,100 and it does the scheduling and it predicts the branches. 1416 01:02:30,100 --> 01:02:32,020 And you can do branch prediction dynamically 1417 01:02:32,020 --> 01:02:34,020 while the program is running and even do it more 1418 01:02:34,020 --> 01:02:36,550 accurately as a result. It does all of the stuff 1419 01:02:36,550 --> 01:02:38,620 that you've talked about doing in the compiler, 1420 01:02:38,620 --> 01:02:40,210 does it while the program's running. 1421 01:02:40,210 --> 01:02:42,830 Why on Earth would you do it in the compiler as well? 1422 01:02:42,830 --> 01:02:44,920 So you don't need an ambitious compiler. 1423 01:02:44,920 --> 01:02:46,630 Well, most of that argument is true, 1424 01:02:46,630 --> 01:02:48,940 but it turns out a crucial piece isn't. 1425 01:02:48,940 --> 01:02:52,780 Yes, the superscalar can look down a long ways 1426 01:02:52,780 --> 01:02:54,880 by predicting branches dynamically. 1427 01:02:54,880 --> 01:02:57,920 And yes, a superscalar probably can do the scheduling. 1428 01:02:57,920 --> 01:03:00,010 Scheduling is a pretty linear job 1429 01:03:00,010 --> 01:03:01,840 and it's probably practical to build 1430 01:03:01,840 --> 01:03:04,780 a fair amount of scheduling into the hardware. 1431 01:03:04,780 --> 01:03:08,550 The problem is in the data dependence analysis. 1432 01:03:08,550 --> 01:03:13,300 And if a superscalar is going to get a lot of instruction level 1433 01:03:13,300 --> 01:03:16,660 parallelism, it's going to have to look very far down, 1434 01:03:16,660 --> 01:03:19,987 hundreds of operations down the instruction stream to get that. 1435 01:03:19,987 --> 01:03:22,570 Well, the superscalar is going to have something that could be 1436 01:03:22,570 --> 01:03:25,040 called a window of opportunity. 1437 01:03:25,040 --> 01:03:26,890 It's going to have sort of a sliding window 1438 01:03:26,890 --> 01:03:28,660 ahead of where we're executing. 1439 01:03:28,660 --> 01:03:31,240 Where it's looking in order to pull things up, 1440 01:03:31,240 --> 01:03:34,510 there's going to be some finite number of operations 1441 01:03:34,510 --> 01:03:36,500 that it can look down. 1442 01:03:36,500 --> 01:03:39,252 And the operations that are data independent 1443 01:03:39,252 --> 01:03:41,460 are going to be peppered through the code down there. 1444 01:03:41,460 --> 01:03:43,050 They're not going to be all clumped together. 1445 01:03:43,050 --> 01:03:45,110 In fact, the opposite effect tends to happen. 1446 01:03:45,110 --> 01:03:47,935 Usually operations are data dependent upon the ones written 1447 01:03:47,935 --> 01:03:49,310 right before them, because that's 1448 01:03:49,310 --> 01:03:50,518 the way programs are written. 1449 01:03:50,518 --> 01:03:53,425 So do some stream of computation and then some more or less 1450 01:03:53,425 --> 01:03:54,800 independent stream of computation 1451 01:03:54,800 --> 01:03:56,960 later and then finally collect those results 1452 01:03:56,960 --> 01:03:58,530 and do something else. 1453 01:03:58,530 --> 01:04:01,640 So you have to look way down the stream to do that. 1454 01:04:01,640 --> 01:04:05,060 That's going to get a huge window of opportunity that's 1455 01:04:05,060 --> 01:04:07,760 not going to be practical to build superscalar. 1456 01:04:07,760 --> 01:04:12,350 Too much of this exponentially growing dependence analysis. 1457 01:04:12,350 --> 01:04:14,710 And so the solution is that you need 1458 01:04:14,710 --> 01:04:17,920 to have a compiler which does the dependence 1459 01:04:17,920 --> 01:04:21,130 analysis in the beginning, does code rearrangement. 1460 01:04:21,130 --> 01:04:22,990 Probably you want the compiler to arrange 1461 01:04:22,990 --> 01:04:24,817 for speculative execution then. 1462 01:04:24,817 --> 01:04:26,650 That's probably easier than trying to decide 1463 01:04:26,650 --> 01:04:28,538 to do that on the fly. 1464 01:04:28,538 --> 01:04:30,080 And it's hard to see how you're going 1465 01:04:30,080 --> 01:04:34,040 to avoid doing this work in the compiler, virtually 1466 01:04:34,040 --> 01:04:38,600 the same work you have to do to compile for a VLIW. 1467 01:04:38,600 --> 01:04:42,620 The two really do seem very much equivalent to me. 1468 01:04:42,620 --> 01:04:45,680 So I don't believe the argument that a superscalar 1469 01:04:45,680 --> 01:04:48,392 is going to buy you compiling speed at all. 1470 01:04:48,392 --> 01:04:49,850 Now, there's one other thing I want 1471 01:04:49,850 --> 01:04:52,340 to mention on the subject of object code compatibility, 1472 01:04:52,340 --> 01:04:55,280 and that is the VLIW disadvantage 1473 01:04:55,280 --> 01:05:00,900 of having a new implementation have changed latencies. 1474 01:05:00,900 --> 01:05:03,180 And thus having the inability to run 1475 01:05:03,180 --> 01:05:06,570 the old code on the new machine has another aspect to it. 1476 01:05:06,570 --> 01:05:11,220 And that is within a given implementation, a given VLIW, 1477 01:05:11,220 --> 01:05:13,980 certain operations that you want to do 1478 01:05:13,980 --> 01:05:17,380 might have variable latency depending upon the data. 1479 01:05:17,380 --> 01:05:20,700 So I mentioned, for example, that the Multiflow and Cydrome 1480 01:05:20,700 --> 01:05:24,990 machines both had the property that the designers built 1481 01:05:24,990 --> 01:05:28,020 in the ability for a long, long load 1482 01:05:28,020 --> 01:05:31,290 to occur, longer than the schedule expected, 1483 01:05:31,290 --> 01:05:33,990 and then the machine would stall in order 1484 01:05:33,990 --> 01:05:36,840 to get the correct answer when that happened. 1485 01:05:36,840 --> 01:05:41,250 So in that case, we've seen another disadvantage of VLIWs. 1486 01:05:41,250 --> 01:05:44,790 And that is that a super scalar can dynamically 1487 01:05:44,790 --> 01:05:49,893 arrange the schedule on the fly to only issue the things that 1488 01:05:49,893 --> 01:05:51,810 are data ready, looking at what's data already 1489 01:05:51,810 --> 01:05:52,435 at that moment. 1490 01:05:52,435 --> 01:05:56,010 So it can adjust to longer and shorter variable latencies. 1491 01:05:56,010 --> 01:05:58,650 A VLIW, you have to decide in advance. 1492 01:05:58,650 --> 01:06:00,810 So that's another advantage of superscalars. 1493 01:06:00,810 --> 01:06:03,300 I'm not really convinced that's an important advantage. 1494 01:06:03,300 --> 01:06:05,520 And I was at the same conference, 1495 01:06:05,520 --> 01:06:07,200 the VLIW conference. 1496 01:06:07,200 --> 01:06:09,690 Butler Lampson gave the keynote speech. 1497 01:06:09,690 --> 01:06:12,150 And he mentioned something in a totally unrelated area that 1498 01:06:12,150 --> 01:06:14,100 really reminded me of this. 1499 01:06:14,100 --> 01:06:16,560 And that is in communications, you have the same problem. 1500 01:06:16,560 --> 01:06:19,260 You have asynchronous versus synchronous communications. 1501 01:06:19,260 --> 01:06:23,650 And the trade off is that asynchronous communications 1502 01:06:23,650 --> 01:06:28,430 can adjust to changed latencies and communications. 1503 01:06:28,430 --> 01:06:35,030 And thus, you don't have to do a worst case analysis in order 1504 01:06:35,030 --> 01:06:36,770 to set up the communications. 1505 01:06:36,770 --> 01:06:38,270 Synchronous communications, you have 1506 01:06:38,270 --> 01:06:40,010 to be sure the data is getting there. 1507 01:06:40,010 --> 01:06:42,200 So you have to do a worst case analysis. 1508 01:06:42,200 --> 01:06:44,510 And it's been everyone's intuition 1509 01:06:44,510 --> 01:06:49,610 that asynchronous communications would be faster because you 1510 01:06:49,610 --> 01:06:52,130 didn't have to allow for the worst that could possibly 1511 01:06:52,130 --> 01:06:52,640 happen. 1512 01:06:52,640 --> 01:06:55,040 Of course, there's the overhead of figuring out 1513 01:06:55,040 --> 01:06:56,552 whether everything was OK or not. 1514 01:06:56,552 --> 01:06:58,760 That's what's wrong with asynchronous communications. 1515 01:06:58,760 --> 01:07:00,740 That's a lot like superscalars. 1516 01:07:00,740 --> 01:07:03,650 And synchronous communications are a lot like VLIWs. 1517 01:07:03,650 --> 01:07:06,710 And Butler's comment was everybody 1518 01:07:06,710 --> 01:07:09,080 always expects asynchronous communications to be faster. 1519 01:07:09,080 --> 01:07:12,560 Synchronous communications is always faster, as it turns out. 1520 01:07:12,560 --> 01:07:15,390 And I was very much reminded of this trade off. 1521 01:07:15,390 --> 01:07:17,390 So I really do believe that that problem 1522 01:07:17,390 --> 01:07:19,490 of changing latencies of functioning, it's 1523 01:07:19,490 --> 01:07:20,580 just not a big deal. 1524 01:07:20,580 --> 01:07:24,260 People often cited as one of the real advantages 1525 01:07:24,260 --> 01:07:25,233 of superscalars. 1526 01:07:25,233 --> 01:07:26,900 There's another real advantage, and this 1527 01:07:26,900 --> 01:07:32,280 is another part of the hardware software trade off. 1528 01:07:32,280 --> 01:07:34,170 The typical classical trade off you 1529 01:07:34,170 --> 01:07:36,030 have in system design when you're 1530 01:07:36,030 --> 01:07:38,490 doing hardware versus software choices 1531 01:07:38,490 --> 01:07:41,730 is that in the software-- 1532 01:07:41,730 --> 01:07:46,080 I'm sorry, in the hardware, you have perfect knowledge. 1533 01:07:46,080 --> 01:07:48,300 That is, you're making your decisions 1534 01:07:48,300 --> 01:07:50,400 while the program is running. 1535 01:07:50,400 --> 01:07:54,640 And so you can see all the values of everything. 1536 01:07:54,640 --> 01:07:58,990 Whereas in the software, you have incomplete knowledge. 1537 01:07:58,990 --> 01:08:01,450 You're analyzing a program that hasn't yet run. 1538 01:08:01,450 --> 01:08:02,800 You don't know what the data is. 1539 01:08:02,800 --> 01:08:05,050 You don't know which way the flow of control has gone. 1540 01:08:05,050 --> 01:08:07,460 You're missing an awful lot of information. 1541 01:08:07,460 --> 01:08:11,090 The good news is you have all the time in the world 1542 01:08:11,090 --> 01:08:12,110 to do your analysis. 1543 01:08:12,110 --> 01:08:14,573 Of course, if your compiler is that slow, 1544 01:08:14,573 --> 01:08:16,740 then you're not going to have a very viable product. 1545 01:08:16,740 --> 01:08:19,399 But even within the bounds of what people expect, 1546 01:08:19,399 --> 01:08:21,649 you have a lot more than the four nanoseconds 1547 01:08:21,649 --> 01:08:24,450 the hardware is going to have to make its decisions. 1548 01:08:24,450 --> 01:08:27,050 So in the software, you can do a much more sophisticated 1549 01:08:27,050 --> 01:08:28,250 analysis. 1550 01:08:28,250 --> 01:08:30,590 In the hardware, you're really stuck 1551 01:08:30,590 --> 01:08:34,939 with doing something a lot more shallow but with better 1552 01:08:34,939 --> 01:08:36,800 information. 1553 01:08:36,800 --> 01:08:41,240 So that is another trade off here between these two systems. 1554 01:08:41,240 --> 01:08:42,680 But it's really more than anything 1555 01:08:42,680 --> 01:08:46,520 a trade off of sophisticated compiling versus less 1556 01:08:46,520 --> 01:08:47,810 sophisticated compiling. 1557 01:08:47,810 --> 01:08:51,319 Now, it's often cited as an advantage for superscalars 1558 01:08:51,319 --> 01:08:54,470 that they have this runtime information. 1559 01:08:54,470 --> 01:08:56,600 And so they can do a better job of scheduling 1560 01:08:56,600 --> 01:08:59,064 and a better job of dependence analysis. 1561 01:08:59,064 --> 01:09:00,439 But I just have trouble believing 1562 01:09:00,439 --> 01:09:03,020 that that adds very much, given the amount of time 1563 01:09:03,020 --> 01:09:04,490 you have to do it. 1564 01:09:04,490 --> 01:09:08,330 But that is something that does distinguish a superscalar 1565 01:09:08,330 --> 01:09:10,189 from a system that isn't a superscalar 1566 01:09:10,189 --> 01:09:12,859 and can't do that kind of dynamic analysis. 1567 01:09:12,859 --> 01:09:15,080 One important kind of analysis is something 1568 01:09:15,080 --> 01:09:17,240 called memory disambiguation. 1569 01:09:17,240 --> 01:09:19,160 And there's certainly no time in this talk 1570 01:09:19,160 --> 01:09:20,160 to go into that subject. 1571 01:09:20,160 --> 01:09:22,939 But the basic problem is if you have two operations, one 1572 01:09:22,939 --> 01:09:27,490 of which reads an array value and the other of which 1573 01:09:27,490 --> 01:09:30,340 writes some array value, is it permissible to do them 1574 01:09:30,340 --> 01:09:33,310 in the other order to get more instruction level parallelism? 1575 01:09:33,310 --> 01:09:35,727 Well, you have to know whether they're reading and writing 1576 01:09:35,727 --> 01:09:37,279 the same value or not. 1577 01:09:37,279 --> 01:09:39,399 And of course at runtime, you can tell. 1578 01:09:39,399 --> 01:09:41,859 You just look at the address that's in the hardware. 1579 01:09:41,859 --> 01:09:43,600 At compile time, sometimes that's 1580 01:09:43,600 --> 01:09:46,492 a tremendously sophisticated analysis. 1581 01:09:46,492 --> 01:09:48,700 When it comes to arrays, people have gotten very good 1582 01:09:48,700 --> 01:09:51,700 at that analysis, at solving the equation of whether these two 1583 01:09:51,700 --> 01:09:52,970 indices are the same. 1584 01:09:52,970 --> 01:09:55,000 That's called array disambiguation. 1585 01:09:55,000 --> 01:09:58,600 For pointers, where now you have two references to memory 1586 01:09:58,600 --> 01:10:01,310 where the references came through pointers, 1587 01:10:01,310 --> 01:10:02,560 that's a wide open subject. 1588 01:10:02,560 --> 01:10:05,290 People have looked at that in several different contexts 1589 01:10:05,290 --> 01:10:06,700 of system design. 1590 01:10:06,700 --> 01:10:09,420 And I think it's one of the big frontiers in instruction level 1591 01:10:09,420 --> 01:10:09,920 parallelism. 1592 01:10:09,920 --> 01:10:12,580 I think that real progress is going to be 1593 01:10:12,580 --> 01:10:14,320 made in pointer disambiguation. 1594 01:10:14,320 --> 01:10:16,720 It is an incredibly hard problem. 1595 01:10:16,720 --> 01:10:19,990 It may involve some slight change of programming paradigm 1596 01:10:19,990 --> 01:10:21,910 in a way that we've learned good programming. 1597 01:10:21,910 --> 01:10:24,550 I'm not talking about adding wild new constructs 1598 01:10:24,550 --> 01:10:26,120 to languages. 1599 01:10:26,120 --> 01:10:29,200 But the way we've learned new programming practices 1600 01:10:29,200 --> 01:10:31,630 about aliasing and things and Fortran programming, 1601 01:10:31,630 --> 01:10:33,310 for example, over the decades. 1602 01:10:33,310 --> 01:10:35,860 I think that may in decades happen 1603 01:10:35,860 --> 01:10:37,030 with pointer disambiguation. 1604 01:10:37,030 --> 01:10:38,560 But in my opinion, nobody's going 1605 01:10:38,560 --> 01:10:41,420 to do sophisticated pointer disambiguation any time real 1606 01:10:41,420 --> 01:10:41,920 soon. 1607 01:10:41,920 --> 01:10:45,280 1608 01:10:45,280 --> 01:10:45,780 OK. 1609 01:10:45,780 --> 01:10:48,300 Now, what I certainly don't have time to do in this talk 1610 01:10:48,300 --> 01:10:51,360 is go into the slightest amount of detail 1611 01:10:51,360 --> 01:10:54,510 about code generation techniques. 1612 01:10:54,510 --> 01:10:57,260 Code generation techniques are one of the reasons 1613 01:10:57,260 --> 01:10:59,630 that instruction level parallelism has really 1614 01:10:59,630 --> 01:11:02,330 taken off in the last 10 years. 1615 01:11:02,330 --> 01:11:04,760 People are a lot more confident of their ability 1616 01:11:04,760 --> 01:11:06,830 to do the kind of code generation 1617 01:11:06,830 --> 01:11:12,010 that you have to do in order to use these machines effectively. 1618 01:11:12,010 --> 01:11:15,340 One of the techniques is called trace scheduling. 1619 01:11:15,340 --> 01:11:17,190 And it's a technique that came along 1620 01:11:17,190 --> 01:11:18,690 around 12 or 13 years ago. 1621 01:11:18,690 --> 01:11:20,160 It was something I was involved in. 1622 01:11:20,160 --> 01:11:25,040 And it's a way of taking this idea of looking down 1623 01:11:25,040 --> 01:11:29,090 the instruction stream and speculating 1624 01:11:29,090 --> 01:11:33,360 in the most likely direction and doing the code generation 1625 01:11:33,360 --> 01:11:38,130 in advance in order to build correct code that 1626 01:11:38,130 --> 01:11:41,100 gets the advantages of going past conditional jumps. 1627 01:11:41,100 --> 01:11:43,770 And the basic way it works, this is incredibly superficial, 1628 01:11:43,770 --> 01:11:46,858 but there are references that come with this tape 1629 01:11:46,858 --> 01:11:49,150 that if you're interested, you can learn more about it. 1630 01:11:49,150 --> 01:11:52,230 Basic way it works is that the compiler takes the code 1631 01:11:52,230 --> 01:11:56,220 and picks a path through the code that it thinks of as being 1632 01:11:56,220 --> 01:11:59,500 one of the more likely paths. 1633 01:11:59,500 --> 01:12:03,280 And sometimes it will unroll loops and in line subroutines 1634 01:12:03,280 --> 01:12:09,130 and unwind rejoins and so on to make these paths longer. 1635 01:12:09,130 --> 01:12:11,050 And then it just makes the assumption 1636 01:12:11,050 --> 01:12:13,720 that this is the way the code is going to go. 1637 01:12:13,720 --> 01:12:15,460 Builds a big data precedence graph 1638 01:12:15,460 --> 01:12:17,740 that it does the independent analysis. 1639 01:12:17,740 --> 01:12:21,890 Schedules this into these wide instruction words 1640 01:12:21,890 --> 01:12:25,600 if you're scheduling for a VLIW or schedules 1641 01:12:25,600 --> 01:12:28,190 the independent operations clump next to each other 1642 01:12:28,190 --> 01:12:30,610 so the superscalar can find them without having to have 1643 01:12:30,610 --> 01:12:33,250 a huge window of opportunity. 1644 01:12:33,250 --> 01:12:34,010 Schedules them. 1645 01:12:34,010 --> 01:12:36,710 Some of them will be speculative. 1646 01:12:36,710 --> 01:12:39,230 And then schedules whatever code is 1647 01:12:39,230 --> 01:12:42,710 necessary to do renaming and so on at the boundaries 1648 01:12:42,710 --> 01:12:45,800 of this trace. 1649 01:12:45,800 --> 01:12:48,633 Now, when the code goes this way, you do very well. 1650 01:12:48,633 --> 01:12:50,300 You get a lot of performance because you 1651 01:12:50,300 --> 01:12:53,548 speculated ahead a lot and exposed 1652 01:12:53,548 --> 01:12:55,340 a lot of the instruction level parallelism. 1653 01:12:55,340 --> 01:12:57,320 When it doesn't go this way, you've set it up in such a way 1654 01:12:57,320 --> 01:12:58,880 that the semantics are right. 1655 01:12:58,880 --> 01:13:01,280 Now, you're not going to pick every trace and compile it. 1656 01:13:01,280 --> 01:13:03,410 That's too many, remember. 1657 01:13:03,410 --> 01:13:05,660 But what you do is having compiled one, 1658 01:13:05,660 --> 01:13:07,910 you take the code that hasn't been compiled yet, 1659 01:13:07,910 --> 01:13:10,040 and you pick the most likely path 1660 01:13:10,040 --> 01:13:11,440 from what's remaining and so on. 1661 01:13:11,440 --> 01:13:13,190 Now, there's a lot of techniques for doing 1662 01:13:13,190 --> 01:13:16,360 this kind of scalar code generation for instruction 1663 01:13:16,360 --> 01:13:17,360 level parallel machines. 1664 01:13:17,360 --> 01:13:19,318 Some of them are variants on traced scheduling. 1665 01:13:19,318 --> 01:13:22,860 Some of them depart quite a bit more than that. 1666 01:13:22,860 --> 01:13:25,130 But basically, they all follow the same idea 1667 01:13:25,130 --> 01:13:28,880 of trying to take advantage of predicting in advance the most 1668 01:13:28,880 --> 01:13:31,970 likely path and scheduling the entire path at once 1669 01:13:31,970 --> 01:13:35,490 rather than just scheduling little pieces of it. 1670 01:13:35,490 --> 01:13:39,700 Now, for loops, there's another important technique. 1671 01:13:39,700 --> 01:13:43,160 It's called software pipelining. 1672 01:13:43,160 --> 01:13:45,920 And software pipelining is-- 1673 01:13:45,920 --> 01:13:48,350 I'm going to show you how it works very, very 1674 01:13:48,350 --> 01:13:50,768 informally on a very, very simple type of example. 1675 01:13:50,768 --> 01:13:53,060 But I just want you to get the idea of what it's about. 1676 01:13:53,060 --> 01:13:55,340 It's actually quite a large technology and quite 1677 01:13:55,340 --> 01:13:57,680 a sophisticated technology, still in its infancy 1678 01:13:57,680 --> 01:14:00,050 and still with a lot of research being done. 1679 01:14:00,050 --> 01:14:04,220 But the basic idea is suppose you have a tight loop of just 1680 01:14:04,220 --> 01:14:08,350 these five operations A through E. 1681 01:14:08,350 --> 01:14:11,590 And you have hardware that operates in such a way 1682 01:14:11,590 --> 01:14:13,330 that functional unit one can do A, 1683 01:14:13,330 --> 01:14:17,100 functional unit two can do B, and so on. 1684 01:14:17,100 --> 01:14:21,390 Our goal is to come up with a schedule for this loop that 1685 01:14:21,390 --> 01:14:24,023 instead of doing A, waiting for it to finish, 1686 01:14:24,023 --> 01:14:25,440 doing B, waiting for it to finish, 1687 01:14:25,440 --> 01:14:29,200 these might be data dependent on each other, for example. 1688 01:14:29,200 --> 01:14:33,030 Instead of doing that, we can start later iterations next 1689 01:14:33,030 --> 01:14:38,320 to these and do pieces of all the different iterations 1690 01:14:38,320 --> 01:14:41,980 at the same time in one cycle or in a handful of cycles 1691 01:14:41,980 --> 01:14:44,440 that we repeat over and over again. 1692 01:14:44,440 --> 01:14:46,240 So the basic way you can think of it, even 1693 01:14:46,240 --> 01:14:47,950 though the technique for generating code 1694 01:14:47,950 --> 01:14:50,440 is very different from this, basic way you can think of it 1695 01:14:50,440 --> 01:14:54,300 is first cycle, start the first iteration. 1696 01:14:54,300 --> 01:14:55,920 Now, as time goes by, you're going 1697 01:14:55,920 --> 01:14:58,770 to do the following operations from that iteration 1698 01:14:58,770 --> 01:15:00,270 on the different functional units, 1699 01:15:00,270 --> 01:15:03,000 starting them each on successive cycles. 1700 01:15:03,000 --> 01:15:05,670 Because they depend upon this one. 1701 01:15:05,670 --> 01:15:11,510 Second cycle, start the second iteration. 1702 01:15:11,510 --> 01:15:13,820 And then in following cycles, continue that. 1703 01:15:13,820 --> 01:15:16,140 And then keep going after that. 1704 01:15:16,140 --> 01:15:18,740 And what you find is that eventually, you sort of filled 1705 01:15:18,740 --> 01:15:21,155 your machine with some repeating pattern 1706 01:15:21,155 --> 01:15:23,900 where each functional unit is doing 1707 01:15:23,900 --> 01:15:26,870 the same thing in the simple example over and over again. 1708 01:15:26,870 --> 01:15:29,720 And then this repeating pattern can be just considered 1709 01:15:29,720 --> 01:15:31,770 a single tight loop. 1710 01:15:31,770 --> 01:15:33,770 So now you pull the repeating pattern out. 1711 01:15:33,770 --> 01:15:35,270 This is, again, how you think of it. 1712 01:15:35,270 --> 01:15:37,478 It's not necessarily how you would implement the code 1713 01:15:37,478 --> 01:15:38,210 generator. 1714 01:15:38,210 --> 01:15:40,220 You pull this repeating pattern out, 1715 01:15:40,220 --> 01:15:43,130 and it becomes your tight loop in the center. 1716 01:15:43,130 --> 01:15:46,780 And you set it up with something called 1717 01:15:46,780 --> 01:15:49,870 the prologue that starts all the iterations necessary to get 1718 01:15:49,870 --> 01:15:52,570 into the tight loop in the center. 1719 01:15:52,570 --> 01:15:55,930 And you finish it with a post log that 1720 01:15:55,930 --> 01:15:58,660 finishes the iterations that have been started 1721 01:15:58,660 --> 01:16:00,370 and have been completed. 1722 01:16:00,370 --> 01:16:02,500 Again, there's more in the references about this 1723 01:16:02,500 --> 01:16:05,580 if you're interested in finding out more about it. 1724 01:16:05,580 --> 01:16:07,820 So finally, let me sum up here. 1725 01:16:07,820 --> 01:16:12,480 What we've seen is that in the last few years, 1726 01:16:12,480 --> 01:16:17,270 microprocessor designs from the major manufacturers 1727 01:16:17,270 --> 01:16:19,730 for their fastest workstations have all 1728 01:16:19,730 --> 01:16:22,520 embodied a fair amount of instruction level parallelism. 1729 01:16:22,520 --> 01:16:27,680 The HP PA-RISC 7100, the IBM RS/6000, the announced DEC 1730 01:16:27,680 --> 01:16:30,987 Alpha are all superscalar machines 1731 01:16:30,987 --> 01:16:32,570 with varying degrees but a fair amount 1732 01:16:32,570 --> 01:16:35,570 in each case of instruction level parallelism. 1733 01:16:35,570 --> 01:16:38,090 These are incredibly fast machines. 1734 01:16:38,090 --> 01:16:39,650 They're fast cycle time machines, 1735 01:16:39,650 --> 01:16:43,010 but they get a lot more advantage 1736 01:16:43,010 --> 01:16:45,320 out of the instruction level parallelism 1737 01:16:45,320 --> 01:16:46,440 than you might imagine. 1738 01:16:46,440 --> 01:16:49,430 So these are machines that are very much the state of the art. 1739 01:16:49,430 --> 01:16:51,620 And these days, if you're building high performance 1740 01:16:51,620 --> 01:16:54,860 processors, if you're not doing instruction level parallelism, 1741 01:16:54,860 --> 01:16:58,060 you're just not there. 1742 01:16:58,060 --> 01:17:01,850 In the '80s, we had the phenomenon 1743 01:17:01,850 --> 01:17:06,320 of startups that were building many supercomputers that 1744 01:17:06,320 --> 01:17:08,090 embodied instruction level parallelism. 1745 01:17:08,090 --> 01:17:11,210 There were some superscalar ones and there 1746 01:17:11,210 --> 01:17:12,740 were a couple of VLIW ones. 1747 01:17:12,740 --> 01:17:15,680 The Multiflow Trace machine I was associated with, 1748 01:17:15,680 --> 01:17:21,050 the Cydrome that Bob Rau also at HP was associated with, 1749 01:17:21,050 --> 01:17:22,730 and the Color, for example. 1750 01:17:22,730 --> 01:17:24,440 For one reason or another, virtually all 1751 01:17:24,440 --> 01:17:26,420 of these mini supercomputer startups 1752 01:17:26,420 --> 01:17:28,790 failed, whether they had very ambitious architectures 1753 01:17:28,790 --> 01:17:30,920 or very humdrum architectures. 1754 01:17:30,920 --> 01:17:33,800 Mostly these were not strongly related, in my opinion, 1755 01:17:33,800 --> 01:17:35,480 to the technology that was there. 1756 01:17:35,480 --> 01:17:38,000 Certainly the Multiflow machine had tremendous price 1757 01:17:38,000 --> 01:17:41,960 performance and ease of use in its time 1758 01:17:41,960 --> 01:17:44,090 and had a lot of commercial acceptance. 1759 01:17:44,090 --> 01:17:46,730 I think a total of 120 pretty big machines 1760 01:17:46,730 --> 01:17:48,470 sold over the years. 1761 01:17:48,470 --> 01:17:50,350 But the company finally couldn't make it. 1762 01:17:50,350 --> 01:17:53,180 But what did happen is that a lot of the technology 1763 01:17:53,180 --> 01:17:55,250 got purchased by a lot of large companies. 1764 01:17:55,250 --> 01:18:00,020 And I believe, for example, that the Fujitsu supercomputer that 1765 01:18:00,020 --> 01:18:02,210 was just announced a couple of weeks ago that 1766 01:18:02,210 --> 01:18:05,480 has as each of its nodes a long instruction word machine. 1767 01:18:05,480 --> 01:18:08,120 And I believe they're using multiflow technology, compiling 1768 01:18:08,120 --> 01:18:10,260 technology, and so on in that product. 1769 01:18:10,260 --> 01:18:13,820 I know several large companies have released the multiflow 1770 01:18:13,820 --> 01:18:16,460 compiler as a compiler for their products. 1771 01:18:16,460 --> 01:18:18,230 And I know that it's underdeveloped. 1772 01:18:18,230 --> 01:18:20,180 It's being used in development efforts 1773 01:18:20,180 --> 01:18:23,900 in a lot of large companies and in some of these products. 1774 01:18:23,900 --> 01:18:27,410 And other startup technologies have found their way in. 1775 01:18:27,410 --> 01:18:31,460 But in fairness, no VLIW has been commercially successful. 1776 01:18:31,460 --> 01:18:35,480 There were the array processors of the '70s that 1777 01:18:35,480 --> 01:18:37,100 were really difficult beasts to use 1778 01:18:37,100 --> 01:18:39,110 but had tremendous price performance 1779 01:18:39,110 --> 01:18:41,923 but never really made it as general purpose computers. 1780 01:18:41,923 --> 01:18:43,340 And what's going to be interesting 1781 01:18:43,340 --> 01:18:46,280 is that now that we're pushing the boundary of instruction 1782 01:18:46,280 --> 01:18:49,130 level parallelism and what's available, 1783 01:18:49,130 --> 01:18:51,085 it really will be an exciting time 1784 01:18:51,085 --> 01:18:52,460 to see which way this thing goes, 1785 01:18:52,460 --> 01:18:55,250 whether we're going to be able to make superscalars practical 1786 01:18:55,250 --> 01:18:56,960 and whether that's the right answer 1787 01:18:56,960 --> 01:18:59,120 or whether VLIWs are going to have to come in 1788 01:18:59,120 --> 01:19:01,610 to have the simpler hardware. 1789 01:19:01,610 --> 01:19:04,160 Very big questions do remain unknown. 1790 01:19:04,160 --> 01:19:07,070 The biggest question is how much instruction level parallelism 1791 01:19:07,070 --> 01:19:08,720 is there. 1792 01:19:08,720 --> 01:19:10,500 And I'd like to just close by giving you 1793 01:19:10,500 --> 01:19:12,890 my quick summary of how I read the limit 1794 01:19:12,890 --> 01:19:15,620 studies and the practical experience people have had 1795 01:19:15,620 --> 01:19:16,430 so far. 1796 01:19:16,430 --> 01:19:20,772 For scientific codes, technical codes, it's no question. 1797 01:19:20,772 --> 01:19:22,730 There's a lot of instruction level parallelism. 1798 01:19:22,730 --> 01:19:24,050 Most of these vectorize. 1799 01:19:24,050 --> 01:19:26,600 There's a lot of other ways to get this advantage, 1800 01:19:26,600 --> 01:19:30,050 but VLIWs certainly are a very effective way of getting it 1801 01:19:30,050 --> 01:19:33,050 and superscalars run very well in these codes also. 1802 01:19:33,050 --> 01:19:35,932 Those numbers are huge and doesn't even 1803 01:19:35,932 --> 01:19:36,890 make sense as the same. 1804 01:19:36,890 --> 01:19:37,940 It's a function of how much hardware 1805 01:19:37,940 --> 01:19:39,270 you throw at the problem. 1806 01:19:39,270 --> 01:19:41,750 So that's easy, the sort of vectorizable array oriented 1807 01:19:41,750 --> 01:19:42,470 codes. 1808 01:19:42,470 --> 01:19:45,230 For scientific codes that aren't quite so easy 1809 01:19:45,230 --> 01:19:47,030 but are still sort of a pipeline floating 1810 01:19:47,030 --> 01:19:50,810 point, memory intensive array codes that you see around, 1811 01:19:50,810 --> 01:19:52,880 probably a factor of five is not a question. 1812 01:19:52,880 --> 01:19:54,920 We can probably easily do that in time. 1813 01:19:54,920 --> 01:19:57,050 And I suspect it'll go a lot farther than that, 1814 01:19:57,050 --> 01:20:00,610 but you can't really make a strong case for it yet. 1815 01:20:00,610 --> 01:20:07,420 For the scientific technical integer codes that are around, 1816 01:20:07,420 --> 01:20:10,330 it's probably around there also, but it's not so clear 1817 01:20:10,330 --> 01:20:12,317 that you can go much further. 1818 01:20:12,317 --> 01:20:14,650 But that would be my take on what the studies have said. 1819 01:20:14,650 --> 01:20:17,192 A lot of people certainly are far more pessimistic than that. 1820 01:20:17,192 --> 01:20:19,300 Some people are more optimistic. 1821 01:20:19,300 --> 01:20:20,680 Certainly time will tell. 1822 01:20:20,680 --> 01:20:23,560 And finally, for the systems codes maybe 1823 01:20:23,560 --> 01:20:25,773 it's a factor of two or three that you 1824 01:20:25,773 --> 01:20:27,940 can get from the studies that have been there today. 1825 01:20:27,940 --> 01:20:31,150 And I think there things like pointer disambiguation and code 1826 01:20:31,150 --> 01:20:34,570 generation techniques are going to make a very big difference 1827 01:20:34,570 --> 01:20:37,000 in what we're able to do. 1828 01:20:37,000 --> 01:20:38,860 And it'll be very interesting to see 1829 01:20:38,860 --> 01:20:41,073 what the future holds for that. 1830 01:20:41,073 --> 01:20:43,240 Finally, the commercial codes, I don't think anybody 1831 01:20:43,240 --> 01:20:44,740 knows anything. 1832 01:20:44,740 --> 01:20:47,200 I've heard some anecdotal evidence 1833 01:20:47,200 --> 01:20:49,000 that for a lot of the more business data 1834 01:20:49,000 --> 01:20:51,100 processing oriented codes, there is 1835 01:20:51,100 --> 01:20:53,590 a lot of instruction level parallelism that's 1836 01:20:53,590 --> 01:20:55,218 available to us. 1837 01:20:55,218 --> 01:20:57,010 But I don't know of any studies that people 1838 01:20:57,010 --> 01:20:59,200 have done and published that have told us just 1839 01:20:59,200 --> 01:21:01,490 what those codes look like. 1840 01:21:01,490 --> 01:21:04,420 So if you are interested in knowing more about the subject, 1841 01:21:04,420 --> 01:21:07,810 Bob Rau and I have a paper that was in Science Magazine 1842 01:21:07,810 --> 01:21:10,090 last year that's a survey. 1843 01:21:10,090 --> 01:21:12,160 Not quite on the level of this talk, but in some 1844 01:21:12,160 --> 01:21:14,620 places goes into a lot more detail 1845 01:21:14,620 --> 01:21:16,310 and has a very good reference list. 1846 01:21:16,310 --> 01:21:19,885 In fact, it's conversations with Bob over the last few years 1847 01:21:19,885 --> 01:21:22,510 that have really given me a lot of the perspective you've heard 1848 01:21:22,510 --> 01:21:24,103 in the course of this talk. 1849 01:21:24,103 --> 01:21:25,520 So thank you very much for coming. 1850 01:21:25,520 --> 01:21:26,728 I've enjoyed this talk a lot. 1851 01:21:26,728 --> 01:21:30,420 1852 01:21:30,420 --> 01:21:33,550 So let's open it up for some questions. 1853 01:21:33,550 --> 01:21:34,660 Yes? 1854 01:21:34,660 --> 01:21:37,030 Josh, could you tell us more about software pipelining 1855 01:21:37,030 --> 01:21:38,290 in commercial systems? 1856 01:21:38,290 --> 01:21:39,290 Sure. 1857 01:21:39,290 --> 01:21:39,790 Thanks. 1858 01:21:39,790 --> 01:21:42,190 Well, the software pipelining is a really interesting 1859 01:21:42,190 --> 01:21:42,760 technology. 1860 01:21:42,760 --> 01:21:45,670 It's a lot more mathematical in nature 1861 01:21:45,670 --> 01:21:48,520 than some of the other code generation techniques 1862 01:21:48,520 --> 01:21:50,290 and so has gotten a lot of interest 1863 01:21:50,290 --> 01:21:51,880 from a lot of different groups who 1864 01:21:51,880 --> 01:21:54,290 tried to formulate different ways of doing it. 1865 01:21:54,290 --> 01:21:57,310 It was used in the floating point systems early on. 1866 01:21:57,310 --> 01:22:01,060 Floating point built these array processors that eventually 1867 01:22:01,060 --> 01:22:04,120 had the ROM be substituted-- have RAM substituted 1868 01:22:04,120 --> 01:22:05,650 for the ROM that was in there. 1869 01:22:05,650 --> 01:22:07,442 And there were a lot of the algorithms that 1870 01:22:07,442 --> 01:22:09,700 went very well on that, went very fast on that system 1871 01:22:09,700 --> 01:22:12,490 because they use software pipelining. 1872 01:22:12,490 --> 01:22:17,200 Cydrome got started because of Bob Rau's software pipelining 1873 01:22:17,200 --> 01:22:19,815 experience and algorithms and particularly because he 1874 01:22:19,815 --> 01:22:21,190 had a lot of good ideas about how 1875 01:22:21,190 --> 01:22:24,310 to build hardware that made it easier to software pipeline. 1876 01:22:24,310 --> 01:22:27,670 We've actually done a lot of that at HP. 1877 01:22:27,670 --> 01:22:31,180 First the Apollo systems, the people in the DN 10,000 1878 01:22:31,180 --> 01:22:33,470 had written a software pipeline for that machine. 1879 01:22:33,470 --> 01:22:36,010 But now, in fact, the commercial compiler 1880 01:22:36,010 --> 01:22:39,250 you get with the PA risk products you buy today 1881 01:22:39,250 --> 01:22:41,090 does software pipeline right in it. 1882 01:22:41,090 --> 01:22:43,450 And there's a lot of research going on at HP Labs, 1883 01:22:43,450 --> 01:22:45,613 Bob Rau's group, and a lot of other places 1884 01:22:45,613 --> 01:22:47,530 where people are looking at software pipeline. 1885 01:22:47,530 --> 01:22:49,360 So it's really become an important topic. 1886 01:22:49,360 --> 01:22:53,440 It even makes ordinary sequential machines 1887 01:22:53,440 --> 01:22:56,080 that have a little bit of sort of subtle instruction level 1888 01:22:56,080 --> 01:22:58,940 parallelism in them go faster when you do that. 1889 01:22:58,940 --> 01:23:02,710 So it's a pretty valuable technique. 1890 01:23:02,710 --> 01:23:03,210 Yes? 1891 01:23:03,210 --> 01:23:06,450 1892 01:23:06,450 --> 01:23:09,930 Could you tell us how high level transformations 1893 01:23:09,930 --> 01:23:15,130 like loop unrolling or inlining effects do ILP [INAUDIBLE]?? 1894 01:23:15,130 --> 01:23:15,630 Absolutely. 1895 01:23:15,630 --> 01:23:17,490 They turn out to be critical effects. 1896 01:23:17,490 --> 01:23:21,240 If you have a subroutine call in the middle of a tight loop, 1897 01:23:21,240 --> 01:23:23,970 you're not going to get anywhere if that subroutine calls 1898 01:23:23,970 --> 01:23:26,610 a boundary beyond which your ILP won't go. 1899 01:23:26,610 --> 01:23:28,680 So the typical thing people will do 1900 01:23:28,680 --> 01:23:32,650 is insert a text copy of the subroutine into that. 1901 01:23:32,650 --> 01:23:35,370 Now of course, that is a source control problem. 1902 01:23:35,370 --> 01:23:38,370 You have to make sure that when you modify that subroutine, 1903 01:23:38,370 --> 01:23:41,310 you catch every place that it was in line later 1904 01:23:41,310 --> 01:23:42,900 in your [INAUDIBLE] control system. 1905 01:23:42,900 --> 01:23:45,020 So that's one technique that's used all the time. 1906 01:23:45,020 --> 01:23:46,860 Similarly, loop unrolling. 1907 01:23:46,860 --> 01:23:49,080 Unless your software are pipelining, in which case 1908 01:23:49,080 --> 01:23:52,110 you want the tightest copy of the loop you can get, 1909 01:23:52,110 --> 01:23:55,680 loop unrolling, which used to be used for compiler optimization 1910 01:23:55,680 --> 01:23:58,652 to eliminate the test between iterations, 1911 01:23:58,652 --> 01:24:00,360 if you could predict that you'd go around 1912 01:24:00,360 --> 01:24:01,530 a certain number of times. 1913 01:24:01,530 --> 01:24:04,110 For ILP, you're happy enough to leave the test in. 1914 01:24:04,110 --> 01:24:06,180 That tends to be a very predictable branch 1915 01:24:06,180 --> 01:24:07,433 at the end of the loop. 1916 01:24:07,433 --> 01:24:08,850 But what you want is to get rather 1917 01:24:08,850 --> 01:24:12,760 than a small body of text, you want a very large body of text. 1918 01:24:12,760 --> 01:24:16,080 And in fact, when people complain about the compile time 1919 01:24:16,080 --> 01:24:19,440 for these what Bill Wolf calls heroic compilers that 1920 01:24:19,440 --> 01:24:22,020 do this sort of thing, the compile time more than anything 1921 01:24:22,020 --> 01:24:23,940 comes about because of the expansion 1922 01:24:23,940 --> 01:24:27,570 in the code size due to loop unrolling 1923 01:24:27,570 --> 01:24:29,700 and subroutine inlining. 1924 01:24:29,700 --> 01:24:33,170 That's really been the biggest factor. 1925 01:24:33,170 --> 01:24:34,160 Yes? 1926 01:24:34,160 --> 01:24:37,470 How does dataflow machines merge into your characterization? 1927 01:24:37,470 --> 01:24:42,440 Is this [INAUDIBLE] superscalar or are there some differences? 1928 01:24:42,440 --> 01:24:44,510 Yes, that's a great question. 1929 01:24:44,510 --> 01:24:46,700 Dataflow machines I really left out, 1930 01:24:46,700 --> 01:24:49,130 even though they are very much instruction 1931 01:24:49,130 --> 01:24:50,360 level parallel machines. 1932 01:24:50,360 --> 01:24:51,950 Right now the emphasis on data flow 1933 01:24:51,950 --> 01:24:55,550 has gotten more off to the multiple thread 1934 01:24:55,550 --> 01:24:57,990 world and MIMD processors. 1935 01:24:57,990 --> 01:25:01,130 So it hasn't been such a practical thing 1936 01:25:01,130 --> 01:25:03,080 as superscalars and VLIWs. 1937 01:25:03,080 --> 01:25:04,580 People haven't really built one that 1938 01:25:04,580 --> 01:25:07,040 would be in any way commercially competitive 1939 01:25:07,040 --> 01:25:09,015 to do just instructional level parallelism. 1940 01:25:09,015 --> 01:25:10,640 But in a sense, they're the purest kind 1941 01:25:10,640 --> 01:25:14,000 of instruction level parallelism when running a single thread. 1942 01:25:14,000 --> 01:25:19,520 What you have is the code itself specifying exactly 1943 01:25:19,520 --> 01:25:22,730 the dependence relationship as part of the way 1944 01:25:22,730 --> 01:25:24,020 the code is written. 1945 01:25:24,020 --> 01:25:27,410 And the hardware simply is notified. 1946 01:25:27,410 --> 01:25:29,690 It has operations which are notified that it's 1947 01:25:29,690 --> 01:25:30,950 time for them to fire. 1948 01:25:30,950 --> 01:25:33,300 And it's very much an instruction level parallelism 1949 01:25:33,300 --> 01:25:33,800 idea. 1950 01:25:33,800 --> 01:25:36,770 And it's a little bit in between superscalars and VLIWs 1951 01:25:36,770 --> 01:25:39,740 in a way, because VLIWs, and again, you 1952 01:25:39,740 --> 01:25:42,230 can read more about in this paper 1953 01:25:42,230 --> 01:25:44,600 where we do discuss dataflow machines in this context. 1954 01:25:44,600 --> 01:25:48,020 VLIWs require the independence between operations 1955 01:25:48,020 --> 01:25:49,700 to be specified. 1956 01:25:49,700 --> 01:25:52,580 And superscalars don't specify anything in advance. 1957 01:25:52,580 --> 01:25:53,840 The hardware figures it out. 1958 01:25:53,840 --> 01:25:56,690 Dataflow machines have the dependence relationship 1959 01:25:56,690 --> 01:25:59,450 and still have to figure out what's independent 1960 01:25:59,450 --> 01:26:00,890 and what can be scheduled. 1961 01:26:00,890 --> 01:26:02,420 So they really are in some sense one 1962 01:26:02,420 --> 01:26:03,770 of these intermediate solutions. 1963 01:26:03,770 --> 01:26:05,460 They're a wonderfully alluring idea. 1964 01:26:05,460 --> 01:26:07,670 Somebody said they've been the greatest 1965 01:26:07,670 --> 01:26:09,650 idea and the next coming thing in architecture 1966 01:26:09,650 --> 01:26:11,390 for 25 years now. 1967 01:26:11,390 --> 01:26:13,633 So they just haven't made it for this purpose. 1968 01:26:13,633 --> 01:26:15,050 But they seem a lot more promising 1969 01:26:15,050 --> 01:26:19,120 for multithread applications. 1970 01:26:19,120 --> 01:26:20,590 Yes? 1971 01:26:20,590 --> 01:26:24,130 You talked about the predictability of branches. 1972 01:26:24,130 --> 01:26:25,810 And one of the things when compilers 1973 01:26:25,810 --> 01:26:28,900 is that compilers don't know which way branches will go. 1974 01:26:28,900 --> 01:26:31,030 Are you suggesting the use of feedback 1975 01:26:31,030 --> 01:26:34,960 as a compiler mechanism to get optimal use 1976 01:26:34,960 --> 01:26:36,400 of the architecture? 1977 01:26:36,400 --> 01:26:37,090 Yeah. 1978 01:26:37,090 --> 01:26:38,360 That's also a great question. 1979 01:26:38,360 --> 01:26:39,735 I wish there had been time for me 1980 01:26:39,735 --> 01:26:42,130 to talk a lot more about profile based optimization 1981 01:26:42,130 --> 01:26:44,860 in general, which is an interesting subject area even 1982 01:26:44,860 --> 01:26:45,730 independent of this. 1983 01:26:45,730 --> 01:26:49,030 There are a lot of other profile based optimizations you can do. 1984 01:26:49,030 --> 01:26:52,360 What I mean by profile based is just what it sounds like. 1985 01:26:52,360 --> 01:26:56,020 The compiler has access to some quote "representative data" 1986 01:26:56,020 --> 01:26:59,290 and then runs it to get an estimate of which way the jumps 1987 01:26:59,290 --> 01:27:02,710 go and then does optimisations, or in this case 1988 01:27:02,710 --> 01:27:07,120 does its scheduling, based on which way those estimates 1989 01:27:07,120 --> 01:27:08,405 from previous runs tell you. 1990 01:27:08,405 --> 01:27:09,530 They're sort of three ways. 1991 01:27:09,530 --> 01:27:11,488 You can either ask the programmer, which is not 1992 01:27:11,488 --> 01:27:14,140 a very acceptable way to tell which way jumps go, 1993 01:27:14,140 --> 01:27:15,370 or you can use heuristics. 1994 01:27:15,370 --> 01:27:17,410 And the jury's out on heuristics. 1995 01:27:17,410 --> 01:27:19,240 Experiments so far have failed. 1996 01:27:19,240 --> 01:27:22,390 But I've been hearing a lot of underground swell 1997 01:27:22,390 --> 01:27:24,400 that really sophisticated heuristics maybe 1998 01:27:24,400 --> 01:27:26,185 do almost as well as profiling. 1999 01:27:26,185 --> 01:27:27,310 And then there's profiling. 2000 01:27:27,310 --> 01:27:31,060 Profiling has the disadvantage that it involves the user. 2001 01:27:31,060 --> 01:27:35,290 And we've yet to see really well engineered systems that 2002 01:27:35,290 --> 01:27:37,810 make this very transparent to the user 2003 01:27:37,810 --> 01:27:41,110 but still allow the compiler to gain access to profiling 2004 01:27:41,110 --> 01:27:42,400 information from these codes. 2005 01:27:42,400 --> 01:27:43,900 Has to involve the users some. 2006 01:27:43,900 --> 01:27:46,210 And I'm, again, interested to see how that works out. 2007 01:27:46,210 --> 01:27:48,200 It certainly is the best way to get data. 2008 01:27:48,200 --> 01:27:50,470 And it turns out you can come very close 2009 01:27:50,470 --> 01:27:52,780 to the best possible. 2010 01:27:52,780 --> 01:27:55,720 [INAUDIBLE] and I, the reason I had those branch statistics 2011 01:27:55,720 --> 01:27:58,990 is that we just did a study that was in this last ASPLOS. 2012 01:27:58,990 --> 01:28:01,990 And we showed that you really can use very unrelated data 2013 01:28:01,990 --> 01:28:04,570 and do very well most of the time in predicting branches. 2014 01:28:04,570 --> 01:28:07,522 2015 01:28:07,522 --> 01:28:08,420 Yes? 2016 01:28:08,420 --> 01:28:10,460 To get more ILP, you have to look deeper 2017 01:28:10,460 --> 01:28:11,780 in the instructions stream. 2018 01:28:11,780 --> 01:28:14,435 And I was wondering if there's any idea how much of a trade 2019 01:28:14,435 --> 01:28:17,060 offs there is between the gains from the ILP 2020 01:28:17,060 --> 01:28:19,910 and the loss of perhaps data locality. 2021 01:28:19,910 --> 01:28:21,630 Yes, that's a very good question. 2022 01:28:21,630 --> 01:28:24,200 And I think one can generalize it to just 2023 01:28:24,200 --> 01:28:28,880 say, how does ILP play with data locality in general? 2024 01:28:28,880 --> 01:28:32,370 Is it really a good thing or a bad thing? 2025 01:28:32,370 --> 01:28:37,680 And my take on it is that for the kinds of systems 2026 01:28:37,680 --> 01:28:40,067 we're talking about today and the sizes of caches we 2027 01:28:40,067 --> 01:28:42,150 have today, and the ones we'll have in the future, 2028 01:28:42,150 --> 01:28:43,983 which will be even better from this respect, 2029 01:28:43,983 --> 01:28:46,080 that that's not just not a big issue. 2030 01:28:46,080 --> 01:28:49,270 Just looking at code and getting a feeling for it. 2031 01:28:49,270 --> 01:28:50,910 But time will really tell. 2032 01:28:50,910 --> 01:28:55,050 The most ambitious ILP machine there's ever been, I believe, 2033 01:28:55,050 --> 01:28:57,090 was the Multiflow Trace, which could fire off 2034 01:28:57,090 --> 01:29:00,180 28 operations at a time and keep maybe 80 of them 2035 01:29:00,180 --> 01:29:01,920 going in flight at once. 2036 01:29:01,920 --> 01:29:04,170 And it didn't have a data cache. 2037 01:29:04,170 --> 01:29:07,140 So locality was just not an issue particularly, 2038 01:29:07,140 --> 01:29:09,830 though some forms of locality were for that machine. 2039 01:29:09,830 --> 01:29:12,040 So there really isn't that kind of experience. 2040 01:29:12,040 --> 01:29:13,710 But my guess is that that's not going 2041 01:29:13,710 --> 01:29:16,540 to be a big problem at all. 2042 01:29:16,540 --> 01:29:17,170 But who knows? 2043 01:29:17,170 --> 01:29:18,128 I'm always an optimist. 2044 01:29:18,128 --> 01:29:20,680 2045 01:29:20,680 --> 01:29:23,460 So well, I'd really like to thank you for coming today. 2046 01:29:23,460 --> 01:29:24,960 I've enjoyed giving this talk a lot. 2047 01:29:24,960 --> 01:29:41,160 2048 01:29:41,160 --> 01:29:44,210 [MUSIC PLAYING] 2049 01:29:44,210 --> 01:30:23,000