1 00:00:00,000 --> 00:00:29,810 2 00:00:29,810 --> 00:00:32,000 Hello, I'm John Crawford, here today 3 00:00:32,000 --> 00:00:35,420 to give you an overview of Intel's Pentium microprocessor. 4 00:00:35,420 --> 00:00:37,640 The Pentium processor is the fifth generation 5 00:00:37,640 --> 00:00:41,450 of Intel's line of PC-compatible microprocessors. 6 00:00:41,450 --> 00:00:44,210 I co-managed the chip design of the Pentium processor, 7 00:00:44,210 --> 00:00:46,743 starting back in 1989. 8 00:00:46,743 --> 00:00:48,660 I will start out with a historical perspective 9 00:00:48,660 --> 00:00:51,920 of Intel's microprocessors from the 386 up through the Pentium 10 00:00:51,920 --> 00:00:55,130 processor and back that up with the underlying historical 11 00:00:55,130 --> 00:00:58,310 trends in the semiconductor technology from 1970 12 00:00:58,310 --> 00:01:00,020 through the present. 13 00:01:00,020 --> 00:01:01,880 Then, I will spend a few minutes describing 14 00:01:01,880 --> 00:01:05,450 a few key aspects of the design methodology we used. 15 00:01:05,450 --> 00:01:07,370 The bulk of this video is devoted 16 00:01:07,370 --> 00:01:08,960 to describing the microarchitecture 17 00:01:08,960 --> 00:01:11,630 of the Pentium processor and how the different hardware 18 00:01:11,630 --> 00:01:15,710 techniques we chose to include contribute to its performance. 19 00:01:15,710 --> 00:01:18,290 After working through this large amount of material, 20 00:01:18,290 --> 00:01:20,720 I'll briefly describe the data integrity or error 21 00:01:20,720 --> 00:01:23,180 checking features we included. 22 00:01:23,180 --> 00:01:24,650 In the last section, we'll describe 23 00:01:24,650 --> 00:01:27,260 the compiler technology co-developed with the Pentium 24 00:01:27,260 --> 00:01:29,720 processor and the results delivered 25 00:01:29,720 --> 00:01:31,742 on industry-standard SPEC benchmarks 26 00:01:31,742 --> 00:01:33,950 that we achieved with the combination of the hardware 27 00:01:33,950 --> 00:01:36,665 techniques and compiler optimizations. 28 00:01:36,665 --> 00:01:38,540 Let me start with a brief overview of Intel's 29 00:01:38,540 --> 00:01:40,400 microprocessor strategy. 30 00:01:40,400 --> 00:01:42,910 We've kept three main goals in mind. 31 00:01:42,910 --> 00:01:46,130 First of all, to maintain compatibility-- 32 00:01:46,130 --> 00:01:48,860 that is, each generation must be compatible with the last 33 00:01:48,860 --> 00:01:51,500 in order to carry forward a very large software 34 00:01:51,500 --> 00:01:55,220 base and the momentum that comes with that in the market. 35 00:01:55,220 --> 00:01:57,107 A second key aspect of the strategy 36 00:01:57,107 --> 00:01:58,940 is to maintain a very aggressive performance 37 00:01:58,940 --> 00:02:02,550 ramp of doubling the performance every 18 months. 38 00:02:02,550 --> 00:02:04,790 The third aspect of the strategy is 39 00:02:04,790 --> 00:02:08,870 to continuously add new functions to enter new markets. 40 00:02:08,870 --> 00:02:11,350 We've done this in two directions. 41 00:02:11,350 --> 00:02:12,830 First of all, we've added functions 42 00:02:12,830 --> 00:02:15,470 to move upward from the PC market 43 00:02:15,470 --> 00:02:17,240 into the workstation, minicomputer, 44 00:02:17,240 --> 00:02:20,930 and mainframe marketplaces, also known as the server market. 45 00:02:20,930 --> 00:02:23,000 In the other directions, we've added features 46 00:02:23,000 --> 00:02:26,570 to move downward into the laptop, notebook, and handheld, 47 00:02:26,570 --> 00:02:29,120 or generally known as the mobile marketplace. 48 00:02:29,120 --> 00:02:31,640 So given this is the overall strategy, let's delve 49 00:02:31,640 --> 00:02:33,660 into this in more detail. 50 00:02:33,660 --> 00:02:36,890 First of all, let's look at the performance growth from the 386 51 00:02:36,890 --> 00:02:39,320 up through the Pentium processor. 52 00:02:39,320 --> 00:02:44,180 Spanning a period from 1986 up until the 1993 timeframe, 53 00:02:44,180 --> 00:02:47,480 I've highlighted three different generations of processors. 54 00:02:47,480 --> 00:02:49,970 You can see four points of the 386 line 55 00:02:49,970 --> 00:02:52,940 where we offer different frequencies of that product. 56 00:02:52,940 --> 00:02:56,960 You can again see four of the points of the 486 product line 57 00:02:56,960 --> 00:03:00,273 in the first point on the Pentium processor line. 58 00:03:00,273 --> 00:03:02,690 Now, each of these is plotted in terms of the integer SPEC 59 00:03:02,690 --> 00:03:06,110 performance on the y-axis against general system 60 00:03:06,110 --> 00:03:08,668 availability date on the x-axis. 61 00:03:08,668 --> 00:03:11,210 It's important to note that this is the date when systems are 62 00:03:11,210 --> 00:03:13,130 available from our customers-- that is, 63 00:03:13,130 --> 00:03:14,750 after they've announced the products 64 00:03:14,750 --> 00:03:16,430 and started shipping to their customers 65 00:03:16,430 --> 00:03:18,180 through their channels. 66 00:03:18,180 --> 00:03:21,210 So you can see here a very nice performance growth rate. 67 00:03:21,210 --> 00:03:23,720 And I've had a spreadsheet program plot the regression 68 00:03:23,720 --> 00:03:26,660 line on a semi-log chart where it computes 69 00:03:26,660 --> 00:03:29,840 the compound annual growth rate somewhere close to 1.6 70 00:03:29,840 --> 00:03:32,960 every year, which corresponds to a doubling of performance 71 00:03:32,960 --> 00:03:34,880 about every 18 months. 72 00:03:34,880 --> 00:03:37,040 That's a very interesting growth rate. 73 00:03:37,040 --> 00:03:40,070 I thought I'd try to relate that growth rate to some real world 74 00:03:40,070 --> 00:03:41,180 examples. 75 00:03:41,180 --> 00:03:43,460 We get a little blase in the computer industry 76 00:03:43,460 --> 00:03:46,898 about the performance going up and up every year, 77 00:03:46,898 --> 00:03:49,190 whereas if the same growth rate occurred in other areas 78 00:03:49,190 --> 00:03:51,320 it would be truly astounding. 79 00:03:51,320 --> 00:03:52,760 For example given the same growth 80 00:03:52,760 --> 00:03:55,190 rate of doubling every 18 months over a decade, 81 00:03:55,190 --> 00:03:57,350 you come up with a factor of approximately 82 00:03:57,350 --> 00:03:59,310 a 200-fold increase. 83 00:03:59,310 --> 00:04:01,730 With the same improvement were applied to automobiles, 84 00:04:01,730 --> 00:04:04,040 we'd be traveling around in our cars 85 00:04:04,040 --> 00:04:07,250 at about 11,000 miles an hour or potentially getting 86 00:04:07,250 --> 00:04:10,640 about 4,000 miles to the gallon. 87 00:04:10,640 --> 00:04:12,140 A point that's maybe a little easier 88 00:04:12,140 --> 00:04:15,200 to visualize or to comprehend would be a flight 89 00:04:15,200 --> 00:04:16,820 from Los Angeles to New York would 90 00:04:16,820 --> 00:04:20,510 take 90 seconds rather than the five hours 91 00:04:20,510 --> 00:04:23,037 or so that it takes currently. 92 00:04:23,037 --> 00:04:24,620 A different way of looking at it might 93 00:04:24,620 --> 00:04:27,950 be in the agricultural field where wheat yields would 94 00:04:27,950 --> 00:04:31,070 have gone from about 35 bushels per acre 95 00:04:31,070 --> 00:04:33,780 up to about 7,000 bushels an acre. 96 00:04:33,780 --> 00:04:36,900 So this is really quite a remarkable growth rate. 97 00:04:36,900 --> 00:04:39,170 Now, where did this performance come from? 98 00:04:39,170 --> 00:04:41,600 I believe that we've really developed the performance 99 00:04:41,600 --> 00:04:43,280 along three dimensions. 100 00:04:43,280 --> 00:04:45,950 Let me go into each of these in a little more detail. 101 00:04:45,950 --> 00:04:48,740 Along one dimension, we have silicon technology. 102 00:04:48,740 --> 00:04:50,900 And here with a Pentium processor, 103 00:04:50,900 --> 00:04:55,070 we've applied 0.8 micron BiCMOS technology as our latest 104 00:04:55,070 --> 00:04:56,690 and greatest process. 105 00:04:56,690 --> 00:04:59,942 A second dimension is an architecture technology 106 00:04:59,942 --> 00:05:02,150 in terms of structuring the computers to extract more 107 00:05:02,150 --> 00:05:04,730 parallelism by executing more and more instructions 108 00:05:04,730 --> 00:05:07,880 in parallel, using techniques such as superscalar 109 00:05:07,880 --> 00:05:10,790 execution, branch prediction, and some more 110 00:05:10,790 --> 00:05:14,000 straightforward aspects such as just larger caches on board 111 00:05:14,000 --> 00:05:15,620 and wider buses. 112 00:05:15,620 --> 00:05:17,840 A third dimension is software technology. 113 00:05:17,840 --> 00:05:20,990 And here, improved compilers are key to providing 114 00:05:20,990 --> 00:05:23,300 improved performance from maximizing 115 00:05:23,300 --> 00:05:27,087 the parallelism that's made available to the hardware. 116 00:05:27,087 --> 00:05:29,420 Let's take a closer look at the semiconductor technology 117 00:05:29,420 --> 00:05:30,470 dimension. 118 00:05:30,470 --> 00:05:33,140 It really is a key aspect that's driving the performance 119 00:05:33,140 --> 00:05:34,310 forward. 120 00:05:34,310 --> 00:05:37,280 And here, we have technology scaling as the main driver, 121 00:05:37,280 --> 00:05:39,620 and there are really two aspects of this. 122 00:05:39,620 --> 00:05:42,680 One is that the die size grows as we improve our manufacturing 123 00:05:42,680 --> 00:05:46,160 processes, and we're able to process larger wafers. 124 00:05:46,160 --> 00:05:50,390 We can economically produce larger and larger die sizes. 125 00:05:50,390 --> 00:05:53,330 The second aspect is that circuit dimensions shrink, 126 00:05:53,330 --> 00:05:56,300 so that each generation we have smaller devices 127 00:05:56,300 --> 00:05:58,580 and we're able to print thinner wires and more 128 00:05:58,580 --> 00:06:00,500 compact transistors. 129 00:06:00,500 --> 00:06:03,500 Fortunately, the smaller transistors run faster 130 00:06:03,500 --> 00:06:06,410 and the extra devices also provide the raw material 131 00:06:06,410 --> 00:06:09,500 for integrating performance structures such as caches, 132 00:06:09,500 --> 00:06:12,380 extra execution pipelines, and in general more 133 00:06:12,380 --> 00:06:14,930 parallel performance structures. 134 00:06:14,930 --> 00:06:17,390 The third category is that new capabilities can also 135 00:06:17,390 --> 00:06:20,120 be added using the extra devices that the technology makes 136 00:06:20,120 --> 00:06:21,470 available to us. 137 00:06:21,470 --> 00:06:25,430 I have here a few charts showing our technology growth from 1970 138 00:06:25,430 --> 00:06:27,470 through the present where I've charted 139 00:06:27,470 --> 00:06:31,070 certain key aspects of our semiconductor technology used 140 00:06:31,070 --> 00:06:33,950 in our microprocessors from the 4,004 141 00:06:33,950 --> 00:06:39,320 in 1971 up through the Pentium processor in 1993. 142 00:06:39,320 --> 00:06:41,150 Here charting the die size, you can 143 00:06:41,150 --> 00:06:44,930 see a nice compound annual growth rate of the die size 144 00:06:44,930 --> 00:06:47,570 with a nice regression line, which I've plotted only 145 00:06:47,570 --> 00:06:50,060 against the lead CPUs. 146 00:06:50,060 --> 00:06:53,175 Coming way below the line at much smaller die sizes, 147 00:06:53,175 --> 00:06:54,800 you can see what we call our compaction 148 00:06:54,800 --> 00:06:58,160 products, such as the 386 compaction and the 486 149 00:06:58,160 --> 00:06:59,360 compaction. 150 00:06:59,360 --> 00:07:01,160 These designs have taken a processor 151 00:07:01,160 --> 00:07:04,100 such as the 386 produced on one technology 152 00:07:04,100 --> 00:07:06,000 and compacted it to the next generation 153 00:07:06,000 --> 00:07:10,160 so we get a much smaller size to take into volume manufacturing. 154 00:07:10,160 --> 00:07:14,420 If you rule these out and plot just the lead CPUs, 155 00:07:14,420 --> 00:07:16,790 we get a very nice compound annual growth rate 156 00:07:16,790 --> 00:07:20,930 on the die size that's fairly constant over time. 157 00:07:20,930 --> 00:07:24,500 This chart shows the shrinking of the transistor dimensions. 158 00:07:24,500 --> 00:07:27,440 And here, I have a semi-log plot of the squares microns 159 00:07:27,440 --> 00:07:30,440 per transistor on the y-axis, turning down over time 160 00:07:30,440 --> 00:07:32,090 very nicely. 161 00:07:32,090 --> 00:07:35,570 The larger die sizes compounded with the smaller dimensions 162 00:07:35,570 --> 00:07:39,620 for each transistor combine to produce a very nice growth 163 00:07:39,620 --> 00:07:42,770 rate in the transistors per microprocessor. 164 00:07:42,770 --> 00:07:46,250 And here, you can see a very nice compound annual growth 165 00:07:46,250 --> 00:07:50,270 rate of the number of transistors per microprocessor. 166 00:07:50,270 --> 00:07:52,490 And here again, we can see the compaction devices 167 00:07:52,490 --> 00:07:56,690 a little below the line for the same reasons discussed earlier. 168 00:07:56,690 --> 00:07:58,407 Again, let's come back to reality 169 00:07:58,407 --> 00:08:00,740 and try to relate these growth curves to some real world 170 00:08:00,740 --> 00:08:01,970 examples. 171 00:08:01,970 --> 00:08:04,850 The transistor count per die is doubling every 24 months 172 00:08:04,850 --> 00:08:07,710 and has been doing so since 1970. 173 00:08:07,710 --> 00:08:12,590 That gives us about a 3,000 fold increase over that timeframe. 174 00:08:12,590 --> 00:08:14,360 Coming back to the real world examples, 175 00:08:14,360 --> 00:08:17,360 perhaps the best analogy is the one of the wheat yields. 176 00:08:17,360 --> 00:08:19,940 Again, back in 1970, the typical yield 177 00:08:19,940 --> 00:08:23,150 was about 33 bushels per acre. 178 00:08:23,150 --> 00:08:24,950 If we had a 3,000 fold increase, that 179 00:08:24,950 --> 00:08:27,800 would give us about 100,000 bushels per acre, 180 00:08:27,800 --> 00:08:29,300 and that would be about three feet 181 00:08:29,300 --> 00:08:32,000 of wheat piled up on that acre. 182 00:08:32,000 --> 00:08:33,500 Perhaps the only real world example 183 00:08:33,500 --> 00:08:35,480 that's come even close to this growth rate 184 00:08:35,480 --> 00:08:37,312 is the growth in our national debt, 185 00:08:37,312 --> 00:08:39,020 but I haven't had the courage to work out 186 00:08:39,020 --> 00:08:42,030 the figures on that one. 187 00:08:42,030 --> 00:08:44,340 Let me shift gears now back to the present 188 00:08:44,340 --> 00:08:46,710 and bring you to the Pentium processor. 189 00:08:46,710 --> 00:08:49,950 First of all, I'd like to cover two key aspects of the design 190 00:08:49,950 --> 00:08:53,100 methodology used in developing this processor of the many 191 00:08:53,100 --> 00:08:54,610 that were involved. 192 00:08:54,610 --> 00:08:57,120 One is that we could develop software with the hardware 193 00:08:57,120 --> 00:08:59,250 so that they would complement each other to deliver 194 00:08:59,250 --> 00:09:01,380 maximum performance. 195 00:09:01,380 --> 00:09:03,180 In order to do this, we had a big focus 196 00:09:03,180 --> 00:09:05,160 on compiler technology. 197 00:09:05,160 --> 00:09:07,590 We staffed and funded a very professional compiler team 198 00:09:07,590 --> 00:09:09,870 internally so that the compiler developers 199 00:09:09,870 --> 00:09:13,560 could work hand in hand with the hardware developers. 200 00:09:13,560 --> 00:09:16,500 We also wanted to ensure that this compiler technology was 201 00:09:16,500 --> 00:09:19,080 propagated out to our large community of applications 202 00:09:19,080 --> 00:09:21,870 developers, so it was very important to make 203 00:09:21,870 --> 00:09:23,850 this technology broadly available 204 00:09:23,850 --> 00:09:26,100 and not hold it close to our vest. 205 00:09:26,100 --> 00:09:28,050 Consequently, we had a big focus on working 206 00:09:28,050 --> 00:09:31,200 with external compiler vendors to deliver this technology out 207 00:09:31,200 --> 00:09:32,748 through their channels. 208 00:09:32,748 --> 00:09:34,290 Later on, we'll hear from the manager 209 00:09:34,290 --> 00:09:36,930 of our internal compiler effort, Beatrice Fu, who 210 00:09:36,930 --> 00:09:39,450 will describe this compiler technology and the results 211 00:09:39,450 --> 00:09:40,890 that we achieved. 212 00:09:40,890 --> 00:09:43,770 We also worked closely with key operating system vendors 213 00:09:43,770 --> 00:09:46,350 to make sure that the system aspects of the processor 214 00:09:46,350 --> 00:09:48,210 would result in good delivered performance 215 00:09:48,210 --> 00:09:50,430 from these operating systems. 216 00:09:50,430 --> 00:09:52,830 A second aspect of the design methodology 217 00:09:52,830 --> 00:09:54,990 was quantifying the design decisions. 218 00:09:54,990 --> 00:09:56,520 And the focus here was to quantify 219 00:09:56,520 --> 00:09:59,550 decisions on the architecture features to include-- 220 00:09:59,550 --> 00:10:02,430 that is, what new features to add to the instruction set-- 221 00:10:02,430 --> 00:10:05,640 and to quantify decisions on the microarchitecture-- that 222 00:10:05,640 --> 00:10:07,505 is, the internal structure of the processor 223 00:10:07,505 --> 00:10:09,630 and the compiler techniques, the internal structure 224 00:10:09,630 --> 00:10:11,040 of the compiler if you will. 225 00:10:11,040 --> 00:10:13,080 We wanted to base decisions in these areas 226 00:10:13,080 --> 00:10:16,050 on as much quantitative data as possible. 227 00:10:16,050 --> 00:10:18,390 One technique we used was system measurement, 228 00:10:18,390 --> 00:10:21,450 where we could measure programs executing on 386 and 486 229 00:10:21,450 --> 00:10:24,300 systems and project forward to the Pentium processor 230 00:10:24,300 --> 00:10:26,190 from the system measurements. 231 00:10:26,190 --> 00:10:28,230 At a more detailed level, we developed 232 00:10:28,230 --> 00:10:31,110 a very flexible and detailed performance simulator 233 00:10:31,110 --> 00:10:32,850 that would very accurately estimate 234 00:10:32,850 --> 00:10:35,350 the performance of different software running on the Pentium 235 00:10:35,350 --> 00:10:38,910 processor long before it was realized in silicon. 236 00:10:38,910 --> 00:10:42,300 In order to drive the simulator, we collected a number of traces 237 00:10:42,300 --> 00:10:45,120 across a broad spectrum of application areas 238 00:10:45,120 --> 00:10:48,390 and used these traces to drive the performance simulator. 239 00:10:48,390 --> 00:10:50,280 The results of this trace-driven simulation 240 00:10:50,280 --> 00:10:52,230 gave us some quantitative feedback 241 00:10:52,230 --> 00:10:54,480 on the value of the different aspects of the hardware 242 00:10:54,480 --> 00:10:55,830 and software. 243 00:10:55,830 --> 00:10:57,810 We even carried this quantitative computer 244 00:10:57,810 --> 00:11:00,120 architecture into the hardware itself 245 00:11:00,120 --> 00:11:01,710 by providing a number of facilities 246 00:11:01,710 --> 00:11:05,010 within the processor to support measurement of running systems 247 00:11:05,010 --> 00:11:08,190 so that we have a very nice event monitoring facility that 248 00:11:08,190 --> 00:11:11,310 can count both discrete events, such as cache misses, 249 00:11:11,310 --> 00:11:14,310 as well as some duration events, such as bus stalls. 250 00:11:14,310 --> 00:11:16,993 Later on, we'll see how we use these performance monitors 251 00:11:16,993 --> 00:11:19,410 to measure the effectiveness of some key microarchitecture 252 00:11:19,410 --> 00:11:21,660 techniques included in the Pentium processor 253 00:11:21,660 --> 00:11:24,300 and how they performed when executing the SPEC benchmark 254 00:11:24,300 --> 00:11:25,290 suite. 255 00:11:25,290 --> 00:11:28,500 But first, let's hear from Dan Alpert, chief architect 256 00:11:28,500 --> 00:11:31,590 of the Pentium processor who will describe key features 257 00:11:31,590 --> 00:11:33,390 of its micro architecture. 258 00:11:33,390 --> 00:11:34,890 Thanks, John. 259 00:11:34,890 --> 00:11:37,230 When we started the design of the Pentium processor, 260 00:11:37,230 --> 00:11:39,810 we knew there'd be a number of very important challenges 261 00:11:39,810 --> 00:11:41,780 in trying to stay on the performance treadmill 262 00:11:41,780 --> 00:11:44,280 while maintaining compatibility with the previous generation 263 00:11:44,280 --> 00:11:46,033 of Intel microprocessors. 264 00:11:46,033 --> 00:11:47,700 What I'll be describing here is a number 265 00:11:47,700 --> 00:11:49,170 of the microarchitecture techniques 266 00:11:49,170 --> 00:11:52,560 that we developed in designing the Pentium processor. 267 00:11:52,560 --> 00:11:55,140 The topics I will cover are the bus interface, 268 00:11:55,140 --> 00:11:58,980 superscalar integer pipelines, cache organization, 269 00:11:58,980 --> 00:12:02,880 branch prediction, and pipeline floating point unit. 270 00:12:02,880 --> 00:12:05,820 Let's start with the bus interface. 271 00:12:05,820 --> 00:12:08,640 We needed a high performance bus to satisfy the Pentium 272 00:12:08,640 --> 00:12:10,920 processors demand for instructions and data 273 00:12:10,920 --> 00:12:13,530 from external cache and memory. 274 00:12:13,530 --> 00:12:16,740 The data bus is 64 bits wide and operates 275 00:12:16,740 --> 00:12:20,490 at the full 66 megahertz speed of the processor core. 276 00:12:20,490 --> 00:12:23,880 In contrast, the bus of our previous generation processor, 277 00:12:23,880 --> 00:12:28,770 the Intel 486 model DX2, has a 32-bit data path 278 00:12:28,770 --> 00:12:31,020 and it runs at 33 megahertz. 279 00:12:31,020 --> 00:12:34,710 So the Pentium processor's bus has four times the bandwidth 280 00:12:34,710 --> 00:12:38,620 with a peak rate of over half a gigabyte per second. 281 00:12:38,620 --> 00:12:41,670 Now moving on to the integer execution units, 282 00:12:41,670 --> 00:12:44,100 we started with the five-stage pipeline 283 00:12:44,100 --> 00:12:47,340 based on that of the Intel 486 processor. 284 00:12:47,340 --> 00:12:49,620 The first stage reads instructions from the cache 285 00:12:49,620 --> 00:12:53,460 into a prefetch buffer and aligns them for decoding. 286 00:12:53,460 --> 00:12:56,760 The first decode stage generates a control word for execution 287 00:12:56,760 --> 00:12:58,500 by the pipeline. 288 00:12:58,500 --> 00:13:01,470 The second code stage decodes the control word 289 00:13:01,470 --> 00:13:04,950 and generates addresses for memory references. 290 00:13:04,950 --> 00:13:08,430 The E stage is used either to calculate a result in the ALU 291 00:13:08,430 --> 00:13:11,160 or to access data in the cache. 292 00:13:11,160 --> 00:13:13,260 The final stage is used to write results back 293 00:13:13,260 --> 00:13:15,660 to the register file. 294 00:13:15,660 --> 00:13:18,240 In the Pentium processor, we have improved performance 295 00:13:18,240 --> 00:13:20,820 by decoding two instructions in parallel 296 00:13:20,820 --> 00:13:23,998 and replicating the address generation hardware and ALU. 297 00:13:23,998 --> 00:13:26,040 You can think of this as being similar to placing 298 00:13:26,040 --> 00:13:29,670 two of the 486's integer pipelines on the same chip. 299 00:13:29,670 --> 00:13:32,280 We call the two pipelines U and V. 300 00:13:32,280 --> 00:13:35,130 One final point is that complex instructions, such as string 301 00:13:35,130 --> 00:13:37,380 operations, are executed by generating 302 00:13:37,380 --> 00:13:40,380 a sequence of microcode words in the D1 stage. 303 00:13:40,380 --> 00:13:42,130 The microcode is written to take advantage 304 00:13:42,130 --> 00:13:45,250 of the hardware that's available in both pipelines. 305 00:13:45,250 --> 00:13:49,030 Let's look at more detail at the instruction-decode stage. 306 00:13:49,030 --> 00:13:51,940 Logic in the D1 stage is used to check for dependencies 307 00:13:51,940 --> 00:13:54,010 between instructions, and we only 308 00:13:54,010 --> 00:13:56,560 issue instructions in parallel to the two pipelines 309 00:13:56,560 --> 00:13:58,210 if they are independent. 310 00:13:58,210 --> 00:14:00,280 The types of dependency that we check for 311 00:14:00,280 --> 00:14:02,703 are resource, control, and data. 312 00:14:02,703 --> 00:14:04,870 First of all, we check that both of the instructions 313 00:14:04,870 --> 00:14:06,430 are from a class we call-- 314 00:14:06,430 --> 00:14:08,230 and this is in quotes-- "simple." 315 00:14:08,230 --> 00:14:09,880 The so-called simple instructions 316 00:14:09,880 --> 00:14:13,600 include ALU operations, jumps, loads, and stores. 317 00:14:13,600 --> 00:14:15,370 Our definition of simple even includes 318 00:14:15,370 --> 00:14:17,620 instructions that perform operations from memory 319 00:14:17,620 --> 00:14:18,910 to registers. 320 00:14:18,910 --> 00:14:22,450 We find that about 90% to 95% of instructions executed 321 00:14:22,450 --> 00:14:23,440 are simple. 322 00:14:23,440 --> 00:14:26,080 By issuing only a subset of the instructions in parallel, 323 00:14:26,080 --> 00:14:28,750 we are able to handle resource dependencies. 324 00:14:28,750 --> 00:14:31,570 For example, there's only a single shifter in the U pipe, 325 00:14:31,570 --> 00:14:35,080 so all of the shift instructions are issued to the U pipe. 326 00:14:35,080 --> 00:14:37,780 Control dependencies occur when the result of one instruction 327 00:14:37,780 --> 00:14:40,660 determines whether another instruction will be executed. 328 00:14:40,660 --> 00:14:43,000 We handle this by checking whether the first instruction 329 00:14:43,000 --> 00:14:44,650 is the jump instruction. 330 00:14:44,650 --> 00:14:48,130 If it is, then we don't issue an instruction in parallel. 331 00:14:48,130 --> 00:14:51,130 Data dependencies occur when the result if one instruction 332 00:14:51,130 --> 00:14:53,980 is used or modified by another instruction. 333 00:14:53,980 --> 00:14:56,080 We handle this by checking that the destination 334 00:14:56,080 --> 00:14:58,270 register of the first instruction 335 00:14:58,270 --> 00:15:00,700 is neither the source nor destination register 336 00:15:00,700 --> 00:15:02,200 of the second instruction. 337 00:15:02,200 --> 00:15:04,360 We have included logic to improve performance 338 00:15:04,360 --> 00:15:06,963 by handling some common cases of dependencies, 339 00:15:06,963 --> 00:15:08,380 thereby allowing more instructions 340 00:15:08,380 --> 00:15:10,540 to be executed in parallel. 341 00:15:10,540 --> 00:15:12,400 First of all are flags. 342 00:15:12,400 --> 00:15:14,740 In the Intel architecture, most of the instructions 343 00:15:14,740 --> 00:15:17,590 modify the flags so that without special handling, 344 00:15:17,590 --> 00:15:20,290 it would be difficult to pair many instructions. 345 00:15:20,290 --> 00:15:22,720 So we do handle the case of parallel instructions 346 00:15:22,720 --> 00:15:25,690 that modify the flags, and we update the flags 347 00:15:25,690 --> 00:15:29,020 the same as if the instructions were executed sequentially. 348 00:15:29,020 --> 00:15:30,760 We also have special logic that handles 349 00:15:30,760 --> 00:15:33,430 the case for conditional branch instructions executed 350 00:15:33,430 --> 00:15:36,490 in parallel with an instruction that sets the flags. 351 00:15:36,490 --> 00:15:38,500 This occurs quite commonly, especially 352 00:15:38,500 --> 00:15:41,000 for compare-branch combinations. 353 00:15:41,000 --> 00:15:42,970 A second special case was the logic 354 00:15:42,970 --> 00:15:45,340 on the stack pointer that allows pushes and pops 355 00:15:45,340 --> 00:15:47,380 to be executed in parallel. 356 00:15:47,380 --> 00:15:50,320 For example, when passing parameters for a procedure 357 00:15:50,320 --> 00:15:53,320 call, there will often be a sequence of pushes. 358 00:15:53,320 --> 00:15:54,910 The logic adjusts the stack pointer 359 00:15:54,910 --> 00:15:58,240 appropriately for parallel execution. 360 00:15:58,240 --> 00:16:00,100 One final point about the decoder 361 00:16:00,100 --> 00:16:03,820 is that we prefetch a complete cache line of 256 bits 362 00:16:03,820 --> 00:16:05,650 to keep the decoder busy. 363 00:16:05,650 --> 00:16:07,810 So the question comes up, just how effective 364 00:16:07,810 --> 00:16:09,580 is the instruction pairing? 365 00:16:09,580 --> 00:16:11,440 Here's a chart showing how the pipelines are 366 00:16:11,440 --> 00:16:16,540 utilized on the SPECint 92 or integer SPEC benchmark suite. 367 00:16:16,540 --> 00:16:19,492 What I've plotted here is for each of the SPEC benchmarks 368 00:16:19,492 --> 00:16:20,950 the percentage of instructions that 369 00:16:20,950 --> 00:16:24,125 go in the U pipe, which is the primary pipeline, 370 00:16:24,125 --> 00:16:25,750 and then the percentage of instructions 371 00:16:25,750 --> 00:16:27,370 that go through the V pipe, which 372 00:16:27,370 --> 00:16:29,270 is the secondary pipeline. 373 00:16:29,270 --> 00:16:31,600 You can see that the pairing ranges roughly around 30% 374 00:16:31,600 --> 00:16:34,990 to 40%, with one benchmark [INAUDIBLE] 375 00:16:34,990 --> 00:16:36,940 hitting above 40% of the instructions 376 00:16:36,940 --> 00:16:38,650 going into the V pipe. 377 00:16:38,650 --> 00:16:40,750 You can see on this particular set of applications 378 00:16:40,750 --> 00:16:42,342 that the pairing is quite effective. 379 00:16:42,342 --> 00:16:44,800 I would like to point out again that these measurements are 380 00:16:44,800 --> 00:16:47,170 taken from an actual running system using 381 00:16:47,170 --> 00:16:48,730 the performance monitoring counters 382 00:16:48,730 --> 00:16:50,920 that John described earlier. 383 00:16:50,920 --> 00:16:54,430 At this point, let's take a look at the cache organization. 384 00:16:54,430 --> 00:16:56,830 We have separate code and data caches. 385 00:16:56,830 --> 00:16:59,440 This was done to eliminate the conflicts between prefetch 386 00:16:59,440 --> 00:17:01,930 access's and data access's. 387 00:17:01,930 --> 00:17:04,599 This was important because we knew the branch predictor was 388 00:17:04,599 --> 00:17:07,810 going to drive a lot more references into the code cache 389 00:17:07,810 --> 00:17:10,630 and because the superscalar pipelines would generate two 390 00:17:10,630 --> 00:17:12,910 data references in parallel. 391 00:17:12,910 --> 00:17:14,829 A consequence of splitting the caches 392 00:17:14,829 --> 00:17:16,780 is that we have to add extra logic to handle 393 00:17:16,780 --> 00:17:19,359 self-modifying code compatibly. 394 00:17:19,359 --> 00:17:22,010 This affected a couple of key areas. 395 00:17:22,010 --> 00:17:24,460 One is that each cache has to snoop the other caches 396 00:17:24,460 --> 00:17:26,619 misses so there would never be inconsistent 397 00:17:26,619 --> 00:17:28,900 data in the two caches. 398 00:17:28,900 --> 00:17:31,210 It turns out that the logic that handles consistency 399 00:17:31,210 --> 00:17:33,910 between external memory and the on-chip caches 400 00:17:33,910 --> 00:17:36,640 was able to handle consistency between the on-chip caches 401 00:17:36,640 --> 00:17:38,250 as well. 402 00:17:38,250 --> 00:17:40,870 The other aspect is that we had to snoop the prefetch buffers 403 00:17:40,870 --> 00:17:44,350 also so that if, for example, we wrote to an area of memory that 404 00:17:44,350 --> 00:17:46,360 happened to be in a prefetch buffer, 405 00:17:46,360 --> 00:17:48,632 we would need to invalidate the prefetch. 406 00:17:48,632 --> 00:17:50,590 Now we can look at some of the vital statistics 407 00:17:50,590 --> 00:17:51,940 about the caches. 408 00:17:51,940 --> 00:17:54,850 Each of them is 8K bytes in size. 409 00:17:54,850 --> 00:17:58,540 They both use 32-byte lines and are two way associative. 410 00:17:58,540 --> 00:18:01,750 The data cache uses a write-back protocol to minimize the bus 411 00:18:01,750 --> 00:18:04,000 traffic, and we use a MESI-- 412 00:18:04,000 --> 00:18:06,850 that's M, E, S, I-- for state protocol 413 00:18:06,850 --> 00:18:09,580 to keep the data cache consistent with the code cache 414 00:18:09,580 --> 00:18:11,590 as well as the rest of the system, including 415 00:18:11,590 --> 00:18:14,350 multilevel caches, off-chip, and the memory. 416 00:18:14,350 --> 00:18:16,840 Once we separated the code and data caches, 417 00:18:16,840 --> 00:18:20,110 it became effective to separate the TLBs as well. 418 00:18:20,110 --> 00:18:23,710 The TLBs support two page sizes, a 4 kilobyte page 419 00:18:23,710 --> 00:18:25,900 and a 4 megabyte page. 420 00:18:25,900 --> 00:18:29,110 The larger page is useful for mapping large, resonate data 421 00:18:29,110 --> 00:18:32,110 structures with a single TLB entry. 422 00:18:32,110 --> 00:18:34,540 For example, it can be used to map a graphics frame 423 00:18:34,540 --> 00:18:38,260 buffer or the memory resident portions, either code or data, 424 00:18:38,260 --> 00:18:40,060 of the operating system. 425 00:18:40,060 --> 00:18:43,740 The data TLB has 64 entries for 4K byte pages 426 00:18:43,740 --> 00:18:46,740 and 8 entries for 4 megabyte pages. 427 00:18:46,740 --> 00:18:50,670 The co-TLB has 32 entries for 4K byte pages. 428 00:18:50,670 --> 00:18:52,260 The effectiveness of the caches is 429 00:18:52,260 --> 00:18:55,440 shown on the following charts for the SPEC benchmark suite. 430 00:18:55,440 --> 00:18:58,110 Please remember that this data is taken from an actual running 431 00:18:58,110 --> 00:19:00,720 system using the performance monitoring hardware that we 432 00:19:00,720 --> 00:19:02,400 included on-chip. 433 00:19:02,400 --> 00:19:05,490 The first chart shows the hit rate for instructions. 434 00:19:05,490 --> 00:19:09,522 It varies from about 90% to a little over 95%. 435 00:19:09,522 --> 00:19:10,980 What we are showing is the hit rate 436 00:19:10,980 --> 00:19:13,260 for prefetches of 32 byte lines coming out 437 00:19:13,260 --> 00:19:15,090 of the instruction cache. 438 00:19:15,090 --> 00:19:17,430 This is often reported in terms of hits per instruction 439 00:19:17,430 --> 00:19:18,432 instead. 440 00:19:18,432 --> 00:19:19,890 And if we looked at it in this way, 441 00:19:19,890 --> 00:19:22,470 the rate would even have been higher. 442 00:19:22,470 --> 00:19:24,180 The next chart shows the data cache 443 00:19:24,180 --> 00:19:25,995 hit rate based on the hits per read 444 00:19:25,995 --> 00:19:27,990 or write reference to memory. 445 00:19:27,990 --> 00:19:29,910 You can see that here, the data cache 446 00:19:29,910 --> 00:19:34,488 hit rate varies from just below 90% to about 95%. 447 00:19:34,488 --> 00:19:36,780 One of the most interesting aspects of the data cache's 448 00:19:36,780 --> 00:19:40,890 design is its support for dual accesses by the two pipelines. 449 00:19:40,890 --> 00:19:43,410 The Intel architecture has a limited number of registers. 450 00:19:43,410 --> 00:19:46,260 There are only eight, and that results in more data memory 451 00:19:46,260 --> 00:19:48,840 references for instruction than we would see in architectures 452 00:19:48,840 --> 00:19:50,460 that have more registers. 453 00:19:50,460 --> 00:19:53,100 For example, for optimized 32-bit code, 454 00:19:53,100 --> 00:19:56,610 we see about 0.6 data references per instruction. 455 00:19:56,610 --> 00:19:58,920 In contrast, many of the risk architectures 456 00:19:58,920 --> 00:20:02,220 that have 32 registers would see only about half that number 457 00:20:02,220 --> 00:20:04,620 of memory references per instruction. 458 00:20:04,620 --> 00:20:06,990 Now, even a small cache can capture the locality 459 00:20:06,990 --> 00:20:08,820 of the additional references. 460 00:20:08,820 --> 00:20:11,370 We see that the compiler for the risk processors 461 00:20:11,370 --> 00:20:15,150 can capture all those references in the 32-register file. 462 00:20:15,150 --> 00:20:16,620 But it does lead to a requirement 463 00:20:16,620 --> 00:20:19,950 for additional bandwidth in the Pentium processor. 464 00:20:19,950 --> 00:20:23,820 We have implemented the cache using dual ported tags in TLB 465 00:20:23,820 --> 00:20:26,940 with a single ported interleaved data array. 466 00:20:26,940 --> 00:20:29,580 Making the data array, which is the bulk of storage 467 00:20:29,580 --> 00:20:31,980 in the cache single ported, turned out 468 00:20:31,980 --> 00:20:34,380 to be the most efficient use of area. 469 00:20:34,380 --> 00:20:36,090 If we had also made it dual ported, 470 00:20:36,090 --> 00:20:37,590 we would have had a smaller cache 471 00:20:37,590 --> 00:20:39,840 and consequently higher miss rate. 472 00:20:39,840 --> 00:20:41,880 The effect of increased misses was more 473 00:20:41,880 --> 00:20:44,050 than the impact of bank conflicts, 474 00:20:44,050 --> 00:20:46,350 so we went with the interleaved approach. 475 00:20:46,350 --> 00:20:48,840 There's logic to detect when parallel references go 476 00:20:48,840 --> 00:20:50,310 to the same bank. 477 00:20:50,310 --> 00:20:53,160 When they do, the U pipe reference is completed 478 00:20:53,160 --> 00:20:56,490 and the V pipe is forced to stall for one clock. 479 00:20:56,490 --> 00:20:58,350 The same logic also handles the case 480 00:20:58,350 --> 00:21:01,110 of data dependencies between parallel references 481 00:21:01,110 --> 00:21:03,210 because, of course, if two references are 482 00:21:03,210 --> 00:21:05,820 to the same location in memory, they will necessarily 483 00:21:05,820 --> 00:21:07,290 be to the same bank. 484 00:21:07,290 --> 00:21:09,360 One final point about the data cache 485 00:21:09,360 --> 00:21:12,330 is that the two 32-bit paths are combined together 486 00:21:12,330 --> 00:21:15,450 to form a 64-bit path for double precision floating point 487 00:21:15,450 --> 00:21:16,927 operands. 488 00:21:16,927 --> 00:21:18,510 The next chart shows some measurements 489 00:21:18,510 --> 00:21:21,120 taken from the SPEC integer benchmarks suite, 490 00:21:21,120 --> 00:21:23,490 demonstrating the effectiveness of dual memory reference 491 00:21:23,490 --> 00:21:24,720 support. 492 00:21:24,720 --> 00:21:27,090 The chart shows the frequency of references 493 00:21:27,090 --> 00:21:28,980 that are actually paired. 494 00:21:28,980 --> 00:21:32,700 You can see here that except for [INAUDIBLE],, about 20% to 25% 495 00:21:32,700 --> 00:21:35,580 of the memory references are V pipe references paired 496 00:21:35,580 --> 00:21:39,810 with U pipe references and the rate of bank conflicts is low. 497 00:21:39,810 --> 00:21:42,780 [INAUDIBLE] shows 40% of references executed in parallel 498 00:21:42,780 --> 00:21:45,540 in the V pipe, but about one out of eight references 499 00:21:45,540 --> 00:21:47,610 is a conflict. 500 00:21:47,610 --> 00:21:49,770 One of the most important performance enhancements 501 00:21:49,770 --> 00:21:53,070 of the Pentium processor is dynamic branch prediction. 502 00:21:53,070 --> 00:21:55,890 The Intel 486 processor flushes its pipeline 503 00:21:55,890 --> 00:21:59,730 for every taken branch at a cost of 2 clocks delay. 504 00:21:59,730 --> 00:22:03,030 Taken branches amount to 15% to 20% of instructions 505 00:22:03,030 --> 00:22:05,190 executed, representing an opportunity 506 00:22:05,190 --> 00:22:06,900 for substantial improvement. 507 00:22:06,900 --> 00:22:09,240 In the Pentium processor, we cache information 508 00:22:09,240 --> 00:22:11,250 about previous branches and use this 509 00:22:11,250 --> 00:22:13,290 to predict future branches. 510 00:22:13,290 --> 00:22:16,152 The prediction is made in an early stage of the pipeline, 511 00:22:16,152 --> 00:22:17,610 and when the prediction is correct, 512 00:22:17,610 --> 00:22:20,510 we can execute branches with no delay. 513 00:22:20,510 --> 00:22:22,260 All of the branch predictions are verified 514 00:22:22,260 --> 00:22:23,790 at the end of the pipeline. 515 00:22:23,790 --> 00:22:26,160 If the prediction was incorrect, the pipelines 516 00:22:26,160 --> 00:22:28,872 are flushed at a cost of 3 or 4 clocks. 517 00:22:28,872 --> 00:22:30,330 The techniques are similar to those 518 00:22:30,330 --> 00:22:33,060 used in mainframe architectures with adaptation 519 00:22:33,060 --> 00:22:36,470 to the Intel architecture and to superscalar execution. 520 00:22:36,470 --> 00:22:38,970 The memory structures that are used in the branch prediction 521 00:22:38,970 --> 00:22:42,960 hardware make for very efficient VLSI implementation. 522 00:22:42,960 --> 00:22:44,790 The heart of the branch prediction hardware 523 00:22:44,790 --> 00:22:49,530 is an associate of memory called a branch target buffer, or BTB. 524 00:22:49,530 --> 00:22:53,820 The BTB has 256 entries, and it is 4-way associative. 525 00:22:53,820 --> 00:22:56,640 The tags used to access the BTB is the address 526 00:22:56,640 --> 00:22:58,470 of the branch instruction. 527 00:22:58,470 --> 00:23:01,590 The data portion of the BTB includes the branch destination 528 00:23:01,590 --> 00:23:04,740 address and 2 bits of history used to predict whether or not 529 00:23:04,740 --> 00:23:06,600 the branch will be taken. 530 00:23:06,600 --> 00:23:08,580 Using 2 bits of history helps in cases 531 00:23:08,580 --> 00:23:11,160 where a branch is consistently taken for a while then 532 00:23:11,160 --> 00:23:12,870 occasionally not taken. 533 00:23:12,870 --> 00:23:15,000 Only when the prediction is wrong twice in a row 534 00:23:15,000 --> 00:23:16,427 will the prediction be changed. 535 00:23:16,427 --> 00:23:18,510 We have some measurements of the branch prediction 536 00:23:18,510 --> 00:23:21,090 accuracy taken with a built-in performance monitoring 537 00:23:21,090 --> 00:23:22,170 hardware. 538 00:23:22,170 --> 00:23:24,330 As you can see, the branch prediction accuracy 539 00:23:24,330 --> 00:23:28,380 is generally 80% to 85% with the exception of the gcc 540 00:23:28,380 --> 00:23:31,350 benchmark, which is only 73%. 541 00:23:31,350 --> 00:23:34,440 The transistor budget available in the Pentium processor 542 00:23:34,440 --> 00:23:35,910 allowed for dramatic improvements 543 00:23:35,910 --> 00:23:37,950 in the floating point performance. 544 00:23:37,950 --> 00:23:41,000 In the Intel 486 microprocessor, the floating point unit 545 00:23:41,000 --> 00:23:44,330 wasn't pipelined and it typically took 10 to 14 clocks 546 00:23:44,330 --> 00:23:46,160 for most operations. 547 00:23:46,160 --> 00:23:49,430 The faster hardware algorithms in the Pentium processor 548 00:23:49,430 --> 00:23:51,530 allow the operations to be executed 549 00:23:51,530 --> 00:23:53,990 by a factor of 3 or more faster. 550 00:23:53,990 --> 00:23:55,940 The execution units are also pipelined 551 00:23:55,940 --> 00:23:59,600 to deliver one floating point result per clock. 552 00:23:59,600 --> 00:24:01,880 The floating point pipeline was closely integrated 553 00:24:01,880 --> 00:24:03,830 with the integer pipeline. 554 00:24:03,830 --> 00:24:05,870 This was important in optimizing from memory 555 00:24:05,870 --> 00:24:08,990 to register operations to deliver 164-bit memory 556 00:24:08,990 --> 00:24:11,570 reference per clock in parallel with one floating 557 00:24:11,570 --> 00:24:13,040 point operation. 558 00:24:13,040 --> 00:24:14,630 The floating point execution units 559 00:24:14,630 --> 00:24:19,040 support the 80-bit IEEE extended format with short latencies. 560 00:24:19,040 --> 00:24:20,900 The multiplier and adder have latencies 561 00:24:20,900 --> 00:24:23,360 of 3 clocks for all formats. 562 00:24:23,360 --> 00:24:27,440 The divider has a latency of 18 to 38 clocks for precisions 563 00:24:27,440 --> 00:24:29,450 from single to extended. 564 00:24:29,450 --> 00:24:31,430 Along with the hardware to improve performance 565 00:24:31,430 --> 00:24:33,450 of the basic arithmetic functions, 566 00:24:33,450 --> 00:24:36,290 we also reimplemented the transcendental instructions-- 567 00:24:36,290 --> 00:24:39,170 sine, cosine, logarithm, and exponential-- 568 00:24:39,170 --> 00:24:41,090 that are part of our instruction set. 569 00:24:41,090 --> 00:24:43,610 The new algorithms use polynomial approximations 570 00:24:43,610 --> 00:24:46,580 to take advantage of the fast multiplier and adder. 571 00:24:46,580 --> 00:24:49,640 This provides improvements both in performance and accuracy 572 00:24:49,640 --> 00:24:52,260 over the Intel 486 microprocessor. 573 00:24:52,260 --> 00:24:54,350 A key aspect of the floating point performance 574 00:24:54,350 --> 00:24:56,660 was co-developing the compiler with the hardware 575 00:24:56,660 --> 00:24:59,150 in order to ensure that this pipeline structure would 576 00:24:59,150 --> 00:25:01,070 be very effectively used. 577 00:25:01,070 --> 00:25:03,192 So that completes my portion of the presentation. 578 00:25:03,192 --> 00:25:05,150 I hope that it helped you in your understanding 579 00:25:05,150 --> 00:25:08,528 of the Pentium microprocessor. 580 00:25:08,528 --> 00:25:10,820 Some of the new capabilities we included on the Pentium 581 00:25:10,820 --> 00:25:13,430 processor were increased error detection 582 00:25:13,430 --> 00:25:16,550 and a functional redundancy check mode of operation. 583 00:25:16,550 --> 00:25:18,740 These were included at the request of our customers 584 00:25:18,740 --> 00:25:20,960 at the high end of the computer market, 585 00:25:20,960 --> 00:25:23,240 those involved in delivering solutions 586 00:25:23,240 --> 00:25:26,510 to mission-critical server applications. 587 00:25:26,510 --> 00:25:29,420 In these areas, even extremely rare errors 588 00:25:29,420 --> 00:25:32,520 must be detected and handled appropriately, 589 00:25:32,520 --> 00:25:36,600 so we included these features to satisfy those requirements. 590 00:25:36,600 --> 00:25:39,197 The most important thing is to handle external errors. 591 00:25:39,197 --> 00:25:41,780 And here, we've carried forward the data bus parity generation 592 00:25:41,780 --> 00:25:44,090 and checking logic that was on the 486 593 00:25:44,090 --> 00:25:47,180 and extended that to our 64-bit data bus. 594 00:25:47,180 --> 00:25:50,090 Also, we added a parity bit on the address bus 595 00:25:50,090 --> 00:25:54,110 so that we also can cover the second wide bus going off-chip 596 00:25:54,110 --> 00:25:56,990 with parity generation and checking. 597 00:25:56,990 --> 00:26:00,320 In addition to this external error detection capability, 598 00:26:00,320 --> 00:26:02,450 within the chip we've included parity bits 599 00:26:02,450 --> 00:26:04,410 on all the major arrays. 600 00:26:04,410 --> 00:26:08,270 This includes the code and data cache arrays, cache tag arrays, 601 00:26:08,270 --> 00:26:11,390 the TLBs, and even the microcode ROM. 602 00:26:11,390 --> 00:26:12,980 All of these have parity bits included 603 00:26:12,980 --> 00:26:16,760 so that any internal errors are detected and reported via pin. 604 00:26:16,760 --> 00:26:19,580 This internal parity checking covers about half the devices 605 00:26:19,580 --> 00:26:22,520 on the chip and so provides a very nice basic level 606 00:26:22,520 --> 00:26:24,110 of coverage. 607 00:26:24,110 --> 00:26:26,120 Beyond that, in order to offer an option 608 00:26:26,120 --> 00:26:28,010 for the ultimate in error detection, 609 00:26:28,010 --> 00:26:30,590 we have an FRC, or functional redundancy check, 610 00:26:30,590 --> 00:26:33,500 mode on the chip where you can configure two processors 611 00:26:33,500 --> 00:26:35,750 in a master checker configuration 612 00:26:35,750 --> 00:26:37,730 where the master is operating normally 613 00:26:37,730 --> 00:26:40,490 and the checker, rather than driving its output pins, 614 00:26:40,490 --> 00:26:43,700 turns them inside out and checks the values driven by the master 615 00:26:43,700 --> 00:26:47,190 and pulls a pin only if a mismatch occurs. 616 00:26:47,190 --> 00:26:48,890 So this increased error detection level 617 00:26:48,890 --> 00:26:51,650 provides a nice capability for some new application 618 00:26:51,650 --> 00:26:55,100 areas where the ultimate in error detection is required. 619 00:26:55,100 --> 00:26:58,370 And now, let's go with Beatrice Fu to our performance lab 620 00:26:58,370 --> 00:27:00,110 where she will describe the compiler 621 00:27:00,110 --> 00:27:03,440 technology we developed with the Pentium processor. 622 00:27:03,440 --> 00:27:04,520 Thanks, John. 623 00:27:04,520 --> 00:27:08,390 How compiler technology enhances the performance of the Pentium 624 00:27:08,390 --> 00:27:13,110 processor is best illustrated with a simple example. 625 00:27:13,110 --> 00:27:15,810 Here, we have a simple synthetic program 626 00:27:15,810 --> 00:27:18,130 compiled by tool compilers. 627 00:27:18,130 --> 00:27:22,290 The code on the left-hand side is generated by a typical 486 628 00:27:22,290 --> 00:27:24,900 compiler, and the code on the right-hand side 629 00:27:24,900 --> 00:27:28,740 is generated by the new compiler technology optimized 630 00:27:28,740 --> 00:27:31,050 for the Pentium CPU. 631 00:27:31,050 --> 00:27:33,960 The two code sequences are quite different, 632 00:27:33,960 --> 00:27:36,900 even though they are both correct representations 633 00:27:36,900 --> 00:27:39,330 of the original program. 634 00:27:39,330 --> 00:27:43,800 This demonstration shows how the two code sequences proceed 635 00:27:43,800 --> 00:27:46,080 through the Pentium processor. 636 00:27:46,080 --> 00:27:49,530 The graphical block diagram represents the instruction 637 00:27:49,530 --> 00:27:54,930 cache, the data cache, the two execution pipelines as well as 638 00:27:54,930 --> 00:27:57,750 the floating point unit. 639 00:27:57,750 --> 00:28:00,780 While executing, the active instructions 640 00:28:00,780 --> 00:28:04,110 are color coded so that we can follow the journeys 641 00:28:04,110 --> 00:28:06,480 through the Pentium processor. 642 00:28:06,480 --> 00:28:09,270 As can be seen, the Pentium processor 643 00:28:09,270 --> 00:28:12,660 has many idle resources on the left-hand side when 644 00:28:12,660 --> 00:28:15,300 executing the traditional code sequence, 645 00:28:15,300 --> 00:28:19,140 whereas the new code sequence utilizes the dual execution 646 00:28:19,140 --> 00:28:22,530 pipelines more effectively on the right-hand side. 647 00:28:22,530 --> 00:28:26,640 As a result, the time taken to run the new code sequence 648 00:28:26,640 --> 00:28:29,860 is much less than the traditional code. 649 00:28:29,860 --> 00:28:31,720 As can be seen, the right-hand side 650 00:28:31,720 --> 00:28:34,210 is finished whereas the left-hand side is still 651 00:28:34,210 --> 00:28:36,120 chugging along. 652 00:28:36,120 --> 00:28:38,760 The advancement of CPU microarchitecture 653 00:28:38,760 --> 00:28:41,520 requires advanced compiler technology 654 00:28:41,520 --> 00:28:44,040 to exploit the on-chip parallelism, 655 00:28:44,040 --> 00:28:47,040 such that features like pipeline and superscalar 656 00:28:47,040 --> 00:28:50,670 can be made more effective to deliver the final performance 657 00:28:50,670 --> 00:28:53,880 as illustrated in the previous demonstration. 658 00:28:53,880 --> 00:28:56,040 The role of a compiler has always 659 00:28:56,040 --> 00:28:59,310 been translating high-level language into low-level machine 660 00:28:59,310 --> 00:29:04,230 code so as to hide the machine details from many programmers. 661 00:29:04,230 --> 00:29:06,600 The role of an optimizing compiler 662 00:29:06,600 --> 00:29:11,160 is to generate machine code that runs efficiently on the target 663 00:29:11,160 --> 00:29:12,850 processor. 664 00:29:12,850 --> 00:29:16,720 In the simple example here, a traditional compiler 665 00:29:16,720 --> 00:29:19,990 will generate code to increment the array index 666 00:29:19,990 --> 00:29:24,160 and compare the loop bond and end up with a three-cycle loop 667 00:29:24,160 --> 00:29:25,270 body. 668 00:29:25,270 --> 00:29:29,020 Using the same example, our optimizing compiler 669 00:29:29,020 --> 00:29:31,990 displaces the loop count to eliminate 670 00:29:31,990 --> 00:29:34,870 the compare instruction and take advantage 671 00:29:34,870 --> 00:29:38,620 of the zero-flex setting of the increment instruction, which 672 00:29:38,620 --> 00:29:44,190 improves the loop cycle count from 3 down to 2. 673 00:29:44,190 --> 00:29:47,070 We are again using the same example. 674 00:29:47,070 --> 00:29:50,940 Our new compiler technology further unrolls the loop 675 00:29:50,940 --> 00:29:54,990 and overlaps the instructions from two different iterations 676 00:29:54,990 --> 00:29:58,990 to take full advantage of the superscalar core. 677 00:29:58,990 --> 00:30:03,280 This gives a two-cycle loop count for every two iterations 678 00:30:03,280 --> 00:30:06,950 or effectively one cycle per iteration. 679 00:30:06,950 --> 00:30:11,800 This is one example of how our compiler technology maximizes 680 00:30:11,800 --> 00:30:17,680 the usage of the dual pipelines of the Pentium processor. 681 00:30:17,680 --> 00:30:22,480 In the last decade, the CPU speed has increased more than 682 00:30:22,480 --> 00:30:27,220 10 times , but the speed of memory components has not kept 683 00:30:27,220 --> 00:30:28,210 up. 684 00:30:28,210 --> 00:30:31,960 As caches are introduced into the memory subsystem 685 00:30:31,960 --> 00:30:36,250 to meet the CPU demand, cooperation from compilers 686 00:30:36,250 --> 00:30:40,480 is sometimes needed to ensure efficient use of the memory 687 00:30:40,480 --> 00:30:41,650 bandwidth. 688 00:30:41,650 --> 00:30:45,250 In this example, with j being the index 689 00:30:45,250 --> 00:30:48,820 of the innermost loop, the array elements 690 00:30:48,820 --> 00:30:53,230 are not accessed from contiguous memory locations. 691 00:30:53,230 --> 00:30:56,500 In other words, even though several elements 692 00:30:56,500 --> 00:31:00,580 are brought into one cache line, only the first element 693 00:31:00,580 --> 00:31:01,870 will be used. 694 00:31:01,870 --> 00:31:04,360 The rest will be discarded and brought 695 00:31:04,360 --> 00:31:08,510 in again and again as we cycle through the outer loop. 696 00:31:08,510 --> 00:31:12,940 Our compiler technology remedies this by interchanging i 697 00:31:12,940 --> 00:31:17,020 and j making the memory accesses contiguous. 698 00:31:17,020 --> 00:31:19,780 Just now, I illustrated a few techniques 699 00:31:19,780 --> 00:31:21,760 of optimizing compiler. 700 00:31:21,760 --> 00:31:25,450 Our 32-bit compiler has extensive optimization 701 00:31:25,450 --> 00:31:28,330 to ensure the maximum performance of the Pentium 702 00:31:28,330 --> 00:31:29,800 processor. 703 00:31:29,800 --> 00:31:33,220 We have state of the art classical optimizations, 704 00:31:33,220 --> 00:31:37,110 light register variable detection, loop invariant 705 00:31:37,110 --> 00:31:40,090 co-motion, and others listed here. 706 00:31:40,090 --> 00:31:51,930 707 00:31:51,930 --> 00:31:56,240 We also pay specific attention to the X86 architecture 708 00:31:56,240 --> 00:31:59,030 as well as our CPU implementation 709 00:31:59,030 --> 00:32:02,480 in terms of selecting effective addressing modes 710 00:32:02,480 --> 00:32:07,220 and co-sequences and rearranging instructions to take advantage 711 00:32:07,220 --> 00:32:09,710 of the superscalar core. 712 00:32:09,710 --> 00:32:12,330 In the earlier example, I illustrated 713 00:32:12,330 --> 00:32:16,250 how loop interchange can result in more effective use 714 00:32:16,250 --> 00:32:18,380 of the memory bandwidth. 715 00:32:18,380 --> 00:32:22,160 Loop interchange is only one of the memory optimization 716 00:32:22,160 --> 00:32:23,270 techniques. 717 00:32:23,270 --> 00:32:27,410 Our compiler technology can also perform loop distribution 718 00:32:27,410 --> 00:32:32,600 to promote parallelism, blocking to maximize memory reuse once 719 00:32:32,600 --> 00:32:35,840 loaded into the cache, strip mining 720 00:32:35,840 --> 00:32:39,260 to adjust the problem size to the size of the cache 721 00:32:39,260 --> 00:32:43,610 in order to minimize trashing, and cache preloading 722 00:32:43,610 --> 00:32:46,070 to hide the memory latency. 723 00:32:46,070 --> 00:32:50,400 The next chart shows the result of our compiler efforts. 724 00:32:50,400 --> 00:32:54,680 We are comparing the performance of the 66 megahertz Pentium 725 00:32:54,680 --> 00:32:58,670 processor running the SPEC benchmark suites compiled 726 00:32:58,670 --> 00:33:01,050 by three different compilers. 727 00:33:01,050 --> 00:33:05,000 The compiler technology I just described is shown here. 728 00:33:05,000 --> 00:33:09,560 Compiler A is the best available 486 compiler, 729 00:33:09,560 --> 00:33:13,850 and compiler B was an average compiler in 1990 when 730 00:33:13,850 --> 00:33:16,430 we started the compiler effort. 731 00:33:16,430 --> 00:33:20,150 On the integer side, we were able to go from a performance 732 00:33:20,150 --> 00:33:26,660 level of 41 using the average compiler up to a level a 64.6 733 00:33:26,660 --> 00:33:29,960 using our new compiler technology. 734 00:33:29,960 --> 00:33:33,410 When compared to the level of 57.6 735 00:33:33,410 --> 00:33:36,740 from the best available 486 compiler, 736 00:33:36,740 --> 00:33:41,210 the difference between the best 486 compiler and a new compiler 737 00:33:41,210 --> 00:33:45,020 is on the order of 10% to 15%, obviously 738 00:33:45,020 --> 00:33:48,980 a very substantial improvement over the average compiler 739 00:33:48,980 --> 00:33:51,560 from the 1990 vintage. 740 00:33:51,560 --> 00:33:56,420 On the floating point side, the results are even more dramatic. 741 00:33:56,420 --> 00:33:58,790 From the new compiler technology, 742 00:33:58,790 --> 00:34:04,010 we are able to achieve a SPEC floating point 59.7 compared 743 00:34:04,010 --> 00:34:09,050 to much lower numbers in the 30s using the other compilers. 744 00:34:09,050 --> 00:34:12,620 Throughout the development of the new compiler technology, 745 00:34:12,620 --> 00:34:16,130 we emphasize that we could not optimize performance 746 00:34:16,130 --> 00:34:20,330 of the Pentium processor at the expense of the Intel 486 747 00:34:20,330 --> 00:34:23,570 or even the Intel 386 processors. 748 00:34:23,570 --> 00:34:27,050 So we were very careful in implementing our optimization 749 00:34:27,050 --> 00:34:30,350 techniques, and we constantly measured the performance 750 00:34:30,350 --> 00:34:34,340 of the generator code on the older generation processors. 751 00:34:34,340 --> 00:34:37,190 The efforts really paid off. 752 00:34:37,190 --> 00:34:41,210 This chart shows that instead of optimizing for the Pentium 753 00:34:41,210 --> 00:34:44,900 processor at the expense of the older parts, 754 00:34:44,900 --> 00:34:48,679 we were in fact, able to improve the performance on the older 755 00:34:48,679 --> 00:34:52,280 parts while optimizing for the Pentium processor. 756 00:34:52,280 --> 00:34:55,040 We are again comparing the same benchmark suites 757 00:34:55,040 --> 00:34:57,920 with using the same three compilers 758 00:34:57,920 --> 00:35:02,390 but running the code on the Intel 486 processor. 759 00:35:02,390 --> 00:35:05,120 You can see that our new compiler technology 760 00:35:05,120 --> 00:35:08,030 is providing a nice performance boost, 761 00:35:08,030 --> 00:35:12,710 not only over the average compiler that we had in 1990 762 00:35:12,710 --> 00:35:17,240 but also over the best available 486 compiler. 763 00:35:17,240 --> 00:35:20,000 Now, I hope you can see the importance of compiler 764 00:35:20,000 --> 00:35:23,320 technology to the overall performance. 765 00:35:23,320 --> 00:35:25,958 This concludes our talk on the Pentium microprocessor. 766 00:35:25,958 --> 00:35:27,750 I hope that you'll take away from this talk 767 00:35:27,750 --> 00:35:30,180 some understanding of the key microarchitecture features 768 00:35:30,180 --> 00:35:33,330 inside the processor as well as some of the key techniques 769 00:35:33,330 --> 00:35:36,090 we've developed with compiler technology. 770 00:35:36,090 --> 00:35:38,250 In combination, these two producing a new level 771 00:35:38,250 --> 00:35:39,930 of performance within our line of PC 772 00:35:39,930 --> 00:35:41,730 compatible microprocessors. 773 00:35:41,730 --> 00:35:43,580 Thank you. 774 00:35:43,580 --> 00:37:09,000