1 00:00:00,000 --> 00:00:00,996 2 00:00:00,996 --> 00:00:04,482 [MUSIC PLAYING] 3 00:00:04,482 --> 00:00:39,350 4 00:00:39,350 --> 00:00:39,850 Hello. 5 00:00:39,850 --> 00:00:43,210 I'm Jim Montanaro, and I'm going to discuss the implementation 6 00:00:43,210 --> 00:00:45,610 of the Alpha CPU chip. 7 00:00:45,610 --> 00:00:48,890 This tape is the second of two tapes about the Alpha program. 8 00:00:48,890 --> 00:00:51,610 The first tape, by Dick Sites and Dirk Meyer, 9 00:00:51,610 --> 00:00:53,500 discuss the architecture and some 10 00:00:53,500 --> 00:00:55,600 of the high-level implementation issues. 11 00:00:55,600 --> 00:00:57,970 This tape will talk about the lower-level implementation 12 00:00:57,970 --> 00:00:58,840 issues. 13 00:00:58,840 --> 00:01:00,325 The two tapes are independent. 14 00:01:00,325 --> 00:01:02,950 You don't need to have seen the first one in order for this one 15 00:01:02,950 --> 00:01:03,940 to make sense. 16 00:01:03,940 --> 00:01:07,150 They do overlap a little bit, but in general, they 17 00:01:07,150 --> 00:01:09,760 just complement each other. 18 00:01:09,760 --> 00:01:11,620 First I'd like to give you some idea of what 19 00:01:11,620 --> 00:01:13,258 I'm going to talk about. 20 00:01:13,258 --> 00:01:14,800 To start with, I'll discuss the goals 21 00:01:14,800 --> 00:01:17,110 of the program and the chip. 22 00:01:17,110 --> 00:01:19,780 Next I'll describe some of the constraints and resources 23 00:01:19,780 --> 00:01:22,090 available which served as a framework around which 24 00:01:22,090 --> 00:01:23,800 the design evolved. 25 00:01:23,800 --> 00:01:26,410 Then I'll try to describe the design strategy 26 00:01:26,410 --> 00:01:28,750 and philosophy of the team. 27 00:01:28,750 --> 00:01:32,240 Having set that background, I'll talk about the chips themselves 28 00:01:32,240 --> 00:01:35,190 and some of the technical problems that we encountered. 29 00:01:35,190 --> 00:01:37,500 The primary goal of the Alpha program 30 00:01:37,500 --> 00:01:40,023 was to provide hardware and software that 31 00:01:40,023 --> 00:01:41,940 could deliver leadership performance and price 32 00:01:41,940 --> 00:01:45,090 performance in a way that could maintain and expand 33 00:01:45,090 --> 00:01:46,860 Digital's customer base. 34 00:01:46,860 --> 00:01:49,000 The chip team's part of the deal was 35 00:01:49,000 --> 00:01:53,230 to provide the engine, a high-performance RISC CPU chip. 36 00:01:53,230 --> 00:01:55,510 A few things were obvious from the start. 37 00:01:55,510 --> 00:01:57,550 Number one, we were coming into the market 38 00:01:57,550 --> 00:02:00,100 late with a RISC chip, and there was some skepticism 39 00:02:00,100 --> 00:02:02,470 about our ability to deliver. 40 00:02:02,470 --> 00:02:05,500 The first chip had to be fast enough to create a stir 41 00:02:05,500 --> 00:02:08,710 and to be the clear choice for our internal system partners. 42 00:02:08,710 --> 00:02:12,020 An average design would not be sufficient. 43 00:02:12,020 --> 00:02:15,710 Number two, we needed to field a variety of systems, 44 00:02:15,710 --> 00:02:17,780 and given the resources available, 45 00:02:17,780 --> 00:02:21,585 we were only going to be able to design one chip to start with. 46 00:02:21,585 --> 00:02:22,960 Consequently, the chip would have 47 00:02:22,960 --> 00:02:26,230 to be flexible enough to support both the high and lower end 48 00:02:26,230 --> 00:02:28,810 systems, at least at the start. 49 00:02:28,810 --> 00:02:32,410 More specialized chips for parts of the system space 50 00:02:32,410 --> 00:02:35,500 would come if we were successful with the program. 51 00:02:35,500 --> 00:02:38,410 And third, of course, we were in a hurry. 52 00:02:38,410 --> 00:02:42,300 Schedule was important for several reasons. 53 00:02:42,300 --> 00:02:45,140 One reason was that we had to provide a hardware base 54 00:02:45,140 --> 00:02:47,290 for the software development. 55 00:02:47,290 --> 00:02:49,480 Although the chip design task was hard, 56 00:02:49,480 --> 00:02:51,600 and its performance was important, 57 00:02:51,600 --> 00:02:55,590 the program would live or die by the success of the software. 58 00:02:55,590 --> 00:02:57,200 The second, very important reason 59 00:02:57,200 --> 00:02:59,390 was to establish credibility. 60 00:02:59,390 --> 00:03:02,030 We were designing a CPU in a new architecture. 61 00:03:02,030 --> 00:03:05,150 So we had no established following inside digital. 62 00:03:05,150 --> 00:03:07,100 As such, we were at risk of cancellation 63 00:03:07,100 --> 00:03:10,440 if we slipped our schedule or missed the performance goals. 64 00:03:10,440 --> 00:03:12,390 At the beginning of the Alpha program 65 00:03:12,390 --> 00:03:14,430 there was some argument as to whether Alpha was 66 00:03:14,430 --> 00:03:16,710 necessary or even desirable. 67 00:03:16,710 --> 00:03:18,810 And it's probably safe to say that we did not 68 00:03:18,810 --> 00:03:22,420 enjoy the unanimous support of all parts of the company. 69 00:03:22,420 --> 00:03:23,880 Some of these risks would decrease 70 00:03:23,880 --> 00:03:25,830 once we had something real, something 71 00:03:25,830 --> 00:03:30,120 that you could send 2 plus 2 to and hopefully get back 72 00:03:30,120 --> 00:03:32,270 the answer 4. 73 00:03:32,270 --> 00:03:34,280 Those were the goals. 74 00:03:34,280 --> 00:03:36,230 As usual, there are the accompanying set 75 00:03:36,230 --> 00:03:39,230 of constraints, some of which seemed at odds with the goals, 76 00:03:39,230 --> 00:03:41,840 and a collection of resources that we hoped 77 00:03:41,840 --> 00:03:43,840 were sufficient to the task. 78 00:03:43,840 --> 00:03:45,910 First, the constraints-- we needed 79 00:03:45,910 --> 00:03:49,510 to complete the chip by mid-1990 to support the software 80 00:03:49,510 --> 00:03:50,920 development schedule. 81 00:03:50,920 --> 00:03:52,570 However, when we looked at the schedule 82 00:03:52,570 --> 00:03:55,180 for our internal CMOS process development, 83 00:03:55,180 --> 00:03:56,860 we found that the process we needed, 84 00:03:56,860 --> 00:04:00,430 CMOS-4 with 0.75 micron features, 85 00:04:00,430 --> 00:04:02,440 would just be ready by then. 86 00:04:02,440 --> 00:04:04,780 The timing looked awfully tight, and any hiccup 87 00:04:04,780 --> 00:04:06,580 in the process development schedule 88 00:04:06,580 --> 00:04:09,290 would directly impact us. 89 00:04:09,290 --> 00:04:12,620 Then there was competition with the NVAX chip, the VAX 90 00:04:12,620 --> 00:04:16,459 microprocessor that was presented at ISSCC '92, 91 00:04:16,459 --> 00:04:20,430 which was also planning to use CMOS-4 in the same time frame. 92 00:04:20,430 --> 00:04:22,580 The idea of two big chips clamoring 93 00:04:22,580 --> 00:04:24,350 for attention from the process folks 94 00:04:24,350 --> 00:04:26,433 just when they were trying to bring the process up 95 00:04:26,433 --> 00:04:29,460 for the first time didn't seem real appealing. 96 00:04:29,460 --> 00:04:32,780 So we decided to go for plan B. We decided to design 97 00:04:32,780 --> 00:04:34,640 two chips in sequence. 98 00:04:34,640 --> 00:04:39,200 First we would use CMOS-3, a 1-micron CMOS process, 99 00:04:39,200 --> 00:04:42,890 and just implement a subset of the functionality we needed. 100 00:04:42,890 --> 00:04:45,103 We built a system that used that chip, 101 00:04:45,103 --> 00:04:46,520 and the software development could 102 00:04:46,520 --> 00:04:49,070 start while we finished the final chip, which 103 00:04:49,070 --> 00:04:50,900 would contain the full functionality 104 00:04:50,900 --> 00:04:53,260 and be built in CMOS-4. 105 00:04:53,260 --> 00:04:56,920 So there were two chips, called EV3 and EV4, 106 00:04:56,920 --> 00:04:59,590 where the 3 and the 4 correspond to the process in which they 107 00:04:59,590 --> 00:05:01,230 were fabricated. 108 00:05:01,230 --> 00:05:03,540 The original plan was that we'd leave the caches off 109 00:05:03,540 --> 00:05:08,460 of EV3 so that it would fit in the allowable CMOS-3 die size. 110 00:05:08,460 --> 00:05:11,640 Later on, in an effort to accelerate the overall program 111 00:05:11,640 --> 00:05:16,000 schedule by providing chips to the system groups earlier, 112 00:05:16,000 --> 00:05:20,160 we decided to make the EV3 pin bus exactly match the pin 113 00:05:20,160 --> 00:05:22,805 bus which would appear on EV4. 114 00:05:22,805 --> 00:05:24,180 This would mean that, in addition 115 00:05:24,180 --> 00:05:26,880 to software development, the production system 116 00:05:26,880 --> 00:05:30,680 debug could proceed while we finish the final chip. 117 00:05:30,680 --> 00:05:33,260 In order to accomplish this, we changed the subset 118 00:05:33,260 --> 00:05:35,300 of what would go into EV3. 119 00:05:35,300 --> 00:05:36,950 We removed the floating-point unit, 120 00:05:36,950 --> 00:05:40,630 and we added baby IND caches in its place. 121 00:05:40,630 --> 00:05:43,240 With EV3, the floating point instructions 122 00:05:43,240 --> 00:05:45,270 trap to an emulator. 123 00:05:45,270 --> 00:05:48,990 This change was made about a year after the design started. 124 00:05:48,990 --> 00:05:52,290 It injected a major disturbance into the whole design process. 125 00:05:52,290 --> 00:05:54,780 And our attempt to hold the original schedule, 126 00:05:54,780 --> 00:05:57,360 despite making a major change like this, 127 00:05:57,360 --> 00:06:00,270 caused a significant strain on the team. 128 00:06:00,270 --> 00:06:02,700 In the end, we slipped the tape-out date of EV3 129 00:06:02,700 --> 00:06:05,680 about four months past the original target. 130 00:06:05,680 --> 00:06:07,690 On the plus side, we are fortunate to have 131 00:06:07,690 --> 00:06:10,510 an impressive array of resources available. 132 00:06:10,510 --> 00:06:14,620 We have in-house CMOS process expertise in manufacturing. 133 00:06:14,620 --> 00:06:18,580 Digital has been developing CMOS processes since about 1982, 134 00:06:18,580 --> 00:06:20,560 and these processes have been optimized 135 00:06:20,560 --> 00:06:23,230 towards high-performance microprocessor design. 136 00:06:23,230 --> 00:06:26,470 Consequently, the chip designers designed to a single CMOS 137 00:06:26,470 --> 00:06:28,420 process, rather than to the union 138 00:06:28,420 --> 00:06:30,580 of several vendors' processes. 139 00:06:30,580 --> 00:06:32,470 And the chip designers have had input 140 00:06:32,470 --> 00:06:35,490 into the direction process development goes. 141 00:06:35,490 --> 00:06:37,470 People working on the process development 142 00:06:37,470 --> 00:06:40,058 sit about 200 feet away from the chip designers. 143 00:06:40,058 --> 00:06:41,850 So if you have questions about the process, 144 00:06:41,850 --> 00:06:45,450 you can walk across the hall and get them answered. 145 00:06:45,450 --> 00:06:48,990 Within our group, we have expertise in systems, software, 146 00:06:48,990 --> 00:06:51,510 architecture, micro-architecture, 147 00:06:51,510 --> 00:06:53,640 circuit design, and layout. 148 00:06:53,640 --> 00:06:55,320 This is very important, because it 149 00:06:55,320 --> 00:06:58,950 allows you to optimize across all these disciplines. 150 00:06:58,950 --> 00:07:01,450 At the same time, we started the first chip design, 151 00:07:01,450 --> 00:07:03,450 there was one person in the group working on a C 152 00:07:03,450 --> 00:07:05,370 compiler for the machine. 153 00:07:05,370 --> 00:07:08,130 There was also a group of four people from our group in Hudson 154 00:07:08,130 --> 00:07:11,400 and from Digital Systems Research Center in California 155 00:07:11,400 --> 00:07:14,880 working on the design of a system that would use the chip. 156 00:07:14,880 --> 00:07:16,680 These efforts were to design prototypes, 157 00:07:16,680 --> 00:07:18,750 not to design products. 158 00:07:18,750 --> 00:07:20,760 But the parallel development efforts 159 00:07:20,760 --> 00:07:22,200 resulted in important information 160 00:07:22,200 --> 00:07:24,300 that we could draw upon to make good trade-offs 161 00:07:24,300 --> 00:07:26,568 during the course of the chip design. 162 00:07:26,568 --> 00:07:28,860 They were also coordinated to the schedule of the chip, 163 00:07:28,860 --> 00:07:31,420 not to the schedule to any particular product. 164 00:07:31,420 --> 00:07:35,007 So for example, we had a socket waiting in the prototype system 165 00:07:35,007 --> 00:07:36,840 to plug the chip into as soon as it came out 166 00:07:36,840 --> 00:07:38,670 of the manufacturing line. 167 00:07:38,670 --> 00:07:40,860 We have a local CAD group of about 100 people, 168 00:07:40,860 --> 00:07:43,800 who develop our design and verification tools. 169 00:07:43,800 --> 00:07:45,360 This provides access to tools which 170 00:07:45,360 --> 00:07:47,460 are not available in the industry at large, 171 00:07:47,460 --> 00:07:50,490 and which can be customized to the requirements of our design, 172 00:07:50,490 --> 00:07:53,573 sometimes on very short notice. 173 00:07:53,573 --> 00:07:55,240 There are several distinct design groups 174 00:07:55,240 --> 00:07:56,740 within the facility. 175 00:07:56,740 --> 00:07:58,900 Despite some differences in design style 176 00:07:58,900 --> 00:08:01,570 and feuds that go back generations, 177 00:08:01,570 --> 00:08:04,270 each project benefits from the work of the other groups, 178 00:08:04,270 --> 00:08:07,400 either through shared CAD tools or design techniques. 179 00:08:07,400 --> 00:08:10,000 The other groups also serve as a pool of talent and machine 180 00:08:10,000 --> 00:08:13,990 resources for times when a project needs some extra help. 181 00:08:13,990 --> 00:08:16,780 Now I'd like to talk about the design style or philosophy that 182 00:08:16,780 --> 00:08:19,690 was present in the EV3 and EV4 designs. 183 00:08:19,690 --> 00:08:21,620 I have to start with a disclaimer-- 184 00:08:21,620 --> 00:08:24,160 this style is specific to EV3 and EV4, 185 00:08:24,160 --> 00:08:25,840 not to digital in general, or even 186 00:08:25,840 --> 00:08:27,670 to other groups in this building. 187 00:08:27,670 --> 00:08:30,352 There were even some variations within EV4, 188 00:08:30,352 --> 00:08:31,810 because the floating point unit was 189 00:08:31,810 --> 00:08:33,669 designed by another group that had 190 00:08:33,669 --> 00:08:35,559 a slightly different approach. 191 00:08:35,559 --> 00:08:37,330 Our method works pretty well for us, 192 00:08:37,330 --> 00:08:39,730 but I don't claim that it's ideal for all groups 193 00:08:39,730 --> 00:08:42,100 and certainly not for all types of chip design. 194 00:08:42,100 --> 00:08:43,780 It is aimed at producing a design which 195 00:08:43,780 --> 00:08:46,830 achieves maximum performance. 196 00:08:46,830 --> 00:08:49,950 The design style of the chip teams reflect, to some measure, 197 00:08:49,950 --> 00:08:51,930 the personality of the team leaders. 198 00:08:51,930 --> 00:08:53,827 If you go into Dan Dobberpuhl's office, 199 00:08:53,827 --> 00:08:55,410 you'll note that the only object which 200 00:08:55,410 --> 00:08:59,580 rises above the stacks of paper is a workstation monitor. 201 00:08:59,580 --> 00:09:04,160 In my office, the bookshelf is stacked with rubber toys. 202 00:09:04,160 --> 00:09:07,100 On one occasion, I found a wind-up toy on my chair 203 00:09:07,100 --> 00:09:09,950 with a note that read, my nephew was going to throw this away, 204 00:09:09,950 --> 00:09:11,120 but I knew you'd want it. 205 00:09:11,120 --> 00:09:13,220 I still have the toy. 206 00:09:13,220 --> 00:09:14,870 With this preface, you might imagine 207 00:09:14,870 --> 00:09:18,180 that we were not the model of organization or decorum. 208 00:09:18,180 --> 00:09:21,290 The team operated with a minimum of organizational hierarchy, 209 00:09:21,290 --> 00:09:24,300 methodological constraints, and documentations. 210 00:09:24,300 --> 00:09:25,790 For example, the behavioral model 211 00:09:25,790 --> 00:09:27,740 is the only implementation spec. 212 00:09:27,740 --> 00:09:30,530 Some projects like to generate detailed descriptions, 213 00:09:30,530 --> 00:09:33,500 in English, of the functionality to be implemented before they 214 00:09:33,500 --> 00:09:34,970 start the implementation. 215 00:09:34,970 --> 00:09:38,300 The only formal documentation for the physical implementation 216 00:09:38,300 --> 00:09:40,880 is a 50-page document, fairly sketchy, 217 00:09:40,880 --> 00:09:43,730 that includes no naming conventions, circuit sizing 218 00:09:43,730 --> 00:09:47,480 and simulation guidelines, global clock waveforms, 219 00:09:47,480 --> 00:09:51,460 flashing and clocking rules, standard latch library, 220 00:09:51,460 --> 00:09:54,880 electromigration limits, CAD tool hints, and guidelines 221 00:09:54,880 --> 00:09:56,710 for layout designers. 222 00:09:56,710 --> 00:09:59,680 It was revised as the project progressed and includes 223 00:09:59,680 --> 00:10:01,960 paragraphs like the following. 224 00:10:01,960 --> 00:10:05,170 "Digression on speed versus functionality issues-- 225 00:10:05,170 --> 00:10:08,020 based on some discussions with several designers, 226 00:10:08,020 --> 00:10:10,720 the EV3 Critical Path Appeals Board 227 00:10:10,720 --> 00:10:14,000 would like to clarify, emphasize, and in some cases, 228 00:10:14,000 --> 00:10:17,200 state for the first time some assumptions and guidelines 229 00:10:17,200 --> 00:10:19,900 around the simulation of critical speed paths and race 230 00:10:19,900 --> 00:10:21,320 conditions. 231 00:10:21,320 --> 00:10:22,970 Critical speed paths and races are 232 00:10:22,970 --> 00:10:25,100 treated in a fundamentally different manner 233 00:10:25,100 --> 00:10:26,540 on our project." 234 00:10:26,540 --> 00:10:29,810 "You should think of speed paths as an interesting and enjoyable 235 00:10:29,810 --> 00:10:33,560 challenge and races as evil and odious. 236 00:10:33,560 --> 00:10:35,570 For the purpose of this note, a speed path 237 00:10:35,570 --> 00:10:37,790 is any path which will be guaranteed 238 00:10:37,790 --> 00:10:39,410 to work if you slow down the clock 239 00:10:39,410 --> 00:10:43,430 by some reasonable amount, and a race is a path which will not. 240 00:10:43,430 --> 00:10:45,830 Yes, we know you can argue with this terminology. 241 00:10:45,830 --> 00:10:49,160 No, we don't care." 242 00:10:49,160 --> 00:10:51,290 Decisions were made, and information 243 00:10:51,290 --> 00:10:53,480 was passed, more in hallway conversations 244 00:10:53,480 --> 00:10:55,082 than in formal meetings. 245 00:10:55,082 --> 00:10:56,540 This is fairly efficient, but tends 246 00:10:56,540 --> 00:10:58,580 to break down as a group grows. 247 00:10:58,580 --> 00:11:00,920 It also has a shortcoming that if your office is 248 00:11:00,920 --> 00:11:02,900 not strategically placed, you may not 249 00:11:02,900 --> 00:11:04,670 pick up the hottest gossip. 250 00:11:04,670 --> 00:11:08,520 There are no published minutes from the hallway chats. 251 00:11:08,520 --> 00:11:11,070 Senior people on the project are all technical 252 00:11:11,070 --> 00:11:13,320 and are all involved in doing technical work. 253 00:11:13,320 --> 00:11:15,430 There are no pure managers. 254 00:11:15,430 --> 00:11:18,060 Senior designers wedge in their management responsibilities, 255 00:11:18,060 --> 00:11:19,560 as required. 256 00:11:19,560 --> 00:11:21,060 With a design of this difficulty, 257 00:11:21,060 --> 00:11:24,600 you can't afford to squander your most experienced designers 258 00:11:24,600 --> 00:11:26,910 on management tasks. 259 00:11:26,910 --> 00:11:28,530 The success of a project organized 260 00:11:28,530 --> 00:11:32,005 like this depends much more on the efforts of individuals 261 00:11:32,005 --> 00:11:33,630 than on the proper assignment of people 262 00:11:33,630 --> 00:11:36,840 to tasks and the proper tracking of each step. 263 00:11:36,840 --> 00:11:38,940 This is not an engineering factory. 264 00:11:38,940 --> 00:11:42,060 The demands of the task itself require creativity, 265 00:11:42,060 --> 00:11:45,120 and a structure of this sort encourages that creativity. 266 00:11:45,120 --> 00:11:48,030 It gives considerable latitude to individual designers, 267 00:11:48,030 --> 00:11:50,280 even those with limited experience. 268 00:11:50,280 --> 00:11:52,392 There is, of course, a downside. 269 00:11:52,392 --> 00:11:53,850 It's a bit difficult to get a sense 270 00:11:53,850 --> 00:11:55,410 of how the design is going. 271 00:11:55,410 --> 00:11:58,590 And sometimes creativity just runs amok and creates some 272 00:11:58,590 --> 00:12:00,780 rather spectacular problems. 273 00:12:00,780 --> 00:12:02,340 And there's also sometimes problems 274 00:12:02,340 --> 00:12:05,070 if people don't get sufficient guidance. 275 00:12:05,070 --> 00:12:06,900 On balance, however, the results are 276 00:12:06,900 --> 00:12:08,580 worth the chaos, the uncertainty, 277 00:12:08,580 --> 00:12:11,000 and the occasional crisis. 278 00:12:11,000 --> 00:12:12,180 That's it for philosophy. 279 00:12:12,180 --> 00:12:15,350 Now I'd like to move on to the organization. 280 00:12:15,350 --> 00:12:17,320 The team is organized by specialty. 281 00:12:17,320 --> 00:12:19,870 Three microarchitects wrote the behavioral model 282 00:12:19,870 --> 00:12:21,573 for nearly the full chip. 283 00:12:21,573 --> 00:12:22,990 The circuit designers were divided 284 00:12:22,990 --> 00:12:26,500 by major functional block and worked with the microarchitects 285 00:12:26,500 --> 00:12:29,260 and the layout designers to mold this behavioral model 286 00:12:29,260 --> 00:12:31,230 into the final design. 287 00:12:31,230 --> 00:12:33,988 There's considerable negotiation throughout this process, 288 00:12:33,988 --> 00:12:35,530 and the behavioral model is generally 289 00:12:35,530 --> 00:12:39,088 rewritten as the physical implementation progresses. 290 00:12:39,088 --> 00:12:40,630 There's an alternative team structure 291 00:12:40,630 --> 00:12:42,410 which is also used in groups here, 292 00:12:42,410 --> 00:12:45,340 which give smaller, well-defined sections to individuals 293 00:12:45,340 --> 00:12:47,980 who write the behavioral model and do the circuit and logic 294 00:12:47,980 --> 00:12:49,630 design all themselves. 295 00:12:49,630 --> 00:12:52,060 That structure has the advantage that the designer doing 296 00:12:52,060 --> 00:12:55,600 the work understands all the trade-offs 297 00:12:55,600 --> 00:12:57,860 within that small section very well, 298 00:12:57,860 --> 00:12:59,680 although she's limited in her visibility 299 00:12:59,680 --> 00:13:01,570 outside that section. 300 00:13:01,570 --> 00:13:04,210 It also affords exposure to a larger part 301 00:13:04,210 --> 00:13:06,350 of the design process. 302 00:13:06,350 --> 00:13:08,200 This organizational model was used 303 00:13:08,200 --> 00:13:11,700 by the group who designed the floating point unit for EV4. 304 00:13:11,700 --> 00:13:14,088 The advantage of organizing by specialty 305 00:13:14,088 --> 00:13:15,630 is that the small number of designers 306 00:13:15,630 --> 00:13:18,640 who own the behavioral model get to see the big picture. 307 00:13:18,640 --> 00:13:20,610 They don't worry about transistors at all, 308 00:13:20,610 --> 00:13:23,490 but they know the logic level functionality of the whole chip 309 00:13:23,490 --> 00:13:24,540 very well. 310 00:13:24,540 --> 00:13:27,360 So they can optimize across the whole design. 311 00:13:27,360 --> 00:13:29,523 The challenge of using this sort of method 312 00:13:29,523 --> 00:13:31,440 is that you have to be able to transfer enough 313 00:13:31,440 --> 00:13:34,320 of the information from the behavioral modelers 314 00:13:34,320 --> 00:13:36,660 to the circuit designers so they can understand 315 00:13:36,660 --> 00:13:40,520 what the options are and make an intelligent implementation. 316 00:13:40,520 --> 00:13:42,860 Even though the behavioral model represents the function 317 00:13:42,860 --> 00:13:45,277 in terms of [? NANs ?] and [? NORs, ?] it's not sufficient 318 00:13:45,277 --> 00:13:48,920 to just translate it into Gates and reduce the logic. 319 00:13:48,920 --> 00:13:50,480 You have to understand the intent 320 00:13:50,480 --> 00:13:53,570 and then figure out the best circuit techniques for the job. 321 00:13:53,570 --> 00:13:54,998 Throughout the design process, we 322 00:13:54,998 --> 00:13:57,290 made constant trade-offs between the microarchitecture, 323 00:13:57,290 --> 00:13:59,482 the chip, and the circuit implementation. 324 00:13:59,482 --> 00:14:01,940 And the effect of this effort is evident in the performance 325 00:14:01,940 --> 00:14:03,380 of EV4. 326 00:14:03,380 --> 00:14:04,897 Excluding the floating-point unit, 327 00:14:04,897 --> 00:14:06,980 there were generally between one and three circuit 328 00:14:06,980 --> 00:14:10,310 designers in each of the six major sections. 329 00:14:10,310 --> 00:14:12,680 That number of designers grew for short periods 330 00:14:12,680 --> 00:14:14,990 if there was a local schedule crunch. 331 00:14:14,990 --> 00:14:17,515 The floating-point unit used a team 332 00:14:17,515 --> 00:14:18,890 that included eight engineers who 333 00:14:18,890 --> 00:14:23,780 handled both the behavioral model and the implementation. 334 00:14:23,780 --> 00:14:27,290 Our schematic design style employs very limited hierarchy. 335 00:14:27,290 --> 00:14:28,850 The only circuits that are formally 336 00:14:28,850 --> 00:14:32,090 shared on multiple schematics are the standard latches 337 00:14:32,090 --> 00:14:33,800 that are used by the project. 338 00:14:33,800 --> 00:14:36,020 All transistors are visible on each page 339 00:14:36,020 --> 00:14:38,360 to allow easy optimization of circuits 340 00:14:38,360 --> 00:14:40,580 and to allow designers using the schematics to see 341 00:14:40,580 --> 00:14:42,470 all the interactions. 342 00:14:42,470 --> 00:14:44,510 Our schematics editor supports a form 343 00:14:44,510 --> 00:14:46,190 of symbol definition, which is local 344 00:14:46,190 --> 00:14:47,840 to a particular schematic. 345 00:14:47,840 --> 00:14:51,320 This allows efficient reuse of circuitry on a given schematic, 346 00:14:51,320 --> 00:14:53,150 while still maintaining visibility 347 00:14:53,150 --> 00:14:55,350 into the details of the circuit. 348 00:14:55,350 --> 00:14:57,720 This type of schematic also allows the schematics 349 00:14:57,720 --> 00:15:01,520 to reflect the physical placement of transistors. 350 00:15:01,520 --> 00:15:04,220 Internal CAD tools exist to check tricky circuits 351 00:15:04,220 --> 00:15:05,458 for proper usage. 352 00:15:05,458 --> 00:15:07,250 And when the need arose during the project, 353 00:15:07,250 --> 00:15:09,630 additional CAD tools were developed. 354 00:15:09,630 --> 00:15:12,180 Schematics are assembled by simply appending the transistor 355 00:15:12,180 --> 00:15:13,320 wireless. 356 00:15:13,320 --> 00:15:15,330 Signals which cross schematic boundaries 357 00:15:15,330 --> 00:15:17,130 are connected by naming convention, 358 00:15:17,130 --> 00:15:20,700 and other tools exist to check for consistent naming. 359 00:15:20,700 --> 00:15:22,470 Despite the importance of creativity 360 00:15:22,470 --> 00:15:25,110 and individual efforts in the design of EV4, 361 00:15:25,110 --> 00:15:26,970 it was necessary to manage and track 362 00:15:26,970 --> 00:15:29,700 the progress of the project at some level, 363 00:15:29,700 --> 00:15:32,070 if only to adjust the allocation of resources 364 00:15:32,070 --> 00:15:33,900 as the tasks changed. 365 00:15:33,900 --> 00:15:36,420 The challenge is to accomplish this without negatively 366 00:15:36,420 --> 00:15:38,200 impacting the design. 367 00:15:38,200 --> 00:15:40,440 There is a natural tendency to overmanage 368 00:15:40,440 --> 00:15:42,360 complex projects like this, rather than 369 00:15:42,360 --> 00:15:44,610 trusting individuals to behave reasonably 370 00:15:44,610 --> 00:15:46,570 in the absence of management. 371 00:15:46,570 --> 00:15:49,320 An overmanaged project tends to give the higher 372 00:15:49,320 --> 00:15:51,900 levels of management a warm, fuzzy feeling about how 373 00:15:51,900 --> 00:15:54,390 the project is going, and the manager responsible 374 00:15:54,390 --> 00:15:56,100 is usually rewarded. 375 00:15:56,100 --> 00:15:58,920 However, the goal of the Alpha CPU design project 376 00:15:58,920 --> 00:16:01,470 was to design the fastest chip possible, not 377 00:16:01,470 --> 00:16:02,940 the best-managed project. 378 00:16:02,940 --> 00:16:04,650 And the electrons involved don't really 379 00:16:04,650 --> 00:16:07,230 care how the project was managed. 380 00:16:07,230 --> 00:16:10,170 Consequently, we tried for the minimum acceptable level 381 00:16:10,170 --> 00:16:11,670 of formal management. 382 00:16:11,670 --> 00:16:13,770 The methods used to run the project varied 383 00:16:13,770 --> 00:16:15,930 as the design developed, usually in response 384 00:16:15,930 --> 00:16:19,510 to a recognized deficiency in the current method. 385 00:16:19,510 --> 00:16:22,967 For example, in the early part of the EV3 implementation, 386 00:16:22,967 --> 00:16:24,550 all the circuit designers were getting 387 00:16:24,550 --> 00:16:26,710 used to the single wire two-phase clocking 388 00:16:26,710 --> 00:16:29,470 technique which we were just beginning to use. 389 00:16:29,470 --> 00:16:32,920 We had used a four-phase design on earlier CPUs, 390 00:16:32,920 --> 00:16:36,670 and the new clocking method required new techniques. 391 00:16:36,670 --> 00:16:38,678 While some standard latches had been defined, 392 00:16:38,678 --> 00:16:40,720 there were lots of other issues to be ironed out, 393 00:16:40,720 --> 00:16:44,770 like clever ways to minimize the latch delays in critical paths. 394 00:16:44,770 --> 00:16:47,350 We couldn't just present a set of proven circuits 395 00:16:47,350 --> 00:16:49,180 and tell everybody, use these circuits. 396 00:16:49,180 --> 00:16:52,970 We didn't know what the proven circuits were. 397 00:16:52,970 --> 00:16:55,220 So for a while, people toiled away individually, 398 00:16:55,220 --> 00:16:57,770 discovering what worked and what didn't. 399 00:16:57,770 --> 00:17:00,470 And during this period we instituted a series 400 00:17:00,470 --> 00:17:03,710 of weekly chats to disseminate information between the circuit 401 00:17:03,710 --> 00:17:04,910 designers. 402 00:17:04,910 --> 00:17:07,430 Chats were about an hour long and were held about once 403 00:17:07,430 --> 00:17:08,619 a week. 404 00:17:08,619 --> 00:17:11,200 People generally presented about once a month, 405 00:17:11,200 --> 00:17:14,500 as the topics of the chats rotated around between circuits 406 00:17:14,500 --> 00:17:16,900 from various sections of the chip. 407 00:17:16,900 --> 00:17:18,400 Presenters were allowed 10 minutes 408 00:17:18,400 --> 00:17:20,140 to get ready for the talk and presented 409 00:17:20,140 --> 00:17:22,400 work that was in progress. 410 00:17:22,400 --> 00:17:24,560 The work to be presented was up to the individual. 411 00:17:24,560 --> 00:17:26,119 And attendance was limited to members 412 00:17:26,119 --> 00:17:28,310 of the design team and one representative 413 00:17:28,310 --> 00:17:32,930 from each of the other two CPU design teams in the building. 414 00:17:32,930 --> 00:17:35,180 The intent was to establish an atmosphere where 415 00:17:35,180 --> 00:17:38,270 half-baked ideas could be freely presented and critiqued 416 00:17:38,270 --> 00:17:42,200 in a friendly manner and to allow all team members to steal 417 00:17:42,200 --> 00:17:44,720 clever ideas from each other. 418 00:17:44,720 --> 00:17:46,630 Some of the chats were very useful 419 00:17:46,630 --> 00:17:49,330 and presented ideas that spread throughout the team. 420 00:17:49,330 --> 00:17:51,317 Others were significantly less useful. 421 00:17:51,317 --> 00:17:53,650 And as time went on, we started to see the same circuits 422 00:17:53,650 --> 00:17:55,930 over and over again each week, so we just discontinued 423 00:17:55,930 --> 00:17:57,040 the chats. 424 00:17:57,040 --> 00:17:58,840 In all, they went on for about five months 425 00:17:58,840 --> 00:18:01,382 in the beginning of the project, and they're a reasonable way 426 00:18:01,382 --> 00:18:04,930 to spread information around inside a loosely coupled team. 427 00:18:04,930 --> 00:18:06,520 Formal reviews were generally limited 428 00:18:06,520 --> 00:18:09,190 to reviews of schematics before they entered layout. 429 00:18:09,190 --> 00:18:12,160 These were attended by the leader of the section, the chip 430 00:18:12,160 --> 00:18:15,070 leader, the lead layout designer, the designer 431 00:18:15,070 --> 00:18:17,620 of the schematic, and the layout designer who was 432 00:18:17,620 --> 00:18:19,570 going to receive the schematic. 433 00:18:19,570 --> 00:18:22,840 Participants looked for bad circuit structures, 434 00:18:22,840 --> 00:18:26,770 obscure critical paths, layout constraints, and assumptions. 435 00:18:26,770 --> 00:18:28,720 They did not review SPICE simulations 436 00:18:28,720 --> 00:18:31,030 or logical functionality. 437 00:18:31,030 --> 00:18:34,600 Layout constraints like nonminimum wire width spacings 438 00:18:34,600 --> 00:18:36,370 and sensitive nodes were included 439 00:18:36,370 --> 00:18:40,170 in the schematics as a note and were discussed at the meeting. 440 00:18:40,170 --> 00:18:42,030 In addition to providing a circuit review 441 00:18:42,030 --> 00:18:45,450 for the schematic, the layout review served other purposes. 442 00:18:45,450 --> 00:18:47,340 The meetings allowed the lead layout designer 443 00:18:47,340 --> 00:18:48,930 to more efficiently supervise the layout, 444 00:18:48,930 --> 00:18:50,430 because he understood all the constraints 445 00:18:50,430 --> 00:18:51,870 for each one of the schematics, and he 446 00:18:51,870 --> 00:18:54,030 could check to make sure that they were followed. 447 00:18:54,030 --> 00:18:56,277 It also provided a forum for the layout designers 448 00:18:56,277 --> 00:18:58,110 to object to unreasonable layout assumptions 449 00:18:58,110 --> 00:19:00,110 by circuit designers, and every once in a while, 450 00:19:00,110 --> 00:19:01,095 we slip those in. 451 00:19:01,095 --> 00:19:03,780 And it allowed him to understand what was important 452 00:19:03,780 --> 00:19:07,330 and where they might be able to cheat if they got stuck. 453 00:19:07,330 --> 00:19:09,550 Passing this sort of information between the people 454 00:19:09,550 --> 00:19:12,220 responsible for different stages of the implementation-- 455 00:19:12,220 --> 00:19:14,710 either the microarchitects and the circuit designers 456 00:19:14,710 --> 00:19:16,870 or the circuit designers and the layout designers-- 457 00:19:16,870 --> 00:19:19,110 it's difficult. There's a lot of information. 458 00:19:19,110 --> 00:19:21,610 There's a lot of assumptions which aren't stated, generally, 459 00:19:21,610 --> 00:19:23,068 and it's not always clear what will 460 00:19:23,068 --> 00:19:25,455 be important to the person at the next level. 461 00:19:25,455 --> 00:19:27,580 However, getting enough of this information flowing 462 00:19:27,580 --> 00:19:30,340 back and forth is essential to producing 463 00:19:30,340 --> 00:19:31,660 a high-performance part. 464 00:19:31,660 --> 00:19:33,940 And the success of this sort of interaction 465 00:19:33,940 --> 00:19:37,635 generally determines the success of the project. 466 00:19:37,635 --> 00:19:40,010 Particularly tricky structures were reviewed individually 467 00:19:40,010 --> 00:19:41,750 before they were ready for layout. 468 00:19:41,750 --> 00:19:44,420 Typical examples include the integer adder and the register 469 00:19:44,420 --> 00:19:49,240 file, and the register conflict logic, which is in the I-Box. 470 00:19:49,240 --> 00:19:50,800 Other sorts of checks on the design 471 00:19:50,800 --> 00:19:54,280 were done by in-house CAD tools, rather than by design review. 472 00:19:54,280 --> 00:19:56,590 Logic verification was performed by running 473 00:19:56,590 --> 00:20:00,025 handcrafted and random patterns on the behavioral model 474 00:20:00,025 --> 00:20:01,900 and comparing the results to the results that 475 00:20:01,900 --> 00:20:04,060 came out of a high-level model. 476 00:20:04,060 --> 00:20:05,890 We ran the equivalent of 30 [? VAC ?] 477 00:20:05,890 --> 00:20:10,000 780 CPU years, about a billion simulated cycles 478 00:20:10,000 --> 00:20:13,240 on the EV4 behavioral model. 479 00:20:13,240 --> 00:20:16,060 The transistor wireless from the circuit schematics 480 00:20:16,060 --> 00:20:19,025 is translated to a logic equivalent. 481 00:20:19,025 --> 00:20:20,650 And that's compared against the results 482 00:20:20,650 --> 00:20:24,130 from the behavioral model for about 380 hand-coded pattern 483 00:20:24,130 --> 00:20:26,400 and lots of random patterns. 484 00:20:26,400 --> 00:20:28,740 The hand-coded patterns comprise about 10 million 485 00:20:28,740 --> 00:20:30,940 simulated cycles. 486 00:20:30,940 --> 00:20:33,970 For both the behavioral- and the transistor-level logic 487 00:20:33,970 --> 00:20:36,760 simulations, the random patterns run constantly 488 00:20:36,760 --> 00:20:38,380 as a background job on the cluster 489 00:20:38,380 --> 00:20:41,020 and just suck up any excess CPU time. 490 00:20:41,020 --> 00:20:42,860 Once these runs are started, you don't have 491 00:20:42,860 --> 00:20:44,110 to pay much attention to them. 492 00:20:44,110 --> 00:20:45,430 They just report errors. 493 00:20:45,430 --> 00:20:49,530 And so it doesn't use up much human effort. 494 00:20:49,530 --> 00:20:51,840 Circuit integrity is assured by a program which 495 00:20:51,840 --> 00:20:55,050 has been developed here at DEC over the last 10 years 496 00:20:55,050 --> 00:20:57,560 to handle our full custom designs. 497 00:20:57,560 --> 00:20:59,310 It runs off the transistor wireless 498 00:20:59,310 --> 00:21:02,250 and accepts data files generated by other programs 499 00:21:02,250 --> 00:21:03,735 and also by designers. 500 00:21:03,735 --> 00:21:05,760 It checks a large set of rules related 501 00:21:05,760 --> 00:21:09,320 to noise margin, coupling, beta ratios, and that sort of thing. 502 00:21:09,320 --> 00:21:11,820 It's intended to guarantee the functionality of the circuit, 503 00:21:11,820 --> 00:21:14,400 but not necessarily the speed. 504 00:21:14,400 --> 00:21:17,190 To verify the speed, several techniques were used. 505 00:21:17,190 --> 00:21:20,700 In addition to masses of circuit simulations using SPICE, 506 00:21:20,700 --> 00:21:22,770 we run a static timing verification 507 00:21:22,770 --> 00:21:24,420 tool on the whole chip. 508 00:21:24,420 --> 00:21:27,750 [? Hit 2 ?] works off the transistor wireless and accepts 509 00:21:27,750 --> 00:21:30,990 extracted node capacitances and wire resistances 510 00:21:30,990 --> 00:21:32,970 generated by other tools. 511 00:21:32,970 --> 00:21:35,070 At these speeds, wire resistances 512 00:21:35,070 --> 00:21:37,170 are not negligible, even for wires 513 00:21:37,170 --> 00:21:39,100 contained in a single section. 514 00:21:39,100 --> 00:21:41,220 So the ability to accurately model the wire delay 515 00:21:41,220 --> 00:21:42,750 is important. 516 00:21:42,750 --> 00:21:45,150 The accuracy of the timing tool is good enough 517 00:21:45,150 --> 00:21:47,490 for the vast majority of our circuits. 518 00:21:47,490 --> 00:21:49,950 The major exceptions are circuits like the integer adder 519 00:21:49,950 --> 00:21:52,260 and the register file that use some non-standard design 520 00:21:52,260 --> 00:21:53,610 techniques. 521 00:21:53,610 --> 00:21:55,590 The tool doesn't replace SPICE at all. 522 00:21:55,590 --> 00:21:58,110 But it complements it nicely, typically finding 523 00:21:58,110 --> 00:22:00,660 paths that are slow, not because they're particularly hard, 524 00:22:00,660 --> 00:22:03,118 but because somebody didn't pay a lot of attention to them, 525 00:22:03,118 --> 00:22:06,240 or because the capacitance that ended up on some of the nodes 526 00:22:06,240 --> 00:22:08,560 was not what was expected. 527 00:22:08,560 --> 00:22:11,940 Now I'd like to talk about what we built. This is the EV4 chip. 528 00:22:11,940 --> 00:22:16,620 It's a 200 megahertz, 64-bit CMOS microprocessor. 529 00:22:16,620 --> 00:22:21,030 It's fabricated in our CMOS-4 process, a 0.75 micron 530 00:22:21,030 --> 00:22:23,100 process with half-micron [? L ?] effectors, 531 00:22:23,100 --> 00:22:24,690 three layers of metal. 532 00:22:24,690 --> 00:22:29,250 The chip is 13.9 by 16.8 millimeters, 533 00:22:29,250 --> 00:22:32,910 contains 1.68 million transistors, about a million 534 00:22:32,910 --> 00:22:35,010 of which are in the two caches. 535 00:22:35,010 --> 00:22:39,570 It's packaged in a 431 pin grid array package, 536 00:22:39,570 --> 00:22:44,280 with 291 signal pins and 140 power pins. 537 00:22:44,280 --> 00:22:49,200 It dissipates 30 watts at 200 megahertz with a 3.3-volt power 538 00:22:49,200 --> 00:22:50,460 supply. 539 00:22:50,460 --> 00:22:53,460 The chip implements a 43-bit subset 540 00:22:53,460 --> 00:22:57,090 of a defined 64-bit linear virtual address space 541 00:22:57,090 --> 00:23:01,740 and provides addressing for 2 to the 34th physical addresses. 542 00:23:01,740 --> 00:23:04,770 It can issue up to two instructions for each 543 00:23:04,770 --> 00:23:09,000 5-nanosecond cycle between pairwise combinations of four 544 00:23:09,000 --> 00:23:09,990 functional units-- 545 00:23:09,990 --> 00:23:12,840 an integer operation unit, a floating-point operations unit, 546 00:23:12,840 --> 00:23:15,710 a load store unit, and a branch unit. 547 00:23:15,710 --> 00:23:19,700 The chip contains 44 translation lookaside buffer entries, 548 00:23:19,700 --> 00:23:23,510 32 data TLB entries, and 12 instruction TLB entries. 549 00:23:23,510 --> 00:23:25,370 It contains and on-chip write buffer, 550 00:23:25,370 --> 00:23:28,850 which has four 32-byte entries. 551 00:23:28,850 --> 00:23:31,670 There are two primary caches, an 8k [? ICache ?] 552 00:23:31,670 --> 00:23:33,530 and an 8k DCache. 553 00:23:33,530 --> 00:23:38,780 And the bus interface is either 128- or 64-bit data buses 554 00:23:38,780 --> 00:23:42,320 with a separate 34-bit address bus. 555 00:23:42,320 --> 00:23:45,500 We can interface to either 5-volt CMOS or TTL, 556 00:23:45,500 --> 00:23:46,770 or 100K ECL. 557 00:23:46,770 --> 00:23:50,080 558 00:23:50,080 --> 00:23:51,700 As I said earlier, one of our goals 559 00:23:51,700 --> 00:23:54,930 was to make a chip that was fast enough to create a stir. 560 00:23:54,930 --> 00:23:57,880 At 200 megahertz, we've done this. 561 00:23:57,880 --> 00:23:59,770 Lots of things contributed to making 562 00:23:59,770 --> 00:24:01,023 a chip that runs this fast. 563 00:24:01,023 --> 00:24:02,440 There's not just one silver bullet 564 00:24:02,440 --> 00:24:05,300 that you need to get success. 565 00:24:05,300 --> 00:24:07,600 These are some of the items that we used. 566 00:24:07,600 --> 00:24:10,270 As I discussed before, we have a CMOS process, 567 00:24:10,270 --> 00:24:13,030 which is optimized for doing high-performance microprocessor 568 00:24:13,030 --> 00:24:14,090 design. 569 00:24:14,090 --> 00:24:15,880 It's got 3.3-volt power supplies, 570 00:24:15,880 --> 00:24:20,513 105-angstrom gate oxides, half-micron L effectives, 571 00:24:20,513 --> 00:24:22,930 and the three layers of metal, the third layer of which is 572 00:24:22,930 --> 00:24:25,090 very thick. 573 00:24:25,090 --> 00:24:27,280 It's got a high-performance architecture, 574 00:24:27,280 --> 00:24:30,160 in which the bottlenecks for the implementation 575 00:24:30,160 --> 00:24:34,060 have been removed, and a super pipeline micro architecture, 576 00:24:34,060 --> 00:24:36,460 in which each of the stages are carefully 577 00:24:36,460 --> 00:24:39,340 balanced so that we don't overload a particular stage. 578 00:24:39,340 --> 00:24:40,880 It's a single chip implementation, 579 00:24:40,880 --> 00:24:43,750 so there's no chip-to-chip delays. 580 00:24:43,750 --> 00:24:48,420 And we did very careful logic, circuit, and layout design. 581 00:24:48,420 --> 00:24:53,220 Most of the critical paths could be categorized as precharged, 582 00:24:53,220 --> 00:24:57,240 and some are some non-precharged with very tricky circuits. 583 00:24:57,240 --> 00:24:59,910 There's very little synthesis in the chip. 584 00:24:59,910 --> 00:25:01,578 We synthesize some of the circuits 585 00:25:01,578 --> 00:25:03,120 in some of the control sections where 586 00:25:03,120 --> 00:25:04,412 they weren't in critical paths. 587 00:25:04,412 --> 00:25:07,350 But in general, these sorts of speeds, 588 00:25:07,350 --> 00:25:10,500 it's very hard to make synthesized circuits work. 589 00:25:10,500 --> 00:25:12,210 We have a comprehensive set of CAD tools 590 00:25:12,210 --> 00:25:13,830 to analyze both the functionality 591 00:25:13,830 --> 00:25:16,090 and the speed of the chip. 592 00:25:16,090 --> 00:25:19,120 And when all else fails, we use brute force. 593 00:25:19,120 --> 00:25:22,570 There's 128 nanofarads of on-chip decoupling cap, 594 00:25:22,570 --> 00:25:24,160 built out of thin oxide capacitance 595 00:25:24,160 --> 00:25:27,470 and spread around the chip wherever we had space. 596 00:25:27,470 --> 00:25:32,120 There is a 250,000-micron transistor in the clock driver. 597 00:25:32,120 --> 00:25:33,770 We'll come back to that later. 598 00:25:33,770 --> 00:25:38,280 That 250,000 microns is 10 inches. 599 00:25:38,280 --> 00:25:41,020 This is a photograph of the EV4 chip. 600 00:25:41,020 --> 00:25:44,440 The integer unit is across the top in the center. 601 00:25:44,440 --> 00:25:47,440 The floating-point unit is across the bottom. 602 00:25:47,440 --> 00:25:49,210 The caches are on the right and the left, 603 00:25:49,210 --> 00:25:53,320 with the data cache on the right and the instruction 604 00:25:53,320 --> 00:25:54,970 cache on the left. 605 00:25:54,970 --> 00:25:58,060 The right buffer is in the upper right-hand corner. 606 00:25:58,060 --> 00:26:00,970 In the center of the chip is the clock driver 607 00:26:00,970 --> 00:26:03,040 that extends all the way between the two caches 608 00:26:03,040 --> 00:26:04,690 about in the center of the chip. 609 00:26:04,690 --> 00:26:07,240 And the dark areas on either side of it 610 00:26:07,240 --> 00:26:10,933 are the decoupling cap associated with the clock. 611 00:26:10,933 --> 00:26:12,350 One of the things I mentioned when 612 00:26:12,350 --> 00:26:13,850 I talked about the goals of the chip 613 00:26:13,850 --> 00:26:16,700 was that we needed to design considerable flexibility so 614 00:26:16,700 --> 00:26:19,120 that we could support a wide variety of systems. 615 00:26:19,120 --> 00:26:22,430 The pin bus supports a fair amount of flexibility. 616 00:26:22,430 --> 00:26:26,750 We support cache sizes from 128 k-bytes to eight megabytes 617 00:26:26,750 --> 00:26:28,580 or no cache at all. 618 00:26:28,580 --> 00:26:33,267 The cache speeds can be between 3 and 16 CPU cycles. 619 00:26:33,267 --> 00:26:35,350 You get to set this up when you power up the chip. 620 00:26:35,350 --> 00:26:38,210 And so you can use either slow chips, slow RAM chips, 621 00:26:38,210 --> 00:26:41,120 or fast RAM chips, depending on what you want. 622 00:26:41,120 --> 00:26:43,370 The external interface runs synchronously 623 00:26:43,370 --> 00:26:45,950 to a system clock, which is supplied by the CPU. 624 00:26:45,950 --> 00:26:48,290 And you can also program how fast that runs when 625 00:26:48,290 --> 00:26:50,250 you power the chip up. 626 00:26:50,250 --> 00:26:52,380 The width of the data bus can be chosen 627 00:26:52,380 --> 00:26:57,810 to be either 128 bits or 64 bits and to be either TTL or CMOS 628 00:26:57,810 --> 00:27:00,210 levels or ECL levels. 629 00:27:00,210 --> 00:27:03,300 We support longword ECC and longword parity. 630 00:27:03,300 --> 00:27:06,120 And we have a simple diagnostic interface 631 00:27:06,120 --> 00:27:09,780 via off-chip serial ROM that provides a mechanism 632 00:27:09,780 --> 00:27:11,580 for loading [? the iCache ?] on power up 633 00:27:11,580 --> 00:27:14,710 and also for doing some diagnostics. 634 00:27:14,710 --> 00:27:16,720 For most operations, the CPU chip 635 00:27:16,720 --> 00:27:20,290 accesses the cache directly in a combinatorial loop, simply 636 00:27:20,290 --> 00:27:23,470 by presenting an address and waiting n CPU cycles 637 00:27:23,470 --> 00:27:25,780 for control, tag, and data to appear, 638 00:27:25,780 --> 00:27:29,530 where you get to program the value of n on power up. 639 00:27:29,530 --> 00:27:33,200 This allows you to use SRAMs of almost any speed. 640 00:27:33,200 --> 00:27:34,990 Obviously, if you use faster SRAMs, 641 00:27:34,990 --> 00:27:36,700 you get better performance. 642 00:27:36,700 --> 00:27:40,390 Simple accesses are done entirely by the CPU chip 643 00:27:40,390 --> 00:27:44,350 without any intervention from the memory subsystem. 644 00:27:44,350 --> 00:27:47,200 Read hits and write hits to unshared lines 645 00:27:47,200 --> 00:27:49,540 are considered to be simple accesses. 646 00:27:49,540 --> 00:27:54,070 More complex accesses defer to a state machine 647 00:27:54,070 --> 00:27:56,727 in the external memory subsystem. 648 00:27:56,727 --> 00:27:58,310 Now I'd like to talk a little bit more 649 00:27:58,310 --> 00:28:01,710 about the clocking method that we use on EV3 and EV4. 650 00:28:01,710 --> 00:28:05,030 This slide shows the waveform on the global clock, CLK, 651 00:28:05,030 --> 00:28:07,160 and one type of latch that we use. 652 00:28:07,160 --> 00:28:08,727 We use a wide variety of latches, 653 00:28:08,727 --> 00:28:10,310 but they all share the characteristics 654 00:28:10,310 --> 00:28:12,590 that they use a single wire clock. 655 00:28:12,590 --> 00:28:14,840 This eliminated the dead time between phases 656 00:28:14,840 --> 00:28:17,870 and allowed us to just route a single clock wire 657 00:28:17,870 --> 00:28:20,177 in our thick metal three-layer. 658 00:28:20,177 --> 00:28:22,760 Note that there's a race-through problem inherent in this type 659 00:28:22,760 --> 00:28:24,490 of latching method. 660 00:28:24,490 --> 00:28:27,145 When the clock rises, the B-latch is closing 661 00:28:27,145 --> 00:28:28,960 and the A-latch is opening, allowing 662 00:28:28,960 --> 00:28:31,930 new data to travel to the input of the B latch. 663 00:28:31,930 --> 00:28:34,690 If the new data from the A-latch gets into the B-latch 664 00:28:34,690 --> 00:28:37,150 before it closes, you're in trouble. 665 00:28:37,150 --> 00:28:39,430 With the latches shown, the data from the A-latch 666 00:28:39,430 --> 00:28:43,630 starts propagating when the rising clock hits the n 667 00:28:43,630 --> 00:28:45,010 threshold. 668 00:28:45,010 --> 00:28:47,740 And the B-latch closes when the clock 669 00:28:47,740 --> 00:28:50,230 hits Vdd minus the p threshold. 670 00:28:50,230 --> 00:28:52,810 As long as the clock edge is faster than the propagation 671 00:28:52,810 --> 00:28:55,580 delay between the latches, all is well. 672 00:28:55,580 --> 00:28:57,460 This requirement is not a new constraint, 673 00:28:57,460 --> 00:28:59,140 because it's in line with our desire 674 00:28:59,140 --> 00:29:01,090 to minimize the clock delay across the chip, 675 00:29:01,090 --> 00:29:03,260 so we can just get the thing to run fast. 676 00:29:03,260 --> 00:29:05,110 However, it does raise the stakes. 677 00:29:05,110 --> 00:29:07,810 If we've chosen a clocking scheme without this transition 678 00:29:07,810 --> 00:29:09,910 time constraint on the clock, and we 679 00:29:09,910 --> 00:29:12,040 were unable to control the clock delay, 680 00:29:12,040 --> 00:29:14,200 we'd have to slow down the cycle time. 681 00:29:14,200 --> 00:29:17,650 That would be disappointing, but it wouldn't be disastrous. 682 00:29:17,650 --> 00:29:19,300 With this sort of latch, if you're 683 00:29:19,300 --> 00:29:21,730 unable to maintain snappy clock edges, 684 00:29:21,730 --> 00:29:24,160 the chips are non-functional at any speed. 685 00:29:24,160 --> 00:29:25,810 Consequently, considerable attention 686 00:29:25,810 --> 00:29:27,730 was paid to the proper design and analysis 687 00:29:27,730 --> 00:29:31,444 of the clock driver and the distribution grid. 688 00:29:31,444 --> 00:29:34,480 The slide shows the distribution of clock load 689 00:29:34,480 --> 00:29:36,730 among the major functional units. 690 00:29:36,730 --> 00:29:39,470 All sections except for the bus interface unit 691 00:29:39,470 --> 00:29:41,210 receive the clock directly. 692 00:29:41,210 --> 00:29:44,380 The BIU operates from a single-level buffered version 693 00:29:44,380 --> 00:29:46,720 of the main clock. 694 00:29:46,720 --> 00:29:50,650 The total load on the clock node is 3,250 pF, 695 00:29:50,650 --> 00:29:52,780 with the majority of the load occurring in the two 696 00:29:52,780 --> 00:29:55,990 main functional units to the north and south of the clock 697 00:29:55,990 --> 00:29:57,260 driver. 698 00:29:57,260 --> 00:30:01,370 The clock driver and pre-driver represent about 50% 699 00:30:01,370 --> 00:30:03,560 of the total effect of switching capacitance 700 00:30:03,560 --> 00:30:06,620 for the chip, which has been determined by power measurement 701 00:30:06,620 --> 00:30:10,620 to be 12,500 pF. 702 00:30:10,620 --> 00:30:13,680 To manage the problem of the IdP associated with the chip power 703 00:30:13,680 --> 00:30:16,860 pins, explicit decoupling is provided on chip. 704 00:30:16,860 --> 00:30:19,657 This consists of thin oxide capacitance, 705 00:30:19,657 --> 00:30:21,990 which is distributed over the entire surface of the chip 706 00:30:21,990 --> 00:30:23,610 wherever we have space. 707 00:30:23,610 --> 00:30:25,440 In the case of the clock driver, there's 708 00:30:25,440 --> 00:30:27,030 a stripe above and below the driver 709 00:30:27,030 --> 00:30:31,230 itself to supply the charge for the switching of the clock. 710 00:30:31,230 --> 00:30:33,930 The totally decoupling cap extracted from layout 711 00:30:33,930 --> 00:30:39,040 is 128,000 pF or 0.13 microfarads. 712 00:30:39,040 --> 00:30:41,500 So the ratio of the decoupling cap to the switching cap 713 00:30:41,500 --> 00:30:43,090 is about 10 to 1. 714 00:30:43,090 --> 00:30:45,490 And the decoupling cap can supply all the charge 715 00:30:45,490 --> 00:30:49,660 associated with a single CPU cycle with only a 10% hit 716 00:30:49,660 --> 00:30:51,940 in the Vdd supply voltage. 717 00:30:51,940 --> 00:30:54,520 The clock driver itself consists of a binary fanning tree, 718 00:30:54,520 --> 00:30:56,380 which distributes the 200-megahertz signal 719 00:30:56,380 --> 00:31:00,170 in a matched fashion to a row of 145 pre-driver circuits, 720 00:31:00,170 --> 00:31:02,200 each five stages deep. 721 00:31:02,200 --> 00:31:04,810 The signals are shorted at the output of the fanning tree 722 00:31:04,810 --> 00:31:07,920 and in the final three stages of the pre-driver. 723 00:31:07,920 --> 00:31:10,950 The inverter in the final stage, which actually drives the clock 724 00:31:10,950 --> 00:31:15,240 node, contains a PMOS, which is 10 and 11/64 inches wide, 725 00:31:15,240 --> 00:31:19,380 and an NMOS, which is 4 and 5/64 inches wide. 726 00:31:19,380 --> 00:31:22,240 The driver switches its load in half a nanosecond. 727 00:31:22,240 --> 00:31:25,963 And the peak switching current in the driver is about 43 amps. 728 00:31:25,963 --> 00:31:28,380 Now, you can see why we need to have local decoupling cap. 729 00:31:28,380 --> 00:31:30,990 You can't pull those electrons in through the inductance 730 00:31:30,990 --> 00:31:32,670 in the pin wires. 731 00:31:32,670 --> 00:31:34,890 The clock driver is located in the horizontal center 732 00:31:34,890 --> 00:31:37,710 of the chip and drives into a grid structure with metal 3 733 00:31:37,710 --> 00:31:40,470 clock lines running vertically in a regular pattern, 734 00:31:40,470 --> 00:31:43,450 alternating with Vdd and Vss. 735 00:31:43,450 --> 00:31:46,390 The horizontal lines are a higher resistance metal 2, 736 00:31:46,390 --> 00:31:49,030 which are interspersed in a more irregular fashion. 737 00:31:49,030 --> 00:31:51,490 Since the majority of the load on the clock 738 00:31:51,490 --> 00:31:53,050 is above and below the clock driver, 739 00:31:53,050 --> 00:31:56,170 it gets fed directly from the low-resistance vertical metal 740 00:31:56,170 --> 00:31:57,530 3 lines. 741 00:31:57,530 --> 00:32:01,120 There's about 150 microns of horizontal metal 2 shorting 742 00:32:01,120 --> 00:32:04,510 the global clock node in the driver itself to minimize 743 00:32:04,510 --> 00:32:06,940 the delay differences due to load imbalance 744 00:32:06,940 --> 00:32:10,120 between adjacent metal 3 wires. 745 00:32:10,120 --> 00:32:13,630 Both the location of the clock driver and the clock and power 746 00:32:13,630 --> 00:32:16,510 busting pattern were chosen to minimize clocks to you 747 00:32:16,510 --> 00:32:20,755 and ensure crisp edges at all points in the chip. 748 00:32:20,755 --> 00:32:22,130 I mentioned that it was important 749 00:32:22,130 --> 00:32:24,380 that the clock be properly distributed across the chip 750 00:32:24,380 --> 00:32:26,900 to ensure that we had low skew and crisp edges at all 751 00:32:26,900 --> 00:32:27,980 the latches. 752 00:32:27,980 --> 00:32:30,500 We spent considerable time analyzing the grid, 753 00:32:30,500 --> 00:32:33,170 and I'd like to show you the results of that analysis. 754 00:32:33,170 --> 00:32:35,720 We extracted the entire clock node from the layout. 755 00:32:35,720 --> 00:32:40,700 It contained 630,000 resistors and 630,000 capacitors, 756 00:32:40,700 --> 00:32:45,130 and there were 63,000 transistors connected to it. 757 00:32:45,130 --> 00:32:47,920 The network was then analyzed using a linear circuit 758 00:32:47,920 --> 00:32:50,980 simulator derived from Carnegie Mellon's awesome circuit 759 00:32:50,980 --> 00:32:53,050 simulation program. 760 00:32:53,050 --> 00:32:55,480 The delay to each transistor was calculated, 761 00:32:55,480 --> 00:32:57,520 and the transistors were split into buckets 762 00:32:57,520 --> 00:32:59,510 based on the delay. 763 00:32:59,510 --> 00:33:01,610 Each frame in the following sequence 764 00:33:01,610 --> 00:33:05,930 represents a single 10-picosecond delay bucket. 765 00:33:05,930 --> 00:33:08,180 Each transistor in the bucket is represented 766 00:33:08,180 --> 00:33:11,500 by a dot at its physical location. 767 00:33:11,500 --> 00:33:13,360 Now, if we run it again more slowly, 768 00:33:13,360 --> 00:33:15,460 you can see the effect of the clock grid. 769 00:33:15,460 --> 00:33:17,260 The wave starts at the clock generator 770 00:33:17,260 --> 00:33:19,090 in the center of the chip and moves 771 00:33:19,090 --> 00:33:21,085 fairly quickly in the north-south metal 772 00:33:21,085 --> 00:33:24,280 3 lines to the floating point unit and the integer units 773 00:33:24,280 --> 00:33:26,650 directly above and below the generator. 774 00:33:26,650 --> 00:33:28,960 To get to the caches on the right and left, 775 00:33:28,960 --> 00:33:31,660 or to the right buffer in the upper right, 776 00:33:31,660 --> 00:33:33,640 the clock signal must go through the thinner 777 00:33:33,640 --> 00:33:36,400 and sparser horizontal metal 2 lines. 778 00:33:36,400 --> 00:33:38,560 So it takes longer for the clock waveform 779 00:33:38,560 --> 00:33:40,690 to make it to those parts of the chip. 780 00:33:40,690 --> 00:33:42,970 The maximum delay, up in the right buffer, 781 00:33:42,970 --> 00:33:46,740 is about 300 picoseconds. 782 00:33:46,740 --> 00:33:49,200 We use the clock cartoons to identify anomalies 783 00:33:49,200 --> 00:33:50,205 in the clock grid. 784 00:33:50,205 --> 00:33:52,230 We were especially concerned when 785 00:33:52,230 --> 00:33:54,630 we saw the clock propagation moving inwards back 786 00:33:54,630 --> 00:33:56,610 towards the clock generator, as occurred 787 00:33:56,610 --> 00:33:58,883 in our first analysis of the grid, or in regions 788 00:33:58,883 --> 00:34:00,300 where there were major differences 789 00:34:00,300 --> 00:34:03,180 and delay between two pieces of circuitry 790 00:34:03,180 --> 00:34:04,865 that were close together. 791 00:34:04,865 --> 00:34:06,240 That sort of problem could result 792 00:34:06,240 --> 00:34:08,850 in timing hazards in the associated latches, 793 00:34:08,850 --> 00:34:10,710 and we wanted to get rid of those. 794 00:34:10,710 --> 00:34:12,210 In general, these things were solved 795 00:34:12,210 --> 00:34:13,980 by finding the piece of the layout that 796 00:34:13,980 --> 00:34:17,100 was not properly connected and just hooking it up. 797 00:34:17,100 --> 00:34:19,469 The program which generated the clock delay cartoon 798 00:34:19,469 --> 00:34:21,570 was also used in the electromigration analysis 799 00:34:21,570 --> 00:34:23,130 of the chip interconnect. 800 00:34:23,130 --> 00:34:26,159 The circuit simulation program determined the current flowing 801 00:34:26,159 --> 00:34:28,770 through each metal segment and each contact 802 00:34:28,770 --> 00:34:31,949 and checked that that current was less 803 00:34:31,949 --> 00:34:35,600 than the limit which was allowed by the electromigration rules. 804 00:34:35,600 --> 00:34:38,750 Failing segments and contacts were reported as a drc tick 805 00:34:38,750 --> 00:34:42,110 file, identifying the offending polygon and the required number 806 00:34:42,110 --> 00:34:44,210 of contacts or metal width. 807 00:34:44,210 --> 00:34:46,639 This level of verification was performed on every node 808 00:34:46,639 --> 00:34:49,340 in the chip to ensure high reliability, 809 00:34:49,340 --> 00:34:52,909 despite the aggressive clock rate. 810 00:34:52,909 --> 00:34:54,739 The IO interface to the chip was designed 811 00:34:54,739 --> 00:34:58,370 to allow direct connection to ordinary 5-volt logic, 812 00:34:58,370 --> 00:35:01,400 even though the chip operates at 3.3 volts. 813 00:35:01,400 --> 00:35:03,200 An additional constraint was a desire 814 00:35:03,200 --> 00:35:06,380 to interface to 100k ECL logic for very 815 00:35:06,380 --> 00:35:08,293 high-performance systems. 816 00:35:08,293 --> 00:35:10,710 These goals were accomplished with unique input and output 817 00:35:10,710 --> 00:35:11,610 circuits. 818 00:35:11,610 --> 00:35:13,680 For inputs, an externally-supplied reference 819 00:35:13,680 --> 00:35:14,940 voltage is defined. 820 00:35:14,940 --> 00:35:17,910 For TTL operation, this is set to the TTL midpoint 821 00:35:17,910 --> 00:35:21,690 of 1.4 volts, typically using a simple resistor divider. 822 00:35:21,690 --> 00:35:23,220 For ECL, the reference voltage can 823 00:35:23,220 --> 00:35:27,035 be supplied by the reference output of a normal 100k ECL 824 00:35:27,035 --> 00:35:28,523 gate. 825 00:35:28,523 --> 00:35:30,190 Because of the problem of noise coupling 826 00:35:30,190 --> 00:35:33,040 onto this extended on-chip reference node, 827 00:35:33,040 --> 00:35:35,260 local low-pass filters and compensation 828 00:35:35,260 --> 00:35:38,740 were used at each receiver, using well resistors and thin 829 00:35:38,740 --> 00:35:40,420 oxide capacitors. 830 00:35:40,420 --> 00:35:42,340 Testing has shown that the chip can operate 831 00:35:42,340 --> 00:35:44,770 with input signals as small as 100 832 00:35:44,770 --> 00:35:48,640 millivolts relative to Vref, even at 200 megahertz. 833 00:35:48,640 --> 00:35:51,520 The output driver represented a more complicated problem. 834 00:35:51,520 --> 00:35:54,010 To generate signals with good TTL noise margin, 835 00:35:54,010 --> 00:35:56,440 and also to interface to the ECL receivers, 836 00:35:56,440 --> 00:35:59,230 it was highly desirable to be able to drive high levels 837 00:35:59,230 --> 00:36:00,940 to the [? VED ?] rail. 838 00:36:00,940 --> 00:36:03,960 To avoid having pins dedicated to a 5-volt power rail, 839 00:36:03,960 --> 00:36:06,730 it was also desirable to use a floating well type scheme, 840 00:36:06,730 --> 00:36:09,130 employing PMOS pull-up devices. 841 00:36:09,130 --> 00:36:11,253 A unique circuit structure achieves these goals 842 00:36:11,253 --> 00:36:12,670 without handicapping the switching 843 00:36:12,670 --> 00:36:15,403 performance of the driver. 844 00:36:15,403 --> 00:36:17,320 This schematic shows the circuit configuration 845 00:36:17,320 --> 00:36:20,350 of the bidirectional 5-volt compatible output driver. 846 00:36:20,350 --> 00:36:23,170 These three transistors represent the actual driver 847 00:36:23,170 --> 00:36:24,050 for the pin. 848 00:36:24,050 --> 00:36:26,668 You've got to worry about three different effects. 849 00:36:26,668 --> 00:36:28,210 You've got to make sure that you keep 850 00:36:28,210 --> 00:36:33,030 the voltage across the NMOS devices to less than 4 volts. 851 00:36:33,030 --> 00:36:35,130 Otherwise, you'll have a gate oxide breakdown. 852 00:36:35,130 --> 00:36:37,170 That's achieved by simply putting two of them 853 00:36:37,170 --> 00:36:40,110 in series so that you get part of the drop across each one 854 00:36:40,110 --> 00:36:43,620 of the NMOS devices. 855 00:36:43,620 --> 00:36:46,770 Next, you've got to keep the PMOS device off, 856 00:36:46,770 --> 00:36:49,790 even when the output goes above Vdd. 857 00:36:49,790 --> 00:36:55,520 These transistors make sure that the output will follow the pin 858 00:36:55,520 --> 00:36:59,120 if the driver is supposed to be off 859 00:36:59,120 --> 00:37:02,710 and the output pin goes above Vdd. 860 00:37:02,710 --> 00:37:05,320 Finally, you've got to make sure that the well of the PMOS 861 00:37:05,320 --> 00:37:08,780 stays properly biased when the pin goes above Vdd. 862 00:37:08,780 --> 00:37:11,740 And this set of transistors accomplishes 863 00:37:11,740 --> 00:37:14,150 that under various conditions. 864 00:37:14,150 --> 00:37:17,570 I want to close by summarizing the design history for the EV3 865 00:37:17,570 --> 00:37:19,850 and EV4 chips. 866 00:37:19,850 --> 00:37:22,070 EV3 was designed in our 1-micron process, 867 00:37:22,070 --> 00:37:24,680 starting in June of 1989. 868 00:37:24,680 --> 00:37:27,260 It contained the integer unit and baby caches 869 00:37:27,260 --> 00:37:29,510 and was pin compatible with EV4. 870 00:37:29,510 --> 00:37:32,480 It taped out around Halloween of 1990 871 00:37:32,480 --> 00:37:38,660 and booted Ultrix and VMS in the ADU system in January of 1991. 872 00:37:38,660 --> 00:37:41,540 The ADU is the Alpha Development Unit. 873 00:37:41,540 --> 00:37:43,730 It's the system, the prototype system, 874 00:37:43,730 --> 00:37:46,190 that we designed in our group, in collaboration 875 00:37:46,190 --> 00:37:50,570 with DEC's System Research Center in California. 876 00:37:50,570 --> 00:37:52,767 We use the ADU for debugging the chip, 877 00:37:52,767 --> 00:37:54,350 and the software groups have used them 878 00:37:54,350 --> 00:37:56,030 for software development. 879 00:37:56,030 --> 00:37:58,550 We produced about 100 CPU modules, 880 00:37:58,550 --> 00:38:03,100 which are in 50 systems running in various parts of the world. 881 00:38:03,100 --> 00:38:07,600 As an aside, we also designed a non-product PC system 882 00:38:07,600 --> 00:38:10,360 using the chip, just to show that the chip could interface 883 00:38:10,360 --> 00:38:13,470 easily in a low-cost system. 884 00:38:13,470 --> 00:38:15,640 All components except for the CPU chip 885 00:38:15,640 --> 00:38:18,520 were purchased through Computer Shopper. 886 00:38:18,520 --> 00:38:21,700 EV4 added a floating-point unit and real caches and was fabbed 887 00:38:21,700 --> 00:38:24,310 in our 0.75-micron process. 888 00:38:24,310 --> 00:38:28,450 It taped out on July 14 of 1991 and came out 889 00:38:28,450 --> 00:38:31,240 of the manufacturing line on August 31. 890 00:38:31,240 --> 00:38:35,770 It booted VMS and Ultrix on September 3, 1991. 891 00:38:35,770 --> 00:38:37,870 The three days in there was because we 892 00:38:37,870 --> 00:38:40,540 had Labor Day holiday, and even with the holiday in there, 893 00:38:40,540 --> 00:38:42,490 we booted that quickly. 894 00:38:42,490 --> 00:38:44,920 In closing, I'd like to offer my congratulations 895 00:38:44,920 --> 00:38:48,460 to all the people who worked so hard to deliver this chip. 896 00:38:48,460 --> 00:38:52,060 The team worked for a long time and overcame lots of obstacles 897 00:38:52,060 --> 00:38:53,950 to get something that ran this fast. 898 00:38:53,950 --> 00:38:58,240 And their efforts should certainly be appreciated. 899 00:38:58,240 --> 00:39:01,290 [MUSIC PLAYING] 900 00:39:01,290 --> 00:40:33,000