1 00:00:00,000 --> 00:00:00,625 [MUSIC PLAYING] 2 00:00:00,625 --> 00:00:02,228 University Video Communications is 3 00:00:02,228 --> 00:00:04,770 pleased to present this edition of the "Distinguished Lecture 4 00:00:04,770 --> 00:00:05,490 Series-- 5 00:00:05,490 --> 00:00:07,470 Industry Leaders in Computer Science 6 00:00:07,470 --> 00:00:09,090 and Electrical Engineering." 7 00:00:09,090 --> 00:00:12,660 Today, Sun Microsystems brings us Dr. David Patterson 8 00:00:12,660 --> 00:00:14,610 on the story of SPARC. 9 00:00:14,610 --> 00:00:18,360 As the first person on the Sun RISC Project in 1984, 10 00:00:18,360 --> 00:00:20,220 Dr. Patterson is uniquely qualified 11 00:00:20,220 --> 00:00:21,930 to address this subject. 12 00:00:21,930 --> 00:00:25,090 A professor since 1977 at UC Berkeley, 13 00:00:25,090 --> 00:00:27,640 he led three generations of RISC projects 14 00:00:27,640 --> 00:00:30,060 and, in fact, won the Distinguished Teaching Award 15 00:00:30,060 --> 00:00:34,140 in 1982 for his course series on RISC, where many of the ideas 16 00:00:34,140 --> 00:00:35,760 were constructed and developed. 17 00:00:35,760 --> 00:00:39,330 But first, we're fortunate to have with us Wayne Rosing, vice 18 00:00:39,330 --> 00:00:41,940 president of the Desktop Systems Graphics Group at Sun, 19 00:00:41,940 --> 00:00:44,490 who will provide a background of the business environment 20 00:00:44,490 --> 00:00:46,860 during the development and introduction of SPARC. 21 00:00:46,860 --> 00:00:48,300 Wayne Rosing. 22 00:00:48,300 --> 00:00:52,260 Before Professor Dave Patterson discusses the technical aspects 23 00:00:52,260 --> 00:00:56,400 of SPARC, I'd like to spend a few moments discussing 24 00:00:56,400 --> 00:00:59,790 the context, the business context in which we developed 25 00:00:59,790 --> 00:01:01,780 this architecture. 26 00:01:01,780 --> 00:01:04,349 First of all, it was a very important for Sun 27 00:01:04,349 --> 00:01:08,190 that we implement a computer architecture that was simple. 28 00:01:08,190 --> 00:01:12,510 This system was architected with 14 people. 29 00:01:12,510 --> 00:01:16,440 Five people were involved in the design of the gate arrays. 30 00:01:16,440 --> 00:01:20,490 Two engineers actually design the CPU board. 31 00:01:20,490 --> 00:01:23,460 And approximately five people were involved in the language 32 00:01:23,460 --> 00:01:26,050 development and the OS porting. 33 00:01:26,050 --> 00:01:29,040 So this started out as a very small project 34 00:01:29,040 --> 00:01:32,010 because Sun was only a $30 million company at the time. 35 00:01:32,010 --> 00:01:34,050 And we did not have the resources 36 00:01:34,050 --> 00:01:39,880 to engage in a complex, full custom chip development. 37 00:01:39,880 --> 00:01:45,150 Second, we wanted to have an architecture that 38 00:01:45,150 --> 00:01:49,740 was capable of scaling all the way up from simple gate arrays 39 00:01:49,740 --> 00:01:50,970 to gallium arsenide. 40 00:01:50,970 --> 00:01:55,590 This was a very important focal point for our design. 41 00:01:55,590 --> 00:01:57,930 And I think it was different than the way 42 00:01:57,930 --> 00:02:01,560 many semiconductor companies who developed microprocessors 43 00:02:01,560 --> 00:02:03,660 typically make decisions. 44 00:02:03,660 --> 00:02:05,580 The semiconductor company normally 45 00:02:05,580 --> 00:02:10,410 tends to have one process, like bipolar or CMOS, 46 00:02:10,410 --> 00:02:12,120 or whatever they might have. 47 00:02:12,120 --> 00:02:14,040 And they concentrate all of their energy 48 00:02:14,040 --> 00:02:18,270 on building the best possible part in that process. 49 00:02:18,270 --> 00:02:21,060 And often, these companies compete with each other 50 00:02:21,060 --> 00:02:24,210 on the merits of their semiconductor process. 51 00:02:24,210 --> 00:02:27,240 Since we're fundamentally a systems company, 52 00:02:27,240 --> 00:02:30,990 we felt the need to develop a computer architecture that 53 00:02:30,990 --> 00:02:32,670 would work in technologies that would 54 00:02:32,670 --> 00:02:36,420 be appropriate from inexpensive desktops 55 00:02:36,420 --> 00:02:39,720 all the way up to the implementation 56 00:02:39,720 --> 00:02:43,090 of large supercomputer class machines. 57 00:02:43,090 --> 00:02:46,830 So the architecture had to fit in multiple technologies. 58 00:02:46,830 --> 00:02:50,060 59 00:02:50,060 --> 00:02:53,300 Next, the most important thing really, 60 00:02:53,300 --> 00:02:55,490 from a business consideration, was 61 00:02:55,490 --> 00:02:58,520 to develop an architecture that would quickly 62 00:02:58,520 --> 00:03:02,810 allow us to move the approximately 2,000 software 63 00:03:02,810 --> 00:03:06,620 applications that existed for our Motorola product line 64 00:03:06,620 --> 00:03:07,910 to SPARC. 65 00:03:07,910 --> 00:03:09,680 Bill Gates from Microsoft has said 66 00:03:09,680 --> 00:03:12,690 volume is everything in the software business. 67 00:03:12,690 --> 00:03:15,920 And I think, if we look at the history of computer 68 00:03:15,920 --> 00:03:18,200 architecture development, we've seen 69 00:03:18,200 --> 00:03:20,810 hundreds of very elegant, very technologically 70 00:03:20,810 --> 00:03:25,370 sophisticated computers designed primarily by engineering folks. 71 00:03:25,370 --> 00:03:27,800 But they've mostly been business failures 72 00:03:27,800 --> 00:03:30,530 because these systems were not able to attract 73 00:03:30,530 --> 00:03:33,080 a critical mass of software. 74 00:03:33,080 --> 00:03:36,110 Without software, you don't have customers. 75 00:03:36,110 --> 00:03:38,600 Without customers, you don't have sales. 76 00:03:38,600 --> 00:03:40,400 And until you have profitable sales, 77 00:03:40,400 --> 00:03:44,660 you cannot generate the gross margin dollars to reinvest 78 00:03:44,660 --> 00:03:48,110 in the engineering to make a computer architecture 79 00:03:48,110 --> 00:03:49,740 successful. 80 00:03:49,740 --> 00:03:51,440 So this is very important. 81 00:03:51,440 --> 00:03:54,560 Probably the dominant consideration 82 00:03:54,560 --> 00:03:58,250 in the development of SPARC was that we 83 00:03:58,250 --> 00:04:01,520 do things in a way that allowed absolute, source 84 00:04:01,520 --> 00:04:04,970 compatible, quick porting of applications 85 00:04:04,970 --> 00:04:08,260 over to the SPARC machines. 86 00:04:08,260 --> 00:04:12,730 Next, looking forward to the kinds of software 87 00:04:12,730 --> 00:04:16,329 that people run on Sun systems, which are highly interactive, 88 00:04:16,329 --> 00:04:19,149 typically graphics-oriented software, 89 00:04:19,149 --> 00:04:21,970 even though it may be very computationally intensive 90 00:04:21,970 --> 00:04:25,240 software, we really looked hard at what 91 00:04:25,240 --> 00:04:27,610 is the style of programming that's 92 00:04:27,610 --> 00:04:31,400 typical in this particular application area. 93 00:04:31,400 --> 00:04:34,660 And if you look at Windows systems and graphics systems, 94 00:04:34,660 --> 00:04:37,510 they're dominated by the need to have 95 00:04:37,510 --> 00:04:41,750 dynamically-linked libraries operating all the time. 96 00:04:41,750 --> 00:04:44,290 And this is an important thing. 97 00:04:44,290 --> 00:04:47,080 The classic metrics of computer design 98 00:04:47,080 --> 00:04:51,640 tend to be running Fortran programs, big batch programs, 99 00:04:51,640 --> 00:04:55,540 or running C programs that are classically large batch 100 00:04:55,540 --> 00:04:56,590 programs. 101 00:04:56,590 --> 00:04:59,500 We spent a lot of time thinking about what 102 00:04:59,500 --> 00:05:03,880 it means to have the kernel execution of Unix be efficient, 103 00:05:03,880 --> 00:05:08,140 or what it means to have interactive graphics code 104 00:05:08,140 --> 00:05:09,160 be efficient. 105 00:05:09,160 --> 00:05:12,190 And what this consideration motivated 106 00:05:12,190 --> 00:05:17,200 was the large register file in SPARC, 107 00:05:17,200 --> 00:05:19,420 as opposed to a system that would 108 00:05:19,420 --> 00:05:22,230 be more straightforward, for instance, of just 32 109 00:05:22,230 --> 00:05:24,460 32-bit registers. 110 00:05:24,460 --> 00:05:29,560 And this has been a point of controversy about RISC design. 111 00:05:29,560 --> 00:05:32,140 It's almost one of the few religious points 112 00:05:32,140 --> 00:05:35,510 so associated with the RISC machines. 113 00:05:35,510 --> 00:05:38,875 I'm sure Dave Patterson will discuss this somewhat. 114 00:05:38,875 --> 00:05:42,750 115 00:05:42,750 --> 00:05:47,640 Lastly, needless to say, we wanted to make sure that we had 116 00:05:47,640 --> 00:05:52,080 a system that was going to be efficient as programming styles 117 00:05:52,080 --> 00:05:55,120 shifted more to object oriented programming, 118 00:05:55,120 --> 00:06:00,330 both things like LISP, as well as Smalltalk, and C++. 119 00:06:00,330 --> 00:06:03,060 And the early development that we're now 120 00:06:03,060 --> 00:06:06,900 doing in advanced software in this area 121 00:06:06,900 --> 00:06:09,150 indicates these decisions were correct. 122 00:06:09,150 --> 00:06:15,050 123 00:06:15,050 --> 00:06:19,640 We wanted an open, multi-vendor architecture. 124 00:06:19,640 --> 00:06:21,950 Now, it's easy to say, well, it's open. 125 00:06:21,950 --> 00:06:24,230 But what did we really want here? 126 00:06:24,230 --> 00:06:26,030 Why did we want this? 127 00:06:26,030 --> 00:06:30,230 The most important thing is we, as a company, 128 00:06:30,230 --> 00:06:34,160 wanted to be able to buy the components for our systems 129 00:06:34,160 --> 00:06:35,430 inexpensively. 130 00:06:35,430 --> 00:06:39,020 So it was important to have multiple vendors sourcing 131 00:06:39,020 --> 00:06:42,560 and have competition between multiple vendors. 132 00:06:42,560 --> 00:06:44,690 For instance, in the SPARC Station 1, 133 00:06:44,690 --> 00:06:48,260 we have two sources for the integer unit. 134 00:06:48,260 --> 00:06:51,650 We have two sources for the floating point units. 135 00:06:51,650 --> 00:06:54,350 This is a very important consideration 136 00:06:54,350 --> 00:06:57,290 in order to get the kind of competitive pricing 137 00:06:57,290 --> 00:07:03,300 we need to continue to produce cost effective machines. 138 00:07:03,300 --> 00:07:06,710 Another consideration, a more subtle consideration, 139 00:07:06,710 --> 00:07:09,890 is Sun can invest so many millions of per 140 00:07:09,890 --> 00:07:13,460 year in fundamental R&D of SPARC chips. 141 00:07:13,460 --> 00:07:15,710 And maybe one semiconductor company 142 00:07:15,710 --> 00:07:18,430 could double that investment. 143 00:07:18,430 --> 00:07:20,000 But if you have four or five, all 144 00:07:20,000 --> 00:07:22,880 of a sudden, the total investment dollars going 145 00:07:22,880 --> 00:07:25,130 against SPARC development goes up 146 00:07:25,130 --> 00:07:27,620 to a very significant number. 147 00:07:27,620 --> 00:07:31,550 And we feel that the investment dollars being applied to SPARC 148 00:07:31,550 --> 00:07:34,640 probably far exceed the investment dollars going 149 00:07:34,640 --> 00:07:39,740 into any one of the other competitive CISC or RISC 150 00:07:39,740 --> 00:07:41,870 computer architectures. 151 00:07:41,870 --> 00:07:45,110 And that kind of leveraging of our dollars 152 00:07:45,110 --> 00:07:49,220 with other companies dollars is very, very fundamental. 153 00:07:49,220 --> 00:07:53,060 Now, you might think that there's a zero sum game that's 154 00:07:53,060 --> 00:07:54,590 being violated here. 155 00:07:54,590 --> 00:07:57,710 But remember, the semiconductor companies that work with us 156 00:07:57,710 --> 00:08:01,400 do not have to worry about developing languages, operating 157 00:08:01,400 --> 00:08:05,240 systems, workstations, and development platforms. 158 00:08:05,240 --> 00:08:08,180 So they're able to concentrate on what they do best 159 00:08:08,180 --> 00:08:11,540 and not have to subsidize as part of their computer 160 00:08:11,540 --> 00:08:16,820 R&D all of the other aspects of a complete system development. 161 00:08:16,820 --> 00:08:18,830 And that's been an attractive thing 162 00:08:18,830 --> 00:08:22,190 because we have been able to work with small companies, 163 00:08:22,190 --> 00:08:25,280 for instance like Bipolar Integrated Technology, which 164 00:08:25,280 --> 00:08:29,000 is a startup doing an advanced ECL process. 165 00:08:29,000 --> 00:08:31,850 It would have been impossible for a startup computer 166 00:08:31,850 --> 00:08:36,380 company in the ECL area to ever develop a microprocessor 167 00:08:36,380 --> 00:08:39,530 without the kind of leverage that the Sun business 168 00:08:39,530 --> 00:08:43,520 strategy has provided. 169 00:08:43,520 --> 00:08:47,810 Last and most important, we want the pace of innovation 170 00:08:47,810 --> 00:08:51,650 to SPARC development not to be set by Sun, 171 00:08:51,650 --> 00:08:55,190 not to be set by the business strategy of any one 172 00:08:55,190 --> 00:08:56,630 corporation. 173 00:08:56,630 --> 00:09:00,020 We wanted a consortium to control this. 174 00:09:00,020 --> 00:09:02,360 And SPARC International, which we 175 00:09:02,360 --> 00:09:06,800 have formed with all of the SPARC component licensees 176 00:09:06,800 --> 00:09:11,480 and with the participation of a number of the architecture 177 00:09:11,480 --> 00:09:15,110 licensees, now has effective control 178 00:09:15,110 --> 00:09:17,090 of the SPARC architecture. 179 00:09:17,090 --> 00:09:20,300 We think that's a very important consideration, in terms 180 00:09:20,300 --> 00:09:23,180 of making this rather, if you will, 181 00:09:23,180 --> 00:09:26,330 upstart architecture from a systems company 182 00:09:26,330 --> 00:09:30,470 become one of the major players in the volume microprocessor 183 00:09:30,470 --> 00:09:32,790 business. 184 00:09:32,790 --> 00:09:34,680 So thank you for that. 185 00:09:34,680 --> 00:09:37,080 Are there any questions? 186 00:09:37,080 --> 00:09:39,860 How would you describe the growth and acceptance 187 00:09:39,860 --> 00:09:42,920 of the SPARC processor in terms of time? 188 00:09:42,920 --> 00:09:46,190 And what problems have some of your partners experienced 189 00:09:46,190 --> 00:09:50,240 in porting from the Motorola-based computers 190 00:09:50,240 --> 00:09:52,100 to SPARC processors? 191 00:09:52,100 --> 00:09:54,830 Before the Sun 4 was first introduced, the first SPARC 192 00:09:54,830 --> 00:09:58,790 machine, we had successful instances of 500,000 line 193 00:09:58,790 --> 00:10:02,030 programs coming into the porting center, 194 00:10:02,030 --> 00:10:04,890 being recompiled, and running the first time. 195 00:10:04,890 --> 00:10:08,840 So from the beginning, we often had very successful ports. 196 00:10:08,840 --> 00:10:12,170 There were some areas in the early period 197 00:10:12,170 --> 00:10:14,510 of the architecture where we had some compiler problems. 198 00:10:14,510 --> 00:10:18,230 And we had problems with shared Fortran common 199 00:10:18,230 --> 00:10:19,230 that we had to work out. 200 00:10:19,230 --> 00:10:21,050 So there was about a six month window 201 00:10:21,050 --> 00:10:24,440 there, where we had a little bit of difficulty because 202 00:10:24,440 --> 00:10:26,840 of some data alignment problems that we 203 00:10:26,840 --> 00:10:28,790 had neglected in the software. 204 00:10:28,790 --> 00:10:31,910 Once those problems were overcome, 205 00:10:31,910 --> 00:10:35,300 after that, the porting has been very smooth. 206 00:10:35,300 --> 00:10:38,030 And getting the larger companies to port 207 00:10:38,030 --> 00:10:40,010 has not been a technical issue. 208 00:10:40,010 --> 00:10:42,830 It's typically been a business consideration. 209 00:10:42,830 --> 00:10:45,170 Many of the large software companies 210 00:10:45,170 --> 00:10:50,090 really are not prepared to port their applications to systems 211 00:10:50,090 --> 00:10:55,370 until they see an installed base of a significant enough unit 212 00:10:55,370 --> 00:10:56,480 volumes. 213 00:10:56,480 --> 00:10:59,930 And where SPARC is in very good shape 214 00:10:59,930 --> 00:11:02,610 is, as of I think the end of this summer, 215 00:11:02,610 --> 00:11:04,220 there is no question that SPARC will 216 00:11:04,220 --> 00:11:08,120 have numerically the largest installed base of RISC machines 217 00:11:08,120 --> 00:11:09,420 in the world. 218 00:11:09,420 --> 00:11:12,460 And so that makes the business motivation 219 00:11:12,460 --> 00:11:15,130 for independent software companies very straightforward. 220 00:11:15,130 --> 00:11:16,810 What would you have done differently, 221 00:11:16,810 --> 00:11:18,177 given what you know today? 222 00:11:18,177 --> 00:11:20,260 I think the things we would have done differently, 223 00:11:20,260 --> 00:11:23,840 from a technical point of view, were pretty straightforward. 224 00:11:23,840 --> 00:11:27,130 We would have put integer multiply 225 00:11:27,130 --> 00:11:31,150 and divide full instructions into the architecture. 226 00:11:31,150 --> 00:11:34,930 Second were mostly implementation issues. 227 00:11:34,930 --> 00:11:37,540 We really should have put about twice as many people 228 00:11:37,540 --> 00:11:39,760 on the project. 229 00:11:39,760 --> 00:11:44,680 And we should have developed our own tightly coupled 230 00:11:44,680 --> 00:11:47,680 floating point strategy in the beginning. 231 00:11:47,680 --> 00:11:51,820 And second to that would have been developing an integrated 232 00:11:51,820 --> 00:11:55,210 memory management unit with multiprocessor support 233 00:11:55,210 --> 00:11:56,620 from the beginning. 234 00:11:56,620 --> 00:12:00,490 When we first went out and started talking to companies 235 00:12:00,490 --> 00:12:06,670 about using SPARC, we at Sun, as I think a bunch of gunslinger 236 00:12:06,670 --> 00:12:08,680 engineers, really have no problem 237 00:12:08,680 --> 00:12:11,200 designing memory management units and whatever 238 00:12:11,200 --> 00:12:12,670 else wasn't there. 239 00:12:12,670 --> 00:12:15,010 But we found a lot of companies really 240 00:12:15,010 --> 00:12:18,640 had succumbed to the traditional semiconductor merchant market 241 00:12:18,640 --> 00:12:21,140 strategy of they do everything for you. 242 00:12:21,140 --> 00:12:23,410 And all you have to do is bolt a few chips in. 243 00:12:23,410 --> 00:12:24,970 And it's all over. 244 00:12:24,970 --> 00:12:28,300 And it took us a while to understand 245 00:12:28,300 --> 00:12:30,970 that the other customers really needed a higher level 246 00:12:30,970 --> 00:12:33,500 of integration than Sun did. 247 00:12:33,500 --> 00:12:36,520 And in the last year and a half, those issues 248 00:12:36,520 --> 00:12:38,980 have been overcome, of course, in our design 249 00:12:38,980 --> 00:12:41,770 of subsequent chipsets. 250 00:12:41,770 --> 00:12:43,870 What were the roles of each of the partners 251 00:12:43,870 --> 00:12:46,630 that you worked with in the development of the SPARC 252 00:12:46,630 --> 00:12:48,520 chipset? 253 00:12:48,520 --> 00:12:53,320 For the very first generation, SPARC development 254 00:12:53,320 --> 00:12:56,440 of the Fujitsu gate array, Fujitsu primarily 255 00:12:56,440 --> 00:13:01,630 provided us with technical support in the CAD area. 256 00:13:01,630 --> 00:13:04,330 And they had one engineer who just 257 00:13:04,330 --> 00:13:07,510 was so involved that, at the end of that, 258 00:13:07,510 --> 00:13:10,900 that particular engineer knew how to do SPARC machines as 259 00:13:10,900 --> 00:13:12,140 well as we did. 260 00:13:12,140 --> 00:13:14,530 And then he was able to subsequently do 261 00:13:14,530 --> 00:13:18,550 a second generation standard cell design with himself 262 00:13:18,550 --> 00:13:20,950 and just a few other Fujitsu people. 263 00:13:20,950 --> 00:13:23,620 So they mostly helped us out in the CAD system 264 00:13:23,620 --> 00:13:25,960 and learned the architecture. 265 00:13:25,960 --> 00:13:28,570 In the case of working with Bit, it 266 00:13:28,570 --> 00:13:32,500 was a true joint development, where Bit and Sun engineers 267 00:13:32,500 --> 00:13:36,220 worked together from the beginning of the definition 268 00:13:36,220 --> 00:13:40,870 of how we would approach the design, the pipeline design 269 00:13:40,870 --> 00:13:43,960 through the development of all of the chips. 270 00:13:43,960 --> 00:13:47,350 In the case of the Cyprus CMOS Full Custom, 271 00:13:47,350 --> 00:13:51,760 it was a true 50-50 joint development, 272 00:13:51,760 --> 00:13:54,520 where each company shared half the cost 273 00:13:54,520 --> 00:13:57,850 and put in approximately half of the engineers. 274 00:13:57,850 --> 00:14:00,850 And we put everybody together in a building, filled it up 275 00:14:00,850 --> 00:14:01,810 with workstations. 276 00:14:01,810 --> 00:14:03,790 And they went about their business. 277 00:14:03,790 --> 00:14:05,290 They used to call themselves-- 278 00:14:05,290 --> 00:14:08,590 I can't remember-- oh yes, Sunray Semiconductor Inc. 279 00:14:08,590 --> 00:14:11,260 They actually made up a fake company name 280 00:14:11,260 --> 00:14:17,200 and called themselves a company surrounding their project 281 00:14:17,200 --> 00:14:18,640 to get an identity. 282 00:14:18,640 --> 00:14:21,250 And it worked very well. 283 00:14:21,250 --> 00:14:23,590 And now, David Patterson. 284 00:14:23,590 --> 00:14:25,270 I'm delighted to have the opportunity 285 00:14:25,270 --> 00:14:27,160 to communicate with you in this way, 286 00:14:27,160 --> 00:14:29,950 as well as to share this tape with Wayne Rosing. 287 00:14:29,950 --> 00:14:33,070 I first met Wayne almost 10 years ago today. 288 00:14:33,070 --> 00:14:37,540 With this taping, I took a leave of absence from Uc Berkeley 289 00:14:37,540 --> 00:14:39,790 and spent it at Digital Equipment Corporation, 290 00:14:39,790 --> 00:14:42,550 where Wayne was in charge of this division. 291 00:14:42,550 --> 00:14:45,010 And together, we spent time figuring out ways 292 00:14:45,010 --> 00:14:47,710 to try and build faster and better VAXes. 293 00:14:47,710 --> 00:14:50,500 I think my involvement in this reduced instruction set, 294 00:14:50,500 --> 00:14:54,190 or RISC, phenomenon started with that. 295 00:14:54,190 --> 00:14:58,480 It seemed to me the difficulty of building VAXes 296 00:14:58,480 --> 00:15:04,240 and VLSI technology gave rise to my involvement 297 00:15:04,240 --> 00:15:07,542 in trying to build a simpler style of machine. 298 00:15:07,542 --> 00:15:09,250 So what I'm going to do today is give you 299 00:15:09,250 --> 00:15:11,950 some principles of computer performance design 300 00:15:11,950 --> 00:15:14,980 and show how those contrasts between the traditional 301 00:15:14,980 --> 00:15:18,040 approach and the RISC approach talk 302 00:15:18,040 --> 00:15:21,550 some differences about the SPARC architecture 303 00:15:21,550 --> 00:15:24,020 with other RISC architectures, then 304 00:15:24,020 --> 00:15:26,020 give a historical perspective, and then show you 305 00:15:26,020 --> 00:15:28,810 some actual chips and boards that 306 00:15:28,810 --> 00:15:30,250 are using the SPARC architecture, 307 00:15:30,250 --> 00:15:32,290 and then speculate a bit about the future. 308 00:15:32,290 --> 00:15:35,410 What I want to emphasize in this talk today 309 00:15:35,410 --> 00:15:37,480 is a fundamental performance equation 310 00:15:37,480 --> 00:15:39,170 that all computer architects do. 311 00:15:39,170 --> 00:15:43,630 As I go to the first slide, the CPU time today, 312 00:15:43,630 --> 00:15:46,420 or the performance, can be divided into two pieces. 313 00:15:46,420 --> 00:15:50,530 Almost all computers today use a clock, 314 00:15:50,530 --> 00:15:53,200 a standard clock that sets the cycle time of the machine. 315 00:15:53,200 --> 00:15:54,610 It's called the clock cycle time. 316 00:15:54,610 --> 00:15:57,110 Well, then we can divide time into the number of those clock 317 00:15:57,110 --> 00:15:57,910 cycles. 318 00:15:57,910 --> 00:16:01,120 So CPU time is the clock cycles per program 319 00:16:01,120 --> 00:16:03,682 times the clock cycle time. 320 00:16:03,682 --> 00:16:05,140 Well, computer designers have found 321 00:16:05,140 --> 00:16:07,750 it useful to come up with another way of presenting 322 00:16:07,750 --> 00:16:09,070 the same information. 323 00:16:09,070 --> 00:16:11,730 And that is to figure out the a number of clock 324 00:16:11,730 --> 00:16:13,840 cycles per instruction. 325 00:16:13,840 --> 00:16:16,638 So that's simply calculated by the clock cycles per program 326 00:16:16,638 --> 00:16:18,180 divided by the number of instructions 327 00:16:18,180 --> 00:16:20,280 you execute in that program, sometimes called 328 00:16:20,280 --> 00:16:21,300 the instruction count. 329 00:16:21,300 --> 00:16:22,890 And that gives you the average number 330 00:16:22,890 --> 00:16:25,500 of clock cycles per instruction, which is 331 00:16:25,500 --> 00:16:28,020 almost always abbreviated CPI. 332 00:16:28,020 --> 00:16:30,120 So now, we can plug those two things together 333 00:16:30,120 --> 00:16:32,130 into this time equation. 334 00:16:32,130 --> 00:16:35,550 And we end up with the CPU time, which 335 00:16:35,550 --> 00:16:39,150 is equal to the instruction count times the CPI 336 00:16:39,150 --> 00:16:41,100 times the clock cycle time. 337 00:16:41,100 --> 00:16:44,100 So performance is trying to-- 338 00:16:44,100 --> 00:16:45,630 you get higher performance by trying 339 00:16:45,630 --> 00:16:48,780 to minimize that product of instruction count, CPI, 340 00:16:48,780 --> 00:16:50,620 and clock cycle time. 341 00:16:50,620 --> 00:16:52,470 So let's see how you take advantage 342 00:16:52,470 --> 00:16:55,590 of that in the two different styles of machine design. 343 00:16:55,590 --> 00:16:58,530 Now, the traditional approach, what they tried to emphasize 344 00:16:58,530 --> 00:17:02,010 was to reduce the instruction count. 345 00:17:02,010 --> 00:17:03,780 That was an interesting figure of merit. 346 00:17:03,780 --> 00:17:06,900 The side effect of that is to increase the CPI. 347 00:17:06,900 --> 00:17:09,119 That wasn't seen as all bad, however. 348 00:17:09,119 --> 00:17:12,000 That was seen as raising the level of the machine, 349 00:17:12,000 --> 00:17:14,910 that you had a higher level, more powerful instruction set. 350 00:17:14,910 --> 00:17:18,569 And in fact, the execute fewer instructions is good. 351 00:17:18,569 --> 00:17:20,640 Higher CPI, well, that wasn't so bad. 352 00:17:20,640 --> 00:17:22,800 And also, it turns out, for that style of machines, 353 00:17:22,800 --> 00:17:24,960 there was an emphasis on reducing 354 00:17:24,960 --> 00:17:28,079 the size of the program, small program size. 355 00:17:28,079 --> 00:17:31,950 Now, in contrast to that is the RISC style design, again trying 356 00:17:31,950 --> 00:17:35,440 to minimize that key performance equation, where with RISC 357 00:17:35,440 --> 00:17:39,240 the emphasis is in reducing the clock cycles per instruction. 358 00:17:39,240 --> 00:17:41,370 And that happens primarily through pipelining, 359 00:17:41,370 --> 00:17:45,360 by having many instructions executing at the same time. 360 00:17:45,360 --> 00:17:47,650 What's this side effect of reducing the CPI? 361 00:17:47,650 --> 00:17:50,070 It's a larger instruction count. 362 00:17:50,070 --> 00:17:52,020 Well, that's considered to be OK because it's 363 00:17:52,020 --> 00:17:55,110 a lot easier to get instruction memory bandwidth 364 00:17:55,110 --> 00:17:57,000 in all computer design. 365 00:17:57,000 --> 00:17:59,730 So that was an acceptable side effect. 366 00:17:59,730 --> 00:18:01,650 The programs are larger as well. 367 00:18:01,650 --> 00:18:04,380 They're larger because of the larger number of instructions 368 00:18:04,380 --> 00:18:06,600 and also because the instructions are 369 00:18:06,600 --> 00:18:09,930 kept in a form that's very easy to get pipelined execution. 370 00:18:09,930 --> 00:18:12,810 And that was OK because, for the last decade, 371 00:18:12,810 --> 00:18:14,940 the DRAMs used to make memory today 372 00:18:14,940 --> 00:18:17,770 are getting cheaper at an incredibly fast rate. 373 00:18:17,770 --> 00:18:20,910 So bigger programs to get better performance 374 00:18:20,910 --> 00:18:24,240 is a tradeoff most people would be happy with. 375 00:18:24,240 --> 00:18:28,440 Now, more perhaps minor or lesser roles 376 00:18:28,440 --> 00:18:30,300 of the reduced instruction set approach 377 00:18:30,300 --> 00:18:32,910 is that given that technology is moving 378 00:18:32,910 --> 00:18:36,750 at an incredibly fast rate, this VLSI technology, 379 00:18:36,750 --> 00:18:38,980 there's an emphasis on simplicity. 380 00:18:38,980 --> 00:18:41,010 Let's try and keep the instruction set simple, 381 00:18:41,010 --> 00:18:43,200 so that we can more closely tracked 382 00:18:43,200 --> 00:18:45,510 the rapid changes in VLSI. 383 00:18:45,510 --> 00:18:48,360 As a benefit of that style of design, 384 00:18:48,360 --> 00:18:51,120 it made it easier to have a faster clock cycle 385 00:18:51,120 --> 00:18:53,370 time by keeping things simple. 386 00:18:53,370 --> 00:18:57,060 And probably the final linchpin of the RISC designs 387 00:18:57,060 --> 00:19:00,450 is the assumption of a much better compiler technology 388 00:19:00,450 --> 00:19:02,580 because there's been advances in compiler design 389 00:19:02,580 --> 00:19:05,890 as well in the last 10 years. 390 00:19:05,890 --> 00:19:10,080 Now, so I talked about the trade off of instruction and CPI 391 00:19:10,080 --> 00:19:13,290 in this equation we're trying to minimize, right? 392 00:19:13,290 --> 00:19:15,780 So what about the clock cycle time? 393 00:19:15,780 --> 00:19:18,435 Well, the clock cycle time is your worst case situation 394 00:19:18,435 --> 00:19:19,772 in computer design. 395 00:19:19,772 --> 00:19:22,230 And there's lots of things that could limit the clock cycle 396 00:19:22,230 --> 00:19:23,290 time. 397 00:19:23,290 --> 00:19:25,560 But with the two different designs-- 398 00:19:25,560 --> 00:19:28,050 for the traditional or CISC approach, 399 00:19:28,050 --> 00:19:30,420 they've always been microprogram. 400 00:19:30,420 --> 00:19:34,620 So one of the limits is the time to fetch a microinstruction. 401 00:19:34,620 --> 00:19:39,060 Typical, traditional machines will have between 4K and 16K 402 00:19:39,060 --> 00:19:41,460 of these microinstructions that are very wide. 403 00:19:41,460 --> 00:19:43,950 So you can't run the cycle time faster 404 00:19:43,950 --> 00:19:48,000 than you can access this microprogram memory. 405 00:19:48,000 --> 00:19:49,900 The RISC designs, on the other hand, 406 00:19:49,900 --> 00:19:52,200 are built on a cache model of design. 407 00:19:52,200 --> 00:19:55,080 So the cache you can adjust. 408 00:19:55,080 --> 00:19:57,348 Caches are made to be adjustable. 409 00:19:57,348 --> 00:19:59,640 They're just the highest level of the memory hierarchy. 410 00:19:59,640 --> 00:20:02,190 So depending on what your clock cycle time is, 411 00:20:02,190 --> 00:20:04,270 you can have a smaller instruction cache. 412 00:20:04,270 --> 00:20:06,810 So the RISC designs aren't as limited by the memory size 413 00:20:06,810 --> 00:20:09,225 because it's adjustable, whereas the traditional designs 414 00:20:09,225 --> 00:20:11,850 you pretty much have to have the whole microprogram you're ever 415 00:20:11,850 --> 00:20:13,150 going to use right there. 416 00:20:13,150 --> 00:20:15,940 And that's going to affect the clock cycle time. 417 00:20:15,940 --> 00:20:20,760 So how does that show up I think is shown in the next slide. 418 00:20:20,760 --> 00:20:23,670 What this is, is a plot of the clock cycle 419 00:20:23,670 --> 00:20:28,547 time on a log scale, over time along the bottom here. 420 00:20:28,547 --> 00:20:30,630 So you see the RISC designes really didn't show up 421 00:20:30,630 --> 00:20:32,280 until about 1986. 422 00:20:32,280 --> 00:20:35,880 But the orange line is the minicomputer line 423 00:20:35,880 --> 00:20:39,690 represented by models of the VAX family and the supercomputers 424 00:20:39,690 --> 00:20:41,940 by the Cray. 425 00:20:41,940 --> 00:20:43,780 It's relative. 426 00:20:43,780 --> 00:20:44,740 This is a log scale. 427 00:20:44,740 --> 00:20:47,490 So they are improving relative to the RISC designs 428 00:20:47,490 --> 00:20:48,660 at a slower rate. 429 00:20:48,660 --> 00:20:53,100 And not only is this RISC line somewhat steeper, 430 00:20:53,100 --> 00:20:55,090 you see this sudden hitch in the curve. 431 00:20:55,090 --> 00:20:57,600 And what that is, is the RISC designs changing technology. 432 00:20:57,600 --> 00:21:01,410 In the earlier part, they were all done in some kind of CMOS 433 00:21:01,410 --> 00:21:02,500 usually design. 434 00:21:02,500 --> 00:21:04,278 They were able to use the ECL design. 435 00:21:04,278 --> 00:21:06,570 So they're able to take advantage of the new technology 436 00:21:06,570 --> 00:21:08,890 and change the clock rate. 437 00:21:08,890 --> 00:21:13,460 So that's how the clock rate effects of what's going on. 438 00:21:13,460 --> 00:21:15,180 So let's do a couple of examples. 439 00:21:15,180 --> 00:21:17,900 The first example was 1987. 440 00:21:17,900 --> 00:21:23,080 July 7, 7/7/87, Sun announced the SPARC-based family 441 00:21:23,080 --> 00:21:23,800 of machines. 442 00:21:23,800 --> 00:21:26,530 And so let's compare those figure of merits 443 00:21:26,530 --> 00:21:29,110 for the products of that time. 444 00:21:29,110 --> 00:21:32,810 Sun's traditional line was the Motorola 68000 family, 445 00:21:32,810 --> 00:21:34,450 in particular the 68020. 446 00:21:34,450 --> 00:21:37,450 So if we look at the instruction count, the ratio of those, 447 00:21:37,450 --> 00:21:40,480 it took about 25% more instructions 448 00:21:40,480 --> 00:21:44,080 for the RISC approach, a higher instruction count. 449 00:21:44,080 --> 00:21:47,020 The clock cycle time is-- 450 00:21:47,020 --> 00:21:48,220 I'll explain a little bit. 451 00:21:48,220 --> 00:21:50,220 It's a little bit funny for the Motorola design. 452 00:21:50,220 --> 00:21:55,210 The clock cycle time was actually better for the 68000 453 00:21:55,210 --> 00:21:56,980 than it was for the RISC design. 454 00:21:56,980 --> 00:21:59,650 But the average number of clocks per instruction, you can see, 455 00:21:59,650 --> 00:22:02,950 is quite a bit worse, 5 to 7 or 0.32. 456 00:22:02,950 --> 00:22:04,820 When you multiply all those things together, 457 00:22:04,820 --> 00:22:07,570 you get about a ratio factor of 2 to 1. 458 00:22:07,570 --> 00:22:09,340 The RISC machine is twice as fast. 459 00:22:09,340 --> 00:22:13,990 Or the 68000 takes twice as long to execute. 460 00:22:13,990 --> 00:22:16,280 But the difference in price was very small, 461 00:22:16,280 --> 00:22:18,040 about 10% or 20% more. 462 00:22:18,040 --> 00:22:21,010 So you've got a factor of 2 performance for a 10% to 20% 463 00:22:21,010 --> 00:22:21,760 increase in price. 464 00:22:21,760 --> 00:22:23,270 It's a pretty good deal. 465 00:22:23,270 --> 00:22:26,080 Now, when we give these talks to our friends at Motorola, 466 00:22:26,080 --> 00:22:29,530 they will argue some about the clocks per instruction, 467 00:22:29,530 --> 00:22:30,400 clock cycle time. 468 00:22:30,400 --> 00:22:33,310 It turns out the Motorola, the clock rates, 469 00:22:33,310 --> 00:22:37,210 the 25 megahertz turns out to be 40 nanoseconds. 470 00:22:37,210 --> 00:22:40,870 Well, in the Motorola microcode, every single microinstruction 471 00:22:40,870 --> 00:22:43,210 takes two clock ticks. 472 00:22:43,210 --> 00:22:46,133 So I had a hard time deciding how to put this together. 473 00:22:46,133 --> 00:22:48,550 The other way you put this is, with an 80 nanosecond clock 474 00:22:48,550 --> 00:22:52,870 cycle time, then the CPI of 2.5 to 3.5. 475 00:22:52,870 --> 00:22:54,580 It doesn't affect the product down here. 476 00:22:54,580 --> 00:22:57,747 Just depending if you're sensitive about what's 477 00:22:57,747 --> 00:22:59,830 the average number of clock cycles per instruction 478 00:22:59,830 --> 00:23:01,570 of your architecture, you may not 479 00:23:01,570 --> 00:23:03,040 like it characterized this way. 480 00:23:03,040 --> 00:23:05,470 But the machine is advertised with a 25 megahertz 40 481 00:23:05,470 --> 00:23:07,720 millisecond clock. 482 00:23:07,720 --> 00:23:09,880 So that was one example. 483 00:23:09,880 --> 00:23:12,640 This year, there is an example from DEC. 484 00:23:12,640 --> 00:23:15,640 DEC has produced a VAX and RISC line, 485 00:23:15,640 --> 00:23:17,920 in fact announced the very same day, just like Sun 486 00:23:17,920 --> 00:23:19,750 did two years before that. 487 00:23:19,750 --> 00:23:22,360 Now, when compared to the VAX architecture, 488 00:23:22,360 --> 00:23:25,180 their RISC machine that they used almost 489 00:23:25,180 --> 00:23:29,110 executes 80% more instructions. 490 00:23:29,110 --> 00:23:32,230 These are based on benchmarks from a book 491 00:23:32,230 --> 00:23:34,390 that John Hennessey and I are working on. 492 00:23:34,390 --> 00:23:38,400 But it's the new C compiler in the tech programs. 493 00:23:38,400 --> 00:23:40,900 So these are large programs that these numbers are based on. 494 00:23:40,900 --> 00:23:43,480 We took the compilers with DEC chips in them 495 00:23:43,480 --> 00:23:46,750 and did these measurements. 496 00:23:46,750 --> 00:23:50,830 It was 80% more instructions, so much higher instruction count. 497 00:23:50,830 --> 00:23:51,970 VAX is actually much fewer. 498 00:23:51,970 --> 00:23:54,940 But you notice this machine, the clock cycle time 499 00:23:54,940 --> 00:23:56,990 was quite a bit less for the RISC machine. 500 00:23:56,990 --> 00:23:59,470 And there's a huge difference in the average number 501 00:23:59,470 --> 00:24:01,220 of clocks per instruction. 502 00:24:01,220 --> 00:24:02,865 So when you multiply it all together, 503 00:24:02,865 --> 00:24:04,240 depending on the program, you get 504 00:24:04,240 --> 00:24:07,480 a factor of performance improvement of about a 3 to 6. 505 00:24:07,480 --> 00:24:10,180 The VAX takes about three times to six times longer 506 00:24:10,180 --> 00:24:14,110 to execute these large programs than the RISC machine. 507 00:24:14,110 --> 00:24:18,160 But yet, using DEC's prices, their list prices, 508 00:24:18,160 --> 00:24:20,000 it was only 25% more expensive. 509 00:24:20,000 --> 00:24:22,960 So again, something is 3 times 6 times more 510 00:24:22,960 --> 00:24:25,280 expensive to start that-- 511 00:24:25,280 --> 00:24:26,230 let's try this again. 512 00:24:26,230 --> 00:24:27,970 3 to 6 times faster-- 513 00:24:27,970 --> 00:24:30,730 that's it-- and only 25% more. 514 00:24:30,730 --> 00:24:32,860 So those are a couple of examples of RISC machines 515 00:24:32,860 --> 00:24:34,870 from two different families. 516 00:24:34,870 --> 00:24:37,900 Now, let's talk in a little bit more technical detail. 517 00:24:37,900 --> 00:24:41,770 And there was an interesting study 518 00:24:41,770 --> 00:24:44,710 we did when we measured these large programs. 519 00:24:44,710 --> 00:24:47,890 And this plot is measuring the ratio really 520 00:24:47,890 --> 00:24:50,860 to the amount of instruction traffic on the VAX. 521 00:24:50,860 --> 00:24:54,520 So let's say that the unit is one for the VAX instructions. 522 00:24:54,520 --> 00:24:56,500 That's shown here, orange. 523 00:24:56,500 --> 00:24:59,110 You can see that the number of words for memory for the RISC 524 00:24:59,110 --> 00:25:00,820 machine, in blue, is 1.8. 525 00:25:00,820 --> 00:25:02,833 That's like what we said before. 526 00:25:02,833 --> 00:25:05,500 Now, the question is, what about the rest of the memory traffic? 527 00:25:05,500 --> 00:25:07,960 What about the data accesses. 528 00:25:07,960 --> 00:25:11,590 Well, it turns out that, normalized again to VAX, 529 00:25:11,590 --> 00:25:17,080 it takes 1.7 of these units of the data memory traffic, 530 00:25:17,080 --> 00:25:19,450 where the RISC machine takes only 0.6. 531 00:25:19,450 --> 00:25:22,360 If you add these things all up, the total memory traffic 532 00:25:22,360 --> 00:25:26,920 is that the VAX has about 10% more traffic than the RISC 533 00:25:26,920 --> 00:25:28,840 machine, which is kind of surprising. 534 00:25:28,840 --> 00:25:30,610 I think most computer designers thought 535 00:25:30,610 --> 00:25:32,860 that if you have a higher instruction count, therefore 536 00:25:32,860 --> 00:25:36,400 you're going to have a higher data memory traffic overall. 537 00:25:36,400 --> 00:25:38,012 But it still might be a good tradeoff. 538 00:25:38,012 --> 00:25:39,970 But the most interesting thing about this slide 539 00:25:39,970 --> 00:25:41,620 isn't that there's a 10% reduction. 540 00:25:41,620 --> 00:25:43,150 It's where the differences are. 541 00:25:43,150 --> 00:25:45,370 For a computer designer's perspective, 542 00:25:45,370 --> 00:25:48,092 it's relatively easy to double your bandwidth 543 00:25:48,092 --> 00:25:50,050 with instruction memory because they're largely 544 00:25:50,050 --> 00:25:51,070 sequential accesses. 545 00:25:51,070 --> 00:25:52,580 Most are the next one. 546 00:25:52,580 --> 00:25:55,510 So simply making the port twice as wide 547 00:25:55,510 --> 00:26:00,190 would probably get you about 1.8 increase in the bandwidth. 548 00:26:00,190 --> 00:26:02,200 On the other hand, data memory references 549 00:26:02,200 --> 00:26:04,477 are very random in their very nature. 550 00:26:04,477 --> 00:26:06,310 And so those are great things to get rid of. 551 00:26:06,310 --> 00:26:07,810 So speaking as a computer designer, 552 00:26:07,810 --> 00:26:13,270 if I had a choice of almost doubling my instruction traffic 553 00:26:13,270 --> 00:26:15,760 and reducing my data traffic by a factor of 3, 554 00:26:15,760 --> 00:26:19,050 that's wonderful from my perspective. 555 00:26:19,050 --> 00:26:22,450 So when we're really starting to run real big programs 556 00:26:22,450 --> 00:26:23,950 on these two sides of architectures, 557 00:26:23,950 --> 00:26:25,743 we see some real interesting differences. 558 00:26:25,743 --> 00:26:27,160 To build a much faster VAX, you're 559 00:26:27,160 --> 00:26:29,530 going to have to have a much higher data memory 560 00:26:29,530 --> 00:26:34,150 traffic than you will have for the RISC machines. 561 00:26:34,150 --> 00:26:37,470 Now I'm going to-- before I talk about the differences 562 00:26:37,470 --> 00:26:38,850 in the SPARC architecture, I want 563 00:26:38,850 --> 00:26:41,730 to emphasize that at no time in our history 564 00:26:41,730 --> 00:26:45,270 have we had such similar machines. 565 00:26:45,270 --> 00:26:48,150 The RISC machines are so similar that people sometimes 566 00:26:48,150 --> 00:26:50,650 tend to emphasize the small differences there are. 567 00:26:50,650 --> 00:26:52,175 But let's put this in perspective. 568 00:26:52,175 --> 00:26:53,550 In putting it in perspective, I'm 569 00:26:53,550 --> 00:26:56,190 going to start talking about the traditional microprocessors. 570 00:26:56,190 --> 00:26:58,320 The so-called 16-bit microprocessors 571 00:26:58,320 --> 00:27:01,770 were developed in 1978 to 1980, about eight years 572 00:27:01,770 --> 00:27:03,522 before the RISC machines. 573 00:27:03,522 --> 00:27:04,980 So when we look at these machines-- 574 00:27:04,980 --> 00:27:09,450 the Intel 86, the Motorola 68000, and the Z8000-- 575 00:27:09,450 --> 00:27:11,680 the first difference is the difference of addressing, 576 00:27:11,680 --> 00:27:13,440 which is a major difference, two of them 577 00:27:13,440 --> 00:27:15,630 using segmented addressing and the Motorola use 578 00:27:15,630 --> 00:27:16,662 of the flat addressing. 579 00:27:16,662 --> 00:27:18,120 That's about as big a difference is 580 00:27:18,120 --> 00:27:20,537 going to get in an architecture because it affects so much 581 00:27:20,537 --> 00:27:22,230 of the system. 582 00:27:22,230 --> 00:27:23,820 There was no protection on the 86. 583 00:27:23,820 --> 00:27:25,755 And it was optional on the other two machines. 584 00:27:25,755 --> 00:27:29,190 And you can see the address size vary. 585 00:27:29,190 --> 00:27:30,975 Even with the segmented architectures, 586 00:27:30,975 --> 00:27:32,100 the address size is varied. 587 00:27:32,100 --> 00:27:34,080 So those affect a lot of the software that runs 588 00:27:34,080 --> 00:27:36,000 on the machine, the addresses. 589 00:27:36,000 --> 00:27:39,420 If we look more inside, the register sizes were different-- 590 00:27:39,420 --> 00:27:44,250 16 bits for Zilog and Intel and 32 for Motorola. 591 00:27:44,250 --> 00:27:46,800 The register model was incredibly different. 592 00:27:46,800 --> 00:27:49,260 Maybe that's one of the biggest differences between them. 593 00:27:49,260 --> 00:27:51,780 All the registers were special purpose in 86. 594 00:27:51,780 --> 00:27:53,790 They were divided into eight data 595 00:27:53,790 --> 00:27:57,330 and eight address in the 68000; 16 general-purpose registers 596 00:27:57,330 --> 00:27:58,650 on the Z8000. 597 00:27:58,650 --> 00:28:01,050 The instruction size was variable, either byte variable 598 00:28:01,050 --> 00:28:06,750 in the 86 or 16-bit variable Motorola 68000. 599 00:28:06,750 --> 00:28:08,790 Intel had no data limit restrictions. 600 00:28:08,790 --> 00:28:11,700 And Motorola, on the 68000, Zilog 601 00:28:11,700 --> 00:28:13,710 had required the data to be aligned. 602 00:28:13,710 --> 00:28:17,040 And finally, the input/output was memory map for Motorola, 603 00:28:17,040 --> 00:28:20,793 but were special instructions on the other two machines. 604 00:28:20,793 --> 00:28:22,710 So let's contrast that with the RISC machines, 605 00:28:22,710 --> 00:28:28,050 which were done in 1986, or announced 1986, 1988. 606 00:28:28,050 --> 00:28:29,910 There's the SPARC, the MIPS architecture, 607 00:28:29,910 --> 00:28:33,720 which is being used by DEC in their RISC machine, 608 00:28:33,720 --> 00:28:37,830 and the Motorola 88000, Motorola's RISC machine. 609 00:28:37,830 --> 00:28:39,030 You see the addressing. 610 00:28:39,030 --> 00:28:40,590 All of them use flat addressing. 611 00:28:40,590 --> 00:28:42,630 All of them use page level protection. 612 00:28:42,630 --> 00:28:46,510 All of them have the same address size. 613 00:28:46,510 --> 00:28:50,910 The SPARC-- in terms of what's visible to the programmer, 614 00:28:50,910 --> 00:28:51,620 the-- 615 00:28:51,620 --> 00:28:54,730 well, the width of the registers are 32 bits wide. 616 00:28:54,730 --> 00:28:58,720 They're always saying the number available to the programmer. 617 00:28:58,720 --> 00:29:00,308 All of them have 32-bit registers. 618 00:29:00,308 --> 00:29:02,350 The instruction size, they're always all 32 bits. 619 00:29:02,350 --> 00:29:03,940 The data always has to be aligned. 620 00:29:03,940 --> 00:29:05,410 And it's memory mapped. 621 00:29:05,410 --> 00:29:08,200 So in contrast to just a few years ago, 622 00:29:08,200 --> 00:29:11,590 it's incredible agreement on all these basic issues, which 623 00:29:11,590 --> 00:29:14,950 affects why it's so easy to port programs between these RISC 624 00:29:14,950 --> 00:29:16,120 machines. 625 00:29:16,120 --> 00:29:19,420 So keep in mind that the RISC machines are more alike 626 00:29:19,420 --> 00:29:21,490 than any set of machines we've ever seen 627 00:29:21,490 --> 00:29:23,060 in history of computer design. 628 00:29:23,060 --> 00:29:24,670 But let me now talk about a couple 629 00:29:24,670 --> 00:29:28,160 of differences for the SPARC model. 630 00:29:28,160 --> 00:29:31,960 The first is that SPARC allows overlapped execution of integer 631 00:29:31,960 --> 00:29:33,350 and floating point programs. 632 00:29:33,350 --> 00:29:35,560 So what this means is, in the right circumstances, 633 00:29:35,560 --> 00:29:38,080 if you do have floating point instructions, 634 00:29:38,080 --> 00:29:40,540 you can maybe get complete overlap 635 00:29:40,540 --> 00:29:43,368 with energy instructions and the floating point instructions. 636 00:29:43,368 --> 00:29:45,160 How do you support that in an architecture? 637 00:29:45,160 --> 00:29:48,340 You have to allow there to be a mechanism so that when you have 638 00:29:48,340 --> 00:29:51,320 an interrupt, you can back up and find out 639 00:29:51,320 --> 00:29:53,570 what floating point instructions hadn't been finished. 640 00:29:53,570 --> 00:29:56,620 So SPARC provides a queue that contains 641 00:29:56,620 --> 00:29:58,540 a copy of the instruction, the floating point 642 00:29:58,540 --> 00:30:01,240 instruction that's in the middle of execution, 643 00:30:01,240 --> 00:30:08,290 as well as a PC of that instruction. 644 00:30:08,290 --> 00:30:10,690 As Wayne Rosing mentioned, there was an emphasis 645 00:30:10,690 --> 00:30:13,480 in the development of SPARC of support for some of the newer 646 00:30:13,480 --> 00:30:14,710 programming ideas-- 647 00:30:14,710 --> 00:30:18,250 the dynamically linked and dynamically typed languages. 648 00:30:18,250 --> 00:30:21,790 That was important in the development of SPARC. 649 00:30:21,790 --> 00:30:25,773 For LISP in particular, it has some interesting requirements, 650 00:30:25,773 --> 00:30:27,940 challenging requirements, for the computer designer. 651 00:30:27,940 --> 00:30:31,540 Just because you see A plus B doesn't 652 00:30:31,540 --> 00:30:35,120 mean that you're adding two numbers or two floating point 653 00:30:35,120 --> 00:30:35,620 numbers. 654 00:30:35,620 --> 00:30:37,480 You could be adding to arrays. 655 00:30:37,480 --> 00:30:39,850 So there has to be the opportunity for the LISP 656 00:30:39,850 --> 00:30:44,120 programmer at runtime to change the Add instruction 657 00:30:44,120 --> 00:30:47,570 to perform these other very complicated things. 658 00:30:47,570 --> 00:30:50,630 So what does a computer designer do about that? 659 00:30:50,630 --> 00:30:54,700 Well, typically, they will tag the data somehow, 660 00:30:54,700 --> 00:30:57,440 some mechanism so that when they're integers, 661 00:30:57,440 --> 00:30:58,690 they'll know they're integers. 662 00:30:58,690 --> 00:30:59,773 And they can do them fast. 663 00:30:59,773 --> 00:31:02,090 And it turns out almost all the time they're integers. 664 00:31:02,090 --> 00:31:05,200 So looking at the slide, SPARC supports 665 00:31:05,200 --> 00:31:07,023 the two least significant bits. 666 00:31:07,023 --> 00:31:08,440 The two least significant bits are 667 00:31:08,440 --> 00:31:09,940 zeros, which indicates these aren't 668 00:31:09,940 --> 00:31:12,590 pointing to something more complicated than just numbers. 669 00:31:12,590 --> 00:31:14,170 They're added together. 670 00:31:14,170 --> 00:31:17,930 And the result is also one of these integers. 671 00:31:17,930 --> 00:31:22,120 So the overflow bit is set on an operation. 672 00:31:22,120 --> 00:31:24,850 If these two least significant bits 673 00:31:24,850 --> 00:31:29,072 are not zeros, and if the result is too big a number 674 00:31:29,072 --> 00:31:30,280 to fit in here, you overflow. 675 00:31:30,280 --> 00:31:33,430 So by having some special instructions, 676 00:31:33,430 --> 00:31:35,480 they do add, subtract-- 677 00:31:35,480 --> 00:31:38,020 and because SPARC uses condition codes, 678 00:31:38,020 --> 00:31:39,430 subtract acts as a compare. 679 00:31:39,430 --> 00:31:42,242 So they can do comparisons with support 680 00:31:42,242 --> 00:31:43,450 on a few of the instructions. 681 00:31:43,450 --> 00:31:45,533 The rest of the instructions will ignore the tags. 682 00:31:45,533 --> 00:31:47,800 But there's a few special just to support the LISP 683 00:31:47,800 --> 00:31:50,740 and Smalltalk dynamic timing. 684 00:31:50,740 --> 00:31:54,220 As it shows on the slide next, another very important feature 685 00:31:54,220 --> 00:31:56,252 for one of these experimental architectures 686 00:31:56,252 --> 00:31:58,210 is that you can test for errors because there's 687 00:31:58,210 --> 00:32:00,502 lots of funny things that can happen in dynamic typing. 688 00:32:00,502 --> 00:32:02,930 There's a lot of checking to make sure things are correct. 689 00:32:02,930 --> 00:32:05,350 A very nice instruction for that is a conditional trap 690 00:32:05,350 --> 00:32:06,158 instruction. 691 00:32:06,158 --> 00:32:08,200 If you don't have a conditional trap instruction, 692 00:32:08,200 --> 00:32:11,200 you have to conditionally branch around some subroutine 693 00:32:11,200 --> 00:32:12,130 and pass parameters. 694 00:32:12,130 --> 00:32:14,380 Conditional trap takes only one clock cycle 695 00:32:14,380 --> 00:32:15,280 to perform the test. 696 00:32:15,280 --> 00:32:17,560 It's a nice thing to include. 697 00:32:17,560 --> 00:32:20,560 The final thing is register windows, which Wayne alluded to 698 00:32:20,560 --> 00:32:21,860 in his presentation. 699 00:32:21,860 --> 00:32:23,770 So I'll spend some time showing what that is 700 00:32:23,770 --> 00:32:26,290 and what the implications are. 701 00:32:26,290 --> 00:32:30,280 Register windows are based on an observation about the behavior 702 00:32:30,280 --> 00:32:32,860 of most programs or all programs, 703 00:32:32,860 --> 00:32:36,700 just as caches were invented based 704 00:32:36,700 --> 00:32:39,610 on an observation of the behavior of all programs 705 00:32:39,610 --> 00:32:40,970 towards the locality. 706 00:32:40,970 --> 00:32:44,530 So as you see on the slide, let me-- [? what ?] the axis is-- 707 00:32:44,530 --> 00:32:47,020 across is basically time and units 708 00:32:47,020 --> 00:32:48,790 of procedure call and return. 709 00:32:48,790 --> 00:32:51,400 And what goes down the slide is the nesting depth, 710 00:32:51,400 --> 00:32:53,800 so how many levels deep is the subroutine. 711 00:32:53,800 --> 00:32:56,320 And so we get jagged-- 712 00:32:56,320 --> 00:32:58,510 basically, these are four or five calls 713 00:32:58,510 --> 00:33:01,340 in a row followed by a return, and a call, and so on. 714 00:33:01,340 --> 00:33:02,920 So we have this pattern. 715 00:33:02,920 --> 00:33:05,500 When I move up, the program is doing returns. 716 00:33:05,500 --> 00:33:07,930 And we moved down, it was doing a series of calls. 717 00:33:07,930 --> 00:33:10,060 Well, the first thing you can observe by this 718 00:33:10,060 --> 00:33:11,750 is that, at least for this program, 719 00:33:11,750 --> 00:33:13,390 we didn't see very jagged lines. 720 00:33:13,390 --> 00:33:17,260 We didn't see programs that are 100 or 1,000 calls followed 721 00:33:17,260 --> 00:33:18,610 by 100 or 1,000 returns. 722 00:33:18,610 --> 00:33:20,780 There's some locality there. 723 00:33:20,780 --> 00:33:23,620 And in fact, if you provided a larger buffer 724 00:33:23,620 --> 00:33:25,960 of registers than just the 32 you need, 725 00:33:25,960 --> 00:33:27,910 that many sets of them on the chip, 726 00:33:27,910 --> 00:33:30,610 the question is, how frequently can you 727 00:33:30,610 --> 00:33:32,350 avoid having to go off chip? 728 00:33:32,350 --> 00:33:34,600 Because if you don't have this on every-- 729 00:33:34,600 --> 00:33:35,260 let's see. 730 00:33:35,260 --> 00:33:37,660 On every call, you have to store registers away, 731 00:33:37,660 --> 00:33:38,710 extra store instructions. 732 00:33:38,710 --> 00:33:43,460 And every return, you have to do loads back to bring them back. 733 00:33:43,460 --> 00:33:47,357 So what this shows is if we put a buffer of a size 734 00:33:47,357 --> 00:33:49,940 I think something of some thing like-- let's say it was eight. 735 00:33:49,940 --> 00:33:53,890 These boxes indicate when you do overflow. 736 00:33:53,890 --> 00:33:57,550 So the first several calls and returns all fit on the chip. 737 00:33:57,550 --> 00:34:00,220 And then where this line is, that meant there were too many. 738 00:34:00,220 --> 00:34:02,110 And you did these overflows when you 739 00:34:02,110 --> 00:34:03,670 did have to do memory traffic. 740 00:34:03,670 --> 00:34:05,980 So basically, the success of the buffering scheme 741 00:34:05,980 --> 00:34:07,534 is the number of boxes on this chart. 742 00:34:07,534 --> 00:34:09,159 And you can see, once we get down here, 743 00:34:09,159 --> 00:34:12,139 we're doing lots of calls and returns in a row. 744 00:34:12,139 --> 00:34:13,510 So that's the observation. 745 00:34:13,510 --> 00:34:15,159 But how well does it work in practice? 746 00:34:15,159 --> 00:34:16,659 And in particular, is this something 747 00:34:16,659 --> 00:34:18,010 that only works with C? 748 00:34:18,010 --> 00:34:21,969 Or does it work with other languages? 749 00:34:21,969 --> 00:34:24,429 This chart shows the percentage of the calls, the results, 750 00:34:24,429 --> 00:34:25,060 and overflow. 751 00:34:25,060 --> 00:34:27,227 The percentage of those calls on the previous chart, 752 00:34:27,227 --> 00:34:29,150 they were the lines for the box. 753 00:34:29,150 --> 00:34:32,199 And this is plotted against the number of register banks 754 00:34:32,199 --> 00:34:34,170 that you have on the chip. 755 00:34:34,170 --> 00:34:35,050 OK. 756 00:34:35,050 --> 00:34:36,639 And now, with the plots that are there 757 00:34:36,639 --> 00:34:41,440 for several programs in three different languages, because I 758 00:34:41,440 --> 00:34:44,679 certainly thought that when you got into other languages 759 00:34:44,679 --> 00:34:47,580 the pattern might be quite a bit different. 760 00:34:47,580 --> 00:34:50,050 But as you can see on the slide, they all 761 00:34:50,050 --> 00:34:52,600 have the same kind of shape with the knee of the curve 762 00:34:52,600 --> 00:34:56,380 typically being somewhere between 6 and 8 763 00:34:56,380 --> 00:34:58,930 of these register banks. 764 00:34:58,930 --> 00:35:01,570 So the machines that we developed at Berkeley 765 00:35:01,570 --> 00:35:02,590 had eight. 766 00:35:02,590 --> 00:35:05,590 And then the SPARC machines had either seven or eight, 767 00:35:05,590 --> 00:35:07,990 I think the ones that have been announced so far. 768 00:35:07,990 --> 00:35:10,610 So that technique seems to work. 769 00:35:10,610 --> 00:35:12,610 And then if you have about seven or eight banks, 770 00:35:12,610 --> 00:35:14,410 you can reduce the number of stores 771 00:35:14,410 --> 00:35:17,930 to save registers off the chip and loads coming back. 772 00:35:17,930 --> 00:35:20,620 Now, there's some implications to the register windows. 773 00:35:20,620 --> 00:35:22,840 And let's evaluate those implications 774 00:35:22,840 --> 00:35:26,090 in terms of our basic performance formula. 775 00:35:26,090 --> 00:35:27,880 So first off, instruction count-- 776 00:35:27,880 --> 00:35:30,460 running those same programs I alluded to earlier on both 777 00:35:30,460 --> 00:35:34,330 the SPARC machine and the MIPS machine, 778 00:35:34,330 --> 00:35:39,850 MIPS required 40% to 60% more store instructions in only 3$ 779 00:35:39,850 --> 00:35:40,875 to 20% more loads. 780 00:35:40,875 --> 00:35:42,250 That's because there's a lot more 781 00:35:42,250 --> 00:35:43,960 loads in the program for other things 782 00:35:43,960 --> 00:35:47,660 than there are simply for saving and restoring registers. 783 00:35:47,660 --> 00:35:51,100 Now, again from a computer design perspective, 784 00:35:51,100 --> 00:35:52,900 your favorite instructions are things that 785 00:35:52,900 --> 00:35:54,110 just add registers together. 786 00:35:54,110 --> 00:35:55,370 You just have to fetch the instruction. 787 00:35:55,370 --> 00:35:56,703 It doesn't bother anything else. 788 00:35:56,703 --> 00:35:59,777 After that, maybe his jumps aren't so great. 789 00:35:59,777 --> 00:36:01,360 But at least they don't access memory. 790 00:36:01,360 --> 00:36:03,880 And after that are loads because at least your reading 791 00:36:03,880 --> 00:36:06,610 instructions and data, it's all coming in the same direction. 792 00:36:06,610 --> 00:36:09,670 But the worst instructions, from my perspective, are the stores. 793 00:36:09,670 --> 00:36:11,820 They're going the wrong way on a one way street. 794 00:36:11,820 --> 00:36:13,690 You have all this data flowing at you. 795 00:36:13,690 --> 00:36:15,863 But instead, what happens is the stores, 796 00:36:15,863 --> 00:36:17,780 you have to send the data the other direction. 797 00:36:17,780 --> 00:36:21,700 So getting rid of stores is a nice thing to get rid of. 798 00:36:21,700 --> 00:36:23,587 The impact on CPI, the average number 799 00:36:23,587 --> 00:36:25,420 of clocks per instruction by your instrument 800 00:36:25,420 --> 00:36:26,878 is basically no impact. 801 00:36:26,878 --> 00:36:28,420 But a lot of people have speculated-- 802 00:36:28,420 --> 00:36:30,220 and I was looking forward to see-- 803 00:36:30,220 --> 00:36:32,560 what was going to happen to the clock cycle time. 804 00:36:32,560 --> 00:36:35,200 So far in the years that people have been building chips, 805 00:36:35,200 --> 00:36:37,030 the SPARC-based chips with register windows 806 00:36:37,030 --> 00:36:40,780 have been as fast or faster than the competing 807 00:36:40,780 --> 00:36:45,100 chips from Motorola and MIPS, indicating 808 00:36:45,100 --> 00:36:46,570 that the size of the register file 809 00:36:46,570 --> 00:36:49,270 physically on the chip, that's not affecting the clock cycle 810 00:36:49,270 --> 00:36:49,480 time. 811 00:36:49,480 --> 00:36:50,800 Something else must be the bottleneck. 812 00:36:50,800 --> 00:36:51,925 That hasn't been the issue. 813 00:36:51,925 --> 00:36:55,340 The technology it's been built with hasn't been the issue. 814 00:36:55,340 --> 00:36:57,717 So I think a lot of people concerned 815 00:36:57,717 --> 00:36:59,800 would agree about the instruction count, these two 816 00:36:59,800 --> 00:37:03,400 points, but wondered whether or not it wouldn't hurt the cycle 817 00:37:03,400 --> 00:37:04,190 time. 818 00:37:04,190 --> 00:37:05,650 And so far, that hasn't been true 819 00:37:05,650 --> 00:37:07,340 for the commercial machines. 820 00:37:07,340 --> 00:37:08,980 There's a couple other things that people have brought up 821 00:37:08,980 --> 00:37:09,730 about register windows. 822 00:37:09,730 --> 00:37:11,260 What about process switching time 823 00:37:11,260 --> 00:37:13,552 when you're switching from working completely different 824 00:37:13,552 --> 00:37:14,980 programs with more registers? 825 00:37:14,980 --> 00:37:17,110 Well, it turns out that our friends in the operating system 826 00:37:17,110 --> 00:37:19,568 have found so many things to do at process switch time that 827 00:37:19,568 --> 00:37:22,860 actually happened to save two or three times as many registers 828 00:37:22,860 --> 00:37:24,610 isn't a big deal. 829 00:37:24,610 --> 00:37:27,190 And less than 20% of the time in Unix 830 00:37:27,190 --> 00:37:30,310 is the saving of the registers away for the SPARC machine. 831 00:37:30,310 --> 00:37:32,020 Another interesting questions is, 832 00:37:32,020 --> 00:37:34,490 what about-- it takes more resources on the chip. 833 00:37:34,490 --> 00:37:36,490 How's that going to affect the size of the chip? 834 00:37:36,490 --> 00:37:38,230 And the size of this chip is important 835 00:37:38,230 --> 00:37:39,980 because that affects the cost of the chip. 836 00:37:39,980 --> 00:37:41,740 The way they're manufactured today, 837 00:37:41,740 --> 00:37:44,980 it goes way up with the area the chip. 838 00:37:44,980 --> 00:37:46,840 It turns out the cypress die, in part 839 00:37:46,840 --> 00:37:51,535 because they have a nice thin line width, is the smallest 840 00:37:51,535 --> 00:37:53,470 die of all the RISC chips, at the time 841 00:37:53,470 --> 00:37:54,828 of this taping at least. 842 00:37:54,828 --> 00:37:56,620 And then if we looked at that very smallest 843 00:37:56,620 --> 00:37:59,840 die, how much of all that die is dedicated to the register 844 00:37:59,840 --> 00:38:00,340 windows? 845 00:38:00,340 --> 00:38:04,270 We see that it's only 10% of the die or less than that. 846 00:38:04,270 --> 00:38:06,700 Final thing to say about register windows 847 00:38:06,700 --> 00:38:09,370 is that they are optional in the SPARC architecture. 848 00:38:09,370 --> 00:38:11,590 Separate instructions from call and return 849 00:38:11,590 --> 00:38:15,370 invoke the register windows, the save and restore instructions. 850 00:38:15,370 --> 00:38:17,440 So you could take a SPARC architecture 851 00:38:17,440 --> 00:38:21,250 and write a compiler that ignored register windows. 852 00:38:21,250 --> 00:38:23,572 Also, depending on your technology, 853 00:38:23,572 --> 00:38:25,030 big architectures that are variable 854 00:38:25,030 --> 00:38:28,030 have as few as two register banks and as many as 32, 855 00:38:28,030 --> 00:38:31,060 depending on your technology. 856 00:38:31,060 --> 00:38:34,530 So lots to say about register windows. 857 00:38:34,530 --> 00:38:36,340 I want to emphasize one more time 858 00:38:36,340 --> 00:38:38,200 that these machines are so similar, 859 00:38:38,200 --> 00:38:39,920 I've never seen anything like it. 860 00:38:39,920 --> 00:38:42,670 And I don't think these small details 861 00:38:42,670 --> 00:38:45,250 will make the differences in the long term. 862 00:38:45,250 --> 00:38:47,038 But we'll see. 863 00:38:47,038 --> 00:38:48,830 Probably, to me, the most interesting thing 864 00:38:48,830 --> 00:38:52,310 about this SPARC architecture is that, as Wayne Rosing alluded, 865 00:38:52,310 --> 00:38:55,440 they've developed a family of computers. 866 00:38:55,440 --> 00:38:57,120 So they're shown on this slide. 867 00:38:57,120 --> 00:38:59,870 The SPARCstation 1, as they're calling it, 868 00:38:59,870 --> 00:39:03,800 is based on a gate array chip, a 20 megahertz CMOS gate array. 869 00:39:03,800 --> 00:39:08,750 This is the same design that was done in 1987 870 00:39:08,750 --> 00:39:12,080 in a more expensive machine, just upgraded with technology 871 00:39:12,080 --> 00:39:14,780 to the faster clock rate. 872 00:39:14,780 --> 00:39:16,640 After that is the SPARCstation-- 873 00:39:16,640 --> 00:39:18,960 I guess it's called 330. 874 00:39:18,960 --> 00:39:19,820 I see. 875 00:39:19,820 --> 00:39:21,140 The 33 from the 33 megahertz. 876 00:39:21,140 --> 00:39:23,630 I was trying to figure out what that number was. 877 00:39:23,630 --> 00:39:25,080 This is a full custom design. 878 00:39:25,080 --> 00:39:27,365 This was done in cooperation with Cypress, 879 00:39:27,365 --> 00:39:29,300 as Wayne alluded to. 880 00:39:29,300 --> 00:39:34,550 This is the traditional approach to designing a microprocessor. 881 00:39:34,550 --> 00:39:37,340 Solbourne, one of the companies in the SPARC international 882 00:39:37,340 --> 00:39:41,360 group, has taken these chips and has built a four-way interleave 883 00:39:41,360 --> 00:39:42,860 multiprocessor-- 884 00:39:42,860 --> 00:39:45,080 four processors in the same box-- 885 00:39:45,080 --> 00:39:49,400 using, I think, the 25 megahertz version of the gate array. 886 00:39:49,400 --> 00:39:54,710 Bit is the ECL design that Wayne alluded to. 887 00:39:54,710 --> 00:39:56,930 It's been announced that it's 80 megahertz clock. 888 00:39:56,930 --> 00:39:58,610 That's 12 and 1/2 nanoseconds for you 889 00:39:58,610 --> 00:40:03,860 computer historians, the same clock cycle time as the Cray-1. 890 00:40:03,860 --> 00:40:07,100 This hasn't been announced in a Sun product yet. 891 00:40:07,100 --> 00:40:08,750 And then Prisma, which is a startup 892 00:40:08,750 --> 00:40:11,870 company, another startup company, they've announced-- 893 00:40:11,870 --> 00:40:14,390 and they've given talks on a 250 megahertz 894 00:40:14,390 --> 00:40:16,520 4 nanosecond second clock version of the SPARC 895 00:40:16,520 --> 00:40:17,420 architecture. 896 00:40:17,420 --> 00:40:20,960 This is being built with lots of gallium arsenide chips. 897 00:40:20,960 --> 00:40:22,460 And it's been announced in the paper 898 00:40:22,460 --> 00:40:25,730 that Sun's trying to convince companies that 899 00:40:25,730 --> 00:40:29,900 have traditionally built PC clones and people building 900 00:40:29,900 --> 00:40:31,970 portable to build something that's 901 00:40:31,970 --> 00:40:33,717 binary compatible to SPARC architecture, 902 00:40:33,717 --> 00:40:35,300 so they can all run the same software. 903 00:40:35,300 --> 00:40:37,250 So these machines-- computer family 904 00:40:37,250 --> 00:40:39,170 means that the same binary runs on them all. 905 00:40:39,170 --> 00:40:40,680 And that's true. 906 00:40:40,680 --> 00:40:43,770 So right now, ignoring the thing-- 907 00:40:43,770 --> 00:40:46,020 right now, the things that Sun sells-- and it probably 908 00:40:46,020 --> 00:40:49,130 will change over time-- there's about a factor of-- well, not 909 00:40:49,130 --> 00:40:51,110 Sun, if I include Prisma here. 910 00:40:51,110 --> 00:40:53,090 The range in price and performance 911 00:40:53,090 --> 00:40:56,750 is about a factor of 50 in price and 21 in performance. 912 00:40:56,750 --> 00:41:00,330 The same binary program runs all the way along. 913 00:41:00,330 --> 00:41:00,830 OK. 914 00:41:00,830 --> 00:41:03,890 So let me show you some of the die photos 915 00:41:03,890 --> 00:41:06,920 and then some of the boards for these designs. 916 00:41:06,920 --> 00:41:11,690 So this first die photo is of the good old gate array design. 917 00:41:11,690 --> 00:41:15,230 And although it may be visible, the register file 918 00:41:15,230 --> 00:41:17,818 is this dark area right here. 919 00:41:17,818 --> 00:41:18,860 That's the register file. 920 00:41:18,860 --> 00:41:21,225 And as you can see, that's not much of the chip. 921 00:41:21,225 --> 00:41:23,600 The whole chip would include all the pad array and things 922 00:41:23,600 --> 00:41:25,080 like that. 923 00:41:25,080 --> 00:41:27,590 I think the ALU is over here. 924 00:41:27,590 --> 00:41:30,138 Here's some of the control over-- 925 00:41:30,138 --> 00:41:32,180 I guess this is the instruction [INAUDIBLE] unit, 926 00:41:32,180 --> 00:41:33,390 and the control down here. 927 00:41:33,390 --> 00:41:36,890 So that gives you an idea of all the functions 928 00:41:36,890 --> 00:41:39,140 that are involved in the gate array design. 929 00:41:39,140 --> 00:41:41,210 And here's the same picture of the chip 930 00:41:41,210 --> 00:41:43,310 without all the labels on it. 931 00:41:43,310 --> 00:41:44,870 I think this one-- 932 00:41:44,870 --> 00:41:47,150 they make this run now at 25 megahertz. 933 00:41:47,150 --> 00:41:52,750 The original SPARC machine was 16.67 megahertz. 934 00:41:52,750 --> 00:41:56,190 This is the picture of the Cypress design. 935 00:41:56,190 --> 00:41:58,530 And the register file has a different orientation. 936 00:41:58,530 --> 00:42:00,960 But this is the register file right here. 937 00:42:00,960 --> 00:42:05,850 And the ALU, I think, is right below this. 938 00:42:05,850 --> 00:42:08,650 And I think this is the bus interface unit over here. 939 00:42:08,650 --> 00:42:11,250 940 00:42:11,250 --> 00:42:13,260 This is a much smaller die. 941 00:42:13,260 --> 00:42:17,910 It's a custom design and is at a higher clock rate. 942 00:42:17,910 --> 00:42:21,010 This is the ECL design, the single chip design. 943 00:42:21,010 --> 00:42:22,680 You see the register file is larger. 944 00:42:22,680 --> 00:42:26,400 It turns out, in this particular technology Bit has, 945 00:42:26,400 --> 00:42:32,850 the RAM cells aren't as dense as they are in the CMOS design. 946 00:42:32,850 --> 00:42:35,520 And the next section is all the arithmetic and logic units. 947 00:42:35,520 --> 00:42:37,980 And then control is laid out over here. 948 00:42:37,980 --> 00:42:42,120 This chip has 125,000 transistors and is very likely 949 00:42:42,120 --> 00:42:46,410 the largest single chip ECL design, at least logic chip, 950 00:42:46,410 --> 00:42:48,180 that's ever been built. 951 00:42:48,180 --> 00:42:50,723 So you've seen the die photographs. 952 00:42:50,723 --> 00:42:52,140 When they're put into the package, 953 00:42:52,140 --> 00:42:53,610 they look something like this. 954 00:42:53,610 --> 00:42:55,980 Those two chips are actually of historical interest, 955 00:42:55,980 --> 00:43:00,480 in that they're the second SPARC chips ever made. 956 00:43:00,480 --> 00:43:02,970 Let me then show you the inside of the chip 957 00:43:02,970 --> 00:43:06,600 bonded against the pads for the Bipolar or ECL 958 00:43:06,600 --> 00:43:09,790 that I just talked about. 959 00:43:09,790 --> 00:43:14,080 So I think you may be able to see that this is the Bipolar 960 00:43:14,080 --> 00:43:14,580 design. 961 00:43:14,580 --> 00:43:17,340 You can see the registrar file on the left-hand side and all 962 00:43:17,340 --> 00:43:18,280 the pins around it. 963 00:43:18,280 --> 00:43:21,750 So this is the package part, 125,000 transistors, 964 00:43:21,750 --> 00:43:24,840 the whole chip running at 12 and 1/2 nanosecond clock cycle 965 00:43:24,840 --> 00:43:26,580 time. 966 00:43:26,580 --> 00:43:29,850 This is a type of perhaps forerunner 967 00:43:29,850 --> 00:43:32,400 of the RISC super microprocessors I'll 968 00:43:32,400 --> 00:43:33,960 mention again in my last slide. 969 00:43:33,960 --> 00:43:36,990 970 00:43:36,990 --> 00:43:41,050 Boards containing the chips that I was just talking about. 971 00:43:41,050 --> 00:43:42,930 You can see those three boards here. 972 00:43:42,930 --> 00:43:46,800 The smaller board refers to the 20 megahertz version 973 00:43:46,800 --> 00:43:48,540 that's in the SPARCstation 1. 974 00:43:48,540 --> 00:43:53,760 The board next to it contains the 33 megahertz version, 975 00:43:53,760 --> 00:43:54,990 the custom chip design. 976 00:43:54,990 --> 00:43:56,550 And over here is a prototype board 977 00:43:56,550 --> 00:43:59,023 running 70 to 80 megahertz. 978 00:43:59,023 --> 00:44:00,690 You can see, just from this perspective, 979 00:44:00,690 --> 00:44:03,683 that the advantage of the slower clock rate 980 00:44:03,683 --> 00:44:05,850 is it allows them to do more highly integrated chips 981 00:44:05,850 --> 00:44:08,010 and have a considerably smaller board design. 982 00:44:08,010 --> 00:44:10,290 This is no bigger than a sheet of notebook paper. 983 00:44:10,290 --> 00:44:13,080 And the others are quite a bit bigger than that. 984 00:44:13,080 --> 00:44:15,747 So let me focus first on this small board that's 985 00:44:15,747 --> 00:44:18,080 in the SPARCstation and identify some of the components. 986 00:44:18,080 --> 00:44:20,110 So in this lower corner of the board, 987 00:44:20,110 --> 00:44:21,630 this is the energy in a chip, which 988 00:44:21,630 --> 00:44:26,095 is this gate array that comes from LSI Logic or Fujitsu. 989 00:44:26,095 --> 00:44:28,470 Unfortunately, this board doesn't have the floating point 990 00:44:28,470 --> 00:44:29,220 chip. 991 00:44:29,220 --> 00:44:32,280 The gate array would be right next to it. 992 00:44:32,280 --> 00:44:34,830 Over here is the cache controller. 993 00:44:34,830 --> 00:44:37,740 And the cache data RAMs are these eight chips. 994 00:44:37,740 --> 00:44:42,250 The cash tag RAMs are those five chips over there. 995 00:44:42,250 --> 00:44:44,280 If we move back, you can see there's 996 00:44:44,280 --> 00:44:45,870 a couple of more gate arrays. 997 00:44:45,870 --> 00:44:47,710 This is the memory management unit. 998 00:44:47,710 --> 00:44:48,900 Here's the DMA chip. 999 00:44:48,900 --> 00:44:51,780 This happens to be the SCSI controller. 1000 00:44:51,780 --> 00:44:53,807 Over to the right are the SIM modules. 1001 00:44:53,807 --> 00:44:55,140 These SIM modules will stand up. 1002 00:44:55,140 --> 00:44:57,030 There's room for 16 of these SIM modules, 1003 00:44:57,030 --> 00:44:59,550 which can contain up to 4 megabytes of memory per module 1004 00:44:59,550 --> 00:45:02,730 or 64 megabytes on this notebook-sized sheet of paper. 1005 00:45:02,730 --> 00:45:05,070 There's actually only 50 chips on this board 1006 00:45:05,070 --> 00:45:08,040 using a lot of high integration. 1007 00:45:08,040 --> 00:45:10,680 This is actually fewer chips than is in the Macintosh, 1008 00:45:10,680 --> 00:45:12,970 for example. 1009 00:45:12,970 --> 00:45:16,140 So now, let's go from the small board 1010 00:45:16,140 --> 00:45:19,260 to the 33 megahertz design. 1011 00:45:19,260 --> 00:45:21,600 This is the board that's based on the 33 megahertz 1012 00:45:21,600 --> 00:45:27,150 version, which goes in the SPARCstation 33-0, 330. 1013 00:45:27,150 --> 00:45:29,160 Here is the Cypress chip. 1014 00:45:29,160 --> 00:45:31,020 It's the custom design. 1015 00:45:31,020 --> 00:45:35,490 Next to it is a controller for the floating point, and then 1016 00:45:35,490 --> 00:45:38,130 a TI floating point chip. 1017 00:45:38,130 --> 00:45:42,100 As we zoom back and see more of the board, 1018 00:45:42,100 --> 00:45:46,140 this is the cache controller logic over here. 1019 00:45:46,140 --> 00:45:48,090 And the cache chips are nearby. 1020 00:45:48,090 --> 00:45:50,010 What's interesting about this design 1021 00:45:50,010 --> 00:45:52,350 is that they have a cache strictly 1022 00:45:52,350 --> 00:45:56,220 for I/O, which more or less doubles the I/O performance 1023 00:45:56,220 --> 00:45:59,100 compared to previous systems without such a cache 1024 00:45:59,100 --> 00:46:03,250 and these controllers and these two gate array chips. 1025 00:46:03,250 --> 00:46:06,410 Let's go to the final board design next. 1026 00:46:06,410 --> 00:46:12,290 Here is the board that contains the ECL chip that I just talked 1027 00:46:12,290 --> 00:46:13,610 about earlier in the slide. 1028 00:46:13,610 --> 00:46:16,490 I can show the example of the die. 1029 00:46:16,490 --> 00:46:19,430 Here is the package that I showed you. 1030 00:46:19,430 --> 00:46:21,530 Here's the energy unit. 1031 00:46:21,530 --> 00:46:24,230 Next to it is the cache controller unit. 1032 00:46:24,230 --> 00:46:28,370 Over here are the registers that are for the floating point 1033 00:46:28,370 --> 00:46:31,070 registers, not the energy register file, not the register 1034 00:46:31,070 --> 00:46:34,430 windows, but just the 32 32-bit registers for the floating 1035 00:46:34,430 --> 00:46:35,600 point unit. 1036 00:46:35,600 --> 00:46:37,580 Here is the multiplier. 1037 00:46:37,580 --> 00:46:39,560 And here's the floating point adder, 1038 00:46:39,560 --> 00:46:41,180 floating point multiplier. 1039 00:46:41,180 --> 00:46:43,922 Here are the bus interface units. 1040 00:46:43,922 --> 00:46:45,380 Now, what's interesting-- you can't 1041 00:46:45,380 --> 00:46:46,160 see from this perspective. 1042 00:46:46,160 --> 00:46:47,160 I'll change in a minute. 1043 00:46:47,160 --> 00:46:50,210 But these all have heavy metal hats on it 1044 00:46:50,210 --> 00:46:53,630 to remove the heat from the chips. 1045 00:46:53,630 --> 00:46:55,870 This runs at about 20 watts. 1046 00:46:55,870 --> 00:46:56,730 This is around 18. 1047 00:46:56,730 --> 00:46:58,910 And the rest of the chips, these ECL chips, 1048 00:46:58,910 --> 00:47:00,680 run at about 15 watts. 1049 00:47:00,680 --> 00:47:05,470 You'll be able to see the hats as I turn this sideways. 1050 00:47:05,470 --> 00:47:09,120 Maybe you can zoom in. 1051 00:47:09,120 --> 00:47:10,790 OK. 1052 00:47:10,790 --> 00:47:13,490 So there needs to be greater spacing on this board, 1053 00:47:13,490 --> 00:47:19,790 as you can see, to contain these heatsinks on the top. 1054 00:47:19,790 --> 00:47:23,270 A single board, 70-80 megahertz in this prototype, 1055 00:47:23,270 --> 00:47:26,780 machine running close to the speeds of the Cray-1. 1056 00:47:26,780 --> 00:47:29,510 So now, I've talked about some of the differences in the SPARC 1057 00:47:29,510 --> 00:47:33,170 architectures, where the ideas came from, the technical ideas. 1058 00:47:33,170 --> 00:47:35,220 Let me talk about historically where 1059 00:47:35,220 --> 00:47:37,640 were these ideas developed. 1060 00:47:37,640 --> 00:47:39,860 The first RISC machine was developed at IBM 1061 00:47:39,860 --> 00:47:42,110 in the Yorktown division, the IBM [INAUDIBLE].. 1062 00:47:42,110 --> 00:47:45,860 This was actually a 24-bit ECL minicomputer led 1063 00:47:45,860 --> 00:47:47,930 by John Koch and George Radin. 1064 00:47:47,930 --> 00:47:50,180 I think the particular emphasis of that group 1065 00:47:50,180 --> 00:47:52,520 was to push the compiler technology. 1066 00:47:52,520 --> 00:47:56,540 It turned out that, for the compiler technology, 1067 00:47:56,540 --> 00:47:59,510 a fairly simple architecture was a better match for the compiler 1068 00:47:59,510 --> 00:48:01,340 technology they were pushing. 1069 00:48:01,340 --> 00:48:06,080 Their emphasis was on a subset of PLI1 that they call PLI.8. 1070 00:48:06,080 --> 00:48:08,005 They invented a custom operating system. 1071 00:48:08,005 --> 00:48:09,380 And their comparisons were always 1072 00:48:09,380 --> 00:48:14,630 against the IBM 370 family machines, quite naturally. 1073 00:48:14,630 --> 00:48:18,620 The RISC research at Berkeley, which I was involved with, 1074 00:48:18,620 --> 00:48:20,720 led to the first two RISC microprocessors. 1075 00:48:20,720 --> 00:48:22,760 They were 32-bit microprocessors. 1076 00:48:22,760 --> 00:48:24,530 And as I mentioned earlier, the emphasis 1077 00:48:24,530 --> 00:48:26,540 was trying to come up with an architecture 1078 00:48:26,540 --> 00:48:29,750 to track the rapid changes in VLSI. 1079 00:48:29,750 --> 00:48:34,760 And we also gave these family of machines their name. 1080 00:48:34,760 --> 00:48:38,660 The emphasis there was on C and Unix running on RISC machines. 1081 00:48:38,660 --> 00:48:42,590 And the comparisons were against the VAX and the 68000. 1082 00:48:42,590 --> 00:48:44,810 Professor Hennessy at Stanford University 1083 00:48:44,810 --> 00:48:46,790 led another RISC effort. 1084 00:48:46,790 --> 00:48:48,770 They built the third RISC microprocessor 1085 00:48:48,770 --> 00:48:52,840 in pushing compiler technology and trying to track VLSI. 1086 00:48:52,840 --> 00:48:56,510 And the emphasis was in Pascal and typical comparisons 1087 00:48:56,510 --> 00:48:58,270 against PDP-10. 1088 00:48:58,270 --> 00:49:01,820 But I think from a longer term historical perspective that 1089 00:49:01,820 --> 00:49:03,440 precedes all of these designs-- 1090 00:49:03,440 --> 00:49:07,235 and that goes back to Seymour Cray and his classic CDC-6600 1091 00:49:07,235 --> 00:49:09,740 and later in his Cray-1 machine. 1092 00:49:09,740 --> 00:49:13,460 You can see the same principles, in my perspective, applied. 1093 00:49:13,460 --> 00:49:16,250 Now, if you look at-- 1094 00:49:16,250 --> 00:49:20,390 if you try and find Seymour Cray give a talk, first of all, 1095 00:49:20,390 --> 00:49:21,380 go see it. 1096 00:49:21,380 --> 00:49:24,320 He does these about once a decade, as far as I can tell. 1097 00:49:24,320 --> 00:49:27,500 So I happen to be fumbling through the video library 1098 00:49:27,500 --> 00:49:31,670 at Berkeley and find out we had a taping of Seymour Cray 1099 00:49:31,670 --> 00:49:33,950 right before the announcement of the Cray-1. 1100 00:49:33,950 --> 00:49:36,740 Lawrence Livermore, which always buys the first model 1101 00:49:36,740 --> 00:49:39,860 of the supercomputers, I guess could use that against him 1102 00:49:39,860 --> 00:49:41,390 to make him give a public talk. 1103 00:49:41,390 --> 00:49:43,280 And what this quote is on the slide 1104 00:49:43,280 --> 00:49:45,620 is from part of that talk. 1105 00:49:45,620 --> 00:49:48,710 And this is, I think, the year before the Cray-1. 1106 00:49:48,710 --> 00:49:51,440 So he said, registers-- he was referring to the 6600 1107 00:49:51,440 --> 00:49:52,700 when he was talking about it. 1108 00:49:52,700 --> 00:49:54,533 Registers made the instructions very simple. 1109 00:49:54,533 --> 00:49:56,690 And that thought is still with me 1110 00:49:56,690 --> 00:49:59,180 and is very present in the machine I am designing now, 1111 00:49:59,180 --> 00:50:00,110 the Cray-1. 1112 00:50:00,110 --> 00:50:02,762 That is somewhat unique. 1113 00:50:02,762 --> 00:50:04,220 What's left off the slide, he talks 1114 00:50:04,220 --> 00:50:06,762 about most machines have very elaborate instructions and more 1115 00:50:06,762 --> 00:50:07,730 memory accesses. 1116 00:50:07,730 --> 00:50:10,023 Then he concludes with simplicity, I guess, 1117 00:50:10,023 --> 00:50:10,940 is a way of saying it. 1118 00:50:10,940 --> 00:50:12,380 I am all for simplicity. 1119 00:50:12,380 --> 00:50:15,980 If it's very complicated, I can't understand it. 1120 00:50:15,980 --> 00:50:18,350 OK. 1121 00:50:18,350 --> 00:50:20,000 So what happened to all this research? 1122 00:50:20,000 --> 00:50:22,460 Well, it turned out each of those research projects 1123 00:50:22,460 --> 00:50:24,050 led to commercial products. 1124 00:50:24,050 --> 00:50:28,460 All of them added floating point to the simple integer model. 1125 00:50:28,460 --> 00:50:32,150 The IBM 801 led to the IBM RT PC. 1126 00:50:32,150 --> 00:50:34,655 But it turned out that was a brand new design. 1127 00:50:34,655 --> 00:50:36,560 Or it was a heavily changed design. 1128 00:50:36,560 --> 00:50:39,320 They cut the number of registers from 32 to 16. 1129 00:50:39,320 --> 00:50:41,360 They introduced 16-bit instructions 1130 00:50:41,360 --> 00:50:43,250 and even microcode in that design. 1131 00:50:43,250 --> 00:50:45,350 So that was a lot of changes. 1132 00:50:45,350 --> 00:50:47,360 The SPARC design this tape's about 1133 00:50:47,360 --> 00:50:49,670 is very close to the RISC design. 1134 00:50:49,670 --> 00:50:51,170 I'll talk about how it was extended, 1135 00:50:51,170 --> 00:50:55,460 but floating point and this LISP support or Smalltalk support. 1136 00:50:55,460 --> 00:50:56,840 At Stanford, the MIPS-- 1137 00:50:56,840 --> 00:51:00,500 Stanford led-- in fact, the name is the name of the company. 1138 00:51:00,500 --> 00:51:03,110 But it was really a brand new instruction set design. 1139 00:51:03,110 --> 00:51:04,700 Having 2x the number of registers, 1140 00:51:04,700 --> 00:51:06,705 having single size instructions, some people 1141 00:51:06,705 --> 00:51:08,080 have said that the MIPS design is 1142 00:51:08,080 --> 00:51:10,780 closer to the Berkeley design than it 1143 00:51:10,780 --> 00:51:11,950 was to the Stanford design. 1144 00:51:11,950 --> 00:51:14,470 But I certainly wouldn't say that about my good friend John 1145 00:51:14,470 --> 00:51:16,810 Hennesy. 1146 00:51:16,810 --> 00:51:18,400 All right. 1147 00:51:18,400 --> 00:51:20,020 So there's this work done at Berkeley. 1148 00:51:20,020 --> 00:51:21,790 What happened at Sun? 1149 00:51:21,790 --> 00:51:24,700 Well, as Wayne mentioned, Sun was a pretty small company 1150 00:51:24,700 --> 00:51:27,130 when it decided on this fairly bold venture 1151 00:51:27,130 --> 00:51:30,670 as to no longer rely exclusively on this giant semiconductor 1152 00:51:30,670 --> 00:51:33,065 house, but to go off on their own. 1153 00:51:33,065 --> 00:51:34,690 Starting with the RISC instruction set, 1154 00:51:34,690 --> 00:51:36,610 a team of operating system people, 1155 00:51:36,610 --> 00:51:39,730 compiler people, and architects and hardware designers 1156 00:51:39,730 --> 00:51:41,530 developed the SPARC instruction set. 1157 00:51:41,530 --> 00:51:43,990 The emphasis was on simplicity because the importance 1158 00:51:43,990 --> 00:51:46,510 of getting this thing to market, the resources 1159 00:51:46,510 --> 00:51:49,240 available in a company-- a $30 million 200 person 1160 00:51:49,240 --> 00:51:50,780 company-- wasn't that great. 1161 00:51:50,780 --> 00:51:52,930 And it needed to be scalable to new technologies. 1162 00:51:52,930 --> 00:51:55,420 That was another reason to keep it simple. 1163 00:51:55,420 --> 00:51:58,680 And so as a result, the 1987 gate array, 1164 00:51:58,680 --> 00:52:00,520 the first time a microprocessor has ever 1165 00:52:00,520 --> 00:52:03,370 been built as a gate array, was faster than the custom designs 1166 00:52:03,370 --> 00:52:04,720 from Intel or Motorola. 1167 00:52:04,720 --> 00:52:08,600 1168 00:52:08,600 --> 00:52:11,090 Now, let me spend some time talking a little bit 1169 00:52:11,090 --> 00:52:13,140 about the future of design. 1170 00:52:13,140 --> 00:52:15,170 So if you get nothing else from this tape, 1171 00:52:15,170 --> 00:52:17,050 I'm going to try and burn into your memory 1172 00:52:17,050 --> 00:52:21,810 this performance equation that you're trying to minimize 1173 00:52:21,810 --> 00:52:23,110 in getting higher performance. 1174 00:52:23,110 --> 00:52:25,903 So this slide is about higher performance. 1175 00:52:25,903 --> 00:52:28,070 Well, we're going to try and improve the clock cycle 1176 00:52:28,070 --> 00:52:29,487 time-- all the RISC machines are-- 1177 00:52:29,487 --> 00:52:31,630 by going to new technologies. 1178 00:52:31,630 --> 00:52:34,730 There's already these custom ECL chips being developed 1179 00:52:34,730 --> 00:52:37,290 by Sun and other companies. 1180 00:52:37,290 --> 00:52:40,850 There's a lot of interest in the BICMOS, combination Bipolar 1181 00:52:40,850 --> 00:52:44,060 and CMOS designs, and interest as well in gallium arsenide 1182 00:52:44,060 --> 00:52:46,610 designs, like I mentioned. 1183 00:52:46,610 --> 00:52:49,730 And this is all point at the fetch clock cycle time. 1184 00:52:49,730 --> 00:52:52,845 The higher integration of trying to get the caches, the memory 1185 00:52:52,845 --> 00:52:54,470 management unit, and the floating point 1186 00:52:54,470 --> 00:52:55,965 all on the same dye-- 1187 00:52:55,965 --> 00:52:57,590 this is going to obviously lower costs, 1188 00:52:57,590 --> 00:52:59,960 but also allows a faster clock cycle time 1189 00:52:59,960 --> 00:53:01,850 because they all fit together. 1190 00:53:01,850 --> 00:53:05,300 As you can see on the monitor, what I'm talking about 1191 00:53:05,300 --> 00:53:10,010 is combining the integer chip, the floating point 1192 00:53:10,010 --> 00:53:15,170 chip, the cache control chip, all of the data cache RAMs, 1193 00:53:15,170 --> 00:53:19,760 the tag RAMs, the memory management unit, and-- 1194 00:53:19,760 --> 00:53:23,180 who knows-- maybe even the DMA, all of those chips into one 1195 00:53:23,180 --> 00:53:23,720 die. 1196 00:53:23,720 --> 00:53:25,732 That's going to reduce clearly the board area. 1197 00:53:25,732 --> 00:53:27,440 And by making things smaller, maybe it'll 1198 00:53:27,440 --> 00:53:28,773 have a little faster cycle time. 1199 00:53:28,773 --> 00:53:32,240 1200 00:53:32,240 --> 00:53:36,650 The area that's probably most novel from a computer 1201 00:53:36,650 --> 00:53:38,210 architecture perspective is what's 1202 00:53:38,210 --> 00:53:40,340 been called superscalar design. 1203 00:53:40,340 --> 00:53:43,160 Rather than fetching one instruction every clock cycle, 1204 00:53:43,160 --> 00:53:45,830 the RISC machines are headed towards multiple instructions 1205 00:53:45,830 --> 00:53:49,190 per clock cycle, two or three so that if you're 1206 00:53:49,190 --> 00:53:51,230 trying to execute two at the same time, 1207 00:53:51,230 --> 00:53:52,940 you're going to lower the CPI. 1208 00:53:52,940 --> 00:53:56,240 In fact, you're going to lower the CPI below 1 to try 1209 00:53:56,240 --> 00:53:58,075 and minimize this formula. 1210 00:53:58,075 --> 00:54:00,200 And of course, there's going to be continued effort 1211 00:54:00,200 --> 00:54:02,930 on trying to use better technology, better algorithms 1212 00:54:02,930 --> 00:54:05,450 to improve compilers so you can lower the instruction count. 1213 00:54:05,450 --> 00:54:07,250 So that's, for me, the future direction, 1214 00:54:07,250 --> 00:54:09,290 in terms of performance improvements and cost 1215 00:54:09,290 --> 00:54:11,820 improvements. 1216 00:54:11,820 --> 00:54:16,490 Now, the big limit to any computer architecture 1217 00:54:16,490 --> 00:54:18,400 is the size of its address. 1218 00:54:18,400 --> 00:54:22,340 The RISC machines came out with 32-bit architectures. 1219 00:54:22,340 --> 00:54:24,530 Many people, including Gordon Bell 1220 00:54:24,530 --> 00:54:26,630 talked about how the only mistake 1221 00:54:26,630 --> 00:54:28,790 that you can't recover from in computer design 1222 00:54:28,790 --> 00:54:30,060 is the address size. 1223 00:54:30,060 --> 00:54:32,380 So in fact, computer designers have used the addresses 1224 00:54:32,380 --> 00:54:35,090 as an excuse to redesign their instruction sets. 1225 00:54:35,090 --> 00:54:38,660 The RISC machines at 32 bits are pretty close to those limits. 1226 00:54:38,660 --> 00:54:40,590 And so I would expect, in the next few years, 1227 00:54:40,590 --> 00:54:43,520 we're going to see proposals for much larger than 32 1228 00:54:43,520 --> 00:54:47,840 bits of addressing in probably all computers, not just RISC 1229 00:54:47,840 --> 00:54:49,950 designs. 1230 00:54:49,950 --> 00:54:55,820 So for my final slide, let me roll the dice here and try 1231 00:54:55,820 --> 00:54:58,940 and describe the future under the possibility somebody 1232 00:54:58,940 --> 00:55:02,600 might even be seeing the tape at the time I'm predicting. 1233 00:55:02,600 --> 00:55:05,013 So let's say that, for some sizable fraction 1234 00:55:05,013 --> 00:55:06,680 of the scientific community in the years 1235 00:55:06,680 --> 00:55:11,420 1993, 1996, that the hard of almost all 1236 00:55:11,420 --> 00:55:13,130 of these systems in this community 1237 00:55:13,130 --> 00:55:16,730 is going to be a RISC supermicroprocessor, millions 1238 00:55:16,730 --> 00:55:18,890 of transistors on one chip implementing 1239 00:55:18,890 --> 00:55:20,540 a RISC-style architecture. 1240 00:55:20,540 --> 00:55:22,160 At the low end is the workstation 1241 00:55:22,160 --> 00:55:25,910 or the desktop, that you have a single one of these devices 1242 00:55:25,910 --> 00:55:27,680 with a fairly simple memory system trying 1243 00:55:27,680 --> 00:55:29,240 to get the cost down. 1244 00:55:29,240 --> 00:55:31,220 The next step up, the file server 1245 00:55:31,220 --> 00:55:34,400 or time-share minicomputer, will see a few of these RISC 1246 00:55:34,400 --> 00:55:36,350 supermicroprocessors with a much better memory 1247 00:55:36,350 --> 00:55:38,930 system and perhaps a much better I/O system as well. 1248 00:55:38,930 --> 00:55:43,670 And at the high end could be the supercomputers with many-- 1249 00:55:43,670 --> 00:55:46,890 with many maybe measured in thousands-- but many RISC 1250 00:55:46,890 --> 00:55:48,520 supermicroprocessors that are going 1251 00:55:48,520 --> 00:55:50,270 to have some kind of communication network 1252 00:55:50,270 --> 00:55:53,360 to allow them all talk together and a very much larger I/O 1253 00:55:53,360 --> 00:55:55,200 system. 1254 00:55:55,200 --> 00:55:57,450 Thank you very much. 1255 00:55:57,450 --> 00:55:58,710 Are there any questions? 1256 00:55:58,710 --> 00:56:00,350 Are there differences in the type 1257 00:56:00,350 --> 00:56:03,230 of applications that are suitable for the RISC 1258 00:56:03,230 --> 00:56:05,720 versus CIC processor? 1259 00:56:05,720 --> 00:56:08,000 And will that change over time? 1260 00:56:08,000 --> 00:56:11,733 The RISC machines seem to have spread out pretty widely. 1261 00:56:11,733 --> 00:56:13,400 One thing you mentioned in your question 1262 00:56:13,400 --> 00:56:15,650 is, well, the example of transaction processing. 1263 00:56:15,650 --> 00:56:19,555 Well, one of the leaders in transaction processing, Tandem, 1264 00:56:19,555 --> 00:56:20,930 has announced that they are going 1265 00:56:20,930 --> 00:56:23,472 to building a line of computers based around the RISC machine 1266 00:56:23,472 --> 00:56:24,320 architectures. 1267 00:56:24,320 --> 00:56:26,680 It looks like, for transaction processing, which 1268 00:56:26,680 --> 00:56:28,430 is something I'm learning something about, 1269 00:56:28,430 --> 00:56:30,978 because my current research is involved in higher performance 1270 00:56:30,978 --> 00:56:33,020 I/O and that's one of the areas we're looking at, 1271 00:56:33,020 --> 00:56:37,880 is it's kind of like operating systems. 1272 00:56:37,880 --> 00:56:39,380 It's the same types of instructions. 1273 00:56:39,380 --> 00:56:42,050 So from an instruction set perspective, higher 1274 00:56:42,050 --> 00:56:44,163 performance, lower cost is very attractive. 1275 00:56:44,163 --> 00:56:45,830 There doesn't seem to be anything there. 1276 00:56:45,830 --> 00:56:48,290 I think the one area that people believe 1277 00:56:48,290 --> 00:56:51,410 that there's some interest is in the traditional business data 1278 00:56:51,410 --> 00:56:57,650 processing dealing with the BCD encoding of data. 1279 00:56:57,650 --> 00:57:00,830 That's an area where there's still a lot of argument about 1280 00:57:00,830 --> 00:57:03,450 whether or not RISC machines are the right thing to do. 1281 00:57:03,450 --> 00:57:06,030 I know that the Hewlett-Packard precision architecture, 1282 00:57:06,030 --> 00:57:09,977 which was concerned about this, put some support-- 1283 00:57:09,977 --> 00:57:12,560 but it's a very modest amount of support-- in for the business 1284 00:57:12,560 --> 00:57:14,890 data processing. 1285 00:57:14,890 --> 00:57:17,960 But I think it'll have to-- that's probably 1286 00:57:17,960 --> 00:57:20,510 the one area that's a toss up that you can get good arguments 1287 00:57:20,510 --> 00:57:21,380 both ways. 1288 00:57:21,380 --> 00:57:24,440 The rest of the areas, by examples of machines, 1289 00:57:24,440 --> 00:57:27,655 we seem to see people lining up. 1290 00:57:27,655 --> 00:57:29,780 The only other thing I'd say from that perspective, 1291 00:57:29,780 --> 00:57:31,992 in terms of what Wayne said once again, 1292 00:57:31,992 --> 00:57:33,950 in terms of a software developer's perspective, 1293 00:57:33,950 --> 00:57:36,660 is just the number of machines out there. 1294 00:57:36,660 --> 00:57:40,310 So even if it's a technically sound instruction set, 1295 00:57:40,310 --> 00:57:42,080 the reluctance of porting applications 1296 00:57:42,080 --> 00:57:45,270 is simply how many places can they sell the program. 1297 00:57:45,270 --> 00:57:47,000 And so that's why I think we'll see 1298 00:57:47,000 --> 00:57:52,940 the 8086 family live forever and the IBM 360 live forever, too. 1299 00:57:52,940 --> 00:57:55,760 Are the SPARC machines from the different vendors 1300 00:57:55,760 --> 00:57:58,490 going to execute the same binaries, 1301 00:57:58,490 --> 00:58:02,690 despite the differences in cache and register window size? 1302 00:58:02,690 --> 00:58:03,682 The answer is yes. 1303 00:58:03,682 --> 00:58:05,390 Now, the thing that was interesting to me 1304 00:58:05,390 --> 00:58:07,723 that I really didn't know about the binary compatibility 1305 00:58:07,723 --> 00:58:09,770 is it turns out, like every single company, 1306 00:58:09,770 --> 00:58:12,410 even though they're binary compatible, 1307 00:58:12,410 --> 00:58:13,910 the kernels of the operating systems 1308 00:58:13,910 --> 00:58:15,452 are different on all these companies. 1309 00:58:15,452 --> 00:58:16,580 I didn't know that. 1310 00:58:16,580 --> 00:58:18,455 And because the I/O's are somewhat different, 1311 00:58:18,455 --> 00:58:20,247 when you start it, it's somewhat different. 1312 00:58:20,247 --> 00:58:22,040 So even though the VAX and the IBM families 1313 00:58:22,040 --> 00:58:24,498 are binary compatible, there's some pieces of the operating 1314 00:58:24,498 --> 00:58:25,740 system that are different. 1315 00:58:25,740 --> 00:58:28,070 And that's true for the RISC machines as well. 1316 00:58:28,070 --> 00:58:31,080 There's some details of dealing with the I/O 1317 00:58:31,080 --> 00:58:32,805 system, or the buses, or the caches, 1318 00:58:32,805 --> 00:58:34,430 because the caches vary, where there'll 1319 00:58:34,430 --> 00:58:35,810 be a piece of the operating system that'll 1320 00:58:35,810 --> 00:58:37,740 have to be different for every single machine. 1321 00:58:37,740 --> 00:58:39,157 But that apparently is what people 1322 00:58:39,157 --> 00:58:42,350 call binary compatibility or user binary compatibility. 1323 00:58:42,350 --> 00:58:47,030 How do you see distributing processing power through memory 1324 00:58:47,030 --> 00:58:50,780 as the memory chips get denser? 1325 00:58:50,780 --> 00:58:54,380 There will be a bottleneck at the chip level. 1326 00:58:54,380 --> 00:58:57,290 I think, in this one area, the supercomputer designs, 1327 00:58:57,290 --> 00:59:00,500 when you're talking about lots of processors-- 1328 00:59:00,500 --> 00:59:04,250 I think that's an area where it's possible 1329 00:59:04,250 --> 00:59:06,260 that, if you're talking about thousands 1330 00:59:06,260 --> 00:59:08,660 of processors and thousands of DRAMs 1331 00:59:08,660 --> 00:59:10,250 and it's enough of a volume, you might 1332 00:59:10,250 --> 00:59:13,160 see some real radical change somewhere down the line 1333 00:59:13,160 --> 00:59:15,440 where the processes are placed in the DRAM chips. 1334 00:59:15,440 --> 00:59:17,360 That might be a much more economical design. 1335 00:59:17,360 --> 00:59:20,750 1336 00:59:20,750 --> 00:59:22,220 Right now, that'd be very dangerous 1337 00:59:22,220 --> 00:59:24,620 because you're fighting against two strong forces-- 1338 00:59:24,620 --> 00:59:28,190 a mass produced microprocessor and mass produced memory. 1339 00:59:28,190 --> 00:59:30,530 And that has strong economic advantages. 1340 00:59:30,530 --> 00:59:32,960 And if you can overcome the potential memory bottlenecks 1341 00:59:32,960 --> 00:59:35,760 with your design, you have tremendous economies of scale. 1342 00:59:35,760 --> 00:59:39,140 If, on the other hand, you're designing your own custom RAM 1343 00:59:39,140 --> 00:59:41,090 with your own custom memory, you're 1344 00:59:41,090 --> 00:59:44,720 in jeopardy of being left way behind the technology as well 1345 00:59:44,720 --> 00:59:45,980 as the cost performance. 1346 00:59:45,980 --> 00:59:47,540 But maybe down the line a little bit, 1347 00:59:47,540 --> 00:59:51,050 if you could cooperate with some semiconductor manufacturer 1348 00:59:51,050 --> 00:59:53,745 in making DRAM, you're able to put your microprocessor 1349 00:59:53,745 --> 00:59:56,120 on there where you have so many thousands of transistors. 1350 00:59:56,120 --> 00:59:59,930 There might be just this huge switch in these high-ends. 1351 00:59:59,930 --> 01:00:01,430 I think, in the low ends, I wouldn't 1352 01:00:01,430 --> 01:00:02,930 expect much of a change, at least 1353 01:00:02,930 --> 01:00:05,450 in the time frame on that slide. 1354 01:00:05,450 --> 01:00:09,990 Most RISC machines today run the Unix operating system. 1355 01:00:09,990 --> 01:00:13,250 Is there anything inherent in SPARC that would make it 1356 01:00:13,250 --> 01:00:16,430 unsuitable for other operating systems? 1357 01:00:16,430 --> 01:00:18,488 I don't know of anything that we did 1358 01:00:18,488 --> 01:00:20,780 in the instruction sets designed that were particularly 1359 01:00:20,780 --> 01:00:22,970 for Unix. 1360 01:00:22,970 --> 01:00:25,190 Unix enabled the RISC machines because it 1361 01:00:25,190 --> 01:00:27,110 was the first portable operating system. 1362 01:00:27,110 --> 01:00:30,500 It was all written in C. VMS, in contrast, for example, 1363 01:00:30,500 --> 01:00:34,170 or DOS, or OS 2, are all written in assembly language. 1364 01:00:34,170 --> 01:00:36,350 So they were an obstacle to RISC machines 1365 01:00:36,350 --> 01:00:37,730 because they couldn't be ported. 1366 01:00:37,730 --> 01:00:41,570 Unix was portable, which freed instruction set designers 1367 01:00:41,570 --> 01:00:43,890 from having to invent new instruction sets. 1368 01:00:43,890 --> 01:00:45,620 So the first person-- 1369 01:00:45,620 --> 01:00:47,390 once that was possible, we were allowed 1370 01:00:47,390 --> 01:00:50,340 to change instruction sets and make some economic sense. 1371 01:00:50,340 --> 01:00:53,240 So I don't think any operating system written in a high level 1372 01:00:53,240 --> 01:00:57,260 language I think would work on any of these RISC processors. 1373 01:00:57,260 --> 01:01:00,310 [MUSIC PLAYING] 1374 01:01:00,310 --> 01:02:06,000