1
00:00:00,000 --> 00:00:00,625
[MUSIC PLAYING]

2
00:00:00,625 --> 00:00:02,228
University Video
Communications is

3
00:00:02,228 --> 00:00:04,770
pleased to present this edition
of the "Distinguished Lecture

4
00:00:04,770 --> 00:00:05,490
Series--

5
00:00:05,490 --> 00:00:07,470
Industry Leaders
in Computer Science

6
00:00:07,470 --> 00:00:09,090
and Electrical Engineering."

7
00:00:09,090 --> 00:00:12,660
Today, Sun Microsystems
brings us Dr. David Patterson

8
00:00:12,660 --> 00:00:14,610
on the story of SPARC.

9
00:00:14,610 --> 00:00:18,360
As the first person on the
Sun RISC Project in 1984,

10
00:00:18,360 --> 00:00:20,220
Dr. Patterson is
uniquely qualified

11
00:00:20,220 --> 00:00:21,930
to address this subject.

12
00:00:21,930 --> 00:00:25,090
A professor since
1977 at UC Berkeley,

13
00:00:25,090 --> 00:00:27,640
he led three generations
of RISC projects

14
00:00:27,640 --> 00:00:30,060
and, in fact, won the
Distinguished Teaching Award

15
00:00:30,060 --> 00:00:34,140
in 1982 for his course series
on RISC, where many of the ideas

16
00:00:34,140 --> 00:00:35,760
were constructed and developed.

17
00:00:35,760 --> 00:00:39,330
But first, we're fortunate to
have with us Wayne Rosing, vice

18
00:00:39,330 --> 00:00:41,940
president of the Desktop
Systems Graphics Group at Sun,

19
00:00:41,940 --> 00:00:44,490
who will provide a background
of the business environment

20
00:00:44,490 --> 00:00:46,860
during the development
and introduction of SPARC.

21
00:00:46,860 --> 00:00:48,300
Wayne Rosing.

22
00:00:48,300 --> 00:00:52,260
Before Professor Dave Patterson
discusses the technical aspects

23
00:00:52,260 --> 00:00:56,400
of SPARC, I'd like to spend
a few moments discussing

24
00:00:56,400 --> 00:00:59,790
the context, the business
context in which we developed

25
00:00:59,790 --> 00:01:01,780
this architecture.

26
00:01:01,780 --> 00:01:04,349
First of all, it was a
very important for Sun

27
00:01:04,349 --> 00:01:08,190
that we implement a computer
architecture that was simple.

28
00:01:08,190 --> 00:01:12,510
This system was
architected with 14 people.

29
00:01:12,510 --> 00:01:16,440
Five people were involved in
the design of the gate arrays.

30
00:01:16,440 --> 00:01:20,490
Two engineers actually
design the CPU board.

31
00:01:20,490 --> 00:01:23,460
And approximately five people
were involved in the language

32
00:01:23,460 --> 00:01:26,050
development and the OS porting.

33
00:01:26,050 --> 00:01:29,040
So this started out as
a very small project

34
00:01:29,040 --> 00:01:32,010
because Sun was only a $30
million company at the time.

35
00:01:32,010 --> 00:01:34,050
And we did not
have the resources

36
00:01:34,050 --> 00:01:39,880
to engage in a complex, full
custom chip development.

37
00:01:39,880 --> 00:01:45,150
Second, we wanted to
have an architecture that

38
00:01:45,150 --> 00:01:49,740
was capable of scaling all the
way up from simple gate arrays

39
00:01:49,740 --> 00:01:50,970
to gallium arsenide.

40
00:01:50,970 --> 00:01:55,590
This was a very important
focal point for our design.

41
00:01:55,590 --> 00:01:57,930
And I think it was
different than the way

42
00:01:57,930 --> 00:02:01,560
many semiconductor companies
who developed microprocessors

43
00:02:01,560 --> 00:02:03,660
typically make decisions.

44
00:02:03,660 --> 00:02:05,580
The semiconductor
company normally

45
00:02:05,580 --> 00:02:10,410
tends to have one process,
like bipolar or CMOS,

46
00:02:10,410 --> 00:02:12,120
or whatever they might have.

47
00:02:12,120 --> 00:02:14,040
And they concentrate
all of their energy

48
00:02:14,040 --> 00:02:18,270
on building the best possible
part in that process.

49
00:02:18,270 --> 00:02:21,060
And often, these companies
compete with each other

50
00:02:21,060 --> 00:02:24,210
on the merits of their
semiconductor process.

51
00:02:24,210 --> 00:02:27,240
Since we're fundamentally
a systems company,

52
00:02:27,240 --> 00:02:30,990
we felt the need to develop
a computer architecture that

53
00:02:30,990 --> 00:02:32,670
would work in
technologies that would

54
00:02:32,670 --> 00:02:36,420
be appropriate from
inexpensive desktops

55
00:02:36,420 --> 00:02:39,720
all the way up to
the implementation

56
00:02:39,720 --> 00:02:43,090
of large supercomputer
class machines.

57
00:02:43,090 --> 00:02:46,830
So the architecture had to
fit in multiple technologies.

58
00:02:46,830 --> 00:02:50,060


59
00:02:50,060 --> 00:02:53,300
Next, the most
important thing really,

60
00:02:53,300 --> 00:02:55,490
from a business
consideration, was

61
00:02:55,490 --> 00:02:58,520
to develop an architecture
that would quickly

62
00:02:58,520 --> 00:03:02,810
allow us to move the
approximately 2,000 software

63
00:03:02,810 --> 00:03:06,620
applications that existed
for our Motorola product line

64
00:03:06,620 --> 00:03:07,910
to SPARC.

65
00:03:07,910 --> 00:03:09,680
Bill Gates from
Microsoft has said

66
00:03:09,680 --> 00:03:12,690
volume is everything in
the software business.

67
00:03:12,690 --> 00:03:15,920
And I think, if we look
at the history of computer

68
00:03:15,920 --> 00:03:18,200
architecture
development, we've seen

69
00:03:18,200 --> 00:03:20,810
hundreds of very elegant,
very technologically

70
00:03:20,810 --> 00:03:25,370
sophisticated computers designed
primarily by engineering folks.

71
00:03:25,370 --> 00:03:27,800
But they've mostly
been business failures

72
00:03:27,800 --> 00:03:30,530
because these systems
were not able to attract

73
00:03:30,530 --> 00:03:33,080
a critical mass of software.

74
00:03:33,080 --> 00:03:36,110
Without software, you
don't have customers.

75
00:03:36,110 --> 00:03:38,600
Without customers,
you don't have sales.

76
00:03:38,600 --> 00:03:40,400
And until you have
profitable sales,

77
00:03:40,400 --> 00:03:44,660
you cannot generate the gross
margin dollars to reinvest

78
00:03:44,660 --> 00:03:48,110
in the engineering to make
a computer architecture

79
00:03:48,110 --> 00:03:49,740
successful.

80
00:03:49,740 --> 00:03:51,440
So this is very important.

81
00:03:51,440 --> 00:03:54,560
Probably the dominant
consideration

82
00:03:54,560 --> 00:03:58,250
in the development
of SPARC was that we

83
00:03:58,250 --> 00:04:01,520
do things in a way that
allowed absolute, source

84
00:04:01,520 --> 00:04:04,970
compatible, quick
porting of applications

85
00:04:04,970 --> 00:04:08,260
over to the SPARC machines.

86
00:04:08,260 --> 00:04:12,730
Next, looking forward
to the kinds of software

87
00:04:12,730 --> 00:04:16,329
that people run on Sun systems,
which are highly interactive,

88
00:04:16,329 --> 00:04:19,149
typically
graphics-oriented software,

89
00:04:19,149 --> 00:04:21,970
even though it may be very
computationally intensive

90
00:04:21,970 --> 00:04:25,240
software, we really
looked hard at what

91
00:04:25,240 --> 00:04:27,610
is the style of
programming that's

92
00:04:27,610 --> 00:04:31,400
typical in this particular
application area.

93
00:04:31,400 --> 00:04:34,660
And if you look at Windows
systems and graphics systems,

94
00:04:34,660 --> 00:04:37,510
they're dominated
by the need to have

95
00:04:37,510 --> 00:04:41,750
dynamically-linked libraries
operating all the time.

96
00:04:41,750 --> 00:04:44,290
And this is an important thing.

97
00:04:44,290 --> 00:04:47,080
The classic metrics
of computer design

98
00:04:47,080 --> 00:04:51,640
tend to be running Fortran
programs, big batch programs,

99
00:04:51,640 --> 00:04:55,540
or running C programs that
are classically large batch

100
00:04:55,540 --> 00:04:56,590
programs.

101
00:04:56,590 --> 00:04:59,500
We spent a lot of time
thinking about what

102
00:04:59,500 --> 00:05:03,880
it means to have the kernel
execution of Unix be efficient,

103
00:05:03,880 --> 00:05:08,140
or what it means to have
interactive graphics code

104
00:05:08,140 --> 00:05:09,160
be efficient.

105
00:05:09,160 --> 00:05:12,190
And what this
consideration motivated

106
00:05:12,190 --> 00:05:17,200
was the large register
file in SPARC,

107
00:05:17,200 --> 00:05:19,420
as opposed to a
system that would

108
00:05:19,420 --> 00:05:22,230
be more straightforward,
for instance, of just 32

109
00:05:22,230 --> 00:05:24,460
32-bit registers.

110
00:05:24,460 --> 00:05:29,560
And this has been a point of
controversy about RISC design.

111
00:05:29,560 --> 00:05:32,140
It's almost one of the
few religious points

112
00:05:32,140 --> 00:05:35,510
so associated with
the RISC machines.

113
00:05:35,510 --> 00:05:38,875
I'm sure Dave Patterson
will discuss this somewhat.

114
00:05:38,875 --> 00:05:42,750


115
00:05:42,750 --> 00:05:47,640
Lastly, needless to say, we
wanted to make sure that we had

116
00:05:47,640 --> 00:05:52,080
a system that was going to be
efficient as programming styles

117
00:05:52,080 --> 00:05:55,120
shifted more to object
oriented programming,

118
00:05:55,120 --> 00:06:00,330
both things like LISP, as
well as Smalltalk, and C++.

119
00:06:00,330 --> 00:06:03,060
And the early development
that we're now

120
00:06:03,060 --> 00:06:06,900
doing in advanced
software in this area

121
00:06:06,900 --> 00:06:09,150
indicates these
decisions were correct.

122
00:06:09,150 --> 00:06:15,050


123
00:06:15,050 --> 00:06:19,640
We wanted an open,
multi-vendor architecture.

124
00:06:19,640 --> 00:06:21,950
Now, it's easy to
say, well, it's open.

125
00:06:21,950 --> 00:06:24,230
But what did we
really want here?

126
00:06:24,230 --> 00:06:26,030
Why did we want this?

127
00:06:26,030 --> 00:06:30,230
The most important thing
is we, as a company,

128
00:06:30,230 --> 00:06:34,160
wanted to be able to buy the
components for our systems

129
00:06:34,160 --> 00:06:35,430
inexpensively.

130
00:06:35,430 --> 00:06:39,020
So it was important to have
multiple vendors sourcing

131
00:06:39,020 --> 00:06:42,560
and have competition
between multiple vendors.

132
00:06:42,560 --> 00:06:44,690
For instance, in
the SPARC Station 1,

133
00:06:44,690 --> 00:06:48,260
we have two sources
for the integer unit.

134
00:06:48,260 --> 00:06:51,650
We have two sources for
the floating point units.

135
00:06:51,650 --> 00:06:54,350
This is a very
important consideration

136
00:06:54,350 --> 00:06:57,290
in order to get the kind
of competitive pricing

137
00:06:57,290 --> 00:07:03,300
we need to continue to produce
cost effective machines.

138
00:07:03,300 --> 00:07:06,710
Another consideration, a
more subtle consideration,

139
00:07:06,710 --> 00:07:09,890
is Sun can invest so
many millions of per

140
00:07:09,890 --> 00:07:13,460
year in fundamental
R&D of SPARC chips.

141
00:07:13,460 --> 00:07:15,710
And maybe one
semiconductor company

142
00:07:15,710 --> 00:07:18,430
could double that investment.

143
00:07:18,430 --> 00:07:20,000
But if you have
four or five, all

144
00:07:20,000 --> 00:07:22,880
of a sudden, the total
investment dollars going

145
00:07:22,880 --> 00:07:25,130
against SPARC
development goes up

146
00:07:25,130 --> 00:07:27,620
to a very significant number.

147
00:07:27,620 --> 00:07:31,550
And we feel that the investment
dollars being applied to SPARC

148
00:07:31,550 --> 00:07:34,640
probably far exceed the
investment dollars going

149
00:07:34,640 --> 00:07:39,740
into any one of the other
competitive CISC or RISC

150
00:07:39,740 --> 00:07:41,870
computer architectures.

151
00:07:41,870 --> 00:07:45,110
And that kind of
leveraging of our dollars

152
00:07:45,110 --> 00:07:49,220
with other companies dollars
is very, very fundamental.

153
00:07:49,220 --> 00:07:53,060
Now, you might think that
there's a zero sum game that's

154
00:07:53,060 --> 00:07:54,590
being violated here.

155
00:07:54,590 --> 00:07:57,710
But remember, the semiconductor
companies that work with us

156
00:07:57,710 --> 00:08:01,400
do not have to worry about
developing languages, operating

157
00:08:01,400 --> 00:08:05,240
systems, workstations,
and development platforms.

158
00:08:05,240 --> 00:08:08,180
So they're able to concentrate
on what they do best

159
00:08:08,180 --> 00:08:11,540
and not have to subsidize
as part of their computer

160
00:08:11,540 --> 00:08:16,820
R&D all of the other aspects of
a complete system development.

161
00:08:16,820 --> 00:08:18,830
And that's been an
attractive thing

162
00:08:18,830 --> 00:08:22,190
because we have been able to
work with small companies,

163
00:08:22,190 --> 00:08:25,280
for instance like Bipolar
Integrated Technology, which

164
00:08:25,280 --> 00:08:29,000
is a startup doing an
advanced ECL process.

165
00:08:29,000 --> 00:08:31,850
It would have been impossible
for a startup computer

166
00:08:31,850 --> 00:08:36,380
company in the ECL area to
ever develop a microprocessor

167
00:08:36,380 --> 00:08:39,530
without the kind of leverage
that the Sun business

168
00:08:39,530 --> 00:08:43,520
strategy has provided.

169
00:08:43,520 --> 00:08:47,810
Last and most important, we
want the pace of innovation

170
00:08:47,810 --> 00:08:51,650
to SPARC development
not to be set by Sun,

171
00:08:51,650 --> 00:08:55,190
not to be set by the
business strategy of any one

172
00:08:55,190 --> 00:08:56,630
corporation.

173
00:08:56,630 --> 00:09:00,020
We wanted a consortium
to control this.

174
00:09:00,020 --> 00:09:02,360
And SPARC
International, which we

175
00:09:02,360 --> 00:09:06,800
have formed with all of the
SPARC component licensees

176
00:09:06,800 --> 00:09:11,480
and with the participation of
a number of the architecture

177
00:09:11,480 --> 00:09:15,110
licensees, now has
effective control

178
00:09:15,110 --> 00:09:17,090
of the SPARC architecture.

179
00:09:17,090 --> 00:09:20,300
We think that's a very important
consideration, in terms

180
00:09:20,300 --> 00:09:23,180
of making this
rather, if you will,

181
00:09:23,180 --> 00:09:26,330
upstart architecture
from a systems company

182
00:09:26,330 --> 00:09:30,470
become one of the major players
in the volume microprocessor

183
00:09:30,470 --> 00:09:32,790
business.

184
00:09:32,790 --> 00:09:34,680
So thank you for that.

185
00:09:34,680 --> 00:09:37,080
Are there any questions?

186
00:09:37,080 --> 00:09:39,860
How would you describe
the growth and acceptance

187
00:09:39,860 --> 00:09:42,920
of the SPARC processor
in terms of time?

188
00:09:42,920 --> 00:09:46,190
And what problems have some
of your partners experienced

189
00:09:46,190 --> 00:09:50,240
in porting from the
Motorola-based computers

190
00:09:50,240 --> 00:09:52,100
to SPARC processors?

191
00:09:52,100 --> 00:09:54,830
Before the Sun 4 was first
introduced, the first SPARC

192
00:09:54,830 --> 00:09:58,790
machine, we had successful
instances of 500,000 line

193
00:09:58,790 --> 00:10:02,030
programs coming into
the porting center,

194
00:10:02,030 --> 00:10:04,890
being recompiled, and
running the first time.

195
00:10:04,890 --> 00:10:08,840
So from the beginning, we often
had very successful ports.

196
00:10:08,840 --> 00:10:12,170
There were some areas
in the early period

197
00:10:12,170 --> 00:10:14,510
of the architecture where we
had some compiler problems.

198
00:10:14,510 --> 00:10:18,230
And we had problems with
shared Fortran common

199
00:10:18,230 --> 00:10:19,230
that we had to work out.

200
00:10:19,230 --> 00:10:21,050
So there was about
a six month window

201
00:10:21,050 --> 00:10:24,440
there, where we had a little
bit of difficulty because

202
00:10:24,440 --> 00:10:26,840
of some data alignment
problems that we

203
00:10:26,840 --> 00:10:28,790
had neglected in the software.

204
00:10:28,790 --> 00:10:31,910
Once those problems
were overcome,

205
00:10:31,910 --> 00:10:35,300
after that, the porting
has been very smooth.

206
00:10:35,300 --> 00:10:38,030
And getting the larger
companies to port

207
00:10:38,030 --> 00:10:40,010
has not been a technical issue.

208
00:10:40,010 --> 00:10:42,830
It's typically been a
business consideration.

209
00:10:42,830 --> 00:10:45,170
Many of the large
software companies

210
00:10:45,170 --> 00:10:50,090
really are not prepared to port
their applications to systems

211
00:10:50,090 --> 00:10:55,370
until they see an installed base
of a significant enough unit

212
00:10:55,370 --> 00:10:56,480
volumes.

213
00:10:56,480 --> 00:10:59,930
And where SPARC is
in very good shape

214
00:10:59,930 --> 00:11:02,610
is, as of I think the
end of this summer,

215
00:11:02,610 --> 00:11:04,220
there is no question
that SPARC will

216
00:11:04,220 --> 00:11:08,120
have numerically the largest
installed base of RISC machines

217
00:11:08,120 --> 00:11:09,420
in the world.

218
00:11:09,420 --> 00:11:12,460
And so that makes the
business motivation

219
00:11:12,460 --> 00:11:15,130
for independent software
companies very straightforward.

220
00:11:15,130 --> 00:11:16,810
What would you have
done differently,

221
00:11:16,810 --> 00:11:18,177
given what you know today?

222
00:11:18,177 --> 00:11:20,260
I think the things we would
have done differently,

223
00:11:20,260 --> 00:11:23,840
from a technical point of view,
were pretty straightforward.

224
00:11:23,840 --> 00:11:27,130
We would have put
integer multiply

225
00:11:27,130 --> 00:11:31,150
and divide full instructions
into the architecture.

226
00:11:31,150 --> 00:11:34,930
Second were mostly
implementation issues.

227
00:11:34,930 --> 00:11:37,540
We really should have put
about twice as many people

228
00:11:37,540 --> 00:11:39,760
on the project.

229
00:11:39,760 --> 00:11:44,680
And we should have developed
our own tightly coupled

230
00:11:44,680 --> 00:11:47,680
floating point strategy
in the beginning.

231
00:11:47,680 --> 00:11:51,820
And second to that would have
been developing an integrated

232
00:11:51,820 --> 00:11:55,210
memory management unit
with multiprocessor support

233
00:11:55,210 --> 00:11:56,620
from the beginning.

234
00:11:56,620 --> 00:12:00,490
When we first went out and
started talking to companies

235
00:12:00,490 --> 00:12:06,670
about using SPARC, we at Sun,
as I think a bunch of gunslinger

236
00:12:06,670 --> 00:12:08,680
engineers, really
have no problem

237
00:12:08,680 --> 00:12:11,200
designing memory management
units and whatever

238
00:12:11,200 --> 00:12:12,670
else wasn't there.

239
00:12:12,670 --> 00:12:15,010
But we found a lot
of companies really

240
00:12:15,010 --> 00:12:18,640
had succumbed to the traditional
semiconductor merchant market

241
00:12:18,640 --> 00:12:21,140
strategy of they do
everything for you.

242
00:12:21,140 --> 00:12:23,410
And all you have to do
is bolt a few chips in.

243
00:12:23,410 --> 00:12:24,970
And it's all over.

244
00:12:24,970 --> 00:12:28,300
And it took us a
while to understand

245
00:12:28,300 --> 00:12:30,970
that the other customers
really needed a higher level

246
00:12:30,970 --> 00:12:33,500
of integration than Sun did.

247
00:12:33,500 --> 00:12:36,520
And in the last year
and a half, those issues

248
00:12:36,520 --> 00:12:38,980
have been overcome, of
course, in our design

249
00:12:38,980 --> 00:12:41,770
of subsequent chipsets.

250
00:12:41,770 --> 00:12:43,870
What were the roles of
each of the partners

251
00:12:43,870 --> 00:12:46,630
that you worked with in the
development of the SPARC

252
00:12:46,630 --> 00:12:48,520
chipset?

253
00:12:48,520 --> 00:12:53,320
For the very first
generation, SPARC development

254
00:12:53,320 --> 00:12:56,440
of the Fujitsu gate
array, Fujitsu primarily

255
00:12:56,440 --> 00:13:01,630
provided us with technical
support in the CAD area.

256
00:13:01,630 --> 00:13:04,330
And they had one
engineer who just

257
00:13:04,330 --> 00:13:07,510
was so involved that,
at the end of that,

258
00:13:07,510 --> 00:13:10,900
that particular engineer knew
how to do SPARC machines as

259
00:13:10,900 --> 00:13:12,140
well as we did.

260
00:13:12,140 --> 00:13:14,530
And then he was able
to subsequently do

261
00:13:14,530 --> 00:13:18,550
a second generation standard
cell design with himself

262
00:13:18,550 --> 00:13:20,950
and just a few other
Fujitsu people.

263
00:13:20,950 --> 00:13:23,620
So they mostly helped
us out in the CAD system

264
00:13:23,620 --> 00:13:25,960
and learned the architecture.

265
00:13:25,960 --> 00:13:28,570
In the case of
working with Bit, it

266
00:13:28,570 --> 00:13:32,500
was a true joint development,
where Bit and Sun engineers

267
00:13:32,500 --> 00:13:36,220
worked together from the
beginning of the definition

268
00:13:36,220 --> 00:13:40,870
of how we would approach the
design, the pipeline design

269
00:13:40,870 --> 00:13:43,960
through the development
of all of the chips.

270
00:13:43,960 --> 00:13:47,350
In the case of the
Cyprus CMOS Full Custom,

271
00:13:47,350 --> 00:13:51,760
it was a true 50-50
joint development,

272
00:13:51,760 --> 00:13:54,520
where each company
shared half the cost

273
00:13:54,520 --> 00:13:57,850
and put in approximately
half of the engineers.

274
00:13:57,850 --> 00:14:00,850
And we put everybody together
in a building, filled it up

275
00:14:00,850 --> 00:14:01,810
with workstations.

276
00:14:01,810 --> 00:14:03,790
And they went about
their business.

277
00:14:03,790 --> 00:14:05,290
They used to call themselves--

278
00:14:05,290 --> 00:14:08,590
I can't remember-- oh yes,
Sunray Semiconductor Inc.

279
00:14:08,590 --> 00:14:11,260
They actually made up
a fake company name

280
00:14:11,260 --> 00:14:17,200
and called themselves a company
surrounding their project

281
00:14:17,200 --> 00:14:18,640
to get an identity.

282
00:14:18,640 --> 00:14:21,250
And it worked very well.

283
00:14:21,250 --> 00:14:23,590
And now, David Patterson.

284
00:14:23,590 --> 00:14:25,270
I'm delighted to
have the opportunity

285
00:14:25,270 --> 00:14:27,160
to communicate with
you in this way,

286
00:14:27,160 --> 00:14:29,950
as well as to share this
tape with Wayne Rosing.

287
00:14:29,950 --> 00:14:33,070
I first met Wayne almost
10 years ago today.

288
00:14:33,070 --> 00:14:37,540
With this taping, I took a leave
of absence from Uc Berkeley

289
00:14:37,540 --> 00:14:39,790
and spent it at Digital
Equipment Corporation,

290
00:14:39,790 --> 00:14:42,550
where Wayne was in
charge of this division.

291
00:14:42,550 --> 00:14:45,010
And together, we spent
time figuring out ways

292
00:14:45,010 --> 00:14:47,710
to try and build faster
and better VAXes.

293
00:14:47,710 --> 00:14:50,500
I think my involvement in
this reduced instruction set,

294
00:14:50,500 --> 00:14:54,190
or RISC, phenomenon
started with that.

295
00:14:54,190 --> 00:14:58,480
It seemed to me the
difficulty of building VAXes

296
00:14:58,480 --> 00:15:04,240
and VLSI technology gave
rise to my involvement

297
00:15:04,240 --> 00:15:07,542
in trying to build a
simpler style of machine.

298
00:15:07,542 --> 00:15:09,250
So what I'm going to
do today is give you

299
00:15:09,250 --> 00:15:11,950
some principles of
computer performance design

300
00:15:11,950 --> 00:15:14,980
and show how those contrasts
between the traditional

301
00:15:14,980 --> 00:15:18,040
approach and the
RISC approach talk

302
00:15:18,040 --> 00:15:21,550
some differences about
the SPARC architecture

303
00:15:21,550 --> 00:15:24,020
with other RISC
architectures, then

304
00:15:24,020 --> 00:15:26,020
give a historical
perspective, and then show you

305
00:15:26,020 --> 00:15:28,810
some actual chips
and boards that

306
00:15:28,810 --> 00:15:30,250
are using the
SPARC architecture,

307
00:15:30,250 --> 00:15:32,290
and then speculate a
bit about the future.

308
00:15:32,290 --> 00:15:35,410
What I want to emphasize
in this talk today

309
00:15:35,410 --> 00:15:37,480
is a fundamental
performance equation

310
00:15:37,480 --> 00:15:39,170
that all computer architects do.

311
00:15:39,170 --> 00:15:43,630
As I go to the first
slide, the CPU time today,

312
00:15:43,630 --> 00:15:46,420
or the performance, can be
divided into two pieces.

313
00:15:46,420 --> 00:15:50,530
Almost all computers
today use a clock,

314
00:15:50,530 --> 00:15:53,200
a standard clock that sets
the cycle time of the machine.

315
00:15:53,200 --> 00:15:54,610
It's called the
clock cycle time.

316
00:15:54,610 --> 00:15:57,110
Well, then we can divide time
into the number of those clock

317
00:15:57,110 --> 00:15:57,910
cycles.

318
00:15:57,910 --> 00:16:01,120
So CPU time is the
clock cycles per program

319
00:16:01,120 --> 00:16:03,682
times the clock cycle time.

320
00:16:03,682 --> 00:16:05,140
Well, computer
designers have found

321
00:16:05,140 --> 00:16:07,750
it useful to come up with
another way of presenting

322
00:16:07,750 --> 00:16:09,070
the same information.

323
00:16:09,070 --> 00:16:11,730
And that is to figure
out the a number of clock

324
00:16:11,730 --> 00:16:13,840
cycles per instruction.

325
00:16:13,840 --> 00:16:16,638
So that's simply calculated by
the clock cycles per program

326
00:16:16,638 --> 00:16:18,180
divided by the number
of instructions

327
00:16:18,180 --> 00:16:20,280
you execute in that
program, sometimes called

328
00:16:20,280 --> 00:16:21,300
the instruction count.

329
00:16:21,300 --> 00:16:22,890
And that gives you
the average number

330
00:16:22,890 --> 00:16:25,500
of clock cycles per
instruction, which is

331
00:16:25,500 --> 00:16:28,020
almost always abbreviated CPI.

332
00:16:28,020 --> 00:16:30,120
So now, we can plug
those two things together

333
00:16:30,120 --> 00:16:32,130
into this time equation.

334
00:16:32,130 --> 00:16:35,550
And we end up with
the CPU time, which

335
00:16:35,550 --> 00:16:39,150
is equal to the instruction
count times the CPI

336
00:16:39,150 --> 00:16:41,100
times the clock cycle time.

337
00:16:41,100 --> 00:16:44,100
So performance is trying to--

338
00:16:44,100 --> 00:16:45,630
you get higher
performance by trying

339
00:16:45,630 --> 00:16:48,780
to minimize that product
of instruction count, CPI,

340
00:16:48,780 --> 00:16:50,620
and clock cycle time.

341
00:16:50,620 --> 00:16:52,470
So let's see how
you take advantage

342
00:16:52,470 --> 00:16:55,590
of that in the two different
styles of machine design.

343
00:16:55,590 --> 00:16:58,530
Now, the traditional approach,
what they tried to emphasize

344
00:16:58,530 --> 00:17:02,010
was to reduce the
instruction count.

345
00:17:02,010 --> 00:17:03,780
That was an interesting
figure of merit.

346
00:17:03,780 --> 00:17:06,900
The side effect of that
is to increase the CPI.

347
00:17:06,900 --> 00:17:09,119
That wasn't seen as
all bad, however.

348
00:17:09,119 --> 00:17:12,000
That was seen as raising
the level of the machine,

349
00:17:12,000 --> 00:17:14,910
that you had a higher level,
more powerful instruction set.

350
00:17:14,910 --> 00:17:18,569
And in fact, the execute
fewer instructions is good.

351
00:17:18,569 --> 00:17:20,640
Higher CPI, well,
that wasn't so bad.

352
00:17:20,640 --> 00:17:22,800
And also, it turns out,
for that style of machines,

353
00:17:22,800 --> 00:17:24,960
there was an
emphasis on reducing

354
00:17:24,960 --> 00:17:28,079
the size of the program,
small program size.

355
00:17:28,079 --> 00:17:31,950
Now, in contrast to that is the
RISC style design, again trying

356
00:17:31,950 --> 00:17:35,440
to minimize that key performance
equation, where with RISC

357
00:17:35,440 --> 00:17:39,240
the emphasis is in reducing the
clock cycles per instruction.

358
00:17:39,240 --> 00:17:41,370
And that happens primarily
through pipelining,

359
00:17:41,370 --> 00:17:45,360
by having many instructions
executing at the same time.

360
00:17:45,360 --> 00:17:47,650
What's this side effect
of reducing the CPI?

361
00:17:47,650 --> 00:17:50,070
It's a larger instruction count.

362
00:17:50,070 --> 00:17:52,020
Well, that's considered
to be OK because it's

363
00:17:52,020 --> 00:17:55,110
a lot easier to get
instruction memory bandwidth

364
00:17:55,110 --> 00:17:57,000
in all computer design.

365
00:17:57,000 --> 00:17:59,730
So that was an
acceptable side effect.

366
00:17:59,730 --> 00:18:01,650
The programs are larger as well.

367
00:18:01,650 --> 00:18:04,380
They're larger because of the
larger number of instructions

368
00:18:04,380 --> 00:18:06,600
and also because
the instructions are

369
00:18:06,600 --> 00:18:09,930
kept in a form that's very easy
to get pipelined execution.

370
00:18:09,930 --> 00:18:12,810
And that was OK because,
for the last decade,

371
00:18:12,810 --> 00:18:14,940
the DRAMs used to
make memory today

372
00:18:14,940 --> 00:18:17,770
are getting cheaper at
an incredibly fast rate.

373
00:18:17,770 --> 00:18:20,910
So bigger programs to
get better performance

374
00:18:20,910 --> 00:18:24,240
is a tradeoff most people
would be happy with.

375
00:18:24,240 --> 00:18:28,440
Now, more perhaps
minor or lesser roles

376
00:18:28,440 --> 00:18:30,300
of the reduced
instruction set approach

377
00:18:30,300 --> 00:18:32,910
is that given that
technology is moving

378
00:18:32,910 --> 00:18:36,750
at an incredibly fast
rate, this VLSI technology,

379
00:18:36,750 --> 00:18:38,980
there's an emphasis
on simplicity.

380
00:18:38,980 --> 00:18:41,010
Let's try and keep the
instruction set simple,

381
00:18:41,010 --> 00:18:43,200
so that we can more
closely tracked

382
00:18:43,200 --> 00:18:45,510
the rapid changes in VLSI.

383
00:18:45,510 --> 00:18:48,360
As a benefit of that
style of design,

384
00:18:48,360 --> 00:18:51,120
it made it easier to
have a faster clock cycle

385
00:18:51,120 --> 00:18:53,370
time by keeping things simple.

386
00:18:53,370 --> 00:18:57,060
And probably the final
linchpin of the RISC designs

387
00:18:57,060 --> 00:19:00,450
is the assumption of a much
better compiler technology

388
00:19:00,450 --> 00:19:02,580
because there's been
advances in compiler design

389
00:19:02,580 --> 00:19:05,890
as well in the last 10 years.

390
00:19:05,890 --> 00:19:10,080
Now, so I talked about the
trade off of instruction and CPI

391
00:19:10,080 --> 00:19:13,290
in this equation we're
trying to minimize, right?

392
00:19:13,290 --> 00:19:15,780
So what about the
clock cycle time?

393
00:19:15,780 --> 00:19:18,435
Well, the clock cycle time
is your worst case situation

394
00:19:18,435 --> 00:19:19,772
in computer design.

395
00:19:19,772 --> 00:19:22,230
And there's lots of things that
could limit the clock cycle

396
00:19:22,230 --> 00:19:23,290
time.

397
00:19:23,290 --> 00:19:25,560
But with the two
different designs--

398
00:19:25,560 --> 00:19:28,050
for the traditional
or CISC approach,

399
00:19:28,050 --> 00:19:30,420
they've always
been microprogram.

400
00:19:30,420 --> 00:19:34,620
So one of the limits is the time
to fetch a microinstruction.

401
00:19:34,620 --> 00:19:39,060
Typical, traditional machines
will have between 4K and 16K

402
00:19:39,060 --> 00:19:41,460
of these microinstructions
that are very wide.

403
00:19:41,460 --> 00:19:43,950
So you can't run the
cycle time faster

404
00:19:43,950 --> 00:19:48,000
than you can access this
microprogram memory.

405
00:19:48,000 --> 00:19:49,900
The RISC designs,
on the other hand,

406
00:19:49,900 --> 00:19:52,200
are built on a cache
model of design.

407
00:19:52,200 --> 00:19:55,080
So the cache you can adjust.

408
00:19:55,080 --> 00:19:57,348
Caches are made
to be adjustable.

409
00:19:57,348 --> 00:19:59,640
They're just the highest
level of the memory hierarchy.

410
00:19:59,640 --> 00:20:02,190
So depending on what
your clock cycle time is,

411
00:20:02,190 --> 00:20:04,270
you can have a smaller
instruction cache.

412
00:20:04,270 --> 00:20:06,810
So the RISC designs aren't
as limited by the memory size

413
00:20:06,810 --> 00:20:09,225
because it's adjustable,
whereas the traditional designs

414
00:20:09,225 --> 00:20:11,850
you pretty much have to have the
whole microprogram you're ever

415
00:20:11,850 --> 00:20:13,150
going to use right there.

416
00:20:13,150 --> 00:20:15,940
And that's going to affect
the clock cycle time.

417
00:20:15,940 --> 00:20:20,760
So how does that show up I think
is shown in the next slide.

418
00:20:20,760 --> 00:20:23,670
What this is, is a
plot of the clock cycle

419
00:20:23,670 --> 00:20:28,547
time on a log scale, over
time along the bottom here.

420
00:20:28,547 --> 00:20:30,630
So you see the RISC designes
really didn't show up

421
00:20:30,630 --> 00:20:32,280
until about 1986.

422
00:20:32,280 --> 00:20:35,880
But the orange line is
the minicomputer line

423
00:20:35,880 --> 00:20:39,690
represented by models of the VAX
family and the supercomputers

424
00:20:39,690 --> 00:20:41,940
by the Cray.

425
00:20:41,940 --> 00:20:43,780
It's relative.

426
00:20:43,780 --> 00:20:44,740
This is a log scale.

427
00:20:44,740 --> 00:20:47,490
So they are improving
relative to the RISC designs

428
00:20:47,490 --> 00:20:48,660
at a slower rate.

429
00:20:48,660 --> 00:20:53,100
And not only is this RISC
line somewhat steeper,

430
00:20:53,100 --> 00:20:55,090
you see this sudden
hitch in the curve.

431
00:20:55,090 --> 00:20:57,600
And what that is, is the RISC
designs changing technology.

432
00:20:57,600 --> 00:21:01,410
In the earlier part, they were
all done in some kind of CMOS

433
00:21:01,410 --> 00:21:02,500
usually design.

434
00:21:02,500 --> 00:21:04,278
They were able to
use the ECL design.

435
00:21:04,278 --> 00:21:06,570
So they're able to take
advantage of the new technology

436
00:21:06,570 --> 00:21:08,890
and change the clock rate.

437
00:21:08,890 --> 00:21:13,460
So that's how the clock rate
effects of what's going on.

438
00:21:13,460 --> 00:21:15,180
So let's do a
couple of examples.

439
00:21:15,180 --> 00:21:17,900
The first example was 1987.

440
00:21:17,900 --> 00:21:23,080
July 7, 7/7/87, Sun announced
the SPARC-based family

441
00:21:23,080 --> 00:21:23,800
of machines.

442
00:21:23,800 --> 00:21:26,530
And so let's compare
those figure of merits

443
00:21:26,530 --> 00:21:29,110
for the products of that time.

444
00:21:29,110 --> 00:21:32,810
Sun's traditional line was
the Motorola 68000 family,

445
00:21:32,810 --> 00:21:34,450
in particular the 68020.

446
00:21:34,450 --> 00:21:37,450
So if we look at the instruction
count, the ratio of those,

447
00:21:37,450 --> 00:21:40,480
it took about 25%
more instructions

448
00:21:40,480 --> 00:21:44,080
for the RISC approach, a
higher instruction count.

449
00:21:44,080 --> 00:21:47,020
The clock cycle time is--

450
00:21:47,020 --> 00:21:48,220
I'll explain a little bit.

451
00:21:48,220 --> 00:21:50,220
It's a little bit funny
for the Motorola design.

452
00:21:50,220 --> 00:21:55,210
The clock cycle time was
actually better for the 68000

453
00:21:55,210 --> 00:21:56,980
than it was for the RISC design.

454
00:21:56,980 --> 00:21:59,650
But the average number of clocks
per instruction, you can see,

455
00:21:59,650 --> 00:22:02,950
is quite a bit worse,
5 to 7 or 0.32.

456
00:22:02,950 --> 00:22:04,820
When you multiply all
those things together,

457
00:22:04,820 --> 00:22:07,570
you get about a ratio
factor of 2 to 1.

458
00:22:07,570 --> 00:22:09,340
The RISC machine
is twice as fast.

459
00:22:09,340 --> 00:22:13,990
Or the 68000 takes twice
as long to execute.

460
00:22:13,990 --> 00:22:16,280
But the difference in
price was very small,

461
00:22:16,280 --> 00:22:18,040
about 10% or 20% more.

462
00:22:18,040 --> 00:22:21,010
So you've got a factor of 2
performance for a 10% to 20%

463
00:22:21,010 --> 00:22:21,760
increase in price.

464
00:22:21,760 --> 00:22:23,270
It's a pretty good deal.

465
00:22:23,270 --> 00:22:26,080
Now, when we give these talks
to our friends at Motorola,

466
00:22:26,080 --> 00:22:29,530
they will argue some about
the clocks per instruction,

467
00:22:29,530 --> 00:22:30,400
clock cycle time.

468
00:22:30,400 --> 00:22:33,310
It turns out the
Motorola, the clock rates,

469
00:22:33,310 --> 00:22:37,210
the 25 megahertz turns
out to be 40 nanoseconds.

470
00:22:37,210 --> 00:22:40,870
Well, in the Motorola microcode,
every single microinstruction

471
00:22:40,870 --> 00:22:43,210
takes two clock ticks.

472
00:22:43,210 --> 00:22:46,133
So I had a hard time deciding
how to put this together.

473
00:22:46,133 --> 00:22:48,550
The other way you put this is,
with an 80 nanosecond clock

474
00:22:48,550 --> 00:22:52,870
cycle time, then the
CPI of 2.5 to 3.5.

475
00:22:52,870 --> 00:22:54,580
It doesn't affect the
product down here.

476
00:22:54,580 --> 00:22:57,747
Just depending if you're
sensitive about what's

477
00:22:57,747 --> 00:22:59,830
the average number of clock
cycles per instruction

478
00:22:59,830 --> 00:23:01,570
of your architecture,
you may not

479
00:23:01,570 --> 00:23:03,040
like it characterized this way.

480
00:23:03,040 --> 00:23:05,470
But the machine is advertised
with a 25 megahertz 40

481
00:23:05,470 --> 00:23:07,720
millisecond clock.

482
00:23:07,720 --> 00:23:09,880
So that was one example.

483
00:23:09,880 --> 00:23:12,640
This year, there is
an example from DEC.

484
00:23:12,640 --> 00:23:15,640
DEC has produced a
VAX and RISC line,

485
00:23:15,640 --> 00:23:17,920
in fact announced the very
same day, just like Sun

486
00:23:17,920 --> 00:23:19,750
did two years before that.

487
00:23:19,750 --> 00:23:22,360
Now, when compared to
the VAX architecture,

488
00:23:22,360 --> 00:23:25,180
their RISC machine
that they used almost

489
00:23:25,180 --> 00:23:29,110
executes 80% more instructions.

490
00:23:29,110 --> 00:23:32,230
These are based on
benchmarks from a book

491
00:23:32,230 --> 00:23:34,390
that John Hennessey
and I are working on.

492
00:23:34,390 --> 00:23:38,400
But it's the new C compiler
in the tech programs.

493
00:23:38,400 --> 00:23:40,900
So these are large programs
that these numbers are based on.

494
00:23:40,900 --> 00:23:43,480
We took the compilers
with DEC chips in them

495
00:23:43,480 --> 00:23:46,750
and did these measurements.

496
00:23:46,750 --> 00:23:50,830
It was 80% more instructions, so
much higher instruction count.

497
00:23:50,830 --> 00:23:51,970
VAX is actually much fewer.

498
00:23:51,970 --> 00:23:54,940
But you notice this machine,
the clock cycle time

499
00:23:54,940 --> 00:23:56,990
was quite a bit less
for the RISC machine.

500
00:23:56,990 --> 00:23:59,470
And there's a huge difference
in the average number

501
00:23:59,470 --> 00:24:01,220
of clocks per instruction.

502
00:24:01,220 --> 00:24:02,865
So when you multiply
it all together,

503
00:24:02,865 --> 00:24:04,240
depending on the
program, you get

504
00:24:04,240 --> 00:24:07,480
a factor of performance
improvement of about a 3 to 6.

505
00:24:07,480 --> 00:24:10,180
The VAX takes about three
times to six times longer

506
00:24:10,180 --> 00:24:14,110
to execute these large
programs than the RISC machine.

507
00:24:14,110 --> 00:24:18,160
But yet, using DEC's
prices, their list prices,

508
00:24:18,160 --> 00:24:20,000
it was only 25% more expensive.

509
00:24:20,000 --> 00:24:22,960
So again, something is
3 times 6 times more

510
00:24:22,960 --> 00:24:25,280
expensive to start that--

511
00:24:25,280 --> 00:24:26,230
let's try this again.

512
00:24:26,230 --> 00:24:27,970
3 to 6 times faster--

513
00:24:27,970 --> 00:24:30,730
that's it-- and only 25% more.

514
00:24:30,730 --> 00:24:32,860
So those are a couple of
examples of RISC machines

515
00:24:32,860 --> 00:24:34,870
from two different families.

516
00:24:34,870 --> 00:24:37,900
Now, let's talk in a little
bit more technical detail.

517
00:24:37,900 --> 00:24:41,770
And there was an
interesting study

518
00:24:41,770 --> 00:24:44,710
we did when we measured
these large programs.

519
00:24:44,710 --> 00:24:47,890
And this plot is
measuring the ratio really

520
00:24:47,890 --> 00:24:50,860
to the amount of instruction
traffic on the VAX.

521
00:24:50,860 --> 00:24:54,520
So let's say that the unit is
one for the VAX instructions.

522
00:24:54,520 --> 00:24:56,500
That's shown here, orange.

523
00:24:56,500 --> 00:24:59,110
You can see that the number of
words for memory for the RISC

524
00:24:59,110 --> 00:25:00,820
machine, in blue, is 1.8.

525
00:25:00,820 --> 00:25:02,833
That's like what we said before.

526
00:25:02,833 --> 00:25:05,500
Now, the question is, what about
the rest of the memory traffic?

527
00:25:05,500 --> 00:25:07,960
What about the data accesses.

528
00:25:07,960 --> 00:25:11,590
Well, it turns out that,
normalized again to VAX,

529
00:25:11,590 --> 00:25:17,080
it takes 1.7 of these units
of the data memory traffic,

530
00:25:17,080 --> 00:25:19,450
where the RISC machine
takes only 0.6.

531
00:25:19,450 --> 00:25:22,360
If you add these things all
up, the total memory traffic

532
00:25:22,360 --> 00:25:26,920
is that the VAX has about 10%
more traffic than the RISC

533
00:25:26,920 --> 00:25:28,840
machine, which is
kind of surprising.

534
00:25:28,840 --> 00:25:30,610
I think most computer
designers thought

535
00:25:30,610 --> 00:25:32,860
that if you have a higher
instruction count, therefore

536
00:25:32,860 --> 00:25:36,400
you're going to have a higher
data memory traffic overall.

537
00:25:36,400 --> 00:25:38,012
But it still might
be a good tradeoff.

538
00:25:38,012 --> 00:25:39,970
But the most interesting
thing about this slide

539
00:25:39,970 --> 00:25:41,620
isn't that there's
a 10% reduction.

540
00:25:41,620 --> 00:25:43,150
It's where the differences are.

541
00:25:43,150 --> 00:25:45,370
For a computer
designer's perspective,

542
00:25:45,370 --> 00:25:48,092
it's relatively easy to
double your bandwidth

543
00:25:48,092 --> 00:25:50,050
with instruction memory
because they're largely

544
00:25:50,050 --> 00:25:51,070
sequential accesses.

545
00:25:51,070 --> 00:25:52,580
Most are the next one.

546
00:25:52,580 --> 00:25:55,510
So simply making the
port twice as wide

547
00:25:55,510 --> 00:26:00,190
would probably get you about
1.8 increase in the bandwidth.

548
00:26:00,190 --> 00:26:02,200
On the other hand,
data memory references

549
00:26:02,200 --> 00:26:04,477
are very random in
their very nature.

550
00:26:04,477 --> 00:26:06,310
And so those are great
things to get rid of.

551
00:26:06,310 --> 00:26:07,810
So speaking as a
computer designer,

552
00:26:07,810 --> 00:26:13,270
if I had a choice of almost
doubling my instruction traffic

553
00:26:13,270 --> 00:26:15,760
and reducing my data
traffic by a factor of 3,

554
00:26:15,760 --> 00:26:19,050
that's wonderful
from my perspective.

555
00:26:19,050 --> 00:26:22,450
So when we're really starting
to run real big programs

556
00:26:22,450 --> 00:26:23,950
on these two sides
of architectures,

557
00:26:23,950 --> 00:26:25,743
we see some real
interesting differences.

558
00:26:25,743 --> 00:26:27,160
To build a much
faster VAX, you're

559
00:26:27,160 --> 00:26:29,530
going to have to have a
much higher data memory

560
00:26:29,530 --> 00:26:34,150
traffic than you will have
for the RISC machines.

561
00:26:34,150 --> 00:26:37,470
Now I'm going to-- before I
talk about the differences

562
00:26:37,470 --> 00:26:38,850
in the SPARC
architecture, I want

563
00:26:38,850 --> 00:26:41,730
to emphasize that at
no time in our history

564
00:26:41,730 --> 00:26:45,270
have we had such
similar machines.

565
00:26:45,270 --> 00:26:48,150
The RISC machines are so
similar that people sometimes

566
00:26:48,150 --> 00:26:50,650
tend to emphasize the small
differences there are.

567
00:26:50,650 --> 00:26:52,175
But let's put this
in perspective.

568
00:26:52,175 --> 00:26:53,550
In putting it in
perspective, I'm

569
00:26:53,550 --> 00:26:56,190
going to start talking about
the traditional microprocessors.

570
00:26:56,190 --> 00:26:58,320
The so-called 16-bit
microprocessors

571
00:26:58,320 --> 00:27:01,770
were developed in 1978 to
1980, about eight years

572
00:27:01,770 --> 00:27:03,522
before the RISC machines.

573
00:27:03,522 --> 00:27:04,980
So when we look at
these machines--

574
00:27:04,980 --> 00:27:09,450
the Intel 86, the Motorola
68000, and the Z8000--

575
00:27:09,450 --> 00:27:11,680
the first difference is the
difference of addressing,

576
00:27:11,680 --> 00:27:13,440
which is a major
difference, two of them

577
00:27:13,440 --> 00:27:15,630
using segmented addressing
and the Motorola use

578
00:27:15,630 --> 00:27:16,662
of the flat addressing.

579
00:27:16,662 --> 00:27:18,120
That's about as
big a difference is

580
00:27:18,120 --> 00:27:20,537
going to get in an architecture
because it affects so much

581
00:27:20,537 --> 00:27:22,230
of the system.

582
00:27:22,230 --> 00:27:23,820
There was no
protection on the 86.

583
00:27:23,820 --> 00:27:25,755
And it was optional on
the other two machines.

584
00:27:25,755 --> 00:27:29,190
And you can see the
address size vary.

585
00:27:29,190 --> 00:27:30,975
Even with the segmented
architectures,

586
00:27:30,975 --> 00:27:32,100
the address size is varied.

587
00:27:32,100 --> 00:27:34,080
So those affect a lot of
the software that runs

588
00:27:34,080 --> 00:27:36,000
on the machine, the addresses.

589
00:27:36,000 --> 00:27:39,420
If we look more inside, the
register sizes were different--

590
00:27:39,420 --> 00:27:44,250
16 bits for Zilog and
Intel and 32 for Motorola.

591
00:27:44,250 --> 00:27:46,800
The register model was
incredibly different.

592
00:27:46,800 --> 00:27:49,260
Maybe that's one of the biggest
differences between them.

593
00:27:49,260 --> 00:27:51,780
All the registers were
special purpose in 86.

594
00:27:51,780 --> 00:27:53,790
They were divided
into eight data

595
00:27:53,790 --> 00:27:57,330
and eight address in the 68000;
16 general-purpose registers

596
00:27:57,330 --> 00:27:58,650
on the Z8000.

597
00:27:58,650 --> 00:28:01,050
The instruction size was
variable, either byte variable

598
00:28:01,050 --> 00:28:06,750
in the 86 or 16-bit
variable Motorola 68000.

599
00:28:06,750 --> 00:28:08,790
Intel had no data
limit restrictions.

600
00:28:08,790 --> 00:28:11,700
And Motorola, on
the 68000, Zilog

601
00:28:11,700 --> 00:28:13,710
had required the
data to be aligned.

602
00:28:13,710 --> 00:28:17,040
And finally, the input/output
was memory map for Motorola,

603
00:28:17,040 --> 00:28:20,793
but were special instructions
on the other two machines.

604
00:28:20,793 --> 00:28:22,710
So let's contrast that
with the RISC machines,

605
00:28:22,710 --> 00:28:28,050
which were done in 1986,
or announced 1986, 1988.

606
00:28:28,050 --> 00:28:29,910
There's the SPARC,
the MIPS architecture,

607
00:28:29,910 --> 00:28:33,720
which is being used by
DEC in their RISC machine,

608
00:28:33,720 --> 00:28:37,830
and the Motorola 88000,
Motorola's RISC machine.

609
00:28:37,830 --> 00:28:39,030
You see the addressing.

610
00:28:39,030 --> 00:28:40,590
All of them use flat addressing.

611
00:28:40,590 --> 00:28:42,630
All of them use page
level protection.

612
00:28:42,630 --> 00:28:46,510
All of them have the
same address size.

613
00:28:46,510 --> 00:28:50,910
The SPARC-- in terms of what's
visible to the programmer,

614
00:28:50,910 --> 00:28:51,620
the--

615
00:28:51,620 --> 00:28:54,730
well, the width of the
registers are 32 bits wide.

616
00:28:54,730 --> 00:28:58,720
They're always saying the number
available to the programmer.

617
00:28:58,720 --> 00:29:00,308
All of them have
32-bit registers.

618
00:29:00,308 --> 00:29:02,350
The instruction size,
they're always all 32 bits.

619
00:29:02,350 --> 00:29:03,940
The data always
has to be aligned.

620
00:29:03,940 --> 00:29:05,410
And it's memory mapped.

621
00:29:05,410 --> 00:29:08,200
So in contrast to
just a few years ago,

622
00:29:08,200 --> 00:29:11,590
it's incredible agreement on
all these basic issues, which

623
00:29:11,590 --> 00:29:14,950
affects why it's so easy to
port programs between these RISC

624
00:29:14,950 --> 00:29:16,120
machines.

625
00:29:16,120 --> 00:29:19,420
So keep in mind that the
RISC machines are more alike

626
00:29:19,420 --> 00:29:21,490
than any set of
machines we've ever seen

627
00:29:21,490 --> 00:29:23,060
in history of computer design.

628
00:29:23,060 --> 00:29:24,670
But let me now
talk about a couple

629
00:29:24,670 --> 00:29:28,160
of differences for
the SPARC model.

630
00:29:28,160 --> 00:29:31,960
The first is that SPARC allows
overlapped execution of integer

631
00:29:31,960 --> 00:29:33,350
and floating point programs.

632
00:29:33,350 --> 00:29:35,560
So what this means is, in
the right circumstances,

633
00:29:35,560 --> 00:29:38,080
if you do have floating
point instructions,

634
00:29:38,080 --> 00:29:40,540
you can maybe get
complete overlap

635
00:29:40,540 --> 00:29:43,368
with energy instructions and
the floating point instructions.

636
00:29:43,368 --> 00:29:45,160
How do you support that
in an architecture?

637
00:29:45,160 --> 00:29:48,340
You have to allow there to be a
mechanism so that when you have

638
00:29:48,340 --> 00:29:51,320
an interrupt, you can
back up and find out

639
00:29:51,320 --> 00:29:53,570
what floating point instructions
hadn't been finished.

640
00:29:53,570 --> 00:29:56,620
So SPARC provides a
queue that contains

641
00:29:56,620 --> 00:29:58,540
a copy of the instruction,
the floating point

642
00:29:58,540 --> 00:30:01,240
instruction that's in
the middle of execution,

643
00:30:01,240 --> 00:30:08,290
as well as a PC of
that instruction.

644
00:30:08,290 --> 00:30:10,690
As Wayne Rosing mentioned,
there was an emphasis

645
00:30:10,690 --> 00:30:13,480
in the development of SPARC of
support for some of the newer

646
00:30:13,480 --> 00:30:14,710
programming ideas--

647
00:30:14,710 --> 00:30:18,250
the dynamically linked and
dynamically typed languages.

648
00:30:18,250 --> 00:30:21,790
That was important in
the development of SPARC.

649
00:30:21,790 --> 00:30:25,773
For LISP in particular, it has
some interesting requirements,

650
00:30:25,773 --> 00:30:27,940
challenging requirements,
for the computer designer.

651
00:30:27,940 --> 00:30:31,540
Just because you
see A plus B doesn't

652
00:30:31,540 --> 00:30:35,120
mean that you're adding two
numbers or two floating point

653
00:30:35,120 --> 00:30:35,620
numbers.

654
00:30:35,620 --> 00:30:37,480
You could be adding to arrays.

655
00:30:37,480 --> 00:30:39,850
So there has to be the
opportunity for the LISP

656
00:30:39,850 --> 00:30:44,120
programmer at runtime to
change the Add instruction

657
00:30:44,120 --> 00:30:47,570
to perform these other
very complicated things.

658
00:30:47,570 --> 00:30:50,630
So what does a computer
designer do about that?

659
00:30:50,630 --> 00:30:54,700
Well, typically, they
will tag the data somehow,

660
00:30:54,700 --> 00:30:57,440
some mechanism so that
when they're integers,

661
00:30:57,440 --> 00:30:58,690
they'll know they're integers.

662
00:30:58,690 --> 00:30:59,773
And they can do them fast.

663
00:30:59,773 --> 00:31:02,090
And it turns out almost all
the time they're integers.

664
00:31:02,090 --> 00:31:05,200
So looking at the
slide, SPARC supports

665
00:31:05,200 --> 00:31:07,023
the two least significant bits.

666
00:31:07,023 --> 00:31:08,440
The two least
significant bits are

667
00:31:08,440 --> 00:31:09,940
zeros, which
indicates these aren't

668
00:31:09,940 --> 00:31:12,590
pointing to something more
complicated than just numbers.

669
00:31:12,590 --> 00:31:14,170
They're added together.

670
00:31:14,170 --> 00:31:17,930
And the result is also
one of these integers.

671
00:31:17,930 --> 00:31:22,120
So the overflow bit is
set on an operation.

672
00:31:22,120 --> 00:31:24,850
If these two least
significant bits

673
00:31:24,850 --> 00:31:29,072
are not zeros, and if the
result is too big a number

674
00:31:29,072 --> 00:31:30,280
to fit in here, you overflow.

675
00:31:30,280 --> 00:31:33,430
So by having some
special instructions,

676
00:31:33,430 --> 00:31:35,480
they do add, subtract--

677
00:31:35,480 --> 00:31:38,020
and because SPARC
uses condition codes,

678
00:31:38,020 --> 00:31:39,430
subtract acts as a compare.

679
00:31:39,430 --> 00:31:42,242
So they can do
comparisons with support

680
00:31:42,242 --> 00:31:43,450
on a few of the instructions.

681
00:31:43,450 --> 00:31:45,533
The rest of the instructions
will ignore the tags.

682
00:31:45,533 --> 00:31:47,800
But there's a few special
just to support the LISP

683
00:31:47,800 --> 00:31:50,740
and Smalltalk dynamic timing.

684
00:31:50,740 --> 00:31:54,220
As it shows on the slide next,
another very important feature

685
00:31:54,220 --> 00:31:56,252
for one of these
experimental architectures

686
00:31:56,252 --> 00:31:58,210
is that you can test for
errors because there's

687
00:31:58,210 --> 00:32:00,502
lots of funny things that
can happen in dynamic typing.

688
00:32:00,502 --> 00:32:02,930
There's a lot of checking to
make sure things are correct.

689
00:32:02,930 --> 00:32:05,350
A very nice instruction for
that is a conditional trap

690
00:32:05,350 --> 00:32:06,158
instruction.

691
00:32:06,158 --> 00:32:08,200
If you don't have a
conditional trap instruction,

692
00:32:08,200 --> 00:32:11,200
you have to conditionally
branch around some subroutine

693
00:32:11,200 --> 00:32:12,130
and pass parameters.

694
00:32:12,130 --> 00:32:14,380
Conditional trap takes
only one clock cycle

695
00:32:14,380 --> 00:32:15,280
to perform the test.

696
00:32:15,280 --> 00:32:17,560
It's a nice thing to include.

697
00:32:17,560 --> 00:32:20,560
The final thing is register
windows, which Wayne alluded to

698
00:32:20,560 --> 00:32:21,860
in his presentation.

699
00:32:21,860 --> 00:32:23,770
So I'll spend some time
showing what that is

700
00:32:23,770 --> 00:32:26,290
and what the implications are.

701
00:32:26,290 --> 00:32:30,280
Register windows are based on an
observation about the behavior

702
00:32:30,280 --> 00:32:32,860
of most programs
or all programs,

703
00:32:32,860 --> 00:32:36,700
just as caches
were invented based

704
00:32:36,700 --> 00:32:39,610
on an observation of the
behavior of all programs

705
00:32:39,610 --> 00:32:40,970
towards the locality.

706
00:32:40,970 --> 00:32:44,530
So as you see on the slide, let
me-- [? what ?] the axis is--

707
00:32:44,530 --> 00:32:47,020
across is basically
time and units

708
00:32:47,020 --> 00:32:48,790
of procedure call and return.

709
00:32:48,790 --> 00:32:51,400
And what goes down the
slide is the nesting depth,

710
00:32:51,400 --> 00:32:53,800
so how many levels
deep is the subroutine.

711
00:32:53,800 --> 00:32:56,320
And so we get jagged--

712
00:32:56,320 --> 00:32:58,510
basically, these are
four or five calls

713
00:32:58,510 --> 00:33:01,340
in a row followed by a
return, and a call, and so on.

714
00:33:01,340 --> 00:33:02,920
So we have this pattern.

715
00:33:02,920 --> 00:33:05,500
When I move up, the
program is doing returns.

716
00:33:05,500 --> 00:33:07,930
And we moved down, it was
doing a series of calls.

717
00:33:07,930 --> 00:33:10,060
Well, the first thing
you can observe by this

718
00:33:10,060 --> 00:33:11,750
is that, at least
for this program,

719
00:33:11,750 --> 00:33:13,390
we didn't see very jagged lines.

720
00:33:13,390 --> 00:33:17,260
We didn't see programs that
are 100 or 1,000 calls followed

721
00:33:17,260 --> 00:33:18,610
by 100 or 1,000 returns.

722
00:33:18,610 --> 00:33:20,780
There's some locality there.

723
00:33:20,780 --> 00:33:23,620
And in fact, if you
provided a larger buffer

724
00:33:23,620 --> 00:33:25,960
of registers than
just the 32 you need,

725
00:33:25,960 --> 00:33:27,910
that many sets of
them on the chip,

726
00:33:27,910 --> 00:33:30,610
the question is, how
frequently can you

727
00:33:30,610 --> 00:33:32,350
avoid having to go off chip?

728
00:33:32,350 --> 00:33:34,600
Because if you don't
have this on every--

729
00:33:34,600 --> 00:33:35,260
let's see.

730
00:33:35,260 --> 00:33:37,660
On every call, you have
to store registers away,

731
00:33:37,660 --> 00:33:38,710
extra store instructions.

732
00:33:38,710 --> 00:33:43,460
And every return, you have to do
loads back to bring them back.

733
00:33:43,460 --> 00:33:47,357
So what this shows is if
we put a buffer of a size

734
00:33:47,357 --> 00:33:49,940
I think something of some thing
like-- let's say it was eight.

735
00:33:49,940 --> 00:33:53,890
These boxes indicate
when you do overflow.

736
00:33:53,890 --> 00:33:57,550
So the first several calls and
returns all fit on the chip.

737
00:33:57,550 --> 00:34:00,220
And then where this line is,
that meant there were too many.

738
00:34:00,220 --> 00:34:02,110
And you did these
overflows when you

739
00:34:02,110 --> 00:34:03,670
did have to do memory traffic.

740
00:34:03,670 --> 00:34:05,980
So basically, the success
of the buffering scheme

741
00:34:05,980 --> 00:34:07,534
is the number of
boxes on this chart.

742
00:34:07,534 --> 00:34:09,159
And you can see, once
we get down here,

743
00:34:09,159 --> 00:34:12,139
we're doing lots of calls
and returns in a row.

744
00:34:12,139 --> 00:34:13,510
So that's the observation.

745
00:34:13,510 --> 00:34:15,159
But how well does
it work in practice?

746
00:34:15,159 --> 00:34:16,659
And in particular,
is this something

747
00:34:16,659 --> 00:34:18,010
that only works with C?

748
00:34:18,010 --> 00:34:21,969
Or does it work with
other languages?

749
00:34:21,969 --> 00:34:24,429
This chart shows the percentage
of the calls, the results,

750
00:34:24,429 --> 00:34:25,060
and overflow.

751
00:34:25,060 --> 00:34:27,227
The percentage of those
calls on the previous chart,

752
00:34:27,227 --> 00:34:29,150
they were the lines for the box.

753
00:34:29,150 --> 00:34:32,199
And this is plotted against
the number of register banks

754
00:34:32,199 --> 00:34:34,170
that you have on the chip.

755
00:34:34,170 --> 00:34:35,050
OK.

756
00:34:35,050 --> 00:34:36,639
And now, with the
plots that are there

757
00:34:36,639 --> 00:34:41,440
for several programs in three
different languages, because I

758
00:34:41,440 --> 00:34:44,679
certainly thought that when
you got into other languages

759
00:34:44,679 --> 00:34:47,580
the pattern might be
quite a bit different.

760
00:34:47,580 --> 00:34:50,050
But as you can see on
the slide, they all

761
00:34:50,050 --> 00:34:52,600
have the same kind of shape
with the knee of the curve

762
00:34:52,600 --> 00:34:56,380
typically being
somewhere between 6 and 8

763
00:34:56,380 --> 00:34:58,930
of these register banks.

764
00:34:58,930 --> 00:35:01,570
So the machines that we
developed at Berkeley

765
00:35:01,570 --> 00:35:02,590
had eight.

766
00:35:02,590 --> 00:35:05,590
And then the SPARC machines
had either seven or eight,

767
00:35:05,590 --> 00:35:07,990
I think the ones that have
been announced so far.

768
00:35:07,990 --> 00:35:10,610
So that technique seems to work.

769
00:35:10,610 --> 00:35:12,610
And then if you have about
seven or eight banks,

770
00:35:12,610 --> 00:35:14,410
you can reduce the
number of stores

771
00:35:14,410 --> 00:35:17,930
to save registers off the
chip and loads coming back.

772
00:35:17,930 --> 00:35:20,620
Now, there's some implications
to the register windows.

773
00:35:20,620 --> 00:35:22,840
And let's evaluate
those implications

774
00:35:22,840 --> 00:35:26,090
in terms of our basic
performance formula.

775
00:35:26,090 --> 00:35:27,880
So first off,
instruction count--

776
00:35:27,880 --> 00:35:30,460
running those same programs
I alluded to earlier on both

777
00:35:30,460 --> 00:35:34,330
the SPARC machine
and the MIPS machine,

778
00:35:34,330 --> 00:35:39,850
MIPS required 40% to 60% more
store instructions in only 3$

779
00:35:39,850 --> 00:35:40,875
to 20% more loads.

780
00:35:40,875 --> 00:35:42,250
That's because
there's a lot more

781
00:35:42,250 --> 00:35:43,960
loads in the program
for other things

782
00:35:43,960 --> 00:35:47,660
than there are simply for
saving and restoring registers.

783
00:35:47,660 --> 00:35:51,100
Now, again from a computer
design perspective,

784
00:35:51,100 --> 00:35:52,900
your favorite instructions
are things that

785
00:35:52,900 --> 00:35:54,110
just add registers together.

786
00:35:54,110 --> 00:35:55,370
You just have to
fetch the instruction.

787
00:35:55,370 --> 00:35:56,703
It doesn't bother anything else.

788
00:35:56,703 --> 00:35:59,777
After that, maybe his
jumps aren't so great.

789
00:35:59,777 --> 00:36:01,360
But at least they
don't access memory.

790
00:36:01,360 --> 00:36:03,880
And after that are loads
because at least your reading

791
00:36:03,880 --> 00:36:06,610
instructions and data, it's all
coming in the same direction.

792
00:36:06,610 --> 00:36:09,670
But the worst instructions, from
my perspective, are the stores.

793
00:36:09,670 --> 00:36:11,820
They're going the wrong
way on a one way street.

794
00:36:11,820 --> 00:36:13,690
You have all this
data flowing at you.

795
00:36:13,690 --> 00:36:15,863
But instead, what
happens is the stores,

796
00:36:15,863 --> 00:36:17,780
you have to send the
data the other direction.

797
00:36:17,780 --> 00:36:21,700
So getting rid of stores is
a nice thing to get rid of.

798
00:36:21,700 --> 00:36:23,587
The impact on CPI,
the average number

799
00:36:23,587 --> 00:36:25,420
of clocks per instruction
by your instrument

800
00:36:25,420 --> 00:36:26,878
is basically no impact.

801
00:36:26,878 --> 00:36:28,420
But a lot of people
have speculated--

802
00:36:28,420 --> 00:36:30,220
and I was looking
forward to see--

803
00:36:30,220 --> 00:36:32,560
what was going to happen
to the clock cycle time.

804
00:36:32,560 --> 00:36:35,200
So far in the years that people
have been building chips,

805
00:36:35,200 --> 00:36:37,030
the SPARC-based chips
with register windows

806
00:36:37,030 --> 00:36:40,780
have been as fast or
faster than the competing

807
00:36:40,780 --> 00:36:45,100
chips from Motorola
and MIPS, indicating

808
00:36:45,100 --> 00:36:46,570
that the size of
the register file

809
00:36:46,570 --> 00:36:49,270
physically on the chip, that's
not affecting the clock cycle

810
00:36:49,270 --> 00:36:49,480
time.

811
00:36:49,480 --> 00:36:50,800
Something else must
be the bottleneck.

812
00:36:50,800 --> 00:36:51,925
That hasn't been the issue.

813
00:36:51,925 --> 00:36:55,340
The technology it's been built
with hasn't been the issue.

814
00:36:55,340 --> 00:36:57,717
So I think a lot
of people concerned

815
00:36:57,717 --> 00:36:59,800
would agree about the
instruction count, these two

816
00:36:59,800 --> 00:37:03,400
points, but wondered whether or
not it wouldn't hurt the cycle

817
00:37:03,400 --> 00:37:04,190
time.

818
00:37:04,190 --> 00:37:05,650
And so far, that
hasn't been true

819
00:37:05,650 --> 00:37:07,340
for the commercial machines.

820
00:37:07,340 --> 00:37:08,980
There's a couple other things
that people have brought up

821
00:37:08,980 --> 00:37:09,730
about register windows.

822
00:37:09,730 --> 00:37:11,260
What about process
switching time

823
00:37:11,260 --> 00:37:13,552
when you're switching from
working completely different

824
00:37:13,552 --> 00:37:14,980
programs with more registers?

825
00:37:14,980 --> 00:37:17,110
Well, it turns out that our
friends in the operating system

826
00:37:17,110 --> 00:37:19,568
have found so many things to
do at process switch time that

827
00:37:19,568 --> 00:37:22,860
actually happened to save two
or three times as many registers

828
00:37:22,860 --> 00:37:24,610
isn't a big deal.

829
00:37:24,610 --> 00:37:27,190
And less than 20%
of the time in Unix

830
00:37:27,190 --> 00:37:30,310
is the saving of the registers
away for the SPARC machine.

831
00:37:30,310 --> 00:37:32,020
Another interesting
questions is,

832
00:37:32,020 --> 00:37:34,490
what about-- it takes more
resources on the chip.

833
00:37:34,490 --> 00:37:36,490
How's that going to affect
the size of the chip?

834
00:37:36,490 --> 00:37:38,230
And the size of this
chip is important

835
00:37:38,230 --> 00:37:39,980
because that affects
the cost of the chip.

836
00:37:39,980 --> 00:37:41,740
The way they're
manufactured today,

837
00:37:41,740 --> 00:37:44,980
it goes way up with
the area the chip.

838
00:37:44,980 --> 00:37:46,840
It turns out the
cypress die, in part

839
00:37:46,840 --> 00:37:51,535
because they have a nice thin
line width, is the smallest

840
00:37:51,535 --> 00:37:53,470
die of all the RISC
chips, at the time

841
00:37:53,470 --> 00:37:54,828
of this taping at least.

842
00:37:54,828 --> 00:37:56,620
And then if we looked
at that very smallest

843
00:37:56,620 --> 00:37:59,840
die, how much of all that die
is dedicated to the register

844
00:37:59,840 --> 00:38:00,340
windows?

845
00:38:00,340 --> 00:38:04,270
We see that it's only 10% of
the die or less than that.

846
00:38:04,270 --> 00:38:06,700
Final thing to say
about register windows

847
00:38:06,700 --> 00:38:09,370
is that they are optional
in the SPARC architecture.

848
00:38:09,370 --> 00:38:11,590
Separate instructions
from call and return

849
00:38:11,590 --> 00:38:15,370
invoke the register windows, the
save and restore instructions.

850
00:38:15,370 --> 00:38:17,440
So you could take a
SPARC architecture

851
00:38:17,440 --> 00:38:21,250
and write a compiler that
ignored register windows.

852
00:38:21,250 --> 00:38:23,572
Also, depending on
your technology,

853
00:38:23,572 --> 00:38:25,030
big architectures
that are variable

854
00:38:25,030 --> 00:38:28,030
have as few as two register
banks and as many as 32,

855
00:38:28,030 --> 00:38:31,060
depending on your technology.

856
00:38:31,060 --> 00:38:34,530
So lots to say about
register windows.

857
00:38:34,530 --> 00:38:36,340
I want to emphasize
one more time

858
00:38:36,340 --> 00:38:38,200
that these machines
are so similar,

859
00:38:38,200 --> 00:38:39,920
I've never seen
anything like it.

860
00:38:39,920 --> 00:38:42,670
And I don't think
these small details

861
00:38:42,670 --> 00:38:45,250
will make the differences
in the long term.

862
00:38:45,250 --> 00:38:47,038
But we'll see.

863
00:38:47,038 --> 00:38:48,830
Probably, to me, the
most interesting thing

864
00:38:48,830 --> 00:38:52,310
about this SPARC architecture is
that, as Wayne Rosing alluded,

865
00:38:52,310 --> 00:38:55,440
they've developed a
family of computers.

866
00:38:55,440 --> 00:38:57,120
So they're shown on this slide.

867
00:38:57,120 --> 00:38:59,870
The SPARCstation 1,
as they're calling it,

868
00:38:59,870 --> 00:39:03,800
is based on a gate array chip,
a 20 megahertz CMOS gate array.

869
00:39:03,800 --> 00:39:08,750
This is the same design
that was done in 1987

870
00:39:08,750 --> 00:39:12,080
in a more expensive machine,
just upgraded with technology

871
00:39:12,080 --> 00:39:14,780
to the faster clock rate.

872
00:39:14,780 --> 00:39:16,640
After that is the SPARCstation--

873
00:39:16,640 --> 00:39:18,960
I guess it's called 330.

874
00:39:18,960 --> 00:39:19,820
I see.

875
00:39:19,820 --> 00:39:21,140
The 33 from the 33 megahertz.

876
00:39:21,140 --> 00:39:23,630
I was trying to figure
out what that number was.

877
00:39:23,630 --> 00:39:25,080
This is a full custom design.

878
00:39:25,080 --> 00:39:27,365
This was done in
cooperation with Cypress,

879
00:39:27,365 --> 00:39:29,300
as Wayne alluded to.

880
00:39:29,300 --> 00:39:34,550
This is the traditional approach
to designing a microprocessor.

881
00:39:34,550 --> 00:39:37,340
Solbourne, one of the companies
in the SPARC international

882
00:39:37,340 --> 00:39:41,360
group, has taken these chips and
has built a four-way interleave

883
00:39:41,360 --> 00:39:42,860
multiprocessor--

884
00:39:42,860 --> 00:39:45,080
four processors
in the same box--

885
00:39:45,080 --> 00:39:49,400
using, I think, the 25 megahertz
version of the gate array.

886
00:39:49,400 --> 00:39:54,710
Bit is the ECL design
that Wayne alluded to.

887
00:39:54,710 --> 00:39:56,930
It's been announced that
it's 80 megahertz clock.

888
00:39:56,930 --> 00:39:58,610
That's 12 and 1/2
nanoseconds for you

889
00:39:58,610 --> 00:40:03,860
computer historians, the same
clock cycle time as the Cray-1.

890
00:40:03,860 --> 00:40:07,100
This hasn't been announced
in a Sun product yet.

891
00:40:07,100 --> 00:40:08,750
And then Prisma,
which is a startup

892
00:40:08,750 --> 00:40:11,870
company, another startup
company, they've announced--

893
00:40:11,870 --> 00:40:14,390
and they've given talks
on a 250 megahertz

894
00:40:14,390 --> 00:40:16,520
4 nanosecond second clock
version of the SPARC

895
00:40:16,520 --> 00:40:17,420
architecture.

896
00:40:17,420 --> 00:40:20,960
This is being built with lots
of gallium arsenide chips.

897
00:40:20,960 --> 00:40:22,460
And it's been
announced in the paper

898
00:40:22,460 --> 00:40:25,730
that Sun's trying to
convince companies that

899
00:40:25,730 --> 00:40:29,900
have traditionally built PC
clones and people building

900
00:40:29,900 --> 00:40:31,970
portable to build
something that's

901
00:40:31,970 --> 00:40:33,717
binary compatible to
SPARC architecture,

902
00:40:33,717 --> 00:40:35,300
so they can all run
the same software.

903
00:40:35,300 --> 00:40:37,250
So these machines--
computer family

904
00:40:37,250 --> 00:40:39,170
means that the same
binary runs on them all.

905
00:40:39,170 --> 00:40:40,680
And that's true.

906
00:40:40,680 --> 00:40:43,770
So right now,
ignoring the thing--

907
00:40:43,770 --> 00:40:46,020
right now, the things that
Sun sells-- and it probably

908
00:40:46,020 --> 00:40:49,130
will change over time-- there's
about a factor of-- well, not

909
00:40:49,130 --> 00:40:51,110
Sun, if I include Prisma here.

910
00:40:51,110 --> 00:40:53,090
The range in price
and performance

911
00:40:53,090 --> 00:40:56,750
is about a factor of 50 in
price and 21 in performance.

912
00:40:56,750 --> 00:41:00,330
The same binary program
runs all the way along.

913
00:41:00,330 --> 00:41:00,830
OK.

914
00:41:00,830 --> 00:41:03,890
So let me show you
some of the die photos

915
00:41:03,890 --> 00:41:06,920
and then some of the
boards for these designs.

916
00:41:06,920 --> 00:41:11,690
So this first die photo is of
the good old gate array design.

917
00:41:11,690 --> 00:41:15,230
And although it may be
visible, the register file

918
00:41:15,230 --> 00:41:17,818
is this dark area right here.

919
00:41:17,818 --> 00:41:18,860
That's the register file.

920
00:41:18,860 --> 00:41:21,225
And as you can see, that's
not much of the chip.

921
00:41:21,225 --> 00:41:23,600
The whole chip would include
all the pad array and things

922
00:41:23,600 --> 00:41:25,080
like that.

923
00:41:25,080 --> 00:41:27,590
I think the ALU is over here.

924
00:41:27,590 --> 00:41:30,138
Here's some of
the control over--

925
00:41:30,138 --> 00:41:32,180
I guess this is the
instruction [INAUDIBLE] unit,

926
00:41:32,180 --> 00:41:33,390
and the control down here.

927
00:41:33,390 --> 00:41:36,890
So that gives you an
idea of all the functions

928
00:41:36,890 --> 00:41:39,140
that are involved in
the gate array design.

929
00:41:39,140 --> 00:41:41,210
And here's the same
picture of the chip

930
00:41:41,210 --> 00:41:43,310
without all the labels on it.

931
00:41:43,310 --> 00:41:44,870
I think this one--

932
00:41:44,870 --> 00:41:47,150
they make this run
now at 25 megahertz.

933
00:41:47,150 --> 00:41:52,750
The original SPARC machine
was 16.67 megahertz.

934
00:41:52,750 --> 00:41:56,190
This is the picture
of the Cypress design.

935
00:41:56,190 --> 00:41:58,530
And the register file has
a different orientation.

936
00:41:58,530 --> 00:42:00,960
But this is the register
file right here.

937
00:42:00,960 --> 00:42:05,850
And the ALU, I think,
is right below this.

938
00:42:05,850 --> 00:42:08,650
And I think this is the bus
interface unit over here.

939
00:42:08,650 --> 00:42:11,250


940
00:42:11,250 --> 00:42:13,260
This is a much smaller die.

941
00:42:13,260 --> 00:42:17,910
It's a custom design and
is at a higher clock rate.

942
00:42:17,910 --> 00:42:21,010
This is the ECL design,
the single chip design.

943
00:42:21,010 --> 00:42:22,680
You see the register
file is larger.

944
00:42:22,680 --> 00:42:26,400
It turns out, in this
particular technology Bit has,

945
00:42:26,400 --> 00:42:32,850
the RAM cells aren't as dense
as they are in the CMOS design.

946
00:42:32,850 --> 00:42:35,520
And the next section is all
the arithmetic and logic units.

947
00:42:35,520 --> 00:42:37,980
And then control is
laid out over here.

948
00:42:37,980 --> 00:42:42,120
This chip has 125,000
transistors and is very likely

949
00:42:42,120 --> 00:42:46,410
the largest single chip ECL
design, at least logic chip,

950
00:42:46,410 --> 00:42:48,180
that's ever been built.

951
00:42:48,180 --> 00:42:50,723
So you've seen the
die photographs.

952
00:42:50,723 --> 00:42:52,140
When they're put
into the package,

953
00:42:52,140 --> 00:42:53,610
they look something like this.

954
00:42:53,610 --> 00:42:55,980
Those two chips are actually
of historical interest,

955
00:42:55,980 --> 00:43:00,480
in that they're the second
SPARC chips ever made.

956
00:43:00,480 --> 00:43:02,970
Let me then show you
the inside of the chip

957
00:43:02,970 --> 00:43:06,600
bonded against the pads
for the Bipolar or ECL

958
00:43:06,600 --> 00:43:09,790
that I just talked about.

959
00:43:09,790 --> 00:43:14,080
So I think you may be able to
see that this is the Bipolar

960
00:43:14,080 --> 00:43:14,580
design.

961
00:43:14,580 --> 00:43:17,340
You can see the registrar file
on the left-hand side and all

962
00:43:17,340 --> 00:43:18,280
the pins around it.

963
00:43:18,280 --> 00:43:21,750
So this is the package
part, 125,000 transistors,

964
00:43:21,750 --> 00:43:24,840
the whole chip running at 12
and 1/2 nanosecond clock cycle

965
00:43:24,840 --> 00:43:26,580
time.

966
00:43:26,580 --> 00:43:29,850
This is a type of
perhaps forerunner

967
00:43:29,850 --> 00:43:32,400
of the RISC super
microprocessors I'll

968
00:43:32,400 --> 00:43:33,960
mention again in my last slide.

969
00:43:33,960 --> 00:43:36,990


970
00:43:36,990 --> 00:43:41,050
Boards containing the chips
that I was just talking about.

971
00:43:41,050 --> 00:43:42,930
You can see those
three boards here.

972
00:43:42,930 --> 00:43:46,800
The smaller board refers
to the 20 megahertz version

973
00:43:46,800 --> 00:43:48,540
that's in the SPARCstation 1.

974
00:43:48,540 --> 00:43:53,760
The board next to it contains
the 33 megahertz version,

975
00:43:53,760 --> 00:43:54,990
the custom chip design.

976
00:43:54,990 --> 00:43:56,550
And over here is
a prototype board

977
00:43:56,550 --> 00:43:59,023
running 70 to 80 megahertz.

978
00:43:59,023 --> 00:44:00,690
You can see, just
from this perspective,

979
00:44:00,690 --> 00:44:03,683
that the advantage of
the slower clock rate

980
00:44:03,683 --> 00:44:05,850
is it allows them to do
more highly integrated chips

981
00:44:05,850 --> 00:44:08,010
and have a considerably
smaller board design.

982
00:44:08,010 --> 00:44:10,290
This is no bigger than a
sheet of notebook paper.

983
00:44:10,290 --> 00:44:13,080
And the others are quite
a bit bigger than that.

984
00:44:13,080 --> 00:44:15,747
So let me focus first on
this small board that's

985
00:44:15,747 --> 00:44:18,080
in the SPARCstation and
identify some of the components.

986
00:44:18,080 --> 00:44:20,110
So in this lower
corner of the board,

987
00:44:20,110 --> 00:44:21,630
this is the energy
in a chip, which

988
00:44:21,630 --> 00:44:26,095
is this gate array that comes
from LSI Logic or Fujitsu.

989
00:44:26,095 --> 00:44:28,470
Unfortunately, this board
doesn't have the floating point

990
00:44:28,470 --> 00:44:29,220
chip.

991
00:44:29,220 --> 00:44:32,280
The gate array would
be right next to it.

992
00:44:32,280 --> 00:44:34,830
Over here is the
cache controller.

993
00:44:34,830 --> 00:44:37,740
And the cache data RAMs
are these eight chips.

994
00:44:37,740 --> 00:44:42,250
The cash tag RAMs are those
five chips over there.

995
00:44:42,250 --> 00:44:44,280
If we move back,
you can see there's

996
00:44:44,280 --> 00:44:45,870
a couple of more gate arrays.

997
00:44:45,870 --> 00:44:47,710
This is the memory
management unit.

998
00:44:47,710 --> 00:44:48,900
Here's the DMA chip.

999
00:44:48,900 --> 00:44:51,780
This happens to be
the SCSI controller.

1000
00:44:51,780 --> 00:44:53,807
Over to the right
are the SIM modules.

1001
00:44:53,807 --> 00:44:55,140
These SIM modules will stand up.

1002
00:44:55,140 --> 00:44:57,030
There's room for 16
of these SIM modules,

1003
00:44:57,030 --> 00:44:59,550
which can contain up to 4
megabytes of memory per module

1004
00:44:59,550 --> 00:45:02,730
or 64 megabytes on this
notebook-sized sheet of paper.

1005
00:45:02,730 --> 00:45:05,070
There's actually only
50 chips on this board

1006
00:45:05,070 --> 00:45:08,040
using a lot of high integration.

1007
00:45:08,040 --> 00:45:10,680
This is actually fewer chips
than is in the Macintosh,

1008
00:45:10,680 --> 00:45:12,970
for example.

1009
00:45:12,970 --> 00:45:16,140
So now, let's go
from the small board

1010
00:45:16,140 --> 00:45:19,260
to the 33 megahertz design.

1011
00:45:19,260 --> 00:45:21,600
This is the board that's
based on the 33 megahertz

1012
00:45:21,600 --> 00:45:27,150
version, which goes in the
SPARCstation 33-0, 330.

1013
00:45:27,150 --> 00:45:29,160
Here is the Cypress chip.

1014
00:45:29,160 --> 00:45:31,020
It's the custom design.

1015
00:45:31,020 --> 00:45:35,490
Next to it is a controller for
the floating point, and then

1016
00:45:35,490 --> 00:45:38,130
a TI floating point chip.

1017
00:45:38,130 --> 00:45:42,100
As we zoom back and
see more of the board,

1018
00:45:42,100 --> 00:45:46,140
this is the cache
controller logic over here.

1019
00:45:46,140 --> 00:45:48,090
And the cache chips are nearby.

1020
00:45:48,090 --> 00:45:50,010
What's interesting
about this design

1021
00:45:50,010 --> 00:45:52,350
is that they have
a cache strictly

1022
00:45:52,350 --> 00:45:56,220
for I/O, which more or less
doubles the I/O performance

1023
00:45:56,220 --> 00:45:59,100
compared to previous
systems without such a cache

1024
00:45:59,100 --> 00:46:03,250
and these controllers and
these two gate array chips.

1025
00:46:03,250 --> 00:46:06,410
Let's go to the final
board design next.

1026
00:46:06,410 --> 00:46:12,290
Here is the board that contains
the ECL chip that I just talked

1027
00:46:12,290 --> 00:46:13,610
about earlier in the slide.

1028
00:46:13,610 --> 00:46:16,490
I can show the
example of the die.

1029
00:46:16,490 --> 00:46:19,430
Here is the package
that I showed you.

1030
00:46:19,430 --> 00:46:21,530
Here's the energy unit.

1031
00:46:21,530 --> 00:46:24,230
Next to it is the
cache controller unit.

1032
00:46:24,230 --> 00:46:28,370
Over here are the registers
that are for the floating point

1033
00:46:28,370 --> 00:46:31,070
registers, not the energy
register file, not the register

1034
00:46:31,070 --> 00:46:34,430
windows, but just the 32 32-bit
registers for the floating

1035
00:46:34,430 --> 00:46:35,600
point unit.

1036
00:46:35,600 --> 00:46:37,580
Here is the multiplier.

1037
00:46:37,580 --> 00:46:39,560
And here's the
floating point adder,

1038
00:46:39,560 --> 00:46:41,180
floating point multiplier.

1039
00:46:41,180 --> 00:46:43,922
Here are the bus
interface units.

1040
00:46:43,922 --> 00:46:45,380
Now, what's
interesting-- you can't

1041
00:46:45,380 --> 00:46:46,160
see from this perspective.

1042
00:46:46,160 --> 00:46:47,160
I'll change in a minute.

1043
00:46:47,160 --> 00:46:50,210
But these all have
heavy metal hats on it

1044
00:46:50,210 --> 00:46:53,630
to remove the heat
from the chips.

1045
00:46:53,630 --> 00:46:55,870
This runs at about 20 watts.

1046
00:46:55,870 --> 00:46:56,730
This is around 18.

1047
00:46:56,730 --> 00:46:58,910
And the rest of the
chips, these ECL chips,

1048
00:46:58,910 --> 00:47:00,680
run at about 15 watts.

1049
00:47:00,680 --> 00:47:05,470
You'll be able to see the
hats as I turn this sideways.

1050
00:47:05,470 --> 00:47:09,120
Maybe you can zoom in.

1051
00:47:09,120 --> 00:47:10,790
OK.

1052
00:47:10,790 --> 00:47:13,490
So there needs to be greater
spacing on this board,

1053
00:47:13,490 --> 00:47:19,790
as you can see, to contain
these heatsinks on the top.

1054
00:47:19,790 --> 00:47:23,270
A single board, 70-80
megahertz in this prototype,

1055
00:47:23,270 --> 00:47:26,780
machine running close to
the speeds of the Cray-1.

1056
00:47:26,780 --> 00:47:29,510
So now, I've talked about some
of the differences in the SPARC

1057
00:47:29,510 --> 00:47:33,170
architectures, where the ideas
came from, the technical ideas.

1058
00:47:33,170 --> 00:47:35,220
Let me talk about
historically where

1059
00:47:35,220 --> 00:47:37,640
were these ideas developed.

1060
00:47:37,640 --> 00:47:39,860
The first RISC machine
was developed at IBM

1061
00:47:39,860 --> 00:47:42,110
in the Yorktown division,
the IBM [INAUDIBLE]..

1062
00:47:42,110 --> 00:47:45,860
This was actually a 24-bit
ECL minicomputer led

1063
00:47:45,860 --> 00:47:47,930
by John Koch and George Radin.

1064
00:47:47,930 --> 00:47:50,180
I think the particular
emphasis of that group

1065
00:47:50,180 --> 00:47:52,520
was to push the
compiler technology.

1066
00:47:52,520 --> 00:47:56,540
It turned out that, for
the compiler technology,

1067
00:47:56,540 --> 00:47:59,510
a fairly simple architecture was
a better match for the compiler

1068
00:47:59,510 --> 00:48:01,340
technology they were pushing.

1069
00:48:01,340 --> 00:48:06,080
Their emphasis was on a subset
of PLI1 that they call PLI.8.

1070
00:48:06,080 --> 00:48:08,005
They invented a custom
operating system.

1071
00:48:08,005 --> 00:48:09,380
And their comparisons
were always

1072
00:48:09,380 --> 00:48:14,630
against the IBM 370 family
machines, quite naturally.

1073
00:48:14,630 --> 00:48:18,620
The RISC research at Berkeley,
which I was involved with,

1074
00:48:18,620 --> 00:48:20,720
led to the first two
RISC microprocessors.

1075
00:48:20,720 --> 00:48:22,760
They were 32-bit
microprocessors.

1076
00:48:22,760 --> 00:48:24,530
And as I mentioned
earlier, the emphasis

1077
00:48:24,530 --> 00:48:26,540
was trying to come up
with an architecture

1078
00:48:26,540 --> 00:48:29,750
to track the rapid
changes in VLSI.

1079
00:48:29,750 --> 00:48:34,760
And we also gave these family
of machines their name.

1080
00:48:34,760 --> 00:48:38,660
The emphasis there was on C and
Unix running on RISC machines.

1081
00:48:38,660 --> 00:48:42,590
And the comparisons were
against the VAX and the 68000.

1082
00:48:42,590 --> 00:48:44,810
Professor Hennessy at
Stanford University

1083
00:48:44,810 --> 00:48:46,790
led another RISC effort.

1084
00:48:46,790 --> 00:48:48,770
They built the third
RISC microprocessor

1085
00:48:48,770 --> 00:48:52,840
in pushing compiler technology
and trying to track VLSI.

1086
00:48:52,840 --> 00:48:56,510
And the emphasis was in
Pascal and typical comparisons

1087
00:48:56,510 --> 00:48:58,270
against PDP-10.

1088
00:48:58,270 --> 00:49:01,820
But I think from a longer term
historical perspective that

1089
00:49:01,820 --> 00:49:03,440
precedes all of these designs--

1090
00:49:03,440 --> 00:49:07,235
and that goes back to Seymour
Cray and his classic CDC-6600

1091
00:49:07,235 --> 00:49:09,740
and later in his Cray-1 machine.

1092
00:49:09,740 --> 00:49:13,460
You can see the same principles,
in my perspective, applied.

1093
00:49:13,460 --> 00:49:16,250
Now, if you look at--

1094
00:49:16,250 --> 00:49:20,390
if you try and find Seymour
Cray give a talk, first of all,

1095
00:49:20,390 --> 00:49:21,380
go see it.

1096
00:49:21,380 --> 00:49:24,320
He does these about once a
decade, as far as I can tell.

1097
00:49:24,320 --> 00:49:27,500
So I happen to be fumbling
through the video library

1098
00:49:27,500 --> 00:49:31,670
at Berkeley and find out we
had a taping of Seymour Cray

1099
00:49:31,670 --> 00:49:33,950
right before the
announcement of the Cray-1.

1100
00:49:33,950 --> 00:49:36,740
Lawrence Livermore, which
always buys the first model

1101
00:49:36,740 --> 00:49:39,860
of the supercomputers, I guess
could use that against him

1102
00:49:39,860 --> 00:49:41,390
to make him give a public talk.

1103
00:49:41,390 --> 00:49:43,280
And what this quote
is on the slide

1104
00:49:43,280 --> 00:49:45,620
is from part of that talk.

1105
00:49:45,620 --> 00:49:48,710
And this is, I think, the
year before the Cray-1.

1106
00:49:48,710 --> 00:49:51,440
So he said, registers-- he
was referring to the 6600

1107
00:49:51,440 --> 00:49:52,700
when he was talking about it.

1108
00:49:52,700 --> 00:49:54,533
Registers made the
instructions very simple.

1109
00:49:54,533 --> 00:49:56,690
And that thought
is still with me

1110
00:49:56,690 --> 00:49:59,180
and is very present in the
machine I am designing now,

1111
00:49:59,180 --> 00:50:00,110
the Cray-1.

1112
00:50:00,110 --> 00:50:02,762
That is somewhat unique.

1113
00:50:02,762 --> 00:50:04,220
What's left off
the slide, he talks

1114
00:50:04,220 --> 00:50:06,762
about most machines have very
elaborate instructions and more

1115
00:50:06,762 --> 00:50:07,730
memory accesses.

1116
00:50:07,730 --> 00:50:10,023
Then he concludes with
simplicity, I guess,

1117
00:50:10,023 --> 00:50:10,940
is a way of saying it.

1118
00:50:10,940 --> 00:50:12,380
I am all for simplicity.

1119
00:50:12,380 --> 00:50:15,980
If it's very complicated,
I can't understand it.

1120
00:50:15,980 --> 00:50:18,350
OK.

1121
00:50:18,350 --> 00:50:20,000
So what happened to
all this research?

1122
00:50:20,000 --> 00:50:22,460
Well, it turned out each
of those research projects

1123
00:50:22,460 --> 00:50:24,050
led to commercial products.

1124
00:50:24,050 --> 00:50:28,460
All of them added floating point
to the simple integer model.

1125
00:50:28,460 --> 00:50:32,150
The IBM 801 led
to the IBM RT PC.

1126
00:50:32,150 --> 00:50:34,655
But it turned out that
was a brand new design.

1127
00:50:34,655 --> 00:50:36,560
Or it was a heavily
changed design.

1128
00:50:36,560 --> 00:50:39,320
They cut the number of
registers from 32 to 16.

1129
00:50:39,320 --> 00:50:41,360
They introduced
16-bit instructions

1130
00:50:41,360 --> 00:50:43,250
and even microcode
in that design.

1131
00:50:43,250 --> 00:50:45,350
So that was a lot of changes.

1132
00:50:45,350 --> 00:50:47,360
The SPARC design
this tape's about

1133
00:50:47,360 --> 00:50:49,670
is very close to
the RISC design.

1134
00:50:49,670 --> 00:50:51,170
I'll talk about how
it was extended,

1135
00:50:51,170 --> 00:50:55,460
but floating point and this LISP
support or Smalltalk support.

1136
00:50:55,460 --> 00:50:56,840
At Stanford, the MIPS--

1137
00:50:56,840 --> 00:51:00,500
Stanford led-- in fact, the
name is the name of the company.

1138
00:51:00,500 --> 00:51:03,110
But it was really a brand
new instruction set design.

1139
00:51:03,110 --> 00:51:04,700
Having 2x the
number of registers,

1140
00:51:04,700 --> 00:51:06,705
having single size
instructions, some people

1141
00:51:06,705 --> 00:51:08,080
have said that
the MIPS design is

1142
00:51:08,080 --> 00:51:10,780
closer to the Berkeley
design than it

1143
00:51:10,780 --> 00:51:11,950
was to the Stanford design.

1144
00:51:11,950 --> 00:51:14,470
But I certainly wouldn't say
that about my good friend John

1145
00:51:14,470 --> 00:51:16,810
Hennesy.

1146
00:51:16,810 --> 00:51:18,400
All right.

1147
00:51:18,400 --> 00:51:20,020
So there's this work
done at Berkeley.

1148
00:51:20,020 --> 00:51:21,790
What happened at Sun?

1149
00:51:21,790 --> 00:51:24,700
Well, as Wayne mentioned, Sun
was a pretty small company

1150
00:51:24,700 --> 00:51:27,130
when it decided on this
fairly bold venture

1151
00:51:27,130 --> 00:51:30,670
as to no longer rely exclusively
on this giant semiconductor

1152
00:51:30,670 --> 00:51:33,065
house, but to go
off on their own.

1153
00:51:33,065 --> 00:51:34,690
Starting with the
RISC instruction set,

1154
00:51:34,690 --> 00:51:36,610
a team of operating
system people,

1155
00:51:36,610 --> 00:51:39,730
compiler people, and architects
and hardware designers

1156
00:51:39,730 --> 00:51:41,530
developed the SPARC
instruction set.

1157
00:51:41,530 --> 00:51:43,990
The emphasis was on simplicity
because the importance

1158
00:51:43,990 --> 00:51:46,510
of getting this thing
to market, the resources

1159
00:51:46,510 --> 00:51:49,240
available in a company--
a $30 million 200 person

1160
00:51:49,240 --> 00:51:50,780
company-- wasn't that great.

1161
00:51:50,780 --> 00:51:52,930
And it needed to be scalable
to new technologies.

1162
00:51:52,930 --> 00:51:55,420
That was another reason
to keep it simple.

1163
00:51:55,420 --> 00:51:58,680
And so as a result,
the 1987 gate array,

1164
00:51:58,680 --> 00:52:00,520
the first time a
microprocessor has ever

1165
00:52:00,520 --> 00:52:03,370
been built as a gate array, was
faster than the custom designs

1166
00:52:03,370 --> 00:52:04,720
from Intel or Motorola.

1167
00:52:04,720 --> 00:52:08,600


1168
00:52:08,600 --> 00:52:11,090
Now, let me spend some
time talking a little bit

1169
00:52:11,090 --> 00:52:13,140
about the future of design.

1170
00:52:13,140 --> 00:52:15,170
So if you get nothing
else from this tape,

1171
00:52:15,170 --> 00:52:17,050
I'm going to try and
burn into your memory

1172
00:52:17,050 --> 00:52:21,810
this performance equation
that you're trying to minimize

1173
00:52:21,810 --> 00:52:23,110
in getting higher performance.

1174
00:52:23,110 --> 00:52:25,903
So this slide is about
higher performance.

1175
00:52:25,903 --> 00:52:28,070
Well, we're going to try
and improve the clock cycle

1176
00:52:28,070 --> 00:52:29,487
time-- all the
RISC machines are--

1177
00:52:29,487 --> 00:52:31,630
by going to new technologies.

1178
00:52:31,630 --> 00:52:34,730
There's already these custom
ECL chips being developed

1179
00:52:34,730 --> 00:52:37,290
by Sun and other companies.

1180
00:52:37,290 --> 00:52:40,850
There's a lot of interest in
the BICMOS, combination Bipolar

1181
00:52:40,850 --> 00:52:44,060
and CMOS designs, and interest
as well in gallium arsenide

1182
00:52:44,060 --> 00:52:46,610
designs, like I mentioned.

1183
00:52:46,610 --> 00:52:49,730
And this is all point at
the fetch clock cycle time.

1184
00:52:49,730 --> 00:52:52,845
The higher integration of trying
to get the caches, the memory

1185
00:52:52,845 --> 00:52:54,470
management unit, and
the floating point

1186
00:52:54,470 --> 00:52:55,965
all on the same dye--

1187
00:52:55,965 --> 00:52:57,590
this is going to
obviously lower costs,

1188
00:52:57,590 --> 00:52:59,960
but also allows a
faster clock cycle time

1189
00:52:59,960 --> 00:53:01,850
because they all fit together.

1190
00:53:01,850 --> 00:53:05,300
As you can see on the monitor,
what I'm talking about

1191
00:53:05,300 --> 00:53:10,010
is combining the integer
chip, the floating point

1192
00:53:10,010 --> 00:53:15,170
chip, the cache control chip,
all of the data cache RAMs,

1193
00:53:15,170 --> 00:53:19,760
the tag RAMs, the memory
management unit, and--

1194
00:53:19,760 --> 00:53:23,180
who knows-- maybe even the DMA,
all of those chips into one

1195
00:53:23,180 --> 00:53:23,720
die.

1196
00:53:23,720 --> 00:53:25,732
That's going to reduce
clearly the board area.

1197
00:53:25,732 --> 00:53:27,440
And by making things
smaller, maybe it'll

1198
00:53:27,440 --> 00:53:28,773
have a little faster cycle time.

1199
00:53:28,773 --> 00:53:32,240


1200
00:53:32,240 --> 00:53:36,650
The area that's probably
most novel from a computer

1201
00:53:36,650 --> 00:53:38,210
architecture
perspective is what's

1202
00:53:38,210 --> 00:53:40,340
been called superscalar design.

1203
00:53:40,340 --> 00:53:43,160
Rather than fetching one
instruction every clock cycle,

1204
00:53:43,160 --> 00:53:45,830
the RISC machines are headed
towards multiple instructions

1205
00:53:45,830 --> 00:53:49,190
per clock cycle, two or
three so that if you're

1206
00:53:49,190 --> 00:53:51,230
trying to execute
two at the same time,

1207
00:53:51,230 --> 00:53:52,940
you're going to lower the CPI.

1208
00:53:52,940 --> 00:53:56,240
In fact, you're going to
lower the CPI below 1 to try

1209
00:53:56,240 --> 00:53:58,075
and minimize this formula.

1210
00:53:58,075 --> 00:54:00,200
And of course, there's
going to be continued effort

1211
00:54:00,200 --> 00:54:02,930
on trying to use better
technology, better algorithms

1212
00:54:02,930 --> 00:54:05,450
to improve compilers so you can
lower the instruction count.

1213
00:54:05,450 --> 00:54:07,250
So that's, for me,
the future direction,

1214
00:54:07,250 --> 00:54:09,290
in terms of performance
improvements and cost

1215
00:54:09,290 --> 00:54:11,820
improvements.

1216
00:54:11,820 --> 00:54:16,490
Now, the big limit to
any computer architecture

1217
00:54:16,490 --> 00:54:18,400
is the size of its address.

1218
00:54:18,400 --> 00:54:22,340
The RISC machines came out
with 32-bit architectures.

1219
00:54:22,340 --> 00:54:24,530
Many people,
including Gordon Bell

1220
00:54:24,530 --> 00:54:26,630
talked about how
the only mistake

1221
00:54:26,630 --> 00:54:28,790
that you can't recover
from in computer design

1222
00:54:28,790 --> 00:54:30,060
is the address size.

1223
00:54:30,060 --> 00:54:32,380
So in fact, computer designers
have used the addresses

1224
00:54:32,380 --> 00:54:35,090
as an excuse to redesign
their instruction sets.

1225
00:54:35,090 --> 00:54:38,660
The RISC machines at 32 bits are
pretty close to those limits.

1226
00:54:38,660 --> 00:54:40,590
And so I would expect,
in the next few years,

1227
00:54:40,590 --> 00:54:43,520
we're going to see proposals
for much larger than 32

1228
00:54:43,520 --> 00:54:47,840
bits of addressing in probably
all computers, not just RISC

1229
00:54:47,840 --> 00:54:49,950
designs.

1230
00:54:49,950 --> 00:54:55,820
So for my final slide, let
me roll the dice here and try

1231
00:54:55,820 --> 00:54:58,940
and describe the future under
the possibility somebody

1232
00:54:58,940 --> 00:55:02,600
might even be seeing the tape
at the time I'm predicting.

1233
00:55:02,600 --> 00:55:05,013
So let's say that, for
some sizable fraction

1234
00:55:05,013 --> 00:55:06,680
of the scientific
community in the years

1235
00:55:06,680 --> 00:55:11,420
1993, 1996, that the
hard of almost all

1236
00:55:11,420 --> 00:55:13,130
of these systems
in this community

1237
00:55:13,130 --> 00:55:16,730
is going to be a RISC
supermicroprocessor, millions

1238
00:55:16,730 --> 00:55:18,890
of transistors on
one chip implementing

1239
00:55:18,890 --> 00:55:20,540
a RISC-style architecture.

1240
00:55:20,540 --> 00:55:22,160
At the low end is
the workstation

1241
00:55:22,160 --> 00:55:25,910
or the desktop, that you have
a single one of these devices

1242
00:55:25,910 --> 00:55:27,680
with a fairly simple
memory system trying

1243
00:55:27,680 --> 00:55:29,240
to get the cost down.

1244
00:55:29,240 --> 00:55:31,220
The next step up,
the file server

1245
00:55:31,220 --> 00:55:34,400
or time-share minicomputer,
will see a few of these RISC

1246
00:55:34,400 --> 00:55:36,350
supermicroprocessors
with a much better memory

1247
00:55:36,350 --> 00:55:38,930
system and perhaps a much
better I/O system as well.

1248
00:55:38,930 --> 00:55:43,670
And at the high end could be
the supercomputers with many--

1249
00:55:43,670 --> 00:55:46,890
with many maybe measured in
thousands-- but many RISC

1250
00:55:46,890 --> 00:55:48,520
supermicroprocessors
that are going

1251
00:55:48,520 --> 00:55:50,270
to have some kind of
communication network

1252
00:55:50,270 --> 00:55:53,360
to allow them all talk together
and a very much larger I/O

1253
00:55:53,360 --> 00:55:55,200
system.

1254
00:55:55,200 --> 00:55:57,450
Thank you very much.

1255
00:55:57,450 --> 00:55:58,710
Are there any questions?

1256
00:55:58,710 --> 00:56:00,350
Are there differences
in the type

1257
00:56:00,350 --> 00:56:03,230
of applications that are
suitable for the RISC

1258
00:56:03,230 --> 00:56:05,720
versus CIC processor?

1259
00:56:05,720 --> 00:56:08,000
And will that change over time?

1260
00:56:08,000 --> 00:56:11,733
The RISC machines seem to
have spread out pretty widely.

1261
00:56:11,733 --> 00:56:13,400
One thing you mentioned
in your question

1262
00:56:13,400 --> 00:56:15,650
is, well, the example of
transaction processing.

1263
00:56:15,650 --> 00:56:19,555
Well, one of the leaders in
transaction processing, Tandem,

1264
00:56:19,555 --> 00:56:20,930
has announced that
they are going

1265
00:56:20,930 --> 00:56:23,472
to building a line of computers
based around the RISC machine

1266
00:56:23,472 --> 00:56:24,320
architectures.

1267
00:56:24,320 --> 00:56:26,680
It looks like, for
transaction processing, which

1268
00:56:26,680 --> 00:56:28,430
is something I'm
learning something about,

1269
00:56:28,430 --> 00:56:30,978
because my current research is
involved in higher performance

1270
00:56:30,978 --> 00:56:33,020
I/O and that's one of the
areas we're looking at,

1271
00:56:33,020 --> 00:56:37,880
is it's kind of like
operating systems.

1272
00:56:37,880 --> 00:56:39,380
It's the same types
of instructions.

1273
00:56:39,380 --> 00:56:42,050
So from an instruction
set perspective, higher

1274
00:56:42,050 --> 00:56:44,163
performance, lower cost
is very attractive.

1275
00:56:44,163 --> 00:56:45,830
There doesn't seem
to be anything there.

1276
00:56:45,830 --> 00:56:48,290
I think the one area
that people believe

1277
00:56:48,290 --> 00:56:51,410
that there's some interest is
in the traditional business data

1278
00:56:51,410 --> 00:56:57,650
processing dealing with
the BCD encoding of data.

1279
00:56:57,650 --> 00:57:00,830
That's an area where there's
still a lot of argument about

1280
00:57:00,830 --> 00:57:03,450
whether or not RISC machines
are the right thing to do.

1281
00:57:03,450 --> 00:57:06,030
I know that the Hewlett-Packard
precision architecture,

1282
00:57:06,030 --> 00:57:09,977
which was concerned about
this, put some support--

1283
00:57:09,977 --> 00:57:12,560
but it's a very modest amount
of support-- in for the business

1284
00:57:12,560 --> 00:57:14,890
data processing.

1285
00:57:14,890 --> 00:57:17,960
But I think it'll have
to-- that's probably

1286
00:57:17,960 --> 00:57:20,510
the one area that's a toss up
that you can get good arguments

1287
00:57:20,510 --> 00:57:21,380
both ways.

1288
00:57:21,380 --> 00:57:24,440
The rest of the areas,
by examples of machines,

1289
00:57:24,440 --> 00:57:27,655
we seem to see people lining up.

1290
00:57:27,655 --> 00:57:29,780
The only other thing I'd
say from that perspective,

1291
00:57:29,780 --> 00:57:31,992
in terms of what
Wayne said once again,

1292
00:57:31,992 --> 00:57:33,950
in terms of a software
developer's perspective,

1293
00:57:33,950 --> 00:57:36,660
is just the number of
machines out there.

1294
00:57:36,660 --> 00:57:40,310
So even if it's a technically
sound instruction set,

1295
00:57:40,310 --> 00:57:42,080
the reluctance of
porting applications

1296
00:57:42,080 --> 00:57:45,270
is simply how many places
can they sell the program.

1297
00:57:45,270 --> 00:57:47,000
And so that's why
I think we'll see

1298
00:57:47,000 --> 00:57:52,940
the 8086 family live forever and
the IBM 360 live forever, too.

1299
00:57:52,940 --> 00:57:55,760
Are the SPARC machines
from the different vendors

1300
00:57:55,760 --> 00:57:58,490
going to execute
the same binaries,

1301
00:57:58,490 --> 00:58:02,690
despite the differences in
cache and register window size?

1302
00:58:02,690 --> 00:58:03,682
The answer is yes.

1303
00:58:03,682 --> 00:58:05,390
Now, the thing that
was interesting to me

1304
00:58:05,390 --> 00:58:07,723
that I really didn't know
about the binary compatibility

1305
00:58:07,723 --> 00:58:09,770
is it turns out, like
every single company,

1306
00:58:09,770 --> 00:58:12,410
even though they're
binary compatible,

1307
00:58:12,410 --> 00:58:13,910
the kernels of the
operating systems

1308
00:58:13,910 --> 00:58:15,452
are different on
all these companies.

1309
00:58:15,452 --> 00:58:16,580
I didn't know that.

1310
00:58:16,580 --> 00:58:18,455
And because the I/O's
are somewhat different,

1311
00:58:18,455 --> 00:58:20,247
when you start it, it's
somewhat different.

1312
00:58:20,247 --> 00:58:22,040
So even though the VAX
and the IBM families

1313
00:58:22,040 --> 00:58:24,498
are binary compatible, there's
some pieces of the operating

1314
00:58:24,498 --> 00:58:25,740
system that are different.

1315
00:58:25,740 --> 00:58:28,070
And that's true for the
RISC machines as well.

1316
00:58:28,070 --> 00:58:31,080
There's some details
of dealing with the I/O

1317
00:58:31,080 --> 00:58:32,805
system, or the
buses, or the caches,

1318
00:58:32,805 --> 00:58:34,430
because the caches
vary, where there'll

1319
00:58:34,430 --> 00:58:35,810
be a piece of the
operating system that'll

1320
00:58:35,810 --> 00:58:37,740
have to be different for
every single machine.

1321
00:58:37,740 --> 00:58:39,157
But that apparently
is what people

1322
00:58:39,157 --> 00:58:42,350
call binary compatibility or
user binary compatibility.

1323
00:58:42,350 --> 00:58:47,030
How do you see distributing
processing power through memory

1324
00:58:47,030 --> 00:58:50,780
as the memory chips get denser?

1325
00:58:50,780 --> 00:58:54,380
There will be a bottleneck
at the chip level.

1326
00:58:54,380 --> 00:58:57,290
I think, in this one area,
the supercomputer designs,

1327
00:58:57,290 --> 00:59:00,500
when you're talking about
lots of processors--

1328
00:59:00,500 --> 00:59:04,250
I think that's an area
where it's possible

1329
00:59:04,250 --> 00:59:06,260
that, if you're
talking about thousands

1330
00:59:06,260 --> 00:59:08,660
of processors and
thousands of DRAMs

1331
00:59:08,660 --> 00:59:10,250
and it's enough of
a volume, you might

1332
00:59:10,250 --> 00:59:13,160
see some real radical change
somewhere down the line

1333
00:59:13,160 --> 00:59:15,440
where the processes are
placed in the DRAM chips.

1334
00:59:15,440 --> 00:59:17,360
That might be a much
more economical design.

1335
00:59:17,360 --> 00:59:20,750


1336
00:59:20,750 --> 00:59:22,220
Right now, that'd
be very dangerous

1337
00:59:22,220 --> 00:59:24,620
because you're fighting
against two strong forces--

1338
00:59:24,620 --> 00:59:28,190
a mass produced microprocessor
and mass produced memory.

1339
00:59:28,190 --> 00:59:30,530
And that has strong
economic advantages.

1340
00:59:30,530 --> 00:59:32,960
And if you can overcome the
potential memory bottlenecks

1341
00:59:32,960 --> 00:59:35,760
with your design, you have
tremendous economies of scale.

1342
00:59:35,760 --> 00:59:39,140
If, on the other hand, you're
designing your own custom RAM

1343
00:59:39,140 --> 00:59:41,090
with your own custom
memory, you're

1344
00:59:41,090 --> 00:59:44,720
in jeopardy of being left way
behind the technology as well

1345
00:59:44,720 --> 00:59:45,980
as the cost performance.

1346
00:59:45,980 --> 00:59:47,540
But maybe down the
line a little bit,

1347
00:59:47,540 --> 00:59:51,050
if you could cooperate with
some semiconductor manufacturer

1348
00:59:51,050 --> 00:59:53,745
in making DRAM, you're able
to put your microprocessor

1349
00:59:53,745 --> 00:59:56,120
on there where you have so
many thousands of transistors.

1350
00:59:56,120 --> 00:59:59,930
There might be just this huge
switch in these high-ends.

1351
00:59:59,930 --> 01:00:01,430
I think, in the low
ends, I wouldn't

1352
01:00:01,430 --> 01:00:02,930
expect much of a
change, at least

1353
01:00:02,930 --> 01:00:05,450
in the time frame on that slide.

1354
01:00:05,450 --> 01:00:09,990
Most RISC machines today run
the Unix operating system.

1355
01:00:09,990 --> 01:00:13,250
Is there anything inherent
in SPARC that would make it

1356
01:00:13,250 --> 01:00:16,430
unsuitable for other
operating systems?

1357
01:00:16,430 --> 01:00:18,488
I don't know of
anything that we did

1358
01:00:18,488 --> 01:00:20,780
in the instruction sets
designed that were particularly

1359
01:00:20,780 --> 01:00:22,970
for Unix.

1360
01:00:22,970 --> 01:00:25,190
Unix enabled the RISC
machines because it

1361
01:00:25,190 --> 01:00:27,110
was the first portable
operating system.

1362
01:00:27,110 --> 01:00:30,500
It was all written in C. VMS,
in contrast, for example,

1363
01:00:30,500 --> 01:00:34,170
or DOS, or OS 2, are all
written in assembly language.

1364
01:00:34,170 --> 01:00:36,350
So they were an obstacle
to RISC machines

1365
01:00:36,350 --> 01:00:37,730
because they couldn't be ported.

1366
01:00:37,730 --> 01:00:41,570
Unix was portable, which freed
instruction set designers

1367
01:00:41,570 --> 01:00:43,890
from having to invent
new instruction sets.

1368
01:00:43,890 --> 01:00:45,620
So the first person--

1369
01:00:45,620 --> 01:00:47,390
once that was possible,
we were allowed

1370
01:00:47,390 --> 01:00:50,340
to change instruction sets
and make some economic sense.

1371
01:00:50,340 --> 01:00:53,240
So I don't think any operating
system written in a high level

1372
01:00:53,240 --> 01:00:57,260
language I think would work on
any of these RISC processors.

1373
01:00:57,260 --> 01:01:00,310
[MUSIC PLAYING]

1374
01:01:00,310 --> 01:02:06,000