1
00:00:00,000 --> 00:00:29,810


2
00:00:29,810 --> 00:00:32,000
Hello, I'm John
Crawford, here today

3
00:00:32,000 --> 00:00:35,420
to give you an overview of
Intel's Pentium microprocessor.

4
00:00:35,420 --> 00:00:37,640
The Pentium processor
is the fifth generation

5
00:00:37,640 --> 00:00:41,450
of Intel's line of
PC-compatible microprocessors.

6
00:00:41,450 --> 00:00:44,210
I co-managed the chip design
of the Pentium processor,

7
00:00:44,210 --> 00:00:46,743
starting back in 1989.

8
00:00:46,743 --> 00:00:48,660
I will start out with a
historical perspective

9
00:00:48,660 --> 00:00:51,920
of Intel's microprocessors from
the 386 up through the Pentium

10
00:00:51,920 --> 00:00:55,130
processor and back that up
with the underlying historical

11
00:00:55,130 --> 00:00:58,310
trends in the semiconductor
technology from 1970

12
00:00:58,310 --> 00:01:00,020
through the present.

13
00:01:00,020 --> 00:01:01,880
Then, I will spend a
few minutes describing

14
00:01:01,880 --> 00:01:05,450
a few key aspects of the
design methodology we used.

15
00:01:05,450 --> 00:01:07,370
The bulk of this
video is devoted

16
00:01:07,370 --> 00:01:08,960
to describing the
microarchitecture

17
00:01:08,960 --> 00:01:11,630
of the Pentium processor and
how the different hardware

18
00:01:11,630 --> 00:01:15,710
techniques we chose to include
contribute to its performance.

19
00:01:15,710 --> 00:01:18,290
After working through this
large amount of material,

20
00:01:18,290 --> 00:01:20,720
I'll briefly describe the
data integrity or error

21
00:01:20,720 --> 00:01:23,180
checking features we included.

22
00:01:23,180 --> 00:01:24,650
In the last section,
we'll describe

23
00:01:24,650 --> 00:01:27,260
the compiler technology
co-developed with the Pentium

24
00:01:27,260 --> 00:01:29,720
processor and the
results delivered

25
00:01:29,720 --> 00:01:31,742
on industry-standard
SPEC benchmarks

26
00:01:31,742 --> 00:01:33,950
that we achieved with the
combination of the hardware

27
00:01:33,950 --> 00:01:36,665
techniques and
compiler optimizations.

28
00:01:36,665 --> 00:01:38,540
Let me start with a
brief overview of Intel's

29
00:01:38,540 --> 00:01:40,400
microprocessor strategy.

30
00:01:40,400 --> 00:01:42,910
We've kept three
main goals in mind.

31
00:01:42,910 --> 00:01:46,130
First of all, to
maintain compatibility--

32
00:01:46,130 --> 00:01:48,860
that is, each generation must
be compatible with the last

33
00:01:48,860 --> 00:01:51,500
in order to carry forward
a very large software

34
00:01:51,500 --> 00:01:55,220
base and the momentum that
comes with that in the market.

35
00:01:55,220 --> 00:01:57,107
A second key aspect
of the strategy

36
00:01:57,107 --> 00:01:58,940
is to maintain a very
aggressive performance

37
00:01:58,940 --> 00:02:02,550
ramp of doubling the
performance every 18 months.

38
00:02:02,550 --> 00:02:04,790
The third aspect
of the strategy is

39
00:02:04,790 --> 00:02:08,870
to continuously add new
functions to enter new markets.

40
00:02:08,870 --> 00:02:11,350
We've done this
in two directions.

41
00:02:11,350 --> 00:02:12,830
First of all, we've
added functions

42
00:02:12,830 --> 00:02:15,470
to move upward
from the PC market

43
00:02:15,470 --> 00:02:17,240
into the workstation,
minicomputer,

44
00:02:17,240 --> 00:02:20,930
and mainframe marketplaces,
also known as the server market.

45
00:02:20,930 --> 00:02:23,000
In the other directions,
we've added features

46
00:02:23,000 --> 00:02:26,570
to move downward into the
laptop, notebook, and handheld,

47
00:02:26,570 --> 00:02:29,120
or generally known as
the mobile marketplace.

48
00:02:29,120 --> 00:02:31,640
So given this is the overall
strategy, let's delve

49
00:02:31,640 --> 00:02:33,660
into this in more detail.

50
00:02:33,660 --> 00:02:36,890
First of all, let's look at the
performance growth from the 386

51
00:02:36,890 --> 00:02:39,320
up through the
Pentium processor.

52
00:02:39,320 --> 00:02:44,180
Spanning a period from 1986
up until the 1993 timeframe,

53
00:02:44,180 --> 00:02:47,480
I've highlighted three different
generations of processors.

54
00:02:47,480 --> 00:02:49,970
You can see four
points of the 386 line

55
00:02:49,970 --> 00:02:52,940
where we offer different
frequencies of that product.

56
00:02:52,940 --> 00:02:56,960
You can again see four of the
points of the 486 product line

57
00:02:56,960 --> 00:03:00,273
in the first point on the
Pentium processor line.

58
00:03:00,273 --> 00:03:02,690
Now, each of these is plotted
in terms of the integer SPEC

59
00:03:02,690 --> 00:03:06,110
performance on the y-axis
against general system

60
00:03:06,110 --> 00:03:08,668
availability date on the x-axis.

61
00:03:08,668 --> 00:03:11,210
It's important to note that this
is the date when systems are

62
00:03:11,210 --> 00:03:13,130
available from our
customers-- that is,

63
00:03:13,130 --> 00:03:14,750
after they've
announced the products

64
00:03:14,750 --> 00:03:16,430
and started shipping
to their customers

65
00:03:16,430 --> 00:03:18,180
through their channels.

66
00:03:18,180 --> 00:03:21,210
So you can see here a very
nice performance growth rate.

67
00:03:21,210 --> 00:03:23,720
And I've had a spreadsheet
program plot the regression

68
00:03:23,720 --> 00:03:26,660
line on a semi-log
chart where it computes

69
00:03:26,660 --> 00:03:29,840
the compound annual growth
rate somewhere close to 1.6

70
00:03:29,840 --> 00:03:32,960
every year, which corresponds
to a doubling of performance

71
00:03:32,960 --> 00:03:34,880
about every 18 months.

72
00:03:34,880 --> 00:03:37,040
That's a very
interesting growth rate.

73
00:03:37,040 --> 00:03:40,070
I thought I'd try to relate that
growth rate to some real world

74
00:03:40,070 --> 00:03:41,180
examples.

75
00:03:41,180 --> 00:03:43,460
We get a little blase
in the computer industry

76
00:03:43,460 --> 00:03:46,898
about the performance
going up and up every year,

77
00:03:46,898 --> 00:03:49,190
whereas if the same growth
rate occurred in other areas

78
00:03:49,190 --> 00:03:51,320
it would be truly astounding.

79
00:03:51,320 --> 00:03:52,760
For example given
the same growth

80
00:03:52,760 --> 00:03:55,190
rate of doubling every
18 months over a decade,

81
00:03:55,190 --> 00:03:57,350
you come up with a
factor of approximately

82
00:03:57,350 --> 00:03:59,310
a 200-fold increase.

83
00:03:59,310 --> 00:04:01,730
With the same improvement
were applied to automobiles,

84
00:04:01,730 --> 00:04:04,040
we'd be traveling
around in our cars

85
00:04:04,040 --> 00:04:07,250
at about 11,000 miles an
hour or potentially getting

86
00:04:07,250 --> 00:04:10,640
about 4,000 miles to the gallon.

87
00:04:10,640 --> 00:04:12,140
A point that's maybe
a little easier

88
00:04:12,140 --> 00:04:15,200
to visualize or to
comprehend would be a flight

89
00:04:15,200 --> 00:04:16,820
from Los Angeles
to New York would

90
00:04:16,820 --> 00:04:20,510
take 90 seconds rather
than the five hours

91
00:04:20,510 --> 00:04:23,037
or so that it takes currently.

92
00:04:23,037 --> 00:04:24,620
A different way of
looking at it might

93
00:04:24,620 --> 00:04:27,950
be in the agricultural field
where wheat yields would

94
00:04:27,950 --> 00:04:31,070
have gone from about
35 bushels per acre

95
00:04:31,070 --> 00:04:33,780
up to about 7,000
bushels an acre.

96
00:04:33,780 --> 00:04:36,900
So this is really quite
a remarkable growth rate.

97
00:04:36,900 --> 00:04:39,170
Now, where did this
performance come from?

98
00:04:39,170 --> 00:04:41,600
I believe that we've really
developed the performance

99
00:04:41,600 --> 00:04:43,280
along three dimensions.

100
00:04:43,280 --> 00:04:45,950
Let me go into each of these
in a little more detail.

101
00:04:45,950 --> 00:04:48,740
Along one dimension, we
have silicon technology.

102
00:04:48,740 --> 00:04:50,900
And here with a
Pentium processor,

103
00:04:50,900 --> 00:04:55,070
we've applied 0.8 micron
BiCMOS technology as our latest

104
00:04:55,070 --> 00:04:56,690
and greatest process.

105
00:04:56,690 --> 00:04:59,942
A second dimension is an
architecture technology

106
00:04:59,942 --> 00:05:02,150
in terms of structuring the
computers to extract more

107
00:05:02,150 --> 00:05:04,730
parallelism by executing
more and more instructions

108
00:05:04,730 --> 00:05:07,880
in parallel, using techniques
such as superscalar

109
00:05:07,880 --> 00:05:10,790
execution, branch
prediction, and some more

110
00:05:10,790 --> 00:05:14,000
straightforward aspects such
as just larger caches on board

111
00:05:14,000 --> 00:05:15,620
and wider buses.

112
00:05:15,620 --> 00:05:17,840
A third dimension is
software technology.

113
00:05:17,840 --> 00:05:20,990
And here, improved compilers
are key to providing

114
00:05:20,990 --> 00:05:23,300
improved performance
from maximizing

115
00:05:23,300 --> 00:05:27,087
the parallelism that's made
available to the hardware.

116
00:05:27,087 --> 00:05:29,420
Let's take a closer look at
the semiconductor technology

117
00:05:29,420 --> 00:05:30,470
dimension.

118
00:05:30,470 --> 00:05:33,140
It really is a key aspect
that's driving the performance

119
00:05:33,140 --> 00:05:34,310
forward.

120
00:05:34,310 --> 00:05:37,280
And here, we have technology
scaling as the main driver,

121
00:05:37,280 --> 00:05:39,620
and there are really
two aspects of this.

122
00:05:39,620 --> 00:05:42,680
One is that the die size grows
as we improve our manufacturing

123
00:05:42,680 --> 00:05:46,160
processes, and we're able
to process larger wafers.

124
00:05:46,160 --> 00:05:50,390
We can economically produce
larger and larger die sizes.

125
00:05:50,390 --> 00:05:53,330
The second aspect is that
circuit dimensions shrink,

126
00:05:53,330 --> 00:05:56,300
so that each generation
we have smaller devices

127
00:05:56,300 --> 00:05:58,580
and we're able to print
thinner wires and more

128
00:05:58,580 --> 00:06:00,500
compact transistors.

129
00:06:00,500 --> 00:06:03,500
Fortunately, the smaller
transistors run faster

130
00:06:03,500 --> 00:06:06,410
and the extra devices also
provide the raw material

131
00:06:06,410 --> 00:06:09,500
for integrating performance
structures such as caches,

132
00:06:09,500 --> 00:06:12,380
extra execution pipelines,
and in general more

133
00:06:12,380 --> 00:06:14,930
parallel performance structures.

134
00:06:14,930 --> 00:06:17,390
The third category is that
new capabilities can also

135
00:06:17,390 --> 00:06:20,120
be added using the extra devices
that the technology makes

136
00:06:20,120 --> 00:06:21,470
available to us.

137
00:06:21,470 --> 00:06:25,430
I have here a few charts showing
our technology growth from 1970

138
00:06:25,430 --> 00:06:27,470
through the present
where I've charted

139
00:06:27,470 --> 00:06:31,070
certain key aspects of our
semiconductor technology used

140
00:06:31,070 --> 00:06:33,950
in our microprocessors
from the 4,004

141
00:06:33,950 --> 00:06:39,320
in 1971 up through the
Pentium processor in 1993.

142
00:06:39,320 --> 00:06:41,150
Here charting the
die size, you can

143
00:06:41,150 --> 00:06:44,930
see a nice compound annual
growth rate of the die size

144
00:06:44,930 --> 00:06:47,570
with a nice regression line,
which I've plotted only

145
00:06:47,570 --> 00:06:50,060
against the lead CPUs.

146
00:06:50,060 --> 00:06:53,175
Coming way below the line
at much smaller die sizes,

147
00:06:53,175 --> 00:06:54,800
you can see what we
call our compaction

148
00:06:54,800 --> 00:06:58,160
products, such as the 386
compaction and the 486

149
00:06:58,160 --> 00:06:59,360
compaction.

150
00:06:59,360 --> 00:07:01,160
These designs have
taken a processor

151
00:07:01,160 --> 00:07:04,100
such as the 386 produced
on one technology

152
00:07:04,100 --> 00:07:06,000
and compacted it to
the next generation

153
00:07:06,000 --> 00:07:10,160
so we get a much smaller size to
take into volume manufacturing.

154
00:07:10,160 --> 00:07:14,420
If you rule these out and
plot just the lead CPUs,

155
00:07:14,420 --> 00:07:16,790
we get a very nice
compound annual growth rate

156
00:07:16,790 --> 00:07:20,930
on the die size that's
fairly constant over time.

157
00:07:20,930 --> 00:07:24,500
This chart shows the shrinking
of the transistor dimensions.

158
00:07:24,500 --> 00:07:27,440
And here, I have a semi-log
plot of the squares microns

159
00:07:27,440 --> 00:07:30,440
per transistor on the y-axis,
turning down over time

160
00:07:30,440 --> 00:07:32,090
very nicely.

161
00:07:32,090 --> 00:07:35,570
The larger die sizes compounded
with the smaller dimensions

162
00:07:35,570 --> 00:07:39,620
for each transistor combine
to produce a very nice growth

163
00:07:39,620 --> 00:07:42,770
rate in the transistors
per microprocessor.

164
00:07:42,770 --> 00:07:46,250
And here, you can see a very
nice compound annual growth

165
00:07:46,250 --> 00:07:50,270
rate of the number of
transistors per microprocessor.

166
00:07:50,270 --> 00:07:52,490
And here again, we can
see the compaction devices

167
00:07:52,490 --> 00:07:56,690
a little below the line for the
same reasons discussed earlier.

168
00:07:56,690 --> 00:07:58,407
Again, let's come
back to reality

169
00:07:58,407 --> 00:08:00,740
and try to relate these growth
curves to some real world

170
00:08:00,740 --> 00:08:01,970
examples.

171
00:08:01,970 --> 00:08:04,850
The transistor count per die
is doubling every 24 months

172
00:08:04,850 --> 00:08:07,710
and has been doing
so since 1970.

173
00:08:07,710 --> 00:08:12,590
That gives us about a 3,000 fold
increase over that timeframe.

174
00:08:12,590 --> 00:08:14,360
Coming back to the
real world examples,

175
00:08:14,360 --> 00:08:17,360
perhaps the best analogy is
the one of the wheat yields.

176
00:08:17,360 --> 00:08:19,940
Again, back in 1970,
the typical yield

177
00:08:19,940 --> 00:08:23,150
was about 33 bushels per acre.

178
00:08:23,150 --> 00:08:24,950
If we had a 3,000
fold increase, that

179
00:08:24,950 --> 00:08:27,800
would give us about
100,000 bushels per acre,

180
00:08:27,800 --> 00:08:29,300
and that would be
about three feet

181
00:08:29,300 --> 00:08:32,000
of wheat piled up on that acre.

182
00:08:32,000 --> 00:08:33,500
Perhaps the only
real world example

183
00:08:33,500 --> 00:08:35,480
that's come even close
to this growth rate

184
00:08:35,480 --> 00:08:37,312
is the growth in
our national debt,

185
00:08:37,312 --> 00:08:39,020
but I haven't had the
courage to work out

186
00:08:39,020 --> 00:08:42,030
the figures on that one.

187
00:08:42,030 --> 00:08:44,340
Let me shift gears now
back to the present

188
00:08:44,340 --> 00:08:46,710
and bring you to the
Pentium processor.

189
00:08:46,710 --> 00:08:49,950
First of all, I'd like to cover
two key aspects of the design

190
00:08:49,950 --> 00:08:53,100
methodology used in developing
this processor of the many

191
00:08:53,100 --> 00:08:54,610
that were involved.

192
00:08:54,610 --> 00:08:57,120
One is that we could develop
software with the hardware

193
00:08:57,120 --> 00:08:59,250
so that they would complement
each other to deliver

194
00:08:59,250 --> 00:09:01,380
maximum performance.

195
00:09:01,380 --> 00:09:03,180
In order to do this,
we had a big focus

196
00:09:03,180 --> 00:09:05,160
on compiler technology.

197
00:09:05,160 --> 00:09:07,590
We staffed and funded a very
professional compiler team

198
00:09:07,590 --> 00:09:09,870
internally so that the
compiler developers

199
00:09:09,870 --> 00:09:13,560
could work hand in hand with
the hardware developers.

200
00:09:13,560 --> 00:09:16,500
We also wanted to ensure that
this compiler technology was

201
00:09:16,500 --> 00:09:19,080
propagated out to our large
community of applications

202
00:09:19,080 --> 00:09:21,870
developers, so it was
very important to make

203
00:09:21,870 --> 00:09:23,850
this technology
broadly available

204
00:09:23,850 --> 00:09:26,100
and not hold it
close to our vest.

205
00:09:26,100 --> 00:09:28,050
Consequently, we had
a big focus on working

206
00:09:28,050 --> 00:09:31,200
with external compiler vendors
to deliver this technology out

207
00:09:31,200 --> 00:09:32,748
through their channels.

208
00:09:32,748 --> 00:09:34,290
Later on, we'll hear
from the manager

209
00:09:34,290 --> 00:09:36,930
of our internal compiler
effort, Beatrice Fu, who

210
00:09:36,930 --> 00:09:39,450
will describe this compiler
technology and the results

211
00:09:39,450 --> 00:09:40,890
that we achieved.

212
00:09:40,890 --> 00:09:43,770
We also worked closely with
key operating system vendors

213
00:09:43,770 --> 00:09:46,350
to make sure that the system
aspects of the processor

214
00:09:46,350 --> 00:09:48,210
would result in good
delivered performance

215
00:09:48,210 --> 00:09:50,430
from these operating systems.

216
00:09:50,430 --> 00:09:52,830
A second aspect of
the design methodology

217
00:09:52,830 --> 00:09:54,990
was quantifying the
design decisions.

218
00:09:54,990 --> 00:09:56,520
And the focus here
was to quantify

219
00:09:56,520 --> 00:09:59,550
decisions on the architecture
features to include--

220
00:09:59,550 --> 00:10:02,430
that is, what new features to
add to the instruction set--

221
00:10:02,430 --> 00:10:05,640
and to quantify decisions on
the microarchitecture-- that

222
00:10:05,640 --> 00:10:07,505
is, the internal
structure of the processor

223
00:10:07,505 --> 00:10:09,630
and the compiler techniques,
the internal structure

224
00:10:09,630 --> 00:10:11,040
of the compiler if you will.

225
00:10:11,040 --> 00:10:13,080
We wanted to base
decisions in these areas

226
00:10:13,080 --> 00:10:16,050
on as much quantitative
data as possible.

227
00:10:16,050 --> 00:10:18,390
One technique we used
was system measurement,

228
00:10:18,390 --> 00:10:21,450
where we could measure programs
executing on 386 and 486

229
00:10:21,450 --> 00:10:24,300
systems and project forward
to the Pentium processor

230
00:10:24,300 --> 00:10:26,190
from the system measurements.

231
00:10:26,190 --> 00:10:28,230
At a more detailed
level, we developed

232
00:10:28,230 --> 00:10:31,110
a very flexible and detailed
performance simulator

233
00:10:31,110 --> 00:10:32,850
that would very
accurately estimate

234
00:10:32,850 --> 00:10:35,350
the performance of different
software running on the Pentium

235
00:10:35,350 --> 00:10:38,910
processor long before it
was realized in silicon.

236
00:10:38,910 --> 00:10:42,300
In order to drive the simulator,
we collected a number of traces

237
00:10:42,300 --> 00:10:45,120
across a broad spectrum
of application areas

238
00:10:45,120 --> 00:10:48,390
and used these traces to drive
the performance simulator.

239
00:10:48,390 --> 00:10:50,280
The results of this
trace-driven simulation

240
00:10:50,280 --> 00:10:52,230
gave us some
quantitative feedback

241
00:10:52,230 --> 00:10:54,480
on the value of the different
aspects of the hardware

242
00:10:54,480 --> 00:10:55,830
and software.

243
00:10:55,830 --> 00:10:57,810
We even carried this
quantitative computer

244
00:10:57,810 --> 00:11:00,120
architecture into
the hardware itself

245
00:11:00,120 --> 00:11:01,710
by providing a
number of facilities

246
00:11:01,710 --> 00:11:05,010
within the processor to support
measurement of running systems

247
00:11:05,010 --> 00:11:08,190
so that we have a very nice
event monitoring facility that

248
00:11:08,190 --> 00:11:11,310
can count both discrete
events, such as cache misses,

249
00:11:11,310 --> 00:11:14,310
as well as some duration
events, such as bus stalls.

250
00:11:14,310 --> 00:11:16,993
Later on, we'll see how we
use these performance monitors

251
00:11:16,993 --> 00:11:19,410
to measure the effectiveness
of some key microarchitecture

252
00:11:19,410 --> 00:11:21,660
techniques included in
the Pentium processor

253
00:11:21,660 --> 00:11:24,300
and how they performed when
executing the SPEC benchmark

254
00:11:24,300 --> 00:11:25,290
suite.

255
00:11:25,290 --> 00:11:28,500
But first, let's hear from
Dan Alpert, chief architect

256
00:11:28,500 --> 00:11:31,590
of the Pentium processor who
will describe key features

257
00:11:31,590 --> 00:11:33,390
of its micro architecture.

258
00:11:33,390 --> 00:11:34,890
Thanks, John.

259
00:11:34,890 --> 00:11:37,230
When we started the design
of the Pentium processor,

260
00:11:37,230 --> 00:11:39,810
we knew there'd be a number
of very important challenges

261
00:11:39,810 --> 00:11:41,780
in trying to stay on the
performance treadmill

262
00:11:41,780 --> 00:11:44,280
while maintaining compatibility
with the previous generation

263
00:11:44,280 --> 00:11:46,033
of Intel microprocessors.

264
00:11:46,033 --> 00:11:47,700
What I'll be describing
here is a number

265
00:11:47,700 --> 00:11:49,170
of the microarchitecture
techniques

266
00:11:49,170 --> 00:11:52,560
that we developed in designing
the Pentium processor.

267
00:11:52,560 --> 00:11:55,140
The topics I will cover
are the bus interface,

268
00:11:55,140 --> 00:11:58,980
superscalar integer
pipelines, cache organization,

269
00:11:58,980 --> 00:12:02,880
branch prediction, and
pipeline floating point unit.

270
00:12:02,880 --> 00:12:05,820
Let's start with
the bus interface.

271
00:12:05,820 --> 00:12:08,640
We needed a high performance
bus to satisfy the Pentium

272
00:12:08,640 --> 00:12:10,920
processors demand for
instructions and data

273
00:12:10,920 --> 00:12:13,530
from external cache and memory.

274
00:12:13,530 --> 00:12:16,740
The data bus is 64
bits wide and operates

275
00:12:16,740 --> 00:12:20,490
at the full 66 megahertz
speed of the processor core.

276
00:12:20,490 --> 00:12:23,880
In contrast, the bus of our
previous generation processor,

277
00:12:23,880 --> 00:12:28,770
the Intel 486 model DX2,
has a 32-bit data path

278
00:12:28,770 --> 00:12:31,020
and it runs at 33 megahertz.

279
00:12:31,020 --> 00:12:34,710
So the Pentium processor's bus
has four times the bandwidth

280
00:12:34,710 --> 00:12:38,620
with a peak rate of over
half a gigabyte per second.

281
00:12:38,620 --> 00:12:41,670
Now moving on to the
integer execution units,

282
00:12:41,670 --> 00:12:44,100
we started with the
five-stage pipeline

283
00:12:44,100 --> 00:12:47,340
based on that of the
Intel 486 processor.

284
00:12:47,340 --> 00:12:49,620
The first stage reads
instructions from the cache

285
00:12:49,620 --> 00:12:53,460
into a prefetch buffer and
aligns them for decoding.

286
00:12:53,460 --> 00:12:56,760
The first decode stage generates
a control word for execution

287
00:12:56,760 --> 00:12:58,500
by the pipeline.

288
00:12:58,500 --> 00:13:01,470
The second code stage
decodes the control word

289
00:13:01,470 --> 00:13:04,950
and generates addresses
for memory references.

290
00:13:04,950 --> 00:13:08,430
The E stage is used either to
calculate a result in the ALU

291
00:13:08,430 --> 00:13:11,160
or to access data in the cache.

292
00:13:11,160 --> 00:13:13,260
The final stage is used
to write results back

293
00:13:13,260 --> 00:13:15,660
to the register file.

294
00:13:15,660 --> 00:13:18,240
In the Pentium processor,
we have improved performance

295
00:13:18,240 --> 00:13:20,820
by decoding two
instructions in parallel

296
00:13:20,820 --> 00:13:23,998
and replicating the address
generation hardware and ALU.

297
00:13:23,998 --> 00:13:26,040
You can think of this as
being similar to placing

298
00:13:26,040 --> 00:13:29,670
two of the 486's integer
pipelines on the same chip.

299
00:13:29,670 --> 00:13:32,280
We call the two
pipelines U and V.

300
00:13:32,280 --> 00:13:35,130
One final point is that complex
instructions, such as string

301
00:13:35,130 --> 00:13:37,380
operations, are
executed by generating

302
00:13:37,380 --> 00:13:40,380
a sequence of microcode
words in the D1 stage.

303
00:13:40,380 --> 00:13:42,130
The microcode is written
to take advantage

304
00:13:42,130 --> 00:13:45,250
of the hardware that's
available in both pipelines.

305
00:13:45,250 --> 00:13:49,030
Let's look at more detail at
the instruction-decode stage.

306
00:13:49,030 --> 00:13:51,940
Logic in the D1 stage is used
to check for dependencies

307
00:13:51,940 --> 00:13:54,010
between instructions,
and we only

308
00:13:54,010 --> 00:13:56,560
issue instructions in
parallel to the two pipelines

309
00:13:56,560 --> 00:13:58,210
if they are independent.

310
00:13:58,210 --> 00:14:00,280
The types of dependency
that we check for

311
00:14:00,280 --> 00:14:02,703
are resource, control, and data.

312
00:14:02,703 --> 00:14:04,870
First of all, we check that
both of the instructions

313
00:14:04,870 --> 00:14:06,430
are from a class we call--

314
00:14:06,430 --> 00:14:08,230
and this is in
quotes-- "simple."

315
00:14:08,230 --> 00:14:09,880
The so-called
simple instructions

316
00:14:09,880 --> 00:14:13,600
include ALU operations,
jumps, loads, and stores.

317
00:14:13,600 --> 00:14:15,370
Our definition of
simple even includes

318
00:14:15,370 --> 00:14:17,620
instructions that perform
operations from memory

319
00:14:17,620 --> 00:14:18,910
to registers.

320
00:14:18,910 --> 00:14:22,450
We find that about 90% to
95% of instructions executed

321
00:14:22,450 --> 00:14:23,440
are simple.

322
00:14:23,440 --> 00:14:26,080
By issuing only a subset of
the instructions in parallel,

323
00:14:26,080 --> 00:14:28,750
we are able to handle
resource dependencies.

324
00:14:28,750 --> 00:14:31,570
For example, there's only a
single shifter in the U pipe,

325
00:14:31,570 --> 00:14:35,080
so all of the shift instructions
are issued to the U pipe.

326
00:14:35,080 --> 00:14:37,780
Control dependencies occur when
the result of one instruction

327
00:14:37,780 --> 00:14:40,660
determines whether another
instruction will be executed.

328
00:14:40,660 --> 00:14:43,000
We handle this by checking
whether the first instruction

329
00:14:43,000 --> 00:14:44,650
is the jump instruction.

330
00:14:44,650 --> 00:14:48,130
If it is, then we don't issue
an instruction in parallel.

331
00:14:48,130 --> 00:14:51,130
Data dependencies occur when
the result if one instruction

332
00:14:51,130 --> 00:14:53,980
is used or modified by
another instruction.

333
00:14:53,980 --> 00:14:56,080
We handle this by checking
that the destination

334
00:14:56,080 --> 00:14:58,270
register of the
first instruction

335
00:14:58,270 --> 00:15:00,700
is neither the source
nor destination register

336
00:15:00,700 --> 00:15:02,200
of the second instruction.

337
00:15:02,200 --> 00:15:04,360
We have included logic
to improve performance

338
00:15:04,360 --> 00:15:06,963
by handling some common
cases of dependencies,

339
00:15:06,963 --> 00:15:08,380
thereby allowing
more instructions

340
00:15:08,380 --> 00:15:10,540
to be executed in parallel.

341
00:15:10,540 --> 00:15:12,400
First of all are flags.

342
00:15:12,400 --> 00:15:14,740
In the Intel architecture,
most of the instructions

343
00:15:14,740 --> 00:15:17,590
modify the flags so that
without special handling,

344
00:15:17,590 --> 00:15:20,290
it would be difficult to
pair many instructions.

345
00:15:20,290 --> 00:15:22,720
So we do handle the case
of parallel instructions

346
00:15:22,720 --> 00:15:25,690
that modify the flags,
and we update the flags

347
00:15:25,690 --> 00:15:29,020
the same as if the instructions
were executed sequentially.

348
00:15:29,020 --> 00:15:30,760
We also have special
logic that handles

349
00:15:30,760 --> 00:15:33,430
the case for conditional
branch instructions executed

350
00:15:33,430 --> 00:15:36,490
in parallel with an instruction
that sets the flags.

351
00:15:36,490 --> 00:15:38,500
This occurs quite
commonly, especially

352
00:15:38,500 --> 00:15:41,000
for compare-branch combinations.

353
00:15:41,000 --> 00:15:42,970
A second special
case was the logic

354
00:15:42,970 --> 00:15:45,340
on the stack pointer that
allows pushes and pops

355
00:15:45,340 --> 00:15:47,380
to be executed in parallel.

356
00:15:47,380 --> 00:15:50,320
For example, when passing
parameters for a procedure

357
00:15:50,320 --> 00:15:53,320
call, there will often
be a sequence of pushes.

358
00:15:53,320 --> 00:15:54,910
The logic adjusts
the stack pointer

359
00:15:54,910 --> 00:15:58,240
appropriately for
parallel execution.

360
00:15:58,240 --> 00:16:00,100
One final point
about the decoder

361
00:16:00,100 --> 00:16:03,820
is that we prefetch a complete
cache line of 256 bits

362
00:16:03,820 --> 00:16:05,650
to keep the decoder busy.

363
00:16:05,650 --> 00:16:07,810
So the question comes
up, just how effective

364
00:16:07,810 --> 00:16:09,580
is the instruction pairing?

365
00:16:09,580 --> 00:16:11,440
Here's a chart showing
how the pipelines are

366
00:16:11,440 --> 00:16:16,540
utilized on the SPECint 92 or
integer SPEC benchmark suite.

367
00:16:16,540 --> 00:16:19,492
What I've plotted here is for
each of the SPEC benchmarks

368
00:16:19,492 --> 00:16:20,950
the percentage of
instructions that

369
00:16:20,950 --> 00:16:24,125
go in the U pipe, which
is the primary pipeline,

370
00:16:24,125 --> 00:16:25,750
and then the percentage
of instructions

371
00:16:25,750 --> 00:16:27,370
that go through
the V pipe, which

372
00:16:27,370 --> 00:16:29,270
is the secondary pipeline.

373
00:16:29,270 --> 00:16:31,600
You can see that the pairing
ranges roughly around 30%

374
00:16:31,600 --> 00:16:34,990
to 40%, with one
benchmark [INAUDIBLE]

375
00:16:34,990 --> 00:16:36,940
hitting above 40%
of the instructions

376
00:16:36,940 --> 00:16:38,650
going into the V pipe.

377
00:16:38,650 --> 00:16:40,750
You can see on this
particular set of applications

378
00:16:40,750 --> 00:16:42,342
that the pairing
is quite effective.

379
00:16:42,342 --> 00:16:44,800
I would like to point out again
that these measurements are

380
00:16:44,800 --> 00:16:47,170
taken from an actual
running system using

381
00:16:47,170 --> 00:16:48,730
the performance
monitoring counters

382
00:16:48,730 --> 00:16:50,920
that John described earlier.

383
00:16:50,920 --> 00:16:54,430
At this point, let's take a
look at the cache organization.

384
00:16:54,430 --> 00:16:56,830
We have separate
code and data caches.

385
00:16:56,830 --> 00:16:59,440
This was done to eliminate
the conflicts between prefetch

386
00:16:59,440 --> 00:17:01,930
access's and data access's.

387
00:17:01,930 --> 00:17:04,599
This was important because we
knew the branch predictor was

388
00:17:04,599 --> 00:17:07,810
going to drive a lot more
references into the code cache

389
00:17:07,810 --> 00:17:10,630
and because the superscalar
pipelines would generate two

390
00:17:10,630 --> 00:17:12,910
data references in parallel.

391
00:17:12,910 --> 00:17:14,829
A consequence of
splitting the caches

392
00:17:14,829 --> 00:17:16,780
is that we have to add
extra logic to handle

393
00:17:16,780 --> 00:17:19,359
self-modifying code compatibly.

394
00:17:19,359 --> 00:17:22,010
This affected a
couple of key areas.

395
00:17:22,010 --> 00:17:24,460
One is that each cache has
to snoop the other caches

396
00:17:24,460 --> 00:17:26,619
misses so there would
never be inconsistent

397
00:17:26,619 --> 00:17:28,900
data in the two caches.

398
00:17:28,900 --> 00:17:31,210
It turns out that the logic
that handles consistency

399
00:17:31,210 --> 00:17:33,910
between external memory
and the on-chip caches

400
00:17:33,910 --> 00:17:36,640
was able to handle consistency
between the on-chip caches

401
00:17:36,640 --> 00:17:38,250
as well.

402
00:17:38,250 --> 00:17:40,870
The other aspect is that we had
to snoop the prefetch buffers

403
00:17:40,870 --> 00:17:44,350
also so that if, for example, we
wrote to an area of memory that

404
00:17:44,350 --> 00:17:46,360
happened to be in
a prefetch buffer,

405
00:17:46,360 --> 00:17:48,632
we would need to
invalidate the prefetch.

406
00:17:48,632 --> 00:17:50,590
Now we can look at some
of the vital statistics

407
00:17:50,590 --> 00:17:51,940
about the caches.

408
00:17:51,940 --> 00:17:54,850
Each of them is
8K bytes in size.

409
00:17:54,850 --> 00:17:58,540
They both use 32-byte lines
and are two way associative.

410
00:17:58,540 --> 00:18:01,750
The data cache uses a write-back
protocol to minimize the bus

411
00:18:01,750 --> 00:18:04,000
traffic, and we use a MESI--

412
00:18:04,000 --> 00:18:06,850
that's M, E, S, I--
for state protocol

413
00:18:06,850 --> 00:18:09,580
to keep the data cache
consistent with the code cache

414
00:18:09,580 --> 00:18:11,590
as well as the rest of
the system, including

415
00:18:11,590 --> 00:18:14,350
multilevel caches,
off-chip, and the memory.

416
00:18:14,350 --> 00:18:16,840
Once we separated the
code and data caches,

417
00:18:16,840 --> 00:18:20,110
it became effective to
separate the TLBs as well.

418
00:18:20,110 --> 00:18:23,710
The TLBs support two page
sizes, a 4 kilobyte page

419
00:18:23,710 --> 00:18:25,900
and a 4 megabyte page.

420
00:18:25,900 --> 00:18:29,110
The larger page is useful for
mapping large, resonate data

421
00:18:29,110 --> 00:18:32,110
structures with a
single TLB entry.

422
00:18:32,110 --> 00:18:34,540
For example, it can be used
to map a graphics frame

423
00:18:34,540 --> 00:18:38,260
buffer or the memory resident
portions, either code or data,

424
00:18:38,260 --> 00:18:40,060
of the operating system.

425
00:18:40,060 --> 00:18:43,740
The data TLB has 64
entries for 4K byte pages

426
00:18:43,740 --> 00:18:46,740
and 8 entries for
4 megabyte pages.

427
00:18:46,740 --> 00:18:50,670
The co-TLB has 32 entries
for 4K byte pages.

428
00:18:50,670 --> 00:18:52,260
The effectiveness
of the caches is

429
00:18:52,260 --> 00:18:55,440
shown on the following charts
for the SPEC benchmark suite.

430
00:18:55,440 --> 00:18:58,110
Please remember that this data
is taken from an actual running

431
00:18:58,110 --> 00:19:00,720
system using the performance
monitoring hardware that we

432
00:19:00,720 --> 00:19:02,400
included on-chip.

433
00:19:02,400 --> 00:19:05,490
The first chart shows the
hit rate for instructions.

434
00:19:05,490 --> 00:19:09,522
It varies from about 90%
to a little over 95%.

435
00:19:09,522 --> 00:19:10,980
What we are showing
is the hit rate

436
00:19:10,980 --> 00:19:13,260
for prefetches of 32
byte lines coming out

437
00:19:13,260 --> 00:19:15,090
of the instruction cache.

438
00:19:15,090 --> 00:19:17,430
This is often reported in
terms of hits per instruction

439
00:19:17,430 --> 00:19:18,432
instead.

440
00:19:18,432 --> 00:19:19,890
And if we looked
at it in this way,

441
00:19:19,890 --> 00:19:22,470
the rate would even
have been higher.

442
00:19:22,470 --> 00:19:24,180
The next chart
shows the data cache

443
00:19:24,180 --> 00:19:25,995
hit rate based on
the hits per read

444
00:19:25,995 --> 00:19:27,990
or write reference to memory.

445
00:19:27,990 --> 00:19:29,910
You can see that
here, the data cache

446
00:19:29,910 --> 00:19:34,488
hit rate varies from just
below 90% to about 95%.

447
00:19:34,488 --> 00:19:36,780
One of the most interesting
aspects of the data cache's

448
00:19:36,780 --> 00:19:40,890
design is its support for dual
accesses by the two pipelines.

449
00:19:40,890 --> 00:19:43,410
The Intel architecture has a
limited number of registers.

450
00:19:43,410 --> 00:19:46,260
There are only eight, and that
results in more data memory

451
00:19:46,260 --> 00:19:48,840
references for instruction than
we would see in architectures

452
00:19:48,840 --> 00:19:50,460
that have more registers.

453
00:19:50,460 --> 00:19:53,100
For example, for
optimized 32-bit code,

454
00:19:53,100 --> 00:19:56,610
we see about 0.6 data
references per instruction.

455
00:19:56,610 --> 00:19:58,920
In contrast, many of
the risk architectures

456
00:19:58,920 --> 00:20:02,220
that have 32 registers would
see only about half that number

457
00:20:02,220 --> 00:20:04,620
of memory references
per instruction.

458
00:20:04,620 --> 00:20:06,990
Now, even a small cache
can capture the locality

459
00:20:06,990 --> 00:20:08,820
of the additional references.

460
00:20:08,820 --> 00:20:11,370
We see that the compiler
for the risk processors

461
00:20:11,370 --> 00:20:15,150
can capture all those references
in the 32-register file.

462
00:20:15,150 --> 00:20:16,620
But it does lead
to a requirement

463
00:20:16,620 --> 00:20:19,950
for additional bandwidth
in the Pentium processor.

464
00:20:19,950 --> 00:20:23,820
We have implemented the cache
using dual ported tags in TLB

465
00:20:23,820 --> 00:20:26,940
with a single ported
interleaved data array.

466
00:20:26,940 --> 00:20:29,580
Making the data array,
which is the bulk of storage

467
00:20:29,580 --> 00:20:31,980
in the cache single
ported, turned out

468
00:20:31,980 --> 00:20:34,380
to be the most
efficient use of area.

469
00:20:34,380 --> 00:20:36,090
If we had also made
it dual ported,

470
00:20:36,090 --> 00:20:37,590
we would have had
a smaller cache

471
00:20:37,590 --> 00:20:39,840
and consequently
higher miss rate.

472
00:20:39,840 --> 00:20:41,880
The effect of increased
misses was more

473
00:20:41,880 --> 00:20:44,050
than the impact
of bank conflicts,

474
00:20:44,050 --> 00:20:46,350
so we went with the
interleaved approach.

475
00:20:46,350 --> 00:20:48,840
There's logic to detect
when parallel references go

476
00:20:48,840 --> 00:20:50,310
to the same bank.

477
00:20:50,310 --> 00:20:53,160
When they do, the U pipe
reference is completed

478
00:20:53,160 --> 00:20:56,490
and the V pipe is forced
to stall for one clock.

479
00:20:56,490 --> 00:20:58,350
The same logic also
handles the case

480
00:20:58,350 --> 00:21:01,110
of data dependencies
between parallel references

481
00:21:01,110 --> 00:21:03,210
because, of course,
if two references are

482
00:21:03,210 --> 00:21:05,820
to the same location in
memory, they will necessarily

483
00:21:05,820 --> 00:21:07,290
be to the same bank.

484
00:21:07,290 --> 00:21:09,360
One final point
about the data cache

485
00:21:09,360 --> 00:21:12,330
is that the two 32-bit
paths are combined together

486
00:21:12,330 --> 00:21:15,450
to form a 64-bit path for
double precision floating point

487
00:21:15,450 --> 00:21:16,927
operands.

488
00:21:16,927 --> 00:21:18,510
The next chart shows
some measurements

489
00:21:18,510 --> 00:21:21,120
taken from the SPEC
integer benchmarks suite,

490
00:21:21,120 --> 00:21:23,490
demonstrating the effectiveness
of dual memory reference

491
00:21:23,490 --> 00:21:24,720
support.

492
00:21:24,720 --> 00:21:27,090
The chart shows the
frequency of references

493
00:21:27,090 --> 00:21:28,980
that are actually paired.

494
00:21:28,980 --> 00:21:32,700
You can see here that except for
[INAUDIBLE],, about 20% to 25%

495
00:21:32,700 --> 00:21:35,580
of the memory references
are V pipe references paired

496
00:21:35,580 --> 00:21:39,810
with U pipe references and the
rate of bank conflicts is low.

497
00:21:39,810 --> 00:21:42,780
[INAUDIBLE] shows 40% of
references executed in parallel

498
00:21:42,780 --> 00:21:45,540
in the V pipe, but about
one out of eight references

499
00:21:45,540 --> 00:21:47,610
is a conflict.

500
00:21:47,610 --> 00:21:49,770
One of the most important
performance enhancements

501
00:21:49,770 --> 00:21:53,070
of the Pentium processor is
dynamic branch prediction.

502
00:21:53,070 --> 00:21:55,890
The Intel 486 processor
flushes its pipeline

503
00:21:55,890 --> 00:21:59,730
for every taken branch at
a cost of 2 clocks delay.

504
00:21:59,730 --> 00:22:03,030
Taken branches amount to
15% to 20% of instructions

505
00:22:03,030 --> 00:22:05,190
executed, representing
an opportunity

506
00:22:05,190 --> 00:22:06,900
for substantial improvement.

507
00:22:06,900 --> 00:22:09,240
In the Pentium processor,
we cache information

508
00:22:09,240 --> 00:22:11,250
about previous
branches and use this

509
00:22:11,250 --> 00:22:13,290
to predict future branches.

510
00:22:13,290 --> 00:22:16,152
The prediction is made in an
early stage of the pipeline,

511
00:22:16,152 --> 00:22:17,610
and when the
prediction is correct,

512
00:22:17,610 --> 00:22:20,510
we can execute
branches with no delay.

513
00:22:20,510 --> 00:22:22,260
All of the branch
predictions are verified

514
00:22:22,260 --> 00:22:23,790
at the end of the pipeline.

515
00:22:23,790 --> 00:22:26,160
If the prediction was
incorrect, the pipelines

516
00:22:26,160 --> 00:22:28,872
are flushed at a cost
of 3 or 4 clocks.

517
00:22:28,872 --> 00:22:30,330
The techniques are
similar to those

518
00:22:30,330 --> 00:22:33,060
used in mainframe
architectures with adaptation

519
00:22:33,060 --> 00:22:36,470
to the Intel architecture
and to superscalar execution.

520
00:22:36,470 --> 00:22:38,970
The memory structures that are
used in the branch prediction

521
00:22:38,970 --> 00:22:42,960
hardware make for very
efficient VLSI implementation.

522
00:22:42,960 --> 00:22:44,790
The heart of the branch
prediction hardware

523
00:22:44,790 --> 00:22:49,530
is an associate of memory called
a branch target buffer, or BTB.

524
00:22:49,530 --> 00:22:53,820
The BTB has 256 entries,
and it is 4-way associative.

525
00:22:53,820 --> 00:22:56,640
The tags used to access
the BTB is the address

526
00:22:56,640 --> 00:22:58,470
of the branch instruction.

527
00:22:58,470 --> 00:23:01,590
The data portion of the BTB
includes the branch destination

528
00:23:01,590 --> 00:23:04,740
address and 2 bits of history
used to predict whether or not

529
00:23:04,740 --> 00:23:06,600
the branch will be taken.

530
00:23:06,600 --> 00:23:08,580
Using 2 bits of
history helps in cases

531
00:23:08,580 --> 00:23:11,160
where a branch is consistently
taken for a while then

532
00:23:11,160 --> 00:23:12,870
occasionally not taken.

533
00:23:12,870 --> 00:23:15,000
Only when the prediction
is wrong twice in a row

534
00:23:15,000 --> 00:23:16,427
will the prediction be changed.

535
00:23:16,427 --> 00:23:18,510
We have some measurements
of the branch prediction

536
00:23:18,510 --> 00:23:21,090
accuracy taken with a built-in
performance monitoring

537
00:23:21,090 --> 00:23:22,170
hardware.

538
00:23:22,170 --> 00:23:24,330
As you can see, the
branch prediction accuracy

539
00:23:24,330 --> 00:23:28,380
is generally 80% to 85% with
the exception of the gcc

540
00:23:28,380 --> 00:23:31,350
benchmark, which is only 73%.

541
00:23:31,350 --> 00:23:34,440
The transistor budget available
in the Pentium processor

542
00:23:34,440 --> 00:23:35,910
allowed for dramatic
improvements

543
00:23:35,910 --> 00:23:37,950
in the floating
point performance.

544
00:23:37,950 --> 00:23:41,000
In the Intel 486 microprocessor,
the floating point unit

545
00:23:41,000 --> 00:23:44,330
wasn't pipelined and it
typically took 10 to 14 clocks

546
00:23:44,330 --> 00:23:46,160
for most operations.

547
00:23:46,160 --> 00:23:49,430
The faster hardware algorithms
in the Pentium processor

548
00:23:49,430 --> 00:23:51,530
allow the operations
to be executed

549
00:23:51,530 --> 00:23:53,990
by a factor of 3 or more faster.

550
00:23:53,990 --> 00:23:55,940
The execution units
are also pipelined

551
00:23:55,940 --> 00:23:59,600
to deliver one floating
point result per clock.

552
00:23:59,600 --> 00:24:01,880
The floating point pipeline
was closely integrated

553
00:24:01,880 --> 00:24:03,830
with the integer pipeline.

554
00:24:03,830 --> 00:24:05,870
This was important in
optimizing from memory

555
00:24:05,870 --> 00:24:08,990
to register operations
to deliver 164-bit memory

556
00:24:08,990 --> 00:24:11,570
reference per clock in
parallel with one floating

557
00:24:11,570 --> 00:24:13,040
point operation.

558
00:24:13,040 --> 00:24:14,630
The floating point
execution units

559
00:24:14,630 --> 00:24:19,040
support the 80-bit IEEE extended
format with short latencies.

560
00:24:19,040 --> 00:24:20,900
The multiplier and
adder have latencies

561
00:24:20,900 --> 00:24:23,360
of 3 clocks for all formats.

562
00:24:23,360 --> 00:24:27,440
The divider has a latency of
18 to 38 clocks for precisions

563
00:24:27,440 --> 00:24:29,450
from single to extended.

564
00:24:29,450 --> 00:24:31,430
Along with the hardware
to improve performance

565
00:24:31,430 --> 00:24:33,450
of the basic
arithmetic functions,

566
00:24:33,450 --> 00:24:36,290
we also reimplemented the
transcendental instructions--

567
00:24:36,290 --> 00:24:39,170
sine, cosine, logarithm,
and exponential--

568
00:24:39,170 --> 00:24:41,090
that are part of
our instruction set.

569
00:24:41,090 --> 00:24:43,610
The new algorithms use
polynomial approximations

570
00:24:43,610 --> 00:24:46,580
to take advantage of the
fast multiplier and adder.

571
00:24:46,580 --> 00:24:49,640
This provides improvements both
in performance and accuracy

572
00:24:49,640 --> 00:24:52,260
over the Intel 486
microprocessor.

573
00:24:52,260 --> 00:24:54,350
A key aspect of the
floating point performance

574
00:24:54,350 --> 00:24:56,660
was co-developing the
compiler with the hardware

575
00:24:56,660 --> 00:24:59,150
in order to ensure that this
pipeline structure would

576
00:24:59,150 --> 00:25:01,070
be very effectively used.

577
00:25:01,070 --> 00:25:03,192
So that completes my
portion of the presentation.

578
00:25:03,192 --> 00:25:05,150
I hope that it helped
you in your understanding

579
00:25:05,150 --> 00:25:08,528
of the Pentium microprocessor.

580
00:25:08,528 --> 00:25:10,820
Some of the new capabilities
we included on the Pentium

581
00:25:10,820 --> 00:25:13,430
processor were increased
error detection

582
00:25:13,430 --> 00:25:16,550
and a functional redundancy
check mode of operation.

583
00:25:16,550 --> 00:25:18,740
These were included at the
request of our customers

584
00:25:18,740 --> 00:25:20,960
at the high end of
the computer market,

585
00:25:20,960 --> 00:25:23,240
those involved in
delivering solutions

586
00:25:23,240 --> 00:25:26,510
to mission-critical
server applications.

587
00:25:26,510 --> 00:25:29,420
In these areas, even
extremely rare errors

588
00:25:29,420 --> 00:25:32,520
must be detected and
handled appropriately,

589
00:25:32,520 --> 00:25:36,600
so we included these features
to satisfy those requirements.

590
00:25:36,600 --> 00:25:39,197
The most important thing is
to handle external errors.

591
00:25:39,197 --> 00:25:41,780
And here, we've carried forward
the data bus parity generation

592
00:25:41,780 --> 00:25:44,090
and checking logic
that was on the 486

593
00:25:44,090 --> 00:25:47,180
and extended that to
our 64-bit data bus.

594
00:25:47,180 --> 00:25:50,090
Also, we added a parity
bit on the address bus

595
00:25:50,090 --> 00:25:54,110
so that we also can cover the
second wide bus going off-chip

596
00:25:54,110 --> 00:25:56,990
with parity generation
and checking.

597
00:25:56,990 --> 00:26:00,320
In addition to this external
error detection capability,

598
00:26:00,320 --> 00:26:02,450
within the chip we've
included parity bits

599
00:26:02,450 --> 00:26:04,410
on all the major arrays.

600
00:26:04,410 --> 00:26:08,270
This includes the code and data
cache arrays, cache tag arrays,

601
00:26:08,270 --> 00:26:11,390
the TLBs, and even
the microcode ROM.

602
00:26:11,390 --> 00:26:12,980
All of these have
parity bits included

603
00:26:12,980 --> 00:26:16,760
so that any internal errors are
detected and reported via pin.

604
00:26:16,760 --> 00:26:19,580
This internal parity checking
covers about half the devices

605
00:26:19,580 --> 00:26:22,520
on the chip and so provides
a very nice basic level

606
00:26:22,520 --> 00:26:24,110
of coverage.

607
00:26:24,110 --> 00:26:26,120
Beyond that, in order
to offer an option

608
00:26:26,120 --> 00:26:28,010
for the ultimate
in error detection,

609
00:26:28,010 --> 00:26:30,590
we have an FRC, or
functional redundancy check,

610
00:26:30,590 --> 00:26:33,500
mode on the chip where you
can configure two processors

611
00:26:33,500 --> 00:26:35,750
in a master checker
configuration

612
00:26:35,750 --> 00:26:37,730
where the master is
operating normally

613
00:26:37,730 --> 00:26:40,490
and the checker, rather than
driving its output pins,

614
00:26:40,490 --> 00:26:43,700
turns them inside out and checks
the values driven by the master

615
00:26:43,700 --> 00:26:47,190
and pulls a pin only
if a mismatch occurs.

616
00:26:47,190 --> 00:26:48,890
So this increased
error detection level

617
00:26:48,890 --> 00:26:51,650
provides a nice capability
for some new application

618
00:26:51,650 --> 00:26:55,100
areas where the ultimate in
error detection is required.

619
00:26:55,100 --> 00:26:58,370
And now, let's go with Beatrice
Fu to our performance lab

620
00:26:58,370 --> 00:27:00,110
where she will
describe the compiler

621
00:27:00,110 --> 00:27:03,440
technology we developed
with the Pentium processor.

622
00:27:03,440 --> 00:27:04,520
Thanks, John.

623
00:27:04,520 --> 00:27:08,390
How compiler technology enhances
the performance of the Pentium

624
00:27:08,390 --> 00:27:13,110
processor is best illustrated
with a simple example.

625
00:27:13,110 --> 00:27:15,810
Here, we have a simple
synthetic program

626
00:27:15,810 --> 00:27:18,130
compiled by tool compilers.

627
00:27:18,130 --> 00:27:22,290
The code on the left-hand side
is generated by a typical 486

628
00:27:22,290 --> 00:27:24,900
compiler, and the code
on the right-hand side

629
00:27:24,900 --> 00:27:28,740
is generated by the new
compiler technology optimized

630
00:27:28,740 --> 00:27:31,050
for the Pentium CPU.

631
00:27:31,050 --> 00:27:33,960
The two code sequences
are quite different,

632
00:27:33,960 --> 00:27:36,900
even though they are both
correct representations

633
00:27:36,900 --> 00:27:39,330
of the original program.

634
00:27:39,330 --> 00:27:43,800
This demonstration shows how
the two code sequences proceed

635
00:27:43,800 --> 00:27:46,080
through the Pentium processor.

636
00:27:46,080 --> 00:27:49,530
The graphical block diagram
represents the instruction

637
00:27:49,530 --> 00:27:54,930
cache, the data cache, the two
execution pipelines as well as

638
00:27:54,930 --> 00:27:57,750
the floating point unit.

639
00:27:57,750 --> 00:28:00,780
While executing, the
active instructions

640
00:28:00,780 --> 00:28:04,110
are color coded so that
we can follow the journeys

641
00:28:04,110 --> 00:28:06,480
through the Pentium processor.

642
00:28:06,480 --> 00:28:09,270
As can be seen, the
Pentium processor

643
00:28:09,270 --> 00:28:12,660
has many idle resources
on the left-hand side when

644
00:28:12,660 --> 00:28:15,300
executing the traditional
code sequence,

645
00:28:15,300 --> 00:28:19,140
whereas the new code sequence
utilizes the dual execution

646
00:28:19,140 --> 00:28:22,530
pipelines more effectively
on the right-hand side.

647
00:28:22,530 --> 00:28:26,640
As a result, the time taken
to run the new code sequence

648
00:28:26,640 --> 00:28:29,860
is much less than
the traditional code.

649
00:28:29,860 --> 00:28:31,720
As can be seen,
the right-hand side

650
00:28:31,720 --> 00:28:34,210
is finished whereas the
left-hand side is still

651
00:28:34,210 --> 00:28:36,120
chugging along.

652
00:28:36,120 --> 00:28:38,760
The advancement of
CPU microarchitecture

653
00:28:38,760 --> 00:28:41,520
requires advanced
compiler technology

654
00:28:41,520 --> 00:28:44,040
to exploit the
on-chip parallelism,

655
00:28:44,040 --> 00:28:47,040
such that features like
pipeline and superscalar

656
00:28:47,040 --> 00:28:50,670
can be made more effective to
deliver the final performance

657
00:28:50,670 --> 00:28:53,880
as illustrated in the
previous demonstration.

658
00:28:53,880 --> 00:28:56,040
The role of a
compiler has always

659
00:28:56,040 --> 00:28:59,310
been translating high-level
language into low-level machine

660
00:28:59,310 --> 00:29:04,230
code so as to hide the machine
details from many programmers.

661
00:29:04,230 --> 00:29:06,600
The role of an
optimizing compiler

662
00:29:06,600 --> 00:29:11,160
is to generate machine code that
runs efficiently on the target

663
00:29:11,160 --> 00:29:12,850
processor.

664
00:29:12,850 --> 00:29:16,720
In the simple example here,
a traditional compiler

665
00:29:16,720 --> 00:29:19,990
will generate code to
increment the array index

666
00:29:19,990 --> 00:29:24,160
and compare the loop bond and
end up with a three-cycle loop

667
00:29:24,160 --> 00:29:25,270
body.

668
00:29:25,270 --> 00:29:29,020
Using the same example,
our optimizing compiler

669
00:29:29,020 --> 00:29:31,990
displaces the loop
count to eliminate

670
00:29:31,990 --> 00:29:34,870
the compare instruction
and take advantage

671
00:29:34,870 --> 00:29:38,620
of the zero-flex setting of the
increment instruction, which

672
00:29:38,620 --> 00:29:44,190
improves the loop cycle
count from 3 down to 2.

673
00:29:44,190 --> 00:29:47,070
We are again using
the same example.

674
00:29:47,070 --> 00:29:50,940
Our new compiler technology
further unrolls the loop

675
00:29:50,940 --> 00:29:54,990
and overlaps the instructions
from two different iterations

676
00:29:54,990 --> 00:29:58,990
to take full advantage
of the superscalar core.

677
00:29:58,990 --> 00:30:03,280
This gives a two-cycle loop
count for every two iterations

678
00:30:03,280 --> 00:30:06,950
or effectively one
cycle per iteration.

679
00:30:06,950 --> 00:30:11,800
This is one example of how our
compiler technology maximizes

680
00:30:11,800 --> 00:30:17,680
the usage of the dual pipelines
of the Pentium processor.

681
00:30:17,680 --> 00:30:22,480
In the last decade, the CPU
speed has increased more than

682
00:30:22,480 --> 00:30:27,220
10 times , but the speed of
memory components has not kept

683
00:30:27,220 --> 00:30:28,210
up.

684
00:30:28,210 --> 00:30:31,960
As caches are introduced
into the memory subsystem

685
00:30:31,960 --> 00:30:36,250
to meet the CPU demand,
cooperation from compilers

686
00:30:36,250 --> 00:30:40,480
is sometimes needed to ensure
efficient use of the memory

687
00:30:40,480 --> 00:30:41,650
bandwidth.

688
00:30:41,650 --> 00:30:45,250
In this example, with
j being the index

689
00:30:45,250 --> 00:30:48,820
of the innermost loop,
the array elements

690
00:30:48,820 --> 00:30:53,230
are not accessed from
contiguous memory locations.

691
00:30:53,230 --> 00:30:56,500
In other words, even
though several elements

692
00:30:56,500 --> 00:31:00,580
are brought into one cache
line, only the first element

693
00:31:00,580 --> 00:31:01,870
will be used.

694
00:31:01,870 --> 00:31:04,360
The rest will be
discarded and brought

695
00:31:04,360 --> 00:31:08,510
in again and again as we
cycle through the outer loop.

696
00:31:08,510 --> 00:31:12,940
Our compiler technology
remedies this by interchanging i

697
00:31:12,940 --> 00:31:17,020
and j making the memory
accesses contiguous.

698
00:31:17,020 --> 00:31:19,780
Just now, I illustrated
a few techniques

699
00:31:19,780 --> 00:31:21,760
of optimizing compiler.

700
00:31:21,760 --> 00:31:25,450
Our 32-bit compiler has
extensive optimization

701
00:31:25,450 --> 00:31:28,330
to ensure the maximum
performance of the Pentium

702
00:31:28,330 --> 00:31:29,800
processor.

703
00:31:29,800 --> 00:31:33,220
We have state of the art
classical optimizations,

704
00:31:33,220 --> 00:31:37,110
light register variable
detection, loop invariant

705
00:31:37,110 --> 00:31:40,090
co-motion, and
others listed here.

706
00:31:40,090 --> 00:31:51,930


707
00:31:51,930 --> 00:31:56,240
We also pay specific attention
to the X86 architecture

708
00:31:56,240 --> 00:31:59,030
as well as our
CPU implementation

709
00:31:59,030 --> 00:32:02,480
in terms of selecting
effective addressing modes

710
00:32:02,480 --> 00:32:07,220
and co-sequences and rearranging
instructions to take advantage

711
00:32:07,220 --> 00:32:09,710
of the superscalar core.

712
00:32:09,710 --> 00:32:12,330
In the earlier
example, I illustrated

713
00:32:12,330 --> 00:32:16,250
how loop interchange can
result in more effective use

714
00:32:16,250 --> 00:32:18,380
of the memory bandwidth.

715
00:32:18,380 --> 00:32:22,160
Loop interchange is only one
of the memory optimization

716
00:32:22,160 --> 00:32:23,270
techniques.

717
00:32:23,270 --> 00:32:27,410
Our compiler technology can
also perform loop distribution

718
00:32:27,410 --> 00:32:32,600
to promote parallelism, blocking
to maximize memory reuse once

719
00:32:32,600 --> 00:32:35,840
loaded into the
cache, strip mining

720
00:32:35,840 --> 00:32:39,260
to adjust the problem size
to the size of the cache

721
00:32:39,260 --> 00:32:43,610
in order to minimize
trashing, and cache preloading

722
00:32:43,610 --> 00:32:46,070
to hide the memory latency.

723
00:32:46,070 --> 00:32:50,400
The next chart shows the
result of our compiler efforts.

724
00:32:50,400 --> 00:32:54,680
We are comparing the performance
of the 66 megahertz Pentium

725
00:32:54,680 --> 00:32:58,670
processor running the SPEC
benchmark suites compiled

726
00:32:58,670 --> 00:33:01,050
by three different compilers.

727
00:33:01,050 --> 00:33:05,000
The compiler technology I
just described is shown here.

728
00:33:05,000 --> 00:33:09,560
Compiler A is the best
available 486 compiler,

729
00:33:09,560 --> 00:33:13,850
and compiler B was an
average compiler in 1990 when

730
00:33:13,850 --> 00:33:16,430
we started the compiler effort.

731
00:33:16,430 --> 00:33:20,150
On the integer side, we were
able to go from a performance

732
00:33:20,150 --> 00:33:26,660
level of 41 using the average
compiler up to a level a 64.6

733
00:33:26,660 --> 00:33:29,960
using our new
compiler technology.

734
00:33:29,960 --> 00:33:33,410
When compared to
the level of 57.6

735
00:33:33,410 --> 00:33:36,740
from the best
available 486 compiler,

736
00:33:36,740 --> 00:33:41,210
the difference between the best
486 compiler and a new compiler

737
00:33:41,210 --> 00:33:45,020
is on the order of
10% to 15%, obviously

738
00:33:45,020 --> 00:33:48,980
a very substantial improvement
over the average compiler

739
00:33:48,980 --> 00:33:51,560
from the 1990 vintage.

740
00:33:51,560 --> 00:33:56,420
On the floating point side, the
results are even more dramatic.

741
00:33:56,420 --> 00:33:58,790
From the new
compiler technology,

742
00:33:58,790 --> 00:34:04,010
we are able to achieve a SPEC
floating point 59.7 compared

743
00:34:04,010 --> 00:34:09,050
to much lower numbers in the
30s using the other compilers.

744
00:34:09,050 --> 00:34:12,620
Throughout the development of
the new compiler technology,

745
00:34:12,620 --> 00:34:16,130
we emphasize that we could
not optimize performance

746
00:34:16,130 --> 00:34:20,330
of the Pentium processor at
the expense of the Intel 486

747
00:34:20,330 --> 00:34:23,570
or even the Intel
386 processors.

748
00:34:23,570 --> 00:34:27,050
So we were very careful in
implementing our optimization

749
00:34:27,050 --> 00:34:30,350
techniques, and we constantly
measured the performance

750
00:34:30,350 --> 00:34:34,340
of the generator code on the
older generation processors.

751
00:34:34,340 --> 00:34:37,190
The efforts really paid off.

752
00:34:37,190 --> 00:34:41,210
This chart shows that instead
of optimizing for the Pentium

753
00:34:41,210 --> 00:34:44,900
processor at the expense
of the older parts,

754
00:34:44,900 --> 00:34:48,679
we were in fact, able to improve
the performance on the older

755
00:34:48,679 --> 00:34:52,280
parts while optimizing
for the Pentium processor.

756
00:34:52,280 --> 00:34:55,040
We are again comparing
the same benchmark suites

757
00:34:55,040 --> 00:34:57,920
with using the same
three compilers

758
00:34:57,920 --> 00:35:02,390
but running the code on
the Intel 486 processor.

759
00:35:02,390 --> 00:35:05,120
You can see that our
new compiler technology

760
00:35:05,120 --> 00:35:08,030
is providing a nice
performance boost,

761
00:35:08,030 --> 00:35:12,710
not only over the average
compiler that we had in 1990

762
00:35:12,710 --> 00:35:17,240
but also over the best
available 486 compiler.

763
00:35:17,240 --> 00:35:20,000
Now, I hope you can see
the importance of compiler

764
00:35:20,000 --> 00:35:23,320
technology to the
overall performance.

765
00:35:23,320 --> 00:35:25,958
This concludes our talk on
the Pentium microprocessor.

766
00:35:25,958 --> 00:35:27,750
I hope that you'll take
away from this talk

767
00:35:27,750 --> 00:35:30,180
some understanding of the key
microarchitecture features

768
00:35:30,180 --> 00:35:33,330
inside the processor as well
as some of the key techniques

769
00:35:33,330 --> 00:35:36,090
we've developed with
compiler technology.

770
00:35:36,090 --> 00:35:38,250
In combination, these
two producing a new level

771
00:35:38,250 --> 00:35:39,930
of performance
within our line of PC

772
00:35:39,930 --> 00:35:41,730
compatible microprocessors.

773
00:35:41,730 --> 00:35:43,580
Thank you.

774
00:35:43,580 --> 00:37:09,000