1
00:00:00,000 --> 00:00:02,958
[MUSIC PLAYING]

2
00:00:02,958 --> 00:00:35,060


3
00:00:35,060 --> 00:00:36,230
Hi, I'm Dick Sites.

4
00:00:36,230 --> 00:00:37,250
And I'm Dirk Meyer.

5
00:00:37,250 --> 00:00:39,530
And we're going to
talk today about Alpha.

6
00:00:39,530 --> 00:00:41,510
In discussing Alpha,
we make the distinction

7
00:00:41,510 --> 00:00:43,100
between the
architecture as a paper

8
00:00:43,100 --> 00:00:45,712
document and the
various implementations.

9
00:00:45,712 --> 00:00:47,420
I'm going to talk
about the architecture.

10
00:00:47,420 --> 00:00:48,680
And then I'll be back
in a little bit later

11
00:00:48,680 --> 00:00:50,330
to talk about the
first implementation

12
00:00:50,330 --> 00:00:52,970
of that architecture.

13
00:00:52,970 --> 00:00:55,220
I'm going to talk about
Alpha architecture,

14
00:00:55,220 --> 00:00:58,735
about the goals, an overview
of the architecture,

15
00:00:58,735 --> 00:01:00,860
about the things that are
different from other risk

16
00:01:00,860 --> 00:01:03,560
architectures, and then
about some of the problems

17
00:01:03,560 --> 00:01:05,630
that we addressed in
designing the architecture

18
00:01:05,630 --> 00:01:08,900
and how we thought about them.

19
00:01:08,900 --> 00:01:12,605
Rich Witek and I are the
co-architects for Alpha.

20
00:01:12,605 --> 00:01:16,070
My part of the Alpha design has
been influenced substantially

21
00:01:16,070 --> 00:01:20,400
by three other architects, Fred
Brooks, John Cocke, and Seymour

22
00:01:20,400 --> 00:01:20,900
Cray.

23
00:01:20,900 --> 00:01:23,750


24
00:01:23,750 --> 00:01:27,230
When we started the
Alpha project in 1988,

25
00:01:27,230 --> 00:01:30,380
we set four goals
for the architecture.

26
00:01:30,380 --> 00:01:37,430
Performance, longevity,
scalability, and generality.

27
00:01:37,430 --> 00:01:40,160
In performance, we
wanted the architecture

28
00:01:40,160 --> 00:01:42,470
to allow implementations
that would

29
00:01:42,470 --> 00:01:45,950
be faster than anything else
in the industry each year

30
00:01:45,950 --> 00:01:47,260
for the foreseeable future.

31
00:01:47,260 --> 00:01:49,830


32
00:01:49,830 --> 00:01:51,870
That performance
goal reflects itself

33
00:01:51,870 --> 00:01:54,180
in a number of the details
of the architecture,

34
00:01:54,180 --> 00:01:57,000
as I'll show you
in a few minutes.

35
00:01:57,000 --> 00:02:01,740
For longevity, we wanted the
architecture to last 15 or 20

36
00:02:01,740 --> 00:02:07,680
or 25 years, much longer
than a 10-year design cycle,

37
00:02:07,680 --> 00:02:13,240
and that longevity goal implied
a number of design decisions,

38
00:02:13,240 --> 00:02:15,750
if we were honest about
meeting that goal.

39
00:02:15,750 --> 00:02:17,430
The first design
decision is that it had

40
00:02:17,430 --> 00:02:20,130
to be a 64-bit architecture.

41
00:02:20,130 --> 00:02:22,260
All the 32-bit
architecture will run out

42
00:02:22,260 --> 00:02:26,538
of address bit sometime in
the next 10 or 15 or 20 years.

43
00:02:26,538 --> 00:02:28,080
In fact, in designing
the first chip,

44
00:02:28,080 --> 00:02:33,780
we ran out of address
bits on VAXs last year.

45
00:02:33,780 --> 00:02:37,260
In addition to being
a 64-bit architecture,

46
00:02:37,260 --> 00:02:41,940
the longevity goal implied
that implementations

47
00:02:41,940 --> 00:02:43,800
over the course of
two decades would have

48
00:02:43,800 --> 00:02:46,590
to scale up in performance.

49
00:02:46,590 --> 00:02:50,490
We looked at how the
scaling could happen.

50
00:02:50,490 --> 00:02:52,920
The first chip runs
at 200 megahertz

51
00:02:52,920 --> 00:02:55,665
five nanosecond cycle.

52
00:02:55,665 --> 00:02:59,790
It looked unlikely that the
chip speed, clock speed,

53
00:02:59,790 --> 00:03:03,850
would improve by a factor
of 1,000 over 25 years,

54
00:03:03,850 --> 00:03:07,050
but the industry curves say that
the implementation performance

55
00:03:07,050 --> 00:03:09,150
does need to improve
by a factor of 1,000

56
00:03:09,150 --> 00:03:12,510
if the architecture is
going to last that long.

57
00:03:12,510 --> 00:03:14,610
We thought it was realistic
for the clock speed

58
00:03:14,610 --> 00:03:16,990
to improve by a factor of 10.

59
00:03:16,990 --> 00:03:19,380
So we looked for other
places to pick up

60
00:03:19,380 --> 00:03:22,210
the other factor of 100.

61
00:03:22,210 --> 00:03:24,150
If the clock speed
isn't going to pick up

62
00:03:24,150 --> 00:03:27,570
by more than a factor of 10,
then to gain more performance

63
00:03:27,570 --> 00:03:30,540
you have to do more work
in every clock cycle.

64
00:03:30,540 --> 00:03:33,570
So very early in the design,
we settled on the idea

65
00:03:33,570 --> 00:03:36,720
that the architecture would
have to gracefully allow

66
00:03:36,720 --> 00:03:39,510
multiple instruction
issue, that it

67
00:03:39,510 --> 00:03:43,050
would be necessary
for the longevity goal

68
00:03:43,050 --> 00:03:45,930
and the performance goals
for the architecture

69
00:03:45,930 --> 00:03:49,880
to allow many instructions to
be started every clock cycle.

70
00:03:49,880 --> 00:03:53,710
We thought realistically over
the course of a few decades,

71
00:03:53,710 --> 00:03:56,430
perhaps using compilers
that our children build,

72
00:03:56,430 --> 00:04:01,020
that compiler technology
could pick up a factor of 10

73
00:04:01,020 --> 00:04:03,720
using multiple
instruction issue.

74
00:04:03,720 --> 00:04:07,170
That still left us a
factor of 10 short.

75
00:04:07,170 --> 00:04:09,540
So we looked at where else
implementations could pick up

76
00:04:09,540 --> 00:04:13,230
performance, and the
only other way we saw

77
00:04:13,230 --> 00:04:17,130
was to have multiple
processors to have perhaps

78
00:04:17,130 --> 00:04:19,019
as many as 10
processors in a box

79
00:04:19,019 --> 00:04:25,050
25 years from now executing
pieces of a common program.

80
00:04:25,050 --> 00:04:27,600
So we focused the
architecture, just because

81
00:04:27,600 --> 00:04:30,690
of the longevity and
the scalability goals,

82
00:04:30,690 --> 00:04:35,340
on those three points
of very fast clock

83
00:04:35,340 --> 00:04:40,890
cycle, multiple instruction
issue, and multiple processors.

84
00:04:40,890 --> 00:04:44,460
Finally, we designed the
architecture to be general.

85
00:04:44,460 --> 00:04:47,190
It's not just a Unix hotbox.

86
00:04:47,190 --> 00:04:51,900
It's designed to be able to run
VMS, to be able to run OSF/1,

87
00:04:51,900 --> 00:04:55,320
to be able to run other
operating systems to track

88
00:04:55,320 --> 00:04:57,900
different computing
paradigms as the industry

89
00:04:57,900 --> 00:05:02,560
changes over the next
5 or 10 or 20 years.

90
00:05:02,560 --> 00:05:06,750
We also designed it to support
a variety of computer languages,

91
00:05:06,750 --> 00:05:11,430
not just Fortran and C but
also COBOL, Pascal, Ada,

92
00:05:11,430 --> 00:05:15,050
Basic, lots of other languages.

93
00:05:15,050 --> 00:05:18,460
And finally, in looking at
bringing a new architecture

94
00:05:18,460 --> 00:05:24,220
to the market, we focused on
migrating our current customers

95
00:05:24,220 --> 00:05:27,250
from VAXs and from
MIPS stack stations

96
00:05:27,250 --> 00:05:29,860
to the new architecture.

97
00:05:29,860 --> 00:05:32,530
We rejected things like
hardware compatibility mode

98
00:05:32,530 --> 00:05:36,520
in favor of doing software
translation of binary images

99
00:05:36,520 --> 00:05:39,790
and doing compilers with
compatible front ends

100
00:05:39,790 --> 00:05:43,740
for source code re-completion.

101
00:05:43,740 --> 00:05:46,050
Here's an overview
of the architecture.

102
00:05:46,050 --> 00:05:49,230
It looks very much like other
risk machines on the surface.

103
00:05:49,230 --> 00:05:52,410
It's a 64-bit load
store risk machine.

104
00:05:52,410 --> 00:05:54,960
Instructions are 32-bits wide.

105
00:05:54,960 --> 00:05:57,780
They describe operations
on one, two, four,

106
00:05:57,780 --> 00:05:59,430
or eight byte integers.

107
00:05:59,430 --> 00:06:02,100
They describe operations
on VAX single precision

108
00:06:02,100 --> 00:06:04,110
and double precision
floating point,

109
00:06:04,110 --> 00:06:06,960
also IEEE single precision
and double precision floating

110
00:06:06,960 --> 00:06:08,910
point.

111
00:06:08,910 --> 00:06:16,050
There are 32 integer
registers, each of 64 bits.

112
00:06:16,050 --> 00:06:18,660
One of those is hardwired
to zero, and one of them

113
00:06:18,660 --> 00:06:20,500
is used as a stack pointer.

114
00:06:20,500 --> 00:06:24,690
There are also 32 floating
point registers, each 64 bits,

115
00:06:24,690 --> 00:06:28,720
and one of those is
hardwired to zero.

116
00:06:28,720 --> 00:06:30,340
In addition, the
program counter is

117
00:06:30,340 --> 00:06:33,280
a full 64-bit virtual
address, and there

118
00:06:33,280 --> 00:06:36,460
are a number of other
counters and registers

119
00:06:36,460 --> 00:06:40,840
to allow writing real operating
system multitasking software,

120
00:06:40,840 --> 00:06:45,580
things like a cycle calendar for
performance studies, floating

121
00:06:45,580 --> 00:06:49,840
point status register for IEEE
rounding modes and status bits.

122
00:06:49,840 --> 00:06:53,560
64-bit virtual
addresses are used

123
00:06:53,560 --> 00:06:56,777
throughout the architecture,
although implementations

124
00:06:56,777 --> 00:06:59,110
are allowed to implement a
subset of the virtual address

125
00:06:59,110 --> 00:07:03,470
space so long as they check the
high order unimplemented bits

126
00:07:03,470 --> 00:07:06,860
for validity.

127
00:07:06,860 --> 00:07:11,310
The instruction formats
are very straightforward.

128
00:07:11,310 --> 00:07:14,690
There is a 6-bit opcode
in every instruction.

129
00:07:14,690 --> 00:07:17,930
The first format has
call pal instructions.

130
00:07:17,930 --> 00:07:21,680
It has a 26-bit
function field that

131
00:07:21,680 --> 00:07:25,820
describes one of a few
dozen privilege subroutines

132
00:07:25,820 --> 00:07:27,410
that are used to
execute a number

133
00:07:27,410 --> 00:07:29,960
of the complex operations
that operating systems are

134
00:07:29,960 --> 00:07:31,340
built on top of.

135
00:07:31,340 --> 00:07:35,270
I'll discuss those
operations in a few minutes.

136
00:07:35,270 --> 00:07:38,300
The branch instructions
have the six bit opcode

137
00:07:38,300 --> 00:07:39,950
and then a five
bit register number

138
00:07:39,950 --> 00:07:44,590
to be tested that can be tested
for positive, negative, zero,

139
00:07:44,590 --> 00:07:45,590
non-zero.

140
00:07:45,590 --> 00:07:48,740
In the case of the integer
registers, even an odd,

141
00:07:48,740 --> 00:07:51,170
and then there's a
21-bit sign displacement

142
00:07:51,170 --> 00:07:53,900
field that's actually a
long word displacement

143
00:07:53,900 --> 00:07:57,140
describes relative to
the branch instruction

144
00:07:57,140 --> 00:08:01,010
a target address instruction
anywhere within plus or minus

145
00:08:01,010 --> 00:08:03,680
four megabytes.

146
00:08:03,680 --> 00:08:07,010
The memory formats have
the six bit opcode and then

147
00:08:07,010 --> 00:08:10,320
two register fields
of five bits each.

148
00:08:10,320 --> 00:08:12,920
The first field
specifies the register

149
00:08:12,920 --> 00:08:14,870
to be loaded or stored.

150
00:08:14,870 --> 00:08:17,780
Second field specifies the
base register for the memory

151
00:08:17,780 --> 00:08:23,570
address, and the remaining
field specifies a 16-bit sign

152
00:08:23,570 --> 00:08:26,360
displacement from
the base register.

153
00:08:26,360 --> 00:08:30,290
The last format is used for
operations between registers,

154
00:08:30,290 --> 00:08:34,400
six bit opcode, two source
registers, a function

155
00:08:34,400 --> 00:08:36,740
field that's really an
extension of the opcode,

156
00:08:36,740 --> 00:08:39,380
and then a destination register.

157
00:08:39,380 --> 00:08:41,600
All the operations
read two registers

158
00:08:41,600 --> 00:08:43,100
and write one register.

159
00:08:43,100 --> 00:08:45,330
They read all 64 bits
of the registers,

160
00:08:45,330 --> 00:08:47,721
and they write all 64
bits of the registers.

161
00:08:47,721 --> 00:08:51,570


162
00:08:51,570 --> 00:08:53,393
One of the simplest
ways of characterizing

163
00:08:53,393 --> 00:08:54,810
how this architecture
is different

164
00:08:54,810 --> 00:08:59,190
from other architectures
is by what's not there.

165
00:08:59,190 --> 00:09:03,060
There are no condition codes
in the Alpha architecture

166
00:09:03,060 --> 00:09:05,790
because they get in the way
of doing multiple issue.

167
00:09:05,790 --> 00:09:08,190
If you issue six
instructions at once

168
00:09:08,190 --> 00:09:12,840
and they all potentially
set the condition codes

169
00:09:12,840 --> 00:09:16,530
and there's really only one
condition code register,

170
00:09:16,530 --> 00:09:19,170
then you have to build a
lot of very messy hardware

171
00:09:19,170 --> 00:09:21,750
to sort out which of
those six instructions

172
00:09:21,750 --> 00:09:24,840
actually sets the
condition code register.

173
00:09:24,840 --> 00:09:29,280
And if the fourth of those takes
an overflow and then, in fact,

174
00:09:29,280 --> 00:09:32,310
you discover on the
fly that the third one

175
00:09:32,310 --> 00:09:34,620
has to set the condition
code register, getting

176
00:09:34,620 --> 00:09:36,558
that right is very messy.

177
00:09:36,558 --> 00:09:37,933
So the condition
code register is

178
00:09:37,933 --> 00:09:41,730
sort of an example
of a design technique

179
00:09:41,730 --> 00:09:44,010
that's difficult to use
when you're planning

180
00:09:44,010 --> 00:09:46,150
on multiple instruction issue.

181
00:09:46,150 --> 00:09:48,270
So we don't have it in Alpha.

182
00:09:48,270 --> 00:09:51,090
In a similar way, we
don't have fixed registers

183
00:09:51,090 --> 00:09:55,770
for operations such as string
pointers or multiply or divide.

184
00:09:55,770 --> 00:09:59,100
Some architectures have a
multiplier quotient register

185
00:09:59,100 --> 00:10:02,022
or dedicated registers
for other operations.

186
00:10:02,022 --> 00:10:03,480
The problem with
that is if you try

187
00:10:03,480 --> 00:10:05,490
to launch like four
instructions at once

188
00:10:05,490 --> 00:10:07,680
and they're all multiplies
and there's only

189
00:10:07,680 --> 00:10:12,210
one multiplier quotient
register, it's a bottleneck.

190
00:10:12,210 --> 00:10:13,620
You either have
to build hardware

191
00:10:13,620 --> 00:10:17,460
that shadows multiple
copies of the MQ register,

192
00:10:17,460 --> 00:10:20,500
or you can't launch
multiple multiplies.

193
00:10:20,500 --> 00:10:23,160
So we simply don't have it,
and all of the operations

194
00:10:23,160 --> 00:10:27,330
are completely general
between the general registers.

195
00:10:27,330 --> 00:10:29,400
Finally, we don't
have byte writes.

196
00:10:29,400 --> 00:10:32,820
If you look at implementations
of writing a single byte

197
00:10:32,820 --> 00:10:36,660
memory, somewhere in
real implementations,

198
00:10:36,660 --> 00:10:39,840
memory is accessed
in complete memory

199
00:10:39,840 --> 00:10:44,160
words, which are typically
four or eight bytes wide.

200
00:10:44,160 --> 00:10:46,260
Updating a single
byte in a memory word

201
00:10:46,260 --> 00:10:51,390
somewhere involves reading
the entire memory word,

202
00:10:51,390 --> 00:10:53,730
possibly correcting a
single bit or double bit

203
00:10:53,730 --> 00:10:57,090
error if the memory system
has error correction,

204
00:10:57,090 --> 00:11:00,480
then updating the single
byte, and then recalculating

205
00:11:00,480 --> 00:11:02,310
any error correction
or parity bits,

206
00:11:02,310 --> 00:11:05,610
and then finally writing
the entire memory work.

207
00:11:05,610 --> 00:11:08,700
And that read, modify,
write is really a sequence.

208
00:11:08,700 --> 00:11:13,020
And if it's implemented in
hardware in the memory board

209
00:11:13,020 --> 00:11:16,620
or in the CPU chip
near the pins,

210
00:11:16,620 --> 00:11:18,870
wherever it's
implemented, there's

211
00:11:18,870 --> 00:11:21,300
a tendency to only
have one implementation

212
00:11:21,300 --> 00:11:24,030
of that sequencing
hardware, and that

213
00:11:24,030 --> 00:11:25,440
makes it difficult
then to launch

214
00:11:25,440 --> 00:11:28,500
six independent byte writes and
to have them run in parallel.

215
00:11:28,500 --> 00:11:32,948
Again, the byte write
logic becomes a bottleneck.

216
00:11:32,948 --> 00:11:34,740
So we took the fairly
radical move in Alpha

217
00:11:34,740 --> 00:11:36,840
of not having byte writes.

218
00:11:36,840 --> 00:11:39,540
Instead, there are carefully
tuned instruction sequences

219
00:11:39,540 --> 00:11:41,820
that allow a program
to do the read,

220
00:11:41,820 --> 00:11:46,470
modify, write as a load
and in-register modify,

221
00:11:46,470 --> 00:11:48,320
and then a store.

222
00:11:48,320 --> 00:11:55,470
We included in the architecture
enough design information

223
00:11:55,470 --> 00:11:59,280
to allow programs
to do single byte

224
00:11:59,280 --> 00:12:01,500
updates within the
same memory word

225
00:12:01,500 --> 00:12:04,500
even in the face of a
other processors updating

226
00:12:04,500 --> 00:12:09,060
adjacent bytes in that memory
word, and that can all work.

227
00:12:09,060 --> 00:12:13,230
And we expect over the
course of a couple of decades

228
00:12:13,230 --> 00:12:16,650
that removing that bottleneck
will allow much faster byte

229
00:12:16,650 --> 00:12:19,470
write or byte
manipulation programs

230
00:12:19,470 --> 00:12:20,970
than some other
architectures that

231
00:12:20,970 --> 00:12:24,840
have an instruction that looks
like it would help instead.

232
00:12:24,840 --> 00:12:29,520
Finally we tried to avoid having
first implementation artifacts.

233
00:12:29,520 --> 00:12:32,940
In contrast to
doing no chip design

234
00:12:32,940 --> 00:12:35,130
and shipping the first chip
and, if it's successful,

235
00:12:35,130 --> 00:12:37,680
then designing another one
and shipping that, and it's

236
00:12:37,680 --> 00:12:41,610
a little bit different, we tried
to design into the architecture

237
00:12:41,610 --> 00:12:43,680
solutions to all of
the design problems

238
00:12:43,680 --> 00:12:47,490
we were aware of even when
that made the first chip

239
00:12:47,490 --> 00:12:51,920
implementation more difficult.

240
00:12:51,920 --> 00:12:55,540
Finally, we have included in
the architecture careful support

241
00:12:55,540 --> 00:12:58,360
for multiprocessing.

242
00:12:58,360 --> 00:13:01,390
One of the design issues
in multiprocessing

243
00:13:01,390 --> 00:13:04,570
is how to do atomic
updates in memory

244
00:13:04,570 --> 00:13:09,520
between shared memory locations
for multiple processors.

245
00:13:09,520 --> 00:13:15,400
We picked a paradigm of load
locked modify store conditional

246
00:13:15,400 --> 00:13:19,420
that is also used in the
MIPS R4,000 architecture.

247
00:13:19,420 --> 00:13:21,130
That particular
paradigm we chose

248
00:13:21,130 --> 00:13:23,230
because it's the only
thing we could find that

249
00:13:23,230 --> 00:13:27,730
would scale up with processor
speed over a couple of decades.

250
00:13:27,730 --> 00:13:32,260
The idea is to do a load
instruction but using load

251
00:13:32,260 --> 00:13:35,680
locked, which remember
a few bits of state,

252
00:13:35,680 --> 00:13:40,420
then modify by doing an
add or an or or whatever

253
00:13:40,420 --> 00:13:43,540
the modification is
in registers and then

254
00:13:43,540 --> 00:13:45,370
do a store conditional
instruction back

255
00:13:45,370 --> 00:13:48,010
to the shared memory location.

256
00:13:48,010 --> 00:13:52,270
If that entire sequence from the
load locked through the modify

257
00:13:52,270 --> 00:13:54,460
to the store conditional,
if that sequence

258
00:13:54,460 --> 00:13:59,200
runs with no interrupts and no
exceptions and no interfering

259
00:13:59,200 --> 00:14:02,080
write from another
processor, then the store

260
00:14:02,080 --> 00:14:05,500
conditional stores
and leaves a bit

261
00:14:05,500 --> 00:14:08,020
in register for
the program to test

262
00:14:08,020 --> 00:14:10,990
to determine that the
store actually occurred.

263
00:14:10,990 --> 00:14:12,760
If anything goes
wrong, if there's

264
00:14:12,760 --> 00:14:15,430
an interrupt in the middle,
if there's an exception on any

265
00:14:15,430 --> 00:14:18,640
of the operations, or if
there's an interfering

266
00:14:18,640 --> 00:14:21,610
write from another processor,
then the store conditional

267
00:14:21,610 --> 00:14:23,890
doesn't store and
instead sets the bit

268
00:14:23,890 --> 00:14:25,305
saying that it did not store.

269
00:14:25,305 --> 00:14:26,680
And eventually,
then, the program

270
00:14:26,680 --> 00:14:29,590
needs to branch on that bit and
go back around and start again

271
00:14:29,590 --> 00:14:32,590
with the load locked, retry
the modification and the store

272
00:14:32,590 --> 00:14:33,970
conditional.

273
00:14:33,970 --> 00:14:37,450
When the entire sequence
runs to completion,

274
00:14:37,450 --> 00:14:39,160
then, in fact, an
atomic update has

275
00:14:39,160 --> 00:14:41,140
been done with no interference.

276
00:14:41,140 --> 00:14:43,430
So there are two advantages
to this paradigm.

277
00:14:43,430 --> 00:14:47,290
The first is that in the
absence of an interfering write,

278
00:14:47,290 --> 00:14:50,680
the entire sequence of load
locked byte modify store

279
00:14:50,680 --> 00:14:54,970
conditional can be done inside
of an on-chip processor cache.

280
00:14:54,970 --> 00:14:58,330
The other advantage is
that multiple processors

281
00:14:58,330 --> 00:15:01,450
can be doing independent
sequences in parallel.

282
00:15:01,450 --> 00:15:03,430
There's no common
lock bit somewhere

283
00:15:03,430 --> 00:15:04,810
in a shared memory system.

284
00:15:04,810 --> 00:15:08,690


285
00:15:08,690 --> 00:15:14,270
For multiple processor support,
we also designed around

286
00:15:14,270 --> 00:15:16,460
read write ordering.

287
00:15:16,460 --> 00:15:19,310
Rather than the traditional
strict read write ordering,

288
00:15:19,310 --> 00:15:21,920
we specify that in the
Alpha architecture,

289
00:15:21,920 --> 00:15:23,930
from the point of view
of a second processor,

290
00:15:23,930 --> 00:15:26,750
reads and writes launched
by a first processor

291
00:15:26,750 --> 00:15:29,180
can arrive at the second
processor in an arbitrary

292
00:15:29,180 --> 00:15:32,930
order, and then we included
a memory barrier instruction

293
00:15:32,930 --> 00:15:37,250
to limit the permutations
allowed in an implementation.

294
00:15:37,250 --> 00:15:39,745
I'll talk about that
more in a few minutes.

295
00:15:39,745 --> 00:15:41,120
Finally, there's
the usual number

296
00:15:41,120 --> 00:15:43,550
of registers for writing
real software to keep track

297
00:15:43,550 --> 00:15:45,530
of which processor
you're running on

298
00:15:45,530 --> 00:15:49,410
and thread-specific context.

299
00:15:49,410 --> 00:15:51,930
Here's a few things that
are different in the Alpha

300
00:15:51,930 --> 00:15:56,040
architecture compared to
traditional risk architectures.

301
00:15:56,040 --> 00:16:00,990
First, we tried to avoid first
implementation difficulties

302
00:16:00,990 --> 00:16:03,180
and instead to do
clean features that we

303
00:16:03,180 --> 00:16:06,390
hope will last many decades.

304
00:16:06,390 --> 00:16:11,760
For example, rather than
architecting strict read write

305
00:16:11,760 --> 00:16:13,290
ordering, as I
just talked about,

306
00:16:13,290 --> 00:16:16,510
we architected
arbitrary read write

307
00:16:16,510 --> 00:16:19,860
ordering between processors
and a memory barrier

308
00:16:19,860 --> 00:16:22,650
as a mechanism to specify the
ordering exactly when it's

309
00:16:22,650 --> 00:16:24,040
needed.

310
00:16:24,040 --> 00:16:25,920
It's possible that
other architectures that

311
00:16:25,920 --> 00:16:27,660
specify strict
read write ordering

312
00:16:27,660 --> 00:16:30,000
will find that that's a
performance bottleneck

313
00:16:30,000 --> 00:16:32,950
over the coming decades.

314
00:16:32,950 --> 00:16:36,870
We also looked at issues
like branch prediction

315
00:16:36,870 --> 00:16:39,390
and bearing the
latency of branches

316
00:16:39,390 --> 00:16:43,530
rather than doing something
like a branch delay

317
00:16:43,530 --> 00:16:48,300
slot that could bury one cycle
this year or maybe two cycles.

318
00:16:48,300 --> 00:16:50,025
We looked at the
long term and said

319
00:16:50,025 --> 00:16:52,410
that that's a
technique that will not

320
00:16:52,410 --> 00:16:54,480
scale up well for
a couple of decades

321
00:16:54,480 --> 00:16:58,200
and will not scale well with
multiple instruction issues.

322
00:16:58,200 --> 00:17:00,750
So we looked at other ways
of solving branch latency

323
00:17:00,750 --> 00:17:03,270
problems.

324
00:17:03,270 --> 00:17:07,859
Similarly, we don't have an
Alpha complex instructions

325
00:17:07,859 --> 00:17:11,040
that do multiple operations
such as a multiply

326
00:17:11,040 --> 00:17:14,099
add operation, which is
really two operations in one

327
00:17:14,099 --> 00:17:15,300
instruction.

328
00:17:15,300 --> 00:17:17,880
We instead seriously
designed to allow

329
00:17:17,880 --> 00:17:22,359
implementations to gracefully
do multiple instruction issue.

330
00:17:22,359 --> 00:17:25,890
We believe that, through
multiple instruction issue,

331
00:17:25,890 --> 00:17:28,560
the same effects you get
with the complex combined

332
00:17:28,560 --> 00:17:32,310
instructions can be
achieved in Alpha

333
00:17:32,310 --> 00:17:34,710
but achieved in a more
straightforward way

334
00:17:34,710 --> 00:17:35,880
over the coming decades.

335
00:17:35,880 --> 00:17:38,520


336
00:17:38,520 --> 00:17:41,480
We also designed a
hazard-free architecture.

337
00:17:41,480 --> 00:17:43,310
This means that if
an instruction writes

338
00:17:43,310 --> 00:17:45,380
to a register and
a later instruction

339
00:17:45,380 --> 00:17:50,210
reads from that register, that
the reading instruction always

340
00:17:50,210 --> 00:17:54,410
gets the written value, that
implementations must implement

341
00:17:54,410 --> 00:17:57,770
whatever pipeline interlocks
or score-boarding are

342
00:17:57,770 --> 00:18:03,590
necessary to deliver the same
results on all implementations,

343
00:18:03,590 --> 00:18:07,850
that there are no hazards in
some implementations where

344
00:18:07,850 --> 00:18:12,050
a register can be read too
soon and get an old value

345
00:18:12,050 --> 00:18:16,078
and be not binary compatible
with a different implementation

346
00:18:16,078 --> 00:18:18,620
with different timing where the
register turns out to be read

347
00:18:18,620 --> 00:18:21,210
later and get a new value.

348
00:18:21,210 --> 00:18:23,300
We also designed for
the Alpha architecture

349
00:18:23,300 --> 00:18:25,040
to have minimal global state.

350
00:18:25,040 --> 00:18:27,410
We've discovered, particularly
in implementing of VAXs

351
00:18:27,410 --> 00:18:31,640
that having global state bits
such as interrupt enables

352
00:18:31,640 --> 00:18:36,350
or rounding mode bits gets
in the way of building very

353
00:18:36,350 --> 00:18:38,450
fast pipeline implementations.

354
00:18:38,450 --> 00:18:42,290
The tendency is, when global
state bits are changed,

355
00:18:42,290 --> 00:18:44,750
to completely
drain the pipeline,

356
00:18:44,750 --> 00:18:47,960
change the global state bits,
and then restart the pipeline

357
00:18:47,960 --> 00:18:51,740
in order to guarantee that
subsequent instructions see

358
00:18:51,740 --> 00:18:55,190
the state change and the
previously issued instructions

359
00:18:55,190 --> 00:18:56,990
don't see the state change.

360
00:18:56,990 --> 00:18:59,420
And that draining and
restarting of the pipelines

361
00:18:59,420 --> 00:19:02,090
eventually becomes a
performance bottleneck,

362
00:19:02,090 --> 00:19:06,530
particularly if the state bits
need to be saved and restored

363
00:19:06,530 --> 00:19:08,840
at every subroutine boundary.

364
00:19:08,840 --> 00:19:10,940
So rather than having
global state bits,

365
00:19:10,940 --> 00:19:13,880
we designed to minimize
them and instead put

366
00:19:13,880 --> 00:19:16,610
in a re-instruction,
things like interrupt

367
00:19:16,610 --> 00:19:19,340
enable bits or
rounding mode bits,

368
00:19:19,340 --> 00:19:23,150
and the entire instruction
and its local state

369
00:19:23,150 --> 00:19:26,080
can then be pipelined
with no drains.

370
00:19:26,080 --> 00:19:28,820


371
00:19:28,820 --> 00:19:32,170
We also have no mode bit
such as a 32-bit mode

372
00:19:32,170 --> 00:19:35,930
versus 64-bit mode
or VAX floating point

373
00:19:35,930 --> 00:19:39,380
versus IEEE floating point.

374
00:19:39,380 --> 00:19:42,980
This, again, minimizes
the hardware complexity

375
00:19:42,980 --> 00:19:47,942
and maximizes the
fraction of time

376
00:19:47,942 --> 00:19:49,400
that the pipeline
can be kept full.

377
00:19:49,400 --> 00:19:53,100


378
00:19:53,100 --> 00:19:55,860
As we were doing the
architecture design,

379
00:19:55,860 --> 00:19:58,680
we kept in mind a number
of implementation issues

380
00:19:58,680 --> 00:20:02,250
in terms of keeping
many operations going

381
00:20:02,250 --> 00:20:06,810
simultaneously, having
pipelined implementations,

382
00:20:06,810 --> 00:20:10,600
having long memory latencies
that would grow with time.

383
00:20:10,600 --> 00:20:13,680
So we included a
number of features

384
00:20:13,680 --> 00:20:19,980
such as specifying imprecise
arithmetic exceptions rather

385
00:20:19,980 --> 00:20:23,760
than precise ones as
an enabling technology

386
00:20:23,760 --> 00:20:25,530
to allow multiple
instruction issued

387
00:20:25,530 --> 00:20:30,000
to be done gracefully without
a lot of hardware overhead.

388
00:20:30,000 --> 00:20:32,760
We also included things
like pre-fetching operations

389
00:20:32,760 --> 00:20:37,530
as a mechanism to cover
some of the main memory

390
00:20:37,530 --> 00:20:42,000
latencies that we anticipate
over the coming years.

391
00:20:42,000 --> 00:20:45,345
Finally, we included
in the design

392
00:20:45,345 --> 00:20:49,560
a discussion of
hardware-software agreements

393
00:20:49,560 --> 00:20:52,890
of suggested ground rules
for hardware implementers

394
00:20:52,890 --> 00:20:55,050
and also for
software implementers

395
00:20:55,050 --> 00:20:58,980
so that, to the extent that
both groups use the same ground

396
00:20:58,980 --> 00:21:03,000
rules, you end up with very
high performance implementations

397
00:21:03,000 --> 00:21:06,617
running high performance
software with very little work.

398
00:21:06,617 --> 00:21:08,700
One of the things that's
different about the Alpha

399
00:21:08,700 --> 00:21:11,460
architecture is the
existence of PAL code.

400
00:21:11,460 --> 00:21:14,280
PAL stands for privileged
architecture library.

401
00:21:14,280 --> 00:21:16,080
It's a set of
privilege subroutines,

402
00:21:16,080 --> 00:21:19,890
a lot like the basic input
output system in an IBM PC.

403
00:21:19,890 --> 00:21:22,230
They perform
specified functions,

404
00:21:22,230 --> 00:21:24,990
but how they are implemented
inside the subroutines

405
00:21:24,990 --> 00:21:28,200
can vary from one
implementation to another.

406
00:21:28,200 --> 00:21:31,560
The specified functions
are the complex operations

407
00:21:31,560 --> 00:21:34,260
that are an operating system
is built on top of such

408
00:21:34,260 --> 00:21:37,260
as taking or returning from
an interrupt or exception

409
00:21:37,260 --> 00:21:41,700
or loading the memory management
registers or context switching.

410
00:21:41,700 --> 00:21:45,090
The PAL subroutines run with
the interrupts turned off,

411
00:21:45,090 --> 00:21:48,870
and they may have
implementation-specific access

412
00:21:48,870 --> 00:21:51,510
to physical hardware registers.

413
00:21:51,510 --> 00:21:53,190
There's a different
set of PAL code

414
00:21:53,190 --> 00:21:56,160
for each operating system
that allows the operating

415
00:21:56,160 --> 00:22:00,120
system to run on what's
really an extended machine.

416
00:22:00,120 --> 00:22:03,660
For example, the
port of VMS to alpha

417
00:22:03,660 --> 00:22:06,660
runs with a set of
PAL code that provides

418
00:22:06,660 --> 00:22:12,420
VAX-like 32 interrupt levels
and four memory management modes

419
00:22:12,420 --> 00:22:14,640
and a number of the
complex operations

420
00:22:14,640 --> 00:22:20,100
that build interrupt vectors
and include interlocked Q

421
00:22:20,100 --> 00:22:21,660
instructions.

422
00:22:21,660 --> 00:22:24,420
By having all of
those in software,

423
00:22:24,420 --> 00:22:27,570
the actual Alpha
hardware is quite simple,

424
00:22:27,570 --> 00:22:32,040
but the VMS port took
two or three years less

425
00:22:32,040 --> 00:22:34,860
than if VMS had been
completely rewritten to avoid

426
00:22:34,860 --> 00:22:37,650
using those VAX features.

427
00:22:37,650 --> 00:22:40,140
In a similar way, the
OSF/1 operating system

428
00:22:40,140 --> 00:22:43,050
runs with its PAL code that
presents to the operating

429
00:22:43,050 --> 00:22:47,850
system a simple, Unix-like
view of the world that does not

430
00:22:47,850 --> 00:22:49,300
have 32 interrupt modes.

431
00:22:49,300 --> 00:22:52,500
It has eight interrupt
modes and a number

432
00:22:52,500 --> 00:22:56,430
of other memory-mapping
features that many Unix systems

433
00:22:56,430 --> 00:22:57,840
are used to.

434
00:22:57,840 --> 00:23:01,800
In a similar way, the
Windows NT operating system

435
00:23:01,800 --> 00:23:04,890
has its own PAL
code that presents

436
00:23:04,890 --> 00:23:06,570
a view of the world
that's culturally

437
00:23:06,570 --> 00:23:12,385
compatible with the 46 and the
MIPS and T implementations.

438
00:23:12,385 --> 00:23:14,760
So those are some of the things
in the Alpha architecture

439
00:23:14,760 --> 00:23:17,430
that are different from
other risk architectures.

440
00:23:17,430 --> 00:23:20,520
Now I'm going to talk about a
number of the design problems

441
00:23:20,520 --> 00:23:24,510
we had in trying to build a
fast architecture to allow

442
00:23:24,510 --> 00:23:27,120
fast implementations.

443
00:23:27,120 --> 00:23:30,810
First design problem
is branch latency.

444
00:23:30,810 --> 00:23:34,200
The problem is when
there's a discontinuity

445
00:23:34,200 --> 00:23:36,630
in the instruction stream,
when a branch goes off

446
00:23:36,630 --> 00:23:42,870
to some other instruction,
how to bury the time that

447
00:23:42,870 --> 00:23:44,790
must occur in real
implementations

448
00:23:44,790 --> 00:23:48,180
before fetching the
target instruction.

449
00:23:48,180 --> 00:23:50,640
Some implementations
or some architectures

450
00:23:50,640 --> 00:23:54,570
specify branch delay slots
in which, after a branch has

451
00:23:54,570 --> 00:23:57,120
executed, the following
instruction is also

452
00:23:57,120 --> 00:24:00,840
executed on the theory that
in real implementations

453
00:24:00,840 --> 00:24:04,680
it's already been pre-fetched,
so executing it is almost free,

454
00:24:04,680 --> 00:24:06,780
and it can cover up
the cycle of latency

455
00:24:06,780 --> 00:24:10,110
while the target address
is being fetched.

456
00:24:10,110 --> 00:24:12,990
That was a good design
technique in 1988,

457
00:24:12,990 --> 00:24:15,570
but already we've
reached the point where,

458
00:24:15,570 --> 00:24:18,630
in real implementations
that are quite fast,

459
00:24:18,630 --> 00:24:20,790
the branch latency that
you need to cover up

460
00:24:20,790 --> 00:24:23,640
is two or three cycles
instead of just one cycle,

461
00:24:23,640 --> 00:24:28,740
and an architected branch delay
slot can only bury one cycle.

462
00:24:28,740 --> 00:24:33,180
In addition, branch delay
slots don't make any sense

463
00:24:33,180 --> 00:24:36,240
when you are anticipating
implementations

464
00:24:36,240 --> 00:24:38,250
that issue multiple
instructions,

465
00:24:38,250 --> 00:24:40,860
and different implementations
issue different numbers

466
00:24:40,860 --> 00:24:43,620
of instructions.

467
00:24:43,620 --> 00:24:45,600
For example, if you have
an implementation that

468
00:24:45,600 --> 00:24:49,110
issues a peak of three
instructions every cycle

469
00:24:49,110 --> 00:24:50,690
and you have a branch
delay slot that

470
00:24:50,690 --> 00:24:55,280
specifies one instruction is
executed after each branch,

471
00:24:55,280 --> 00:24:58,460
then after every branch, the
other two potential issue slots

472
00:24:58,460 --> 00:25:00,050
have to be wasted.

473
00:25:00,050 --> 00:25:03,780
You can only issue one of those
three following instructions.

474
00:25:03,780 --> 00:25:06,110
If branches occur about
every eight instructions,

475
00:25:06,110 --> 00:25:09,720
that can be quite a
performance bottleneck.

476
00:25:09,720 --> 00:25:11,750
So rather than
branch delay slots,

477
00:25:11,750 --> 00:25:14,060
we went back to the
original problem

478
00:25:14,060 --> 00:25:17,390
of covering branch latency
and looked for other solutions

479
00:25:17,390 --> 00:25:20,700
that we thought would scale
well for a couple of decades.

480
00:25:20,700 --> 00:25:22,850
We didn't find
any magic bullets,

481
00:25:22,850 --> 00:25:26,420
so instead we approached
covering branch latency

482
00:25:26,420 --> 00:25:28,710
on three different fronts.

483
00:25:28,710 --> 00:25:32,150
First, we architected
branch prediction hits

484
00:25:32,150 --> 00:25:34,160
for two-way branches.

485
00:25:34,160 --> 00:25:38,600
The rule is that a two-way
branch that is a forward branch

486
00:25:38,600 --> 00:25:41,540
is predicted not to be
taken, and a two-way branch

487
00:25:41,540 --> 00:25:44,840
that's a backward branch
is predicted to be taken.

488
00:25:44,840 --> 00:25:48,320
To the extent that the hardware
implementers and the compiler

489
00:25:48,320 --> 00:25:51,290
writers follow
the same rule, you

490
00:25:51,290 --> 00:25:55,370
get a very fast, very
simple implementation

491
00:25:55,370 --> 00:25:59,810
that involves less hardware
than branch prediction caches

492
00:25:59,810 --> 00:26:02,340
or things like that.

493
00:26:02,340 --> 00:26:07,170
For branches that have
calculated targets

494
00:26:07,170 --> 00:26:09,780
such as some subroutine
calls or case

495
00:26:09,780 --> 00:26:13,350
statements or
subroutine returns,

496
00:26:13,350 --> 00:26:15,850
there were bits left over
in those instructions.

497
00:26:15,850 --> 00:26:20,130
So we use those bits as hints to
the implementation about where

498
00:26:20,130 --> 00:26:22,980
the most likely target is.

499
00:26:22,980 --> 00:26:26,490
The low 14 bits of those
branch instructions

500
00:26:26,490 --> 00:26:29,580
can be used to directly
drive an instruction

501
00:26:29,580 --> 00:26:32,970
cache with the address of
the most likely target.

502
00:26:32,970 --> 00:26:34,890
Because they are
hints, implementations

503
00:26:34,890 --> 00:26:36,510
need not use them.

504
00:26:36,510 --> 00:26:39,480
Also, because they are
hints, if the hint is wrong,

505
00:26:39,480 --> 00:26:42,900
implementations must eventually,
although perhaps more slowly,

506
00:26:42,900 --> 00:26:45,300
fetch the correctly
specified target

507
00:26:45,300 --> 00:26:47,790
and actually branch to it.

508
00:26:47,790 --> 00:26:50,550
We also specified in those
hints whether a given

509
00:26:50,550 --> 00:26:54,270
calculated branch was a
subroutine call or subroutine

510
00:26:54,270 --> 00:26:56,380
return or neither.

511
00:26:56,380 --> 00:26:59,910
That's enough information
for an implementation

512
00:26:59,910 --> 00:27:05,230
to run a small stack of likely
subroutine return addresses.

513
00:27:05,230 --> 00:27:08,070
And in fact, as Dirk will
talk about in the first EV4

514
00:27:08,070 --> 00:27:14,820
implementation, there is a four
level subroutine return stack.

515
00:27:14,820 --> 00:27:16,440
That allows an
implementation, when

516
00:27:16,440 --> 00:27:18,810
it encounters a
subroutine return,

517
00:27:18,810 --> 00:27:21,930
to pull off the top of that
implementation stack, the most

518
00:27:21,930 --> 00:27:24,540
likely I-cache
address to fetch from

519
00:27:24,540 --> 00:27:27,570
and, under good circumstances,
have subroutine returns take

520
00:27:27,570 --> 00:27:30,430
zero or one cycles.

521
00:27:30,430 --> 00:27:34,620
That's an example of how we
thought about problems, looked

522
00:27:34,620 --> 00:27:38,490
at existing solutions
in the marketplace,

523
00:27:38,490 --> 00:27:41,370
asked the question of
whether they would scale well

524
00:27:41,370 --> 00:27:43,450
over the coming decades.

525
00:27:43,450 --> 00:27:45,660
When the answer was
yes, we used them

526
00:27:45,660 --> 00:27:48,180
and did not deviate
from conventional risk

527
00:27:48,180 --> 00:27:49,350
architectures.

528
00:27:49,350 --> 00:27:52,290
When the answer was no,
we said, what problem

529
00:27:52,290 --> 00:27:54,270
are these techniques solving?

530
00:27:54,270 --> 00:27:56,760
Go back and look at the problem
with a clean sheet of paper

531
00:27:56,760 --> 00:27:58,218
and say, how else
could the problem

532
00:27:58,218 --> 00:28:01,110
be solved in a way
that would scale well

533
00:28:01,110 --> 00:28:03,480
over a couple of decades?

534
00:28:03,480 --> 00:28:06,390
The last thing we did
in trying to bury branch

535
00:28:06,390 --> 00:28:09,990
latency is to include
conditional move instructions

536
00:28:09,990 --> 00:28:12,840
which test a register
and, depending

537
00:28:12,840 --> 00:28:14,610
on the result of the
test, conditionally,

538
00:28:14,610 --> 00:28:17,310
move a second register
to a third register.

539
00:28:17,310 --> 00:28:20,850
These can be used for simple
cases of testing end defects

540
00:28:20,850 --> 00:28:22,950
or doing maximum or minimum.

541
00:28:22,950 --> 00:28:26,100
And when they are used, they
completely eliminate branches.

542
00:28:26,100 --> 00:28:28,770
By eliminating branches,
eliminate the issues

543
00:28:28,770 --> 00:28:32,440
of branch delays, store branch
latencies, and also, in fact,

544
00:28:32,440 --> 00:28:34,260
make basic blocks
bigger and allow

545
00:28:34,260 --> 00:28:37,080
compilers to do more
optimization and more code

546
00:28:37,080 --> 00:28:38,770
movement.

547
00:28:38,770 --> 00:28:41,910
So in summary, we rejected the
traditional branch delay slots

548
00:28:41,910 --> 00:28:45,180
because they're incompatible
with multiple instruction issue

549
00:28:45,180 --> 00:28:49,000
and with scaling up
performance over many decades.

550
00:28:49,000 --> 00:28:52,830
We instead looked at
fresh mechanisms that

551
00:28:52,830 --> 00:28:55,440
would give the same effect.

552
00:28:55,440 --> 00:28:57,240
Another design
problem we looked at

553
00:28:57,240 --> 00:28:59,520
was how to handle arithmetic
exceptions, things

554
00:28:59,520 --> 00:29:02,880
like overflows or
divide by zero.

555
00:29:02,880 --> 00:29:05,400
Some architecture specify
precise exceptions

556
00:29:05,400 --> 00:29:09,120
in which if an instruction
gets an arithmetic exception,

557
00:29:09,120 --> 00:29:13,295
the following instructions
must not have executed.

558
00:29:13,295 --> 00:29:14,670
Now, that makes
it very difficult

559
00:29:14,670 --> 00:29:17,910
to do things like four-way
instruction issue.

560
00:29:17,910 --> 00:29:21,030
You either cannot issue
multiple instructions that could

561
00:29:21,030 --> 00:29:23,280
possibly get
arithmetic exceptions,

562
00:29:23,280 --> 00:29:25,780
or you have to issue them and
build a bunch of hardware,

563
00:29:25,780 --> 00:29:28,650
things like trap silos, that
back out of the instructions

564
00:29:28,650 --> 00:29:31,470
that really should not have
happened if you issue four

565
00:29:31,470 --> 00:29:34,420
and the second one overflows.

566
00:29:34,420 --> 00:29:38,160
So rather than specifying
precise arithmetic exceptions,

567
00:29:38,160 --> 00:29:40,890
we specified imprecise
arithmetic exceptions.

568
00:29:40,890 --> 00:29:45,300
We said, if you get an overflow,
you find out about it later.

569
00:29:45,300 --> 00:29:46,895
Because the
architecture is not tied

570
00:29:46,895 --> 00:29:48,270
to any particular
implementation,

571
00:29:48,270 --> 00:29:50,670
the architecture doesn't
say how much later.

572
00:29:50,670 --> 00:29:51,990
It just says later.

573
00:29:51,990 --> 00:29:54,990
It's the Cray-1 model
of exception handling,

574
00:29:54,990 --> 00:29:57,090
but there are completely
legitimate reasons

575
00:29:57,090 --> 00:30:01,690
for people to want to
get precise exceptions.

576
00:30:01,690 --> 00:30:04,380
So we also included a
trap barrier instruction

577
00:30:04,380 --> 00:30:06,180
that specifies, of
all the instructions

578
00:30:06,180 --> 00:30:08,820
that have been issued
at the barrier,

579
00:30:08,820 --> 00:30:12,390
all the preceding instructions
must deliver and the exceptions

580
00:30:12,390 --> 00:30:17,100
that they're going to deliver
at the barrier or earlier,

581
00:30:17,100 --> 00:30:20,980
and then subsequent
instructions can be issued.

582
00:30:20,980 --> 00:30:23,850
So if you write a program
with no trap barriers,

583
00:30:23,850 --> 00:30:26,370
it runs at vector-like
speeds, and you

584
00:30:26,370 --> 00:30:30,240
get information on something
like an overflow that it

585
00:30:30,240 --> 00:30:32,430
occurred but not very
precise information

586
00:30:32,430 --> 00:30:34,212
on where it occurred.

587
00:30:34,212 --> 00:30:35,670
If you do something
like put a trap

588
00:30:35,670 --> 00:30:38,160
barrier at the end
of every subroutine,

589
00:30:38,160 --> 00:30:41,160
that may involve draining a
few cycles of the pipeline

590
00:30:41,160 --> 00:30:43,050
at the end of the
subroutine, but it then

591
00:30:43,050 --> 00:30:46,650
localizes where the exception
occurred to which subroutine.

592
00:30:46,650 --> 00:30:49,380
If you put the trap barrier
at the end of every source

593
00:30:49,380 --> 00:30:51,360
language statement,
you can localize

594
00:30:51,360 --> 00:30:55,388
the generation of an exception
to which statement it was.

595
00:30:55,388 --> 00:30:57,930
And if you put a trap barrier
after every single instruction,

596
00:30:57,930 --> 00:31:03,750
you can discover exactly which
instruction generated it.

597
00:31:03,750 --> 00:31:06,570
Because the IEEE
floating point standard

598
00:31:06,570 --> 00:31:10,560
has an extensive set
of exception handling,

599
00:31:10,560 --> 00:31:13,440
we also included in
the architecture manual

600
00:31:13,440 --> 00:31:17,560
a design that allows one
trap barrier per basic block,

601
00:31:17,560 --> 00:31:21,390
rather than one per
instruction, and still allows

602
00:31:21,390 --> 00:31:24,900
complete IEEE exception handling
to be performed in software

603
00:31:24,900 --> 00:31:26,235
in the trap routine.

604
00:31:26,235 --> 00:31:31,290
The basic idea is that if
there's a software constraint,

605
00:31:31,290 --> 00:31:36,180
that the compiler is not reuse
registers within a basic block

606
00:31:36,180 --> 00:31:39,600
or not reuse registers
before the trap barrier

607
00:31:39,600 --> 00:31:41,790
so that a software
handler can successfully

608
00:31:41,790 --> 00:31:43,950
backtrack through the
instruction stream,

609
00:31:43,950 --> 00:31:46,200
find out which exception
occurred first,

610
00:31:46,200 --> 00:31:49,680
process that exception,
and then continue.

611
00:31:49,680 --> 00:31:52,890
That allows a substantially
faster implementation

612
00:31:52,890 --> 00:31:55,700
than stopping after
every single instruction

613
00:31:55,700 --> 00:31:58,940
and waiting for any
exceptions to occur.

614
00:31:58,940 --> 00:32:00,400
In addition to the
trap barrier, we

615
00:32:00,400 --> 00:32:04,080
included exception
enable disable bits

616
00:32:04,080 --> 00:32:09,790
in every instruction so that
integer add, for instance,

617
00:32:09,790 --> 00:32:11,580
can be specified
in the instruction

618
00:32:11,580 --> 00:32:15,480
to take an overflow trap or
not to take an overflow trap.

619
00:32:15,480 --> 00:32:19,200
That allows languages such as
Fortran that require overflows

620
00:32:19,200 --> 00:32:22,260
to be generated to simply
compile into the instructions

621
00:32:22,260 --> 00:32:25,650
to do the overflow and also
allows languages such as C,

622
00:32:25,650 --> 00:32:28,680
which generally specify
no overflow trapping,

623
00:32:28,680 --> 00:32:32,010
to compile into the
instruction not to overflow.

624
00:32:32,010 --> 00:32:35,550
In a similar way, for
floating point underflow,

625
00:32:35,550 --> 00:32:37,590
disabling the underflow
allows the program

626
00:32:37,590 --> 00:32:40,540
to simply give a result
of 0 and keep going.

627
00:32:40,540 --> 00:32:42,570
Some of our customers
have expressed

628
00:32:42,570 --> 00:32:44,750
a preference for that behavior.

629
00:32:44,750 --> 00:32:48,540
The whole point here is that
by choosing compiler options,

630
00:32:48,540 --> 00:32:51,120
you get to choose what
level of performance

631
00:32:51,120 --> 00:32:54,870
and what level of precision
on exceptions you want rather

632
00:32:54,870 --> 00:32:58,480
than me choosing for you.

633
00:32:58,480 --> 00:33:00,660
So in summary on the
arithmetic exceptions,

634
00:33:00,660 --> 00:33:04,620
we rejected an architecture
that would require hardware

635
00:33:04,620 --> 00:33:07,060
implementations that issue
multiple instruction,

636
00:33:07,060 --> 00:33:10,530
silo register values, and
then have to roll back

637
00:33:10,530 --> 00:33:12,910
if an exception occurs.

638
00:33:12,910 --> 00:33:17,442
We also rejected implementations
that read operands

639
00:33:17,442 --> 00:33:19,650
and then check whether an
overflow or something could

640
00:33:19,650 --> 00:33:20,950
possibly occur.

641
00:33:20,950 --> 00:33:24,420
And if so, delay issuing other
instructions into the pipeline

642
00:33:24,420 --> 00:33:27,660
until the exception
decision has been made.

643
00:33:27,660 --> 00:33:29,220
That doesn't work
for multiple issue,

644
00:33:29,220 --> 00:33:31,740
and it doesn't work
well for single issue

645
00:33:31,740 --> 00:33:34,830
if the read of the registers
and the check of the values

646
00:33:34,830 --> 00:33:39,390
is actually two or three
stages down the pipeline.

647
00:33:39,390 --> 00:33:42,710
Another problem we looked
at was memory latency.

648
00:33:42,710 --> 00:33:44,840
Over the course
of a few decades,

649
00:33:44,840 --> 00:33:48,290
maybe memory latency will get
longer and longer as processor

650
00:33:48,290 --> 00:33:52,430
speeds get faster and faster,
but memory chip speeds don't.

651
00:33:52,430 --> 00:33:54,470
So we included in
the architecture

652
00:33:54,470 --> 00:33:56,120
a pair of pre-fetching
instructions

653
00:33:56,120 --> 00:33:58,700
as hints to an implementation.

654
00:33:58,700 --> 00:34:02,090
For example, a Fortran
program working down

655
00:34:02,090 --> 00:34:04,460
one column of an
array could issue

656
00:34:04,460 --> 00:34:06,470
the pre-fetching
instructions to be preferred

657
00:34:06,470 --> 00:34:09,770
the second column of the
array during the processing

658
00:34:09,770 --> 00:34:11,630
of the first column.

659
00:34:11,630 --> 00:34:13,850
It looked likely that
those instructions

660
00:34:13,850 --> 00:34:19,429
could bury about 100
cycles of memory latency

661
00:34:19,429 --> 00:34:21,570
in a realistic way.

662
00:34:21,570 --> 00:34:25,280
We also looked at memory latency
and multiprocessor systems.

663
00:34:25,280 --> 00:34:29,420
We specified that there is no
implied read write ordering

664
00:34:29,420 --> 00:34:31,159
in the following sense.

665
00:34:31,159 --> 00:34:35,000
If one processor issues a
sequence of reads and writes,

666
00:34:35,000 --> 00:34:37,550
they are allowed to arrive
at a second processor

667
00:34:37,550 --> 00:34:38,960
in an arbitrary order.

668
00:34:38,960 --> 00:34:42,290
Implementations are allowed
to rearrange reads and writes.

669
00:34:42,290 --> 00:34:45,409
That's an enabling
technology that allows things

670
00:34:45,409 --> 00:34:49,489
like write buffers,
or multi-bank caches,

671
00:34:49,489 --> 00:34:53,179
or routing networks
between processors

672
00:34:53,179 --> 00:34:57,800
if you have lots of processors,
or things like memory buses

673
00:34:57,800 --> 00:35:02,060
that do error
detection and retry.

674
00:35:02,060 --> 00:35:04,190
For example, if you
have a processor that

675
00:35:04,190 --> 00:35:08,670
does two writes and you pipeline
those writes down a memory box

676
00:35:08,670 --> 00:35:13,190
and the first right
arrives with bad parity,

677
00:35:13,190 --> 00:35:16,100
the second write then arrives,
and then the first write

678
00:35:16,100 --> 00:35:18,980
is retried and
arrives successfully

679
00:35:18,980 --> 00:35:21,290
with good parity
on the second try.

680
00:35:21,290 --> 00:35:24,740
That's an example of a good
implementation technique that

681
00:35:24,740 --> 00:35:26,750
has the effect of
delivering the writes out

682
00:35:26,750 --> 00:35:30,110
of order to another processor.

683
00:35:30,110 --> 00:35:32,900
We wanted to allow
implementations

684
00:35:32,900 --> 00:35:34,622
to run at very high speeds.

685
00:35:34,622 --> 00:35:36,080
And therefore, in
the architecture,

686
00:35:36,080 --> 00:35:40,430
we do not require precise
read write ordering.

687
00:35:40,430 --> 00:35:43,730
Instead, there's a memory
barrier instruction that

688
00:35:43,730 --> 00:35:45,650
says, at this point, I care.

689
00:35:45,650 --> 00:35:47,330
At this point, all
the reads and writes

690
00:35:47,330 --> 00:35:49,670
that have been issued by
a given processor in front

691
00:35:49,670 --> 00:35:53,900
of the memory barrier have to
be delivered to other processors

692
00:35:53,900 --> 00:35:55,910
before any reads
and writes issues

693
00:35:55,910 --> 00:35:58,310
after the memory barrier.

694
00:35:58,310 --> 00:35:59,990
And the phrasing was
chosen carefully.

695
00:35:59,990 --> 00:36:02,660
It doesn't say that
the sending processor

696
00:36:02,660 --> 00:36:06,650
has to stop issuing
instructions or run slowly.

697
00:36:06,650 --> 00:36:09,860
It simply says that
whatever permutation

698
00:36:09,860 --> 00:36:12,260
could occur in the
implementations,

699
00:36:12,260 --> 00:36:15,802
that there is a restriction
at the memory barrier.

700
00:36:15,802 --> 00:36:18,260
That may mean, for instance,
that the memory barrier itself

701
00:36:18,260 --> 00:36:20,480
is pipelined out along
with reads and writes

702
00:36:20,480 --> 00:36:23,300
to other processors,
or it may mean exactly

703
00:36:23,300 --> 00:36:27,890
at the memory barrier a pipeline
memory bus sends a write

704
00:36:27,890 --> 00:36:30,050
and then waits a few
cycles to get confirmation

705
00:36:30,050 --> 00:36:33,290
that that right
arrived successfully

706
00:36:33,290 --> 00:36:37,620
before sending writes that
come after the memory barrier.

707
00:36:37,620 --> 00:36:40,820
So we view this as an enabling
technology for very high speed

708
00:36:40,820 --> 00:36:42,140
implementations.

709
00:36:42,140 --> 00:36:45,860
It's one of the
design ideas that

710
00:36:45,860 --> 00:36:48,770
led Cray research to choosing
the Alpha architecture

711
00:36:48,770 --> 00:36:52,310
for building 1,000 processor
or bigger massively parallel

712
00:36:52,310 --> 00:36:54,680
processor.

713
00:36:54,680 --> 00:36:57,500
So in summary on the
memory system design,

714
00:36:57,500 --> 00:36:59,435
we rejected things
like implicit read

715
00:36:59,435 --> 00:37:02,880
write ordering in favor
of explicit programmer

716
00:37:02,880 --> 00:37:06,830
statement of where
overlap can occur

717
00:37:06,830 --> 00:37:10,080
and where overlap
must not occur.

718
00:37:10,080 --> 00:37:14,240
So the Alpha architecture is
the new 64-bit architecture

719
00:37:14,240 --> 00:37:16,790
designed to last a
long time, designed

720
00:37:16,790 --> 00:37:19,580
to allow high performance
implementations

721
00:37:19,580 --> 00:37:22,490
with specific emphasis on
allowing multiple construction

722
00:37:22,490 --> 00:37:28,250
issue, multiple processors,
and very fast clock rates.

723
00:37:28,250 --> 00:37:31,490
The architecture supports a very
wide range of operating systems

724
00:37:31,490 --> 00:37:33,070
and compiler languages.

725
00:37:33,070 --> 00:37:35,700


726
00:37:35,700 --> 00:37:38,090
Some of you who looked at
earlier tapes in this series

727
00:37:38,090 --> 00:37:39,980
may have seen Dave
Patterson when

728
00:37:39,980 --> 00:37:44,270
he predicted that by somewhere
between 1993 and '96,

729
00:37:44,270 --> 00:37:47,660
there would be super
microprocessors running

730
00:37:47,660 --> 00:37:51,080
anywhere from desktop
to supercomputers.

731
00:37:51,080 --> 00:37:54,260
We're very happy with Alpha to
deliver on Dave's prediction

732
00:37:54,260 --> 00:37:56,360
a year early.

733
00:37:56,360 --> 00:37:58,040
I think now it's
time for Dirk to talk

734
00:37:58,040 --> 00:38:01,700
about the first implementation.

735
00:38:01,700 --> 00:38:04,940
As Dick said, I am here to
tell you about the first Alpha

736
00:38:04,940 --> 00:38:07,220
implementation, which
internal to [INAUDIBLE],,

737
00:38:07,220 --> 00:38:09,087
we refer to as EV4.

738
00:38:09,087 --> 00:38:11,670
Since I'm going to use that term
throughout this presentation,

739
00:38:11,670 --> 00:38:13,730
first a couple of
words on the term.

740
00:38:13,730 --> 00:38:16,580
When the program started,
the original name

741
00:38:16,580 --> 00:38:19,280
for the architecture, which
the designers came up, with

742
00:38:19,280 --> 00:38:23,120
was EVAX for extended VAX,
which turned out in retrospect

743
00:38:23,120 --> 00:38:25,410
to be not a very good name
because the architecture,

744
00:38:25,410 --> 00:38:28,760
in fact, has no formal
relationship to the VAX

745
00:38:28,760 --> 00:38:30,360
architecture.

746
00:38:30,360 --> 00:38:32,580
However, we're engineers
and not marketing people.

747
00:38:32,580 --> 00:38:35,038
So that was the name we chose
at the beginning it was later

748
00:38:35,038 --> 00:38:36,290
changed to Alpha.

749
00:38:36,290 --> 00:38:38,150
The name of the
chip, however, has

750
00:38:38,150 --> 00:38:41,128
remained EV for
extended VAX four

751
00:38:41,128 --> 00:38:43,670
because it was built in a fourth
generation seamless process.

752
00:38:43,670 --> 00:38:46,580


753
00:38:46,580 --> 00:38:50,100
The presentation that I'm going
to give has four basic parts.

754
00:38:50,100 --> 00:38:51,920
First of all, I'm
going to describe

755
00:38:51,920 --> 00:38:54,470
some of the overall
features of the chip,

756
00:38:54,470 --> 00:38:57,110
then I'm going to describe
some of the higher level

757
00:38:57,110 --> 00:39:00,290
architectural features
such as the pipelines,

758
00:39:00,290 --> 00:39:04,160
the functional units, and
the instruction issue rules.

759
00:39:04,160 --> 00:39:05,990
I will next describe
just a couple

760
00:39:05,990 --> 00:39:09,440
of the more interesting micro
architectural features EV4

761
00:39:09,440 --> 00:39:11,930
and relate those back
to the higher level

762
00:39:11,930 --> 00:39:15,218
architectural concepts that
Dick described previously.

763
00:39:15,218 --> 00:39:16,760
And finally, I will
talk a little bit

764
00:39:16,760 --> 00:39:19,880
about the electrical
interface between EV4

765
00:39:19,880 --> 00:39:23,410
and the systems into
which it's designed.

766
00:39:23,410 --> 00:39:25,990
EV4 is a single
chip implementation

767
00:39:25,990 --> 00:39:28,000
of the Alpha architecture.

768
00:39:28,000 --> 00:39:30,700
It's implemented
in CMOS technology

769
00:39:30,700 --> 00:39:33,760
and has two cycle time
variance, the first of which

770
00:39:33,760 --> 00:39:37,570
operates at a 200 megahertz
internal clock rate,

771
00:39:37,570 --> 00:39:41,360
and the second variant
operates at 150 megahertz.

772
00:39:41,360 --> 00:39:43,960
Now, these two speed
variants don't really

773
00:39:43,960 --> 00:39:46,360
represent separate designs.

774
00:39:46,360 --> 00:39:49,660
They really represent
distinct speed bins.

775
00:39:49,660 --> 00:39:53,020
So it's one design,
and various chips

776
00:39:53,020 --> 00:39:55,930
happened to fall in one
speed bin or the other.

777
00:39:55,930 --> 00:39:58,990
Some work faster than others,
and the ones that run fast,

778
00:39:58,990 --> 00:40:01,640
we run fast.

779
00:40:01,640 --> 00:40:04,820
As I said, the chip is
built in a CMOS process.

780
00:40:04,820 --> 00:40:08,310
It's built in digital
CMOS four technology,

781
00:40:08,310 --> 00:40:11,880
which means that it's our fourth
generation CMOS technology.

782
00:40:11,880 --> 00:40:17,300
This technology features 0.7
micron drawn feature sizes

783
00:40:17,300 --> 00:40:22,320
with a 0.5 micron effective
channel length that

784
00:40:22,320 --> 00:40:25,770
also features three
layers of metalization,

785
00:40:25,770 --> 00:40:30,120
and the technology is really
focused towards high speed

786
00:40:30,120 --> 00:40:32,400
microprocessor implementations.

787
00:40:32,400 --> 00:40:34,800
For example, the
third layer of metal

788
00:40:34,800 --> 00:40:39,180
is different than upper
layers of metal in other CMOS

789
00:40:39,180 --> 00:40:43,140
technologies where the metal
is intended to maximize signal

790
00:40:43,140 --> 00:40:44,100
routeability.

791
00:40:44,100 --> 00:40:45,990
In this case, we use
a third layer of metal

792
00:40:45,990 --> 00:40:50,100
solely to distribute power and
our internal high-speed clock.

793
00:40:50,100 --> 00:40:53,820
And therefore, it's a much
thicker, coarser, metalization

794
00:40:53,820 --> 00:40:57,510
level which can both bring
a lot of power into the chip

795
00:40:57,510 --> 00:41:02,280
and distribute the
clock with minimum skew.

796
00:41:02,280 --> 00:41:04,840
The die, physically,
is fairly large.

797
00:41:04,840 --> 00:41:08,310
It's about 14 millimeters
by 17 millimeters in size,

798
00:41:08,310 --> 00:41:12,100
and it implements about
1.7 million transistors.

799
00:41:12,100 --> 00:41:16,000
The part is packaged
in a 431 pin grid array

800
00:41:16,000 --> 00:41:18,940
of which 291 pins are signals.

801
00:41:18,940 --> 00:41:23,510
The remaining 140
are power and ground.

802
00:41:23,510 --> 00:41:28,880
The chip dissipates 30 watts
at 200 megahertz or 23 watts

803
00:41:28,880 --> 00:41:32,090
at 150 megahertz.

804
00:41:32,090 --> 00:41:37,760
It implements a 43-bit subset of
the architected 64-bit virtual

805
00:41:37,760 --> 00:41:38,960
address space.

806
00:41:38,960 --> 00:41:41,510
As Dick mentioned
previously, implementations

807
00:41:41,510 --> 00:41:45,170
are allowed by the architecture
to implement a subset

808
00:41:45,170 --> 00:41:47,370
of the 64-bit virtual address.

809
00:41:47,370 --> 00:41:51,360
However, the entire
64 bits is checked.

810
00:41:51,360 --> 00:41:56,040
The chip also supports a
34-bit physical address space,

811
00:41:56,040 --> 00:41:58,800
giving us the ability
to address some 16

812
00:41:58,800 --> 00:42:02,300
gigabytes of physical memory.

813
00:42:02,300 --> 00:42:05,330
The chip is a super scalar
implementation in the sense

814
00:42:05,330 --> 00:42:09,530
that it can issue, at its peak,
two instructions in each CPU

815
00:42:09,530 --> 00:42:14,300
cycle to any one of four fully
pipelined functional units.

816
00:42:14,300 --> 00:42:17,630
The functional units include
an integer operation unit,

817
00:42:17,630 --> 00:42:21,140
a floating point operation
unit, a load storing unit,

818
00:42:21,140 --> 00:42:22,670
and a branch unit.

819
00:42:22,670 --> 00:42:27,230
The chip also includes a total
of 44 translation lookaside

820
00:42:27,230 --> 00:42:30,260
buffer entries, 32 of
which are dedicated

821
00:42:30,260 --> 00:42:32,600
to data stream references
and 12 of which

822
00:42:32,600 --> 00:42:35,600
are dedicated to instruction
stream references.

823
00:42:35,600 --> 00:42:39,520
Both translation buffers
are fully associative.

824
00:42:39,520 --> 00:42:41,560
There is an on-chip
right buffer which

825
00:42:41,560 --> 00:42:45,850
includes four entries
where each entry can

826
00:42:45,850 --> 00:42:48,610
contained 32 bytes of data.

827
00:42:48,610 --> 00:42:54,330
The chip also includes two
on-chip caches, 8 kilobytes

828
00:42:54,330 --> 00:42:57,390
in size, one devoted to
instruction stream references

829
00:42:57,390 --> 00:43:00,870
and one devoted to
data stream references.

830
00:43:00,870 --> 00:43:04,140
Both of them are physical,
direct map caches.

831
00:43:04,140 --> 00:43:07,090
The data cache is right through.

832
00:43:07,090 --> 00:43:13,060
The bus interface is flexible
and can support system designs

833
00:43:13,060 --> 00:43:19,420
with either 128-bit or 64-bit
data paths between the system

834
00:43:19,420 --> 00:43:20,610
and the chip.

835
00:43:20,610 --> 00:43:23,680
And as previously mentioned,
the external interface

836
00:43:23,680 --> 00:43:27,060
supports a 34-bit
physical address.

837
00:43:27,060 --> 00:43:31,720
The next graphic shows a crude
block diagram of the chip.

838
00:43:31,720 --> 00:43:34,490
Starting from the top, you
see an instruction cache,

839
00:43:34,490 --> 00:43:37,420
which each cycle can
feed to instructions

840
00:43:37,420 --> 00:43:41,110
to the IBOX or instruction
issue a unit of the chip.

841
00:43:41,110 --> 00:43:44,290
The IBOX is responsible for
decoding the instructions,

842
00:43:44,290 --> 00:43:46,990
performing all on-chip
resource checks,

843
00:43:46,990 --> 00:43:50,410
and sending the instructions
off to the respective functional

844
00:43:50,410 --> 00:43:51,700
units.

845
00:43:51,700 --> 00:43:54,100
The EBOX is an
integer operate unit

846
00:43:54,100 --> 00:43:58,640
which performs basic integer
operations such as add,

847
00:43:58,640 --> 00:44:02,140
subtract, or, and shifts.

848
00:44:02,140 --> 00:44:07,350
The FBOX is the equivalent
floating point operate unit.

849
00:44:07,350 --> 00:44:09,955
Integer register file
sits below the EBOX

850
00:44:09,955 --> 00:44:11,330
while the floating
point register

851
00:44:11,330 --> 00:44:14,480
file sits below the FBOX.

852
00:44:14,480 --> 00:44:18,260
The diagrams in the picture
represent data buses

853
00:44:18,260 --> 00:44:22,940
which represent data flowing
to and from the functional

854
00:44:22,940 --> 00:44:25,870
units from the register file.

855
00:44:25,870 --> 00:44:28,480
You can correlate Dick's
discussion of the instruction

856
00:44:28,480 --> 00:44:31,810
formats laid out in
the Alpha architecture

857
00:44:31,810 --> 00:44:35,170
to the physical orientation
of the chip itself.

858
00:44:35,170 --> 00:44:37,570
From the top of the
integer register file,

859
00:44:37,570 --> 00:44:42,970
you see two 64-bit wide buses
which carry integer operands

860
00:44:42,970 --> 00:44:46,090
to the EBOX and another
64-bit bus which

861
00:44:46,090 --> 00:44:49,797
carries integer results back
to the integer register file.

862
00:44:49,797 --> 00:44:51,880
Similarly, on the floating
point side of the chip,

863
00:44:51,880 --> 00:44:55,180
you see twp 64-bit buses
which carry operands

864
00:44:55,180 --> 00:44:58,990
to the floating point functional
unit and another 64-bit bus

865
00:44:58,990 --> 00:45:02,210
which carries results
back to the register file.

866
00:45:02,210 --> 00:45:05,080
Below the register
files sits the ABOX.

867
00:45:05,080 --> 00:45:07,460
The ABOX, fundamentally,
is the load storage unit

868
00:45:07,460 --> 00:45:09,380
within the machine.

869
00:45:09,380 --> 00:45:13,500
You can see two register
file parts feeding the ABOX.

870
00:45:13,500 --> 00:45:16,340
One of these contains
base addresses.

871
00:45:16,340 --> 00:45:18,620
The other contains
stored data coming

872
00:45:18,620 --> 00:45:20,240
from the integer register file.

873
00:45:20,240 --> 00:45:22,190
Similarly, on the
floating point side,

874
00:45:22,190 --> 00:45:25,310
you see a single
64-bit wide bus which

875
00:45:25,310 --> 00:45:30,110
carries floating point
stored data out to the ABOX.

876
00:45:30,110 --> 00:45:34,190
Below the ABOX is the four
entry by 32 byte wide right

877
00:45:34,190 --> 00:45:37,910
buffer which gets its
data from the ABOX

878
00:45:37,910 --> 00:45:42,140
and supplies data out to the
BIU or bus interface unit.

879
00:45:42,140 --> 00:45:44,150
And lastly, at the
bottom of the diagram,

880
00:45:44,150 --> 00:45:45,650
you see that on-chip
ship data cache

881
00:45:45,650 --> 00:45:49,130
which, as I previously
stated, is 8 kilobytes in size

882
00:45:49,130 --> 00:45:51,200
and a write through cache.

883
00:45:51,200 --> 00:45:54,350
The bus interface unit is
responsible for orchestrating

884
00:45:54,350 --> 00:45:58,160
activities between
on-chip functional units

885
00:45:58,160 --> 00:46:01,500
and the outside world,
meaning the system.

886
00:46:01,500 --> 00:46:04,910
It contains a 64-bit
data path which

887
00:46:04,910 --> 00:46:10,220
carries both filled data from
the off-chip memory structures

888
00:46:10,220 --> 00:46:12,560
to on-chip primary
caches and also

889
00:46:12,560 --> 00:46:17,570
carries write data from the
write buffer out onto the pins.

890
00:46:17,570 --> 00:46:20,210
The next graphic shows
an integer pipeline

891
00:46:20,210 --> 00:46:21,680
in the machine.

892
00:46:21,680 --> 00:46:23,960
It is 7 stages long,
and this diagram

893
00:46:23,960 --> 00:46:27,890
can be used to describe
the pipelines which

894
00:46:27,890 --> 00:46:32,520
operate in the IBOX,
the EBOX, and the ABOX.

895
00:46:32,520 --> 00:46:36,570
Taking the IBOX first,
you see fundamentally

896
00:46:36,570 --> 00:46:38,760
four pipeline stages.

897
00:46:38,760 --> 00:46:41,760
In stage zero, which we also
refer to as the instruction

898
00:46:41,760 --> 00:46:46,800
fetch stage, the IBOX reads
a naturally aligned pair

899
00:46:46,800 --> 00:46:50,580
of long word instructions
from the instruction cache.

900
00:46:50,580 --> 00:46:56,390
In the next cycle, the IBOX
performs two primary functions.

901
00:46:56,390 --> 00:47:00,160
First, it decodes a
portion of the instructions

902
00:47:00,160 --> 00:47:02,500
in order to determine
which functional units

903
00:47:02,500 --> 00:47:04,690
the instructions should be sent.

904
00:47:04,690 --> 00:47:07,420
In parallel with
that, it determines

905
00:47:07,420 --> 00:47:12,330
for branch instructions the
probable target for the branch.

906
00:47:12,330 --> 00:47:16,980
The chip supports two methods
for branch prediction.

907
00:47:16,980 --> 00:47:20,820
In the first method, we support
the architected hint alluded

908
00:47:20,820 --> 00:47:27,490
to by Dick earlier in this
talk where the hardware looks

909
00:47:27,490 --> 00:47:29,920
at the sign bit of
the displacement field

910
00:47:29,920 --> 00:47:31,600
in the branch
construction itself

911
00:47:31,600 --> 00:47:36,460
and predicts that backward
reaching branches are taken

912
00:47:36,460 --> 00:47:38,710
but forward branches are not.

913
00:47:38,710 --> 00:47:40,900
Alternatively, the
chip also implements

914
00:47:40,900 --> 00:47:43,630
a dynamic branch prediction
structure in which,

915
00:47:43,630 --> 00:47:47,720
associated with each instruction
in the instruction cache,

916
00:47:47,720 --> 00:47:49,250
is a single bit.

917
00:47:49,250 --> 00:47:52,160
This bit indicates which
direction the branch

918
00:47:52,160 --> 00:47:55,600
took the last time
it was executed.

919
00:47:55,600 --> 00:47:57,850
When new instructions are
brought into the instruction

920
00:47:57,850 --> 00:48:01,490
cache, we load this
bit with the sign

921
00:48:01,490 --> 00:48:03,470
of the displacement
as an initial guess,

922
00:48:03,470 --> 00:48:06,430
and then this bit
gets updated later.

923
00:48:06,430 --> 00:48:09,430
Branches which are predicted
by the hardware to be taken

924
00:48:09,430 --> 00:48:12,610
result in a one cycle
bubble being inserted

925
00:48:12,610 --> 00:48:17,020
into the pipeline, meaning
that if a branch instruction is

926
00:48:17,020 --> 00:48:20,710
fetched from the instruction
cache in cycle zero,

927
00:48:20,710 --> 00:48:22,750
it takes all of
cycle one to predict

928
00:48:22,750 --> 00:48:24,520
the direction of the
branch and generate

929
00:48:24,520 --> 00:48:26,860
the target for the
branch so that we are not

930
00:48:26,860 --> 00:48:29,470
able to go back to
I-cache for that target

931
00:48:29,470 --> 00:48:32,830
until what would be
cycle two for the branch.

932
00:48:32,830 --> 00:48:35,290
This bubble, which gets
inserted into the pipeline,

933
00:48:35,290 --> 00:48:41,350
can in some cases
be removed later.

934
00:48:41,350 --> 00:48:44,320
In pipe stage two,
labeled I0, we

935
00:48:44,320 --> 00:48:47,010
perform more instruction decode.

936
00:48:47,010 --> 00:48:50,490
And in pipe stage three,
labeled I1, all the real action

937
00:48:50,490 --> 00:48:51,750
takes place.

938
00:48:51,750 --> 00:48:55,020
In that pipe stage, we read
operands from the register

939
00:48:55,020 --> 00:48:58,620
file and perform resource checks
to see whether the instruction

940
00:48:58,620 --> 00:49:00,570
can be issued.

941
00:49:00,570 --> 00:49:03,270
These resource checks
include both the availability

942
00:49:03,270 --> 00:49:06,150
of register operands as
well as the availability

943
00:49:06,150 --> 00:49:09,410
of the on-chip functional units.

944
00:49:09,410 --> 00:49:11,900
After pipe stage three,
instructions either

945
00:49:11,900 --> 00:49:14,660
execute to completion and write
the results in the register

946
00:49:14,660 --> 00:49:18,450
file, or perhaps they get
aborted for a host of reasons,

947
00:49:18,450 --> 00:49:20,960
which I'll describe later.

948
00:49:20,960 --> 00:49:24,080
Within the IBOX, there's
still a few more activities

949
00:49:24,080 --> 00:49:26,620
that happen later down the pipe.

950
00:49:26,620 --> 00:49:29,110
As Dick mentioned,
conditional branches

951
00:49:29,110 --> 00:49:31,000
test a register
in order to decide

952
00:49:31,000 --> 00:49:32,890
whether to take the branch.

953
00:49:32,890 --> 00:49:35,590
Since we read the
register in cycle three,

954
00:49:35,590 --> 00:49:38,320
we can test its
value against zero

955
00:49:38,320 --> 00:49:40,750
in the beginning of cycle
four and be in a position

956
00:49:40,750 --> 00:49:45,010
to determine whether the branch
should in fact be taken or not.

957
00:49:45,010 --> 00:49:48,820
Notice then that although a
branch instruction is fetched

958
00:49:48,820 --> 00:49:51,730
from the instruction
cache in cycle zero,

959
00:49:51,730 --> 00:49:55,340
we don't know whether it's going
to be taken until cycle four.

960
00:49:55,340 --> 00:49:58,760
What this means is that if
the branch was predicted

961
00:49:58,760 --> 00:50:01,760
incorrectly, we incur
a four cycle penalty

962
00:50:01,760 --> 00:50:06,150
to go back and get the
true branch target.

963
00:50:06,150 --> 00:50:08,730
Also in cycle four,
we're in a position

964
00:50:08,730 --> 00:50:12,390
to generate the true virtual
PC for the instruction

965
00:50:12,390 --> 00:50:14,740
one stage back in the pipeline.

966
00:50:14,740 --> 00:50:18,850
So we have this virtual PC at
the end of pipe stage four.

967
00:50:18,850 --> 00:50:22,570
In pipeline stage five, we
can, from this virtual PC,

968
00:50:22,570 --> 00:50:25,240
generate the
corresponding physical PC

969
00:50:25,240 --> 00:50:27,790
so that, at the end
of the pipe stage five

970
00:50:27,790 --> 00:50:29,510
in the beginning
of pipe stage six,

971
00:50:29,510 --> 00:50:31,510
we can determine whether
the instruction that we

972
00:50:31,510 --> 00:50:35,080
fetched five cycles
ago, in fact,

973
00:50:35,080 --> 00:50:37,350
hit in the instruction cache.

974
00:50:37,350 --> 00:50:39,930
So this means that if
an instruction fetched

975
00:50:39,930 --> 00:50:42,850
in pipe stage zero
missed the cache,

976
00:50:42,850 --> 00:50:46,130
we don't know it until
five cycles later.

977
00:50:46,130 --> 00:50:49,300
Moving on to the activities
which happen in the EBOX,

978
00:50:49,300 --> 00:50:52,140
I'll start in pipe stage four.

979
00:50:52,140 --> 00:50:58,410
In that pipeline stage, the
box has two 64-bit operands

980
00:50:58,410 --> 00:51:01,080
on which to perform
an operation.

981
00:51:01,080 --> 00:51:05,190
Most operations in the EBOX can
be performed in a single cycle.

982
00:51:05,190 --> 00:51:09,540
These include add and subtract
and the simple logic functions

983
00:51:09,540 --> 00:51:12,000
like add, and, and x or.

984
00:51:12,000 --> 00:51:14,490
So at the end of
pipeline stage four,

985
00:51:14,490 --> 00:51:16,860
the results of these
instructions are ready for use

986
00:51:16,860 --> 00:51:20,400
by subsequent instructions,
although the results won't

987
00:51:20,400 --> 00:51:23,010
be, in fact, written to the
register file for two more

988
00:51:23,010 --> 00:51:23,880
cycles.

989
00:51:23,880 --> 00:51:27,150
We can supply the
data on bypass paths

990
00:51:27,150 --> 00:51:30,330
back to the multiplexor
which physically

991
00:51:30,330 --> 00:51:34,460
sits between the integer
register file and the EBOX.

992
00:51:34,460 --> 00:51:38,320
Shift operations take
two cycles to complete.

993
00:51:38,320 --> 00:51:39,880
The results of
shifts are therefore

994
00:51:39,880 --> 00:51:42,220
available in pipe stage five.

995
00:51:42,220 --> 00:51:45,490
Shift operations therefore
have a latency of two cycles,

996
00:51:45,490 --> 00:51:47,650
although they can
be fully pipelined.

997
00:51:47,650 --> 00:51:50,770
Also in pipe stage
five, we produce,

998
00:51:50,770 --> 00:51:53,920
based on the results
generated in the EBOX,

999
00:51:53,920 --> 00:51:57,010
a bit which indicates
whether that result is zero.

1000
00:51:57,010 --> 00:51:59,620
We then store that bit
in the register file.

1001
00:51:59,620 --> 00:52:03,900
This makes branch instructions
easier to execute.

1002
00:52:03,900 --> 00:52:06,480
Lastly, in pipe
stage six, the EBOX

1003
00:52:06,480 --> 00:52:09,320
writes its results
into the register file.

1004
00:52:09,320 --> 00:52:10,990
Now, referencing
the same diagram,

1005
00:52:10,990 --> 00:52:12,740
I'm going to describe
the operations which

1006
00:52:12,740 --> 00:52:15,690
occur within the ABOX.

1007
00:52:15,690 --> 00:52:17,760
The box receives
its instructions

1008
00:52:17,760 --> 00:52:21,930
issued from the IBOX at the
end of pipe stage three.

1009
00:52:21,930 --> 00:52:24,880
In pipe stage four,
it takes the operand

1010
00:52:24,880 --> 00:52:27,960
supplied to it from the
register file and adds to it

1011
00:52:27,960 --> 00:52:30,170
the displacement.

1012
00:52:30,170 --> 00:52:32,090
Thus, at the end
of pipe stage four,

1013
00:52:32,090 --> 00:52:35,310
it has an effective
virtual address.

1014
00:52:35,310 --> 00:52:37,470
Because the size
of the data cache

1015
00:52:37,470 --> 00:52:41,970
is equal to the size of the
pages used by the memory

1016
00:52:41,970 --> 00:52:44,760
management system,
we can start looking

1017
00:52:44,760 --> 00:52:48,390
in the decache at the
beginning of pipe stage five.

1018
00:52:48,390 --> 00:52:50,400
We don't have to wait for
the translation buffer

1019
00:52:50,400 --> 00:52:53,190
to produce a full
physical address.

1020
00:52:53,190 --> 00:52:55,710
Therefore, at the end
of pipe stage five,

1021
00:52:55,710 --> 00:52:58,650
we have on our hands
the physical address

1022
00:52:58,650 --> 00:53:01,380
that corresponds to the
generated virtual address

1023
00:53:01,380 --> 00:53:03,720
and the data which
hopefully corresponds

1024
00:53:03,720 --> 00:53:06,690
as well to that virtual address.

1025
00:53:06,690 --> 00:53:08,970
We are in a position
to know whether a load

1026
00:53:08,970 --> 00:53:12,720
instruction actually hit or
missed at the beginning of pipe

1027
00:53:12,720 --> 00:53:14,890
stage six.

1028
00:53:14,890 --> 00:53:17,080
And if the instruction
hit in the data cache,

1029
00:53:17,080 --> 00:53:20,230
we can supply the results
of the instruction

1030
00:53:20,230 --> 00:53:24,340
to the functional units by
the end of pipe stage six.

1031
00:53:24,340 --> 00:53:27,130
Hence, load instructions
which hit in the data

1032
00:53:27,130 --> 00:53:31,110
have an effective latency
of three CPU cycles.

1033
00:53:31,110 --> 00:53:35,310
For stores, we know as well at
the beginning of pipe stage six

1034
00:53:35,310 --> 00:53:37,920
whether the store
instruction hit in the data.

1035
00:53:37,920 --> 00:53:40,980
If it missed, no
further operation

1036
00:53:40,980 --> 00:53:43,020
is required for the data cache.

1037
00:53:43,020 --> 00:53:44,910
The data is moved
into the right buffer

1038
00:53:44,910 --> 00:53:49,140
independent of the results
of the data cache lookup.

1039
00:53:49,140 --> 00:53:52,080
If the lookup resulted
in a data cache hit,

1040
00:53:52,080 --> 00:53:56,340
we write the corresponding
data into the data cache array

1041
00:53:56,340 --> 00:53:59,780
the next time that that
array is otherwise not busy.

1042
00:53:59,780 --> 00:54:02,420
Turning next to the diagram
which shows the floating point

1043
00:54:02,420 --> 00:54:06,800
pipeline, you see that floating
point instructions share

1044
00:54:06,800 --> 00:54:10,760
the first four pipeline stages
with integer instructions,

1045
00:54:10,760 --> 00:54:14,240
and those really correspond to
activity, which I previously

1046
00:54:14,240 --> 00:54:17,240
described in the IBOX.

1047
00:54:17,240 --> 00:54:19,310
The floating point pipeline
is quite a bit longer

1048
00:54:19,310 --> 00:54:22,340
than the integer pipeline,
and floating point results

1049
00:54:22,340 --> 00:54:25,550
are not available
until six CPU cycles

1050
00:54:25,550 --> 00:54:27,620
after the instruction
was issued.

1051
00:54:27,620 --> 00:54:30,440
The floating point unit is
fully pipelined, however,

1052
00:54:30,440 --> 00:54:32,330
so that we can start
a new floating point

1053
00:54:32,330 --> 00:54:34,410
operation every CPU cycle.

1054
00:54:34,410 --> 00:54:36,920
However, as I said,
the effective latency

1055
00:54:36,920 --> 00:54:41,233
of this functional
unit is six CPU cycles.

1056
00:54:41,233 --> 00:54:42,900
Now I want to talk
about the instruction

1057
00:54:42,900 --> 00:54:46,920
issue rules which hardware
within EV4 enforces.

1058
00:54:46,920 --> 00:54:49,890
Generally, if you think
back to the diagram which

1059
00:54:49,890 --> 00:54:53,030
showed the functional
units within the chip,

1060
00:54:53,030 --> 00:54:56,678
you can see that the possibility
for multiple instruction issue

1061
00:54:56,678 --> 00:54:57,470
is certainly there.

1062
00:54:57,470 --> 00:54:59,660
We have four fully
pipelined units,

1063
00:54:59,660 --> 00:55:03,500
and the IBOX ought to be able
to issue pairs of instructions

1064
00:55:03,500 --> 00:55:04,460
to any of these units.

1065
00:55:04,460 --> 00:55:06,240
And generally, that's the case.

1066
00:55:06,240 --> 00:55:09,110
We can issue load
restore instructions

1067
00:55:09,110 --> 00:55:10,880
with operate instructions.

1068
00:55:10,880 --> 00:55:13,520
We can issue integer
operator instructions

1069
00:55:13,520 --> 00:55:15,920
with floating point
operate instructions.

1070
00:55:15,920 --> 00:55:19,040
We can issue floating
point operator instructions

1071
00:55:19,040 --> 00:55:22,550
with corresponding floating
point branches or integer

1072
00:55:22,550 --> 00:55:26,710
operator instructions
with integer branches.

1073
00:55:26,710 --> 00:55:30,820
One quirk is that integer
stores can't be issued together

1074
00:55:30,820 --> 00:55:34,330
with floating point operates,
nor can floating point stores

1075
00:55:34,330 --> 00:55:37,600
be issued together
with integer operates.

1076
00:55:37,600 --> 00:55:41,830
This is due to a internal
hardware resource constraint

1077
00:55:41,830 --> 00:55:48,190
in the IBOX which, although not
real difficult to get around,

1078
00:55:48,190 --> 00:55:50,320
wasn't worth the added
complication in terms

1079
00:55:50,320 --> 00:55:53,032
of the performance that
it would have bought us.

1080
00:55:53,032 --> 00:55:54,490
Next, I want to
address a couple of

1081
00:55:54,490 --> 00:55:56,532
interesting micro
architectural features which we

1082
00:55:56,532 --> 00:55:58,750
included within the EV4 chip.

1083
00:55:58,750 --> 00:56:01,870
The first of these addresses
the branch latency problem,

1084
00:56:01,870 --> 00:56:03,730
which Dick referred to earlier.

1085
00:56:03,730 --> 00:56:06,460
Specifically, it addresses
the problem associated

1086
00:56:06,460 --> 00:56:08,440
with memory format branches.

1087
00:56:08,440 --> 00:56:11,290
These instructions are
used in computed jumps,

1088
00:56:11,290 --> 00:56:15,620
subroutine calls and
returns and the like.

1089
00:56:15,620 --> 00:56:17,570
The instructions
read the register

1090
00:56:17,570 --> 00:56:21,260
file and jump to the
virtual address contained

1091
00:56:21,260 --> 00:56:24,710
within the operand
register and represent

1092
00:56:24,710 --> 00:56:30,040
a special problem for the
instruction pre-fetch hardware.

1093
00:56:30,040 --> 00:56:31,990
Thinking back to the
earlier diagram which

1094
00:56:31,990 --> 00:56:34,180
showed the machine's
pipeline, remember

1095
00:56:34,180 --> 00:56:36,550
that the branch
prediction hardware

1096
00:56:36,550 --> 00:56:38,770
runs in pipe stage
one of the machine

1097
00:56:38,770 --> 00:56:42,250
while registers aren't read
until two cycles later.

1098
00:56:42,250 --> 00:56:45,460
Therefore, in order for
the prediction hardware

1099
00:56:45,460 --> 00:56:49,330
to come up with a
target for the branch,

1100
00:56:49,330 --> 00:56:51,490
something has to
be done in hardware

1101
00:56:51,490 --> 00:56:54,490
if we want to come up with a
target early in the pipe stage,

1102
00:56:54,490 --> 00:56:57,790
since the actual target isn't
available until pipe stage

1103
00:56:57,790 --> 00:56:58,900
three.

1104
00:56:58,900 --> 00:57:00,520
The solution here
is to implement

1105
00:57:00,520 --> 00:57:03,070
what we call a JSR stack.

1106
00:57:03,070 --> 00:57:06,580
This stack allows subroutine
and call and return addresses

1107
00:57:06,580 --> 00:57:08,980
to be maintained
back in the pipeline

1108
00:57:08,980 --> 00:57:12,880
by the instruction
pre-fetcher and also requires

1109
00:57:12,880 --> 00:57:14,870
use of hint bits.

1110
00:57:14,870 --> 00:57:17,450
On a subroutine
call, the compiler

1111
00:57:17,450 --> 00:57:20,690
inserts into the
instruction a 2-bit field

1112
00:57:20,690 --> 00:57:24,740
which tells the hardware that
this instruction is actually

1113
00:57:24,740 --> 00:57:27,540
implementing a subroutine call.

1114
00:57:27,540 --> 00:57:31,710
The prediction hardware then
takes the virtual address

1115
00:57:31,710 --> 00:57:35,070
following the instruction,
following the subroutine call,

1116
00:57:35,070 --> 00:57:37,570
and places that on
a hardware stack.

1117
00:57:37,570 --> 00:57:40,120
When the subsequent
return comes along,

1118
00:57:40,120 --> 00:57:42,100
we then pop that
stack and use that

1119
00:57:42,100 --> 00:57:45,590
as a target for the instruction.

1120
00:57:45,590 --> 00:57:47,910
This JSR stack has
a depth of four,

1121
00:57:47,910 --> 00:57:50,750
meaning it's capable of
holding at any one time

1122
00:57:50,750 --> 00:57:55,000
a stack of four subroutine
return addresses.

1123
00:57:55,000 --> 00:57:56,830
We also include
hardware to make sure

1124
00:57:56,830 --> 00:57:59,580
that this stack
doesn't get corrupted.

1125
00:57:59,580 --> 00:58:02,170
The way it might get
corrupted is as follows.

1126
00:58:02,170 --> 00:58:04,650
We could fetch from
the instruction cache

1127
00:58:04,650 --> 00:58:07,540
an instruction that's a call.

1128
00:58:07,540 --> 00:58:10,530
We might then send this
instruction down the pipeline.

1129
00:58:10,530 --> 00:58:14,070
And later, for whatever
reason, an exception, a branch

1130
00:58:14,070 --> 00:58:17,070
mispredict, an
instruction cache miss,

1131
00:58:17,070 --> 00:58:19,890
the call instruction
may never get executed.

1132
00:58:19,890 --> 00:58:23,070
If we updated the stack when we
fetched the instruction rather

1133
00:58:23,070 --> 00:58:25,630
than when we committed to
executing the instruction,

1134
00:58:25,630 --> 00:58:27,900
the stack could
therefore get out of sync

1135
00:58:27,900 --> 00:58:30,590
with the rest of the machine.

1136
00:58:30,590 --> 00:58:34,280
Note that this is a
performance optimization

1137
00:58:34,280 --> 00:58:36,680
and if the hardware
stack ever gets out

1138
00:58:36,680 --> 00:58:39,170
of sync with the
real programmed flow,

1139
00:58:39,170 --> 00:58:42,980
we do not end up with
incorrect operations.

1140
00:58:42,980 --> 00:58:44,540
We merely take a
little bit longer

1141
00:58:44,540 --> 00:58:48,560
to fetch the target
of a call or return.

1142
00:58:48,560 --> 00:58:50,480
The next feature that
I want to describe

1143
00:58:50,480 --> 00:58:52,610
relates to granularity hints.

1144
00:58:52,610 --> 00:58:56,000
The Alpha architecture
uses a 2-bit field

1145
00:58:56,000 --> 00:58:58,520
within each page
table entry, which

1146
00:58:58,520 --> 00:59:01,700
can be used by
software to communicate

1147
00:59:01,700 --> 00:59:06,110
to hardware that a particular
page table entry in fact maps

1148
00:59:06,110 --> 00:59:08,090
greater than a single page.

1149
00:59:08,090 --> 00:59:10,460
This 2-bit field can
be used by the software

1150
00:59:10,460 --> 00:59:12,920
to communicate to the hardware
that a given page table

1151
00:59:12,920 --> 00:59:17,270
entry in fact maps a contiguous
physical region consisting

1152
00:59:17,270 --> 00:59:23,270
of one page, eight pages,
64 pages, or 512 pages.

1153
00:59:23,270 --> 00:59:27,200
Since EV4 implements
an eight kilobyte page,

1154
00:59:27,200 --> 00:59:33,890
this translates to a region of
8 kilobytes, 64 kilobytes, 512

1155
00:59:33,890 --> 00:59:36,620
kilobytes, or 4
megabytes in size.

1156
00:59:36,620 --> 00:59:40,280
The data stream translation
buffer contains 32 entries,

1157
00:59:40,280 --> 00:59:41,810
as I previously stated.

1158
00:59:41,810 --> 00:59:44,090
Each of these entries
is able to support

1159
00:59:44,090 --> 00:59:47,270
any of the four
granularity hints specified

1160
00:59:47,270 --> 00:59:48,740
by the architecture.

1161
00:59:48,740 --> 00:59:52,880
This allows a very flexible
arrangement for page table

1162
00:59:52,880 --> 00:59:55,310
entries and also allows
a relatively small

1163
00:59:55,310 --> 00:59:58,850
associative translation
buffer to in fact map

1164
00:59:58,850 --> 01:00:02,480
a large memory space.

1165
01:00:02,480 --> 01:00:05,570
The size of the page
mapped by each entry

1166
01:00:05,570 --> 01:00:09,020
is set when the page table
is written by PAL code

1167
01:00:09,020 --> 01:00:11,690
into the translation
buffer structure.

1168
01:00:11,690 --> 01:00:15,440
The istream translation
buffer consists of 12 entries.

1169
01:00:15,440 --> 01:00:18,830
Eight of these entries are
devoted to the smallest

1170
01:00:18,830 --> 01:00:22,220
granularity size, meaning that
each of those eight entries

1171
01:00:22,220 --> 01:00:25,040
can only map an
eight kilobyte region

1172
01:00:25,040 --> 01:00:27,980
corresponding to a single page.

1173
01:00:27,980 --> 01:00:32,420
Four of the entries, however,
support the largest granularity

1174
01:00:32,420 --> 01:00:35,150
size, which means that
each of those entries

1175
01:00:35,150 --> 01:00:38,220
can map a four megabyte region.

1176
01:00:38,220 --> 01:00:42,200
This is very useful for mapping
large, non-paged instruction

1177
01:00:42,200 --> 01:00:47,030
areas such as the operating
system kernel or large shared

1178
01:00:47,030 --> 01:00:48,790
libraries.

1179
01:00:48,790 --> 01:00:52,640
I'm next going to describe
EV4's external interface.

1180
01:00:52,640 --> 01:00:56,210
This interface was designed
with several goals in mind.

1181
01:00:56,210 --> 01:00:58,240
First of all, we
needed the interface

1182
01:00:58,240 --> 01:01:00,220
to be extremely flexible.

1183
01:01:00,220 --> 01:01:03,490
Since this is the first
chip in a new architecture

1184
01:01:03,490 --> 01:01:06,910
and since we didn't have the
resources inside the company

1185
01:01:06,910 --> 01:01:09,760
to devote several design
teams each to doing

1186
01:01:09,760 --> 01:01:12,430
an implementation targeted
towards a particular end

1187
01:01:12,430 --> 01:01:16,420
of the system-wide
spectrum, the single design

1188
01:01:16,420 --> 01:01:20,710
needed to be used in systems
ranging from low-end desktop

1189
01:01:20,710 --> 01:01:24,490
machines to mid-range
multiprocessor servers

1190
01:01:24,490 --> 01:01:27,970
all the way up to high-end,
massively parallel

1191
01:01:27,970 --> 01:01:29,860
supercomputer systems.

1192
01:01:29,860 --> 01:01:32,770
With that in mind, we
needed each system designer

1193
01:01:32,770 --> 01:01:35,860
to be able to make his or
her own cost performance

1194
01:01:35,860 --> 01:01:37,030
trade-offs.

1195
01:01:37,030 --> 01:01:39,040
Lastly, because of the
schedule constraints

1196
01:01:39,040 --> 01:01:41,440
confronted by our
internal system partners,

1197
01:01:41,440 --> 01:01:44,320
we wanted to define an external
interface around which you

1198
01:01:44,320 --> 01:01:48,670
could design a system using
off the shelf industry standard

1199
01:01:48,670 --> 01:01:51,030
components.

1200
01:01:51,030 --> 01:01:54,650
The next graphic shows
a crude schematic

1201
01:01:54,650 --> 01:01:58,290
for the external
interface of the EV4 chip.

1202
01:01:58,290 --> 01:02:02,250
The chip takes as
its input a 2x clock.

1203
01:02:02,250 --> 01:02:05,550
So for example, a chip
that operates internally

1204
01:02:05,550 --> 01:02:10,300
at 200 megahertz requires a
400 megahertz input oscillator.

1205
01:02:10,300 --> 01:02:12,090
This is the only
high speed signal

1206
01:02:12,090 --> 01:02:14,640
that a system designer
really has to worry about.

1207
01:02:14,640 --> 01:02:16,680
The rest of the interface
can be programmed

1208
01:02:16,680 --> 01:02:19,170
to operate much more slowly.

1209
01:02:19,170 --> 01:02:21,330
The chip also supports
an external cache,

1210
01:02:21,330 --> 01:02:23,880
although this cache
is not required

1211
01:02:23,880 --> 01:02:25,650
by the chip design itself.

1212
01:02:25,650 --> 01:02:28,920
Its presence or
absence is left solely

1213
01:02:28,920 --> 01:02:31,780
up to the system designer.

1214
01:02:31,780 --> 01:02:36,170
For systems which do
implement an off-chip cache,

1215
01:02:36,170 --> 01:02:38,930
the chip is capable of
accessing that cache

1216
01:02:38,930 --> 01:02:43,430
with no help from external
system-level logic.

1217
01:02:43,430 --> 01:02:47,330
So for example, if
a load instruction

1218
01:02:47,330 --> 01:02:50,630
misses the on-chip
cache, the on-ship BIU

1219
01:02:50,630 --> 01:02:53,600
can read the off-chip cash
rams with no interaction

1220
01:02:53,600 --> 01:02:56,620
from the outside world.

1221
01:02:56,620 --> 01:02:58,270
This relieves the
system designer

1222
01:02:58,270 --> 01:03:02,740
from having to design logic
which has to directly interface

1223
01:03:02,740 --> 01:03:07,450
to the high speed clocking
domain inside the chip.

1224
01:03:07,450 --> 01:03:12,260
The graphic shows three
RAM structures which

1225
01:03:12,260 --> 01:03:14,300
make up the external cache.

1226
01:03:14,300 --> 01:03:18,140
The tag control RAM consists
of four bits, a valid bit,

1227
01:03:18,140 --> 01:03:21,560
a shared bit, a dirty
bit, and a parity bit

1228
01:03:21,560 --> 01:03:25,760
which contains parity across
the previously mentioned

1229
01:03:25,760 --> 01:03:27,920
three bits.

1230
01:03:27,920 --> 01:03:31,820
The tag RAM contains
a tag corresponding

1231
01:03:31,820 --> 01:03:34,460
to each block and
the external cache,

1232
01:03:34,460 --> 01:03:38,240
and the data and check
RAMs contain the cache data

1233
01:03:38,240 --> 01:03:42,340
and associated check
or parity bits.

1234
01:03:42,340 --> 01:03:45,670
As I previously stated,
the external bus

1235
01:03:45,670 --> 01:03:50,790
can be either 64-bits or
128-bits wide, as determined

1236
01:03:50,790 --> 01:03:52,380
by the system designer's need.

1237
01:03:52,380 --> 01:03:55,170


1238
01:03:55,170 --> 01:03:58,740
To begin a cache reference,
the chip drives an address out

1239
01:03:58,740 --> 01:04:00,840
onto its address bus.

1240
01:04:00,840 --> 01:04:03,840
This address and
associated RAM control flow

1241
01:04:03,840 --> 01:04:06,210
through system-dependent
logic which physically

1242
01:04:06,210 --> 01:04:09,210
sit between these
buses and the RAMs.

1243
01:04:09,210 --> 01:04:12,150
The RAM access is
completely combinatorial.

1244
01:04:12,150 --> 01:04:14,790
The chip drives a new
address out under the RAM

1245
01:04:14,790 --> 01:04:17,460
and then waits a
user-specified period of time

1246
01:04:17,460 --> 01:04:21,570
before sampling the associated
tag and data fields.

1247
01:04:21,570 --> 01:04:26,640
Some external transactions
cannot be required by EV4 alone

1248
01:04:26,640 --> 01:04:28,350
using the external cache RAMs.

1249
01:04:28,350 --> 01:04:32,130
And for these transactions,
system-level components

1250
01:04:32,130 --> 01:04:34,910
have to get into the picture.

1251
01:04:34,910 --> 01:04:38,030
Information between EV4
and these components

1252
01:04:38,030 --> 01:04:40,040
flows between the
miscellaneous control

1253
01:04:40,040 --> 01:04:42,350
field listed on this graphic.

1254
01:04:42,350 --> 01:04:44,930
This field consists
of a command field

1255
01:04:44,930 --> 01:04:49,010
onto which EV4 drives
commands such as read, write,

1256
01:04:49,010 --> 01:04:52,490
memory barrier, load locked,
et cetera, and a pair

1257
01:04:52,490 --> 01:04:55,880
of acknowledgment fields, which
system-level components drive

1258
01:04:55,880 --> 01:04:57,890
back to EV4.

1259
01:04:57,890 --> 01:05:01,730
These control fields
operate synchronously

1260
01:05:01,730 --> 01:05:06,740
with the internal chip clock
but phase aligned to a system

1261
01:05:06,740 --> 01:05:10,310
clock, which is programmable.

1262
01:05:10,310 --> 01:05:13,650
This system clock
can be user specified

1263
01:05:13,650 --> 01:05:16,890
to operate at frequencies
ranging from 1/2 to 1/8,

1264
01:05:16,890 --> 01:05:20,050
the on-chip CPU cycle time.

1265
01:05:20,050 --> 01:05:23,140
In keeping with the
previously mentioned

1266
01:05:23,140 --> 01:05:26,560
external interface goals of
performance, flexibility,

1267
01:05:26,560 --> 01:05:30,220
and simplicity, EV4
supports external caches

1268
01:05:30,220 --> 01:05:34,570
ranging in size from 128
kilobytes to 8 megabytes.

1269
01:05:34,570 --> 01:05:38,740
Or, as I previously said, we
could have no cache at all.

1270
01:05:38,740 --> 01:05:41,710
The external cash, if
implemented by the system

1271
01:05:41,710 --> 01:05:45,220
designer, can be built with a
variety of industry standard

1272
01:05:45,220 --> 01:05:46,300
RAMs.

1273
01:05:46,300 --> 01:05:49,090
High-end systems
can use the fastest,

1274
01:05:49,090 --> 01:05:53,440
largest static RAMs they can buy
while low-end systems can use

1275
01:05:53,440 --> 01:05:56,770
smaller, cheaper, slower RAMs.

1276
01:05:56,770 --> 01:05:58,750
The RAM timing is
set by software

1277
01:05:58,750 --> 01:06:02,530
and can range from three
cycles to 16 CPU cycles,

1278
01:06:02,530 --> 01:06:04,510
determined by the
value which software

1279
01:06:04,510 --> 01:06:08,290
places into an on-chip
control register.

1280
01:06:08,290 --> 01:06:11,230
The external interface,
although it runs synchronously

1281
01:06:11,230 --> 01:06:14,920
to the on-chip CPU
clock, can be specified

1282
01:06:14,920 --> 01:06:18,400
to run anywhere from
1/2 to 1/8 the speed

1283
01:06:18,400 --> 01:06:22,360
of this on-chip clock, again,
allowing cost performance

1284
01:06:22,360 --> 01:06:26,310
trade-offs to be made
by the system designer.

1285
01:06:26,310 --> 01:06:29,130
Lastly, there are no
cache policy decisions

1286
01:06:29,130 --> 01:06:31,230
enforced by the EV4 chip.

1287
01:06:31,230 --> 01:06:34,950
This means that it's left to
the system designer or to decide

1288
01:06:34,950 --> 01:06:38,670
what cache coherence protocol
is appropriate for him or her.

1289
01:06:38,670 --> 01:06:40,950
The valid, dirty,
and shared bits,

1290
01:06:40,950 --> 01:06:45,030
as I previously described
within the tag control field,

1291
01:06:45,030 --> 01:06:47,640
imply a bias towards
a conditional

1292
01:06:47,640 --> 01:06:51,840
write through cache coherence
protocol in name only.

1293
01:06:51,840 --> 01:06:56,430
Really, the shared bid simply
specifies to EV4 for that it

1294
01:06:56,430 --> 01:07:00,330
cannot write to an external
cache block by itself

1295
01:07:00,330 --> 01:07:03,870
but requires external
module-level interaction

1296
01:07:03,870 --> 01:07:06,290
in order to complete the write.

1297
01:07:06,290 --> 01:07:08,920
This means that, in addition
to a conditional write

1298
01:07:08,920 --> 01:07:12,640
through protocol, the system
designer could implement

1299
01:07:12,640 --> 01:07:16,360
a pure ownership coherence
protocol where the shared

1300
01:07:16,360 --> 01:07:20,080
bit becomes really more
of an ownership bit,

1301
01:07:20,080 --> 01:07:23,590
or the system designer
could implement

1302
01:07:23,590 --> 01:07:27,280
a pure write through protocol
where the shared bit is always

1303
01:07:27,280 --> 01:07:28,490
set.

1304
01:07:28,490 --> 01:07:32,200
This means that EV4
will always request help

1305
01:07:32,200 --> 01:07:37,160
from module-level components
before writing data

1306
01:07:37,160 --> 01:07:39,080
into the external cache.

1307
01:07:39,080 --> 01:07:43,070
The external interface also
allows the system designer

1308
01:07:43,070 --> 01:07:47,360
to implement either a
128-bit wide or 64-bit wide

1309
01:07:47,360 --> 01:07:48,620
external bus.

1310
01:07:48,620 --> 01:07:53,630
The chip supports both
CMOS, TTL, and ECL

1311
01:07:53,630 --> 01:07:57,170
and supports both longword
error-correcting codes

1312
01:07:57,170 --> 01:07:59,330
or longword parity.

1313
01:07:59,330 --> 01:08:03,410
The chip does not perform
on-chip invalidate filtering

1314
01:08:03,410 --> 01:08:06,260
for decache and validates
but does provide a mechanism

1315
01:08:06,260 --> 01:08:08,120
so that system
designers can filter

1316
01:08:08,120 --> 01:08:13,340
these invalidates off-chip
by implementing a back map.

1317
01:08:13,340 --> 01:08:15,860
Here's a die photograph of EV4.

1318
01:08:15,860 --> 01:08:18,710
On the left is the 8
kilobyte instruction cache

1319
01:08:18,710 --> 01:08:22,920
with branch history table
located just below that.

1320
01:08:22,920 --> 01:08:26,160
Across the top of the die,
you see the instruction

1321
01:08:26,160 --> 01:08:29,370
execute unit with
the EBOX on the left

1322
01:08:29,370 --> 01:08:32,069
and the IBOX on the right.

1323
01:08:32,069 --> 01:08:35,000
In the middle of the chip,
you see integer control

1324
01:08:35,000 --> 01:08:37,700
with the clock driver
extending horizontally

1325
01:08:37,700 --> 01:08:40,590
across the center of the die.

1326
01:08:40,590 --> 01:08:43,350
Below the clock driver is
located the floating point

1327
01:08:43,350 --> 01:08:47,069
operating unit while
the data cache occupies

1328
01:08:47,069 --> 01:08:49,775
the right edge of the die.

1329
01:08:49,775 --> 01:08:54,420
The right buffer can be seen
in the upper right hand corner.

1330
01:08:54,420 --> 01:08:58,109
In conclusion, EV4 is
the first implementation

1331
01:08:58,109 --> 01:08:59,880
of the Alpha architecture.

1332
01:08:59,880 --> 01:09:03,000
It's a 200 megahertz
single-chip microprocessor

1333
01:09:03,000 --> 01:09:04,740
with a flexible
interface to allow

1334
01:09:04,740 --> 01:09:08,490
it to be used in systems ranging
from desktop workstations

1335
01:09:08,490 --> 01:09:11,060
to massively parallel
supercomputers.

1336
01:09:11,060 --> 01:09:14,760
It'll be supported initially
by both VMS and OSF/1

1337
01:09:14,760 --> 01:09:17,729
with other operating
systems to follow,

1338
01:09:17,729 --> 01:09:21,120
and it's the first in the
family with more implementations

1339
01:09:21,120 --> 01:09:22,660
to come.

1340
01:09:22,660 --> 01:09:24,640
Speaking as a hardware
system designer,

1341
01:09:24,640 --> 01:09:27,130
I'm grateful for people
like Dick Sites who

1342
01:09:27,130 --> 01:09:29,350
has thought long and hard
about the problems faced

1343
01:09:29,350 --> 01:09:31,390
by high performance
system designers

1344
01:09:31,390 --> 01:09:33,670
when designing the
Alpha architecture.

1345
01:09:33,670 --> 01:09:36,130
I think that the advantages
of this architecture

1346
01:09:36,130 --> 01:09:38,740
are apparent by the high
performance achieved

1347
01:09:38,740 --> 01:09:41,740
by its initial implementation,
and the advantages

1348
01:09:41,740 --> 01:09:44,660
will become even more
apparent in the future.

1349
01:09:44,660 --> 01:09:46,960
I hope this tape has
been beneficial to you,

1350
01:09:46,960 --> 01:09:49,780
and thank you very much.

1351
01:09:49,780 --> 01:09:52,530
[MUSIC PLAYING]

1352
01:09:52,530 --> 01:10:59,000