1
00:00:00,000 --> 00:00:03,437
[MUSIC PLAYING]

2
00:00:03,437 --> 00:00:20,640


3
00:00:20,640 --> 00:00:23,157
Hi, I'm Josh Fisher
from HP Labs.

4
00:00:23,157 --> 00:00:24,240
I'm glad to be here today.

5
00:00:24,240 --> 00:00:28,580
I'll be talking about
instruction level parallelism.

6
00:00:28,580 --> 00:00:31,820
ILP is very different
from parallel processing

7
00:00:31,820 --> 00:00:33,590
that we hear so much about.

8
00:00:33,590 --> 00:00:36,410
Rather than having
separate processors getting

9
00:00:36,410 --> 00:00:38,300
separate chunks of
a program that's

10
00:00:38,300 --> 00:00:43,190
been programmed to do that, ILP
is an architectural technique

11
00:00:43,190 --> 00:00:46,010
that lies at the
bottom end of the CPU

12
00:00:46,010 --> 00:00:51,080
and allows the overlap of the
ordinary risk level operations

13
00:00:51,080 --> 00:00:53,660
so that you might be doing
a multiply and an add

14
00:00:53,660 --> 00:00:54,470
at the same time.

15
00:00:54,470 --> 00:00:57,320
The same multiply and add that
appear in the user's program,

16
00:00:57,320 --> 00:01:00,350
only now they're issued
one after the other

17
00:01:00,350 --> 00:01:02,750
without the second one waiting
for the first to finish,

18
00:01:02,750 --> 00:01:03,360
for example.

19
00:01:03,360 --> 00:01:05,459
Or they might be issued
at the same time.

20
00:01:05,459 --> 00:01:08,720
Instruction level
parallelism is something that

21
00:01:08,720 --> 00:01:11,190
is meant to be transparent.

22
00:01:11,190 --> 00:01:14,190
That is, the user should
not see any of it.

23
00:01:14,190 --> 00:01:17,550
We take ordinary programs and
make them go a lot faster using

24
00:01:17,550 --> 00:01:18,540
this technique.

25
00:01:18,540 --> 00:01:20,610
It's an old technique.

26
00:01:20,610 --> 00:01:24,390
It's been around for over 30
years in the fastest computers.

27
00:01:24,390 --> 00:01:28,080
And after a 30 year
history in which it really

28
00:01:28,080 --> 00:01:29,580
didn't do an awful
lot, particularly

29
00:01:29,580 --> 00:01:33,090
for the last 20 of those
years, in the '80s,

30
00:01:33,090 --> 00:01:34,710
it started really taking off.

31
00:01:34,710 --> 00:01:36,840
And it's become very
much the hot topic

32
00:01:36,840 --> 00:01:38,760
in architecture conferences.

33
00:01:38,760 --> 00:01:42,510
And you just don't build a
fast microprocessor these days

34
00:01:42,510 --> 00:01:46,210
without using instruction
level parallelism.

35
00:01:46,210 --> 00:01:49,040
It's a very simple concept.

36
00:01:49,040 --> 00:01:52,550
If you look at a
small chunk of code.

37
00:01:52,550 --> 00:01:57,630
And here we have
just eight operations

38
00:01:57,630 --> 00:01:59,580
that are doing some
multiplies, adding

39
00:01:59,580 --> 00:02:02,550
the result, two more
multiplies, adding the result,

40
00:02:02,550 --> 00:02:03,480
and a couple of adds.

41
00:02:03,480 --> 00:02:06,240
And I just took
this random chunk

42
00:02:06,240 --> 00:02:08,460
of code out of an application.

43
00:02:08,460 --> 00:02:10,830
Using instruction
level parallelism,

44
00:02:10,830 --> 00:02:15,300
you might issue the first
multiply in the first cycle.

45
00:02:15,300 --> 00:02:17,855
But rather than
just stopping there,

46
00:02:17,855 --> 00:02:20,050
you might at the
same time notice

47
00:02:20,050 --> 00:02:25,150
that this add down at the bottom
doesn't depend upon anything

48
00:02:25,150 --> 00:02:26,380
that came earlier.

49
00:02:26,380 --> 00:02:29,440
These other adds do have to
wait for the multiplies that

50
00:02:29,440 --> 00:02:30,850
are producing their data.

51
00:02:30,850 --> 00:02:33,930
But this add is
data independent.

52
00:02:33,930 --> 00:02:36,990
And you certainly could start
executing it right away.

53
00:02:36,990 --> 00:02:40,350
An ordinary computer typically
has a separate floating point

54
00:02:40,350 --> 00:02:41,640
and integer unit.

55
00:02:41,640 --> 00:02:44,260
And that integer unit
isn't doing anything.

56
00:02:44,260 --> 00:02:46,170
So you might somehow
get the bright idea

57
00:02:46,170 --> 00:02:49,470
of issuing the integer
operation at the same time

58
00:02:49,470 --> 00:02:53,100
as the floating operation rather
than waiting for it to stop.

59
00:02:53,100 --> 00:02:56,070
And that way you're
overlapping operations,

60
00:02:56,070 --> 00:02:58,050
and you're able
to get a speed up.

61
00:02:58,050 --> 00:03:00,150
In this little
expression, if you

62
00:03:00,150 --> 00:03:03,600
did each of these
operations one at a time,

63
00:03:03,600 --> 00:03:06,690
you'd find it took 12 cycles,
say, if the multiply took

64
00:03:06,690 --> 00:03:08,295
two and the adds took one.

65
00:03:08,295 --> 00:03:09,420
There are eight operations.

66
00:03:09,420 --> 00:03:11,280
Four of them are multiplies.

67
00:03:11,280 --> 00:03:15,390
If you look at what
might happen down here,

68
00:03:15,390 --> 00:03:18,030
we're able to complete the
same work in six cycles.

69
00:03:18,030 --> 00:03:20,010
So we'd get a factor
of two speed up.

70
00:03:20,010 --> 00:03:21,450
That's a typical
sort of thing you

71
00:03:21,450 --> 00:03:24,360
can do using this kind
of hardware arrangement.

72
00:03:24,360 --> 00:03:26,640
The hardware you pretty
much already have

73
00:03:26,640 --> 00:03:28,620
and the ability now
suddenly to be able to

74
00:03:28,620 --> 00:03:30,702
overlap a lot of the operations.

75
00:03:30,702 --> 00:03:32,160
Now, notice I
haven't said anything

76
00:03:32,160 --> 00:03:36,730
at all about how you go from
getting handed this stuff

77
00:03:36,730 --> 00:03:42,160
and coming down to this,
getting this record of execution

78
00:03:42,160 --> 00:03:45,490
of what actually happened
when that went twice as fast.

79
00:03:45,490 --> 00:03:47,860
The technology of
instruction level parallelism

80
00:03:47,860 --> 00:03:52,330
is getting somehow the ability
to go from a sequential program

81
00:03:52,330 --> 00:03:56,530
automatically to this kind
of speed up or a good deal

82
00:03:56,530 --> 00:03:58,840
more, as we'll see.

83
00:03:58,840 --> 00:04:03,890
If we look at the history of
instruction level parallelism,

84
00:04:03,890 --> 00:04:06,950
it turns out that it's been
tremendously influenced

85
00:04:06,950 --> 00:04:09,530
by the question of
how much ILP there

86
00:04:09,530 --> 00:04:12,512
is in typical programs and
people's belief about that.

87
00:04:12,512 --> 00:04:13,970
And in fact, that's
still, as we'll

88
00:04:13,970 --> 00:04:16,990
see later, a very
controversial question.

89
00:04:16,990 --> 00:04:20,230
People really don't have a good
handle on the answer to that.

90
00:04:20,230 --> 00:04:21,730
But if you look
historically at what

91
00:04:21,730 --> 00:04:24,710
happened with machines like
this, in the late '50s,

92
00:04:24,710 --> 00:04:26,920
there were some experimental
machines being built.

93
00:04:26,920 --> 00:04:32,480
And then in the early '60s,
CDC delivered the 6600,

94
00:04:32,480 --> 00:04:35,030
the fastest scientific
computer of its time,

95
00:04:35,030 --> 00:04:37,140
sort of the
supercomputer of its day.

96
00:04:37,140 --> 00:04:39,140
And it had eight
functional units.

97
00:04:39,140 --> 00:04:41,060
They built extra
functional units

98
00:04:41,060 --> 00:04:43,580
so that you could
attempt to start

99
00:04:43,580 --> 00:04:45,950
a new operation
every single cycle

100
00:04:45,950 --> 00:04:49,220
whether or not the
previous ones had finished.

101
00:04:49,220 --> 00:04:50,900
That was a very
successful machine.

102
00:04:50,900 --> 00:04:53,570
There were a lot of
them used for a lot

103
00:04:53,570 --> 00:04:55,100
of important applications.

104
00:04:55,100 --> 00:04:59,730
At the same time, IBM
was building the 360 91,

105
00:04:59,730 --> 00:05:02,540
which had a lot of
instruction level parallelism

106
00:05:02,540 --> 00:05:04,715
as well, a rather
different kind underneath.

107
00:05:04,715 --> 00:05:07,310
And we'll talk about
that in a few minutes.

108
00:05:07,310 --> 00:05:09,470
It was not a commercial
success, but then there

109
00:05:09,470 --> 00:05:11,450
were follow ons to
that machine that

110
00:05:11,450 --> 00:05:15,660
used the same kind of CPU
design and were very successful.

111
00:05:15,660 --> 00:05:18,080
So in the early '60s,
this was the way

112
00:05:18,080 --> 00:05:19,520
to make computers go faster.

113
00:05:19,520 --> 00:05:22,040
If you talked about
parallel processing, then

114
00:05:22,040 --> 00:05:23,840
you meant this kind of stuff.

115
00:05:23,840 --> 00:05:25,820
Nobody called it instruction
level parallelism.

116
00:05:25,820 --> 00:05:29,430
It was just what people did
to make computers go faster.

117
00:05:29,430 --> 00:05:32,210
It was a topic that got
some academic interest

118
00:05:32,210 --> 00:05:34,490
and certainly a lot
of design interest.

119
00:05:34,490 --> 00:05:37,640
And people started asking
this very natural question

120
00:05:37,640 --> 00:05:41,030
of how much parallelism
there is in typical programs.

121
00:05:41,030 --> 00:05:47,450
And several groups set out
to answer that question.

122
00:05:47,450 --> 00:05:49,678
There were three
that I know of, and I

123
00:05:49,678 --> 00:05:51,470
believe there were
probably others as well.

124
00:05:51,470 --> 00:05:52,820
The three I know for sure.

125
00:05:52,820 --> 00:05:54,740
And they all got the
same answer when they

126
00:05:54,740 --> 00:05:56,150
tried to answer this question.

127
00:05:56,150 --> 00:05:58,980
The way they tried to answer
this question was as follows.

128
00:05:58,980 --> 00:06:03,800
They said imagine you have a
bunch of ordinary programs,

129
00:06:03,800 --> 00:06:07,940
but imagine that you also
have infinitely much hardware

130
00:06:07,940 --> 00:06:11,330
to run these programs
on trying to get as much

131
00:06:11,330 --> 00:06:14,700
of instruction level parallelism
as you possibly could.

132
00:06:14,700 --> 00:06:17,930
So you have 5,000
integer adders,

133
00:06:17,930 --> 00:06:21,050
or as many as you'd like,
as many floating point units

134
00:06:21,050 --> 00:06:23,510
as you'd like, as
much memory bandwidth

135
00:06:23,510 --> 00:06:26,180
and registers as you
need, and the ability

136
00:06:26,180 --> 00:06:28,910
to issue as many operations
on the tick of a clock

137
00:06:28,910 --> 00:06:30,590
as you possibly could.

138
00:06:30,590 --> 00:06:33,110
And then the
question is, how much

139
00:06:33,110 --> 00:06:34,820
of this kind of
parallelism would you

140
00:06:34,820 --> 00:06:40,710
get running ordinary programs
on machines of that ability?

141
00:06:40,710 --> 00:06:42,540
And the answer was very uniform.

142
00:06:42,540 --> 00:06:45,480
And people still either
on purpose or not even

143
00:06:45,480 --> 00:06:48,270
knowing this history repeat
these experiments today.

144
00:06:48,270 --> 00:06:50,880
And you always get pretty
much the same range.

145
00:06:50,880 --> 00:06:53,340
About one and a half
to two and a half or so

146
00:06:53,340 --> 00:06:55,960
is the kind of number you get.

147
00:06:55,960 --> 00:06:59,710
That was not considered
a very wonderful result.

148
00:06:59,710 --> 00:07:01,480
People were pretty
pessimistic about all

149
00:07:01,480 --> 00:07:03,280
of this when they heard it.

150
00:07:03,280 --> 00:07:07,150
And in fact, the interest in
instruction level parallelism

151
00:07:07,150 --> 00:07:09,800
died out at the same time
these experiments came out.

152
00:07:09,800 --> 00:07:12,520
And I can't know for sure
if this was a huge factor,

153
00:07:12,520 --> 00:07:15,160
but talking to people
who were in the computer

154
00:07:15,160 --> 00:07:18,550
architecture and compiler
worlds at the time,

155
00:07:18,550 --> 00:07:21,790
I've heard very often, oh
yeah, this was a big factor.

156
00:07:21,790 --> 00:07:23,620
We just thought this
technology had run out

157
00:07:23,620 --> 00:07:25,210
of steam right there.

158
00:07:25,210 --> 00:07:29,140
And I remember very clearly as
a young university professor

159
00:07:29,140 --> 00:07:33,160
going off to Washington to
get my first grants working

160
00:07:33,160 --> 00:07:33,800
in this area.

161
00:07:33,800 --> 00:07:35,830
And I wanted to do
something about getting

162
00:07:35,830 --> 00:07:38,435
more parallelism for this
and I had ways of doing it.

163
00:07:38,435 --> 00:07:40,060
And I remember the
people in Washington

164
00:07:40,060 --> 00:07:43,690
telling me very clearly,
oh, don't bother with this.

165
00:07:43,690 --> 00:07:44,950
This is the Flynn bottleneck.

166
00:07:44,950 --> 00:07:46,540
It even had a name,
like Flynn had

167
00:07:46,540 --> 00:07:48,040
done some of these experiments.

168
00:07:48,040 --> 00:07:50,050
And you can't get that
kind of parallelism.

169
00:07:50,050 --> 00:07:51,970
It's just not there to be had.

170
00:07:51,970 --> 00:07:53,020
Go do something else.

171
00:07:53,020 --> 00:07:55,270
And fortunately, I
had read these papers

172
00:07:55,270 --> 00:07:56,320
and I had a ready answer.

173
00:07:56,320 --> 00:08:00,310
But it was very striking how
clear that perception was.

174
00:08:00,310 --> 00:08:03,040
Well, it turns out
that this belief

175
00:08:03,040 --> 00:08:06,160
that there's only that much
parallelism is really wrong.

176
00:08:06,160 --> 00:08:08,980
And it's sort of wrong
for a big general reason.

177
00:08:08,980 --> 00:08:11,170
And it's wrong
because they really

178
00:08:11,170 --> 00:08:13,120
missed one specific,
narrow thing

179
00:08:13,120 --> 00:08:17,080
right away that makes the
results not quite what you want

180
00:08:17,080 --> 00:08:19,000
to know when you measure this.

181
00:08:19,000 --> 00:08:20,680
The general thing
is the question

182
00:08:20,680 --> 00:08:23,660
of code transformations.

183
00:08:23,660 --> 00:08:27,680
A compiler can do
things to code that

184
00:08:27,680 --> 00:08:30,740
will expose more of this
instruction level parallelism.

185
00:08:30,740 --> 00:08:33,320
And there's no way to predict
just what the compiler will

186
00:08:33,320 --> 00:08:37,370
do that will give you
more of this parallelism.

187
00:08:37,370 --> 00:08:40,309
There are lots of tricks
similar, for example,

188
00:08:40,309 --> 00:08:42,655
to the vectorizing
tricks that give you

189
00:08:42,655 --> 00:08:44,780
more and more of this
instruction level parallelism

190
00:08:44,780 --> 00:08:47,510
compared to the way the
program is originally

191
00:08:47,510 --> 00:08:51,545
expressed by the programmer.

192
00:08:51,545 --> 00:08:53,170
If you think about
it, in fact, there's

193
00:08:53,170 --> 00:08:55,630
a silly argument that's a
little less silly than it seems

194
00:08:55,630 --> 00:08:59,770
at first that says that these
limit studies really can't make

195
00:08:59,770 --> 00:09:03,190
sense as far as giving
a flat statement of how

196
00:09:03,190 --> 00:09:04,660
much parallelism there is.

197
00:09:04,660 --> 00:09:09,310
After all, you could reduce a
computation to a table lookup

198
00:09:09,310 --> 00:09:11,950
by computing ahead all
the possible answers

199
00:09:11,950 --> 00:09:14,830
for all the possible
ranges of data

200
00:09:14,830 --> 00:09:18,010
and then doing a one cycle
table lookup in order

201
00:09:18,010 --> 00:09:20,720
to get that computation
done in a single cycle.

202
00:09:20,720 --> 00:09:24,130
So you take a big matrix
of programs and data

203
00:09:24,130 --> 00:09:26,190
sets and just look
up your answer

204
00:09:26,190 --> 00:09:28,690
for the program and data set
at hand and do it in one cycle.

205
00:09:28,690 --> 00:09:29,920
Now, that's silly.

206
00:09:29,920 --> 00:09:32,380
Of course, nothing wildly
like that is practical.

207
00:09:32,380 --> 00:09:35,830
But in fact, a lot of
speedups that one does

208
00:09:35,830 --> 00:09:37,960
reduce to that trick.

209
00:09:37,960 --> 00:09:40,330
And there are even ways
in which compilers,

210
00:09:40,330 --> 00:09:43,450
not to mention hand coders,
will substitute table lookups

211
00:09:43,450 --> 00:09:46,520
for more complex,
longer computations.

212
00:09:46,520 --> 00:09:48,880
So really, this
is a real problem

213
00:09:48,880 --> 00:09:51,310
in trying to make a
flat statement about how

214
00:09:51,310 --> 00:09:52,600
much parallelism there is.

215
00:09:52,600 --> 00:09:55,780
A more accurate statement
is much narrower.

216
00:09:55,780 --> 00:09:58,780
It's conditioned on
the compiler technology

217
00:09:58,780 --> 00:10:00,940
available to the
experimenter at the time.

218
00:10:00,940 --> 00:10:03,700
And very rarely do they even
have state of the art compiling

219
00:10:03,700 --> 00:10:06,010
techniques when they
do these experiments,

220
00:10:06,010 --> 00:10:08,740
never mind compiling
techniques comparable to what

221
00:10:08,740 --> 00:10:10,340
we'll have in the future.

222
00:10:10,340 --> 00:10:12,308
So in some sense,
these limit studies,

223
00:10:12,308 --> 00:10:13,350
they're very interesting.

224
00:10:13,350 --> 00:10:14,770
They're very informative.

225
00:10:14,770 --> 00:10:16,870
But people should
not be taking them

226
00:10:16,870 --> 00:10:20,530
as this is some absolute
that we'll never get beyond.

227
00:10:20,530 --> 00:10:22,600
But there was a much
bigger problem right away,

228
00:10:22,600 --> 00:10:25,620
and that concerned
conditional branches.

229
00:10:25,620 --> 00:10:29,650
When these limit studies
were done in the early '70s,

230
00:10:29,650 --> 00:10:32,340
people said, well, in
an ordinary program,

231
00:10:32,340 --> 00:10:34,860
we have tests and branches.

232
00:10:34,860 --> 00:10:38,670
So if something is true, go
this way, otherwise go that way.

233
00:10:38,670 --> 00:10:42,390
And you might want to
raise an operation up

234
00:10:42,390 --> 00:10:45,990
from beyond one of these
branches up into the place

235
00:10:45,990 --> 00:10:47,610
that you're right
now considering

236
00:10:47,610 --> 00:10:48,743
how much you can issue.

237
00:10:48,743 --> 00:10:49,410
So here you are.

238
00:10:49,410 --> 00:10:52,050
You're issuing operations,
pulling them up from below,

239
00:10:52,050 --> 00:10:54,060
but you now have run into
a conditional branch.

240
00:10:54,060 --> 00:10:56,640
And you'd like to pull one up
from one side or the other.

241
00:10:56,640 --> 00:11:00,060
Well, they reasoned you can't
do that because after all,

242
00:11:00,060 --> 00:11:01,560
if you pull an
operation up, it's

243
00:11:01,560 --> 00:11:03,030
bound to have side effects.

244
00:11:03,030 --> 00:11:04,920
Must be a reason
you're doing it.

245
00:11:04,920 --> 00:11:06,967
And suppose the branch
goes the other way.

246
00:11:06,967 --> 00:11:08,550
Then you'll have
done something that's

247
00:11:08,550 --> 00:11:11,680
affected your computation
and you didn't mean to do it.

248
00:11:11,680 --> 00:11:12,690
And that would be bad.

249
00:11:12,690 --> 00:11:15,090
And of course, you can think
of some little special cases

250
00:11:15,090 --> 00:11:17,860
where it isn't so bad
and you get away with it.

251
00:11:17,860 --> 00:11:19,890
But people basically
felt that this was not

252
00:11:19,890 --> 00:11:24,220
a source of big speed for
instruction level parallelism.

253
00:11:24,220 --> 00:11:27,700
What's interesting is that
one of the groups that

254
00:11:27,700 --> 00:11:31,930
did this experiment, which was
Reisman and Foster at UMass

255
00:11:31,930 --> 00:11:36,760
Amherst, did a second experiment
in which they suggested

256
00:11:36,760 --> 00:11:40,720
that there is a way around this
conditional branch problem.

257
00:11:40,720 --> 00:11:44,210
And they looked at that and
measured what could happen.

258
00:11:44,210 --> 00:11:47,810
And it really was a
very interesting result

259
00:11:47,810 --> 00:11:48,520
that they got.

260
00:11:48,520 --> 00:11:51,992


261
00:11:51,992 --> 00:11:53,450
They termed this
differently, but I

262
00:11:53,450 --> 00:11:55,460
think this is the way
you can look at it

263
00:11:55,460 --> 00:11:56,690
and understand it best.

264
00:11:56,690 --> 00:11:59,810
They said, suppose you
take the program in advance

265
00:11:59,810 --> 00:12:03,960
and unwind every possible
path of computation.

266
00:12:03,960 --> 00:12:07,730
So write down every
path through the code

267
00:12:07,730 --> 00:12:10,410
that the code could
take when it runs.

268
00:12:10,410 --> 00:12:12,230
So at each conditional
jump, you now

269
00:12:12,230 --> 00:12:14,600
would write down two
ways, the one leading up

270
00:12:14,600 --> 00:12:16,250
to here going that way and
then the one leading up to here

271
00:12:16,250 --> 00:12:17,490
and then going that way.

272
00:12:17,490 --> 00:12:20,480
So now you've written every
single path through the code.

273
00:12:20,480 --> 00:12:22,310
Now, it's an awful lot of those.

274
00:12:22,310 --> 00:12:24,740
But just imagine for a
second you had done that.

275
00:12:24,740 --> 00:12:26,990
Don't forget the hypothesis
of this experiment

276
00:12:26,990 --> 00:12:28,790
is infinite hardware.

277
00:12:28,790 --> 00:12:29,550
All you need.

278
00:12:29,550 --> 00:12:31,070
So you have a big
piece of paper.

279
00:12:31,070 --> 00:12:32,840
You write all this stuff down.

280
00:12:32,840 --> 00:12:38,540
And then you build a processor
for each one of these paths.

281
00:12:38,540 --> 00:12:42,990
And you hand that path
to that processor.

282
00:12:42,990 --> 00:12:44,400
And now you start computing.

283
00:12:44,400 --> 00:12:46,260
Now, a processor that's
been given a path

284
00:12:46,260 --> 00:12:48,385
doesn't have to worry about
these conditional jumps

285
00:12:48,385 --> 00:12:49,010
anymore.

286
00:12:49,010 --> 00:12:53,530
OK some processor was told to
go this way and then this way.

287
00:12:53,530 --> 00:12:55,620
And so if it wants
to start up here,

288
00:12:55,620 --> 00:12:58,140
if it wants to pull
this operation up

289
00:12:58,140 --> 00:12:59,670
above these jumps, no problem.

290
00:12:59,670 --> 00:13:02,580
It knows for its world the
jumps are going this way.

291
00:13:02,580 --> 00:13:04,630
The branches are going that way.

292
00:13:04,630 --> 00:13:05,552
So no problem.

293
00:13:05,552 --> 00:13:06,510
So it would pull it up.

294
00:13:06,510 --> 00:13:08,580
Each processor in
turn stops worrying

295
00:13:08,580 --> 00:13:11,370
about the branches
and starts pulling up

296
00:13:11,370 --> 00:13:14,530
all the code that's data ready.

297
00:13:14,530 --> 00:13:16,750
Now, every time a
branch resolves,

298
00:13:16,750 --> 00:13:20,410
you have a whole lot of
processors to throw away.

299
00:13:20,410 --> 00:13:23,050
All the ones that had been
sent some way inconsistent

300
00:13:23,050 --> 00:13:25,652
with the branch resolving in
the direction that it did.

301
00:13:25,652 --> 00:13:26,860
So you just throw those away.

302
00:13:26,860 --> 00:13:27,400
No problem.

303
00:13:27,400 --> 00:13:29,120
Infinite hardware.

304
00:13:29,120 --> 00:13:31,520
Eventually some
processor will have

305
00:13:31,520 --> 00:13:35,440
been handed exactly the
right path through the code,

306
00:13:35,440 --> 00:13:38,380
and that processor will have
gotten instruction level

307
00:13:38,380 --> 00:13:40,180
parallelism without
having to worry

308
00:13:40,180 --> 00:13:42,490
about the conditional jumps.

309
00:13:42,490 --> 00:13:44,650
And it's going to get more
parallelism than you'd

310
00:13:44,650 --> 00:13:46,210
get if you stopped at branches.

311
00:13:46,210 --> 00:13:47,170
Well, how much more?

312
00:13:47,170 --> 00:13:49,505
Huge amount more,
as it turned out.

313
00:13:49,505 --> 00:13:51,130
The kinds of numbers
they were getting,

314
00:13:51,130 --> 00:13:53,830
I think the programs and
data they looked at averaged

315
00:13:53,830 --> 00:13:56,710
54 or 55 factor speed up.

316
00:13:56,710 --> 00:13:59,500
And there were some where they
said, look, this is silly.

317
00:13:59,500 --> 00:14:02,950
This is just dependent
on the data set size.

318
00:14:02,950 --> 00:14:05,540
You can get any number you
want with some of these,

319
00:14:05,540 --> 00:14:07,150
and that's certainly true.

320
00:14:07,150 --> 00:14:11,680
Huge order of magnitude
and more speed up available

321
00:14:11,680 --> 00:14:14,570
if you could somehow go
past conditional branches.

322
00:14:14,570 --> 00:14:19,030
So then they actually set out to
see how crazy an idea this was.

323
00:14:19,030 --> 00:14:20,980
And what they measured
was that if you even

324
00:14:20,980 --> 00:14:25,210
wanted a factor of 10
speed up, then at the worst

325
00:14:25,210 --> 00:14:28,330
place, the place that was
densest with machines still

326
00:14:28,330 --> 00:14:33,370
running of the program,
you had to have 16

327
00:14:33,370 --> 00:14:35,930
unresolved conditional jumps.

328
00:14:35,930 --> 00:14:38,570
And that meant you had
to have 2 to the 16th

329
00:14:38,570 --> 00:14:42,230
or 64,000 copies
of your machine.

330
00:14:42,230 --> 00:14:44,442
And that's just to get
a factor of 10 speed up.

331
00:14:44,442 --> 00:14:46,400
And that doesn't even
speak to all the machines

332
00:14:46,400 --> 00:14:48,260
you had to throw out
for the whole rest

333
00:14:48,260 --> 00:14:50,343
of the program, which is
probably a good deal more

334
00:14:50,343 --> 00:14:51,240
than that.

335
00:14:51,240 --> 00:14:52,970
So they reasoned
this is ludicrous.

336
00:14:52,970 --> 00:14:55,100
This is a preposterous
amount of hardware

337
00:14:55,100 --> 00:14:56,690
you'd have to throw
at the problem

338
00:14:56,690 --> 00:14:58,870
in order to get this speed up.

339
00:14:58,870 --> 00:15:04,808
And wrote a summary, and these
were very good experimenters,

340
00:15:04,808 --> 00:15:06,350
good computer
scientists have gone on

341
00:15:06,350 --> 00:15:09,290
to do good work in computer
science besides this,

342
00:15:09,290 --> 00:15:11,390
and a very imaginative
experiment.

343
00:15:11,390 --> 00:15:12,770
But I think sort
of as a prisoner

344
00:15:12,770 --> 00:15:15,590
of their time, the
conclusion that they

345
00:15:15,590 --> 00:15:18,380
wrote was not the
conclusion one would like

346
00:15:18,380 --> 00:15:19,940
in retrospect to have seen.

347
00:15:19,940 --> 00:15:22,340
Here's the conclusion
you'd like to see.

348
00:15:22,340 --> 00:15:24,290
You'd like it to have
said, boy, there's

349
00:15:24,290 --> 00:15:27,110
a lot of this instruction
level parallelism.

350
00:15:27,110 --> 00:15:29,420
But the crazy way we thought
of trying to tap into it

351
00:15:29,420 --> 00:15:29,990
doesn't work.

352
00:15:29,990 --> 00:15:30,680
Sorry.

353
00:15:30,680 --> 00:15:33,440
Maybe somebody can come
along with a better way.

354
00:15:33,440 --> 00:15:37,580
Instead they basically wrote
it off and people reading it

355
00:15:37,580 --> 00:15:38,840
basically wrote it off.

356
00:15:38,840 --> 00:15:41,475
And the message wasn't
what they wrote,

357
00:15:41,475 --> 00:15:43,850
which was there's a lot of
instruction level parallelism,

358
00:15:43,850 --> 00:15:46,370
but it's impractical.

359
00:15:46,370 --> 00:15:48,553
Bad enough, people
basically came away

360
00:15:48,553 --> 00:15:50,720
with the feeling there
wasn't much instruction level

361
00:15:50,720 --> 00:15:52,160
parallelism.

362
00:15:52,160 --> 00:15:53,170
So that's all wrong.

363
00:15:53,170 --> 00:15:54,950
It really is the
first thing I said

364
00:15:54,950 --> 00:15:56,720
that's the correct conclusion.

365
00:15:56,720 --> 00:16:00,050
And it turns out that there
is a practical embodiment

366
00:16:00,050 --> 00:16:01,520
of this idea.

367
00:16:01,520 --> 00:16:04,880
I've spent a lot of time on this
one sort of obscure experiment,

368
00:16:04,880 --> 00:16:06,740
but a lot more people
know about it now.

369
00:16:06,740 --> 00:16:09,080
Because the idea
that comes out of it,

370
00:16:09,080 --> 00:16:13,430
this idea of unwinding
all the possible execution

371
00:16:13,430 --> 00:16:17,940
paths in order to get
more parallelism by going

372
00:16:17,940 --> 00:16:21,000
past branches is very
central to an awful lot

373
00:16:21,000 --> 00:16:22,250
of the technology that's here.

374
00:16:22,250 --> 00:16:23,750
We're going to learn
about something

375
00:16:23,750 --> 00:16:25,800
called speculative
execution in a minute.

376
00:16:25,800 --> 00:16:28,110
And that really
is the way you can

377
00:16:28,110 --> 00:16:31,140
tap into huge amounts of
instruction level parallelism.

378
00:16:31,140 --> 00:16:34,470
But you have to make
this idea of going down

379
00:16:34,470 --> 00:16:37,807
paths speculatively practical.

380
00:16:37,807 --> 00:16:40,140
And you're not going to do
it the way these guys set out

381
00:16:40,140 --> 00:16:41,260
to measure it.

382
00:16:41,260 --> 00:16:43,110
But there is
something you can see

383
00:16:43,110 --> 00:16:48,090
right away that does kind of
bring some practicality to it.

384
00:16:48,090 --> 00:16:49,070
And here's the idea.

385
00:16:49,070 --> 00:16:54,540
Take a single CPU and start
filling it up from below

386
00:16:54,540 --> 00:16:56,640
with things to be done.

387
00:16:56,640 --> 00:16:58,920
The way we were doing
on that first example.

388
00:16:58,920 --> 00:17:01,230
But now when you get to
a conditional branch,

389
00:17:01,230 --> 00:17:03,780
pick one of the directions, say.

390
00:17:03,780 --> 00:17:09,410
So just pick taken and start
pulling things up from taken.

391
00:17:09,410 --> 00:17:13,069
But when you do it, make sure
that you do it in such a way

392
00:17:13,069 --> 00:17:16,250
that you don't destroy so much
of the state of computation

393
00:17:16,250 --> 00:17:18,380
of a computation
that if it turns out

394
00:17:18,380 --> 00:17:21,089
that untaken was
the right answer,

395
00:17:21,089 --> 00:17:23,220
you have no ability to recover.

396
00:17:23,220 --> 00:17:26,530
So make sure that when
you pull operations up

397
00:17:26,530 --> 00:17:28,900
and they're clobbering
things, they're

398
00:17:28,900 --> 00:17:31,090
changing the state of the
computation, that you've

399
00:17:31,090 --> 00:17:33,910
somehow arranged in the
hardware or software

400
00:17:33,910 --> 00:17:37,870
to leave around enough stuff
so that when untaken turns out

401
00:17:37,870 --> 00:17:40,545
to be the right answer,
you can still not

402
00:17:40,545 --> 00:17:42,670
get the advantage of having
pulled these things up,

403
00:17:42,670 --> 00:17:46,930
but at least the semantics of
the program are still correct.

404
00:17:46,930 --> 00:17:51,640
That idea, again, is called
speculative execution.

405
00:17:51,640 --> 00:17:55,780
It allows you to go past
a branch, pull things up.

406
00:17:55,780 --> 00:17:58,697
You're sort of in a danger
zone here because you're

407
00:17:58,697 --> 00:18:01,030
doing things that you're not
quite sure whether you want

408
00:18:01,030 --> 00:18:03,250
the results of them
or the old stuff back

409
00:18:03,250 --> 00:18:04,960
or you want to
ignore them, which

410
00:18:04,960 --> 00:18:07,640
is usually what you end
up just having to do.

411
00:18:07,640 --> 00:18:09,490
And then at some point,
the branch resolves

412
00:18:09,490 --> 00:18:13,120
and you can continue with
the correct computation.

413
00:18:13,120 --> 00:18:17,010
Now, in fact, you can go up
from both sides, not just one.

414
00:18:17,010 --> 00:18:19,330
So that when you look down
at this branch, whatever

415
00:18:19,330 --> 00:18:20,872
it is you're going
to do on the taken

416
00:18:20,872 --> 00:18:23,140
side you can simultaneously
do on the untaken side

417
00:18:23,140 --> 00:18:25,030
and pull up
operations from both.

418
00:18:25,030 --> 00:18:27,580
Make sure that you're
preserving the semantics

419
00:18:27,580 --> 00:18:29,587
of the program both ways.

420
00:18:29,587 --> 00:18:31,420
But now what's happened
is we've gotten back

421
00:18:31,420 --> 00:18:33,640
to this Reisman and
Foster situation.

422
00:18:33,640 --> 00:18:38,880
Now instead of having 64,000
computers running this program,

423
00:18:38,880 --> 00:18:42,870
instead what we have is one
computer with 64,000 operations

424
00:18:42,870 --> 00:18:45,690
to do in order to get a
factor of 10 speed up.

425
00:18:45,690 --> 00:18:48,670
Doesn't feel like we've
improved things very much.

426
00:18:48,670 --> 00:18:51,120
However, if when
you get to a branch

427
00:18:51,120 --> 00:18:54,090
that branch is very
predictable, that is it

428
00:18:54,090 --> 00:18:58,180
tends to go in one
direction most of the time

429
00:18:58,180 --> 00:19:02,500
and you can figure out what
that direction is, then

430
00:19:02,500 --> 00:19:05,800
it makes sense to just
gamble, a very common trade

431
00:19:05,800 --> 00:19:09,400
off in computer systems, gamble
on the more likely event,

432
00:19:09,400 --> 00:19:12,640
get a big speed up
there, and then make sure

433
00:19:12,640 --> 00:19:15,170
you do the right computation
when the less likely event

434
00:19:15,170 --> 00:19:15,670
occurs.

435
00:19:15,670 --> 00:19:18,130
That's what's behind
caches after all.

436
00:19:18,130 --> 00:19:22,670
And that really makes sense
if branches are predictable.

437
00:19:22,670 --> 00:19:26,030
Well, it turns out, these
are some recent numbers

438
00:19:26,030 --> 00:19:29,970
we've gotten here
at HP, it turns out

439
00:19:29,970 --> 00:19:31,845
that branches are a
lot more predictable

440
00:19:31,845 --> 00:19:32,970
than a lot of people think.

441
00:19:32,970 --> 00:19:35,370
Now, people working in
instruction level parallelism

442
00:19:35,370 --> 00:19:36,620
are pretty well aware of this.

443
00:19:36,620 --> 00:19:38,550
But I remember very often
being told by people

444
00:19:38,550 --> 00:19:40,282
that this is not the way it is.

445
00:19:40,282 --> 00:19:42,240
And so it was interesting
to measure it and see

446
00:19:42,240 --> 00:19:43,140
just what it is.

447
00:19:43,140 --> 00:19:45,210
If you consider a
given source branch,

448
00:19:45,210 --> 00:19:48,180
maybe it's executed
a million times,

449
00:19:48,180 --> 00:19:54,180
if in a run of the program
it's taken 800,000 times

450
00:19:54,180 --> 00:19:58,200
and not taken 200,000
times, then you would say it

451
00:19:58,200 --> 00:20:02,760
goes in its majority
direction 80% of the time.

452
00:20:02,760 --> 00:20:05,190
And you would count that
as a million branches

453
00:20:05,190 --> 00:20:08,790
that go in their majority
direction 80% of the time.

454
00:20:08,790 --> 00:20:11,610
And now take a wide
range of programs

455
00:20:11,610 --> 00:20:13,860
over a very wide
and varied data set.

456
00:20:13,860 --> 00:20:15,480
We have the spec
marks here with a lot

457
00:20:15,480 --> 00:20:18,660
of added data sets and a lot
of other particular C integer

458
00:20:18,660 --> 00:20:20,760
codes, which we're
most interested in.

459
00:20:20,760 --> 00:20:22,150
And we took all of the branches.

460
00:20:22,150 --> 00:20:25,392
This is around 7 billion
branches, something like that

461
00:20:25,392 --> 00:20:26,850
were measured to
get these numbers,

462
00:20:26,850 --> 00:20:29,490
out of real programs
running real data.

463
00:20:29,490 --> 00:20:32,610
And these were systems
and scientific codes.

464
00:20:32,610 --> 00:20:36,180
And what we found is that 40%
of the branches you execute

465
00:20:36,180 --> 00:20:40,670
come from a source branch
that's predictable more than 99

466
00:20:40,670 --> 00:20:44,040
and 1/2 percent of the time.

467
00:20:44,040 --> 00:20:47,040
And if you sort of cut off
at some reasonable place

468
00:20:47,040 --> 00:20:49,260
where you'd like your
predictions to be right

469
00:20:49,260 --> 00:20:52,290
in order to get a payoff from
your speculative execution,

470
00:20:52,290 --> 00:20:55,050
you get a huge preponderance
of the branches

471
00:20:55,050 --> 00:20:58,120
where going one way seems
like the thing to do.

472
00:20:58,120 --> 00:21:02,950
And in fact, very few branches
dynamically are unpredictable.

473
00:21:02,950 --> 00:21:06,260
Now, you can always find a
program where that's not true.

474
00:21:06,260 --> 00:21:07,580
But I'm actually surprised.

475
00:21:07,580 --> 00:21:11,340
I haven't been able to find
non-pathological programs.

476
00:21:11,340 --> 00:21:13,900
I'm sure they're out there, but
I just haven't run into them.

477
00:21:13,900 --> 00:21:14,960
I sort of looked
at some programs

478
00:21:14,960 --> 00:21:15,960
that I thought would be.

479
00:21:15,960 --> 00:21:17,180
Even some of the spec marks.

480
00:21:17,180 --> 00:21:18,620
I think most people
would predict

481
00:21:18,620 --> 00:21:22,040
that the LISP interpreter, for
example, would have very 50-50

482
00:21:22,040 --> 00:21:22,790
behavior.

483
00:21:22,790 --> 00:21:24,583
They very much fit
into this pattern.

484
00:21:24,583 --> 00:21:26,750
And there's sometimes a
little more weight down here

485
00:21:26,750 --> 00:21:27,625
but very little more.

486
00:21:27,625 --> 00:21:29,240
And sometimes the
scientific code,

487
00:21:29,240 --> 00:21:32,300
there's a lot more
weight up at the top.

488
00:21:32,300 --> 00:21:34,640
So branches are
very predictable.

489
00:21:34,640 --> 00:21:38,090
And this gives us a
technique for dealing

490
00:21:38,090 --> 00:21:44,150
with this problem of going
past conditional branches.

491
00:21:44,150 --> 00:21:48,140
And it really puts some validity
to the limit studies that

492
00:21:48,140 --> 00:21:49,850
don't stop at
conditional branches

493
00:21:49,850 --> 00:21:51,990
but say if you want to
know what's available,

494
00:21:51,990 --> 00:21:53,270
here's what's available.

495
00:21:53,270 --> 00:21:54,980
Now you need tricks to get it.

496
00:21:54,980 --> 00:21:57,680
You need architectural
techniques

497
00:21:57,680 --> 00:22:02,000
that will allow you to go
past those conditional jumps.

498
00:22:02,000 --> 00:22:04,340
So that's sort of a little
bit of narrow history.

499
00:22:04,340 --> 00:22:06,350
I'll say much more
about the recent history

500
00:22:06,350 --> 00:22:07,500
at the end of the talk.

501
00:22:07,500 --> 00:22:10,220
But a little narrow history
and some perspective

502
00:22:10,220 --> 00:22:11,810
that I have always
found very valuable

503
00:22:11,810 --> 00:22:14,780
in understanding the hardware
and software techniques that

504
00:22:14,780 --> 00:22:17,360
go into instruction level
parallelism, which I'd

505
00:22:17,360 --> 00:22:20,990
like to start looking at now.

506
00:22:20,990 --> 00:22:23,830
So if you think back to that
example we did at the beginning

507
00:22:23,830 --> 00:22:25,900
where we took a section
of straight line code,

508
00:22:25,900 --> 00:22:27,608
there were no conditional
branches there,

509
00:22:27,608 --> 00:22:29,860
and we tried to
schedule it to see

510
00:22:29,860 --> 00:22:33,610
what would be fired when in
the instruction level parallel

511
00:22:33,610 --> 00:22:38,410
hardware, we really did
two different tasks.

512
00:22:38,410 --> 00:22:40,760
And you heard me
doing those out loud,

513
00:22:40,760 --> 00:22:42,510
though I didn't separate
them at the time.

514
00:22:42,510 --> 00:22:45,970
The first task was can I
issue this operation now?

515
00:22:45,970 --> 00:22:49,600
That is, is it independent
of all the operations

516
00:22:49,600 --> 00:22:52,780
that I'm issuing now or
that are in between where

517
00:22:52,780 --> 00:22:53,650
we are and now?

518
00:22:53,650 --> 00:22:56,260
In other words,
is its data ready?

519
00:22:56,260 --> 00:22:58,810
Have the things it has
to wait for already

520
00:22:58,810 --> 00:23:00,590
completed their execution?

521
00:23:00,590 --> 00:23:03,550
So that's called recognizing
the independence of operations

522
00:23:03,550 --> 00:23:07,240
or dependence analysis is
the typical word used for it.

523
00:23:07,240 --> 00:23:09,760
And that's a job that
you have to do in order

524
00:23:09,760 --> 00:23:12,460
to find out whether an
operation is ready to be fired.

525
00:23:12,460 --> 00:23:14,740
If its data isn't ready,
it doesn't make any sense

526
00:23:14,740 --> 00:23:16,570
to start the add.

527
00:23:16,570 --> 00:23:17,680
So that's one job.

528
00:23:17,680 --> 00:23:20,410
Then having done that, you
get a handful of things

529
00:23:20,410 --> 00:23:22,540
that you would like
to fire right now.

530
00:23:22,540 --> 00:23:25,277
You have to ask the question,
where in this hardware

531
00:23:25,277 --> 00:23:26,110
am I going to do it?

532
00:23:26,110 --> 00:23:29,778
Can I do it this cycle
and using what resources?

533
00:23:29,778 --> 00:23:31,570
So the second question,
then, is how do you

534
00:23:31,570 --> 00:23:33,960
schedule the operation?

535
00:23:33,960 --> 00:23:37,320
So the way you can think of
this is if you sort of fix

536
00:23:37,320 --> 00:23:39,960
the endpoints here, here's
the source code that we're

537
00:23:39,960 --> 00:23:43,265
starting with and
here's the execution

538
00:23:43,265 --> 00:23:44,640
we're trying to
get the execution

539
00:23:44,640 --> 00:23:48,240
pattern at the end of
instruction level parallelism,

540
00:23:48,240 --> 00:23:50,190
and those are sort of
fixed independent of how

541
00:23:50,190 --> 00:23:53,440
you do this stuff, the
steps along the way

542
00:23:53,440 --> 00:23:56,050
are, well, you want
to run your compiler

543
00:23:56,050 --> 00:23:59,500
and turn it into low
level operations.

544
00:23:59,500 --> 00:24:02,680
You might not want to run it all
the way into the final version

545
00:24:02,680 --> 00:24:04,240
because, for
example, you may not

546
00:24:04,240 --> 00:24:07,150
have chosen what functional
units or what registers to use.

547
00:24:07,150 --> 00:24:09,040
You may do that as
part of scheduling.

548
00:24:09,040 --> 00:24:11,530
But you do most of
the compiling job.

549
00:24:11,530 --> 00:24:15,082
And then you want to have
some thing, whatever it is,

550
00:24:15,082 --> 00:24:17,290
and we'll talk about that
in the next slide, whatever

551
00:24:17,290 --> 00:24:19,953
it is that will do the
dependence analysis.

552
00:24:19,953 --> 00:24:22,120
And then you want to do
something that will schedule

553
00:24:22,120 --> 00:24:25,510
the operations on the hardware.

554
00:24:25,510 --> 00:24:28,120
Now, I've gone on and on
about some thing on purpose

555
00:24:28,120 --> 00:24:31,720
to emphasize one of the
most important things

556
00:24:31,720 --> 00:24:35,040
that I hope you can
get out of this talk.

557
00:24:35,040 --> 00:24:39,480
And that is that if you think
in terms of the source code

558
00:24:39,480 --> 00:24:44,880
and the record of execution
as fixed, the desired

559
00:24:44,880 --> 00:24:47,790
kind of execution and
the desired source code,

560
00:24:47,790 --> 00:24:51,730
then you have a
spectrum here of jobs

561
00:24:51,730 --> 00:24:54,520
to do where you can
draw the line of which

562
00:24:54,520 --> 00:24:58,360
to do in software
before the program runs

563
00:24:58,360 --> 00:25:02,080
and which to do in hardware
as the program runs

564
00:25:02,080 --> 00:25:05,460
in lots of different places.

565
00:25:05,460 --> 00:25:08,750
There are two extremes of this.

566
00:25:08,750 --> 00:25:13,220
The one is to simply take
ordinary risk operations here,

567
00:25:13,220 --> 00:25:15,240
hand them to the hardware.

568
00:25:15,240 --> 00:25:16,940
So now on this side
of this boundary,

569
00:25:16,940 --> 00:25:18,560
everything is hardware.

570
00:25:18,560 --> 00:25:23,180
Ask the hardware to do the
independence recognition.

571
00:25:23,180 --> 00:25:26,810
Ask the hardware to
schedule the operations

572
00:25:26,810 --> 00:25:30,650
on the functional units using
the registers, whatever.

573
00:25:30,650 --> 00:25:33,860
In that case, you have what
people now call a super scalar.

574
00:25:33,860 --> 00:25:36,730


575
00:25:36,730 --> 00:25:38,620
Alternatively, in
the other extreme,

576
00:25:38,620 --> 00:25:41,330
and there are intermediate
points between these two,

577
00:25:41,330 --> 00:25:43,960
in the other extreme,
you could say

578
00:25:43,960 --> 00:25:46,300
I'm going to allow
the dependence

579
00:25:46,300 --> 00:25:49,510
analysis and the scheduling
to all occur kind of as

580
00:25:49,510 --> 00:25:53,350
part of the compiling process,
to all occur in the software

581
00:25:53,350 --> 00:25:54,910
before the program runs.

582
00:25:54,910 --> 00:25:57,890


583
00:25:57,890 --> 00:25:59,800
And then finally, the
hardware has nothing

584
00:25:59,800 --> 00:26:03,800
left to do but execute that code
that you've already scheduled.

585
00:26:03,800 --> 00:26:06,710
If you do that, then the
hardware you're talking about

586
00:26:06,710 --> 00:26:09,782
is called a VLIW, a Very Long
Instruction Word machine.

587
00:26:09,782 --> 00:26:11,990
So we're going to look at
these two kinds of machines

588
00:26:11,990 --> 00:26:13,490
in a little more
detail, and we're

589
00:26:13,490 --> 00:26:14,890
going to discuss this trade off.

590
00:26:14,890 --> 00:26:17,960
This trade off is
very, very similar

591
00:26:17,960 --> 00:26:19,970
to the kinds of
hardware, software trade

592
00:26:19,970 --> 00:26:23,815
offs one is always
making in systems design.

593
00:26:23,815 --> 00:26:25,190
But I guess before
going on, it's

594
00:26:25,190 --> 00:26:27,710
worth saying that these are
the popular things people

595
00:26:27,710 --> 00:26:28,430
talk about.

596
00:26:28,430 --> 00:26:32,090
There are clever ideas
that try to embody the best

597
00:26:32,090 --> 00:26:35,180
of both kinds of processors.

598
00:26:35,180 --> 00:26:40,780
That is, they try to not apply
the hardware software trade off

599
00:26:40,780 --> 00:26:42,790
in one extreme or
the other but try

600
00:26:42,790 --> 00:26:47,080
to look at all the individual
pieces of the jobs in here

601
00:26:47,080 --> 00:26:49,510
and figure out which ones
to parcel to the software

602
00:26:49,510 --> 00:26:51,940
and which ones to
parcel to the hardware.

603
00:26:51,940 --> 00:26:53,530
One of the most
clever ideas, just

604
00:26:53,530 --> 00:26:55,810
to give you an
example of this, is

605
00:26:55,810 --> 00:27:00,130
something that appears in
the horizon processor, which

606
00:27:00,130 --> 00:27:02,320
I first saw at the
Supercomputer Research Center

607
00:27:02,320 --> 00:27:06,730
I guess is now being built by
Tera Computer in Washington

608
00:27:06,730 --> 00:27:08,650
state, I think.

609
00:27:08,650 --> 00:27:13,110
That idea is to draw the
boundary here, basically.

610
00:27:13,110 --> 00:27:15,480
That's an obvious idea if
you look at this picture.

611
00:27:15,480 --> 00:27:17,370
But the way they did
it was very clever.

612
00:27:17,370 --> 00:27:21,830
The idea is to do the dependence
analysis and the compiler

613
00:27:21,830 --> 00:27:27,680
and then encode the statements
of what an operation is

614
00:27:27,680 --> 00:27:33,340
dependent upon into the
code in the form of hints

615
00:27:33,340 --> 00:27:36,160
so that the hardware can
simply run that program

616
00:27:36,160 --> 00:27:40,265
step after step without
looking at the hints at all.

617
00:27:40,265 --> 00:27:42,640
So you know the program can
run correctly without anybody

618
00:27:42,640 --> 00:27:43,940
having to do anything.

619
00:27:43,940 --> 00:27:45,640
But then if the
hardware wants to use

620
00:27:45,640 --> 00:27:48,460
the hints to do more
clever scheduling

621
00:27:48,460 --> 00:27:50,930
and get some instruction level
parallelism, it can do that.

622
00:27:50,930 --> 00:27:52,250
And the hints are very simple.

623
00:27:52,250 --> 00:27:55,690
It's just a number attached
to each instruction.

624
00:27:55,690 --> 00:27:58,360
And it says how many
instructions straight

625
00:27:58,360 --> 00:28:02,200
before this one this
instruction is guaranteed

626
00:28:02,200 --> 00:28:04,130
to be independent of.

627
00:28:04,130 --> 00:28:07,310
So if there is a 10
in that field that

628
00:28:07,310 --> 00:28:09,470
goes with that instruction,
then the hardware

629
00:28:09,470 --> 00:28:14,450
is guaranteed by the
compiler that the previous 10

630
00:28:14,450 --> 00:28:16,580
instructions in the
instruction stream

631
00:28:16,580 --> 00:28:18,052
are all independent of this one.

632
00:28:18,052 --> 00:28:20,510
And the hardware just has to
look and see if anything older

633
00:28:20,510 --> 00:28:23,220
than that is still in flight.

634
00:28:23,220 --> 00:28:26,960
And if nothing is, then it knows
it can issue this operation.

635
00:28:26,960 --> 00:28:29,720
If something is, then it
has to play safe and wait,

636
00:28:29,720 --> 00:28:31,850
not knowing for certain
whether that's a data

637
00:28:31,850 --> 00:28:33,740
predecessor of this one or not.

638
00:28:33,740 --> 00:28:35,330
But it's within that window.

639
00:28:35,330 --> 00:28:36,620
That's called horizon.

640
00:28:36,620 --> 00:28:37,340
Clever idea.

641
00:28:37,340 --> 00:28:39,480
Nice name for it.

642
00:28:39,480 --> 00:28:40,550
So that's one example.

643
00:28:40,550 --> 00:28:42,170
And I guess I went
through that just

644
00:28:42,170 --> 00:28:44,212
to make sure that you
understand that this really

645
00:28:44,212 --> 00:28:44,870
is a continuum.

646
00:28:44,870 --> 00:28:48,500
And people often talk about
the VLIWs and superscalars

647
00:28:48,500 --> 00:28:50,750
as being sort of opposites.

648
00:28:50,750 --> 00:28:53,420
But really they're
two approaches

649
00:28:53,420 --> 00:28:56,340
to trying to get the
exact same thing.

650
00:28:56,340 --> 00:28:58,220
It's just where you
draw that boundary.

651
00:28:58,220 --> 00:29:01,070
And practitioners in the field
are always discovering this

652
00:29:01,070 --> 00:29:02,750
and are always glad
to discover it.

653
00:29:02,750 --> 00:29:05,000
It's one of those nice
things that you suddenly

654
00:29:05,000 --> 00:29:10,570
see what's underlying
everything in that sense.

655
00:29:10,570 --> 00:29:11,070
OK.

656
00:29:11,070 --> 00:29:15,850
Now, if you look at
how these things work,

657
00:29:15,850 --> 00:29:18,890
let's consider
first a superscalar.

658
00:29:18,890 --> 00:29:20,370
It's pretty intuitive.

659
00:29:20,370 --> 00:29:23,610
You build some extra hardware.

660
00:29:23,610 --> 00:29:25,320
You feed in the
instruction stream.

661
00:29:25,320 --> 00:29:26,790
And it's just an
ordinary stream,

662
00:29:26,790 --> 00:29:28,763
one operation at a time.

663
00:29:28,763 --> 00:29:30,180
Could be the same
stream you would

664
00:29:30,180 --> 00:29:33,270
have been running on a
non-superscalar processor

665
00:29:33,270 --> 00:29:35,400
implementing the
same architecture.

666
00:29:35,400 --> 00:29:40,050
And this hardware starts
fetching the operations,

667
00:29:40,050 --> 00:29:45,000
looking at them, figuring out
if they're data ready or not,

668
00:29:45,000 --> 00:29:47,140
doing the independence
recognition.

669
00:29:47,140 --> 00:29:49,230
And if they are,
figuring out where

670
00:29:49,230 --> 00:29:51,280
to send them in the system.

671
00:29:51,280 --> 00:29:54,900
Now, there is every imaginable
level of sophistication that

672
00:29:54,900 --> 00:29:57,330
can be applied to this job.

673
00:29:57,330 --> 00:30:02,340
The very simplest thing you
can do is get an operation,

674
00:30:02,340 --> 00:30:05,520
figure out whether the
registers it's going to use

675
00:30:05,520 --> 00:30:08,310
are reserved by somebody.

676
00:30:08,310 --> 00:30:11,600
And if they are, just stop.

677
00:30:11,600 --> 00:30:13,190
Reserved in the
sense that somebody

678
00:30:13,190 --> 00:30:17,040
is planning on writing the
results into that register.

679
00:30:17,040 --> 00:30:22,650
Or instead of just stopping, you
can stick the operation aside

680
00:30:22,650 --> 00:30:24,090
and look at the next one.

681
00:30:24,090 --> 00:30:26,370
Now, when I said stop,
I mean stop issuing

682
00:30:26,370 --> 00:30:29,490
and allow the ones that were
already in flight to complete.

683
00:30:29,490 --> 00:30:32,820
Instead of stopping issuing,
you can put that one aside

684
00:30:32,820 --> 00:30:37,320
in a special buffer and look
at the next operation and ask,

685
00:30:37,320 --> 00:30:39,060
can this one now go?

686
00:30:39,060 --> 00:30:41,340
And then you need to
periodically, like each cycle,

687
00:30:41,340 --> 00:30:42,840
look at the ones
that you've delayed

688
00:30:42,840 --> 00:30:44,940
and see whether they can fire.

689
00:30:44,940 --> 00:30:47,100
Or you might be even
more sophisticated

690
00:30:47,100 --> 00:30:50,010
and when you put something aside
arrange to have the hardware

691
00:30:50,010 --> 00:30:54,660
notify you when that
operation is now data ready.

692
00:30:54,660 --> 00:30:58,590
So they're starting with
the more modest superscalar

693
00:30:58,590 --> 00:31:01,020
algorithm in the 6600.

694
00:31:01,020 --> 00:31:02,920
And the rather
more ambitious one,

695
00:31:02,920 --> 00:31:08,250
it's called Tomasulo algorithm
in the 360 in the early '60s.

696
00:31:08,250 --> 00:31:12,090
Those sorts of things have been
used in several processors.

697
00:31:12,090 --> 00:31:14,640
And as we'll see
later, they are very

698
00:31:14,640 --> 00:31:18,930
much what we find in the highest
performance processors today.

699
00:31:18,930 --> 00:31:22,410
The algorithms that were used
in the CDC and IBM machines

700
00:31:22,410 --> 00:31:23,850
lived for quite a long time.

701
00:31:23,850 --> 00:31:27,510
The CDC method evolved into
something considerably more

702
00:31:27,510 --> 00:31:31,210
ambitious as the superscalar
unit of the Cray supercomputer.

703
00:31:31,210 --> 00:31:35,850
Seymour Cray was at CDC and
took that same philosophy

704
00:31:35,850 --> 00:31:37,440
when he built the Cray-1.

705
00:31:37,440 --> 00:31:40,023
And in one form or
another, these ideas

706
00:31:40,023 --> 00:31:41,690
have lived on in that
line of computers.

707
00:31:41,690 --> 00:31:43,770
And the same thing
with IBM mainframes

708
00:31:43,770 --> 00:31:46,470
that lived on in a lot of
the most powerful models

709
00:31:46,470 --> 00:31:48,670
and the RS/6000 and so on.

710
00:31:48,670 --> 00:31:51,060
So that's what you do
with a superscalar.

711
00:31:51,060 --> 00:31:55,350
And real question is,
as we'll see later,

712
00:31:55,350 --> 00:31:58,010
how does this hardware scale?

713
00:31:58,010 --> 00:32:01,640
How big can it grow before it
starts getting impractical?

714
00:32:01,640 --> 00:32:05,380
And we'll talk about
that in a while.

715
00:32:05,380 --> 00:32:08,980
An alternative to that is to
throw out that control hardware

716
00:32:08,980 --> 00:32:11,120
completely.

717
00:32:11,120 --> 00:32:15,560
Instead let the compiler
do the whole job

718
00:32:15,560 --> 00:32:18,980
and do it in such a
way that it schedules

719
00:32:18,980 --> 00:32:25,140
exactly the record of execution
that it wants to have happen.

720
00:32:25,140 --> 00:32:29,070
So that rather than giving a
sequential chunk of operations

721
00:32:29,070 --> 00:32:33,660
in, it gives the exact outcome,
the record of execution

722
00:32:33,660 --> 00:32:36,840
mapped exactly into the
code that the compiler

723
00:32:36,840 --> 00:32:38,503
expects to happen.

724
00:32:38,503 --> 00:32:39,920
And then the
hardware doesn't have

725
00:32:39,920 --> 00:32:42,290
to do anything in the way
of sophisticated scheduling.

726
00:32:42,290 --> 00:32:44,450
It simply takes
the first operation

727
00:32:44,450 --> 00:32:48,230
and assigns each-- it takes
the first instruction.

728
00:32:48,230 --> 00:32:50,510
People tend to call these
long things instructions.

729
00:32:50,510 --> 00:32:53,330
And the constituent
risk operations

730
00:32:53,330 --> 00:32:56,600
are called operations, just to
distinguish between the two.

731
00:32:56,600 --> 00:32:58,970
Take the first
instruction, parcel

732
00:32:58,970 --> 00:33:01,520
each operation out to
the functional unit

733
00:33:01,520 --> 00:33:04,040
it registers and says it uses.

734
00:33:04,040 --> 00:33:05,960
Bang, execute them in one cycle.

735
00:33:05,960 --> 00:33:09,290
Grab the next one and so on.

736
00:33:09,290 --> 00:33:11,600
So that's the idea of the VLIW.

737
00:33:11,600 --> 00:33:13,290
It's very simple.

738
00:33:13,290 --> 00:33:15,680
It requires very
much less hardware,

739
00:33:15,680 --> 00:33:17,975
because this is quite
complex hardware

740
00:33:17,975 --> 00:33:19,100
that you can now throw out.

741
00:33:19,100 --> 00:33:22,160
In fact, you tend to
very often throw out

742
00:33:22,160 --> 00:33:25,070
a little bit of the
control hardware in a VLIW

743
00:33:25,070 --> 00:33:27,710
that you would have built in
an ordinary sequential machine,

744
00:33:27,710 --> 00:33:31,100
simply because you have such a
sophisticated compiler in order

745
00:33:31,100 --> 00:33:34,010
to do that at all that there
are a few other things that

746
00:33:34,010 --> 00:33:36,650
sometimes control hardware
has to do that you can maybe

747
00:33:36,650 --> 00:33:40,730
arrange to do a little more
efficiently in the software.

748
00:33:40,730 --> 00:33:46,610
Now, I've been saying
consistently issue

749
00:33:46,610 --> 00:33:48,245
the next instruction
on the next cycle,

750
00:33:48,245 --> 00:33:50,120
issue the next instruction
on the next cycle.

751
00:33:50,120 --> 00:33:52,670
And a lot of these
operations, for example,

752
00:33:52,670 --> 00:33:56,030
you'd expect a floating multiply
and a floating add and a load,

753
00:33:56,030 --> 00:33:58,940
take more than one
cycle to operate.

754
00:33:58,940 --> 00:34:01,340
Well, in all instruction
level parallel machines,

755
00:34:01,340 --> 00:34:02,900
independent of the
kind of machine

756
00:34:02,900 --> 00:34:08,170
it is, you like to build
pipelined functional units

757
00:34:08,170 --> 00:34:12,820
and pipelined memory
when you have the need

758
00:34:12,820 --> 00:34:15,489
to start a new operation
before the old one finishes.

759
00:34:15,489 --> 00:34:19,130
Whenever you have a longer than
one cycle latency operation,

760
00:34:19,130 --> 00:34:21,340
it makes sense to
build a pipeline unit.

761
00:34:21,340 --> 00:34:23,840
And your goal is to be able
to start a new operation

762
00:34:23,840 --> 00:34:25,239
in that unit each cycle.

763
00:34:25,239 --> 00:34:27,070
After all, you have
all this mechanism

764
00:34:27,070 --> 00:34:29,830
for firing operations
before the old one stops.

765
00:34:29,830 --> 00:34:34,000
Why not build hardware that
allows you to fire more of them

766
00:34:34,000 --> 00:34:35,870
with basically the
same hardware dollar?

767
00:34:35,870 --> 00:34:38,530
So you tend to see
any long latency

768
00:34:38,530 --> 00:34:41,230
operation in instruction
level parallel

769
00:34:41,230 --> 00:34:43,389
machine being pipelined.

770
00:34:43,389 --> 00:34:44,500
Now, not all of them are.

771
00:34:44,500 --> 00:34:47,260
That's not always the case,
but it is certainly the rule.

772
00:34:47,260 --> 00:34:48,909
And the example
I've written here

773
00:34:48,909 --> 00:34:50,380
certainly would
count on the fact

774
00:34:50,380 --> 00:34:53,110
that these longer latency
operations aren't holding you

775
00:34:53,110 --> 00:34:57,620
up when you issue
the next instruction.

776
00:34:57,620 --> 00:35:01,050
Now, I've shown this a issuing
a bunch at the same time.

777
00:35:01,050 --> 00:35:02,750
And when we saw the
record of execution

778
00:35:02,750 --> 00:35:06,860
at the very beginning for either
superscalar or VLIW machines,

779
00:35:06,860 --> 00:35:10,220
we saw it written as
a bunch of operations

780
00:35:10,220 --> 00:35:11,700
that fired at the same time.

781
00:35:11,700 --> 00:35:13,460
So it was sort of multi issue.

782
00:35:13,460 --> 00:35:17,030
Another idea that's very similar
but just a little bit different

783
00:35:17,030 --> 00:35:19,820
is what's come to be called
super pipelined or deeply

784
00:35:19,820 --> 00:35:23,870
pipelined or highly pipelined
machines in which you increase

785
00:35:23,870 --> 00:35:28,040
the issue rate tremendously as
far as the clock of the issue.

786
00:35:28,040 --> 00:35:30,420
But you issue more
than one operator.

787
00:35:30,420 --> 00:35:30,920
I'm sorry.

788
00:35:30,920 --> 00:35:33,410
You don't issue more than
one operation per clock.

789
00:35:33,410 --> 00:35:34,910
So that, for example,
you might even

790
00:35:34,910 --> 00:35:37,340
have an issue rate a lot
faster than an integer add,

791
00:35:37,340 --> 00:35:38,990
which is sort of
what we tend to think

792
00:35:38,990 --> 00:35:44,370
of as the atomic unit in the
user visible clock in a system.

793
00:35:44,370 --> 00:35:47,570
Instead, you might start an
integer add and then issue

794
00:35:47,570 --> 00:35:50,660
more operations
before it completes

795
00:35:50,660 --> 00:35:55,030
but at a very, very fast clip.

796
00:35:55,030 --> 00:35:57,460
Now, that's an important
hardware trade off

797
00:35:57,460 --> 00:36:00,070
and has lots of implications
for how you build your system.

798
00:36:00,070 --> 00:36:01,157
But really in concept.

799
00:36:01,157 --> 00:36:02,740
It's not very different
from this kind

800
00:36:02,740 --> 00:36:04,390
of record of execution.

801
00:36:04,390 --> 00:36:07,750
It's just finding the available
instruction level parallelism

802
00:36:07,750 --> 00:36:10,660
and scheduling it is going
to be a pretty similar job.

803
00:36:10,660 --> 00:36:13,120
And it's likely to be
similarly effective,

804
00:36:13,120 --> 00:36:16,390
give or take the advantages and
disadvantages of the hardware

805
00:36:16,390 --> 00:36:17,800
trade offs that you get.

806
00:36:17,800 --> 00:36:22,390
And it applies just as well to
superscalar and VLIW machines.

807
00:36:22,390 --> 00:36:24,910
It's a little more attractive
for a superscalar machine

808
00:36:24,910 --> 00:36:27,760
because you can build a
more simple issue unit

809
00:36:27,760 --> 00:36:30,580
and just make it very fast
rather than trying to issue

810
00:36:30,580 --> 00:36:31,940
a lot at the same time.

811
00:36:31,940 --> 00:36:34,190
So perhaps that's
an advantage there.

812
00:36:34,190 --> 00:36:36,280
But in any case, that's
an idea that people often

813
00:36:36,280 --> 00:36:39,130
think of as being more
different than in my opinion

814
00:36:39,130 --> 00:36:40,720
it really is.

815
00:36:40,720 --> 00:36:42,400
So one last thing about VLIWs.

816
00:36:42,400 --> 00:36:46,090
I thought you'd enjoy
seeing some code that

817
00:36:46,090 --> 00:36:48,970
is an example of when
you're doing incredibly well

818
00:36:48,970 --> 00:36:50,140
on a VLIW.

819
00:36:50,140 --> 00:36:53,560
Now, you very rarely
will do this well.

820
00:36:53,560 --> 00:36:56,920
This is two instructions,
two clock ticks,

821
00:36:56,920 --> 00:37:02,410
of compiled code for the
Multiflow Trace 28/300

822
00:37:02,410 --> 00:37:04,090
that I happen to have around.

823
00:37:04,090 --> 00:37:07,360
And it's for a dense solver.

824
00:37:07,360 --> 00:37:08,860
And dense solvers
are things that

825
00:37:08,860 --> 00:37:11,020
are very easy to get
parallelism out of.

826
00:37:11,020 --> 00:37:15,220
Almost any form of parallelism
works great on a dense solver.

827
00:37:15,220 --> 00:37:16,720
And this is two clock ticks.

828
00:37:16,720 --> 00:37:19,630
First clock tick, you issue
all of these operations.

829
00:37:19,630 --> 00:37:22,900
If you look at, if you can read
the op codes from a distance,

830
00:37:22,900 --> 00:37:26,630
there's a lot of floating
point multiplies integer ALUs,

831
00:37:26,630 --> 00:37:28,090
a lot of loads and stores.

832
00:37:28,090 --> 00:37:29,620
Actually, I guess
I grabbed two that

833
00:37:29,620 --> 00:37:32,080
don't have loads and stores
now that I look at it.

834
00:37:32,080 --> 00:37:32,770
Oh yes they do.

835
00:37:32,770 --> 00:37:34,510
Lots of loads in stores as well.

836
00:37:34,510 --> 00:37:36,910
And a bunch of
branches, because when

837
00:37:36,910 --> 00:37:39,040
you do that many
operations at once,

838
00:37:39,040 --> 00:37:41,950
you better not be limited to
doing one branch per cycle

839
00:37:41,950 --> 00:37:44,750
or that will really make
your code very sparse again.

840
00:37:44,750 --> 00:37:47,260
So in here is the second
instruction filled

841
00:37:47,260 --> 00:37:48,520
with all these operations.

842
00:37:48,520 --> 00:37:51,730
Obviously if you issue this
much on the first clock tick

843
00:37:51,730 --> 00:37:54,100
and then don't even wait for
it to stop before issuing

844
00:37:54,100 --> 00:37:56,050
that much on the
second clock tick,

845
00:37:56,050 --> 00:37:57,790
this machine is going to scream.

846
00:37:57,790 --> 00:38:01,120
And sure, a code like
this goes incredibly well

847
00:38:01,120 --> 00:38:02,100
on this machine.

848
00:38:02,100 --> 00:38:04,990
If you measure the performance
per clock tick of code

849
00:38:04,990 --> 00:38:08,020
like this on a machine like
that, it's just tremendous.

850
00:38:08,020 --> 00:38:09,580
Of course, most of
the time it's not

851
00:38:09,580 --> 00:38:10,955
going to look
anything like that.

852
00:38:10,955 --> 00:38:13,690


853
00:38:13,690 --> 00:38:15,820
But interesting nonetheless.

854
00:38:15,820 --> 00:38:18,580
So now that we've seen
the kinds of machines

855
00:38:18,580 --> 00:38:22,210
that there are, and of course
in the scope of the time

856
00:38:22,210 --> 00:38:23,950
of this talk I can't
really go into a lot

857
00:38:23,950 --> 00:38:26,183
of the architectural
details that

858
00:38:26,183 --> 00:38:27,850
are very interesting
about this machine,

859
00:38:27,850 --> 00:38:29,890
at the end I'll point
you at some references

860
00:38:29,890 --> 00:38:33,650
if you want to learn
some more about this.

861
00:38:33,650 --> 00:38:36,740
I'd like to talk about
two big problems, two

862
00:38:36,740 --> 00:38:40,070
or three big problems in
instruction level parallelism.

863
00:38:40,070 --> 00:38:43,870
And the first one is
speculative execution.

864
00:38:43,870 --> 00:38:46,278
Speculative execution
is an interesting topic.

865
00:38:46,278 --> 00:38:48,820
You hardly heard anybody talk
about it just a couple of years

866
00:38:48,820 --> 00:38:49,420
ago.

867
00:38:49,420 --> 00:38:52,210
And now when you
look at, let's see,

868
00:38:52,210 --> 00:38:54,620
I was at the ASPLOS
conference recently,

869
00:38:54,620 --> 00:38:56,170
which is a good
systems conference,

870
00:38:56,170 --> 00:38:58,690
there were three or four papers
on speculative execution,

871
00:38:58,690 --> 00:38:59,560
good papers.

872
00:38:59,560 --> 00:39:01,960
So people are really working
on the problems here.

873
00:39:01,960 --> 00:39:04,630
And here's the thing about
speculative execution.

874
00:39:04,630 --> 00:39:07,310
There really are two
sets of problems.

875
00:39:07,310 --> 00:39:09,190
There's how you
implement it and then

876
00:39:09,190 --> 00:39:11,950
there's the big problem
you get when you do.

877
00:39:11,950 --> 00:39:14,110
The how you implement
it is actually simpler

878
00:39:14,110 --> 00:39:15,820
than you'd think.

879
00:39:15,820 --> 00:39:17,770
Here you are with a program.

880
00:39:17,770 --> 00:39:20,620
And looking ahead, you might
decide because of the way

881
00:39:20,620 --> 00:39:22,720
you think these branches
will go that you'd

882
00:39:22,720 --> 00:39:27,160
like to grab this operation
and issue it ahead of these two

883
00:39:27,160 --> 00:39:29,690
branches, because you think
they're going to go this way.

884
00:39:29,690 --> 00:39:31,940
And so you speculate
that that'll be true.

885
00:39:31,940 --> 00:39:33,970
Well, now the first
job you have to do

886
00:39:33,970 --> 00:39:37,810
is what happens when it was
an unprofitable speculative

887
00:39:37,810 --> 00:39:38,660
operation?

888
00:39:38,660 --> 00:39:40,060
How do you clean up after it?

889
00:39:40,060 --> 00:39:43,690
And there's a series of
techniques you can use.

890
00:39:43,690 --> 00:39:48,010
One technique you can
use is in the hardware,

891
00:39:48,010 --> 00:39:52,570
you can arrange to sort of
shadow the work you're doing

892
00:39:52,570 --> 00:39:56,770
and have the old state sitting
around and the new state.

893
00:39:56,770 --> 00:39:59,620
For example, you can have
a shadow set of registers.

894
00:39:59,620 --> 00:40:01,813
And this is an idea
that people have

895
00:40:01,813 --> 00:40:02,980
kicked around several times.

896
00:40:02,980 --> 00:40:06,310
And now I've seen there
was a paper at ASPLOS

897
00:40:06,310 --> 00:40:10,660
by Mike Smith at Stanford, who
was advocating the use of this.

898
00:40:10,660 --> 00:40:14,770
That you grab an operation,
do it, but write its results

899
00:40:14,770 --> 00:40:16,750
in a set of special registers.

900
00:40:16,750 --> 00:40:18,790
And you don't sort of
commit these registers

901
00:40:18,790 --> 00:40:21,790
as the real results
until the branches

902
00:40:21,790 --> 00:40:23,630
resolve in that direction.

903
00:40:23,630 --> 00:40:25,420
So I'm paraphrasing
a harder thing.

904
00:40:25,420 --> 00:40:27,742
But this idea has
been kicked around.

905
00:40:27,742 --> 00:40:29,200
There was a little
bit of something

906
00:40:29,200 --> 00:40:32,047
like that in the
multiflow VLIW computer

907
00:40:32,047 --> 00:40:34,130
and it's something people
have talked about a lot.

908
00:40:34,130 --> 00:40:36,430
So that's one idea
that you could do.

909
00:40:36,430 --> 00:40:39,520
Second idea is to use renaming.

910
00:40:39,520 --> 00:40:44,150
That is when you write the
results of your operation.

911
00:40:44,150 --> 00:40:46,810
Suppose the operation
that you're pulling up

912
00:40:46,810 --> 00:40:48,670
writes a variable x.

913
00:40:48,670 --> 00:40:53,470
Suppose somebody down here
reads the old value of x.

914
00:40:53,470 --> 00:40:56,980
Then you can give this new
result a different name.

915
00:40:56,980 --> 00:40:59,440
And when you get down here,
when the jumps resolve

916
00:40:59,440 --> 00:41:01,240
in the direction
that you had planned,

917
00:41:01,240 --> 00:41:05,590
you can put in code that
names that result back to x

918
00:41:05,590 --> 00:41:10,470
if it happens that this value
of x and the new value of x

919
00:41:10,470 --> 00:41:11,760
reach some place in common.

920
00:41:11,760 --> 00:41:13,403
Then they have to
have the same name.

921
00:41:13,403 --> 00:41:15,570
Those of you who are familiar
with compilers, that's

922
00:41:15,570 --> 00:41:17,400
a typical kind of problem.

923
00:41:17,400 --> 00:41:21,330
You would just use
compiler renaming, putting

924
00:41:21,330 --> 00:41:27,530
a new sort of move to rename
the new value to x down here so

925
00:41:27,530 --> 00:41:30,210
that the same value gets
read down both branches.

926
00:41:30,210 --> 00:41:33,733
This is the typical thing you
would do, especially in a VLIW,

927
00:41:33,733 --> 00:41:35,150
where you're setting
up the world,

928
00:41:35,150 --> 00:41:37,700
that we'll see in a superscalar
as well, where you're setting

929
00:41:37,700 --> 00:41:42,500
up the world to do speculative
execution without destroying

930
00:41:42,500 --> 00:41:44,900
the state of the old values.

931
00:41:44,900 --> 00:41:47,990
And instead of having to do
moves and all of the stuff

932
00:41:47,990 --> 00:41:49,940
and renaming, one thing
you can do very often

933
00:41:49,940 --> 00:41:52,190
is simply do operations
that don't hurt anything

934
00:41:52,190 --> 00:41:53,270
down this branch.

935
00:41:53,270 --> 00:41:55,530
And then just ignore them
when you go down there.

936
00:41:55,530 --> 00:41:59,120
So if over here you're
writing a value x

937
00:41:59,120 --> 00:42:02,060
but there's no interest in
the old value of x down here,

938
00:42:02,060 --> 00:42:04,657
you can just move it up
and not worry about it.

939
00:42:04,657 --> 00:42:06,240
Forget about it when
it goes that way.

940
00:42:06,240 --> 00:42:07,310
So x has some value.

941
00:42:07,310 --> 00:42:08,870
You don't care.

942
00:42:08,870 --> 00:42:11,300
So those are the sorts of
techniques you can use.

943
00:42:11,300 --> 00:42:13,730
This is all perfectly tractable.

944
00:42:13,730 --> 00:42:14,810
It's been implemented.

945
00:42:14,810 --> 00:42:18,140
The Multiflow Trace did
speculative execution

946
00:42:18,140 --> 00:42:18,950
like this.

947
00:42:18,950 --> 00:42:23,690
And people have done the
sort of moving things up

948
00:42:23,690 --> 00:42:25,412
that you can ignore
for many years

949
00:42:25,412 --> 00:42:26,870
in various different
architectures.

950
00:42:26,870 --> 00:42:28,940
So this has been found
around, and it seems

951
00:42:28,940 --> 00:42:30,110
like a reasonable thing.

952
00:42:30,110 --> 00:42:33,450
There's a big gotcha with
speculative execution though.

953
00:42:33,450 --> 00:42:36,640
And that is what happens
when there's an exception?

954
00:42:36,640 --> 00:42:41,350
So it's easiest if we look
at it in terms of loads.

955
00:42:41,350 --> 00:42:44,650
So suppose you have
a while loop here.

956
00:42:44,650 --> 00:42:47,980
And what we're doing
is while the index

957
00:42:47,980 --> 00:42:51,380
I is below some
upper limit, we're

958
00:42:51,380 --> 00:42:53,600
going around and doing
some computation.

959
00:42:53,600 --> 00:42:57,840
And the computation involves
loading A of I. Now,

960
00:42:57,840 --> 00:43:01,020
you might want to look
down further iterations

961
00:43:01,020 --> 00:43:04,080
of this computation and
bring some loads further up.

962
00:43:04,080 --> 00:43:06,600
That's a typical thing people
like to do, because loads very

963
00:43:06,600 --> 00:43:09,480
often, especially array loads
if the array is large enough

964
00:43:09,480 --> 00:43:11,430
and you're bypassing
a cache, say,

965
00:43:11,430 --> 00:43:13,380
can have a very long latency.

966
00:43:13,380 --> 00:43:15,180
So you want to bring
those loads up.

967
00:43:15,180 --> 00:43:18,270
But at some point, you're going
to bring a load up that you

968
00:43:18,270 --> 00:43:21,840
wouldn't have wanted
to do because of having

969
00:43:21,840 --> 00:43:23,470
gone past the upper limit.

970
00:43:23,470 --> 00:43:27,500
So now you've loaded off
the end of the array.

971
00:43:27,500 --> 00:43:30,410
Well, I'm not a big
fan of sports sayings,

972
00:43:30,410 --> 00:43:32,360
which are typically trite.

973
00:43:32,360 --> 00:43:34,430
And there's one
in particular that

974
00:43:34,430 --> 00:43:38,600
makes me totally crazy whenever
I hear a color person in an NFL

975
00:43:38,600 --> 00:43:40,180
game doing.

976
00:43:40,180 --> 00:43:42,470
They'll very often say, I
don't know if any of you

977
00:43:42,470 --> 00:43:45,350
have ever noticed this, they'll
say when you throw a pass,

978
00:43:45,350 --> 00:43:47,090
three things can
happen and two of them

979
00:43:47,090 --> 00:43:51,140
are bad, being an incompletion
and an interception.

980
00:43:51,140 --> 00:43:54,140
And the implication is you
shouldn't throw passes.

981
00:43:54,140 --> 00:43:55,567
You should run a lot.

982
00:43:55,567 --> 00:43:57,150
So for those of you
who know football,

983
00:43:57,150 --> 00:43:59,850
that makes some sense.

984
00:43:59,850 --> 00:44:01,490
That's of course,
patently crazy,

985
00:44:01,490 --> 00:44:05,540
because it doesn't weigh the
probabilities of the bad things

986
00:44:05,540 --> 00:44:08,420
and good things happening
and the cost and benefit

987
00:44:08,420 --> 00:44:09,687
of the good and bad things.

988
00:44:09,687 --> 00:44:12,020
But you're always hearing
them say, if you throw a pass,

989
00:44:12,020 --> 00:44:13,340
three things can happen,
two of them are bad.

990
00:44:13,340 --> 00:44:15,680
Well, speculative loads
are just like that.

991
00:44:15,680 --> 00:44:20,000
If you do a speculative
load, three things can happen

992
00:44:20,000 --> 00:44:22,003
and two of them are bad.

993
00:44:22,003 --> 00:44:23,420
The good thing
that happens is you

994
00:44:23,420 --> 00:44:25,100
do the load, everything's fine.

995
00:44:25,100 --> 00:44:26,570
You fetch the data.

996
00:44:26,570 --> 00:44:28,550
Nobody seems to
have a complaint.

997
00:44:28,550 --> 00:44:30,050
If it turns out
that you don't want

998
00:44:30,050 --> 00:44:31,250
to use the data, then great.

999
00:44:31,250 --> 00:44:32,063
You ignore it.

1000
00:44:32,063 --> 00:44:33,230
I mean, you just did a load.

1001
00:44:33,230 --> 00:44:34,563
What's the harm of doing a load?

1002
00:44:34,563 --> 00:44:36,650
You can certainly ignore
it if it turns out

1003
00:44:36,650 --> 00:44:39,590
that you don't go the direction
that would have required

1004
00:44:39,590 --> 00:44:41,580
the use of that load.

1005
00:44:41,580 --> 00:44:43,830
Second thing that can happen,
though, and this is bad,

1006
00:44:43,830 --> 00:44:47,130
is that you might have a
miss in your memory hierarchy

1007
00:44:47,130 --> 00:44:48,990
somewhere.

1008
00:44:48,990 --> 00:44:51,680
And now you've got to figure out
whether to service that miss.

1009
00:44:51,680 --> 00:44:54,260


1010
00:44:54,260 --> 00:44:55,850
Now, you might have no choice.

1011
00:44:55,850 --> 00:44:58,230
Somebody might have built
the system in that way.

1012
00:44:58,230 --> 00:45:00,350
But for the system
designer, maybe you

1013
00:45:00,350 --> 00:45:05,270
want to build in some ability to
allow you to not have to stop.

1014
00:45:05,270 --> 00:45:07,520
Particularly, I mean,
OK, if it's a cache miss.

1015
00:45:07,520 --> 00:45:09,770
But suppose the page fault.

1016
00:45:09,770 --> 00:45:11,690
Do you really want
to stop and service

1017
00:45:11,690 --> 00:45:16,010
the page fault, which takes an
awfully long time, if you start

1018
00:45:16,010 --> 00:45:17,990
getting a lot of
them because you've

1019
00:45:17,990 --> 00:45:21,140
run off the end of an array?

1020
00:45:21,140 --> 00:45:23,500
And now you're really
getting into the area of data

1021
00:45:23,500 --> 00:45:24,790
that you're not interested in.

1022
00:45:24,790 --> 00:45:28,880
Worse than that, imagine
you're chasing pointers.

1023
00:45:28,880 --> 00:45:33,350
And now you start loading
from some useless address

1024
00:45:33,350 --> 00:45:35,247
because you're going
off into some place

1025
00:45:35,247 --> 00:45:36,830
that the pointer
structure never meant

1026
00:45:36,830 --> 00:45:39,470
you to be because
your test for the end

1027
00:45:39,470 --> 00:45:41,550
would have stopped you
before you got there.

1028
00:45:41,550 --> 00:45:43,880
And now you've got
some address that's

1029
00:45:43,880 --> 00:45:46,380
an address you're not even
slightly interested in.

1030
00:45:46,380 --> 00:45:48,560
So you have a policy
question of whether,

1031
00:45:48,560 --> 00:45:51,080
if you build the hardware,
whether to allow yourself

1032
00:45:51,080 --> 00:45:54,830
to not service page fault. So
that's the first bad thing that

1033
00:45:54,830 --> 00:45:55,460
can happen.

1034
00:45:55,460 --> 00:45:56,940
Second bad thing
that can happen,

1035
00:45:56,940 --> 00:45:59,930
which is maybe less
bad, is that you

1036
00:45:59,930 --> 00:46:02,420
might get an actual exception.

1037
00:46:02,420 --> 00:46:06,530
So you might try to
access an address that's

1038
00:46:06,530 --> 00:46:08,000
not only off the
end of the array

1039
00:46:08,000 --> 00:46:09,890
but out of your user space.

1040
00:46:09,890 --> 00:46:11,678
Or it might be on Mars.

1041
00:46:11,678 --> 00:46:14,220
There might be some address that
doesn't even make any sense.

1042
00:46:14,220 --> 00:46:14,930
So that's easy.

1043
00:46:14,930 --> 00:46:18,620
The hardware raises its hand and
says, oops, sorry, bad address.

1044
00:46:18,620 --> 00:46:20,210
Well, now what?

1045
00:46:20,210 --> 00:46:26,270
Chances are you've only
gotten that because you've

1046
00:46:26,270 --> 00:46:27,770
done something speculatively.

1047
00:46:27,770 --> 00:46:30,740
The only other alternative
is that it's a user error.

1048
00:46:30,740 --> 00:46:32,900
And those are pretty
infrequent compared to,

1049
00:46:32,900 --> 00:46:34,340
as it turns out
when you measure,

1050
00:46:34,340 --> 00:46:37,580
compared to exceptions raised
from speculative operations.

1051
00:46:37,580 --> 00:46:40,100
The multiflow compiler, which
I have some experience with,

1052
00:46:40,100 --> 00:46:41,558
Stefan Freudenberger,
and I've been

1053
00:46:41,558 --> 00:46:45,200
looking at these properties at
HP of the multiflow compiler.

1054
00:46:45,200 --> 00:46:49,460
And we've been finding
that, in fact, some programs

1055
00:46:49,460 --> 00:46:54,760
have hundreds of thousands
of fake exceptions.

1056
00:46:54,760 --> 00:46:59,460
That is, exceptions raised by
unprofitable speculative loads.

1057
00:46:59,460 --> 00:47:03,070
And if you have to stop
and report each of those

1058
00:47:03,070 --> 00:47:06,917
or take some kind of trap during
each of those, that's misery.

1059
00:47:06,917 --> 00:47:07,750
But what can you do?

1060
00:47:07,750 --> 00:47:11,790
If you just ignore them, which
is what you're tempted to do,

1061
00:47:11,790 --> 00:47:15,720
what do you do when the user
actually has an error and does

1062
00:47:15,720 --> 00:47:19,950
a speculative load near your
compiler or hardware has marked

1063
00:47:19,950 --> 00:47:22,620
it is speculative, and now
your system says, oh look,

1064
00:47:22,620 --> 00:47:24,600
here's an exception on
the speculative load.

1065
00:47:24,600 --> 00:47:26,400
Ha ha, they're not fooling me.

1066
00:47:26,400 --> 00:47:28,650
Well, what if it turns out
it's actually a user error?

1067
00:47:28,650 --> 00:47:30,240
Planes fall out of the sky.

1068
00:47:30,240 --> 00:47:31,050
Bridges collapse.

1069
00:47:31,050 --> 00:47:34,370
So you have to report them
or you're in trouble there.

1070
00:47:34,370 --> 00:47:35,620
So now what?

1071
00:47:35,620 --> 00:47:38,410
You've got 200,000 of these
things in some program.

1072
00:47:38,410 --> 00:47:40,900
You don't want to report them
except if it's user error,

1073
00:47:40,900 --> 00:47:42,907
and then you do
want to report it.

1074
00:47:42,907 --> 00:47:44,740
So that's a piece of
engineering that people

1075
00:47:44,740 --> 00:47:47,410
have been working very hard
on the last couple of years.

1076
00:47:47,410 --> 00:47:51,340
And the basic idea that
you'll see in several papers

1077
00:47:51,340 --> 00:47:54,910
that people have
written up is sort

1078
00:47:54,910 --> 00:47:57,680
of to mimic the IEEE
floating point idea,

1079
00:47:57,680 --> 00:48:03,580
which is to produce some invalid
value, set a bit somewhere that

1080
00:48:03,580 --> 00:48:05,703
says, of course,
integers, unfortunately,

1081
00:48:05,703 --> 00:48:07,120
addresses, things
like that, don't

1082
00:48:07,120 --> 00:48:10,150
have this nice
extra representation

1083
00:48:10,150 --> 00:48:12,880
that floating point numbers
have that makes no sense

1084
00:48:12,880 --> 00:48:13,730
that you can use.

1085
00:48:13,730 --> 00:48:15,910
You can call not a number.

1086
00:48:15,910 --> 00:48:17,800
And then for those of
about that, and then

1087
00:48:17,800 --> 00:48:19,745
get this for free.

1088
00:48:19,745 --> 00:48:21,370
You have to build
some hardware to take

1089
00:48:21,370 --> 00:48:24,070
care of this, some
extra bits somewhere,

1090
00:48:24,070 --> 00:48:26,230
that allow you to
tag something you've

1091
00:48:26,230 --> 00:48:31,750
done as having been speculative
and raised an exception.

1092
00:48:31,750 --> 00:48:33,670
And then somehow,
you arrange, and this

1093
00:48:33,670 --> 00:48:35,680
is where the
engineering comes in,

1094
00:48:35,680 --> 00:48:41,320
somehow you arrange to find
out when the branches resolve

1095
00:48:41,320 --> 00:48:43,570
in such a way that
you know for a fact

1096
00:48:43,570 --> 00:48:46,270
that it was a profitable
speculative operation.

1097
00:48:46,270 --> 00:48:50,420
That is, the flow of control
eventually did go that way.

1098
00:48:50,420 --> 00:48:53,132
And so you shouldn't
have gotten an exception.

1099
00:48:53,132 --> 00:48:55,090
It wasn't that didn't
mean to do that operation

1100
00:48:55,090 --> 00:48:55,630
in the first place.

1101
00:48:55,630 --> 00:48:56,170
You did.

1102
00:48:56,170 --> 00:48:57,370
It's a real user error.

1103
00:48:57,370 --> 00:48:58,630
You better stop and report it.

1104
00:48:58,630 --> 00:49:00,250
So you have to
somehow leave around

1105
00:49:00,250 --> 00:49:02,050
in that spot when
you finally get

1106
00:49:02,050 --> 00:49:04,960
to where the branches resolve
the right way the ability

1107
00:49:04,960 --> 00:49:09,710
to report this user error.

1108
00:49:09,710 --> 00:49:11,020
So that's the idea.

1109
00:49:11,020 --> 00:49:14,830
Hardware delays the report
until the branch resolves.

1110
00:49:14,830 --> 00:49:17,185
And you'll see people
building systems

1111
00:49:17,185 --> 00:49:18,310
that do this in the future.

1112
00:49:18,310 --> 00:49:20,645
Nothing like this
has been built yet.

1113
00:49:20,645 --> 00:49:23,020
Again, the machine that I've
had a lot of experience with

1114
00:49:23,020 --> 00:49:24,880
is probably the
most ambitious by

1115
00:49:24,880 --> 00:49:26,230
far in terms of doing this step.

1116
00:49:26,230 --> 00:49:31,120
The Multiflow Trace VLIWs
did do speculative loads.

1117
00:49:31,120 --> 00:49:33,400
It marked them as speculative.

1118
00:49:33,400 --> 00:49:36,258
But in the end,
would just throw away

1119
00:49:36,258 --> 00:49:38,050
exceptions on speculative
loads, and that's

1120
00:49:38,050 --> 00:49:40,690
a pretty dangerous policy
and not the sort of thing

1121
00:49:40,690 --> 00:49:42,520
you'd want to be doing.

1122
00:49:42,520 --> 00:49:49,030
Second huge area concerns the
two real big disadvantages

1123
00:49:49,030 --> 00:49:53,360
of superscalars and VLIWs.

1124
00:49:53,360 --> 00:49:57,530
And I'm going to talk about
the VLIW disadvantage is

1125
00:49:57,530 --> 00:49:59,570
what the next slide will say.

1126
00:49:59,570 --> 00:50:01,250
But in the course of
doing it, then I'll

1127
00:50:01,250 --> 00:50:03,500
compare it to superscalars.

1128
00:50:03,500 --> 00:50:08,450
And I'll talk about what the big
disadvantages for superscalars.

1129
00:50:08,450 --> 00:50:14,220
The big disadvantage for VLIWs
is object code compatibility.

1130
00:50:14,220 --> 00:50:18,270
This is really the Achilles
heel of these architectures.

1131
00:50:18,270 --> 00:50:20,040
And it is a big problem.

1132
00:50:20,040 --> 00:50:23,325
And I'm personally dying to
know just how all of this

1133
00:50:23,325 --> 00:50:24,200
is going to turn out.

1134
00:50:24,200 --> 00:50:25,740
I certainly have my guesses.

1135
00:50:25,740 --> 00:50:31,800
But right now VLIWs pose a
very big engineering problem

1136
00:50:31,800 --> 00:50:35,250
for computer manufacturers.

1137
00:50:35,250 --> 00:50:39,110
This is not a technical problem
in terms of building systems

1138
00:50:39,110 --> 00:50:41,430
that work correctly and go fast.

1139
00:50:41,430 --> 00:50:44,870
It's the problem that
when you build a VLIW

1140
00:50:44,870 --> 00:50:47,023
and you're a computer
manufacturer,

1141
00:50:47,023 --> 00:50:48,440
you have the problem
that when you

1142
00:50:48,440 --> 00:50:51,530
want to ship your
next VLIW, it's

1143
00:50:51,530 --> 00:50:54,080
unlikely to run
the code that was

1144
00:50:54,080 --> 00:50:57,490
running on your previous VLIW.

1145
00:50:57,490 --> 00:50:59,140
Let me explain why that is.

1146
00:50:59,140 --> 00:51:02,470
If what you've done
in a VLIW is written

1147
00:51:02,470 --> 00:51:09,200
the actual record of execution
that you're expecting to get,

1148
00:51:09,200 --> 00:51:12,620
then what happens when
you've put a multiply here

1149
00:51:12,620 --> 00:51:15,920
and an add that uses its
data three cycles later

1150
00:51:15,920 --> 00:51:19,520
and you build a new machine
in which the multiply takes

1151
00:51:19,520 --> 00:51:22,090
five cycles instead of three?

1152
00:51:22,090 --> 00:51:25,240
This old code will not have
the right answer waiting for it

1153
00:51:25,240 --> 00:51:26,890
after three cycles.

1154
00:51:26,890 --> 00:51:31,452
Now, you can put in a way of
stopping and waiting, spinning

1155
00:51:31,452 --> 00:51:33,160
for a couple of cycles,
and that happens.

1156
00:51:33,160 --> 00:51:34,090
But now you're
going to throw away

1157
00:51:34,090 --> 00:51:35,740
a lot of the advantage
of your new system

1158
00:51:35,740 --> 00:51:36,850
when it runs the old code.

1159
00:51:36,850 --> 00:51:38,642
That's pretty unacceptable,
and even that's

1160
00:51:38,642 --> 00:51:41,260
hard to engineer
in some aspects.

1161
00:51:41,260 --> 00:51:44,140
So the problem is that
when you have your old code

1162
00:51:44,140 --> 00:51:45,940
and you want to run
it on your new VLIW,

1163
00:51:45,940 --> 00:51:48,940
unless you've just scaled
the clock, then it's OK.

1164
00:51:48,940 --> 00:51:51,670
If everything is in the
same relation to each other,

1165
00:51:51,670 --> 00:51:54,460
if all the latencies are the
same relative to each other,

1166
00:51:54,460 --> 00:51:55,400
that's fine.

1167
00:51:55,400 --> 00:51:57,410
Same number of clock signals.

1168
00:51:57,410 --> 00:52:00,880
But if you've done anything else
about changing the latencies,

1169
00:52:00,880 --> 00:52:03,470
you have a real problem in this.

1170
00:52:03,470 --> 00:52:07,510
Now, there are a lot of
hardware and software tricks

1171
00:52:07,510 --> 00:52:10,930
that people have talked
about playing for a VLIWs.

1172
00:52:10,930 --> 00:52:15,040
And there's a lot of energy
being put toward really solving

1173
00:52:15,040 --> 00:52:18,470
this problem in
the system by sort

1174
00:52:18,470 --> 00:52:20,750
of importing a little
bit of superscalerness

1175
00:52:20,750 --> 00:52:22,700
into the machines.

1176
00:52:22,700 --> 00:52:25,790
And the machines that have
been built, both the Multiflow

1177
00:52:25,790 --> 00:52:28,680
machine and the Cydrome
machine, for example,

1178
00:52:28,680 --> 00:52:31,040
which were the two most
ambitious VLIWs that

1179
00:52:31,040 --> 00:52:32,870
were built in the 1980s.

1180
00:52:32,870 --> 00:52:37,760
Both had some mechanism
whereby you could do loads

1181
00:52:37,760 --> 00:52:40,430
and in case the
load had to delay

1182
00:52:40,430 --> 00:52:42,310
for one reason or the
other, the machine

1183
00:52:42,310 --> 00:52:43,310
could take care of that.

1184
00:52:43,310 --> 00:52:44,810
It could sort of stop and wait.

1185
00:52:44,810 --> 00:52:46,760
But you don't want to
push that idea too far,

1186
00:52:46,760 --> 00:52:49,880
or you're throwing away all of
the performance that the VLIW

1187
00:52:49,880 --> 00:52:51,980
is supposed to be giving you.

1188
00:52:51,980 --> 00:52:54,560
Now, there's a second
approach to solving this,

1189
00:52:54,560 --> 00:52:57,320
and that is to distribute
something different from fully

1190
00:52:57,320 --> 00:52:58,730
bound object code.

1191
00:52:58,730 --> 00:53:01,010
This is a period in a
few experimental systems.

1192
00:53:01,010 --> 00:53:03,230
Even some little
bit of extra work

1193
00:53:03,230 --> 00:53:07,140
is done in linkers in some
commercial systems as well.

1194
00:53:07,140 --> 00:53:10,740
So what you distribute is not
quite fully bound object code.

1195
00:53:10,740 --> 00:53:12,980
Well, you might want to
take that idea to a much

1196
00:53:12,980 --> 00:53:15,260
greater extreme in a VLIW.

1197
00:53:15,260 --> 00:53:17,480
And what you might
want to do instead

1198
00:53:17,480 --> 00:53:20,930
is to ship something in
which the final scheduling

1199
00:53:20,930 --> 00:53:22,760
step hasn't been done.

1200
00:53:22,760 --> 00:53:25,130
The dependence
analysis has been done.

1201
00:53:25,130 --> 00:53:28,460
Some amount of the scheduling
perhaps has been done.

1202
00:53:28,460 --> 00:53:32,270
But the linker, basically,
or what looks to the user

1203
00:53:32,270 --> 00:53:34,940
like the linker does
this final scheduling

1204
00:53:34,940 --> 00:53:37,320
at install time for the code.

1205
00:53:37,320 --> 00:53:41,090
So that's distributing
some sort of raised objects

1206
00:53:41,090 --> 00:53:42,710
beyond the thing you'd expect.

1207
00:53:42,710 --> 00:53:45,110
So those are the
directions that people

1208
00:53:45,110 --> 00:53:49,320
have been using for solving
this object code compatibility

1209
00:53:49,320 --> 00:53:49,820
problem.

1210
00:53:49,820 --> 00:53:52,310
Now, of course, there's another
solution to the object code

1211
00:53:52,310 --> 00:53:56,390
compatibility problem, and
that is build a superscalar.

1212
00:53:56,390 --> 00:53:58,610
And at first glance,
and in reality

1213
00:53:58,610 --> 00:54:00,800
in the systems we ship
today, this problem

1214
00:54:00,800 --> 00:54:03,950
is completely solved by
shipping a superscalar.

1215
00:54:03,950 --> 00:54:06,530
A superscalar, after all,
gets the sequential code

1216
00:54:06,530 --> 00:54:08,100
in the hardware.

1217
00:54:08,100 --> 00:54:09,870
The thing that you
send to the user

1218
00:54:09,870 --> 00:54:12,480
is the stuff that comes
out after the squiggly line

1219
00:54:12,480 --> 00:54:14,730
way over here, not the
squiggly line right over there,

1220
00:54:14,730 --> 00:54:16,590
if you remember that slide.

1221
00:54:16,590 --> 00:54:18,870
And all the rest
of the stuff that's

1222
00:54:18,870 --> 00:54:20,430
changing according
to the latencies

1223
00:54:20,430 --> 00:54:23,020
and so on of the hardware is
being done in the hardware.

1224
00:54:23,020 --> 00:54:24,990
So you just build
different control hardware

1225
00:54:24,990 --> 00:54:27,090
as part of the implementation
of the superscalar.

1226
00:54:27,090 --> 00:54:29,220
It's going to run
the sequential code.

1227
00:54:29,220 --> 00:54:32,340
As far as the user's concerned,
distribute the sequential code,

1228
00:54:32,340 --> 00:54:33,900
it'll run correctly
on every system.

1229
00:54:33,900 --> 00:54:36,300
No problem.

1230
00:54:36,300 --> 00:54:37,980
There is a problem with it.

1231
00:54:37,980 --> 00:54:40,440
For systems with
the amount of ILP

1232
00:54:40,440 --> 00:54:47,520
that we see today, the problem
really is solved just fine,

1233
00:54:47,520 --> 00:54:49,770
I believe.

1234
00:54:49,770 --> 00:54:54,510
But what I'm not convinced of
is that the problem is solved

1235
00:54:54,510 --> 00:54:57,420
for large amounts more ILP.

1236
00:54:57,420 --> 00:54:59,910
And I'd like to explain why
and what the problem is.

1237
00:54:59,910 --> 00:55:02,790
Now, this is all conjecture.

1238
00:55:02,790 --> 00:55:04,710
People have never
built a superscalar

1239
00:55:04,710 --> 00:55:09,960
that goes beyond the
superscalars we have today.

1240
00:55:09,960 --> 00:55:11,610
We finally, in
fact, have gotten up

1241
00:55:11,610 --> 00:55:13,680
just sort of to the
level of the big machines

1242
00:55:13,680 --> 00:55:17,707
that people were building
dozens of years ago.

1243
00:55:17,707 --> 00:55:19,540
And it's going to be
very interesting to see

1244
00:55:19,540 --> 00:55:21,880
what direction this
goes in now that we

1245
00:55:21,880 --> 00:55:24,880
have machines that have finally
gotten up to that boundary.

1246
00:55:24,880 --> 00:55:27,400
But here's the problem with
large amounts of instruction

1247
00:55:27,400 --> 00:55:29,610
level parallelism.

1248
00:55:29,610 --> 00:55:32,920
First of all, you
might want to ask

1249
00:55:32,920 --> 00:55:35,320
whether that amount of
instruction level parallelism

1250
00:55:35,320 --> 00:55:36,737
is there.

1251
00:55:36,737 --> 00:55:38,320
Before the end of
this talk, I'm going

1252
00:55:38,320 --> 00:55:40,570
to talk about how
much instruction level

1253
00:55:40,570 --> 00:55:43,690
parallelism I think is available
in different kinds of codes.

1254
00:55:43,690 --> 00:55:47,290
But one quick
answer before I talk

1255
00:55:47,290 --> 00:55:48,880
about this problem
with superscalars

1256
00:55:48,880 --> 00:55:51,490
is to say, well, maybe
this is all the ILP

1257
00:55:51,490 --> 00:55:54,910
there really is in most programs
except for scientific programs.

1258
00:55:54,910 --> 00:55:56,470
Scientific programs,
you give them

1259
00:55:56,470 --> 00:55:58,990
a vector unit along with
your modest superscalar

1260
00:55:58,990 --> 00:56:00,730
and then you're just doing fine.

1261
00:56:00,730 --> 00:56:02,970
And if that's all there is,
then this is no problem.

1262
00:56:02,970 --> 00:56:05,300
Then we can go on from there.

1263
00:56:05,300 --> 00:56:07,090
So that's one simple answer.

1264
00:56:07,090 --> 00:56:10,090
I don't believe it's going
to be true in the end,

1265
00:56:10,090 --> 00:56:12,348
and I certainly hope
it's not true in the end.

1266
00:56:12,348 --> 00:56:13,390
But it's always possible.

1267
00:56:13,390 --> 00:56:16,410
There's no way to know
for sure right now.

1268
00:56:16,410 --> 00:56:18,900
But what happens when you
have large amounts of ILP?

1269
00:56:18,900 --> 00:56:21,420
One negative answer
is you can't even

1270
00:56:21,420 --> 00:56:24,455
build a superscalar like that.

1271
00:56:24,455 --> 00:56:26,830
After all, if you have large
amounts of instruction level

1272
00:56:26,830 --> 00:56:28,840
parallelism, then you
have to be looking

1273
00:56:28,840 --> 00:56:31,710
at a lot of operations at once.

1274
00:56:31,710 --> 00:56:35,080
And to do the
independence recognition,

1275
00:56:35,080 --> 00:56:37,540
you have to look
at these operations

1276
00:56:37,540 --> 00:56:41,987
basically pairwise and ask, are
they independent of each other?

1277
00:56:41,987 --> 00:56:43,570
Whatever algorithm
you're going to use

1278
00:56:43,570 --> 00:56:45,880
is going to seem to
grow exponentially

1279
00:56:45,880 --> 00:56:49,130
in the number of operations
you want to consider.

1280
00:56:49,130 --> 00:56:51,230
If you have a large
amount of operations

1281
00:56:51,230 --> 00:56:53,840
that you're trying
to fire every cycle,

1282
00:56:53,840 --> 00:56:57,660
you better be considering a
huge amount more than that.

1283
00:56:57,660 --> 00:57:01,230
And having this control
hardware grow exponentially

1284
00:57:01,230 --> 00:57:04,770
in so many operations is
likely to be really bad news

1285
00:57:04,770 --> 00:57:06,090
for the hardware designer.

1286
00:57:06,090 --> 00:57:07,757
I think if you talk
to the people who've

1287
00:57:07,757 --> 00:57:11,730
built today's most ambitious
superscalars, most of them

1288
00:57:11,730 --> 00:57:13,620
maybe are more
plucky and public,

1289
00:57:13,620 --> 00:57:16,390
but they're not really
excited about the next steps.

1290
00:57:16,390 --> 00:57:19,840
It really seems hard to go
much beyond where we are now.

1291
00:57:19,840 --> 00:57:21,750
So this is a real problem.

1292
00:57:21,750 --> 00:57:22,590
Totally unknown.

1293
00:57:22,590 --> 00:57:25,470
It might turn out that there
are nice hardware algorithms you

1294
00:57:25,470 --> 00:57:28,930
can use that will
make this problem fine

1295
00:57:28,930 --> 00:57:30,430
and then everything is fine.

1296
00:57:30,430 --> 00:57:34,720
But question one is, can you
really scale superscalars

1297
00:57:34,720 --> 00:57:36,580
much beyond where we are today?

1298
00:57:36,580 --> 00:57:38,680
It's not a problem for VLIWs.

1299
00:57:38,680 --> 00:57:42,587
You have to scale the compiling
time, which unfortunately

1300
00:57:42,587 --> 00:57:44,920
in this tape there isn't
enough time for me to really go

1301
00:57:44,920 --> 00:57:45,650
into that subject.

1302
00:57:45,650 --> 00:57:47,108
But I don't think
compiling time is

1303
00:57:47,108 --> 00:57:50,440
going to be a very big barrier
in the end to using VLIWs.

1304
00:57:50,440 --> 00:57:51,940
I think compiling
is going to end up

1305
00:57:51,940 --> 00:57:55,240
being very fast, even very
sophisticated compiling

1306
00:57:55,240 --> 00:57:55,980
for VLIWs.

1307
00:57:55,980 --> 00:57:57,730
That's something I'm
very confident about.

1308
00:57:57,730 --> 00:58:03,910
But scaling these superscalars,
so VLIWs scale very nicely.

1309
00:58:03,910 --> 00:58:05,440
You just add more functionality.

1310
00:58:05,440 --> 00:58:10,920
It's a pretty linear design,
and it's not a huge problem

1311
00:58:10,920 --> 00:58:11,832
to grow it.

1312
00:58:11,832 --> 00:58:13,290
But growing the
superscalar, you've

1313
00:58:13,290 --> 00:58:15,870
got the central control
unit more or less all

1314
00:58:15,870 --> 00:58:19,230
in one place that has to do
this exponentially growing job.

1315
00:58:19,230 --> 00:58:21,660
It's kind of scary, actually.

1316
00:58:21,660 --> 00:58:23,370
Next question is, what
about the register

1317
00:58:23,370 --> 00:58:25,410
architecture of the machine?

1318
00:58:25,410 --> 00:58:27,000
Once you fix your
architecture, you

1319
00:58:27,000 --> 00:58:29,770
have to fix the
number of registers.

1320
00:58:29,770 --> 00:58:32,710
If you have a lot more
instruction level parallelism,

1321
00:58:32,710 --> 00:58:35,260
you have a lot more operations
in flight at the same time.

1322
00:58:35,260 --> 00:58:37,570
There's a natural
pressure that starts

1323
00:58:37,570 --> 00:58:40,030
going on the number
of registers you have.

1324
00:58:40,030 --> 00:58:42,790
Unless you start with a very
large number of registers,

1325
00:58:42,790 --> 00:58:47,373
enough to supply data for your
most ambitious superscalar,

1326
00:58:47,373 --> 00:58:48,790
then you're going
to find yourself

1327
00:58:48,790 --> 00:58:51,820
starved as you go from
the inexpensive model

1328
00:58:51,820 --> 00:58:53,080
to the expensive model.

1329
00:58:53,080 --> 00:58:55,430
They all have to have the
same register architecture.

1330
00:58:55,430 --> 00:58:59,125
So that's a bit of a limit
in scaling superscalars that

1331
00:58:59,125 --> 00:59:01,750
really is a problem if you want
to stay object code compatible.

1332
00:59:01,750 --> 00:59:04,450
You can't, after all,
send instructions

1333
00:59:04,450 --> 00:59:07,030
that refer to register
39 in a machine that

1334
00:59:07,030 --> 00:59:09,430
only has 32 registers.

1335
00:59:09,430 --> 00:59:10,870
And then the bigger question.

1336
00:59:10,870 --> 00:59:13,060
And here's a real
unknown, and I'm

1337
00:59:13,060 --> 00:59:16,090
really curious about what
the answer to this is.

1338
00:59:16,090 --> 00:59:18,730
That is, suppose you build
two different superscalar

1339
00:59:18,730 --> 00:59:21,777
implementations of
the same architecture.

1340
00:59:21,777 --> 00:59:23,860
Now, they're both going
to run the code correctly.

1341
00:59:23,860 --> 00:59:25,580
That's not an issue.

1342
00:59:25,580 --> 00:59:26,980
But what happens
if the latencies

1343
00:59:26,980 --> 00:59:28,890
are dramatically different?

1344
00:59:28,890 --> 00:59:31,900
For example, you might have
one expensive one and one

1345
00:59:31,900 --> 00:59:33,290
low cost model.

1346
00:59:33,290 --> 00:59:35,050
And if you look
at those two, it's

1347
00:59:35,050 --> 00:59:37,570
very likely going
to be that the ratio

1348
00:59:37,570 --> 00:59:41,890
of the long latency
operations, like memory

1349
00:59:41,890 --> 00:59:44,680
operations and floating
point operations,

1350
00:59:44,680 --> 00:59:48,040
compared to integer operations
is going to be very much

1351
00:59:48,040 --> 00:59:51,550
greater in the low cost model
than it will in the highest

1352
00:59:51,550 --> 00:59:52,330
performance model.

1353
00:59:52,330 --> 00:59:54,520
Because that's sort of
where you put your speed

1354
00:59:54,520 --> 00:59:55,770
in the high performance model.

1355
00:59:55,770 --> 01:00:00,370
You engineer those systems,
multi-level caches and fast,

1356
01:00:00,370 --> 01:00:02,560
fast floating point
into the hardware.

1357
01:00:02,560 --> 01:00:04,270
And that's where
you put your dollar.

1358
01:00:04,270 --> 01:00:06,790
So now you have these
tremendously different ratios

1359
01:00:06,790 --> 01:00:09,160
of latencies of
operations between the two

1360
01:00:09,160 --> 01:00:10,180
implementations.

1361
01:00:10,180 --> 01:00:14,530
Can you really distribute
the same object code to both

1362
01:00:14,530 --> 01:00:17,830
and not give up tremendous
performance on one of them

1363
01:00:17,830 --> 01:00:21,130
compared to what you could
have if you distributed two

1364
01:00:21,130 --> 01:00:23,710
separate sets of object code?

1365
01:00:23,710 --> 01:00:26,140
Big unknown.

1366
01:00:26,140 --> 01:00:30,670
If it turns out that on, for
example, the low cost model

1367
01:00:30,670 --> 01:00:33,040
your performance
suffers terribly

1368
01:00:33,040 --> 01:00:37,270
because you ship the
expensive model's object code,

1369
01:00:37,270 --> 01:00:39,490
third party software
vendors are probably

1370
01:00:39,490 --> 01:00:41,140
going to be very
reluctant to do that.

1371
01:00:41,140 --> 01:00:43,600
They're going to want their
best performance on the machine

1372
01:00:43,600 --> 01:00:45,730
that you sold zillions
of, not on the machine

1373
01:00:45,730 --> 01:00:46,960
that you sold a very few of.

1374
01:00:46,960 --> 01:00:49,180
But then what happens is
you're losing your advantage

1375
01:00:49,180 --> 01:00:50,290
on the expensive machine.

1376
01:00:50,290 --> 01:00:52,123
If they have to ship
two different versions,

1377
01:00:52,123 --> 01:00:53,710
there goes object
code compatibility.

1378
01:00:53,710 --> 01:00:55,127
The whole point
of this is to only

1379
01:00:55,127 --> 01:00:57,100
have one set of object code.

1380
01:00:57,100 --> 01:01:00,060


1381
01:01:00,060 --> 01:01:01,060
And that's just unknown.

1382
01:01:01,060 --> 01:01:02,848
It could well be that
all you have to do

1383
01:01:02,848 --> 01:01:05,140
is get the things that are
independent near each other.

1384
01:01:05,140 --> 01:01:06,825
That's good enough
and will get you

1385
01:01:06,825 --> 01:01:08,200
close to the best
performance you

1386
01:01:08,200 --> 01:01:10,150
can get on either
kind of machine,

1387
01:01:10,150 --> 01:01:12,370
on either implementation
of the superscalar.

1388
01:01:12,370 --> 01:01:17,190
But that's a really big
unknown for us in this field.

1389
01:01:17,190 --> 01:01:18,720
Now, there's one other thing.

1390
01:01:18,720 --> 01:01:21,170
And that is that
you very often hear

1391
01:01:21,170 --> 01:01:24,320
people say that what they
don't like about VLIW

1392
01:01:24,320 --> 01:01:26,720
and that they do like
about superscalars is

1393
01:01:26,720 --> 01:01:30,350
that the compiling
effort is so bad on VLIWs

1394
01:01:30,350 --> 01:01:32,840
and superscalars
aren't like that.

1395
01:01:32,840 --> 01:01:37,448
And really, I have my
beliefs about these subjects,

1396
01:01:37,448 --> 01:01:38,990
and maybe they come
through, but I've

1397
01:01:38,990 --> 01:01:42,710
tried to really say what I think
is known and what's not known.

1398
01:01:42,710 --> 01:01:45,080
This is not known either.

1399
01:01:45,080 --> 01:01:47,040
But I just feel so
strongly about it,

1400
01:01:47,040 --> 01:01:49,580
I have a lot of trouble
believing that the statement

1401
01:01:49,580 --> 01:01:51,410
could possibly be true.

1402
01:01:51,410 --> 01:01:53,913
To me the effort in compiling--

1403
01:01:53,913 --> 01:01:54,830
and not the statement.

1404
01:01:54,830 --> 01:01:56,205
The statement that
you don't have

1405
01:01:56,205 --> 01:02:00,162
to compile very ambitiously
for superscalars.

1406
01:02:00,162 --> 01:02:02,370
What I believe is that this
statement really is true.

1407
01:02:02,370 --> 01:02:04,460
And that is that the
effort in compiling

1408
01:02:04,460 --> 01:02:07,490
is a function of the ILP
that you want to get.

1409
01:02:07,490 --> 01:02:10,190
It's not a function of
whether the machine is

1410
01:02:10,190 --> 01:02:14,260
a VLIW or a superscalar.

1411
01:02:14,260 --> 01:02:17,100
And my argument is as follows.

1412
01:02:17,100 --> 01:02:19,380
The argument that it's
easier for a superscalar

1413
01:02:19,380 --> 01:02:22,470
is that the superscalar looks
down the instruction stream

1414
01:02:22,470 --> 01:02:25,170
and it does the data
dependence analysis

1415
01:02:25,170 --> 01:02:30,100
and it does the scheduling
and it predicts the branches.

1416
01:02:30,100 --> 01:02:32,020
And you can do branch
prediction dynamically

1417
01:02:32,020 --> 01:02:34,020
while the program is
running and even do it more

1418
01:02:34,020 --> 01:02:36,550
accurately as a result.
It does all of the stuff

1419
01:02:36,550 --> 01:02:38,620
that you've talked about
doing in the compiler,

1420
01:02:38,620 --> 01:02:40,210
does it while the
program's running.

1421
01:02:40,210 --> 01:02:42,830
Why on Earth would you do
it in the compiler as well?

1422
01:02:42,830 --> 01:02:44,920
So you don't need an
ambitious compiler.

1423
01:02:44,920 --> 01:02:46,630
Well, most of that
argument is true,

1424
01:02:46,630 --> 01:02:48,940
but it turns out a
crucial piece isn't.

1425
01:02:48,940 --> 01:02:52,780
Yes, the superscalar can
look down a long ways

1426
01:02:52,780 --> 01:02:54,880
by predicting
branches dynamically.

1427
01:02:54,880 --> 01:02:57,920
And yes, a superscalar
probably can do the scheduling.

1428
01:02:57,920 --> 01:03:00,010
Scheduling is a
pretty linear job

1429
01:03:00,010 --> 01:03:01,840
and it's probably
practical to build

1430
01:03:01,840 --> 01:03:04,780
a fair amount of scheduling
into the hardware.

1431
01:03:04,780 --> 01:03:08,550
The problem is in the
data dependence analysis.

1432
01:03:08,550 --> 01:03:13,300
And if a superscalar is going to
get a lot of instruction level

1433
01:03:13,300 --> 01:03:16,660
parallelism, it's going to
have to look very far down,

1434
01:03:16,660 --> 01:03:19,987
hundreds of operations down the
instruction stream to get that.

1435
01:03:19,987 --> 01:03:22,570
Well, the superscalar is going
to have something that could be

1436
01:03:22,570 --> 01:03:25,040
called a window of opportunity.

1437
01:03:25,040 --> 01:03:26,890
It's going to have sort
of a sliding window

1438
01:03:26,890 --> 01:03:28,660
ahead of where we're executing.

1439
01:03:28,660 --> 01:03:31,240
Where it's looking in
order to pull things up,

1440
01:03:31,240 --> 01:03:34,510
there's going to be some
finite number of operations

1441
01:03:34,510 --> 01:03:36,500
that it can look down.

1442
01:03:36,500 --> 01:03:39,252
And the operations that
are data independent

1443
01:03:39,252 --> 01:03:41,460
are going to be peppered
through the code down there.

1444
01:03:41,460 --> 01:03:43,050
They're not going to be
all clumped together.

1445
01:03:43,050 --> 01:03:45,110
In fact, the opposite
effect tends to happen.

1446
01:03:45,110 --> 01:03:47,935
Usually operations are data
dependent upon the ones written

1447
01:03:47,935 --> 01:03:49,310
right before them,
because that's

1448
01:03:49,310 --> 01:03:50,518
the way programs are written.

1449
01:03:50,518 --> 01:03:53,425
So do some stream of computation
and then some more or less

1450
01:03:53,425 --> 01:03:54,800
independent stream
of computation

1451
01:03:54,800 --> 01:03:56,960
later and then finally
collect those results

1452
01:03:56,960 --> 01:03:58,530
and do something else.

1453
01:03:58,530 --> 01:04:01,640
So you have to look way
down the stream to do that.

1454
01:04:01,640 --> 01:04:05,060
That's going to get a huge
window of opportunity that's

1455
01:04:05,060 --> 01:04:07,760
not going to be practical
to build superscalar.

1456
01:04:07,760 --> 01:04:12,350
Too much of this exponentially
growing dependence analysis.

1457
01:04:12,350 --> 01:04:14,710
And so the solution
is that you need

1458
01:04:14,710 --> 01:04:17,920
to have a compiler which
does the dependence

1459
01:04:17,920 --> 01:04:21,130
analysis in the beginning,
does code rearrangement.

1460
01:04:21,130 --> 01:04:22,990
Probably you want the
compiler to arrange

1461
01:04:22,990 --> 01:04:24,817
for speculative execution then.

1462
01:04:24,817 --> 01:04:26,650
That's probably easier
than trying to decide

1463
01:04:26,650 --> 01:04:28,538
to do that on the fly.

1464
01:04:28,538 --> 01:04:30,080
And it's hard to
see how you're going

1465
01:04:30,080 --> 01:04:34,040
to avoid doing this work
in the compiler, virtually

1466
01:04:34,040 --> 01:04:38,600
the same work you have to
do to compile for a VLIW.

1467
01:04:38,600 --> 01:04:42,620
The two really do seem
very much equivalent to me.

1468
01:04:42,620 --> 01:04:45,680
So I don't believe the
argument that a superscalar

1469
01:04:45,680 --> 01:04:48,392
is going to buy you
compiling speed at all.

1470
01:04:48,392 --> 01:04:49,850
Now, there's one
other thing I want

1471
01:04:49,850 --> 01:04:52,340
to mention on the subject of
object code compatibility,

1472
01:04:52,340 --> 01:04:55,280
and that is the
VLIW disadvantage

1473
01:04:55,280 --> 01:05:00,900
of having a new implementation
have changed latencies.

1474
01:05:00,900 --> 01:05:03,180
And thus having the
inability to run

1475
01:05:03,180 --> 01:05:06,570
the old code on the new machine
has another aspect to it.

1476
01:05:06,570 --> 01:05:11,220
And that is within a given
implementation, a given VLIW,

1477
01:05:11,220 --> 01:05:13,980
certain operations
that you want to do

1478
01:05:13,980 --> 01:05:17,380
might have variable latency
depending upon the data.

1479
01:05:17,380 --> 01:05:20,700
So I mentioned, for example,
that the Multiflow and Cydrome

1480
01:05:20,700 --> 01:05:24,990
machines both had the property
that the designers built

1481
01:05:24,990 --> 01:05:28,020
in the ability for
a long, long load

1482
01:05:28,020 --> 01:05:31,290
to occur, longer than
the schedule expected,

1483
01:05:31,290 --> 01:05:33,990
and then the machine
would stall in order

1484
01:05:33,990 --> 01:05:36,840
to get the correct answer
when that happened.

1485
01:05:36,840 --> 01:05:41,250
So in that case, we've seen
another disadvantage of VLIWs.

1486
01:05:41,250 --> 01:05:44,790
And that is that a super
scalar can dynamically

1487
01:05:44,790 --> 01:05:49,893
arrange the schedule on the fly
to only issue the things that

1488
01:05:49,893 --> 01:05:51,810
are data ready, looking
at what's data already

1489
01:05:51,810 --> 01:05:52,435
at that moment.

1490
01:05:52,435 --> 01:05:56,010
So it can adjust to longer and
shorter variable latencies.

1491
01:05:56,010 --> 01:05:58,650
A VLIW, you have to
decide in advance.

1492
01:05:58,650 --> 01:06:00,810
So that's another
advantage of superscalars.

1493
01:06:00,810 --> 01:06:03,300
I'm not really convinced
that's an important advantage.

1494
01:06:03,300 --> 01:06:05,520
And I was at the
same conference,

1495
01:06:05,520 --> 01:06:07,200
the VLIW conference.

1496
01:06:07,200 --> 01:06:09,690
Butler Lampson gave
the keynote speech.

1497
01:06:09,690 --> 01:06:12,150
And he mentioned something in
a totally unrelated area that

1498
01:06:12,150 --> 01:06:14,100
really reminded me of this.

1499
01:06:14,100 --> 01:06:16,560
And that is in communications,
you have the same problem.

1500
01:06:16,560 --> 01:06:19,260
You have asynchronous versus
synchronous communications.

1501
01:06:19,260 --> 01:06:23,650
And the trade off is that
asynchronous communications

1502
01:06:23,650 --> 01:06:28,430
can adjust to changed
latencies and communications.

1503
01:06:28,430 --> 01:06:35,030
And thus, you don't have to do
a worst case analysis in order

1504
01:06:35,030 --> 01:06:36,770
to set up the communications.

1505
01:06:36,770 --> 01:06:38,270
Synchronous
communications, you have

1506
01:06:38,270 --> 01:06:40,010
to be sure the data
is getting there.

1507
01:06:40,010 --> 01:06:42,200
So you have to do a
worst case analysis.

1508
01:06:42,200 --> 01:06:44,510
And it's been
everyone's intuition

1509
01:06:44,510 --> 01:06:49,610
that asynchronous communications
would be faster because you

1510
01:06:49,610 --> 01:06:52,130
didn't have to allow for the
worst that could possibly

1511
01:06:52,130 --> 01:06:52,640
happen.

1512
01:06:52,640 --> 01:06:55,040
Of course, there's the
overhead of figuring out

1513
01:06:55,040 --> 01:06:56,552
whether everything
was OK or not.

1514
01:06:56,552 --> 01:06:58,760
That's what's wrong with
asynchronous communications.

1515
01:06:58,760 --> 01:07:00,740
That's a lot like superscalars.

1516
01:07:00,740 --> 01:07:03,650
And synchronous communications
are a lot like VLIWs.

1517
01:07:03,650 --> 01:07:06,710
And Butler's comment
was everybody

1518
01:07:06,710 --> 01:07:09,080
always expects asynchronous
communications to be faster.

1519
01:07:09,080 --> 01:07:12,560
Synchronous communications is
always faster, as it turns out.

1520
01:07:12,560 --> 01:07:15,390
And I was very much
reminded of this trade off.

1521
01:07:15,390 --> 01:07:17,390
So I really do believe
that that problem

1522
01:07:17,390 --> 01:07:19,490
of changing latencies
of functioning, it's

1523
01:07:19,490 --> 01:07:20,580
just not a big deal.

1524
01:07:20,580 --> 01:07:24,260
People often cited as one
of the real advantages

1525
01:07:24,260 --> 01:07:25,233
of superscalars.

1526
01:07:25,233 --> 01:07:26,900
There's another real
advantage, and this

1527
01:07:26,900 --> 01:07:32,280
is another part of the
hardware software trade off.

1528
01:07:32,280 --> 01:07:34,170
The typical classical
trade off you

1529
01:07:34,170 --> 01:07:36,030
have in system
design when you're

1530
01:07:36,030 --> 01:07:38,490
doing hardware versus
software choices

1531
01:07:38,490 --> 01:07:41,730
is that in the software--

1532
01:07:41,730 --> 01:07:46,080
I'm sorry, in the hardware,
you have perfect knowledge.

1533
01:07:46,080 --> 01:07:48,300
That is, you're
making your decisions

1534
01:07:48,300 --> 01:07:50,400
while the program is running.

1535
01:07:50,400 --> 01:07:54,640
And so you can see all
the values of everything.

1536
01:07:54,640 --> 01:07:58,990
Whereas in the software, you
have incomplete knowledge.

1537
01:07:58,990 --> 01:08:01,450
You're analyzing a program
that hasn't yet run.

1538
01:08:01,450 --> 01:08:02,800
You don't know what the data is.

1539
01:08:02,800 --> 01:08:05,050
You don't know which way the
flow of control has gone.

1540
01:08:05,050 --> 01:08:07,460
You're missing an awful
lot of information.

1541
01:08:07,460 --> 01:08:11,090
The good news is you have
all the time in the world

1542
01:08:11,090 --> 01:08:12,110
to do your analysis.

1543
01:08:12,110 --> 01:08:14,573
Of course, if your
compiler is that slow,

1544
01:08:14,573 --> 01:08:16,740
then you're not going to
have a very viable product.

1545
01:08:16,740 --> 01:08:19,399
But even within the bounds
of what people expect,

1546
01:08:19,399 --> 01:08:21,649
you have a lot more than
the four nanoseconds

1547
01:08:21,649 --> 01:08:24,450
the hardware is going to
have to make its decisions.

1548
01:08:24,450 --> 01:08:27,050
So in the software, you can
do a much more sophisticated

1549
01:08:27,050 --> 01:08:28,250
analysis.

1550
01:08:28,250 --> 01:08:30,590
In the hardware,
you're really stuck

1551
01:08:30,590 --> 01:08:34,939
with doing something a lot
more shallow but with better

1552
01:08:34,939 --> 01:08:36,800
information.

1553
01:08:36,800 --> 01:08:41,240
So that is another trade off
here between these two systems.

1554
01:08:41,240 --> 01:08:42,680
But it's really
more than anything

1555
01:08:42,680 --> 01:08:46,520
a trade off of sophisticated
compiling versus less

1556
01:08:46,520 --> 01:08:47,810
sophisticated compiling.

1557
01:08:47,810 --> 01:08:51,319
Now, it's often cited as an
advantage for superscalars

1558
01:08:51,319 --> 01:08:54,470
that they have this
runtime information.

1559
01:08:54,470 --> 01:08:56,600
And so they can do a
better job of scheduling

1560
01:08:56,600 --> 01:08:59,064
and a better job of
dependence analysis.

1561
01:08:59,064 --> 01:09:00,439
But I just have
trouble believing

1562
01:09:00,439 --> 01:09:03,020
that that adds very much,
given the amount of time

1563
01:09:03,020 --> 01:09:04,490
you have to do it.

1564
01:09:04,490 --> 01:09:08,330
But that is something that
does distinguish a superscalar

1565
01:09:08,330 --> 01:09:10,189
from a system that
isn't a superscalar

1566
01:09:10,189 --> 01:09:12,859
and can't do that kind
of dynamic analysis.

1567
01:09:12,859 --> 01:09:15,080
One important kind of
analysis is something

1568
01:09:15,080 --> 01:09:17,240
called memory disambiguation.

1569
01:09:17,240 --> 01:09:19,160
And there's certainly
no time in this talk

1570
01:09:19,160 --> 01:09:20,160
to go into that subject.

1571
01:09:20,160 --> 01:09:22,939
But the basic problem is if
you have two operations, one

1572
01:09:22,939 --> 01:09:27,490
of which reads an array
value and the other of which

1573
01:09:27,490 --> 01:09:30,340
writes some array value, is
it permissible to do them

1574
01:09:30,340 --> 01:09:33,310
in the other order to get more
instruction level parallelism?

1575
01:09:33,310 --> 01:09:35,727
Well, you have to know whether
they're reading and writing

1576
01:09:35,727 --> 01:09:37,279
the same value or not.

1577
01:09:37,279 --> 01:09:39,399
And of course at
runtime, you can tell.

1578
01:09:39,399 --> 01:09:41,859
You just look at the address
that's in the hardware.

1579
01:09:41,859 --> 01:09:43,600
At compile time,
sometimes that's

1580
01:09:43,600 --> 01:09:46,492
a tremendously
sophisticated analysis.

1581
01:09:46,492 --> 01:09:48,700
When it comes to arrays,
people have gotten very good

1582
01:09:48,700 --> 01:09:51,700
at that analysis, at solving the
equation of whether these two

1583
01:09:51,700 --> 01:09:52,970
indices are the same.

1584
01:09:52,970 --> 01:09:55,000
That's called array
disambiguation.

1585
01:09:55,000 --> 01:09:58,600
For pointers, where now you
have two references to memory

1586
01:09:58,600 --> 01:10:01,310
where the references
came through pointers,

1587
01:10:01,310 --> 01:10:02,560
that's a wide open subject.

1588
01:10:02,560 --> 01:10:05,290
People have looked at that
in several different contexts

1589
01:10:05,290 --> 01:10:06,700
of system design.

1590
01:10:06,700 --> 01:10:09,420
And I think it's one of the big
frontiers in instruction level

1591
01:10:09,420 --> 01:10:09,920
parallelism.

1592
01:10:09,920 --> 01:10:12,580
I think that real
progress is going to be

1593
01:10:12,580 --> 01:10:14,320
made in pointer disambiguation.

1594
01:10:14,320 --> 01:10:16,720
It is an incredibly
hard problem.

1595
01:10:16,720 --> 01:10:19,990
It may involve some slight
change of programming paradigm

1596
01:10:19,990 --> 01:10:21,910
in a way that we've
learned good programming.

1597
01:10:21,910 --> 01:10:24,550
I'm not talking about
adding wild new constructs

1598
01:10:24,550 --> 01:10:26,120
to languages.

1599
01:10:26,120 --> 01:10:29,200
But the way we've learned
new programming practices

1600
01:10:29,200 --> 01:10:31,630
about aliasing and things
and Fortran programming,

1601
01:10:31,630 --> 01:10:33,310
for example, over the decades.

1602
01:10:33,310 --> 01:10:35,860
I think that may
in decades happen

1603
01:10:35,860 --> 01:10:37,030
with pointer disambiguation.

1604
01:10:37,030 --> 01:10:38,560
But in my opinion,
nobody's going

1605
01:10:38,560 --> 01:10:41,420
to do sophisticated pointer
disambiguation any time real

1606
01:10:41,420 --> 01:10:41,920
soon.

1607
01:10:41,920 --> 01:10:45,280


1608
01:10:45,280 --> 01:10:45,780
OK.

1609
01:10:45,780 --> 01:10:48,300
Now, what I certainly don't
have time to do in this talk

1610
01:10:48,300 --> 01:10:51,360
is go into the slightest
amount of detail

1611
01:10:51,360 --> 01:10:54,510
about code generation
techniques.

1612
01:10:54,510 --> 01:10:57,260
Code generation techniques
are one of the reasons

1613
01:10:57,260 --> 01:10:59,630
that instruction level
parallelism has really

1614
01:10:59,630 --> 01:11:02,330
taken off in the last 10 years.

1615
01:11:02,330 --> 01:11:04,760
People are a lot more
confident of their ability

1616
01:11:04,760 --> 01:11:06,830
to do the kind of
code generation

1617
01:11:06,830 --> 01:11:12,010
that you have to do in order to
use these machines effectively.

1618
01:11:12,010 --> 01:11:15,340
One of the techniques is
called trace scheduling.

1619
01:11:15,340 --> 01:11:17,190
And it's a technique
that came along

1620
01:11:17,190 --> 01:11:18,690
around 12 or 13 years ago.

1621
01:11:18,690 --> 01:11:20,160
It was something
I was involved in.

1622
01:11:20,160 --> 01:11:25,040
And it's a way of taking
this idea of looking down

1623
01:11:25,040 --> 01:11:29,090
the instruction
stream and speculating

1624
01:11:29,090 --> 01:11:33,360
in the most likely direction
and doing the code generation

1625
01:11:33,360 --> 01:11:38,130
in advance in order to
build correct code that

1626
01:11:38,130 --> 01:11:41,100
gets the advantages of going
past conditional jumps.

1627
01:11:41,100 --> 01:11:43,770
And the basic way it works,
this is incredibly superficial,

1628
01:11:43,770 --> 01:11:46,858
but there are references
that come with this tape

1629
01:11:46,858 --> 01:11:49,150
that if you're interested,
you can learn more about it.

1630
01:11:49,150 --> 01:11:52,230
Basic way it works is that
the compiler takes the code

1631
01:11:52,230 --> 01:11:56,220
and picks a path through the
code that it thinks of as being

1632
01:11:56,220 --> 01:11:59,500
one of the more likely paths.

1633
01:11:59,500 --> 01:12:03,280
And sometimes it will unroll
loops and in line subroutines

1634
01:12:03,280 --> 01:12:09,130
and unwind rejoins and so on
to make these paths longer.

1635
01:12:09,130 --> 01:12:11,050
And then it just
makes the assumption

1636
01:12:11,050 --> 01:12:13,720
that this is the way
the code is going to go.

1637
01:12:13,720 --> 01:12:15,460
Builds a big data
precedence graph

1638
01:12:15,460 --> 01:12:17,740
that it does the
independent analysis.

1639
01:12:17,740 --> 01:12:21,890
Schedules this into these
wide instruction words

1640
01:12:21,890 --> 01:12:25,600
if you're scheduling
for a VLIW or schedules

1641
01:12:25,600 --> 01:12:28,190
the independent operations
clump next to each other

1642
01:12:28,190 --> 01:12:30,610
so the superscalar can find
them without having to have

1643
01:12:30,610 --> 01:12:33,250
a huge window of opportunity.

1644
01:12:33,250 --> 01:12:34,010
Schedules them.

1645
01:12:34,010 --> 01:12:36,710
Some of them will
be speculative.

1646
01:12:36,710 --> 01:12:39,230
And then schedules
whatever code is

1647
01:12:39,230 --> 01:12:42,710
necessary to do renaming
and so on at the boundaries

1648
01:12:42,710 --> 01:12:45,800
of this trace.

1649
01:12:45,800 --> 01:12:48,633
Now, when the code goes
this way, you do very well.

1650
01:12:48,633 --> 01:12:50,300
You get a lot of
performance because you

1651
01:12:50,300 --> 01:12:53,548
speculated ahead
a lot and exposed

1652
01:12:53,548 --> 01:12:55,340
a lot of the instruction
level parallelism.

1653
01:12:55,340 --> 01:12:57,320
When it doesn't go this way,
you've set it up in such a way

1654
01:12:57,320 --> 01:12:58,880
that the semantics are right.

1655
01:12:58,880 --> 01:13:01,280
Now, you're not going to pick
every trace and compile it.

1656
01:13:01,280 --> 01:13:03,410
That's too many, remember.

1657
01:13:03,410 --> 01:13:05,660
But what you do is
having compiled one,

1658
01:13:05,660 --> 01:13:07,910
you take the code that
hasn't been compiled yet,

1659
01:13:07,910 --> 01:13:10,040
and you pick the
most likely path

1660
01:13:10,040 --> 01:13:11,440
from what's remaining and so on.

1661
01:13:11,440 --> 01:13:13,190
Now, there's a lot of
techniques for doing

1662
01:13:13,190 --> 01:13:16,360
this kind of scalar code
generation for instruction

1663
01:13:16,360 --> 01:13:17,360
level parallel machines.

1664
01:13:17,360 --> 01:13:19,318
Some of them are variants
on traced scheduling.

1665
01:13:19,318 --> 01:13:22,860
Some of them depart quite
a bit more than that.

1666
01:13:22,860 --> 01:13:25,130
But basically, they all
follow the same idea

1667
01:13:25,130 --> 01:13:28,880
of trying to take advantage of
predicting in advance the most

1668
01:13:28,880 --> 01:13:31,970
likely path and scheduling
the entire path at once

1669
01:13:31,970 --> 01:13:35,490
rather than just scheduling
little pieces of it.

1670
01:13:35,490 --> 01:13:39,700
Now, for loops, there's
another important technique.

1671
01:13:39,700 --> 01:13:43,160
It's called software pipelining.

1672
01:13:43,160 --> 01:13:45,920
And software pipelining is--

1673
01:13:45,920 --> 01:13:48,350
I'm going to show you
how it works very, very

1674
01:13:48,350 --> 01:13:50,768
informally on a very, very
simple type of example.

1675
01:13:50,768 --> 01:13:53,060
But I just want you to get
the idea of what it's about.

1676
01:13:53,060 --> 01:13:55,340
It's actually quite a
large technology and quite

1677
01:13:55,340 --> 01:13:57,680
a sophisticated technology,
still in its infancy

1678
01:13:57,680 --> 01:14:00,050
and still with a lot
of research being done.

1679
01:14:00,050 --> 01:14:04,220
But the basic idea is suppose
you have a tight loop of just

1680
01:14:04,220 --> 01:14:08,350
these five operations
A through E.

1681
01:14:08,350 --> 01:14:11,590
And you have hardware that
operates in such a way

1682
01:14:11,590 --> 01:14:13,330
that functional
unit one can do A,

1683
01:14:13,330 --> 01:14:17,100
functional unit two
can do B, and so on.

1684
01:14:17,100 --> 01:14:21,390
Our goal is to come up with
a schedule for this loop that

1685
01:14:21,390 --> 01:14:24,023
instead of doing A,
waiting for it to finish,

1686
01:14:24,023 --> 01:14:25,440
doing B, waiting
for it to finish,

1687
01:14:25,440 --> 01:14:29,200
these might be data dependent
on each other, for example.

1688
01:14:29,200 --> 01:14:33,030
Instead of doing that, we can
start later iterations next

1689
01:14:33,030 --> 01:14:38,320
to these and do pieces of
all the different iterations

1690
01:14:38,320 --> 01:14:41,980
at the same time in one cycle
or in a handful of cycles

1691
01:14:41,980 --> 01:14:44,440
that we repeat over
and over again.

1692
01:14:44,440 --> 01:14:46,240
So the basic way you
can think of it, even

1693
01:14:46,240 --> 01:14:47,950
though the technique
for generating code

1694
01:14:47,950 --> 01:14:50,440
is very different from this,
basic way you can think of it

1695
01:14:50,440 --> 01:14:54,300
is first cycle, start
the first iteration.

1696
01:14:54,300 --> 01:14:55,920
Now, as time goes
by, you're going

1697
01:14:55,920 --> 01:14:58,770
to do the following
operations from that iteration

1698
01:14:58,770 --> 01:15:00,270
on the different
functional units,

1699
01:15:00,270 --> 01:15:03,000
starting them each
on successive cycles.

1700
01:15:03,000 --> 01:15:05,670
Because they depend
upon this one.

1701
01:15:05,670 --> 01:15:11,510
Second cycle, start
the second iteration.

1702
01:15:11,510 --> 01:15:13,820
And then in following
cycles, continue that.

1703
01:15:13,820 --> 01:15:16,140
And then keep going after that.

1704
01:15:16,140 --> 01:15:18,740
And what you find is that
eventually, you sort of filled

1705
01:15:18,740 --> 01:15:21,155
your machine with
some repeating pattern

1706
01:15:21,155 --> 01:15:23,900
where each functional
unit is doing

1707
01:15:23,900 --> 01:15:26,870
the same thing in the simple
example over and over again.

1708
01:15:26,870 --> 01:15:29,720
And then this repeating
pattern can be just considered

1709
01:15:29,720 --> 01:15:31,770
a single tight loop.

1710
01:15:31,770 --> 01:15:33,770
So now you pull the
repeating pattern out.

1711
01:15:33,770 --> 01:15:35,270
This is, again, how
you think of it.

1712
01:15:35,270 --> 01:15:37,478
It's not necessarily how
you would implement the code

1713
01:15:37,478 --> 01:15:38,210
generator.

1714
01:15:38,210 --> 01:15:40,220
You pull this
repeating pattern out,

1715
01:15:40,220 --> 01:15:43,130
and it becomes your
tight loop in the center.

1716
01:15:43,130 --> 01:15:46,780
And you set it up
with something called

1717
01:15:46,780 --> 01:15:49,870
the prologue that starts all
the iterations necessary to get

1718
01:15:49,870 --> 01:15:52,570
into the tight
loop in the center.

1719
01:15:52,570 --> 01:15:55,930
And you finish it
with a post log that

1720
01:15:55,930 --> 01:15:58,660
finishes the iterations
that have been started

1721
01:15:58,660 --> 01:16:00,370
and have been completed.

1722
01:16:00,370 --> 01:16:02,500
Again, there's more in
the references about this

1723
01:16:02,500 --> 01:16:05,580
if you're interested in
finding out more about it.

1724
01:16:05,580 --> 01:16:07,820
So finally, let me sum up here.

1725
01:16:07,820 --> 01:16:12,480
What we've seen is that
in the last few years,

1726
01:16:12,480 --> 01:16:17,270
microprocessor designs from
the major manufacturers

1727
01:16:17,270 --> 01:16:19,730
for their fastest
workstations have all

1728
01:16:19,730 --> 01:16:22,520
embodied a fair amount of
instruction level parallelism.

1729
01:16:22,520 --> 01:16:27,680
The HP PA-RISC 7100, the IBM
RS/6000, the announced DEC

1730
01:16:27,680 --> 01:16:30,987
Alpha are all
superscalar machines

1731
01:16:30,987 --> 01:16:32,570
with varying degrees
but a fair amount

1732
01:16:32,570 --> 01:16:35,570
in each case of instruction
level parallelism.

1733
01:16:35,570 --> 01:16:38,090
These are incredibly
fast machines.

1734
01:16:38,090 --> 01:16:39,650
They're fast cycle
time machines,

1735
01:16:39,650 --> 01:16:43,010
but they get a
lot more advantage

1736
01:16:43,010 --> 01:16:45,320
out of the instruction
level parallelism

1737
01:16:45,320 --> 01:16:46,440
than you might imagine.

1738
01:16:46,440 --> 01:16:49,430
So these are machines that are
very much the state of the art.

1739
01:16:49,430 --> 01:16:51,620
And these days, if you're
building high performance

1740
01:16:51,620 --> 01:16:54,860
processors, if you're not doing
instruction level parallelism,

1741
01:16:54,860 --> 01:16:58,060
you're just not there.

1742
01:16:58,060 --> 01:17:01,850
In the '80s, we
had the phenomenon

1743
01:17:01,850 --> 01:17:06,320
of startups that were building
many supercomputers that

1744
01:17:06,320 --> 01:17:08,090
embodied instruction
level parallelism.

1745
01:17:08,090 --> 01:17:11,210
There were some
superscalar ones and there

1746
01:17:11,210 --> 01:17:12,740
were a couple of VLIW ones.

1747
01:17:12,740 --> 01:17:15,680
The Multiflow Trace machine
I was associated with,

1748
01:17:15,680 --> 01:17:21,050
the Cydrome that Bob Rau also
at HP was associated with,

1749
01:17:21,050 --> 01:17:22,730
and the Color, for example.

1750
01:17:22,730 --> 01:17:24,440
For one reason or
another, virtually all

1751
01:17:24,440 --> 01:17:26,420
of these mini
supercomputer startups

1752
01:17:26,420 --> 01:17:28,790
failed, whether they had
very ambitious architectures

1753
01:17:28,790 --> 01:17:30,920
or very humdrum architectures.

1754
01:17:30,920 --> 01:17:33,800
Mostly these were not strongly
related, in my opinion,

1755
01:17:33,800 --> 01:17:35,480
to the technology
that was there.

1756
01:17:35,480 --> 01:17:38,000
Certainly the Multiflow
machine had tremendous price

1757
01:17:38,000 --> 01:17:41,960
performance and ease
of use in its time

1758
01:17:41,960 --> 01:17:44,090
and had a lot of
commercial acceptance.

1759
01:17:44,090 --> 01:17:46,730
I think a total of 120
pretty big machines

1760
01:17:46,730 --> 01:17:48,470
sold over the years.

1761
01:17:48,470 --> 01:17:50,350
But the company finally
couldn't make it.

1762
01:17:50,350 --> 01:17:53,180
But what did happen is that
a lot of the technology

1763
01:17:53,180 --> 01:17:55,250
got purchased by a lot
of large companies.

1764
01:17:55,250 --> 01:18:00,020
And I believe, for example, that
the Fujitsu supercomputer that

1765
01:18:00,020 --> 01:18:02,210
was just announced a
couple of weeks ago that

1766
01:18:02,210 --> 01:18:05,480
has as each of its nodes a
long instruction word machine.

1767
01:18:05,480 --> 01:18:08,120
And I believe they're using
multiflow technology, compiling

1768
01:18:08,120 --> 01:18:10,260
technology, and so
on in that product.

1769
01:18:10,260 --> 01:18:13,820
I know several large companies
have released the multiflow

1770
01:18:13,820 --> 01:18:16,460
compiler as a compiler
for their products.

1771
01:18:16,460 --> 01:18:18,230
And I know that
it's underdeveloped.

1772
01:18:18,230 --> 01:18:20,180
It's being used in
development efforts

1773
01:18:20,180 --> 01:18:23,900
in a lot of large companies
and in some of these products.

1774
01:18:23,900 --> 01:18:27,410
And other startup technologies
have found their way in.

1775
01:18:27,410 --> 01:18:31,460
But in fairness, no VLIW has
been commercially successful.

1776
01:18:31,460 --> 01:18:35,480
There were the array
processors of the '70s that

1777
01:18:35,480 --> 01:18:37,100
were really difficult
beasts to use

1778
01:18:37,100 --> 01:18:39,110
but had tremendous
price performance

1779
01:18:39,110 --> 01:18:41,923
but never really made it as
general purpose computers.

1780
01:18:41,923 --> 01:18:43,340
And what's going
to be interesting

1781
01:18:43,340 --> 01:18:46,280
is that now that we're pushing
the boundary of instruction

1782
01:18:46,280 --> 01:18:49,130
level parallelism
and what's available,

1783
01:18:49,130 --> 01:18:51,085
it really will be
an exciting time

1784
01:18:51,085 --> 01:18:52,460
to see which way
this thing goes,

1785
01:18:52,460 --> 01:18:55,250
whether we're going to be able
to make superscalars practical

1786
01:18:55,250 --> 01:18:56,960
and whether that's
the right answer

1787
01:18:56,960 --> 01:18:59,120
or whether VLIWs are
going to have to come in

1788
01:18:59,120 --> 01:19:01,610
to have the simpler hardware.

1789
01:19:01,610 --> 01:19:04,160
Very big questions
do remain unknown.

1790
01:19:04,160 --> 01:19:07,070
The biggest question is how much
instruction level parallelism

1791
01:19:07,070 --> 01:19:08,720
is there.

1792
01:19:08,720 --> 01:19:10,500
And I'd like to just
close by giving you

1793
01:19:10,500 --> 01:19:12,890
my quick summary of
how I read the limit

1794
01:19:12,890 --> 01:19:15,620
studies and the practical
experience people have had

1795
01:19:15,620 --> 01:19:16,430
so far.

1796
01:19:16,430 --> 01:19:20,772
For scientific codes, technical
codes, it's no question.

1797
01:19:20,772 --> 01:19:22,730
There's a lot of instruction
level parallelism.

1798
01:19:22,730 --> 01:19:24,050
Most of these vectorize.

1799
01:19:24,050 --> 01:19:26,600
There's a lot of other
ways to get this advantage,

1800
01:19:26,600 --> 01:19:30,050
but VLIWs certainly are a very
effective way of getting it

1801
01:19:30,050 --> 01:19:33,050
and superscalars run very
well in these codes also.

1802
01:19:33,050 --> 01:19:35,932
Those numbers are
huge and doesn't even

1803
01:19:35,932 --> 01:19:36,890
make sense as the same.

1804
01:19:36,890 --> 01:19:37,940
It's a function of
how much hardware

1805
01:19:37,940 --> 01:19:39,270
you throw at the problem.

1806
01:19:39,270 --> 01:19:41,750
So that's easy, the sort of
vectorizable array oriented

1807
01:19:41,750 --> 01:19:42,470
codes.

1808
01:19:42,470 --> 01:19:45,230
For scientific codes
that aren't quite so easy

1809
01:19:45,230 --> 01:19:47,030
but are still sort of
a pipeline floating

1810
01:19:47,030 --> 01:19:50,810
point, memory intensive array
codes that you see around,

1811
01:19:50,810 --> 01:19:52,880
probably a factor of
five is not a question.

1812
01:19:52,880 --> 01:19:54,920
We can probably easily
do that in time.

1813
01:19:54,920 --> 01:19:57,050
And I suspect it'll go
a lot farther than that,

1814
01:19:57,050 --> 01:20:00,610
but you can't really make
a strong case for it yet.

1815
01:20:00,610 --> 01:20:07,420
For the scientific technical
integer codes that are around,

1816
01:20:07,420 --> 01:20:10,330
it's probably around there
also, but it's not so clear

1817
01:20:10,330 --> 01:20:12,317
that you can go much further.

1818
01:20:12,317 --> 01:20:14,650
But that would be my take on
what the studies have said.

1819
01:20:14,650 --> 01:20:17,192
A lot of people certainly are
far more pessimistic than that.

1820
01:20:17,192 --> 01:20:19,300
Some people are more optimistic.

1821
01:20:19,300 --> 01:20:20,680
Certainly time will tell.

1822
01:20:20,680 --> 01:20:23,560
And finally, for the
systems codes maybe

1823
01:20:23,560 --> 01:20:25,773
it's a factor of two
or three that you

1824
01:20:25,773 --> 01:20:27,940
can get from the studies
that have been there today.

1825
01:20:27,940 --> 01:20:31,150
And I think there things like
pointer disambiguation and code

1826
01:20:31,150 --> 01:20:34,570
generation techniques are going
to make a very big difference

1827
01:20:34,570 --> 01:20:37,000
in what we're able to do.

1828
01:20:37,000 --> 01:20:38,860
And it'll be very
interesting to see

1829
01:20:38,860 --> 01:20:41,073
what the future holds for that.

1830
01:20:41,073 --> 01:20:43,240
Finally, the commercial
codes, I don't think anybody

1831
01:20:43,240 --> 01:20:44,740
knows anything.

1832
01:20:44,740 --> 01:20:47,200
I've heard some
anecdotal evidence

1833
01:20:47,200 --> 01:20:49,000
that for a lot of the
more business data

1834
01:20:49,000 --> 01:20:51,100
processing oriented
codes, there is

1835
01:20:51,100 --> 01:20:53,590
a lot of instruction
level parallelism that's

1836
01:20:53,590 --> 01:20:55,218
available to us.

1837
01:20:55,218 --> 01:20:57,010
But I don't know of
any studies that people

1838
01:20:57,010 --> 01:20:59,200
have done and published
that have told us just

1839
01:20:59,200 --> 01:21:01,490
what those codes look like.

1840
01:21:01,490 --> 01:21:04,420
So if you are interested in
knowing more about the subject,

1841
01:21:04,420 --> 01:21:07,810
Bob Rau and I have a paper
that was in Science Magazine

1842
01:21:07,810 --> 01:21:10,090
last year that's a survey.

1843
01:21:10,090 --> 01:21:12,160
Not quite on the level
of this talk, but in some

1844
01:21:12,160 --> 01:21:14,620
places goes into
a lot more detail

1845
01:21:14,620 --> 01:21:16,310
and has a very good
reference list.

1846
01:21:16,310 --> 01:21:19,885
In fact, it's conversations
with Bob over the last few years

1847
01:21:19,885 --> 01:21:22,510
that have really given me a lot
of the perspective you've heard

1848
01:21:22,510 --> 01:21:24,103
in the course of this talk.

1849
01:21:24,103 --> 01:21:25,520
So thank you very
much for coming.

1850
01:21:25,520 --> 01:21:26,728
I've enjoyed this talk a lot.

1851
01:21:26,728 --> 01:21:30,420


1852
01:21:30,420 --> 01:21:33,550
So let's open it up
for some questions.

1853
01:21:33,550 --> 01:21:34,660
Yes?

1854
01:21:34,660 --> 01:21:37,030
Josh, could you tell us more
about software pipelining

1855
01:21:37,030 --> 01:21:38,290
in commercial systems?

1856
01:21:38,290 --> 01:21:39,290
Sure.

1857
01:21:39,290 --> 01:21:39,790
Thanks.

1858
01:21:39,790 --> 01:21:42,190
Well, the software pipelining
is a really interesting

1859
01:21:42,190 --> 01:21:42,760
technology.

1860
01:21:42,760 --> 01:21:45,670
It's a lot more
mathematical in nature

1861
01:21:45,670 --> 01:21:48,520
than some of the other
code generation techniques

1862
01:21:48,520 --> 01:21:50,290
and so has gotten
a lot of interest

1863
01:21:50,290 --> 01:21:51,880
from a lot of
different groups who

1864
01:21:51,880 --> 01:21:54,290
tried to formulate
different ways of doing it.

1865
01:21:54,290 --> 01:21:57,310
It was used in the floating
point systems early on.

1866
01:21:57,310 --> 01:22:01,060
Floating point built these
array processors that eventually

1867
01:22:01,060 --> 01:22:04,120
had the ROM be substituted--
have RAM substituted

1868
01:22:04,120 --> 01:22:05,650
for the ROM that was in there.

1869
01:22:05,650 --> 01:22:07,442
And there were a lot
of the algorithms that

1870
01:22:07,442 --> 01:22:09,700
went very well on that, went
very fast on that system

1871
01:22:09,700 --> 01:22:12,490
because they use
software pipelining.

1872
01:22:12,490 --> 01:22:17,200
Cydrome got started because of
Bob Rau's software pipelining

1873
01:22:17,200 --> 01:22:19,815
experience and algorithms
and particularly because he

1874
01:22:19,815 --> 01:22:21,190
had a lot of good
ideas about how

1875
01:22:21,190 --> 01:22:24,310
to build hardware that made it
easier to software pipeline.

1876
01:22:24,310 --> 01:22:27,670
We've actually done
a lot of that at HP.

1877
01:22:27,670 --> 01:22:31,180
First the Apollo systems,
the people in the DN 10,000

1878
01:22:31,180 --> 01:22:33,470
had written a software
pipeline for that machine.

1879
01:22:33,470 --> 01:22:36,010
But now, in fact, the
commercial compiler

1880
01:22:36,010 --> 01:22:39,250
you get with the PA risk
products you buy today

1881
01:22:39,250 --> 01:22:41,090
does software
pipeline right in it.

1882
01:22:41,090 --> 01:22:43,450
And there's a lot of
research going on at HP Labs,

1883
01:22:43,450 --> 01:22:45,613
Bob Rau's group, and
a lot of other places

1884
01:22:45,613 --> 01:22:47,530
where people are looking
at software pipeline.

1885
01:22:47,530 --> 01:22:49,360
So it's really become
an important topic.

1886
01:22:49,360 --> 01:22:53,440
It even makes ordinary
sequential machines

1887
01:22:53,440 --> 01:22:56,080
that have a little bit of sort
of subtle instruction level

1888
01:22:56,080 --> 01:22:58,940
parallelism in them go
faster when you do that.

1889
01:22:58,940 --> 01:23:02,710
So it's a pretty
valuable technique.

1890
01:23:02,710 --> 01:23:03,210
Yes?

1891
01:23:03,210 --> 01:23:06,450


1892
01:23:06,450 --> 01:23:09,930
Could you tell us how
high level transformations

1893
01:23:09,930 --> 01:23:15,130
like loop unrolling or inlining
effects do ILP [INAUDIBLE]??

1894
01:23:15,130 --> 01:23:15,630
Absolutely.

1895
01:23:15,630 --> 01:23:17,490
They turn out to be
critical effects.

1896
01:23:17,490 --> 01:23:21,240
If you have a subroutine call
in the middle of a tight loop,

1897
01:23:21,240 --> 01:23:23,970
you're not going to get anywhere
if that subroutine calls

1898
01:23:23,970 --> 01:23:26,610
a boundary beyond which
your ILP won't go.

1899
01:23:26,610 --> 01:23:28,680
So the typical
thing people will do

1900
01:23:28,680 --> 01:23:32,650
is insert a text copy of
the subroutine into that.

1901
01:23:32,650 --> 01:23:35,370
Now of course, that is a
source control problem.

1902
01:23:35,370 --> 01:23:38,370
You have to make sure that when
you modify that subroutine,

1903
01:23:38,370 --> 01:23:41,310
you catch every place
that it was in line later

1904
01:23:41,310 --> 01:23:42,900
in your [INAUDIBLE]
control system.

1905
01:23:42,900 --> 01:23:45,020
So that's one technique
that's used all the time.

1906
01:23:45,020 --> 01:23:46,860
Similarly, loop unrolling.

1907
01:23:46,860 --> 01:23:49,080
Unless your software are
pipelining, in which case

1908
01:23:49,080 --> 01:23:52,110
you want the tightest copy
of the loop you can get,

1909
01:23:52,110 --> 01:23:55,680
loop unrolling, which used to be
used for compiler optimization

1910
01:23:55,680 --> 01:23:58,652
to eliminate the test
between iterations,

1911
01:23:58,652 --> 01:24:00,360
if you could predict
that you'd go around

1912
01:24:00,360 --> 01:24:01,530
a certain number of times.

1913
01:24:01,530 --> 01:24:04,110
For ILP, you're happy
enough to leave the test in.

1914
01:24:04,110 --> 01:24:06,180
That tends to be a
very predictable branch

1915
01:24:06,180 --> 01:24:07,433
at the end of the loop.

1916
01:24:07,433 --> 01:24:08,850
But what you want
is to get rather

1917
01:24:08,850 --> 01:24:12,760
than a small body of text, you
want a very large body of text.

1918
01:24:12,760 --> 01:24:16,080
And in fact, when people
complain about the compile time

1919
01:24:16,080 --> 01:24:19,440
for these what Bill Wolf
calls heroic compilers that

1920
01:24:19,440 --> 01:24:22,020
do this sort of thing, the
compile time more than anything

1921
01:24:22,020 --> 01:24:23,940
comes about because
of the expansion

1922
01:24:23,940 --> 01:24:27,570
in the code size due
to loop unrolling

1923
01:24:27,570 --> 01:24:29,700
and subroutine inlining.

1924
01:24:29,700 --> 01:24:33,170
That's really been
the biggest factor.

1925
01:24:33,170 --> 01:24:34,160
Yes?

1926
01:24:34,160 --> 01:24:37,470
How does dataflow machines merge
into your characterization?

1927
01:24:37,470 --> 01:24:42,440
Is this [INAUDIBLE] superscalar
or are there some differences?

1928
01:24:42,440 --> 01:24:44,510
Yes, that's a great question.

1929
01:24:44,510 --> 01:24:46,700
Dataflow machines
I really left out,

1930
01:24:46,700 --> 01:24:49,130
even though they are
very much instruction

1931
01:24:49,130 --> 01:24:50,360
level parallel machines.

1932
01:24:50,360 --> 01:24:51,950
Right now the
emphasis on data flow

1933
01:24:51,950 --> 01:24:55,550
has gotten more off
to the multiple thread

1934
01:24:55,550 --> 01:24:57,990
world and MIMD processors.

1935
01:24:57,990 --> 01:25:01,130
So it hasn't been
such a practical thing

1936
01:25:01,130 --> 01:25:03,080
as superscalars and VLIWs.

1937
01:25:03,080 --> 01:25:04,580
People haven't
really built one that

1938
01:25:04,580 --> 01:25:07,040
would be in any way
commercially competitive

1939
01:25:07,040 --> 01:25:09,015
to do just instructional
level parallelism.

1940
01:25:09,015 --> 01:25:10,640
But in a sense,
they're the purest kind

1941
01:25:10,640 --> 01:25:14,000
of instruction level parallelism
when running a single thread.

1942
01:25:14,000 --> 01:25:19,520
What you have is the code
itself specifying exactly

1943
01:25:19,520 --> 01:25:22,730
the dependence relationship
as part of the way

1944
01:25:22,730 --> 01:25:24,020
the code is written.

1945
01:25:24,020 --> 01:25:27,410
And the hardware
simply is notified.

1946
01:25:27,410 --> 01:25:29,690
It has operations which
are notified that it's

1947
01:25:29,690 --> 01:25:30,950
time for them to fire.

1948
01:25:30,950 --> 01:25:33,300
And it's very much an
instruction level parallelism

1949
01:25:33,300 --> 01:25:33,800
idea.

1950
01:25:33,800 --> 01:25:36,770
And it's a little bit in
between superscalars and VLIWs

1951
01:25:36,770 --> 01:25:39,740
in a way, because
VLIWs, and again, you

1952
01:25:39,740 --> 01:25:42,230
can read more
about in this paper

1953
01:25:42,230 --> 01:25:44,600
where we do discuss dataflow
machines in this context.

1954
01:25:44,600 --> 01:25:48,020
VLIWs require the independence
between operations

1955
01:25:48,020 --> 01:25:49,700
to be specified.

1956
01:25:49,700 --> 01:25:52,580
And superscalars don't
specify anything in advance.

1957
01:25:52,580 --> 01:25:53,840
The hardware figures it out.

1958
01:25:53,840 --> 01:25:56,690
Dataflow machines have the
dependence relationship

1959
01:25:56,690 --> 01:25:59,450
and still have to figure
out what's independent

1960
01:25:59,450 --> 01:26:00,890
and what can be scheduled.

1961
01:26:00,890 --> 01:26:02,420
So they really are
in some sense one

1962
01:26:02,420 --> 01:26:03,770
of these intermediate solutions.

1963
01:26:03,770 --> 01:26:05,460
They're a wonderfully
alluring idea.

1964
01:26:05,460 --> 01:26:07,670
Somebody said they've
been the greatest

1965
01:26:07,670 --> 01:26:09,650
idea and the next coming
thing in architecture

1966
01:26:09,650 --> 01:26:11,390
for 25 years now.

1967
01:26:11,390 --> 01:26:13,633
So they just haven't
made it for this purpose.

1968
01:26:13,633 --> 01:26:15,050
But they seem a
lot more promising

1969
01:26:15,050 --> 01:26:19,120
for multithread applications.

1970
01:26:19,120 --> 01:26:20,590
Yes?

1971
01:26:20,590 --> 01:26:24,130
You talked about the
predictability of branches.

1972
01:26:24,130 --> 01:26:25,810
And one of the
things when compilers

1973
01:26:25,810 --> 01:26:28,900
is that compilers don't know
which way branches will go.

1974
01:26:28,900 --> 01:26:31,030
Are you suggesting
the use of feedback

1975
01:26:31,030 --> 01:26:34,960
as a compiler mechanism
to get optimal use

1976
01:26:34,960 --> 01:26:36,400
of the architecture?

1977
01:26:36,400 --> 01:26:37,090
Yeah.

1978
01:26:37,090 --> 01:26:38,360
That's also a great question.

1979
01:26:38,360 --> 01:26:39,735
I wish there had
been time for me

1980
01:26:39,735 --> 01:26:42,130
to talk a lot more about
profile based optimization

1981
01:26:42,130 --> 01:26:44,860
in general, which is an
interesting subject area even

1982
01:26:44,860 --> 01:26:45,730
independent of this.

1983
01:26:45,730 --> 01:26:49,030
There are a lot of other profile
based optimizations you can do.

1984
01:26:49,030 --> 01:26:52,360
What I mean by profile based
is just what it sounds like.

1985
01:26:52,360 --> 01:26:56,020
The compiler has access to some
quote "representative data"

1986
01:26:56,020 --> 01:26:59,290
and then runs it to get an
estimate of which way the jumps

1987
01:26:59,290 --> 01:27:02,710
go and then does
optimisations, or in this case

1988
01:27:02,710 --> 01:27:07,120
does its scheduling, based
on which way those estimates

1989
01:27:07,120 --> 01:27:08,405
from previous runs tell you.

1990
01:27:08,405 --> 01:27:09,530
They're sort of three ways.

1991
01:27:09,530 --> 01:27:11,488
You can either ask the
programmer, which is not

1992
01:27:11,488 --> 01:27:14,140
a very acceptable way to
tell which way jumps go,

1993
01:27:14,140 --> 01:27:15,370
or you can use heuristics.

1994
01:27:15,370 --> 01:27:17,410
And the jury's
out on heuristics.

1995
01:27:17,410 --> 01:27:19,240
Experiments so far have failed.

1996
01:27:19,240 --> 01:27:22,390
But I've been hearing a
lot of underground swell

1997
01:27:22,390 --> 01:27:24,400
that really sophisticated
heuristics maybe

1998
01:27:24,400 --> 01:27:26,185
do almost as well as profiling.

1999
01:27:26,185 --> 01:27:27,310
And then there's profiling.

2000
01:27:27,310 --> 01:27:31,060
Profiling has the disadvantage
that it involves the user.

2001
01:27:31,060 --> 01:27:35,290
And we've yet to see really
well engineered systems that

2002
01:27:35,290 --> 01:27:37,810
make this very
transparent to the user

2003
01:27:37,810 --> 01:27:41,110
but still allow the compiler
to gain access to profiling

2004
01:27:41,110 --> 01:27:42,400
information from these codes.

2005
01:27:42,400 --> 01:27:43,900
Has to involve the users some.

2006
01:27:43,900 --> 01:27:46,210
And I'm, again, interested
to see how that works out.

2007
01:27:46,210 --> 01:27:48,200
It certainly is the
best way to get data.

2008
01:27:48,200 --> 01:27:50,470
And it turns out you
can come very close

2009
01:27:50,470 --> 01:27:52,780
to the best possible.

2010
01:27:52,780 --> 01:27:55,720
[INAUDIBLE] and I, the reason
I had those branch statistics

2011
01:27:55,720 --> 01:27:58,990
is that we just did a study
that was in this last ASPLOS.

2012
01:27:58,990 --> 01:28:01,990
And we showed that you really
can use very unrelated data

2013
01:28:01,990 --> 01:28:04,570
and do very well most of the
time in predicting branches.

2014
01:28:04,570 --> 01:28:07,522


2015
01:28:07,522 --> 01:28:08,420
Yes?

2016
01:28:08,420 --> 01:28:10,460
To get more ILP, you
have to look deeper

2017
01:28:10,460 --> 01:28:11,780
in the instructions stream.

2018
01:28:11,780 --> 01:28:14,435
And I was wondering if there's
any idea how much of a trade

2019
01:28:14,435 --> 01:28:17,060
offs there is between
the gains from the ILP

2020
01:28:17,060 --> 01:28:19,910
and the loss of
perhaps data locality.

2021
01:28:19,910 --> 01:28:21,630
Yes, that's a very
good question.

2022
01:28:21,630 --> 01:28:24,200
And I think one can
generalize it to just

2023
01:28:24,200 --> 01:28:28,880
say, how does ILP play with
data locality in general?

2024
01:28:28,880 --> 01:28:32,370
Is it really a good
thing or a bad thing?

2025
01:28:32,370 --> 01:28:37,680
And my take on it is that
for the kinds of systems

2026
01:28:37,680 --> 01:28:40,067
we're talking about today
and the sizes of caches we

2027
01:28:40,067 --> 01:28:42,150
have today, and the ones
we'll have in the future,

2028
01:28:42,150 --> 01:28:43,983
which will be even
better from this respect,

2029
01:28:43,983 --> 01:28:46,080
that that's not just
not a big issue.

2030
01:28:46,080 --> 01:28:49,270
Just looking at code and
getting a feeling for it.

2031
01:28:49,270 --> 01:28:50,910
But time will really tell.

2032
01:28:50,910 --> 01:28:55,050
The most ambitious ILP machine
there's ever been, I believe,

2033
01:28:55,050 --> 01:28:57,090
was the Multiflow Trace,
which could fire off

2034
01:28:57,090 --> 01:29:00,180
28 operations at a time
and keep maybe 80 of them

2035
01:29:00,180 --> 01:29:01,920
going in flight at once.

2036
01:29:01,920 --> 01:29:04,170
And it didn't have a data cache.

2037
01:29:04,170 --> 01:29:07,140
So locality was just not
an issue particularly,

2038
01:29:07,140 --> 01:29:09,830
though some forms of locality
were for that machine.

2039
01:29:09,830 --> 01:29:12,040
So there really isn't
that kind of experience.

2040
01:29:12,040 --> 01:29:13,710
But my guess is that
that's not going

2041
01:29:13,710 --> 01:29:16,540
to be a big problem at all.

2042
01:29:16,540 --> 01:29:17,170
But who knows?

2043
01:29:17,170 --> 01:29:18,128
I'm always an optimist.

2044
01:29:18,128 --> 01:29:20,680


2045
01:29:20,680 --> 01:29:23,460
So well, I'd really like to
thank you for coming today.

2046
01:29:23,460 --> 01:29:24,960
I've enjoyed giving
this talk a lot.

2047
01:29:24,960 --> 01:29:41,160


2048
01:29:41,160 --> 01:29:44,210
[MUSIC PLAYING]

2049
01:29:44,210 --> 01:30:23,000