1
00:00:00,000 --> 00:00:03,437
[MUSIC PLAYING]

2
00:00:03,437 --> 00:00:13,770


3
00:00:13,770 --> 00:00:15,960
Thanks a lot for coming
and watching this tape.

4
00:00:15,960 --> 00:00:17,880
I'm going to talk
about input/output.

5
00:00:17,880 --> 00:00:19,740
Now, input/output
is an area that's

6
00:00:19,740 --> 00:00:22,785
been an orphan of computer
architecture long neglected.

7
00:00:22,785 --> 00:00:25,410
So I'm going to need to motivate
why you want to stay and watch

8
00:00:25,410 --> 00:00:26,948
this tape.

9
00:00:26,948 --> 00:00:29,490
To give you an idea of how much
this area has been neglected,

10
00:00:29,490 --> 00:00:32,100
first of all, the
whole equipment

11
00:00:32,100 --> 00:00:34,200
is referred to as
peripherals, as opposed

12
00:00:34,200 --> 00:00:36,330
to the central processor.

13
00:00:36,330 --> 00:00:38,600
And secondly, there was a
programming language that

14
00:00:38,600 --> 00:00:41,100
was the ancestor of most of the
programming languages people

15
00:00:41,100 --> 00:00:44,370
use today, this program
is called ALGOL 60.

16
00:00:44,370 --> 00:00:46,920
And when was it was
invented, it didn't

17
00:00:46,920 --> 00:00:50,850
have any input or output at all
in language and nobody cared.

18
00:00:50,850 --> 00:00:56,220
So let me motivate why you want
to watch this tape, starting

19
00:00:56,220 --> 00:00:58,760
off with our first slide.

20
00:00:58,760 --> 00:01:02,390
What's happening is a
renaissance in CPU performance

21
00:01:02,390 --> 00:01:05,269
improving by 50%
to 100% per year.

22
00:01:05,269 --> 00:01:08,330
In fact, what we have
today is the supercomputer

23
00:01:08,330 --> 00:01:09,020
in the desktop.

24
00:01:09,020 --> 00:01:09,980
It's something that
people have been

25
00:01:09,980 --> 00:01:11,600
talking about for a long time.

26
00:01:11,600 --> 00:01:14,030
What you see on the
left-hand part of this slide

27
00:01:14,030 --> 00:01:17,480
is the HP 735 workstation and
the various characteristics

28
00:01:17,480 --> 00:01:18,260
of that.

29
00:01:18,260 --> 00:01:20,090
In contrast, on
the right-hand side

30
00:01:20,090 --> 00:01:23,690
is the Cray 1 supercomputer, the
original vector supercomputer

31
00:01:23,690 --> 00:01:25,100
and its characteristics.

32
00:01:25,100 --> 00:01:27,320
When you line these
characteristics up

33
00:01:27,320 --> 00:01:33,230
by all measures, the HP
735 is faster or better

34
00:01:33,230 --> 00:01:36,650
than the Cray Vector
supercomputer, even including

35
00:01:36,650 --> 00:01:40,622
the cost at about 1%
or half a percent.

36
00:01:40,622 --> 00:01:45,230
Another way of saying
that is that the HP 735,

37
00:01:45,230 --> 00:01:47,450
if it was announced
in the late 1970s,

38
00:01:47,450 --> 00:01:49,980
would have been a hell
of a good supercomputer.

39
00:01:49,980 --> 00:01:52,210
So we've got the
supercomputer on the disk.

40
00:01:52,210 --> 00:01:53,690
What an amazing development.

41
00:01:53,690 --> 00:01:59,210
Moreover, people are taking tens
or hundreds of these processors

42
00:01:59,210 --> 00:02:01,700
and putting them together
to create supercomputers out

43
00:02:01,700 --> 00:02:03,890
of lots of these processors.

44
00:02:03,890 --> 00:02:06,530
So that is moving at
an even faster rate,

45
00:02:06,530 --> 00:02:08,220
say, 150% per year.

46
00:02:08,220 --> 00:02:10,190
So why do we care about I/O?

47
00:02:10,190 --> 00:02:12,170
Well, because I/O hasn't
been moving as fast

48
00:02:12,170 --> 00:02:13,550
as the processor design.

49
00:02:13,550 --> 00:02:15,410
It's limited by
mechanical delays

50
00:02:15,410 --> 00:02:17,480
and has been growing
maybe 5% or 10%

51
00:02:17,480 --> 00:02:19,780
per year over that
same time period.

52
00:02:19,780 --> 00:02:22,510


53
00:02:22,510 --> 00:02:24,990
What's going to happen if
we don't improve the I/O

54
00:02:24,990 --> 00:02:25,690
component?

55
00:02:25,690 --> 00:02:27,820
Gene Amdahl came up
with Amdahl's law

56
00:02:27,820 --> 00:02:32,190
that says system speed is
limited by the slowest part.

57
00:02:32,190 --> 00:02:34,430
So let's do a specific example.

58
00:02:34,430 --> 00:02:36,820
If we spend 10% of
our time today in I/O,

59
00:02:36,820 --> 00:02:39,550
and we get a 10 times faster
CPU, which comes along pretty

60
00:02:39,550 --> 00:02:42,260
quickly, we're only going to
get half of that performance

61
00:02:42,260 --> 00:02:42,760
potential.

62
00:02:42,760 --> 00:02:46,130
We're going to lose 50%
of that performance.

63
00:02:46,130 --> 00:02:48,760
Similarly, if we take that same
program with 10% of its time

64
00:02:48,760 --> 00:02:51,160
today and get 100 times
faster CPU, possibly

65
00:02:51,160 --> 00:02:54,430
with a multiple processor,
we're going to lose 90% of that,

66
00:02:54,430 --> 00:02:57,800
getting only 1/10
of its potential.

67
00:02:57,800 --> 00:03:00,770
That means we have an
I/O bottleneck facing us

68
00:03:00,770 --> 00:03:02,030
very shortly.

69
00:03:02,030 --> 00:03:04,040
Because a diminishing
fraction of the time

70
00:03:04,040 --> 00:03:05,990
is going to be spent in
the CPU in the future

71
00:03:05,990 --> 00:03:07,610
if we don't do
anything about I/O.

72
00:03:07,610 --> 00:03:11,300
That means a diminishing
value of faster CPUs, which

73
00:03:11,300 --> 00:03:13,850
means a diminishing
value of researchers

74
00:03:13,850 --> 00:03:16,760
who are working on CPUs, not
to mention a diminishing value

75
00:03:16,760 --> 00:03:19,518
of high-paid academic
consultants on CPUs.

76
00:03:19,518 --> 00:03:21,560
So I think we all agree
this is pretty important,

77
00:03:21,560 --> 00:03:25,457
even if we're not going
to work on I/O ourselves.

78
00:03:25,457 --> 00:03:27,790
So what have our colleagues
in magnetic disk design been

79
00:03:27,790 --> 00:03:30,190
doing all these years while
we've been making processors

80
00:03:30,190 --> 00:03:31,370
faster?

81
00:03:31,370 --> 00:03:33,730
They've been concentrating
on capacity and dollars

82
00:03:33,730 --> 00:03:34,960
per megabyte.

83
00:03:34,960 --> 00:03:38,440
They are improving
at about 25% per year

84
00:03:38,440 --> 00:03:42,250
historically, and more recently,
at about 50% per year--

85
00:03:42,250 --> 00:03:44,165
both of those measures.

86
00:03:44,165 --> 00:03:45,790
The other thing that
they've been doing

87
00:03:45,790 --> 00:03:48,047
is evolving to smaller
disks, from things

88
00:03:48,047 --> 00:03:50,380
that were the size of washing
machines or refrigerators,

89
00:03:50,380 --> 00:03:52,240
to the things that you
can fit in your hand.

90
00:03:52,240 --> 00:03:54,350
Here's a specific example.

91
00:03:54,350 --> 00:03:58,250
So this I have in my hand is a
2 and 1/2 inch diameter disk.

92
00:03:58,250 --> 00:04:00,790
You can see it's very thin,
maybe a 1/2 inch diameter.

93
00:04:00,790 --> 00:04:03,490
And what's on the back of
it is the electronics--

94
00:04:03,490 --> 00:04:05,500
all the integrated circuits.

95
00:04:05,500 --> 00:04:07,660
This disk, in the time
we're making this tape,

96
00:04:07,660 --> 00:04:09,490
can hold 320 megabytes.

97
00:04:09,490 --> 00:04:11,590
As you can see, there's
just two platters here.

98
00:04:11,590 --> 00:04:14,035
This is remarkable
shrinkage in design.

99
00:04:14,035 --> 00:04:16,660
That's what our disk colleagues
have been doing-- making things

100
00:04:16,660 --> 00:04:21,938
smaller, the cost cheaper,
and the capacity greater.

101
00:04:21,938 --> 00:04:23,230
That's what they've been doing.

102
00:04:23,230 --> 00:04:24,970
And in fact, in
a few years, this

103
00:04:24,970 --> 00:04:27,130
is going to seem like a
dinosaur monster disk,

104
00:04:27,130 --> 00:04:30,343
and people are working on disks
that'll be 1 inch in diameter,

105
00:04:30,343 --> 00:04:32,260
much smaller than a 2
inch, and you won't even

106
00:04:32,260 --> 00:04:35,147
be able to know whether I've
got one in my hand or not.

107
00:04:35,147 --> 00:04:37,480
What our colleagues in the
disk industry have been doing

108
00:04:37,480 --> 00:04:41,650
is making disks that are
smaller and cheaper, as opposed

109
00:04:41,650 --> 00:04:43,490
to larger and faster.

110
00:04:43,490 --> 00:04:47,390
The faster processors need
larger and faster disks.

111
00:04:47,390 --> 00:04:50,140
So the question we asked
ourselves six or seven years

112
00:04:50,140 --> 00:04:53,230
ago was, can these smaller
disk be used somehow

113
00:04:53,230 --> 00:04:56,680
to close the gap in performance
between disk and CPUs?

114
00:04:56,680 --> 00:04:59,920
Or how could we use these
smaller disks to do that?

115
00:04:59,920 --> 00:05:02,760
So the idea is to replace a
small number of large disks

116
00:05:02,760 --> 00:05:06,675
with a large number
of small disks.

117
00:05:06,675 --> 00:05:11,910
This next slide shows
how it would work.

118
00:05:11,910 --> 00:05:14,060
What we'd see on
the top is the way

119
00:05:14,060 --> 00:05:15,810
disks are traditionally
manufactured.

120
00:05:15,810 --> 00:05:18,840
What you see is four
different designs

121
00:05:18,840 --> 00:05:20,430
having four
different engineering

122
00:05:20,430 --> 00:05:23,460
teams, each concentrating
on the high end

123
00:05:23,460 --> 00:05:25,975
to the low end of the efforts.

124
00:05:25,975 --> 00:05:28,920
What we're talking about
instead is concentrating

125
00:05:28,920 --> 00:05:32,820
the engineering talents in the
lowest smallest diameter disk,

126
00:05:32,820 --> 00:05:34,530
build the best disk
you can, and simply

127
00:05:34,530 --> 00:05:37,860
replicating to get mid-range
and then high-end designs.

128
00:05:37,860 --> 00:05:41,880
That's the El Dorado of
disk array design here.

129
00:05:41,880 --> 00:05:43,380
That's what people
are trying to do.

130
00:05:43,380 --> 00:05:46,980
Well, how well would that work?

131
00:05:46,980 --> 00:05:50,910
What's shown on the left column
is the IBM mainframe disk.

132
00:05:50,910 --> 00:05:53,350
Back when IBM made
a lot of money,

133
00:05:53,350 --> 00:05:57,600
this is where a lot of money
was made, in the mainframe disk.

134
00:05:57,600 --> 00:06:00,730
If we contrast that
with this narrower disk

135
00:06:00,730 --> 00:06:04,180
also from IBM, which is
in the middle column,

136
00:06:04,180 --> 00:06:05,980
we can see some big
differences there.

137
00:06:05,980 --> 00:06:08,820
But if we got enough of those
small disks, in this case, 70,

138
00:06:08,820 --> 00:06:10,920
so that we would get the
same capacity-- that's

139
00:06:10,920 --> 00:06:12,960
this right column, it's
the same capacity--

140
00:06:12,960 --> 00:06:15,010
we have some interesting
characteristics.

141
00:06:15,010 --> 00:06:17,040
First of all, it's actually
got some advantages

142
00:06:17,040 --> 00:06:18,870
in size and power.

143
00:06:18,870 --> 00:06:20,490
But the really
exciting advantage

144
00:06:20,490 --> 00:06:22,620
from the system
designer's perspective,

145
00:06:22,620 --> 00:06:25,980
given this CPU
performance gap, is

146
00:06:25,980 --> 00:06:28,290
the data rate or
I/O rate-- data rate

147
00:06:28,290 --> 00:06:30,150
megabytes per
second transferred,

148
00:06:30,150 --> 00:06:32,550
or the number of
I/Os per second.

149
00:06:32,550 --> 00:06:36,580
Given that we have 70 disks
with 70 arms operating

150
00:06:36,580 --> 00:06:38,610
at the same time instead
of just a small number,

151
00:06:38,610 --> 00:06:40,360
we can get a much
higher data rate.

152
00:06:40,360 --> 00:06:42,960
You can see that's about
a factor of 8 improvement.

153
00:06:42,960 --> 00:06:45,305
Even though the big
disks run much faster,

154
00:06:45,305 --> 00:06:46,680
we've got so many
small ones that

155
00:06:46,680 --> 00:06:48,510
are close enough we get a
factor of 8 improvement,

156
00:06:48,510 --> 00:06:49,927
and the number of
I/Os per second,

157
00:06:49,927 --> 00:06:52,890
similarly, increase by
about a factor of 6.

158
00:06:52,890 --> 00:06:54,710
And let's say the cost
is about the same.

159
00:06:54,710 --> 00:06:56,190
But what about
this last row here?

160
00:06:56,190 --> 00:06:59,190
What about the reliability?

161
00:06:59,190 --> 00:07:01,927
This is an acronym for
the mean time to data law.

162
00:07:01,927 --> 00:07:04,260
So we can think of it as the
mean time between failures.

163
00:07:04,260 --> 00:07:07,030
How well will that work?

164
00:07:07,030 --> 00:07:10,030
Well, turns out if
things fail randomly,

165
00:07:10,030 --> 00:07:13,710
then the reliability of
N things is 1 over N.

166
00:07:13,710 --> 00:07:15,420
So we take the mean
time between failure

167
00:07:15,420 --> 00:07:18,910
from about 50,000 hours
divided by 70, 700 hours.

168
00:07:18,910 --> 00:07:21,720
In other words, the mean
time between failure

169
00:07:21,720 --> 00:07:24,840
drops from six
years to one month.

170
00:07:24,840 --> 00:07:28,980
Therefore, disk arrays are
too unreliable to be useful.

171
00:07:28,980 --> 00:07:30,657
Therefore, it's a bad idea.

172
00:07:30,657 --> 00:07:31,990
And this is the end of the tape.

173
00:07:31,990 --> 00:07:34,210
Thank you very much.

174
00:07:34,210 --> 00:07:35,290
No.

175
00:07:35,290 --> 00:07:38,050
Don't-- hold that dial.

176
00:07:38,050 --> 00:07:41,350
What we're going to
do is add extra disks

177
00:07:41,350 --> 00:07:43,690
and turns a weakness
into a strength.

178
00:07:43,690 --> 00:07:45,850
Hence, the name of RAID.

179
00:07:45,850 --> 00:07:49,870
What we're going to do
is take redundant arrays

180
00:07:49,870 --> 00:07:52,720
of inexpensive disks by
having some extras to overcome

181
00:07:52,720 --> 00:07:55,060
this reliability disadvantage.

182
00:07:55,060 --> 00:07:56,630
Now, there's two
advantages here.

183
00:07:56,630 --> 00:07:58,625
One is subtle and one's obvious.

184
00:07:58,625 --> 00:08:00,250
What we're going to
do have extra disks

185
00:08:00,250 --> 00:08:03,320
so that when a disk fails, we
can reconstruct the lost data.

186
00:08:03,320 --> 00:08:05,940
That's the first point.

187
00:08:05,940 --> 00:08:08,570
But if we can reconstruct
the data sometimes,

188
00:08:08,570 --> 00:08:11,237
that means that on the
fly we can reconstruct

189
00:08:11,237 --> 00:08:12,320
the data that's been lost.

190
00:08:12,320 --> 00:08:16,850
So just because at the end of
the term when everything's busy

191
00:08:16,850 --> 00:08:19,435
a disk crashes doesn't mean
you can't get your term paper.

192
00:08:19,435 --> 00:08:20,810
Because it can be
reconstructed--

193
00:08:20,810 --> 00:08:22,268
it's going to run
a little slower--

194
00:08:22,268 --> 00:08:24,860
but reconstructed on the
fly and given back to you.

195
00:08:24,860 --> 00:08:28,460
We'll see a tape of that
later, a demo showing it

196
00:08:28,460 --> 00:08:30,890
runs a little slower,
but the information's

197
00:08:30,890 --> 00:08:33,559
available continuously.

198
00:08:33,559 --> 00:08:37,820
And that's the basic idea of
RAID of-- so this redundancy.

199
00:08:37,820 --> 00:08:41,030
Now, it turns out there's lots
of ways to do this redundancy.

200
00:08:41,030 --> 00:08:43,370
What I'm going to do is
cover four of them here.

201
00:08:43,370 --> 00:08:45,020
What's shown on
the left-hand side

202
00:08:45,020 --> 00:08:48,120
is kind of the English
descriptions of these things.

203
00:08:48,120 --> 00:08:50,970
What's shown on the right-hand
side is the levels of the RAID.

204
00:08:50,970 --> 00:08:52,520
This was in our original paper.

205
00:08:52,520 --> 00:08:54,615
We used these levels
to explain things.

206
00:08:54,615 --> 00:08:56,240
And as you can see,
from top to bottom,

207
00:08:56,240 --> 00:08:58,370
they're getting
more sophisticated.

208
00:08:58,370 --> 00:09:01,400
And these level numbers
have caught on, probably

209
00:09:01,400 --> 00:09:04,350
to the regret of some
English teachers.

210
00:09:04,350 --> 00:09:06,560
You may see these levels--

211
00:09:06,560 --> 00:09:09,020
these numbers here
that are put together

212
00:09:09,020 --> 00:09:11,480
to describe advertisements.

213
00:09:11,480 --> 00:09:14,152
You'll see level 5 RAIDs or
RAID 5s and things like that.

214
00:09:14,152 --> 00:09:16,610
What I'm going to do is go over
each of these organizations

215
00:09:16,610 --> 00:09:20,660
in this next part of the tape,
and explain the advantages.

216
00:09:20,660 --> 00:09:22,720
And basically, as
we go down the line,

217
00:09:22,720 --> 00:09:24,470
we're going to have
less and less overhead

218
00:09:24,470 --> 00:09:28,680
to provide redundancy
to get that reliability.

219
00:09:28,680 --> 00:09:32,000
The first organization is called
either mirroring or shadowing,

220
00:09:32,000 --> 00:09:35,580
depending where your--

221
00:09:35,580 --> 00:09:37,700
what manufacturer
sells it to you.

222
00:09:37,700 --> 00:09:39,650
So what's shown on the
left is the data disk.

223
00:09:39,650 --> 00:09:42,950
And all these examples will
have, say, eight data disks.

224
00:09:42,950 --> 00:09:46,670
Now, what you need to do is
to have reliability using

225
00:09:46,670 --> 00:09:49,580
the mirroring scheme is every
disk has its own mirror,

226
00:09:49,580 --> 00:09:51,200
so that if you
write to this disk,

227
00:09:51,200 --> 00:09:52,730
you also write to the mirror.

228
00:09:52,730 --> 00:09:55,940
That way if this
disk fails, we know

229
00:09:55,940 --> 00:09:59,280
where to go look to find
the data that's missing.

230
00:09:59,280 --> 00:10:02,120
This is the organization that's
the favorite organization

231
00:10:02,120 --> 00:10:05,240
of disk manufacturers.

232
00:10:05,240 --> 00:10:08,850
The reason is that you have
to buy twice as many disks.

233
00:10:08,850 --> 00:10:11,870
Unfortunately, for those of
us on a university budget,

234
00:10:11,870 --> 00:10:14,360
getting extra reliability by
buying twice as much disks

235
00:10:14,360 --> 00:10:17,150
isn't something we could
afford, so this wasn't something

236
00:10:17,150 --> 00:10:19,340
we were interested in.

237
00:10:19,340 --> 00:10:23,120
The next organization,
again, has eight data disks,

238
00:10:23,120 --> 00:10:25,880
and we've cut the
redundancy down to four.

239
00:10:25,880 --> 00:10:28,080
If you've taken a
course in memory design,

240
00:10:28,080 --> 00:10:29,810
you've heard about
error correction codes

241
00:10:29,810 --> 00:10:32,845
that can correct single errors.

242
00:10:32,845 --> 00:10:34,220
You know there's
a way to do this

243
00:10:34,220 --> 00:10:35,780
without doubling the memory.

244
00:10:35,780 --> 00:10:38,030
And in fact, what
happens is is you

245
00:10:38,030 --> 00:10:41,090
have, in this case, four
redundant disks, and parity

246
00:10:41,090 --> 00:10:43,880
is calculated over
subsets of the data disk.

247
00:10:43,880 --> 00:10:47,210
And you can see how this
works in this example.

248
00:10:47,210 --> 00:10:50,180
Let's suppose disk
11 breaks right here.

249
00:10:50,180 --> 00:10:52,160
What would happen if we
calculated the parities

250
00:10:52,160 --> 00:10:54,440
in the subset is we
get a 1 for this group,

251
00:10:54,440 --> 00:10:56,330
because disk 11's included.

252
00:10:56,330 --> 00:10:58,880
Disk 11 is not
there, so that's a 0.

253
00:10:58,880 --> 00:11:02,100
In this group, disk 11 is
included, so we get a 1 there,

254
00:11:02,100 --> 00:11:03,350
and this group would have a 1.

255
00:11:03,350 --> 00:11:06,380
So we'd have the
pattern 1, 0, 1, 1.

256
00:11:06,380 --> 00:11:08,810
The fact that this
pattern is not zeros

257
00:11:08,810 --> 00:11:11,150
means there's an error.

258
00:11:11,150 --> 00:11:13,100
And not only that,
we need to figure out

259
00:11:13,100 --> 00:11:14,100
which disk is the error.

260
00:11:14,100 --> 00:11:15,892
And if you remember,
if you're binary here,

261
00:11:15,892 --> 00:11:18,700
1, 0, 1, 1-- if you convert
that into decimal, that's 11.

262
00:11:18,700 --> 00:11:21,170
So in fact, kind of
as a parlor trick.

263
00:11:21,170 --> 00:11:22,850
The encoding that
shows [INAUDIBLE] here

264
00:11:22,850 --> 00:11:24,680
also points out the
disk that has failed.

265
00:11:24,680 --> 00:11:27,230
So there is an error,
and it's disk 11.

266
00:11:27,230 --> 00:11:30,950
Now we've reduce the cost, in
this case, from eight disks

267
00:11:30,950 --> 00:11:33,020
to four disks, but that
still would expensive

268
00:11:33,020 --> 00:11:35,670
on the university budget.

269
00:11:35,670 --> 00:11:39,110
The next scheme, RAID level
3, or redundancy via parity,

270
00:11:39,110 --> 00:11:43,010
recognizes one of the really
great things about disks today,

271
00:11:43,010 --> 00:11:44,810
in that disks are smart.

272
00:11:44,810 --> 00:11:47,510
Disks have controllers that
know if they're working or not.

273
00:11:47,510 --> 00:11:49,022
Moreover, on every
single sector--

274
00:11:49,022 --> 00:11:51,230
that's kind of every block
that you can read or write

275
00:11:51,230 --> 00:11:54,740
individually-- it has
extra error-checking codes

276
00:11:54,740 --> 00:11:56,930
to determine with
very high probability

277
00:11:56,930 --> 00:11:59,300
whether or not what you
read was read correctly,

278
00:11:59,300 --> 00:12:02,540
or if you write it, it's
likely to be read correctly

279
00:12:02,540 --> 00:12:04,650
again later.

280
00:12:04,650 --> 00:12:08,220
Since there's the
ability to determine

281
00:12:08,220 --> 00:12:09,360
if a disk has failed--

282
00:12:09,360 --> 00:12:12,090
the disk will raise its hand if
it failed-- alls we have to do

283
00:12:12,090 --> 00:12:14,670
is calculate the data that's
missing from that disk,

284
00:12:14,670 --> 00:12:16,452
and we can do that
with one extra disk.

285
00:12:16,452 --> 00:12:17,910
So this is the type
of thing that's

286
00:12:17,910 --> 00:12:20,700
interesting in a
university budget.

287
00:12:20,700 --> 00:12:21,900
So how does that work?

288
00:12:21,900 --> 00:12:23,400
You can think of
this disk as having

289
00:12:23,400 --> 00:12:26,220
the sum of all the information
on all the other disks.

290
00:12:26,220 --> 00:12:27,840
Then if this disk
fails, we'll know

291
00:12:27,840 --> 00:12:30,690
that from the Reed/Solomon
codes, alls we have to do

292
00:12:30,690 --> 00:12:34,060
is subtract the remaining
information from the sum disk,

293
00:12:34,060 --> 00:12:36,607
and that difference must
have been that value there.

294
00:12:36,607 --> 00:12:37,440
That's what happens.

295
00:12:37,440 --> 00:12:39,480
Actually, we do this
all parity calculations,

296
00:12:39,480 --> 00:12:41,500
but you can think
of it that way.

297
00:12:41,500 --> 00:12:42,970
So this seems
pretty interesting.

298
00:12:42,970 --> 00:12:46,500
Instead of having twice as many
disks, we have one extra disk.

299
00:12:46,500 --> 00:12:50,580
We can afford that on
a university budget.

300
00:12:50,580 --> 00:12:52,207
Now, so how well will this work?

301
00:12:52,207 --> 00:12:54,540
So far you've probably been
thinking of this as we write

302
00:12:54,540 --> 00:12:57,630
a bit to every single one of
the disks, so that there's--

303
00:12:57,630 --> 00:13:00,270
a read will read from all
nine disks in this case,

304
00:13:00,270 --> 00:13:02,600
and a write will write to
all nine disks in this case.

305
00:13:02,600 --> 00:13:04,373
That's probably a good
way to think of it.

306
00:13:04,373 --> 00:13:05,790
You can think of
this when they're

307
00:13:05,790 --> 00:13:08,310
accessing all the disks
as large accesses,

308
00:13:08,310 --> 00:13:10,483
large reads and large writes.

309
00:13:10,483 --> 00:13:12,150
If you think about
this a little longer,

310
00:13:12,150 --> 00:13:14,850
you realize though if every
single disk has this error

311
00:13:14,850 --> 00:13:16,800
correction information
in each sector,

312
00:13:16,800 --> 00:13:19,722
that would allow these disks
to be accessed independently

313
00:13:19,722 --> 00:13:22,180
with blocks that don't have
anything to do with each other.

314
00:13:22,180 --> 00:13:25,260
So if what we write to
each one is a whole sector

315
00:13:25,260 --> 00:13:27,570
at a time, or even
a larger unit,

316
00:13:27,570 --> 00:13:31,540
then we can let reads
happen independently.

317
00:13:31,540 --> 00:13:34,350
This would allow us to
increase the number of I/Os

318
00:13:34,350 --> 00:13:36,670
per second in terms of
reads from this information.

319
00:13:36,670 --> 00:13:39,477
So we could have the large reads
that involve all the disks,

320
00:13:39,477 --> 00:13:41,310
the large writes that
involve all the disks,

321
00:13:41,310 --> 00:13:46,380
and small reads without
really any extra work.

322
00:13:46,380 --> 00:13:49,050
That brings us to small writes.

323
00:13:49,050 --> 00:13:52,267
What would happen if we wanted
to write just to this one disk?

324
00:13:52,267 --> 00:13:53,850
What it would seem
like you want to do

325
00:13:53,850 --> 00:13:56,267
if you wanted to write to this
disk, what you'd have to do

326
00:13:56,267 --> 00:14:00,120
is read all the information
from the corresponding disks,

327
00:14:00,120 --> 00:14:02,940
calculate the new parity
and write that out.

328
00:14:02,940 --> 00:14:04,720
Well, that's one
way you could do it.

329
00:14:04,720 --> 00:14:08,040
But because it's got this sum
and difference relationship,

330
00:14:08,040 --> 00:14:09,810
the other way we
could do it is say,

331
00:14:09,810 --> 00:14:12,450
let's compare the new
data to the old data,

332
00:14:12,450 --> 00:14:14,820
see which bits change,
and then change

333
00:14:14,820 --> 00:14:16,740
those bits in the parity disk.

334
00:14:16,740 --> 00:14:19,650
Thus, a small
write could involve

335
00:14:19,650 --> 00:14:23,053
only two disks-- the parity
disk and the disk with the data.

336
00:14:23,053 --> 00:14:24,720
So that seems better
than this other way

337
00:14:24,720 --> 00:14:25,890
that involved all the disks.

338
00:14:25,890 --> 00:14:29,130
So we can do the large reads--
everybody supplies disks;

339
00:14:29,130 --> 00:14:31,587
large writes-- everybody
gets some bits.

340
00:14:31,587 --> 00:14:33,670
The small reads, which
allow things independently,

341
00:14:33,670 --> 00:14:34,670
and the writes--

342
00:14:34,670 --> 00:14:36,810
unfortunately, the writes
take four disk accesses--

343
00:14:36,810 --> 00:14:39,060
reading and writing that
disk, and reading and writing

344
00:14:39,060 --> 00:14:40,320
the parity disk.

345
00:14:40,320 --> 00:14:42,540
Moreover, what the problem
is, the parity disk

346
00:14:42,540 --> 00:14:44,040
is going to be the
write bottleneck.

347
00:14:44,040 --> 00:14:46,752
Every write has to
update the parity disk.

348
00:14:46,752 --> 00:14:48,210
Therefore, the
speed is going to be

349
00:14:48,210 --> 00:14:51,960
limited by how fast the
parity disk can write.

350
00:14:51,960 --> 00:14:55,300
This inspired the
next organization.

351
00:14:55,300 --> 00:14:56,970
Now, this is a
different diagram,

352
00:14:56,970 --> 00:15:00,090
with the disk in the top being
shown in exploded form going

353
00:15:00,090 --> 00:15:02,250
down the page, where
each one of these squares

354
00:15:02,250 --> 00:15:04,110
is supposed to
represent a block.

355
00:15:04,110 --> 00:15:07,480
You can think of that as
a sector for right now.

356
00:15:07,480 --> 00:15:10,550
So let's try this idea.

357
00:15:10,550 --> 00:15:12,660
You notice what
we've done is taken

358
00:15:12,660 --> 00:15:15,600
all the parity, which before
was always on one block

359
00:15:15,600 --> 00:15:16,500
and spread it out.

360
00:15:16,500 --> 00:15:18,570
We've rotated the parity.

361
00:15:18,570 --> 00:15:21,420
So the parity for
this first row, which

362
00:15:21,420 --> 00:15:25,560
we refer to as a strike,
this first row is here,

363
00:15:25,560 --> 00:15:27,630
and then it moves its way down.

364
00:15:27,630 --> 00:15:29,830
So how will small writes
work in this case?

365
00:15:29,830 --> 00:15:31,860
Suppose we're going
to write to D3.

366
00:15:31,860 --> 00:15:35,670
Well, we're going to have to
update the block in D3, which

367
00:15:35,670 --> 00:15:39,360
ties up this disk; update
the parity right here, which

368
00:15:39,360 --> 00:15:42,730
corresponds to the D3
which occupies this disk.

369
00:15:42,730 --> 00:15:44,970
So we have two of the
five disks occupied.

370
00:15:44,970 --> 00:15:48,180
If the other disk we wanted
to occupy, wanted to update,

371
00:15:48,180 --> 00:15:50,370
the data we wanted to
update was disk eight,

372
00:15:50,370 --> 00:15:52,590
we'd also occupy
this parity block.

373
00:15:52,590 --> 00:15:55,470
So that would occupy
these two disks.

374
00:15:55,470 --> 00:15:58,620
Thus, without any really
extra work, just by rotating

375
00:15:58,620 --> 00:16:01,830
the parity, we're
capable of getting

376
00:16:01,830 --> 00:16:05,640
more small write bandwidth,
which makes it more attractive.

377
00:16:05,640 --> 00:16:08,640
This final organization,
redundancy with rotated parody,

378
00:16:08,640 --> 00:16:12,570
or RAID level 5, is
the end of our progress

379
00:16:12,570 --> 00:16:15,682
in getting performance
and providing redundancy.

380
00:16:15,682 --> 00:16:17,640
And this is the scheme
that we used at Berkeley

381
00:16:17,640 --> 00:16:20,110
and is very popular today.

382
00:16:20,110 --> 00:16:22,650
You can even read
articles in Byte magazine,

383
00:16:22,650 --> 00:16:25,140
amazingly enough, at least
when we make this tape,

384
00:16:25,140 --> 00:16:29,840
and you'll see RAID 5, and
now you know what that means.

385
00:16:29,840 --> 00:16:31,200
OK.

386
00:16:31,200 --> 00:16:34,290
Now in addition to coming
up with these ideas,

387
00:16:34,290 --> 00:16:37,450
we actually like trying to think
about building these things.

388
00:16:37,450 --> 00:16:39,310
And how are disks
constructed today?

389
00:16:39,310 --> 00:16:41,310
Well, the way disks work
today is you don't just

390
00:16:41,310 --> 00:16:43,200
plug the disk into a computer.

391
00:16:43,200 --> 00:16:45,150
It's connected to a cable.

392
00:16:45,150 --> 00:16:46,530
It's typically called a string.

393
00:16:46,530 --> 00:16:49,170
And you put several disks
on this string that's

394
00:16:49,170 --> 00:16:51,450
connected to a string
controller that, in this case,

395
00:16:51,450 --> 00:16:53,760
would be connected to an
array controller, let's say.

396
00:16:53,760 --> 00:16:57,395
But the idea here is disks are
connected over a set of wires

397
00:16:57,395 --> 00:16:58,770
to a computer,
and that's the way

398
00:16:58,770 --> 00:17:01,935
things are done in many
kinds of computer systems.

399
00:17:01,935 --> 00:17:03,560
It would seem what
you would do is just

400
00:17:03,560 --> 00:17:08,119
put that stripe right down
this row along this cable.

401
00:17:08,119 --> 00:17:09,780
What's wrong with that?

402
00:17:09,780 --> 00:17:12,410
The problem is now we have
the string controller, a piece

403
00:17:12,410 --> 00:17:14,480
of electronics which can break.

404
00:17:14,480 --> 00:17:17,000
If this string
were to break, that

405
00:17:17,000 --> 00:17:24,349
would disconnect all of these
disks from the computer system.

406
00:17:24,349 --> 00:17:26,230
Now, this RAID thing
works, provided

407
00:17:26,230 --> 00:17:27,579
we only have one broken disk.

408
00:17:27,579 --> 00:17:28,550
We can reconstruct it.

409
00:17:28,550 --> 00:17:30,250
But all of the
information in this group

410
00:17:30,250 --> 00:17:32,020
would be disconnected.

411
00:17:32,020 --> 00:17:37,030
That would mean we couldn't
reconstruct that information.

412
00:17:37,030 --> 00:17:41,100
However, if we do it
orthogonal, if we change it

413
00:17:41,100 --> 00:17:43,750
so that the disks are
perpendicular to the strings,

414
00:17:43,750 --> 00:17:46,350
what happens if
we lose a string?

415
00:17:46,350 --> 00:17:49,620
If this string goes out,
we only lose one disk

416
00:17:49,620 --> 00:17:53,700
from each one of the stripes,
which allows us to reconstruct

417
00:17:53,700 --> 00:17:57,480
that in the RAID
project because that's

418
00:17:57,480 --> 00:17:58,658
the way the system works.

419
00:17:58,658 --> 00:18:00,450
If we lose one just
from this [INAUDIBLE],,

420
00:18:00,450 --> 00:18:02,140
it can reconstruct
the lost information.

421
00:18:02,140 --> 00:18:05,040
So just not, again,
extra hardware just

422
00:18:05,040 --> 00:18:07,260
by thinking about the
right way you organize it,

423
00:18:07,260 --> 00:18:11,010
you can get a more comfortable
system, a more reliable system.

424
00:18:11,010 --> 00:18:14,203


425
00:18:14,203 --> 00:18:15,870
Another question that
we asked ourselves

426
00:18:15,870 --> 00:18:17,250
is what about spare disks?

427
00:18:17,250 --> 00:18:19,080
What would happen
if we were to have

428
00:18:19,080 --> 00:18:21,120
some disks without
any data on them

429
00:18:21,120 --> 00:18:24,810
ready to repair in
case something broke?

430
00:18:24,810 --> 00:18:27,210
Right now you've been
assuming that the disk breaks

431
00:18:27,210 --> 00:18:30,570
and you call up your
local system manager,

432
00:18:30,570 --> 00:18:33,528
and he or she gets in their
car and drives to RadioShack,

433
00:18:33,528 --> 00:18:35,320
I guess, and buys a
disk and bring it back,

434
00:18:35,320 --> 00:18:36,300
and that might
take a long while.

435
00:18:36,300 --> 00:18:37,800
What happens if
there's a spare just

436
00:18:37,800 --> 00:18:39,450
sitting there ready to be used?

437
00:18:39,450 --> 00:18:42,910
What this actually shows is
the mean time to data loss.

438
00:18:42,910 --> 00:18:44,520
That's like the mean
time to failure.

439
00:18:44,520 --> 00:18:47,910
How many, in this case,
in thousands of hours,

440
00:18:47,910 --> 00:18:50,130
before it would break?

441
00:18:50,130 --> 00:18:52,800
Well, the study that was
done here, assuming disks

442
00:18:52,800 --> 00:18:55,147
fail independently, if we had--

443
00:18:55,147 --> 00:18:56,730
assuming we started
off with 70 disks,

444
00:18:56,730 --> 00:19:00,240
with just one spare disk we
would get this factor of 40

445
00:19:00,240 --> 00:19:01,920
improvement in reliability--

446
00:19:01,920 --> 00:19:03,600
40 times as reliable.

447
00:19:03,600 --> 00:19:05,490
Interestingly, with
two spare disks,

448
00:19:05,490 --> 00:19:08,220
we're about as good as an
infinite number of spare disks.

449
00:19:08,220 --> 00:19:10,380
So this data suggests
one or two spare disks

450
00:19:10,380 --> 00:19:12,178
would really be an
interesting thing

451
00:19:12,178 --> 00:19:14,220
to consider in terms of
improving the reliability

452
00:19:14,220 --> 00:19:15,880
of RAID systems.

453
00:19:15,880 --> 00:19:17,580
But at the top of
the slide, it says

454
00:19:17,580 --> 00:19:19,590
we're assuming
independent failures.

455
00:19:19,590 --> 00:19:21,450
We know disks don't
fail independently

456
00:19:21,450 --> 00:19:24,833
because of those strings that
connect the disks together.

457
00:19:24,833 --> 00:19:26,250
That's what this
next slide shows.

458
00:19:26,250 --> 00:19:28,500
So what would happen
with dependent failures

459
00:19:28,500 --> 00:19:32,330
because we disk are
connected on strings?

460
00:19:32,330 --> 00:19:35,502
Well, what happens is the
relationship is changed to now

461
00:19:35,502 --> 00:19:36,710
we need a whole spare string.

462
00:19:36,710 --> 00:19:39,080
In this case, we are assuming
that a string contains

463
00:19:39,080 --> 00:19:40,350
seven disks on them.

464
00:19:40,350 --> 00:19:42,300
So here's the disk
across the bottom.

465
00:19:42,300 --> 00:19:45,560
And these lines indicate
one string full of disks

466
00:19:45,560 --> 00:19:47,295
or two strings full of disks.

467
00:19:47,295 --> 00:19:49,670
So now if we have a whole
spare string because the string

468
00:19:49,670 --> 00:19:51,962
controller could fail and
knock out all of these disks,

469
00:19:51,962 --> 00:19:54,920
we get a 50 times improvement in
the mean time between failures.

470
00:19:54,920 --> 00:19:57,690
But two strings was as good as
an infinite number of strings.

471
00:19:57,690 --> 00:20:00,830
So you can argue that one,
possibly two spare strings

472
00:20:00,830 --> 00:20:03,650
would be a good idea if you
had about 70 disks in terms

473
00:20:03,650 --> 00:20:07,190
of improving the reliability.

474
00:20:07,190 --> 00:20:09,560
Well, not only do we try and
come up with organizations

475
00:20:09,560 --> 00:20:11,018
and thinking about
building things,

476
00:20:11,018 --> 00:20:13,340
we actually try and do
this in a research context

477
00:20:13,340 --> 00:20:15,358
that involves
multiple disciplines.

478
00:20:15,358 --> 00:20:17,900
So the project that I've been
talking about, the RAID effort,

479
00:20:17,900 --> 00:20:21,680
was led by Professor
Randy Katz and myself.

480
00:20:21,680 --> 00:20:23,450
But the context that
we did this research

481
00:20:23,450 --> 00:20:27,560
included network
operating system

482
00:20:27,560 --> 00:20:30,980
and file system work
by John Ousterhaut,

483
00:20:30,980 --> 00:20:34,490
particularly his Sprite project,
a Unix-like operating system

484
00:20:34,490 --> 00:20:36,380
with many interesting features.

485
00:20:36,380 --> 00:20:39,380
At the same time also,
Mike Stonebraker,

486
00:20:39,380 --> 00:20:43,280
in the database processing--
transaction processing area

487
00:20:43,280 --> 00:20:45,870
led the Postgres effort,
his follow-on to Ingres.

488
00:20:45,870 --> 00:20:49,310
So this work was being
done in this triumvirate

489
00:20:49,310 --> 00:20:53,010
of hardware operating system
and database research.

490
00:20:53,010 --> 00:20:56,900
The reason I really like this
style of multidisciplinary work

491
00:20:56,900 --> 00:21:00,148
is what's happens and
shown on the next slide.

492
00:21:00,148 --> 00:21:02,690
So Garth Gibson, who's now an
assistant professor at Carnegie

493
00:21:02,690 --> 00:21:04,640
Mellon University,
Randy Katz and I

494
00:21:04,640 --> 00:21:07,170
did the papers where we talked
about the advantages of RAID,

495
00:21:07,170 --> 00:21:09,950
but it has this drawback of
these small rights being slow,

496
00:21:09,950 --> 00:21:12,630
that which I talked
about earlier.

497
00:21:12,630 --> 00:21:14,810
Well, this inspired the
operating system people,

498
00:21:14,810 --> 00:21:16,850
in particularly,
Mendel Rosenblum, who's

499
00:21:16,850 --> 00:21:18,860
an assistant professor
at Stanford University,

500
00:21:18,860 --> 00:21:20,420
and John Ousterhout
to start thinking

501
00:21:20,420 --> 00:21:22,670
about what would a file
system be like that

502
00:21:22,670 --> 00:21:24,530
didn't do small writes?

503
00:21:24,530 --> 00:21:27,320
They came up with
the idea of making

504
00:21:27,320 --> 00:21:32,720
the writes go as fast as
the disk could accept them.

505
00:21:32,720 --> 00:21:34,580
And so instead of
doing accesses,

506
00:21:34,580 --> 00:21:36,920
it's just writing
without moving the head.

507
00:21:36,920 --> 00:21:40,340
So it writes a stream
or a log of the updates

508
00:21:40,340 --> 00:21:41,760
rather than updating in place.

509
00:21:41,760 --> 00:21:43,520
This is the
log-structured file system

510
00:21:43,520 --> 00:21:46,070
which is a pretty
popular concept today.

511
00:21:46,070 --> 00:21:49,697
This, in turn,
inspired Margo Seltzer,

512
00:21:49,697 --> 00:21:51,530
who is now an assistant
professor at Harvard

513
00:21:51,530 --> 00:21:53,238
University, and Mike
Stonebraker to think

514
00:21:53,238 --> 00:21:55,970
about having
transactions support

515
00:21:55,970 --> 00:21:57,620
in the log-structured
file system

516
00:21:57,620 --> 00:22:00,990
and see how all that would work
to make databases run well.

517
00:22:00,990 --> 00:22:05,210
And so this symbiotic
relationship of the hardware

518
00:22:05,210 --> 00:22:08,330
advances inspiring operating
system advances and databases,

519
00:22:08,330 --> 00:22:09,890
again, makes the
hardware look good.

520
00:22:09,890 --> 00:22:12,560
So that's one of the reasons
I like these efforts.

521
00:22:12,560 --> 00:22:16,310
Also, it's really a wonderful
environment for the students

522
00:22:16,310 --> 00:22:18,860
involved because students are
educating each other about what

523
00:22:18,860 --> 00:22:20,510
the important
issues are, leaving

524
00:22:20,510 --> 00:22:23,900
the faculty out of that loop
frequently, to their benefit,

525
00:22:23,900 --> 00:22:24,530
I think--

526
00:22:24,530 --> 00:22:26,655
the benefit of the students,
maybe not the faculty.

527
00:22:26,655 --> 00:22:28,810


528
00:22:28,810 --> 00:22:31,670
So in addition to all those
things-- producing students,

529
00:22:31,670 --> 00:22:33,420
writing papers, thinking
about a build-in,

530
00:22:33,420 --> 00:22:35,080
we actually try and
build some things.

531
00:22:35,080 --> 00:22:36,870
The first one we
called RAID the first,

532
00:22:36,870 --> 00:22:39,600
and this was just an
off-the-shelf system.

533
00:22:39,600 --> 00:22:42,180
It's been operational
for several years.

534
00:22:42,180 --> 00:22:46,060
It's off-the-shelf at the now
pretty old fashioned technology

535
00:22:46,060 --> 00:22:48,840
at SUN-4/280, some
interface cards,

536
00:22:48,840 --> 00:22:50,340
and really old fashioned disks--

537
00:22:50,340 --> 00:22:55,950
5 and 1/4 inch disks, and we
had 10 gigabytes of storage.

538
00:22:55,950 --> 00:22:58,890
But this was our testing ground
of trying these RAID ideas out

539
00:22:58,890 --> 00:23:00,360
and see what we could learn.

540
00:23:00,360 --> 00:23:02,790
And what we decided
the follow-on would be,

541
00:23:02,790 --> 00:23:04,710
which we call RAID
the second, was

542
00:23:04,710 --> 00:23:06,570
going to try and
actually explore

543
00:23:06,570 --> 00:23:09,990
the idea of a diskless
supercomputer.

544
00:23:09,990 --> 00:23:14,370
The system is organized
for high performance

545
00:23:14,370 --> 00:23:17,820
by integrating the network
and the disk systems.

546
00:23:17,820 --> 00:23:21,120
This next slide gives you an
idea what we're talking about.

547
00:23:21,120 --> 00:23:23,280
Here's the server on the right.

548
00:23:23,280 --> 00:23:25,860
Now, the server is really--

549
00:23:25,860 --> 00:23:28,860
to my surprise, doesn't really
care that much about the data.

550
00:23:28,860 --> 00:23:30,690
Servers care about metadata.

551
00:23:30,690 --> 00:23:32,760
But the servers which
are typically built today

552
00:23:32,760 --> 00:23:36,210
is really just a workstation
without a monitor on it.

553
00:23:36,210 --> 00:23:40,140
That memory system can be
a bottleneck to the disk

554
00:23:40,140 --> 00:23:40,990
and to the network.

555
00:23:40,990 --> 00:23:44,430
So what we did instead is
put memory on the I/O card

556
00:23:44,430 --> 00:23:46,200
so that we could
have a memory, and we

557
00:23:46,200 --> 00:23:47,970
could design the
number of megabytes

558
00:23:47,970 --> 00:23:49,950
per second of bandwidth
of that memory device

559
00:23:49,950 --> 00:23:52,650
to connect the disk
to the network.

560
00:23:52,650 --> 00:23:55,830
We happened to use an
Ultranet, a commercial gigabit

561
00:23:55,830 --> 00:23:58,580
per second network over
the standard HIPPI, which

562
00:23:58,580 --> 00:24:01,080
is the standard high-speed buses
to connect to these network

563
00:24:01,080 --> 00:24:01,920
cards.

564
00:24:01,920 --> 00:24:04,860
So the server dealt
with the metadata.

565
00:24:04,860 --> 00:24:07,972
These disks dealt with the
data transfer themselves,

566
00:24:07,972 --> 00:24:09,930
but the memory is the
buffer between these two,

567
00:24:09,930 --> 00:24:14,000
and then out over the network
to the rest of the community.

568
00:24:14,000 --> 00:24:15,640
This is what we call
RAID the second.

569
00:24:15,640 --> 00:24:20,290
What I'm going to do next is
show a videotape of a demo.

570
00:24:20,290 --> 00:24:22,000
And this demo in
particular is going

571
00:24:22,000 --> 00:24:23,950
to show what happens
in terms of performance

572
00:24:23,950 --> 00:24:26,920
by using graphics
when a disk is to fail

573
00:24:26,920 --> 00:24:29,170
and showing continuous
update on the fly.

574
00:24:29,170 --> 00:24:31,300
And then the actual
daredevil demo,

575
00:24:31,300 --> 00:24:35,770
where a disk is going to be
removed from the live array.

576
00:24:35,770 --> 00:24:38,650
And you'll see the data
being reconstructed.

577
00:24:38,650 --> 00:24:40,420
The person who
giving this talk is

578
00:24:40,420 --> 00:24:42,765
Ed Lee, one of the several
graduate students working

579
00:24:42,765 --> 00:24:43,390
on the project.

580
00:24:43,390 --> 00:24:45,760
His portion of the demo
was doing the live portion

581
00:24:45,760 --> 00:24:48,340
of the reconstruction because
he did the reconstruction

582
00:24:48,340 --> 00:24:50,170
software.

583
00:24:50,170 --> 00:24:52,430
And Lee's getting
his PhD shortly

584
00:24:52,430 --> 00:24:55,323
and going to join DEC at
the Systems Research Center.

585
00:24:55,323 --> 00:24:55,990
Here's the demo.

586
00:24:55,990 --> 00:24:56,882
[VIDEO PLAYBACK]

587
00:24:56,882 --> 00:25:01,040
- So as we mentioned before,
our RAID [INAUDIBLE] server is

588
00:25:01,040 --> 00:25:04,470
based upon a RAID
level 5 disk array,

589
00:25:04,470 --> 00:25:07,327
which means that given in
the event of a disk failure,

590
00:25:07,327 --> 00:25:11,880
we can continue supplying data
uninterrupted to our clients.

591
00:25:11,880 --> 00:25:16,288
And also, we can dynamically
rebuild the contents

592
00:25:16,288 --> 00:25:18,370
of the failed disk
onto a spare disk

593
00:25:18,370 --> 00:25:21,145
without interrupting the
flow of data to our client.

594
00:25:21,145 --> 00:25:23,770
So in this demo, there will
be three primary parts.

595
00:25:23,770 --> 00:25:26,220
The first part, we will
cause a disk to be failed.

596
00:25:26,220 --> 00:25:28,715
In the second part,
we will rebuild

597
00:25:28,715 --> 00:25:31,390
the contents of the failed
disk onto a spare disk.

598
00:25:31,390 --> 00:25:34,430
And in the third part, I will
actually physically remove

599
00:25:34,430 --> 00:25:37,030
a disk that is in--

600
00:25:37,030 --> 00:25:38,890
that is configured
into the system.

601
00:25:38,890 --> 00:25:40,630
And in all three
phases, you'll see

602
00:25:40,630 --> 00:25:45,950
that there is no interruption
of data to our clients.

603
00:25:45,950 --> 00:25:51,556
So for the first part, we go
back to our video display here.

604
00:25:51,556 --> 00:25:53,830
When we start
initially fail a disk,

605
00:25:53,830 --> 00:25:58,870
you'll see that there will be a
brief pause while the RAID disk

606
00:25:58,870 --> 00:26:01,710
array reconfigures itself,
[INAUDIBLE] the pause there,

607
00:26:01,710 --> 00:26:06,060
but now is continuing to display
the frames at approximately

608
00:26:06,060 --> 00:26:08,190
it's normal rate.

609
00:26:08,190 --> 00:26:12,790
If we go over to the
performance monitor,

610
00:26:12,790 --> 00:26:16,000
you see that at the point
at which the disk failed,

611
00:26:16,000 --> 00:26:19,300
there is a slight spike
in CPU utilization caused

612
00:26:19,300 --> 00:26:22,990
by the extra overhead and
rearranging certain components

613
00:26:22,990 --> 00:26:24,700
of the RAID driver.

614
00:26:24,700 --> 00:26:28,030
And afterwards, you see that the
utilization is slightly higher

615
00:26:28,030 --> 00:26:32,630
due to the extra work
required to reconstruct

616
00:26:32,630 --> 00:26:35,420
the contents of the
failed disk as it

617
00:26:35,420 --> 00:26:38,010
is requested by the client.

618
00:26:38,010 --> 00:26:42,040
Now we'll start the
rebuild process.

619
00:26:42,040 --> 00:26:43,730
And now the contents
of the failed disk

620
00:26:43,730 --> 00:26:47,210
is being rebuilt
onto a [INAUDIBLE]..

621
00:26:47,210 --> 00:26:49,070
And you see that
the CPU utilization

622
00:26:49,070 --> 00:26:51,110
has increased significantly.

623
00:26:51,110 --> 00:26:52,640
And this is due to--

624
00:26:52,640 --> 00:26:53,990
by approximately 20%.

625
00:26:53,990 --> 00:26:57,680
And this is due to all the
additional I/Os required

626
00:26:57,680 --> 00:27:01,820
to rebuild the contents
of that failed disk.

627
00:27:01,820 --> 00:27:06,070
We know that the number of the
disk I/O's reads and writes

628
00:27:06,070 --> 00:27:07,370
have not changed significantly.

629
00:27:07,370 --> 00:27:11,420
And this is because this
performance monitor measured

630
00:27:11,420 --> 00:27:14,210
logical I/O that is performed
through the disk array rather

631
00:27:14,210 --> 00:27:17,400
than the physical disk activity.

632
00:27:17,400 --> 00:27:19,510
So now the rebuild
phase is in [INAUDIBLE]..

633
00:27:19,510 --> 00:27:21,700
And you see that
the CPU utilization

634
00:27:21,700 --> 00:27:25,090
is returning to its
approximately normal rate.

635
00:27:25,090 --> 00:27:28,960
And during that entire
period, the supply

636
00:27:28,960 --> 00:27:31,481
of data for our clients
was not interrupted.

637
00:27:31,481 --> 00:27:35,540
And the clients continued
to display the images

638
00:27:35,540 --> 00:27:40,694
at approximately
their former rates.

639
00:27:40,694 --> 00:27:44,740
OK, so for the final more
dramatic portion of our demo,

640
00:27:44,740 --> 00:27:46,565
I will actually
physically remove

641
00:27:46,565 --> 00:27:48,850
a disk which is
currently configured

642
00:27:48,850 --> 00:27:51,870
and running in the RAID server.

643
00:27:51,870 --> 00:27:55,825
And you'll see that
even in that case,

644
00:27:55,825 --> 00:28:00,885
RAID-II can provide the
necessary data to its clients.

645
00:28:00,885 --> 00:28:03,360
Here, I am about
to remove the disk.

646
00:28:03,360 --> 00:28:08,310


647
00:28:08,310 --> 00:28:11,016
Now I actually physically
remove the disk.

648
00:28:11,016 --> 00:28:14,960
As we go back to
our monitor, you

649
00:28:14,960 --> 00:28:18,920
see that the client is still
getting its necessary stream

650
00:28:18,920 --> 00:28:22,130
of data, and it's still
running at approximately

651
00:28:22,130 --> 00:28:23,410
its original frame rate.

652
00:28:23,410 --> 00:28:27,350


653
00:28:27,350 --> 00:28:30,620
This shows that the dynamic
reconstruction of data

654
00:28:30,620 --> 00:28:32,580
is working properly.

655
00:28:32,580 --> 00:28:36,010


656
00:28:36,010 --> 00:28:38,934
So that concludes the
RAID demonstration.

657
00:28:38,934 --> 00:28:40,810
We'd like for you
to keep in mind

658
00:28:40,810 --> 00:28:44,950
that many of the bottlenecks
that we've illustrated today

659
00:28:44,950 --> 00:28:47,470
has been due to
either our limitations

660
00:28:47,470 --> 00:28:50,230
on the client or
the disk controller.

661
00:28:50,230 --> 00:28:53,420
So as we appropriate faster
clients and disk controllers,

662
00:28:53,420 --> 00:28:56,720
we would expect performance
to scale appropriately.

663
00:28:56,720 --> 00:28:58,687
So thank you for watching.

664
00:28:58,687 --> 00:28:59,270
[END PLAYBACK]

665
00:28:59,270 --> 00:29:00,860
So one of the
questions was, well,

666
00:29:00,860 --> 00:29:03,380
just how well does all these
redundancy schemes work?

667
00:29:03,380 --> 00:29:04,700
Remember, we're going
to take this weakness

668
00:29:04,700 --> 00:29:05,900
and turn it into strength.

669
00:29:05,900 --> 00:29:07,580
How big a strength is it?

670
00:29:07,580 --> 00:29:11,270
And you can see that
on this next slide.

671
00:29:11,270 --> 00:29:13,520
Everything's the same
as the earlier slide.

672
00:29:13,520 --> 00:29:16,220
We're going to have a
few more disks there

673
00:29:16,220 --> 00:29:18,920
to find the reliability if
the cost is a little higher,

674
00:29:18,920 --> 00:29:21,180
but we can say the
cost is about the same.

675
00:29:21,180 --> 00:29:23,570
But the big change is the
mean time to data loss.

676
00:29:23,570 --> 00:29:26,000
By having this many
extra disks, we

677
00:29:26,000 --> 00:29:28,370
can take this
weakness, remember,

678
00:29:28,370 --> 00:29:31,550
that was down to about a month
of reliability, and instead,

679
00:29:31,550 --> 00:29:33,800
turns it into a
tremendous strength--

680
00:29:33,800 --> 00:29:36,373
a ridiculous
calculation of ours.

681
00:29:36,373 --> 00:29:37,790
So what we wanted
to do is make it

682
00:29:37,790 --> 00:29:40,490
as reliable or more reliable
than a mainframe disk,

683
00:29:40,490 --> 00:29:43,280
and we can do that.

684
00:29:43,280 --> 00:29:45,050
In addition to
doing these ideas,

685
00:29:45,050 --> 00:29:47,845
let's talk about
the future of this.

686
00:29:47,845 --> 00:29:50,220
In a few years, we're going
to have these one-inch disks.

687
00:29:50,220 --> 00:29:53,510
And you can imagine on a single
notebook-sized computer having

688
00:29:53,510 --> 00:29:56,300
32 of 1-inch diameter
disks provide you,

689
00:29:56,300 --> 00:29:58,790
say, 8 gigabytes of storage,
more than 100 megabits

690
00:29:58,790 --> 00:30:04,010
per second transfer rate, and
very, very high reliability.

691
00:30:04,010 --> 00:30:06,710
In addition to producing
students, writing papers,

692
00:30:06,710 --> 00:30:09,500
giving talks, making
videotapes, we also

693
00:30:09,500 --> 00:30:12,500
like to transfer the technology
to companies to use it.

694
00:30:12,500 --> 00:30:15,470
There is even today
amazingly a RAID newsletter

695
00:30:15,470 --> 00:30:17,810
to keep people up to date
on what's happening in RAID.

696
00:30:17,810 --> 00:30:21,320
And I think there is-- it
says 100, but maybe even 200

697
00:30:21,320 --> 00:30:24,620
companies involved in RAID
systems are selling components.

698
00:30:24,620 --> 00:30:27,500
Several of the major companies
you'd recognize there

699
00:30:27,500 --> 00:30:30,740
are selling them right
now, and other companies

700
00:30:30,740 --> 00:30:33,020
are developing RAID systems.

701
00:30:33,020 --> 00:30:34,880
At the bottom, I
mention the IBM story.

702
00:30:34,880 --> 00:30:37,340
One of the companies there,
Storage Technologies,

703
00:30:37,340 --> 00:30:40,408
decided to talk about their
product, a very complicated

704
00:30:40,408 --> 00:30:42,200
product, that not only
does the RAID ideas,

705
00:30:42,200 --> 00:30:46,250
but does compression and other
things on the fly, which--

706
00:30:46,250 --> 00:30:49,070
and their code name for it
was the Iceberg Project.

707
00:30:49,070 --> 00:30:52,010
Well, they went
around and gave 700

708
00:30:52,010 --> 00:30:55,550
non-disclosure
presentations on Iceberg.

709
00:30:55,550 --> 00:30:59,270
So those of you who can
imagine with human nature

710
00:30:59,270 --> 00:31:03,830
realize that 700
nondisclosures is an oxymoron.

711
00:31:03,830 --> 00:31:06,105
This is really obviously--
well, not obviously,

712
00:31:06,105 --> 00:31:08,480
unfortunately, even though
they swore they wouldn't, they

713
00:31:08,480 --> 00:31:09,830
started talking about them.

714
00:31:09,830 --> 00:31:13,370
So IBM salesmen heard
a lot about RAID.

715
00:31:13,370 --> 00:31:16,400
And so I was told
shortly after that

716
00:31:16,400 --> 00:31:18,320
that IBM salesmen were
going around saying

717
00:31:18,320 --> 00:31:20,150
RAID was a bad idea.

718
00:31:20,150 --> 00:31:22,550
So I knew this was
a really good sign.

719
00:31:22,550 --> 00:31:25,925
If IBM salesmen are marketing
against a research idea,

720
00:31:25,925 --> 00:31:27,050
something must be going on.

721
00:31:27,050 --> 00:31:29,222
And in fact, today,
as I said, you

722
00:31:29,222 --> 00:31:30,680
can read about it
in Byte magazine.

723
00:31:30,680 --> 00:31:34,130
And IBM, their president
of their disk company

724
00:31:34,130 --> 00:31:36,710
has said that RAID is
just like a fine wine.

725
00:31:36,710 --> 00:31:38,660
They won't deliver
it before it's time.

726
00:31:38,660 --> 00:31:41,180
And when RAID's ready to
go, IBM will be there.

727
00:31:41,180 --> 00:31:44,150
And that's the way they wanted
to go with their mainframe.

728
00:31:44,150 --> 00:31:46,610
They already are
developing an offering

729
00:31:46,610 --> 00:31:50,370
for sale RAID systems
in other markets,

730
00:31:50,370 --> 00:31:52,490
particularly their
supercomputer kind of markets.

731
00:31:52,490 --> 00:31:56,510
But eventually, the expectations
are that IBM will switch over

732
00:31:56,510 --> 00:32:00,487
to RAID for their big systems.

733
00:32:00,487 --> 00:32:02,070
So that's kind of
the end of the RAIDs

734
00:32:02,070 --> 00:32:03,420
part of this presentation.

735
00:32:03,420 --> 00:32:04,950
And this is--

736
00:32:04,950 --> 00:32:06,778
I've used past tense
a lot because this

737
00:32:06,778 --> 00:32:08,070
is all done and all wrapped up.

738
00:32:08,070 --> 00:32:09,990
Just about everybody is
going to be graduating soon

739
00:32:09,990 --> 00:32:11,970
from Berkeley, and people
are using the ideas.

740
00:32:11,970 --> 00:32:13,410
What I'm going to
talk about next

741
00:32:13,410 --> 00:32:15,240
is stuff that we're
working on now

742
00:32:15,240 --> 00:32:19,830
and tend to work on the future,
and it's to be more futuristic.

743
00:32:19,830 --> 00:32:21,495
There's three
pieces of this part.

744
00:32:21,495 --> 00:32:23,092
And we liked working
on I/O so much

745
00:32:23,092 --> 00:32:24,550
we decided we'd
keep working on it.

746
00:32:24,550 --> 00:32:27,180
In fact, if it was good to work
on secondary storage of disks,

747
00:32:27,180 --> 00:32:29,790
well, tertiary storage
must be wonderful.

748
00:32:29,790 --> 00:32:31,650
And that's what we're
working on today.

749
00:32:31,650 --> 00:32:33,540
And if you can think
as the small disk

750
00:32:33,540 --> 00:32:37,635
is what inspired us to
work on the disks today--

751
00:32:37,635 --> 00:32:40,655


752
00:32:40,655 --> 00:32:41,780
remember these small disks.

753
00:32:41,780 --> 00:32:42,950
What can we do
with these things?

754
00:32:42,950 --> 00:32:44,867
I think what's inspiring
this research project

755
00:32:44,867 --> 00:32:47,350
today are these tapes.

756
00:32:47,350 --> 00:32:51,370
This is a tape that's right
of your camcorder camera,

757
00:32:51,370 --> 00:32:53,920
and this tape can
hold, amazingly

758
00:32:53,920 --> 00:32:56,170
enough, 5 gigabytes
of information.

759
00:32:56,170 --> 00:32:58,990
Putting that in perspective,
that's like 5,000 books.

760
00:32:58,990 --> 00:33:00,490
That's how much can
be in this tape.

761
00:33:00,490 --> 00:33:02,430
This tape costs $10 to $20.

762
00:33:02,430 --> 00:33:03,880
I can hold it in my hand.

763
00:33:03,880 --> 00:33:05,680
So the question that
we ask ourself, gee,

764
00:33:05,680 --> 00:33:09,670
with a tape like this, how
could a computer system designer

765
00:33:09,670 --> 00:33:11,260
use that and put it to use?

766
00:33:11,260 --> 00:33:12,640
And that's, I
think, this tape is

767
00:33:12,640 --> 00:33:15,430
what's inspiring for the next
part of our research effort.

768
00:33:15,430 --> 00:33:17,480
So there's three parts
of this technology.

769
00:33:17,480 --> 00:33:19,840
It's these helical
scan tapes, which

770
00:33:19,840 --> 00:33:22,900
I'll talk more what
helical scan means shortly;

771
00:33:22,900 --> 00:33:25,150
tape robots that can
hold lots of these tapes

772
00:33:25,150 --> 00:33:26,767
and make them
automatically available;

773
00:33:26,767 --> 00:33:27,850
and then data compression.

774
00:33:27,850 --> 00:33:30,528
And that's-- I'll be spending
the rest of this tape talking

775
00:33:30,528 --> 00:33:33,070
about these technologies, then
talking about the applications

776
00:33:33,070 --> 00:33:33,987
of these technologies.

777
00:33:33,987 --> 00:33:36,430
But try and remember,
you can't see my arms

778
00:33:36,430 --> 00:33:37,960
waving the whole
time, but my arms

779
00:33:37,960 --> 00:33:38,940
are waving the whole
time I'm talking.

780
00:33:38,940 --> 00:33:41,440
Because this is what we're going
to do rather than something

781
00:33:41,440 --> 00:33:43,940
we've already done.

782
00:33:43,940 --> 00:33:46,310
So what's all this about tape?

783
00:33:46,310 --> 00:33:48,295
If I've been chairman
at Berkeley for a while,

784
00:33:48,295 --> 00:33:49,670
and I go into some
rooms and I'll

785
00:33:49,670 --> 00:33:52,592
see these rooms full
of magnetic tapes.

786
00:33:52,592 --> 00:33:54,050
So pretty much what
I've determined

787
00:33:54,050 --> 00:33:56,900
is tapes so far have
been written once

788
00:33:56,900 --> 00:34:00,080
and read zero or one time.

789
00:34:00,080 --> 00:34:02,780
People do these backups
of tremendous amounts

790
00:34:02,780 --> 00:34:05,095
of information, and I'll try
and reclaim storage space.

791
00:34:05,095 --> 00:34:05,720
And oh, no, no.

792
00:34:05,720 --> 00:34:08,100
You can't get rid of
that, that 1985 tape.

793
00:34:08,100 --> 00:34:10,022
Somebody may ask for
that file some day.

794
00:34:10,022 --> 00:34:11,480
Now, I don't think
very many people

795
00:34:11,480 --> 00:34:13,909
ask for 1985 tapes
in this day and age,

796
00:34:13,909 --> 00:34:15,949
but we've got these
roomful of these tapes.

797
00:34:15,949 --> 00:34:18,620
And they also get used
for distributing software.

798
00:34:18,620 --> 00:34:20,989
What I'm talking about,
using tapes very differently,

799
00:34:20,989 --> 00:34:23,909
using them like you use disks,
reading them and writing them,

800
00:34:23,909 --> 00:34:27,139
making them automatically
available for people to use.

801
00:34:27,139 --> 00:34:30,409
And that's what the
new opportunity is.

802
00:34:30,409 --> 00:34:32,090
So that's a fundamental
relationship

803
00:34:32,090 --> 00:34:33,679
between tapes and disks.

804
00:34:33,679 --> 00:34:36,679
And longitudinal tapes actually
use the same technology

805
00:34:36,679 --> 00:34:37,472
as hard disks.

806
00:34:37,472 --> 00:34:39,639
And they're going to track
each other's improvement.

807
00:34:39,639 --> 00:34:41,969
So the media itself
is pretty similar.

808
00:34:41,969 --> 00:34:44,389
The differences are
inherent in the geometries.

809
00:34:44,389 --> 00:34:47,000
Disks-- remember, I showed
you before-- has platters

810
00:34:47,000 --> 00:34:50,690
and randomly rotating with
these gaps for these arms

811
00:34:50,690 --> 00:34:54,020
to move in and out gives
them the random access.

812
00:34:54,020 --> 00:34:57,770
Because these arms are flying
so close to all these platters,

813
00:34:57,770 --> 00:35:00,380
they have to seal it so
the drive and the media

814
00:35:00,380 --> 00:35:01,760
are a single unit.

815
00:35:01,760 --> 00:35:05,880
In contrast, tapes are this
magnetic information spread

816
00:35:05,880 --> 00:35:08,420
on these removable strips
that are on a spool.

817
00:35:08,420 --> 00:35:11,130
There's sequential
access, no random access.

818
00:35:11,130 --> 00:35:13,380
But because of the nature
of the way the readers work,

819
00:35:13,380 --> 00:35:16,730
you can insert and
remove these tapes

820
00:35:16,730 --> 00:35:19,260
so you can have many
tapes to a reader.

821
00:35:19,260 --> 00:35:21,180
And so it's got these
fundamental advantages

822
00:35:21,180 --> 00:35:23,930
as a random access
versus sequential access,

823
00:35:23,930 --> 00:35:26,480
and multiple media per reader
versus one media per reader.

824
00:35:26,480 --> 00:35:28,610
So I expect 20 years
from now, we'll

825
00:35:28,610 --> 00:35:32,530
still have these kind of
relationships in these two.

826
00:35:32,530 --> 00:35:35,030
But there's this new technology,
which I showed you earlier,

827
00:35:35,030 --> 00:35:36,080
called helical scan.

828
00:35:36,080 --> 00:35:38,480
So what helical scan
does is different

829
00:35:38,480 --> 00:35:39,680
than the longitudinal tapes.

830
00:35:39,680 --> 00:35:44,730
Longitude tapes, the information
is across the tape as it moves.

831
00:35:44,730 --> 00:35:47,000
So helical scan is
the tape is spinning,

832
00:35:47,000 --> 00:35:49,170
it has the head at
an angle to the tape,

833
00:35:49,170 --> 00:35:51,000
and it's spinning
very, very rapidly,

834
00:35:51,000 --> 00:35:54,380
so as the tape moves by, it
can record lots of information.

835
00:35:54,380 --> 00:35:56,840
So you can get factors
of almost 100 increase

836
00:35:56,840 --> 00:35:59,630
in the density of the tapes
with this new helical scan

837
00:35:59,630 --> 00:36:01,100
technology.

838
00:36:01,100 --> 00:36:02,880
Now, as you can
see on the slide,

839
00:36:02,880 --> 00:36:05,720
this is not some exotic
technology that's unavailable.

840
00:36:05,720 --> 00:36:07,280
This is pretty
standard technology

841
00:36:07,280 --> 00:36:10,640
that's being used in every
VCR, in every camcorder.

842
00:36:10,640 --> 00:36:13,370
And if you happen to have a
digital audio tape stereo,

843
00:36:13,370 --> 00:36:15,600
it's being used in that
technology as well.

844
00:36:15,600 --> 00:36:17,690
Now, you would
think that there's,

845
00:36:17,690 --> 00:36:19,250
given the sequential
nature, it would

846
00:36:19,250 --> 00:36:21,640
be very slow, which is true.

847
00:36:21,640 --> 00:36:23,640
It is pretty slow,
especially compared to disks.

848
00:36:23,640 --> 00:36:26,090
But because they keep
a longitudinal track,

849
00:36:26,090 --> 00:36:27,770
they have a fast search mode.

850
00:36:27,770 --> 00:36:30,710
So actually, here's
three different--

851
00:36:30,710 --> 00:36:32,090
these three
different columns are

852
00:36:32,090 --> 00:36:34,640
examples of different kinds
of helical scan technology.

853
00:36:34,640 --> 00:36:36,650
Random searches will
take tens of seconds,

854
00:36:36,650 --> 00:36:38,192
where you might
think they might take

855
00:36:38,192 --> 00:36:40,610
several minutes because they
have fast search mode here.

856
00:36:40,610 --> 00:36:43,610
This shows the density of these
tapes-- this one at about--

857
00:36:43,610 --> 00:36:46,770
this is in megabytes, so this
is almost 5 gigabytes here.

858
00:36:46,770 --> 00:36:49,350
Here's a couple of other
ones that are going on.

859
00:36:49,350 --> 00:36:52,100
And this compares it to magnetic
disks in the conventional tape.

860
00:36:52,100 --> 00:36:53,933
And you can see comparing
conventional tape,

861
00:36:53,933 --> 00:36:59,710
a dramatic difference in the
density of these technologies.

862
00:36:59,710 --> 00:37:02,500
Now usually when I give
a talk, and there's

863
00:37:02,500 --> 00:37:04,960
people who can talk back, they--
somebody raises their hand

864
00:37:04,960 --> 00:37:07,330
and asks, what
about optical disk?

865
00:37:07,330 --> 00:37:10,630
Optical disk is a very
interesting technology,

866
00:37:10,630 --> 00:37:13,300
especially if you're going
to make one copy of things

867
00:37:13,300 --> 00:37:17,320
and want to stamp it out
for [INAUDIBLE] many people

868
00:37:17,320 --> 00:37:18,910
dramatically inexpensively.

869
00:37:18,910 --> 00:37:21,280
For a couple of dollars,
1 gigabyte of information

870
00:37:21,280 --> 00:37:25,280
can be stamped out repeatedly
and used at many places.

871
00:37:25,280 --> 00:37:30,610
But if we compare the media
cost to these 8 millimeter

872
00:37:30,610 --> 00:37:32,110
tapes, the helical
scan tapes, you

873
00:37:32,110 --> 00:37:34,360
see there's a very large
difference, so that basically

874
00:37:34,360 --> 00:37:36,460
there's almost a factor
of 100 difference in terms

875
00:37:36,460 --> 00:37:39,370
of the media in terms
of dollars per megabyte

876
00:37:39,370 --> 00:37:41,770
of helical scan versus
the optical disk.

877
00:37:41,770 --> 00:37:44,080
Optical disks are moving
more or less at the rate

878
00:37:44,080 --> 00:37:46,570
the standards committees
can agree on standards,

879
00:37:46,570 --> 00:37:48,220
while the helical
scan tapes tend

880
00:37:48,220 --> 00:37:50,230
to be pushing the technology.

881
00:37:50,230 --> 00:37:52,300
Now, there's some-- so
it's got that advantage,

882
00:37:52,300 --> 00:37:53,675
the helical scan
tape, and that's

883
00:37:53,675 --> 00:37:56,020
what's, again, waving our
hands is where we are today

884
00:37:56,020 --> 00:37:58,475
at Berkeley and pushing
the helical scan tapes.

885
00:37:58,475 --> 00:38:00,100
But we've learned a
few things that are

886
00:38:00,100 --> 00:38:01,450
disadvantages of these tapes.

887
00:38:01,450 --> 00:38:03,010
First of all, they wear out.

888
00:38:03,010 --> 00:38:07,780
Helical scan devices, because
they move very rapidly

889
00:38:07,780 --> 00:38:10,930
over that tape, the tape
can wear out more quickly.

890
00:38:10,930 --> 00:38:14,590
Even longitudinal tapes can
only have thousands of passes.

891
00:38:14,590 --> 00:38:16,550
Moreover, the heads
wear out as well.

892
00:38:16,550 --> 00:38:21,590
So that's some disadvantages
of that technology.

893
00:38:21,590 --> 00:38:24,325
So your economic model has
to factor in the advantage

894
00:38:24,325 --> 00:38:25,390
of helical scan tapes.

895
00:38:25,390 --> 00:38:27,882
You have to subtract out the
fact that the tapes wear out

896
00:38:27,882 --> 00:38:28,840
and the heads wear out.

897
00:38:28,840 --> 00:38:31,690
But there's still, when you
have a factor of 100 advantage,

898
00:38:31,690 --> 00:38:33,310
that's a serious advantage.

899
00:38:33,310 --> 00:38:36,740
Also, right now, if you looked
at the tapes the way they work,

900
00:38:36,740 --> 00:38:40,510
there's this very long rewind,
eject, load, and spin-up times.

901
00:38:40,510 --> 00:38:42,370
But I believe
that's not inherent.

902
00:38:42,370 --> 00:38:44,770
It's just there's no
market yet for somebody

903
00:38:44,770 --> 00:38:46,990
using these tapes in
this unconventional way,

904
00:38:46,990 --> 00:38:48,340
as if they were disks.

905
00:38:48,340 --> 00:38:51,070
If that was important, I believe
engineers could design them

906
00:38:51,070 --> 00:38:57,400
that they wouldn't be so slow in
terms of loading and rewinding.

907
00:38:57,400 --> 00:38:59,380
Now, that's the
tapes themselves.

908
00:38:59,380 --> 00:39:01,810
What about the tape
robots, the second piece

909
00:39:01,810 --> 00:39:03,610
of this technology?

910
00:39:03,610 --> 00:39:06,640
If you were able to
see this setup that we

911
00:39:06,640 --> 00:39:09,190
have in this room, we
could fit on the same stage

912
00:39:09,190 --> 00:39:13,750
where I'm talking a 10 foot by 8
foot monster called the Storage

913
00:39:13,750 --> 00:39:15,430
Technologies 4400.

914
00:39:15,430 --> 00:39:17,527
This robot, which isn't
that big, and at a half

915
00:39:17,527 --> 00:39:19,360
million dollars isn't
really that expensive,

916
00:39:19,360 --> 00:39:22,420
could hold 6,000 tapes.

917
00:39:22,420 --> 00:39:25,600
In 1992, that would
be about 5 terabytes.

918
00:39:25,600 --> 00:39:28,330
Next year they're going
to make the transition

919
00:39:28,330 --> 00:39:31,060
from longitudinal
to helical scan,

920
00:39:31,060 --> 00:39:36,205
so that same robot could
hold 120 terabytes.

921
00:39:36,205 --> 00:39:38,080
When we start talking
about numbers like this

922
00:39:38,080 --> 00:39:40,420
it's hard to keep track of
what we're talking about.

923
00:39:40,420 --> 00:39:42,850
It's like talking about
the national debt.

924
00:39:42,850 --> 00:39:46,930
A trillion dollars-- how
much is a trillion dollars?

925
00:39:46,930 --> 00:39:49,070
An extra trillion
dollars-- is that bad?

926
00:39:49,070 --> 00:39:50,270
What does it mean?

927
00:39:50,270 --> 00:39:53,980
Well, terabytes-- 120
terabytes is a fantastic amount

928
00:39:53,980 --> 00:39:55,090
of information.

929
00:39:55,090 --> 00:39:58,150
If we were to go to
the Library of Congress

930
00:39:58,150 --> 00:40:01,900
and see their attempt to
capture the sum of humankind's

931
00:40:01,900 --> 00:40:04,660
knowledge, to give you an idea
of how aggressive they are

932
00:40:04,660 --> 00:40:07,260
in getting information,
2/3 of their holdings

933
00:40:07,260 --> 00:40:08,260
are not even in English.

934
00:40:08,260 --> 00:40:10,480
They're trying to get
everything that's been printed

935
00:40:10,480 --> 00:40:12,670
and keep it at the
Library of Congress.

936
00:40:12,670 --> 00:40:15,700
If we could magically
transform all the texts and all

937
00:40:15,700 --> 00:40:18,340
of those books and put
it onto a computer,

938
00:40:18,340 --> 00:40:21,460
it's been estimated
that's about 20 terabytes.

939
00:40:21,460 --> 00:40:24,190
So with 120 terabytes,
we could have six copies

940
00:40:24,190 --> 00:40:27,970
of all of humankind's
knowledge right up here

941
00:40:27,970 --> 00:40:31,130
on the stage next to me
for about a half a million.

942
00:40:31,130 --> 00:40:35,960
So this is an extraordinary
amount of information.

943
00:40:35,960 --> 00:40:38,740
So the tape robots which
are available right now

944
00:40:38,740 --> 00:40:43,075
give us this automatic access,
still with a long access time.

945
00:40:43,075 --> 00:40:44,950
Now, those of you who've
been in the industry

946
00:40:44,950 --> 00:40:47,140
for a while would
seem-- even this is--

947
00:40:47,140 --> 00:40:50,680
seems like this is infinite
amount of information.

948
00:40:50,680 --> 00:40:54,110
But you know when dealing with
programmers that they always

949
00:40:54,110 --> 00:40:54,610
complain.

950
00:40:54,610 --> 00:40:56,150
So this can't be enough.

951
00:40:56,150 --> 00:40:57,820
So we have to get
even more than this,

952
00:40:57,820 --> 00:41:02,380
and that leads to the next
topic of data compression.

953
00:41:02,380 --> 00:41:06,830
So compression has some simple
terms, easy to figure out.

954
00:41:06,830 --> 00:41:09,340
The first one is a style of
compression called lossless,

955
00:41:09,340 --> 00:41:11,680
and that gives you
typically factors of 2 or 3.

956
00:41:11,680 --> 00:41:14,200
And your contract is
you won't lose any bits,

957
00:41:14,200 --> 00:41:16,720
but it will take one half
to one third the space.

958
00:41:16,720 --> 00:41:19,750
And text is typically
done with compression.

959
00:41:19,750 --> 00:41:22,542
The second category
called lossy allows

960
00:41:22,542 --> 00:41:24,250
you to lose some bits
as long as it still

961
00:41:24,250 --> 00:41:25,420
looks good to the visual--

962
00:41:25,420 --> 00:41:27,760
to the human eye,
visually interesting.

963
00:41:27,760 --> 00:41:28,900
So these are images.

964
00:41:28,900 --> 00:41:30,940
And sometimes they
get factors of 20,

965
00:41:30,940 --> 00:41:33,650
sometimes less, sometimes
much more than that.

966
00:41:33,650 --> 00:41:35,680
So the question as
computer systems designers

967
00:41:35,680 --> 00:41:37,840
with this new
technology is, where

968
00:41:37,840 --> 00:41:42,040
is compression going to be used,
and where should it be used?

969
00:41:42,040 --> 00:41:44,910
So this complicated
line drawing here,

970
00:41:44,910 --> 00:41:48,510
let me go show you
how that works.

971
00:41:48,510 --> 00:41:49,650
Here's the tape itself.

972
00:41:49,650 --> 00:41:50,820
Here is its controller.

973
00:41:50,820 --> 00:41:53,340
It's connected over
a cable, a SCSI cable

974
00:41:53,340 --> 00:41:55,200
to the SCSI host bus adapter.

975
00:41:55,200 --> 00:41:58,080
SCSI is one of those cables
that connect peripherals

976
00:41:58,080 --> 00:42:01,350
to computers through a file
server over a long haul

977
00:42:01,350 --> 00:42:04,260
network, to another file server
over a local area network,

978
00:42:04,260 --> 00:42:07,530
to the processor and memory,
and it shows up on the screen.

979
00:42:07,530 --> 00:42:09,550
So where is it being used today?

980
00:42:09,550 --> 00:42:11,850
Well, the compression
is being used right now

981
00:42:11,850 --> 00:42:12,810
in the controller.

982
00:42:12,810 --> 00:42:14,700
And in fact, you have
to be pretty careful

983
00:42:14,700 --> 00:42:18,120
when you buy either a
tape controller or a disk

984
00:42:18,120 --> 00:42:20,520
to find out whether
or not when they

985
00:42:20,520 --> 00:42:23,243
argue the capacity of
this tape or the disk,

986
00:42:23,243 --> 00:42:24,910
whether they're
counting on compression.

987
00:42:24,910 --> 00:42:27,150
So the same tape may
hold 5 gigabytes,

988
00:42:27,150 --> 00:42:31,080
but it maybe advertised as a 10
gigabyte or 15 gigabyte tape,

989
00:42:31,080 --> 00:42:33,750
because they're counting on
getting this factor 2 or 3

990
00:42:33,750 --> 00:42:35,010
with lossless compression.

991
00:42:35,010 --> 00:42:36,120
So you have to--

992
00:42:36,120 --> 00:42:39,480
user beware here, buyer beware.

993
00:42:39,480 --> 00:42:40,980
From a computer
systems perspective,

994
00:42:40,980 --> 00:42:42,570
what do we think of that?

995
00:42:42,570 --> 00:42:45,270
Well, what we think is
that's the stupidest place

996
00:42:45,270 --> 00:42:46,470
you could possibly do it.

997
00:42:46,470 --> 00:42:50,850
Because after you get that
data and you decompress it,

998
00:42:50,850 --> 00:42:54,330
then you have to ship that
larger piece of information

999
00:42:54,330 --> 00:42:57,090
by factors of 2 or 3 over
each one of these cables,

1000
00:42:57,090 --> 00:43:00,160
all the way until just
before it gets to the screen.

1001
00:43:00,160 --> 00:43:02,328
Therefore, you get the
compression advantage

1002
00:43:02,328 --> 00:43:04,120
of the tape, but you
don't get any benefit.

1003
00:43:04,120 --> 00:43:06,510
And all of these wires,
you have to pay extra,

1004
00:43:06,510 --> 00:43:08,552
or it takes longer to
transfer them to get there.

1005
00:43:08,552 --> 00:43:10,770
What we really want is
just-in-time decompression.

1006
00:43:10,770 --> 00:43:13,420
What we want is keep it
in the compressed form,

1007
00:43:13,420 --> 00:43:15,060
send it all the way
around, and just

1008
00:43:15,060 --> 00:43:17,790
before it pops up in the
screen it gets depressed--

1009
00:43:17,790 --> 00:43:18,885
decompressed.

1010
00:43:18,885 --> 00:43:22,637
It may get depressed, too, but,
it should get decompressed.

1011
00:43:22,637 --> 00:43:23,970
Now, there's advantages of that.

1012
00:43:23,970 --> 00:43:27,060
If we're up here, we know
what kind of data it is.

1013
00:43:27,060 --> 00:43:30,720
And if it is an image, we
can use image compression

1014
00:43:30,720 --> 00:43:32,790
and get even greater
than factors of 2 or 3.

1015
00:43:32,790 --> 00:43:34,470
So this is the
right place to do it

1016
00:43:34,470 --> 00:43:37,720
for a couple of
different reasons.

1017
00:43:37,720 --> 00:43:39,420
Now, when we're
talking about hundreds

1018
00:43:39,420 --> 00:43:40,920
of terabytes, what
application would

1019
00:43:40,920 --> 00:43:43,050
we use at Berkeley to
be able to drive this?

1020
00:43:43,050 --> 00:43:44,790
What we chose to
use is this project

1021
00:43:44,790 --> 00:43:49,560
called Sequoia 2000, which
is a global change research

1022
00:43:49,560 --> 00:43:52,110
effort involving Earth system
scientists and computer

1023
00:43:52,110 --> 00:43:53,070
scientists.

1024
00:43:53,070 --> 00:43:55,620
These people are trying to
worry about the problems facing

1025
00:43:55,620 --> 00:43:56,520
our planet.

1026
00:43:56,520 --> 00:43:59,183
They're worrying
about CO2 content.

1027
00:43:59,183 --> 00:44:01,350
They're worrying about the
melting of the snow caps.

1028
00:44:01,350 --> 00:44:03,030
They're worrying
about the ozone holes.

1029
00:44:03,030 --> 00:44:04,447
These are the
researchers that are

1030
00:44:04,447 --> 00:44:06,840
dealing with it, either
with simulating data

1031
00:44:06,840 --> 00:44:09,430
or sensing data from space.

1032
00:44:09,430 --> 00:44:12,420
The project at Berkeley
involves computer scientists

1033
00:44:12,420 --> 00:44:17,220
at several UC campuses, Earth
system scientists at several UC

1034
00:44:17,220 --> 00:44:19,830
campuses, and even some
people from the real world

1035
00:44:19,830 --> 00:44:26,160
who are trying to use this data
to do public policy decisions.

1036
00:44:26,160 --> 00:44:31,110
These global change researchers
are drowning in data.

1037
00:44:31,110 --> 00:44:34,050
They get all this
data from remote sense

1038
00:44:34,050 --> 00:44:36,540
from satellite in space that
they need to deal with today.

1039
00:44:36,540 --> 00:44:39,750
Modelers will typically create a
tenth of a terabyte for a year.

1040
00:44:39,750 --> 00:44:41,280
In just a few
years, they're going

1041
00:44:41,280 --> 00:44:45,660
to put up a series of satellites
in space, that once they're

1042
00:44:45,660 --> 00:44:49,960
in place, they're going to
broadcast 2 terabytes per day,

1043
00:44:49,960 --> 00:44:51,607
these bits raining
down from space,

1044
00:44:51,607 --> 00:44:53,190
and they're going
to keep broadcasting

1045
00:44:53,190 --> 00:44:56,320
that information for 15 years.

1046
00:44:56,320 --> 00:44:59,633
And what they want to do is
capture all that information,

1047
00:44:59,633 --> 00:45:01,800
and so that it's digitally
recorded so that they can

1048
00:45:01,800 --> 00:45:03,420
do simulations in
the future to see

1049
00:45:03,420 --> 00:45:05,400
if their theories
about the climate

1050
00:45:05,400 --> 00:45:08,050
bear up to this
15-year case study.

1051
00:45:08,050 --> 00:45:10,380
So this is really
going to challenge

1052
00:45:10,380 --> 00:45:13,250
all levels of computer systems
to be able to pull this off.

1053
00:45:13,250 --> 00:45:15,750
To give you an example of the
type of thing we'd like to do,

1054
00:45:15,750 --> 00:45:18,995
I'm going to show you this
next videotape as an example.

1055
00:45:18,995 --> 00:45:20,370
This is not
something we've done.

1056
00:45:20,370 --> 00:45:21,840
This is the type of
thing we'd like it

1057
00:45:21,840 --> 00:45:23,940
so that Earth's system
scientists, global change

1058
00:45:23,940 --> 00:45:27,390
researchers, could try on their
screen to find things out.

1059
00:45:27,390 --> 00:45:28,860
For this example,
it's going to be

1060
00:45:28,860 --> 00:45:32,640
using green to represent the
chlorophyll content of plants.

1061
00:45:32,640 --> 00:45:35,940
And what you're going to see is
the trade winds blow the rain,

1062
00:45:35,940 --> 00:45:39,300
the chlorophyll will
move across South America

1063
00:45:39,300 --> 00:45:40,980
and bumping into the Andes.

1064
00:45:40,980 --> 00:45:43,170
And this would show
what would happen

1065
00:45:43,170 --> 00:45:47,070
if there was a environmental
accident on the East

1066
00:45:47,070 --> 00:45:50,910
Coast of South America, how
rapidly that might contaminate

1067
00:45:50,910 --> 00:45:52,260
the South American continent.

1068
00:45:52,260 --> 00:45:54,970
And you can see as it
contraposes this information

1069
00:45:54,970 --> 00:45:57,345
according to altitudes, you
get some interesting insights

1070
00:45:57,345 --> 00:45:58,375
to what's going on.

1071
00:45:58,375 --> 00:46:00,000
This is an example
of something that we

1072
00:46:00,000 --> 00:46:02,820
would like to do in
the Sequoia Project

1073
00:46:02,820 --> 00:46:05,033
and should fire
your imagination.

1074
00:46:05,033 --> 00:46:05,700
Here's the demo.

1075
00:46:05,700 --> 00:46:09,473


1076
00:46:09,473 --> 00:46:10,140
[VIDEO PLAYBACK]

1077
00:46:10,140 --> 00:46:14,540
- So now let's compare North
America with South America.

1078
00:46:14,540 --> 00:46:17,150
The relation of
vegetation production

1079
00:46:17,150 --> 00:46:21,650
to global climatic patterns
is clearly reflected here.

1080
00:46:21,650 --> 00:46:24,500
The tropical easterly
winds spread rain

1081
00:46:24,500 --> 00:46:28,210
across the continent
to make it green.

1082
00:46:28,210 --> 00:46:30,620
They drop the last
of their moisture

1083
00:46:30,620 --> 00:46:34,940
when they meet the tall
Andes mountains on the left.

1084
00:46:34,940 --> 00:46:38,060
The black area of low
production on the left

1085
00:46:38,060 --> 00:46:40,850
reflects the dry lands
and the rain shadow

1086
00:46:40,850 --> 00:46:44,000
produced by the Andes.

1087
00:46:44,000 --> 00:46:46,610
Why, then, is there
another black shadow

1088
00:46:46,610 --> 00:46:49,020
in the lower portion
of South America,

1089
00:46:49,020 --> 00:46:51,980
but on the opposite
side of the Andes?

1090
00:46:51,980 --> 00:46:55,730
This occurs at exactly 30
degrees South latitude,

1091
00:46:55,730 --> 00:46:58,490
where the tropical
easterly winds shift

1092
00:46:58,490 --> 00:47:03,500
to the westerly winds which
characterize temperate zones.

1093
00:47:03,500 --> 00:47:07,190
GRAS is a system designed to
support scientific research

1094
00:47:07,190 --> 00:47:10,670
and to answer land
management questions.

1095
00:47:10,670 --> 00:47:13,670
This presentation
suggests only a few

1096
00:47:13,670 --> 00:47:16,190
of GRAS's potential
applications.

1097
00:47:16,190 --> 00:47:19,490
It does illustrate how today's
technology can integrate

1098
00:47:19,490 --> 00:47:22,880
the latest satellite imagery,
computer manipulation

1099
00:47:22,880 --> 00:47:26,090
techniques, and
hardware capabilities

1100
00:47:26,090 --> 00:47:29,780
for visualizing our
fragile ecosystem in ways

1101
00:47:29,780 --> 00:47:32,268
not previously possible.

1102
00:47:32,268 --> 00:47:35,726
[MUSIC PLAYING]

1103
00:47:35,726 --> 00:48:01,347


1104
00:48:01,347 --> 00:48:01,930
[END PLAYBACK]

1105
00:48:01,930 --> 00:48:04,097
That's one application
of the technologies

1106
00:48:04,097 --> 00:48:05,680
to help the global
change researchers,

1107
00:48:05,680 --> 00:48:07,930
but another
interesting application

1108
00:48:07,930 --> 00:48:10,340
is the electronic library
or digital library.

1109
00:48:10,340 --> 00:48:12,988
And you can see that
on the next slide.

1110
00:48:12,988 --> 00:48:14,530
If you visit the
Berkeley campus, one

1111
00:48:14,530 --> 00:48:18,070
of our nicest buildings on
campus, the Bancroft Library,

1112
00:48:18,070 --> 00:48:21,043
that has just
372,000 books in it.

1113
00:48:21,043 --> 00:48:22,460
And if you convert
that into text,

1114
00:48:22,460 --> 00:48:24,160
that'd be about half a terabyte.

1115
00:48:24,160 --> 00:48:26,800
If instead of just having
the text, if what you had

1116
00:48:26,800 --> 00:48:29,500
was the images of
the full page, that

1117
00:48:29,500 --> 00:48:32,390
might take maybe 20 terabytes.

1118
00:48:32,390 --> 00:48:35,890
That's not a very big
piece of that storage robot

1119
00:48:35,890 --> 00:48:37,450
that we talked about earlier.

1120
00:48:37,450 --> 00:48:39,100
Moreover, if you
visited the campus

1121
00:48:39,100 --> 00:48:40,750
right now while
we're filming, we're

1122
00:48:40,750 --> 00:48:44,200
in the middle of a four-year,
$45 million project

1123
00:48:44,200 --> 00:48:49,570
to create a building
contain two million books.

1124
00:48:49,570 --> 00:48:51,700
Right now in the state of
California, $45 million

1125
00:48:51,700 --> 00:48:53,510
is a lot of money.

1126
00:48:53,510 --> 00:48:57,730
And what we're doing is building
this building for books.

1127
00:48:57,730 --> 00:48:59,320
How good an idea is that?

1128
00:48:59,320 --> 00:49:00,940
We could fit all
that information,

1129
00:49:00,940 --> 00:49:03,730
even the page images,
into one of these robots.

1130
00:49:03,730 --> 00:49:05,890
And it's pretty
expensive to create this.

1131
00:49:05,890 --> 00:49:09,317
I wonder whether or not if you
visit the Berkeley campus in 10

1132
00:49:09,317 --> 00:49:10,900
years and come on a
tour guide, people

1133
00:49:10,900 --> 00:49:13,780
are going to refer to this
new building as the mausoleum

1134
00:49:13,780 --> 00:49:16,780
of dead trees, Tien's folly,
where they spent $45 million

1135
00:49:16,780 --> 00:49:18,670
to hold all these books.

1136
00:49:18,670 --> 00:49:21,400
And in fact, I've
given this talk before

1137
00:49:21,400 --> 00:49:23,290
and talked about
how libraries work

1138
00:49:23,290 --> 00:49:25,753
so much that I can see
clearly in my future

1139
00:49:25,753 --> 00:49:28,420
how libraries are going to be so
different from the way they are

1140
00:49:28,420 --> 00:49:29,110
today.

1141
00:49:29,110 --> 00:49:30,850
And when we look
back at these times,

1142
00:49:30,850 --> 00:49:32,152
people are going to chuckle.

1143
00:49:32,152 --> 00:49:33,610
And to put that in
perspective, let

1144
00:49:33,610 --> 00:49:35,420
me tell you what's
happened in my lifetime

1145
00:49:35,420 --> 00:49:37,090
in terms of learning
how to program,

1146
00:49:37,090 --> 00:49:40,970
and then we'll fast forward and
talk about how libraries work.

1147
00:49:40,970 --> 00:49:44,207
So the way I learned to program
is I would write my program out

1148
00:49:44,207 --> 00:49:45,040
on a sheet of paper.

1149
00:49:45,040 --> 00:49:46,810
I would then go to
a vending machine,

1150
00:49:46,810 --> 00:49:48,352
put a quarter in
the vending machine,

1151
00:49:48,352 --> 00:49:49,880
and get a stack of IBM cards.

1152
00:49:49,880 --> 00:49:51,910
I would then take
my piece of paper,

1153
00:49:51,910 --> 00:49:54,220
wander over to the
keypunch machine, and type

1154
00:49:54,220 --> 00:49:56,680
in, keypunch, put little
holes in cardboard

1155
00:49:56,680 --> 00:49:58,030
all of those characters.

1156
00:49:58,030 --> 00:49:59,950
I would take this
stack of cards,

1157
00:49:59,950 --> 00:50:02,475
wander over to somebody
at the counter, smile,

1158
00:50:02,475 --> 00:50:04,600
try and get to know the
person, be the best friend,

1159
00:50:04,600 --> 00:50:06,280
hoping that that stack
of cards will get

1160
00:50:06,280 --> 00:50:08,140
put at the front of some queue.

1161
00:50:08,140 --> 00:50:11,470
Then what I did was go home,
come back the next morning,

1162
00:50:11,470 --> 00:50:14,260
look, bend down, find
my slot with my name,

1163
00:50:14,260 --> 00:50:17,140
and kick out a line printer
listing to see what happened.

1164
00:50:17,140 --> 00:50:19,940
Ah, left out a comma.

1165
00:50:19,940 --> 00:50:22,510
Take the cards, go over,
get some more cards,

1166
00:50:22,510 --> 00:50:25,750
replicate, insert the
comma, hand it to the guy,

1167
00:50:25,750 --> 00:50:30,400
smile, come back the next
day for the next printing

1168
00:50:30,400 --> 00:50:31,280
of the cards.

1169
00:50:31,280 --> 00:50:34,060
Now, that is the way
I learned to program.

1170
00:50:34,060 --> 00:50:35,830
How do people learn
to program today?

1171
00:50:35,830 --> 00:50:37,900
I'm not sure that
students today even know

1172
00:50:37,900 --> 00:50:39,370
the syntax of the language.

1173
00:50:39,370 --> 00:50:41,890
They start typing, commas
get inserted automatically

1174
00:50:41,890 --> 00:50:43,120
by the editor.

1175
00:50:43,120 --> 00:50:44,962
I can imagine--
suppose what we did on

1176
00:50:44,962 --> 00:50:47,170
the Berkeley campus would
say, boy, things are tough.

1177
00:50:47,170 --> 00:50:48,630
We're having to
build this library,

1178
00:50:48,630 --> 00:50:49,630
it's using up our funds.

1179
00:50:49,630 --> 00:50:51,580
So what we're going to do is
go back and teach programming

1180
00:50:51,580 --> 00:50:53,740
the way the faculty
learned how to program.

1181
00:50:53,740 --> 00:50:55,780
We would have a riot
on the Berkeley campus

1182
00:50:55,780 --> 00:50:57,460
to put People's Park to shame.

1183
00:50:57,460 --> 00:51:00,070
They'd say, nobody can
learn to program that way.

1184
00:51:00,070 --> 00:51:02,440
That's prehistoric, impossible.

1185
00:51:02,440 --> 00:51:05,940
Only an idiot would
even suggest it.

1186
00:51:05,940 --> 00:51:07,660
So let's talk about libraries.

1187
00:51:07,660 --> 00:51:09,820
Imagine we fast
forward about 10 years,

1188
00:51:09,820 --> 00:51:13,250
and let's explain to the
people 10 years in the future

1189
00:51:13,250 --> 00:51:15,200
how we use libraries today.

1190
00:51:15,200 --> 00:51:18,370
So the way we use libraries
today is if you're lucky,

1191
00:51:18,370 --> 00:51:20,140
and you've got
electronic card catalog,

1192
00:51:20,140 --> 00:51:22,015
you can find what you
want and you write down

1193
00:51:22,015 --> 00:51:22,900
the call letters.

1194
00:51:22,900 --> 00:51:24,460
Then you get up
out of your office,

1195
00:51:24,460 --> 00:51:28,978
or you come from home to
the campus, wander around,

1196
00:51:28,978 --> 00:51:30,520
and if you have a
special pad, you're

1197
00:51:30,520 --> 00:51:33,190
allowed to get into the stacks,
go up and down the stacks,

1198
00:51:33,190 --> 00:51:36,370
find the call letters, look
where the book's supposed

1199
00:51:36,370 --> 00:51:38,770
to be in the shelf, and
then you look all around

1200
00:51:38,770 --> 00:51:41,060
because it's probably not
where it's supposed to be.

1201
00:51:41,060 --> 00:51:42,310
And if it's not
there, you look over

1202
00:51:42,310 --> 00:51:44,020
to the cart that's
right next to it, which

1203
00:51:44,020 --> 00:51:45,853
has the books that
haven't been put away yet

1204
00:51:45,853 --> 00:51:46,990
and see if you can find it.

1205
00:51:46,990 --> 00:51:49,060
If you're lucky, you
find the book you want.

1206
00:51:49,060 --> 00:51:51,700
You wander down the stacks,
go to the hard disk.

1207
00:51:51,700 --> 00:51:53,800
If you're fortunate and
you have a modern library,

1208
00:51:53,800 --> 00:51:56,050
they'll use a grocery
scanner to scan it in.

1209
00:51:56,050 --> 00:51:58,410
If not, you have to write
your name on this little card.

1210
00:51:58,410 --> 00:51:59,350
You take the book.

1211
00:51:59,350 --> 00:52:01,450
You go back to your office
or go all the way home,

1212
00:52:01,450 --> 00:52:05,050
read the 10 or 20 pages you
cared about, and then you try

1213
00:52:05,050 --> 00:52:07,090
and remember to
take that book back,

1214
00:52:07,090 --> 00:52:10,720
because no one else gets to
use that book while you're

1215
00:52:10,720 --> 00:52:11,360
having it.

1216
00:52:11,360 --> 00:52:14,050
And then when you do get
that postcard in the mail

1217
00:52:14,050 --> 00:52:16,210
after three weeks or
after the end of semester

1218
00:52:16,210 --> 00:52:18,130
to remind you to bring that
book back, you bring it back,

1219
00:52:18,130 --> 00:52:19,450
and it gets put on the shelves.

1220
00:52:19,450 --> 00:52:21,970
So imagine telling somebody
10 years from now this is

1221
00:52:21,970 --> 00:52:24,400
how we did scholarly research.

1222
00:52:24,400 --> 00:52:27,760
This is how we found out
what other people were doing.

1223
00:52:27,760 --> 00:52:30,640
They'll say, but you must
never have read books.

1224
00:52:30,640 --> 00:52:33,460
You must not have taken
any time at all to do that,

1225
00:52:33,460 --> 00:52:34,270
so it's so painful.

1226
00:52:34,270 --> 00:52:35,200
It must have taken you hours.

1227
00:52:35,200 --> 00:52:35,992
Yes, it took hours.

1228
00:52:35,992 --> 00:52:37,533
And it must have
been very expensive.

1229
00:52:37,533 --> 00:52:38,750
Oh, it was very expensive.

1230
00:52:38,750 --> 00:52:41,470
Every time you check a
book out of a library,

1231
00:52:41,470 --> 00:52:43,150
a library is doing
a very good job

1232
00:52:43,150 --> 00:52:47,017
if it only cost them $1 to put
the book back on the shelf.

1233
00:52:47,017 --> 00:52:49,600
So the best thing you could do
to help libraries out right now

1234
00:52:49,600 --> 00:52:51,332
is to not check out any books.

1235
00:52:51,332 --> 00:52:52,790
That would save
them a lot of money

1236
00:52:52,790 --> 00:52:55,510
if nobody checked out books.

1237
00:52:55,510 --> 00:53:00,500
Libraries also have to buy books
in anticipation of their use.

1238
00:53:00,500 --> 00:53:03,460
So a librarian is doing
an outstanding job if only

1239
00:53:03,460 --> 00:53:06,790
20% of the books that they buy
on the Berkeley campus, that's

1240
00:53:06,790 --> 00:53:08,050
about a $3 million budget.

1241
00:53:08,050 --> 00:53:10,960
So 20% of the books that
they buy, no one ever,

1242
00:53:10,960 --> 00:53:12,370
ever checks out.

1243
00:53:12,370 --> 00:53:13,660
Not once.

1244
00:53:13,660 --> 00:53:16,600
So that's just-- those books
just sit there and use up

1245
00:53:16,600 --> 00:53:18,032
shelf space.

1246
00:53:18,032 --> 00:53:19,990
Similarly, to create a
catalog entry for a book

1247
00:53:19,990 --> 00:53:22,130
costs a big fraction of
the price of the book.

1248
00:53:22,130 --> 00:53:25,240
So the system we have today
is extraordinarily expensive,

1249
00:53:25,240 --> 00:53:27,040
and it's extraordinarily
inconvenient.

1250
00:53:27,040 --> 00:53:29,560
Imagine having to get up out
of your chair and go do that.

1251
00:53:29,560 --> 00:53:31,570
You can imagine 10 years
or so in the future,

1252
00:53:31,570 --> 00:53:33,548
people are going to
be doing-- searching

1253
00:53:33,548 --> 00:53:35,590
for the right information,
getting those 10 or 20

1254
00:53:35,590 --> 00:53:37,715
pages they're interested
in pop up on their screen,

1255
00:53:37,715 --> 00:53:40,358
reading that, inserting what
they need, and move ahead.

1256
00:53:40,358 --> 00:53:42,400
It'll be dramatically
different than we do today,

1257
00:53:42,400 --> 00:53:44,320
as dramatically different
as the way I learned

1258
00:53:44,320 --> 00:53:46,930
to program versus the way people
are learning to program today.

1259
00:53:46,930 --> 00:53:48,160
In fact, I've given
this talk enough

1260
00:53:48,160 --> 00:53:50,350
whenever I go to the library
I get angry that I've

1261
00:53:50,350 --> 00:53:51,880
got to go through
all this rigmarole

1262
00:53:51,880 --> 00:53:53,470
because I know
it's not necessary.

1263
00:53:53,470 --> 00:53:56,180
The technology is
there to do that.

1264
00:53:56,180 --> 00:53:59,950
So let's start wrapping
this talk up here.

1265
00:53:59,950 --> 00:54:01,900
What I'm talking about
in the second half

1266
00:54:01,900 --> 00:54:04,100
is a new technology.

1267
00:54:04,100 --> 00:54:06,820
This is a pretty
classic curve that's

1268
00:54:06,820 --> 00:54:09,743
been shown for a long time where
there's DRAM in magnetic disks.

1269
00:54:09,743 --> 00:54:11,410
And this has been
called the access gap.

1270
00:54:11,410 --> 00:54:14,170
This is a log scale in terms
of dollars per megabyte.

1271
00:54:14,170 --> 00:54:16,120
This is the most expensive.

1272
00:54:16,120 --> 00:54:18,970
This is the fastest down here--
log scale and access time.

1273
00:54:18,970 --> 00:54:21,433
So DRAM's expensive and fast.

1274
00:54:21,433 --> 00:54:23,350
And this is an access
gap that a lot of people

1275
00:54:23,350 --> 00:54:25,342
have tried to invent
technologies to fill.

1276
00:54:25,342 --> 00:54:26,800
What I'm telling
you, there's going

1277
00:54:26,800 --> 00:54:29,320
to be a new access gap,
this robo-line tape that

1278
00:54:29,320 --> 00:54:32,410
is much cheaper and much
slower, and how do we

1279
00:54:32,410 --> 00:54:36,528
figure out how to use
that as systems designers?

1280
00:54:36,528 --> 00:54:38,820
So there's lots of research
issues we've got to attack.

1281
00:54:38,820 --> 00:54:40,110
And again, remember,
I'm waving my hands

1282
00:54:40,110 --> 00:54:41,587
because it's things
we need to do.

1283
00:54:41,587 --> 00:54:43,920
There's, how are we going to
manage three or four levels

1284
00:54:43,920 --> 00:54:45,503
of storage hierarchy,
when in the past

1285
00:54:45,503 --> 00:54:46,860
we've only managed two?

1286
00:54:46,860 --> 00:54:49,440
How are we going to manage
that inherent latency

1287
00:54:49,440 --> 00:54:52,920
of this new technology
that's very cheap?

1288
00:54:52,920 --> 00:54:55,400
Other examples of what are we
going to do with compression?

1289
00:54:55,400 --> 00:54:56,400
Can we do it on the fly?

1290
00:54:56,400 --> 00:54:57,233
What about hardware?

1291
00:54:57,233 --> 00:54:58,858
How are we going to
keep this reliable?

1292
00:54:58,858 --> 00:55:00,570
If you have sum of
humankind's knowledge,

1293
00:55:00,570 --> 00:55:02,700
it's not OK if the sum
of humankind's knowledge

1294
00:55:02,700 --> 00:55:03,540
goes down.

1295
00:55:03,540 --> 00:55:06,540
It's not OK if you lose Marc
Twain's collected works.

1296
00:55:06,540 --> 00:55:09,070
And what does it mean to
back up 100 terabytes?

1297
00:55:09,070 --> 00:55:11,350
So there's lots of
interesting issues there.

1298
00:55:11,350 --> 00:55:15,960
So for my last slide, let me
conclude with a prediction

1299
00:55:15,960 --> 00:55:19,320
that this new storage technology
is going to really change

1300
00:55:19,320 --> 00:55:22,410
our society provided
a couple of things,

1301
00:55:22,410 --> 00:55:24,275
that for a cost
of a minicomputer,

1302
00:55:24,275 --> 00:55:25,650
if you can get a
factor of 1,000,

1303
00:55:25,650 --> 00:55:28,290
that's a pretty big impact and
that's going to change things.

1304
00:55:28,290 --> 00:55:30,510
The obstacles aren't
technical, though, in my view.

1305
00:55:30,510 --> 00:55:33,090
The technical obstacles that
come up I bet we can attack.

1306
00:55:33,090 --> 00:55:34,800
We've done it in the past.

1307
00:55:34,800 --> 00:55:36,568
First of all, it's
the legal copyrights

1308
00:55:36,568 --> 00:55:37,860
that's going to be an obstacle.

1309
00:55:37,860 --> 00:55:40,740
I'm not allowed to make an
online copy of all the books

1310
00:55:40,740 --> 00:55:43,860
on my library at home
because the copyright says

1311
00:55:43,860 --> 00:55:45,030
that's a copy.

1312
00:55:45,030 --> 00:55:47,580
So copyright is an
obstacle to online.

1313
00:55:47,580 --> 00:55:50,880
Similarly, business model
is going to be an obstacle.

1314
00:55:50,880 --> 00:55:53,640
Paper-based publishers
are used to having a book,

1315
00:55:53,640 --> 00:55:55,170
meaning they get money.

1316
00:55:55,170 --> 00:55:57,550
If they place that
information online,

1317
00:55:57,550 --> 00:56:01,093
what guarantee do they have that
they'll get any sales at all.

1318
00:56:01,093 --> 00:56:02,260
Why won't it just be copied?

1319
00:56:02,260 --> 00:56:05,850
So we have, as technologists,
to provide those guarantees.

1320
00:56:05,850 --> 00:56:09,060
So my prediction by
the end of this decade

1321
00:56:09,060 --> 00:56:12,150
before the next century, that
if we can address the first two

1322
00:56:12,150 --> 00:56:15,360
non-technical issues,
that this factor of 1,000

1323
00:56:15,360 --> 00:56:16,920
increase in online
storage is going

1324
00:56:16,920 --> 00:56:19,410
to have a much greater
impact in our society

1325
00:56:19,410 --> 00:56:22,050
than this factor of 1,000
increase in CPU speed.

1326
00:56:22,050 --> 00:56:23,790
So thanks very much
for your attention.

1327
00:56:23,790 --> 00:56:26,165
You really did stick through
and listen to the whole tape

1328
00:56:26,165 --> 00:56:27,030
about input/output.

1329
00:56:27,030 --> 00:56:28,655
But I hope you could
see from this tape

1330
00:56:28,655 --> 00:56:31,440
just why input/output is so
much more exciting in processor

1331
00:56:31,440 --> 00:56:34,500
design, and you'll agree
that terabytes is a lot more

1332
00:56:34,500 --> 00:56:35,760
important than teraflops.

1333
00:56:35,760 --> 00:56:36,480
Thanks very much.

1334
00:56:36,480 --> 00:56:42,420


1335
00:56:42,420 --> 00:56:44,400
Yeah, where's the applause?

1336
00:56:44,400 --> 00:56:46,050
Yeah, where is the applause?

1337
00:56:46,050 --> 00:56:46,920
If this was on TV.

1338
00:56:46,920 --> 00:56:48,210
We could insert the laughter.

1339
00:56:48,210 --> 00:56:51,560
[LAUGHTER]

1340
00:56:51,560 --> 00:57:31,000