1 00:00:00,000 --> 00:00:03,437 [MUSIC PLAYING] 2 00:00:03,437 --> 00:00:13,770 3 00:00:13,770 --> 00:00:15,960 Thanks a lot for coming and watching this tape. 4 00:00:15,960 --> 00:00:17,880 I'm going to talk about input/output. 5 00:00:17,880 --> 00:00:19,740 Now, input/output is an area that's 6 00:00:19,740 --> 00:00:22,785 been an orphan of computer architecture long neglected. 7 00:00:22,785 --> 00:00:25,410 So I'm going to need to motivate why you want to stay and watch 8 00:00:25,410 --> 00:00:26,948 this tape. 9 00:00:26,948 --> 00:00:29,490 To give you an idea of how much this area has been neglected, 10 00:00:29,490 --> 00:00:32,100 first of all, the whole equipment 11 00:00:32,100 --> 00:00:34,200 is referred to as peripherals, as opposed 12 00:00:34,200 --> 00:00:36,330 to the central processor. 13 00:00:36,330 --> 00:00:38,600 And secondly, there was a programming language that 14 00:00:38,600 --> 00:00:41,100 was the ancestor of most of the programming languages people 15 00:00:41,100 --> 00:00:44,370 use today, this program is called ALGOL 60. 16 00:00:44,370 --> 00:00:46,920 And when was it was invented, it didn't 17 00:00:46,920 --> 00:00:50,850 have any input or output at all in language and nobody cared. 18 00:00:50,850 --> 00:00:56,220 So let me motivate why you want to watch this tape, starting 19 00:00:56,220 --> 00:00:58,760 off with our first slide. 20 00:00:58,760 --> 00:01:02,390 What's happening is a renaissance in CPU performance 21 00:01:02,390 --> 00:01:05,269 improving by 50% to 100% per year. 22 00:01:05,269 --> 00:01:08,330 In fact, what we have today is the supercomputer 23 00:01:08,330 --> 00:01:09,020 in the desktop. 24 00:01:09,020 --> 00:01:09,980 It's something that people have been 25 00:01:09,980 --> 00:01:11,600 talking about for a long time. 26 00:01:11,600 --> 00:01:14,030 What you see on the left-hand part of this slide 27 00:01:14,030 --> 00:01:17,480 is the HP 735 workstation and the various characteristics 28 00:01:17,480 --> 00:01:18,260 of that. 29 00:01:18,260 --> 00:01:20,090 In contrast, on the right-hand side 30 00:01:20,090 --> 00:01:23,690 is the Cray 1 supercomputer, the original vector supercomputer 31 00:01:23,690 --> 00:01:25,100 and its characteristics. 32 00:01:25,100 --> 00:01:27,320 When you line these characteristics up 33 00:01:27,320 --> 00:01:33,230 by all measures, the HP 735 is faster or better 34 00:01:33,230 --> 00:01:36,650 than the Cray Vector supercomputer, even including 35 00:01:36,650 --> 00:01:40,622 the cost at about 1% or half a percent. 36 00:01:40,622 --> 00:01:45,230 Another way of saying that is that the HP 735, 37 00:01:45,230 --> 00:01:47,450 if it was announced in the late 1970s, 38 00:01:47,450 --> 00:01:49,980 would have been a hell of a good supercomputer. 39 00:01:49,980 --> 00:01:52,210 So we've got the supercomputer on the disk. 40 00:01:52,210 --> 00:01:53,690 What an amazing development. 41 00:01:53,690 --> 00:01:59,210 Moreover, people are taking tens or hundreds of these processors 42 00:01:59,210 --> 00:02:01,700 and putting them together to create supercomputers out 43 00:02:01,700 --> 00:02:03,890 of lots of these processors. 44 00:02:03,890 --> 00:02:06,530 So that is moving at an even faster rate, 45 00:02:06,530 --> 00:02:08,220 say, 150% per year. 46 00:02:08,220 --> 00:02:10,190 So why do we care about I/O? 47 00:02:10,190 --> 00:02:12,170 Well, because I/O hasn't been moving as fast 48 00:02:12,170 --> 00:02:13,550 as the processor design. 49 00:02:13,550 --> 00:02:15,410 It's limited by mechanical delays 50 00:02:15,410 --> 00:02:17,480 and has been growing maybe 5% or 10% 51 00:02:17,480 --> 00:02:19,780 per year over that same time period. 52 00:02:19,780 --> 00:02:22,510 53 00:02:22,510 --> 00:02:24,990 What's going to happen if we don't improve the I/O 54 00:02:24,990 --> 00:02:25,690 component? 55 00:02:25,690 --> 00:02:27,820 Gene Amdahl came up with Amdahl's law 56 00:02:27,820 --> 00:02:32,190 that says system speed is limited by the slowest part. 57 00:02:32,190 --> 00:02:34,430 So let's do a specific example. 58 00:02:34,430 --> 00:02:36,820 If we spend 10% of our time today in I/O, 59 00:02:36,820 --> 00:02:39,550 and we get a 10 times faster CPU, which comes along pretty 60 00:02:39,550 --> 00:02:42,260 quickly, we're only going to get half of that performance 61 00:02:42,260 --> 00:02:42,760 potential. 62 00:02:42,760 --> 00:02:46,130 We're going to lose 50% of that performance. 63 00:02:46,130 --> 00:02:48,760 Similarly, if we take that same program with 10% of its time 64 00:02:48,760 --> 00:02:51,160 today and get 100 times faster CPU, possibly 65 00:02:51,160 --> 00:02:54,430 with a multiple processor, we're going to lose 90% of that, 66 00:02:54,430 --> 00:02:57,800 getting only 1/10 of its potential. 67 00:02:57,800 --> 00:03:00,770 That means we have an I/O bottleneck facing us 68 00:03:00,770 --> 00:03:02,030 very shortly. 69 00:03:02,030 --> 00:03:04,040 Because a diminishing fraction of the time 70 00:03:04,040 --> 00:03:05,990 is going to be spent in the CPU in the future 71 00:03:05,990 --> 00:03:07,610 if we don't do anything about I/O. 72 00:03:07,610 --> 00:03:11,300 That means a diminishing value of faster CPUs, which 73 00:03:11,300 --> 00:03:13,850 means a diminishing value of researchers 74 00:03:13,850 --> 00:03:16,760 who are working on CPUs, not to mention a diminishing value 75 00:03:16,760 --> 00:03:19,518 of high-paid academic consultants on CPUs. 76 00:03:19,518 --> 00:03:21,560 So I think we all agree this is pretty important, 77 00:03:21,560 --> 00:03:25,457 even if we're not going to work on I/O ourselves. 78 00:03:25,457 --> 00:03:27,790 So what have our colleagues in magnetic disk design been 79 00:03:27,790 --> 00:03:30,190 doing all these years while we've been making processors 80 00:03:30,190 --> 00:03:31,370 faster? 81 00:03:31,370 --> 00:03:33,730 They've been concentrating on capacity and dollars 82 00:03:33,730 --> 00:03:34,960 per megabyte. 83 00:03:34,960 --> 00:03:38,440 They are improving at about 25% per year 84 00:03:38,440 --> 00:03:42,250 historically, and more recently, at about 50% per year-- 85 00:03:42,250 --> 00:03:44,165 both of those measures. 86 00:03:44,165 --> 00:03:45,790 The other thing that they've been doing 87 00:03:45,790 --> 00:03:48,047 is evolving to smaller disks, from things 88 00:03:48,047 --> 00:03:50,380 that were the size of washing machines or refrigerators, 89 00:03:50,380 --> 00:03:52,240 to the things that you can fit in your hand. 90 00:03:52,240 --> 00:03:54,350 Here's a specific example. 91 00:03:54,350 --> 00:03:58,250 So this I have in my hand is a 2 and 1/2 inch diameter disk. 92 00:03:58,250 --> 00:04:00,790 You can see it's very thin, maybe a 1/2 inch diameter. 93 00:04:00,790 --> 00:04:03,490 And what's on the back of it is the electronics-- 94 00:04:03,490 --> 00:04:05,500 all the integrated circuits. 95 00:04:05,500 --> 00:04:07,660 This disk, in the time we're making this tape, 96 00:04:07,660 --> 00:04:09,490 can hold 320 megabytes. 97 00:04:09,490 --> 00:04:11,590 As you can see, there's just two platters here. 98 00:04:11,590 --> 00:04:14,035 This is remarkable shrinkage in design. 99 00:04:14,035 --> 00:04:16,660 That's what our disk colleagues have been doing-- making things 100 00:04:16,660 --> 00:04:21,938 smaller, the cost cheaper, and the capacity greater. 101 00:04:21,938 --> 00:04:23,230 That's what they've been doing. 102 00:04:23,230 --> 00:04:24,970 And in fact, in a few years, this 103 00:04:24,970 --> 00:04:27,130 is going to seem like a dinosaur monster disk, 104 00:04:27,130 --> 00:04:30,343 and people are working on disks that'll be 1 inch in diameter, 105 00:04:30,343 --> 00:04:32,260 much smaller than a 2 inch, and you won't even 106 00:04:32,260 --> 00:04:35,147 be able to know whether I've got one in my hand or not. 107 00:04:35,147 --> 00:04:37,480 What our colleagues in the disk industry have been doing 108 00:04:37,480 --> 00:04:41,650 is making disks that are smaller and cheaper, as opposed 109 00:04:41,650 --> 00:04:43,490 to larger and faster. 110 00:04:43,490 --> 00:04:47,390 The faster processors need larger and faster disks. 111 00:04:47,390 --> 00:04:50,140 So the question we asked ourselves six or seven years 112 00:04:50,140 --> 00:04:53,230 ago was, can these smaller disk be used somehow 113 00:04:53,230 --> 00:04:56,680 to close the gap in performance between disk and CPUs? 114 00:04:56,680 --> 00:04:59,920 Or how could we use these smaller disks to do that? 115 00:04:59,920 --> 00:05:02,760 So the idea is to replace a small number of large disks 116 00:05:02,760 --> 00:05:06,675 with a large number of small disks. 117 00:05:06,675 --> 00:05:11,910 This next slide shows how it would work. 118 00:05:11,910 --> 00:05:14,060 What we'd see on the top is the way 119 00:05:14,060 --> 00:05:15,810 disks are traditionally manufactured. 120 00:05:15,810 --> 00:05:18,840 What you see is four different designs 121 00:05:18,840 --> 00:05:20,430 having four different engineering 122 00:05:20,430 --> 00:05:23,460 teams, each concentrating on the high end 123 00:05:23,460 --> 00:05:25,975 to the low end of the efforts. 124 00:05:25,975 --> 00:05:28,920 What we're talking about instead is concentrating 125 00:05:28,920 --> 00:05:32,820 the engineering talents in the lowest smallest diameter disk, 126 00:05:32,820 --> 00:05:34,530 build the best disk you can, and simply 127 00:05:34,530 --> 00:05:37,860 replicating to get mid-range and then high-end designs. 128 00:05:37,860 --> 00:05:41,880 That's the El Dorado of disk array design here. 129 00:05:41,880 --> 00:05:43,380 That's what people are trying to do. 130 00:05:43,380 --> 00:05:46,980 Well, how well would that work? 131 00:05:46,980 --> 00:05:50,910 What's shown on the left column is the IBM mainframe disk. 132 00:05:50,910 --> 00:05:53,350 Back when IBM made a lot of money, 133 00:05:53,350 --> 00:05:57,600 this is where a lot of money was made, in the mainframe disk. 134 00:05:57,600 --> 00:06:00,730 If we contrast that with this narrower disk 135 00:06:00,730 --> 00:06:04,180 also from IBM, which is in the middle column, 136 00:06:04,180 --> 00:06:05,980 we can see some big differences there. 137 00:06:05,980 --> 00:06:08,820 But if we got enough of those small disks, in this case, 70, 138 00:06:08,820 --> 00:06:10,920 so that we would get the same capacity-- that's 139 00:06:10,920 --> 00:06:12,960 this right column, it's the same capacity-- 140 00:06:12,960 --> 00:06:15,010 we have some interesting characteristics. 141 00:06:15,010 --> 00:06:17,040 First of all, it's actually got some advantages 142 00:06:17,040 --> 00:06:18,870 in size and power. 143 00:06:18,870 --> 00:06:20,490 But the really exciting advantage 144 00:06:20,490 --> 00:06:22,620 from the system designer's perspective, 145 00:06:22,620 --> 00:06:25,980 given this CPU performance gap, is 146 00:06:25,980 --> 00:06:28,290 the data rate or I/O rate-- data rate 147 00:06:28,290 --> 00:06:30,150 megabytes per second transferred, 148 00:06:30,150 --> 00:06:32,550 or the number of I/Os per second. 149 00:06:32,550 --> 00:06:36,580 Given that we have 70 disks with 70 arms operating 150 00:06:36,580 --> 00:06:38,610 at the same time instead of just a small number, 151 00:06:38,610 --> 00:06:40,360 we can get a much higher data rate. 152 00:06:40,360 --> 00:06:42,960 You can see that's about a factor of 8 improvement. 153 00:06:42,960 --> 00:06:45,305 Even though the big disks run much faster, 154 00:06:45,305 --> 00:06:46,680 we've got so many small ones that 155 00:06:46,680 --> 00:06:48,510 are close enough we get a factor of 8 improvement, 156 00:06:48,510 --> 00:06:49,927 and the number of I/Os per second, 157 00:06:49,927 --> 00:06:52,890 similarly, increase by about a factor of 6. 158 00:06:52,890 --> 00:06:54,710 And let's say the cost is about the same. 159 00:06:54,710 --> 00:06:56,190 But what about this last row here? 160 00:06:56,190 --> 00:06:59,190 What about the reliability? 161 00:06:59,190 --> 00:07:01,927 This is an acronym for the mean time to data law. 162 00:07:01,927 --> 00:07:04,260 So we can think of it as the mean time between failures. 163 00:07:04,260 --> 00:07:07,030 How well will that work? 164 00:07:07,030 --> 00:07:10,030 Well, turns out if things fail randomly, 165 00:07:10,030 --> 00:07:13,710 then the reliability of N things is 1 over N. 166 00:07:13,710 --> 00:07:15,420 So we take the mean time between failure 167 00:07:15,420 --> 00:07:18,910 from about 50,000 hours divided by 70, 700 hours. 168 00:07:18,910 --> 00:07:21,720 In other words, the mean time between failure 169 00:07:21,720 --> 00:07:24,840 drops from six years to one month. 170 00:07:24,840 --> 00:07:28,980 Therefore, disk arrays are too unreliable to be useful. 171 00:07:28,980 --> 00:07:30,657 Therefore, it's a bad idea. 172 00:07:30,657 --> 00:07:31,990 And this is the end of the tape. 173 00:07:31,990 --> 00:07:34,210 Thank you very much. 174 00:07:34,210 --> 00:07:35,290 No. 175 00:07:35,290 --> 00:07:38,050 Don't-- hold that dial. 176 00:07:38,050 --> 00:07:41,350 What we're going to do is add extra disks 177 00:07:41,350 --> 00:07:43,690 and turns a weakness into a strength. 178 00:07:43,690 --> 00:07:45,850 Hence, the name of RAID. 179 00:07:45,850 --> 00:07:49,870 What we're going to do is take redundant arrays 180 00:07:49,870 --> 00:07:52,720 of inexpensive disks by having some extras to overcome 181 00:07:52,720 --> 00:07:55,060 this reliability disadvantage. 182 00:07:55,060 --> 00:07:56,630 Now, there's two advantages here. 183 00:07:56,630 --> 00:07:58,625 One is subtle and one's obvious. 184 00:07:58,625 --> 00:08:00,250 What we're going to do have extra disks 185 00:08:00,250 --> 00:08:03,320 so that when a disk fails, we can reconstruct the lost data. 186 00:08:03,320 --> 00:08:05,940 That's the first point. 187 00:08:05,940 --> 00:08:08,570 But if we can reconstruct the data sometimes, 188 00:08:08,570 --> 00:08:11,237 that means that on the fly we can reconstruct 189 00:08:11,237 --> 00:08:12,320 the data that's been lost. 190 00:08:12,320 --> 00:08:16,850 So just because at the end of the term when everything's busy 191 00:08:16,850 --> 00:08:19,435 a disk crashes doesn't mean you can't get your term paper. 192 00:08:19,435 --> 00:08:20,810 Because it can be reconstructed-- 193 00:08:20,810 --> 00:08:22,268 it's going to run a little slower-- 194 00:08:22,268 --> 00:08:24,860 but reconstructed on the fly and given back to you. 195 00:08:24,860 --> 00:08:28,460 We'll see a tape of that later, a demo showing it 196 00:08:28,460 --> 00:08:30,890 runs a little slower, but the information's 197 00:08:30,890 --> 00:08:33,559 available continuously. 198 00:08:33,559 --> 00:08:37,820 And that's the basic idea of RAID of-- so this redundancy. 199 00:08:37,820 --> 00:08:41,030 Now, it turns out there's lots of ways to do this redundancy. 200 00:08:41,030 --> 00:08:43,370 What I'm going to do is cover four of them here. 201 00:08:43,370 --> 00:08:45,020 What's shown on the left-hand side 202 00:08:45,020 --> 00:08:48,120 is kind of the English descriptions of these things. 203 00:08:48,120 --> 00:08:50,970 What's shown on the right-hand side is the levels of the RAID. 204 00:08:50,970 --> 00:08:52,520 This was in our original paper. 205 00:08:52,520 --> 00:08:54,615 We used these levels to explain things. 206 00:08:54,615 --> 00:08:56,240 And as you can see, from top to bottom, 207 00:08:56,240 --> 00:08:58,370 they're getting more sophisticated. 208 00:08:58,370 --> 00:09:01,400 And these level numbers have caught on, probably 209 00:09:01,400 --> 00:09:04,350 to the regret of some English teachers. 210 00:09:04,350 --> 00:09:06,560 You may see these levels-- 211 00:09:06,560 --> 00:09:09,020 these numbers here that are put together 212 00:09:09,020 --> 00:09:11,480 to describe advertisements. 213 00:09:11,480 --> 00:09:14,152 You'll see level 5 RAIDs or RAID 5s and things like that. 214 00:09:14,152 --> 00:09:16,610 What I'm going to do is go over each of these organizations 215 00:09:16,610 --> 00:09:20,660 in this next part of the tape, and explain the advantages. 216 00:09:20,660 --> 00:09:22,720 And basically, as we go down the line, 217 00:09:22,720 --> 00:09:24,470 we're going to have less and less overhead 218 00:09:24,470 --> 00:09:28,680 to provide redundancy to get that reliability. 219 00:09:28,680 --> 00:09:32,000 The first organization is called either mirroring or shadowing, 220 00:09:32,000 --> 00:09:35,580 depending where your-- 221 00:09:35,580 --> 00:09:37,700 what manufacturer sells it to you. 222 00:09:37,700 --> 00:09:39,650 So what's shown on the left is the data disk. 223 00:09:39,650 --> 00:09:42,950 And all these examples will have, say, eight data disks. 224 00:09:42,950 --> 00:09:46,670 Now, what you need to do is to have reliability using 225 00:09:46,670 --> 00:09:49,580 the mirroring scheme is every disk has its own mirror, 226 00:09:49,580 --> 00:09:51,200 so that if you write to this disk, 227 00:09:51,200 --> 00:09:52,730 you also write to the mirror. 228 00:09:52,730 --> 00:09:55,940 That way if this disk fails, we know 229 00:09:55,940 --> 00:09:59,280 where to go look to find the data that's missing. 230 00:09:59,280 --> 00:10:02,120 This is the organization that's the favorite organization 231 00:10:02,120 --> 00:10:05,240 of disk manufacturers. 232 00:10:05,240 --> 00:10:08,850 The reason is that you have to buy twice as many disks. 233 00:10:08,850 --> 00:10:11,870 Unfortunately, for those of us on a university budget, 234 00:10:11,870 --> 00:10:14,360 getting extra reliability by buying twice as much disks 235 00:10:14,360 --> 00:10:17,150 isn't something we could afford, so this wasn't something 236 00:10:17,150 --> 00:10:19,340 we were interested in. 237 00:10:19,340 --> 00:10:23,120 The next organization, again, has eight data disks, 238 00:10:23,120 --> 00:10:25,880 and we've cut the redundancy down to four. 239 00:10:25,880 --> 00:10:28,080 If you've taken a course in memory design, 240 00:10:28,080 --> 00:10:29,810 you've heard about error correction codes 241 00:10:29,810 --> 00:10:32,845 that can correct single errors. 242 00:10:32,845 --> 00:10:34,220 You know there's a way to do this 243 00:10:34,220 --> 00:10:35,780 without doubling the memory. 244 00:10:35,780 --> 00:10:38,030 And in fact, what happens is is you 245 00:10:38,030 --> 00:10:41,090 have, in this case, four redundant disks, and parity 246 00:10:41,090 --> 00:10:43,880 is calculated over subsets of the data disk. 247 00:10:43,880 --> 00:10:47,210 And you can see how this works in this example. 248 00:10:47,210 --> 00:10:50,180 Let's suppose disk 11 breaks right here. 249 00:10:50,180 --> 00:10:52,160 What would happen if we calculated the parities 250 00:10:52,160 --> 00:10:54,440 in the subset is we get a 1 for this group, 251 00:10:54,440 --> 00:10:56,330 because disk 11's included. 252 00:10:56,330 --> 00:10:58,880 Disk 11 is not there, so that's a 0. 253 00:10:58,880 --> 00:11:02,100 In this group, disk 11 is included, so we get a 1 there, 254 00:11:02,100 --> 00:11:03,350 and this group would have a 1. 255 00:11:03,350 --> 00:11:06,380 So we'd have the pattern 1, 0, 1, 1. 256 00:11:06,380 --> 00:11:08,810 The fact that this pattern is not zeros 257 00:11:08,810 --> 00:11:11,150 means there's an error. 258 00:11:11,150 --> 00:11:13,100 And not only that, we need to figure out 259 00:11:13,100 --> 00:11:14,100 which disk is the error. 260 00:11:14,100 --> 00:11:15,892 And if you remember, if you're binary here, 261 00:11:15,892 --> 00:11:18,700 1, 0, 1, 1-- if you convert that into decimal, that's 11. 262 00:11:18,700 --> 00:11:21,170 So in fact, kind of as a parlor trick. 263 00:11:21,170 --> 00:11:22,850 The encoding that shows [INAUDIBLE] here 264 00:11:22,850 --> 00:11:24,680 also points out the disk that has failed. 265 00:11:24,680 --> 00:11:27,230 So there is an error, and it's disk 11. 266 00:11:27,230 --> 00:11:30,950 Now we've reduce the cost, in this case, from eight disks 267 00:11:30,950 --> 00:11:33,020 to four disks, but that still would expensive 268 00:11:33,020 --> 00:11:35,670 on the university budget. 269 00:11:35,670 --> 00:11:39,110 The next scheme, RAID level 3, or redundancy via parity, 270 00:11:39,110 --> 00:11:43,010 recognizes one of the really great things about disks today, 271 00:11:43,010 --> 00:11:44,810 in that disks are smart. 272 00:11:44,810 --> 00:11:47,510 Disks have controllers that know if they're working or not. 273 00:11:47,510 --> 00:11:49,022 Moreover, on every single sector-- 274 00:11:49,022 --> 00:11:51,230 that's kind of every block that you can read or write 275 00:11:51,230 --> 00:11:54,740 individually-- it has extra error-checking codes 276 00:11:54,740 --> 00:11:56,930 to determine with very high probability 277 00:11:56,930 --> 00:11:59,300 whether or not what you read was read correctly, 278 00:11:59,300 --> 00:12:02,540 or if you write it, it's likely to be read correctly 279 00:12:02,540 --> 00:12:04,650 again later. 280 00:12:04,650 --> 00:12:08,220 Since there's the ability to determine 281 00:12:08,220 --> 00:12:09,360 if a disk has failed-- 282 00:12:09,360 --> 00:12:12,090 the disk will raise its hand if it failed-- alls we have to do 283 00:12:12,090 --> 00:12:14,670 is calculate the data that's missing from that disk, 284 00:12:14,670 --> 00:12:16,452 and we can do that with one extra disk. 285 00:12:16,452 --> 00:12:17,910 So this is the type of thing that's 286 00:12:17,910 --> 00:12:20,700 interesting in a university budget. 287 00:12:20,700 --> 00:12:21,900 So how does that work? 288 00:12:21,900 --> 00:12:23,400 You can think of this disk as having 289 00:12:23,400 --> 00:12:26,220 the sum of all the information on all the other disks. 290 00:12:26,220 --> 00:12:27,840 Then if this disk fails, we'll know 291 00:12:27,840 --> 00:12:30,690 that from the Reed/Solomon codes, alls we have to do 292 00:12:30,690 --> 00:12:34,060 is subtract the remaining information from the sum disk, 293 00:12:34,060 --> 00:12:36,607 and that difference must have been that value there. 294 00:12:36,607 --> 00:12:37,440 That's what happens. 295 00:12:37,440 --> 00:12:39,480 Actually, we do this all parity calculations, 296 00:12:39,480 --> 00:12:41,500 but you can think of it that way. 297 00:12:41,500 --> 00:12:42,970 So this seems pretty interesting. 298 00:12:42,970 --> 00:12:46,500 Instead of having twice as many disks, we have one extra disk. 299 00:12:46,500 --> 00:12:50,580 We can afford that on a university budget. 300 00:12:50,580 --> 00:12:52,207 Now, so how well will this work? 301 00:12:52,207 --> 00:12:54,540 So far you've probably been thinking of this as we write 302 00:12:54,540 --> 00:12:57,630 a bit to every single one of the disks, so that there's-- 303 00:12:57,630 --> 00:13:00,270 a read will read from all nine disks in this case, 304 00:13:00,270 --> 00:13:02,600 and a write will write to all nine disks in this case. 305 00:13:02,600 --> 00:13:04,373 That's probably a good way to think of it. 306 00:13:04,373 --> 00:13:05,790 You can think of this when they're 307 00:13:05,790 --> 00:13:08,310 accessing all the disks as large accesses, 308 00:13:08,310 --> 00:13:10,483 large reads and large writes. 309 00:13:10,483 --> 00:13:12,150 If you think about this a little longer, 310 00:13:12,150 --> 00:13:14,850 you realize though if every single disk has this error 311 00:13:14,850 --> 00:13:16,800 correction information in each sector, 312 00:13:16,800 --> 00:13:19,722 that would allow these disks to be accessed independently 313 00:13:19,722 --> 00:13:22,180 with blocks that don't have anything to do with each other. 314 00:13:22,180 --> 00:13:25,260 So if what we write to each one is a whole sector 315 00:13:25,260 --> 00:13:27,570 at a time, or even a larger unit, 316 00:13:27,570 --> 00:13:31,540 then we can let reads happen independently. 317 00:13:31,540 --> 00:13:34,350 This would allow us to increase the number of I/Os 318 00:13:34,350 --> 00:13:36,670 per second in terms of reads from this information. 319 00:13:36,670 --> 00:13:39,477 So we could have the large reads that involve all the disks, 320 00:13:39,477 --> 00:13:41,310 the large writes that involve all the disks, 321 00:13:41,310 --> 00:13:46,380 and small reads without really any extra work. 322 00:13:46,380 --> 00:13:49,050 That brings us to small writes. 323 00:13:49,050 --> 00:13:52,267 What would happen if we wanted to write just to this one disk? 324 00:13:52,267 --> 00:13:53,850 What it would seem like you want to do 325 00:13:53,850 --> 00:13:56,267 if you wanted to write to this disk, what you'd have to do 326 00:13:56,267 --> 00:14:00,120 is read all the information from the corresponding disks, 327 00:14:00,120 --> 00:14:02,940 calculate the new parity and write that out. 328 00:14:02,940 --> 00:14:04,720 Well, that's one way you could do it. 329 00:14:04,720 --> 00:14:08,040 But because it's got this sum and difference relationship, 330 00:14:08,040 --> 00:14:09,810 the other way we could do it is say, 331 00:14:09,810 --> 00:14:12,450 let's compare the new data to the old data, 332 00:14:12,450 --> 00:14:14,820 see which bits change, and then change 333 00:14:14,820 --> 00:14:16,740 those bits in the parity disk. 334 00:14:16,740 --> 00:14:19,650 Thus, a small write could involve 335 00:14:19,650 --> 00:14:23,053 only two disks-- the parity disk and the disk with the data. 336 00:14:23,053 --> 00:14:24,720 So that seems better than this other way 337 00:14:24,720 --> 00:14:25,890 that involved all the disks. 338 00:14:25,890 --> 00:14:29,130 So we can do the large reads-- everybody supplies disks; 339 00:14:29,130 --> 00:14:31,587 large writes-- everybody gets some bits. 340 00:14:31,587 --> 00:14:33,670 The small reads, which allow things independently, 341 00:14:33,670 --> 00:14:34,670 and the writes-- 342 00:14:34,670 --> 00:14:36,810 unfortunately, the writes take four disk accesses-- 343 00:14:36,810 --> 00:14:39,060 reading and writing that disk, and reading and writing 344 00:14:39,060 --> 00:14:40,320 the parity disk. 345 00:14:40,320 --> 00:14:42,540 Moreover, what the problem is, the parity disk 346 00:14:42,540 --> 00:14:44,040 is going to be the write bottleneck. 347 00:14:44,040 --> 00:14:46,752 Every write has to update the parity disk. 348 00:14:46,752 --> 00:14:48,210 Therefore, the speed is going to be 349 00:14:48,210 --> 00:14:51,960 limited by how fast the parity disk can write. 350 00:14:51,960 --> 00:14:55,300 This inspired the next organization. 351 00:14:55,300 --> 00:14:56,970 Now, this is a different diagram, 352 00:14:56,970 --> 00:15:00,090 with the disk in the top being shown in exploded form going 353 00:15:00,090 --> 00:15:02,250 down the page, where each one of these squares 354 00:15:02,250 --> 00:15:04,110 is supposed to represent a block. 355 00:15:04,110 --> 00:15:07,480 You can think of that as a sector for right now. 356 00:15:07,480 --> 00:15:10,550 So let's try this idea. 357 00:15:10,550 --> 00:15:12,660 You notice what we've done is taken 358 00:15:12,660 --> 00:15:15,600 all the parity, which before was always on one block 359 00:15:15,600 --> 00:15:16,500 and spread it out. 360 00:15:16,500 --> 00:15:18,570 We've rotated the parity. 361 00:15:18,570 --> 00:15:21,420 So the parity for this first row, which 362 00:15:21,420 --> 00:15:25,560 we refer to as a strike, this first row is here, 363 00:15:25,560 --> 00:15:27,630 and then it moves its way down. 364 00:15:27,630 --> 00:15:29,830 So how will small writes work in this case? 365 00:15:29,830 --> 00:15:31,860 Suppose we're going to write to D3. 366 00:15:31,860 --> 00:15:35,670 Well, we're going to have to update the block in D3, which 367 00:15:35,670 --> 00:15:39,360 ties up this disk; update the parity right here, which 368 00:15:39,360 --> 00:15:42,730 corresponds to the D3 which occupies this disk. 369 00:15:42,730 --> 00:15:44,970 So we have two of the five disks occupied. 370 00:15:44,970 --> 00:15:48,180 If the other disk we wanted to occupy, wanted to update, 371 00:15:48,180 --> 00:15:50,370 the data we wanted to update was disk eight, 372 00:15:50,370 --> 00:15:52,590 we'd also occupy this parity block. 373 00:15:52,590 --> 00:15:55,470 So that would occupy these two disks. 374 00:15:55,470 --> 00:15:58,620 Thus, without any really extra work, just by rotating 375 00:15:58,620 --> 00:16:01,830 the parity, we're capable of getting 376 00:16:01,830 --> 00:16:05,640 more small write bandwidth, which makes it more attractive. 377 00:16:05,640 --> 00:16:08,640 This final organization, redundancy with rotated parody, 378 00:16:08,640 --> 00:16:12,570 or RAID level 5, is the end of our progress 379 00:16:12,570 --> 00:16:15,682 in getting performance and providing redundancy. 380 00:16:15,682 --> 00:16:17,640 And this is the scheme that we used at Berkeley 381 00:16:17,640 --> 00:16:20,110 and is very popular today. 382 00:16:20,110 --> 00:16:22,650 You can even read articles in Byte magazine, 383 00:16:22,650 --> 00:16:25,140 amazingly enough, at least when we make this tape, 384 00:16:25,140 --> 00:16:29,840 and you'll see RAID 5, and now you know what that means. 385 00:16:29,840 --> 00:16:31,200 OK. 386 00:16:31,200 --> 00:16:34,290 Now in addition to coming up with these ideas, 387 00:16:34,290 --> 00:16:37,450 we actually like trying to think about building these things. 388 00:16:37,450 --> 00:16:39,310 And how are disks constructed today? 389 00:16:39,310 --> 00:16:41,310 Well, the way disks work today is you don't just 390 00:16:41,310 --> 00:16:43,200 plug the disk into a computer. 391 00:16:43,200 --> 00:16:45,150 It's connected to a cable. 392 00:16:45,150 --> 00:16:46,530 It's typically called a string. 393 00:16:46,530 --> 00:16:49,170 And you put several disks on this string that's 394 00:16:49,170 --> 00:16:51,450 connected to a string controller that, in this case, 395 00:16:51,450 --> 00:16:53,760 would be connected to an array controller, let's say. 396 00:16:53,760 --> 00:16:57,395 But the idea here is disks are connected over a set of wires 397 00:16:57,395 --> 00:16:58,770 to a computer, and that's the way 398 00:16:58,770 --> 00:17:01,935 things are done in many kinds of computer systems. 399 00:17:01,935 --> 00:17:03,560 It would seem what you would do is just 400 00:17:03,560 --> 00:17:08,119 put that stripe right down this row along this cable. 401 00:17:08,119 --> 00:17:09,780 What's wrong with that? 402 00:17:09,780 --> 00:17:12,410 The problem is now we have the string controller, a piece 403 00:17:12,410 --> 00:17:14,480 of electronics which can break. 404 00:17:14,480 --> 00:17:17,000 If this string were to break, that 405 00:17:17,000 --> 00:17:24,349 would disconnect all of these disks from the computer system. 406 00:17:24,349 --> 00:17:26,230 Now, this RAID thing works, provided 407 00:17:26,230 --> 00:17:27,579 we only have one broken disk. 408 00:17:27,579 --> 00:17:28,550 We can reconstruct it. 409 00:17:28,550 --> 00:17:30,250 But all of the information in this group 410 00:17:30,250 --> 00:17:32,020 would be disconnected. 411 00:17:32,020 --> 00:17:37,030 That would mean we couldn't reconstruct that information. 412 00:17:37,030 --> 00:17:41,100 However, if we do it orthogonal, if we change it 413 00:17:41,100 --> 00:17:43,750 so that the disks are perpendicular to the strings, 414 00:17:43,750 --> 00:17:46,350 what happens if we lose a string? 415 00:17:46,350 --> 00:17:49,620 If this string goes out, we only lose one disk 416 00:17:49,620 --> 00:17:53,700 from each one of the stripes, which allows us to reconstruct 417 00:17:53,700 --> 00:17:57,480 that in the RAID project because that's 418 00:17:57,480 --> 00:17:58,658 the way the system works. 419 00:17:58,658 --> 00:18:00,450 If we lose one just from this [INAUDIBLE],, 420 00:18:00,450 --> 00:18:02,140 it can reconstruct the lost information. 421 00:18:02,140 --> 00:18:05,040 So just not, again, extra hardware just 422 00:18:05,040 --> 00:18:07,260 by thinking about the right way you organize it, 423 00:18:07,260 --> 00:18:11,010 you can get a more comfortable system, a more reliable system. 424 00:18:11,010 --> 00:18:14,203 425 00:18:14,203 --> 00:18:15,870 Another question that we asked ourselves 426 00:18:15,870 --> 00:18:17,250 is what about spare disks? 427 00:18:17,250 --> 00:18:19,080 What would happen if we were to have 428 00:18:19,080 --> 00:18:21,120 some disks without any data on them 429 00:18:21,120 --> 00:18:24,810 ready to repair in case something broke? 430 00:18:24,810 --> 00:18:27,210 Right now you've been assuming that the disk breaks 431 00:18:27,210 --> 00:18:30,570 and you call up your local system manager, 432 00:18:30,570 --> 00:18:33,528 and he or she gets in their car and drives to RadioShack, 433 00:18:33,528 --> 00:18:35,320 I guess, and buys a disk and bring it back, 434 00:18:35,320 --> 00:18:36,300 and that might take a long while. 435 00:18:36,300 --> 00:18:37,800 What happens if there's a spare just 436 00:18:37,800 --> 00:18:39,450 sitting there ready to be used? 437 00:18:39,450 --> 00:18:42,910 What this actually shows is the mean time to data loss. 438 00:18:42,910 --> 00:18:44,520 That's like the mean time to failure. 439 00:18:44,520 --> 00:18:47,910 How many, in this case, in thousands of hours, 440 00:18:47,910 --> 00:18:50,130 before it would break? 441 00:18:50,130 --> 00:18:52,800 Well, the study that was done here, assuming disks 442 00:18:52,800 --> 00:18:55,147 fail independently, if we had-- 443 00:18:55,147 --> 00:18:56,730 assuming we started off with 70 disks, 444 00:18:56,730 --> 00:19:00,240 with just one spare disk we would get this factor of 40 445 00:19:00,240 --> 00:19:01,920 improvement in reliability-- 446 00:19:01,920 --> 00:19:03,600 40 times as reliable. 447 00:19:03,600 --> 00:19:05,490 Interestingly, with two spare disks, 448 00:19:05,490 --> 00:19:08,220 we're about as good as an infinite number of spare disks. 449 00:19:08,220 --> 00:19:10,380 So this data suggests one or two spare disks 450 00:19:10,380 --> 00:19:12,178 would really be an interesting thing 451 00:19:12,178 --> 00:19:14,220 to consider in terms of improving the reliability 452 00:19:14,220 --> 00:19:15,880 of RAID systems. 453 00:19:15,880 --> 00:19:17,580 But at the top of the slide, it says 454 00:19:17,580 --> 00:19:19,590 we're assuming independent failures. 455 00:19:19,590 --> 00:19:21,450 We know disks don't fail independently 456 00:19:21,450 --> 00:19:24,833 because of those strings that connect the disks together. 457 00:19:24,833 --> 00:19:26,250 That's what this next slide shows. 458 00:19:26,250 --> 00:19:28,500 So what would happen with dependent failures 459 00:19:28,500 --> 00:19:32,330 because we disk are connected on strings? 460 00:19:32,330 --> 00:19:35,502 Well, what happens is the relationship is changed to now 461 00:19:35,502 --> 00:19:36,710 we need a whole spare string. 462 00:19:36,710 --> 00:19:39,080 In this case, we are assuming that a string contains 463 00:19:39,080 --> 00:19:40,350 seven disks on them. 464 00:19:40,350 --> 00:19:42,300 So here's the disk across the bottom. 465 00:19:42,300 --> 00:19:45,560 And these lines indicate one string full of disks 466 00:19:45,560 --> 00:19:47,295 or two strings full of disks. 467 00:19:47,295 --> 00:19:49,670 So now if we have a whole spare string because the string 468 00:19:49,670 --> 00:19:51,962 controller could fail and knock out all of these disks, 469 00:19:51,962 --> 00:19:54,920 we get a 50 times improvement in the mean time between failures. 470 00:19:54,920 --> 00:19:57,690 But two strings was as good as an infinite number of strings. 471 00:19:57,690 --> 00:20:00,830 So you can argue that one, possibly two spare strings 472 00:20:00,830 --> 00:20:03,650 would be a good idea if you had about 70 disks in terms 473 00:20:03,650 --> 00:20:07,190 of improving the reliability. 474 00:20:07,190 --> 00:20:09,560 Well, not only do we try and come up with organizations 475 00:20:09,560 --> 00:20:11,018 and thinking about building things, 476 00:20:11,018 --> 00:20:13,340 we actually try and do this in a research context 477 00:20:13,340 --> 00:20:15,358 that involves multiple disciplines. 478 00:20:15,358 --> 00:20:17,900 So the project that I've been talking about, the RAID effort, 479 00:20:17,900 --> 00:20:21,680 was led by Professor Randy Katz and myself. 480 00:20:21,680 --> 00:20:23,450 But the context that we did this research 481 00:20:23,450 --> 00:20:27,560 included network operating system 482 00:20:27,560 --> 00:20:30,980 and file system work by John Ousterhaut, 483 00:20:30,980 --> 00:20:34,490 particularly his Sprite project, a Unix-like operating system 484 00:20:34,490 --> 00:20:36,380 with many interesting features. 485 00:20:36,380 --> 00:20:39,380 At the same time also, Mike Stonebraker, 486 00:20:39,380 --> 00:20:43,280 in the database processing-- transaction processing area 487 00:20:43,280 --> 00:20:45,870 led the Postgres effort, his follow-on to Ingres. 488 00:20:45,870 --> 00:20:49,310 So this work was being done in this triumvirate 489 00:20:49,310 --> 00:20:53,010 of hardware operating system and database research. 490 00:20:53,010 --> 00:20:56,900 The reason I really like this style of multidisciplinary work 491 00:20:56,900 --> 00:21:00,148 is what's happens and shown on the next slide. 492 00:21:00,148 --> 00:21:02,690 So Garth Gibson, who's now an assistant professor at Carnegie 493 00:21:02,690 --> 00:21:04,640 Mellon University, Randy Katz and I 494 00:21:04,640 --> 00:21:07,170 did the papers where we talked about the advantages of RAID, 495 00:21:07,170 --> 00:21:09,950 but it has this drawback of these small rights being slow, 496 00:21:09,950 --> 00:21:12,630 that which I talked about earlier. 497 00:21:12,630 --> 00:21:14,810 Well, this inspired the operating system people, 498 00:21:14,810 --> 00:21:16,850 in particularly, Mendel Rosenblum, who's 499 00:21:16,850 --> 00:21:18,860 an assistant professor at Stanford University, 500 00:21:18,860 --> 00:21:20,420 and John Ousterhout to start thinking 501 00:21:20,420 --> 00:21:22,670 about what would a file system be like that 502 00:21:22,670 --> 00:21:24,530 didn't do small writes? 503 00:21:24,530 --> 00:21:27,320 They came up with the idea of making 504 00:21:27,320 --> 00:21:32,720 the writes go as fast as the disk could accept them. 505 00:21:32,720 --> 00:21:34,580 And so instead of doing accesses, 506 00:21:34,580 --> 00:21:36,920 it's just writing without moving the head. 507 00:21:36,920 --> 00:21:40,340 So it writes a stream or a log of the updates 508 00:21:40,340 --> 00:21:41,760 rather than updating in place. 509 00:21:41,760 --> 00:21:43,520 This is the log-structured file system 510 00:21:43,520 --> 00:21:46,070 which is a pretty popular concept today. 511 00:21:46,070 --> 00:21:49,697 This, in turn, inspired Margo Seltzer, 512 00:21:49,697 --> 00:21:51,530 who is now an assistant professor at Harvard 513 00:21:51,530 --> 00:21:53,238 University, and Mike Stonebraker to think 514 00:21:53,238 --> 00:21:55,970 about having transactions support 515 00:21:55,970 --> 00:21:57,620 in the log-structured file system 516 00:21:57,620 --> 00:22:00,990 and see how all that would work to make databases run well. 517 00:22:00,990 --> 00:22:05,210 And so this symbiotic relationship of the hardware 518 00:22:05,210 --> 00:22:08,330 advances inspiring operating system advances and databases, 519 00:22:08,330 --> 00:22:09,890 again, makes the hardware look good. 520 00:22:09,890 --> 00:22:12,560 So that's one of the reasons I like these efforts. 521 00:22:12,560 --> 00:22:16,310 Also, it's really a wonderful environment for the students 522 00:22:16,310 --> 00:22:18,860 involved because students are educating each other about what 523 00:22:18,860 --> 00:22:20,510 the important issues are, leaving 524 00:22:20,510 --> 00:22:23,900 the faculty out of that loop frequently, to their benefit, 525 00:22:23,900 --> 00:22:24,530 I think-- 526 00:22:24,530 --> 00:22:26,655 the benefit of the students, maybe not the faculty. 527 00:22:26,655 --> 00:22:28,810 528 00:22:28,810 --> 00:22:31,670 So in addition to all those things-- producing students, 529 00:22:31,670 --> 00:22:33,420 writing papers, thinking about a build-in, 530 00:22:33,420 --> 00:22:35,080 we actually try and build some things. 531 00:22:35,080 --> 00:22:36,870 The first one we called RAID the first, 532 00:22:36,870 --> 00:22:39,600 and this was just an off-the-shelf system. 533 00:22:39,600 --> 00:22:42,180 It's been operational for several years. 534 00:22:42,180 --> 00:22:46,060 It's off-the-shelf at the now pretty old fashioned technology 535 00:22:46,060 --> 00:22:48,840 at SUN-4/280, some interface cards, 536 00:22:48,840 --> 00:22:50,340 and really old fashioned disks-- 537 00:22:50,340 --> 00:22:55,950 5 and 1/4 inch disks, and we had 10 gigabytes of storage. 538 00:22:55,950 --> 00:22:58,890 But this was our testing ground of trying these RAID ideas out 539 00:22:58,890 --> 00:23:00,360 and see what we could learn. 540 00:23:00,360 --> 00:23:02,790 And what we decided the follow-on would be, 541 00:23:02,790 --> 00:23:04,710 which we call RAID the second, was 542 00:23:04,710 --> 00:23:06,570 going to try and actually explore 543 00:23:06,570 --> 00:23:09,990 the idea of a diskless supercomputer. 544 00:23:09,990 --> 00:23:14,370 The system is organized for high performance 545 00:23:14,370 --> 00:23:17,820 by integrating the network and the disk systems. 546 00:23:17,820 --> 00:23:21,120 This next slide gives you an idea what we're talking about. 547 00:23:21,120 --> 00:23:23,280 Here's the server on the right. 548 00:23:23,280 --> 00:23:25,860 Now, the server is really-- 549 00:23:25,860 --> 00:23:28,860 to my surprise, doesn't really care that much about the data. 550 00:23:28,860 --> 00:23:30,690 Servers care about metadata. 551 00:23:30,690 --> 00:23:32,760 But the servers which are typically built today 552 00:23:32,760 --> 00:23:36,210 is really just a workstation without a monitor on it. 553 00:23:36,210 --> 00:23:40,140 That memory system can be a bottleneck to the disk 554 00:23:40,140 --> 00:23:40,990 and to the network. 555 00:23:40,990 --> 00:23:44,430 So what we did instead is put memory on the I/O card 556 00:23:44,430 --> 00:23:46,200 so that we could have a memory, and we 557 00:23:46,200 --> 00:23:47,970 could design the number of megabytes 558 00:23:47,970 --> 00:23:49,950 per second of bandwidth of that memory device 559 00:23:49,950 --> 00:23:52,650 to connect the disk to the network. 560 00:23:52,650 --> 00:23:55,830 We happened to use an Ultranet, a commercial gigabit 561 00:23:55,830 --> 00:23:58,580 per second network over the standard HIPPI, which 562 00:23:58,580 --> 00:24:01,080 is the standard high-speed buses to connect to these network 563 00:24:01,080 --> 00:24:01,920 cards. 564 00:24:01,920 --> 00:24:04,860 So the server dealt with the metadata. 565 00:24:04,860 --> 00:24:07,972 These disks dealt with the data transfer themselves, 566 00:24:07,972 --> 00:24:09,930 but the memory is the buffer between these two, 567 00:24:09,930 --> 00:24:14,000 and then out over the network to the rest of the community. 568 00:24:14,000 --> 00:24:15,640 This is what we call RAID the second. 569 00:24:15,640 --> 00:24:20,290 What I'm going to do next is show a videotape of a demo. 570 00:24:20,290 --> 00:24:22,000 And this demo in particular is going 571 00:24:22,000 --> 00:24:23,950 to show what happens in terms of performance 572 00:24:23,950 --> 00:24:26,920 by using graphics when a disk is to fail 573 00:24:26,920 --> 00:24:29,170 and showing continuous update on the fly. 574 00:24:29,170 --> 00:24:31,300 And then the actual daredevil demo, 575 00:24:31,300 --> 00:24:35,770 where a disk is going to be removed from the live array. 576 00:24:35,770 --> 00:24:38,650 And you'll see the data being reconstructed. 577 00:24:38,650 --> 00:24:40,420 The person who giving this talk is 578 00:24:40,420 --> 00:24:42,765 Ed Lee, one of the several graduate students working 579 00:24:42,765 --> 00:24:43,390 on the project. 580 00:24:43,390 --> 00:24:45,760 His portion of the demo was doing the live portion 581 00:24:45,760 --> 00:24:48,340 of the reconstruction because he did the reconstruction 582 00:24:48,340 --> 00:24:50,170 software. 583 00:24:50,170 --> 00:24:52,430 And Lee's getting his PhD shortly 584 00:24:52,430 --> 00:24:55,323 and going to join DEC at the Systems Research Center. 585 00:24:55,323 --> 00:24:55,990 Here's the demo. 586 00:24:55,990 --> 00:24:56,882 [VIDEO PLAYBACK] 587 00:24:56,882 --> 00:25:01,040 - So as we mentioned before, our RAID [INAUDIBLE] server is 588 00:25:01,040 --> 00:25:04,470 based upon a RAID level 5 disk array, 589 00:25:04,470 --> 00:25:07,327 which means that given in the event of a disk failure, 590 00:25:07,327 --> 00:25:11,880 we can continue supplying data uninterrupted to our clients. 591 00:25:11,880 --> 00:25:16,288 And also, we can dynamically rebuild the contents 592 00:25:16,288 --> 00:25:18,370 of the failed disk onto a spare disk 593 00:25:18,370 --> 00:25:21,145 without interrupting the flow of data to our client. 594 00:25:21,145 --> 00:25:23,770 So in this demo, there will be three primary parts. 595 00:25:23,770 --> 00:25:26,220 The first part, we will cause a disk to be failed. 596 00:25:26,220 --> 00:25:28,715 In the second part, we will rebuild 597 00:25:28,715 --> 00:25:31,390 the contents of the failed disk onto a spare disk. 598 00:25:31,390 --> 00:25:34,430 And in the third part, I will actually physically remove 599 00:25:34,430 --> 00:25:37,030 a disk that is in-- 600 00:25:37,030 --> 00:25:38,890 that is configured into the system. 601 00:25:38,890 --> 00:25:40,630 And in all three phases, you'll see 602 00:25:40,630 --> 00:25:45,950 that there is no interruption of data to our clients. 603 00:25:45,950 --> 00:25:51,556 So for the first part, we go back to our video display here. 604 00:25:51,556 --> 00:25:53,830 When we start initially fail a disk, 605 00:25:53,830 --> 00:25:58,870 you'll see that there will be a brief pause while the RAID disk 606 00:25:58,870 --> 00:26:01,710 array reconfigures itself, [INAUDIBLE] the pause there, 607 00:26:01,710 --> 00:26:06,060 but now is continuing to display the frames at approximately 608 00:26:06,060 --> 00:26:08,190 it's normal rate. 609 00:26:08,190 --> 00:26:12,790 If we go over to the performance monitor, 610 00:26:12,790 --> 00:26:16,000 you see that at the point at which the disk failed, 611 00:26:16,000 --> 00:26:19,300 there is a slight spike in CPU utilization caused 612 00:26:19,300 --> 00:26:22,990 by the extra overhead and rearranging certain components 613 00:26:22,990 --> 00:26:24,700 of the RAID driver. 614 00:26:24,700 --> 00:26:28,030 And afterwards, you see that the utilization is slightly higher 615 00:26:28,030 --> 00:26:32,630 due to the extra work required to reconstruct 616 00:26:32,630 --> 00:26:35,420 the contents of the failed disk as it 617 00:26:35,420 --> 00:26:38,010 is requested by the client. 618 00:26:38,010 --> 00:26:42,040 Now we'll start the rebuild process. 619 00:26:42,040 --> 00:26:43,730 And now the contents of the failed disk 620 00:26:43,730 --> 00:26:47,210 is being rebuilt onto a [INAUDIBLE].. 621 00:26:47,210 --> 00:26:49,070 And you see that the CPU utilization 622 00:26:49,070 --> 00:26:51,110 has increased significantly. 623 00:26:51,110 --> 00:26:52,640 And this is due to-- 624 00:26:52,640 --> 00:26:53,990 by approximately 20%. 625 00:26:53,990 --> 00:26:57,680 And this is due to all the additional I/Os required 626 00:26:57,680 --> 00:27:01,820 to rebuild the contents of that failed disk. 627 00:27:01,820 --> 00:27:06,070 We know that the number of the disk I/O's reads and writes 628 00:27:06,070 --> 00:27:07,370 have not changed significantly. 629 00:27:07,370 --> 00:27:11,420 And this is because this performance monitor measured 630 00:27:11,420 --> 00:27:14,210 logical I/O that is performed through the disk array rather 631 00:27:14,210 --> 00:27:17,400 than the physical disk activity. 632 00:27:17,400 --> 00:27:19,510 So now the rebuild phase is in [INAUDIBLE].. 633 00:27:19,510 --> 00:27:21,700 And you see that the CPU utilization 634 00:27:21,700 --> 00:27:25,090 is returning to its approximately normal rate. 635 00:27:25,090 --> 00:27:28,960 And during that entire period, the supply 636 00:27:28,960 --> 00:27:31,481 of data for our clients was not interrupted. 637 00:27:31,481 --> 00:27:35,540 And the clients continued to display the images 638 00:27:35,540 --> 00:27:40,694 at approximately their former rates. 639 00:27:40,694 --> 00:27:44,740 OK, so for the final more dramatic portion of our demo, 640 00:27:44,740 --> 00:27:46,565 I will actually physically remove 641 00:27:46,565 --> 00:27:48,850 a disk which is currently configured 642 00:27:48,850 --> 00:27:51,870 and running in the RAID server. 643 00:27:51,870 --> 00:27:55,825 And you'll see that even in that case, 644 00:27:55,825 --> 00:28:00,885 RAID-II can provide the necessary data to its clients. 645 00:28:00,885 --> 00:28:03,360 Here, I am about to remove the disk. 646 00:28:03,360 --> 00:28:08,310 647 00:28:08,310 --> 00:28:11,016 Now I actually physically remove the disk. 648 00:28:11,016 --> 00:28:14,960 As we go back to our monitor, you 649 00:28:14,960 --> 00:28:18,920 see that the client is still getting its necessary stream 650 00:28:18,920 --> 00:28:22,130 of data, and it's still running at approximately 651 00:28:22,130 --> 00:28:23,410 its original frame rate. 652 00:28:23,410 --> 00:28:27,350 653 00:28:27,350 --> 00:28:30,620 This shows that the dynamic reconstruction of data 654 00:28:30,620 --> 00:28:32,580 is working properly. 655 00:28:32,580 --> 00:28:36,010 656 00:28:36,010 --> 00:28:38,934 So that concludes the RAID demonstration. 657 00:28:38,934 --> 00:28:40,810 We'd like for you to keep in mind 658 00:28:40,810 --> 00:28:44,950 that many of the bottlenecks that we've illustrated today 659 00:28:44,950 --> 00:28:47,470 has been due to either our limitations 660 00:28:47,470 --> 00:28:50,230 on the client or the disk controller. 661 00:28:50,230 --> 00:28:53,420 So as we appropriate faster clients and disk controllers, 662 00:28:53,420 --> 00:28:56,720 we would expect performance to scale appropriately. 663 00:28:56,720 --> 00:28:58,687 So thank you for watching. 664 00:28:58,687 --> 00:28:59,270 [END PLAYBACK] 665 00:28:59,270 --> 00:29:00,860 So one of the questions was, well, 666 00:29:00,860 --> 00:29:03,380 just how well does all these redundancy schemes work? 667 00:29:03,380 --> 00:29:04,700 Remember, we're going to take this weakness 668 00:29:04,700 --> 00:29:05,900 and turn it into strength. 669 00:29:05,900 --> 00:29:07,580 How big a strength is it? 670 00:29:07,580 --> 00:29:11,270 And you can see that on this next slide. 671 00:29:11,270 --> 00:29:13,520 Everything's the same as the earlier slide. 672 00:29:13,520 --> 00:29:16,220 We're going to have a few more disks there 673 00:29:16,220 --> 00:29:18,920 to find the reliability if the cost is a little higher, 674 00:29:18,920 --> 00:29:21,180 but we can say the cost is about the same. 675 00:29:21,180 --> 00:29:23,570 But the big change is the mean time to data loss. 676 00:29:23,570 --> 00:29:26,000 By having this many extra disks, we 677 00:29:26,000 --> 00:29:28,370 can take this weakness, remember, 678 00:29:28,370 --> 00:29:31,550 that was down to about a month of reliability, and instead, 679 00:29:31,550 --> 00:29:33,800 turns it into a tremendous strength-- 680 00:29:33,800 --> 00:29:36,373 a ridiculous calculation of ours. 681 00:29:36,373 --> 00:29:37,790 So what we wanted to do is make it 682 00:29:37,790 --> 00:29:40,490 as reliable or more reliable than a mainframe disk, 683 00:29:40,490 --> 00:29:43,280 and we can do that. 684 00:29:43,280 --> 00:29:45,050 In addition to doing these ideas, 685 00:29:45,050 --> 00:29:47,845 let's talk about the future of this. 686 00:29:47,845 --> 00:29:50,220 In a few years, we're going to have these one-inch disks. 687 00:29:50,220 --> 00:29:53,510 And you can imagine on a single notebook-sized computer having 688 00:29:53,510 --> 00:29:56,300 32 of 1-inch diameter disks provide you, 689 00:29:56,300 --> 00:29:58,790 say, 8 gigabytes of storage, more than 100 megabits 690 00:29:58,790 --> 00:30:04,010 per second transfer rate, and very, very high reliability. 691 00:30:04,010 --> 00:30:06,710 In addition to producing students, writing papers, 692 00:30:06,710 --> 00:30:09,500 giving talks, making videotapes, we also 693 00:30:09,500 --> 00:30:12,500 like to transfer the technology to companies to use it. 694 00:30:12,500 --> 00:30:15,470 There is even today amazingly a RAID newsletter 695 00:30:15,470 --> 00:30:17,810 to keep people up to date on what's happening in RAID. 696 00:30:17,810 --> 00:30:21,320 And I think there is-- it says 100, but maybe even 200 697 00:30:21,320 --> 00:30:24,620 companies involved in RAID systems are selling components. 698 00:30:24,620 --> 00:30:27,500 Several of the major companies you'd recognize there 699 00:30:27,500 --> 00:30:30,740 are selling them right now, and other companies 700 00:30:30,740 --> 00:30:33,020 are developing RAID systems. 701 00:30:33,020 --> 00:30:34,880 At the bottom, I mention the IBM story. 702 00:30:34,880 --> 00:30:37,340 One of the companies there, Storage Technologies, 703 00:30:37,340 --> 00:30:40,408 decided to talk about their product, a very complicated 704 00:30:40,408 --> 00:30:42,200 product, that not only does the RAID ideas, 705 00:30:42,200 --> 00:30:46,250 but does compression and other things on the fly, which-- 706 00:30:46,250 --> 00:30:49,070 and their code name for it was the Iceberg Project. 707 00:30:49,070 --> 00:30:52,010 Well, they went around and gave 700 708 00:30:52,010 --> 00:30:55,550 non-disclosure presentations on Iceberg. 709 00:30:55,550 --> 00:30:59,270 So those of you who can imagine with human nature 710 00:30:59,270 --> 00:31:03,830 realize that 700 nondisclosures is an oxymoron. 711 00:31:03,830 --> 00:31:06,105 This is really obviously-- well, not obviously, 712 00:31:06,105 --> 00:31:08,480 unfortunately, even though they swore they wouldn't, they 713 00:31:08,480 --> 00:31:09,830 started talking about them. 714 00:31:09,830 --> 00:31:13,370 So IBM salesmen heard a lot about RAID. 715 00:31:13,370 --> 00:31:16,400 And so I was told shortly after that 716 00:31:16,400 --> 00:31:18,320 that IBM salesmen were going around saying 717 00:31:18,320 --> 00:31:20,150 RAID was a bad idea. 718 00:31:20,150 --> 00:31:22,550 So I knew this was a really good sign. 719 00:31:22,550 --> 00:31:25,925 If IBM salesmen are marketing against a research idea, 720 00:31:25,925 --> 00:31:27,050 something must be going on. 721 00:31:27,050 --> 00:31:29,222 And in fact, today, as I said, you 722 00:31:29,222 --> 00:31:30,680 can read about it in Byte magazine. 723 00:31:30,680 --> 00:31:34,130 And IBM, their president of their disk company 724 00:31:34,130 --> 00:31:36,710 has said that RAID is just like a fine wine. 725 00:31:36,710 --> 00:31:38,660 They won't deliver it before it's time. 726 00:31:38,660 --> 00:31:41,180 And when RAID's ready to go, IBM will be there. 727 00:31:41,180 --> 00:31:44,150 And that's the way they wanted to go with their mainframe. 728 00:31:44,150 --> 00:31:46,610 They already are developing an offering 729 00:31:46,610 --> 00:31:50,370 for sale RAID systems in other markets, 730 00:31:50,370 --> 00:31:52,490 particularly their supercomputer kind of markets. 731 00:31:52,490 --> 00:31:56,510 But eventually, the expectations are that IBM will switch over 732 00:31:56,510 --> 00:32:00,487 to RAID for their big systems. 733 00:32:00,487 --> 00:32:02,070 So that's kind of the end of the RAIDs 734 00:32:02,070 --> 00:32:03,420 part of this presentation. 735 00:32:03,420 --> 00:32:04,950 And this is-- 736 00:32:04,950 --> 00:32:06,778 I've used past tense a lot because this 737 00:32:06,778 --> 00:32:08,070 is all done and all wrapped up. 738 00:32:08,070 --> 00:32:09,990 Just about everybody is going to be graduating soon 739 00:32:09,990 --> 00:32:11,970 from Berkeley, and people are using the ideas. 740 00:32:11,970 --> 00:32:13,410 What I'm going to talk about next 741 00:32:13,410 --> 00:32:15,240 is stuff that we're working on now 742 00:32:15,240 --> 00:32:19,830 and tend to work on the future, and it's to be more futuristic. 743 00:32:19,830 --> 00:32:21,495 There's three pieces of this part. 744 00:32:21,495 --> 00:32:23,092 And we liked working on I/O so much 745 00:32:23,092 --> 00:32:24,550 we decided we'd keep working on it. 746 00:32:24,550 --> 00:32:27,180 In fact, if it was good to work on secondary storage of disks, 747 00:32:27,180 --> 00:32:29,790 well, tertiary storage must be wonderful. 748 00:32:29,790 --> 00:32:31,650 And that's what we're working on today. 749 00:32:31,650 --> 00:32:33,540 And if you can think as the small disk 750 00:32:33,540 --> 00:32:37,635 is what inspired us to work on the disks today-- 751 00:32:37,635 --> 00:32:40,655 752 00:32:40,655 --> 00:32:41,780 remember these small disks. 753 00:32:41,780 --> 00:32:42,950 What can we do with these things? 754 00:32:42,950 --> 00:32:44,867 I think what's inspiring this research project 755 00:32:44,867 --> 00:32:47,350 today are these tapes. 756 00:32:47,350 --> 00:32:51,370 This is a tape that's right of your camcorder camera, 757 00:32:51,370 --> 00:32:53,920 and this tape can hold, amazingly 758 00:32:53,920 --> 00:32:56,170 enough, 5 gigabytes of information. 759 00:32:56,170 --> 00:32:58,990 Putting that in perspective, that's like 5,000 books. 760 00:32:58,990 --> 00:33:00,490 That's how much can be in this tape. 761 00:33:00,490 --> 00:33:02,430 This tape costs $10 to $20. 762 00:33:02,430 --> 00:33:03,880 I can hold it in my hand. 763 00:33:03,880 --> 00:33:05,680 So the question that we ask ourself, gee, 764 00:33:05,680 --> 00:33:09,670 with a tape like this, how could a computer system designer 765 00:33:09,670 --> 00:33:11,260 use that and put it to use? 766 00:33:11,260 --> 00:33:12,640 And that's, I think, this tape is 767 00:33:12,640 --> 00:33:15,430 what's inspiring for the next part of our research effort. 768 00:33:15,430 --> 00:33:17,480 So there's three parts of this technology. 769 00:33:17,480 --> 00:33:19,840 It's these helical scan tapes, which 770 00:33:19,840 --> 00:33:22,900 I'll talk more what helical scan means shortly; 771 00:33:22,900 --> 00:33:25,150 tape robots that can hold lots of these tapes 772 00:33:25,150 --> 00:33:26,767 and make them automatically available; 773 00:33:26,767 --> 00:33:27,850 and then data compression. 774 00:33:27,850 --> 00:33:30,528 And that's-- I'll be spending the rest of this tape talking 775 00:33:30,528 --> 00:33:33,070 about these technologies, then talking about the applications 776 00:33:33,070 --> 00:33:33,987 of these technologies. 777 00:33:33,987 --> 00:33:36,430 But try and remember, you can't see my arms 778 00:33:36,430 --> 00:33:37,960 waving the whole time, but my arms 779 00:33:37,960 --> 00:33:38,940 are waving the whole time I'm talking. 780 00:33:38,940 --> 00:33:41,440 Because this is what we're going to do rather than something 781 00:33:41,440 --> 00:33:43,940 we've already done. 782 00:33:43,940 --> 00:33:46,310 So what's all this about tape? 783 00:33:46,310 --> 00:33:48,295 If I've been chairman at Berkeley for a while, 784 00:33:48,295 --> 00:33:49,670 and I go into some rooms and I'll 785 00:33:49,670 --> 00:33:52,592 see these rooms full of magnetic tapes. 786 00:33:52,592 --> 00:33:54,050 So pretty much what I've determined 787 00:33:54,050 --> 00:33:56,900 is tapes so far have been written once 788 00:33:56,900 --> 00:34:00,080 and read zero or one time. 789 00:34:00,080 --> 00:34:02,780 People do these backups of tremendous amounts 790 00:34:02,780 --> 00:34:05,095 of information, and I'll try and reclaim storage space. 791 00:34:05,095 --> 00:34:05,720 And oh, no, no. 792 00:34:05,720 --> 00:34:08,100 You can't get rid of that, that 1985 tape. 793 00:34:08,100 --> 00:34:10,022 Somebody may ask for that file some day. 794 00:34:10,022 --> 00:34:11,480 Now, I don't think very many people 795 00:34:11,480 --> 00:34:13,909 ask for 1985 tapes in this day and age, 796 00:34:13,909 --> 00:34:15,949 but we've got these roomful of these tapes. 797 00:34:15,949 --> 00:34:18,620 And they also get used for distributing software. 798 00:34:18,620 --> 00:34:20,989 What I'm talking about, using tapes very differently, 799 00:34:20,989 --> 00:34:23,909 using them like you use disks, reading them and writing them, 800 00:34:23,909 --> 00:34:27,139 making them automatically available for people to use. 801 00:34:27,139 --> 00:34:30,409 And that's what the new opportunity is. 802 00:34:30,409 --> 00:34:32,090 So that's a fundamental relationship 803 00:34:32,090 --> 00:34:33,679 between tapes and disks. 804 00:34:33,679 --> 00:34:36,679 And longitudinal tapes actually use the same technology 805 00:34:36,679 --> 00:34:37,472 as hard disks. 806 00:34:37,472 --> 00:34:39,639 And they're going to track each other's improvement. 807 00:34:39,639 --> 00:34:41,969 So the media itself is pretty similar. 808 00:34:41,969 --> 00:34:44,389 The differences are inherent in the geometries. 809 00:34:44,389 --> 00:34:47,000 Disks-- remember, I showed you before-- has platters 810 00:34:47,000 --> 00:34:50,690 and randomly rotating with these gaps for these arms 811 00:34:50,690 --> 00:34:54,020 to move in and out gives them the random access. 812 00:34:54,020 --> 00:34:57,770 Because these arms are flying so close to all these platters, 813 00:34:57,770 --> 00:35:00,380 they have to seal it so the drive and the media 814 00:35:00,380 --> 00:35:01,760 are a single unit. 815 00:35:01,760 --> 00:35:05,880 In contrast, tapes are this magnetic information spread 816 00:35:05,880 --> 00:35:08,420 on these removable strips that are on a spool. 817 00:35:08,420 --> 00:35:11,130 There's sequential access, no random access. 818 00:35:11,130 --> 00:35:13,380 But because of the nature of the way the readers work, 819 00:35:13,380 --> 00:35:16,730 you can insert and remove these tapes 820 00:35:16,730 --> 00:35:19,260 so you can have many tapes to a reader. 821 00:35:19,260 --> 00:35:21,180 And so it's got these fundamental advantages 822 00:35:21,180 --> 00:35:23,930 as a random access versus sequential access, 823 00:35:23,930 --> 00:35:26,480 and multiple media per reader versus one media per reader. 824 00:35:26,480 --> 00:35:28,610 So I expect 20 years from now, we'll 825 00:35:28,610 --> 00:35:32,530 still have these kind of relationships in these two. 826 00:35:32,530 --> 00:35:35,030 But there's this new technology, which I showed you earlier, 827 00:35:35,030 --> 00:35:36,080 called helical scan. 828 00:35:36,080 --> 00:35:38,480 So what helical scan does is different 829 00:35:38,480 --> 00:35:39,680 than the longitudinal tapes. 830 00:35:39,680 --> 00:35:44,730 Longitude tapes, the information is across the tape as it moves. 831 00:35:44,730 --> 00:35:47,000 So helical scan is the tape is spinning, 832 00:35:47,000 --> 00:35:49,170 it has the head at an angle to the tape, 833 00:35:49,170 --> 00:35:51,000 and it's spinning very, very rapidly, 834 00:35:51,000 --> 00:35:54,380 so as the tape moves by, it can record lots of information. 835 00:35:54,380 --> 00:35:56,840 So you can get factors of almost 100 increase 836 00:35:56,840 --> 00:35:59,630 in the density of the tapes with this new helical scan 837 00:35:59,630 --> 00:36:01,100 technology. 838 00:36:01,100 --> 00:36:02,880 Now, as you can see on the slide, 839 00:36:02,880 --> 00:36:05,720 this is not some exotic technology that's unavailable. 840 00:36:05,720 --> 00:36:07,280 This is pretty standard technology 841 00:36:07,280 --> 00:36:10,640 that's being used in every VCR, in every camcorder. 842 00:36:10,640 --> 00:36:13,370 And if you happen to have a digital audio tape stereo, 843 00:36:13,370 --> 00:36:15,600 it's being used in that technology as well. 844 00:36:15,600 --> 00:36:17,690 Now, you would think that there's, 845 00:36:17,690 --> 00:36:19,250 given the sequential nature, it would 846 00:36:19,250 --> 00:36:21,640 be very slow, which is true. 847 00:36:21,640 --> 00:36:23,640 It is pretty slow, especially compared to disks. 848 00:36:23,640 --> 00:36:26,090 But because they keep a longitudinal track, 849 00:36:26,090 --> 00:36:27,770 they have a fast search mode. 850 00:36:27,770 --> 00:36:30,710 So actually, here's three different-- 851 00:36:30,710 --> 00:36:32,090 these three different columns are 852 00:36:32,090 --> 00:36:34,640 examples of different kinds of helical scan technology. 853 00:36:34,640 --> 00:36:36,650 Random searches will take tens of seconds, 854 00:36:36,650 --> 00:36:38,192 where you might think they might take 855 00:36:38,192 --> 00:36:40,610 several minutes because they have fast search mode here. 856 00:36:40,610 --> 00:36:43,610 This shows the density of these tapes-- this one at about-- 857 00:36:43,610 --> 00:36:46,770 this is in megabytes, so this is almost 5 gigabytes here. 858 00:36:46,770 --> 00:36:49,350 Here's a couple of other ones that are going on. 859 00:36:49,350 --> 00:36:52,100 And this compares it to magnetic disks in the conventional tape. 860 00:36:52,100 --> 00:36:53,933 And you can see comparing conventional tape, 861 00:36:53,933 --> 00:36:59,710 a dramatic difference in the density of these technologies. 862 00:36:59,710 --> 00:37:02,500 Now usually when I give a talk, and there's 863 00:37:02,500 --> 00:37:04,960 people who can talk back, they-- somebody raises their hand 864 00:37:04,960 --> 00:37:07,330 and asks, what about optical disk? 865 00:37:07,330 --> 00:37:10,630 Optical disk is a very interesting technology, 866 00:37:10,630 --> 00:37:13,300 especially if you're going to make one copy of things 867 00:37:13,300 --> 00:37:17,320 and want to stamp it out for [INAUDIBLE] many people 868 00:37:17,320 --> 00:37:18,910 dramatically inexpensively. 869 00:37:18,910 --> 00:37:21,280 For a couple of dollars, 1 gigabyte of information 870 00:37:21,280 --> 00:37:25,280 can be stamped out repeatedly and used at many places. 871 00:37:25,280 --> 00:37:30,610 But if we compare the media cost to these 8 millimeter 872 00:37:30,610 --> 00:37:32,110 tapes, the helical scan tapes, you 873 00:37:32,110 --> 00:37:34,360 see there's a very large difference, so that basically 874 00:37:34,360 --> 00:37:36,460 there's almost a factor of 100 difference in terms 875 00:37:36,460 --> 00:37:39,370 of the media in terms of dollars per megabyte 876 00:37:39,370 --> 00:37:41,770 of helical scan versus the optical disk. 877 00:37:41,770 --> 00:37:44,080 Optical disks are moving more or less at the rate 878 00:37:44,080 --> 00:37:46,570 the standards committees can agree on standards, 879 00:37:46,570 --> 00:37:48,220 while the helical scan tapes tend 880 00:37:48,220 --> 00:37:50,230 to be pushing the technology. 881 00:37:50,230 --> 00:37:52,300 Now, there's some-- so it's got that advantage, 882 00:37:52,300 --> 00:37:53,675 the helical scan tape, and that's 883 00:37:53,675 --> 00:37:56,020 what's, again, waving our hands is where we are today 884 00:37:56,020 --> 00:37:58,475 at Berkeley and pushing the helical scan tapes. 885 00:37:58,475 --> 00:38:00,100 But we've learned a few things that are 886 00:38:00,100 --> 00:38:01,450 disadvantages of these tapes. 887 00:38:01,450 --> 00:38:03,010 First of all, they wear out. 888 00:38:03,010 --> 00:38:07,780 Helical scan devices, because they move very rapidly 889 00:38:07,780 --> 00:38:10,930 over that tape, the tape can wear out more quickly. 890 00:38:10,930 --> 00:38:14,590 Even longitudinal tapes can only have thousands of passes. 891 00:38:14,590 --> 00:38:16,550 Moreover, the heads wear out as well. 892 00:38:16,550 --> 00:38:21,590 So that's some disadvantages of that technology. 893 00:38:21,590 --> 00:38:24,325 So your economic model has to factor in the advantage 894 00:38:24,325 --> 00:38:25,390 of helical scan tapes. 895 00:38:25,390 --> 00:38:27,882 You have to subtract out the fact that the tapes wear out 896 00:38:27,882 --> 00:38:28,840 and the heads wear out. 897 00:38:28,840 --> 00:38:31,690 But there's still, when you have a factor of 100 advantage, 898 00:38:31,690 --> 00:38:33,310 that's a serious advantage. 899 00:38:33,310 --> 00:38:36,740 Also, right now, if you looked at the tapes the way they work, 900 00:38:36,740 --> 00:38:40,510 there's this very long rewind, eject, load, and spin-up times. 901 00:38:40,510 --> 00:38:42,370 But I believe that's not inherent. 902 00:38:42,370 --> 00:38:44,770 It's just there's no market yet for somebody 903 00:38:44,770 --> 00:38:46,990 using these tapes in this unconventional way, 904 00:38:46,990 --> 00:38:48,340 as if they were disks. 905 00:38:48,340 --> 00:38:51,070 If that was important, I believe engineers could design them 906 00:38:51,070 --> 00:38:57,400 that they wouldn't be so slow in terms of loading and rewinding. 907 00:38:57,400 --> 00:38:59,380 Now, that's the tapes themselves. 908 00:38:59,380 --> 00:39:01,810 What about the tape robots, the second piece 909 00:39:01,810 --> 00:39:03,610 of this technology? 910 00:39:03,610 --> 00:39:06,640 If you were able to see this setup that we 911 00:39:06,640 --> 00:39:09,190 have in this room, we could fit on the same stage 912 00:39:09,190 --> 00:39:13,750 where I'm talking a 10 foot by 8 foot monster called the Storage 913 00:39:13,750 --> 00:39:15,430 Technologies 4400. 914 00:39:15,430 --> 00:39:17,527 This robot, which isn't that big, and at a half 915 00:39:17,527 --> 00:39:19,360 million dollars isn't really that expensive, 916 00:39:19,360 --> 00:39:22,420 could hold 6,000 tapes. 917 00:39:22,420 --> 00:39:25,600 In 1992, that would be about 5 terabytes. 918 00:39:25,600 --> 00:39:28,330 Next year they're going to make the transition 919 00:39:28,330 --> 00:39:31,060 from longitudinal to helical scan, 920 00:39:31,060 --> 00:39:36,205 so that same robot could hold 120 terabytes. 921 00:39:36,205 --> 00:39:38,080 When we start talking about numbers like this 922 00:39:38,080 --> 00:39:40,420 it's hard to keep track of what we're talking about. 923 00:39:40,420 --> 00:39:42,850 It's like talking about the national debt. 924 00:39:42,850 --> 00:39:46,930 A trillion dollars-- how much is a trillion dollars? 925 00:39:46,930 --> 00:39:49,070 An extra trillion dollars-- is that bad? 926 00:39:49,070 --> 00:39:50,270 What does it mean? 927 00:39:50,270 --> 00:39:53,980 Well, terabytes-- 120 terabytes is a fantastic amount 928 00:39:53,980 --> 00:39:55,090 of information. 929 00:39:55,090 --> 00:39:58,150 If we were to go to the Library of Congress 930 00:39:58,150 --> 00:40:01,900 and see their attempt to capture the sum of humankind's 931 00:40:01,900 --> 00:40:04,660 knowledge, to give you an idea of how aggressive they are 932 00:40:04,660 --> 00:40:07,260 in getting information, 2/3 of their holdings 933 00:40:07,260 --> 00:40:08,260 are not even in English. 934 00:40:08,260 --> 00:40:10,480 They're trying to get everything that's been printed 935 00:40:10,480 --> 00:40:12,670 and keep it at the Library of Congress. 936 00:40:12,670 --> 00:40:15,700 If we could magically transform all the texts and all 937 00:40:15,700 --> 00:40:18,340 of those books and put it onto a computer, 938 00:40:18,340 --> 00:40:21,460 it's been estimated that's about 20 terabytes. 939 00:40:21,460 --> 00:40:24,190 So with 120 terabytes, we could have six copies 940 00:40:24,190 --> 00:40:27,970 of all of humankind's knowledge right up here 941 00:40:27,970 --> 00:40:31,130 on the stage next to me for about a half a million. 942 00:40:31,130 --> 00:40:35,960 So this is an extraordinary amount of information. 943 00:40:35,960 --> 00:40:38,740 So the tape robots which are available right now 944 00:40:38,740 --> 00:40:43,075 give us this automatic access, still with a long access time. 945 00:40:43,075 --> 00:40:44,950 Now, those of you who've been in the industry 946 00:40:44,950 --> 00:40:47,140 for a while would seem-- even this is-- 947 00:40:47,140 --> 00:40:50,680 seems like this is infinite amount of information. 948 00:40:50,680 --> 00:40:54,110 But you know when dealing with programmers that they always 949 00:40:54,110 --> 00:40:54,610 complain. 950 00:40:54,610 --> 00:40:56,150 So this can't be enough. 951 00:40:56,150 --> 00:40:57,820 So we have to get even more than this, 952 00:40:57,820 --> 00:41:02,380 and that leads to the next topic of data compression. 953 00:41:02,380 --> 00:41:06,830 So compression has some simple terms, easy to figure out. 954 00:41:06,830 --> 00:41:09,340 The first one is a style of compression called lossless, 955 00:41:09,340 --> 00:41:11,680 and that gives you typically factors of 2 or 3. 956 00:41:11,680 --> 00:41:14,200 And your contract is you won't lose any bits, 957 00:41:14,200 --> 00:41:16,720 but it will take one half to one third the space. 958 00:41:16,720 --> 00:41:19,750 And text is typically done with compression. 959 00:41:19,750 --> 00:41:22,542 The second category called lossy allows 960 00:41:22,542 --> 00:41:24,250 you to lose some bits as long as it still 961 00:41:24,250 --> 00:41:25,420 looks good to the visual-- 962 00:41:25,420 --> 00:41:27,760 to the human eye, visually interesting. 963 00:41:27,760 --> 00:41:28,900 So these are images. 964 00:41:28,900 --> 00:41:30,940 And sometimes they get factors of 20, 965 00:41:30,940 --> 00:41:33,650 sometimes less, sometimes much more than that. 966 00:41:33,650 --> 00:41:35,680 So the question as computer systems designers 967 00:41:35,680 --> 00:41:37,840 with this new technology is, where 968 00:41:37,840 --> 00:41:42,040 is compression going to be used, and where should it be used? 969 00:41:42,040 --> 00:41:44,910 So this complicated line drawing here, 970 00:41:44,910 --> 00:41:48,510 let me go show you how that works. 971 00:41:48,510 --> 00:41:49,650 Here's the tape itself. 972 00:41:49,650 --> 00:41:50,820 Here is its controller. 973 00:41:50,820 --> 00:41:53,340 It's connected over a cable, a SCSI cable 974 00:41:53,340 --> 00:41:55,200 to the SCSI host bus adapter. 975 00:41:55,200 --> 00:41:58,080 SCSI is one of those cables that connect peripherals 976 00:41:58,080 --> 00:42:01,350 to computers through a file server over a long haul 977 00:42:01,350 --> 00:42:04,260 network, to another file server over a local area network, 978 00:42:04,260 --> 00:42:07,530 to the processor and memory, and it shows up on the screen. 979 00:42:07,530 --> 00:42:09,550 So where is it being used today? 980 00:42:09,550 --> 00:42:11,850 Well, the compression is being used right now 981 00:42:11,850 --> 00:42:12,810 in the controller. 982 00:42:12,810 --> 00:42:14,700 And in fact, you have to be pretty careful 983 00:42:14,700 --> 00:42:18,120 when you buy either a tape controller or a disk 984 00:42:18,120 --> 00:42:20,520 to find out whether or not when they 985 00:42:20,520 --> 00:42:23,243 argue the capacity of this tape or the disk, 986 00:42:23,243 --> 00:42:24,910 whether they're counting on compression. 987 00:42:24,910 --> 00:42:27,150 So the same tape may hold 5 gigabytes, 988 00:42:27,150 --> 00:42:31,080 but it maybe advertised as a 10 gigabyte or 15 gigabyte tape, 989 00:42:31,080 --> 00:42:33,750 because they're counting on getting this factor 2 or 3 990 00:42:33,750 --> 00:42:35,010 with lossless compression. 991 00:42:35,010 --> 00:42:36,120 So you have to-- 992 00:42:36,120 --> 00:42:39,480 user beware here, buyer beware. 993 00:42:39,480 --> 00:42:40,980 From a computer systems perspective, 994 00:42:40,980 --> 00:42:42,570 what do we think of that? 995 00:42:42,570 --> 00:42:45,270 Well, what we think is that's the stupidest place 996 00:42:45,270 --> 00:42:46,470 you could possibly do it. 997 00:42:46,470 --> 00:42:50,850 Because after you get that data and you decompress it, 998 00:42:50,850 --> 00:42:54,330 then you have to ship that larger piece of information 999 00:42:54,330 --> 00:42:57,090 by factors of 2 or 3 over each one of these cables, 1000 00:42:57,090 --> 00:43:00,160 all the way until just before it gets to the screen. 1001 00:43:00,160 --> 00:43:02,328 Therefore, you get the compression advantage 1002 00:43:02,328 --> 00:43:04,120 of the tape, but you don't get any benefit. 1003 00:43:04,120 --> 00:43:06,510 And all of these wires, you have to pay extra, 1004 00:43:06,510 --> 00:43:08,552 or it takes longer to transfer them to get there. 1005 00:43:08,552 --> 00:43:10,770 What we really want is just-in-time decompression. 1006 00:43:10,770 --> 00:43:13,420 What we want is keep it in the compressed form, 1007 00:43:13,420 --> 00:43:15,060 send it all the way around, and just 1008 00:43:15,060 --> 00:43:17,790 before it pops up in the screen it gets depressed-- 1009 00:43:17,790 --> 00:43:18,885 decompressed. 1010 00:43:18,885 --> 00:43:22,637 It may get depressed, too, but, it should get decompressed. 1011 00:43:22,637 --> 00:43:23,970 Now, there's advantages of that. 1012 00:43:23,970 --> 00:43:27,060 If we're up here, we know what kind of data it is. 1013 00:43:27,060 --> 00:43:30,720 And if it is an image, we can use image compression 1014 00:43:30,720 --> 00:43:32,790 and get even greater than factors of 2 or 3. 1015 00:43:32,790 --> 00:43:34,470 So this is the right place to do it 1016 00:43:34,470 --> 00:43:37,720 for a couple of different reasons. 1017 00:43:37,720 --> 00:43:39,420 Now, when we're talking about hundreds 1018 00:43:39,420 --> 00:43:40,920 of terabytes, what application would 1019 00:43:40,920 --> 00:43:43,050 we use at Berkeley to be able to drive this? 1020 00:43:43,050 --> 00:43:44,790 What we chose to use is this project 1021 00:43:44,790 --> 00:43:49,560 called Sequoia 2000, which is a global change research 1022 00:43:49,560 --> 00:43:52,110 effort involving Earth system scientists and computer 1023 00:43:52,110 --> 00:43:53,070 scientists. 1024 00:43:53,070 --> 00:43:55,620 These people are trying to worry about the problems facing 1025 00:43:55,620 --> 00:43:56,520 our planet. 1026 00:43:56,520 --> 00:43:59,183 They're worrying about CO2 content. 1027 00:43:59,183 --> 00:44:01,350 They're worrying about the melting of the snow caps. 1028 00:44:01,350 --> 00:44:03,030 They're worrying about the ozone holes. 1029 00:44:03,030 --> 00:44:04,447 These are the researchers that are 1030 00:44:04,447 --> 00:44:06,840 dealing with it, either with simulating data 1031 00:44:06,840 --> 00:44:09,430 or sensing data from space. 1032 00:44:09,430 --> 00:44:12,420 The project at Berkeley involves computer scientists 1033 00:44:12,420 --> 00:44:17,220 at several UC campuses, Earth system scientists at several UC 1034 00:44:17,220 --> 00:44:19,830 campuses, and even some people from the real world 1035 00:44:19,830 --> 00:44:26,160 who are trying to use this data to do public policy decisions. 1036 00:44:26,160 --> 00:44:31,110 These global change researchers are drowning in data. 1037 00:44:31,110 --> 00:44:34,050 They get all this data from remote sense 1038 00:44:34,050 --> 00:44:36,540 from satellite in space that they need to deal with today. 1039 00:44:36,540 --> 00:44:39,750 Modelers will typically create a tenth of a terabyte for a year. 1040 00:44:39,750 --> 00:44:41,280 In just a few years, they're going 1041 00:44:41,280 --> 00:44:45,660 to put up a series of satellites in space, that once they're 1042 00:44:45,660 --> 00:44:49,960 in place, they're going to broadcast 2 terabytes per day, 1043 00:44:49,960 --> 00:44:51,607 these bits raining down from space, 1044 00:44:51,607 --> 00:44:53,190 and they're going to keep broadcasting 1045 00:44:53,190 --> 00:44:56,320 that information for 15 years. 1046 00:44:56,320 --> 00:44:59,633 And what they want to do is capture all that information, 1047 00:44:59,633 --> 00:45:01,800 and so that it's digitally recorded so that they can 1048 00:45:01,800 --> 00:45:03,420 do simulations in the future to see 1049 00:45:03,420 --> 00:45:05,400 if their theories about the climate 1050 00:45:05,400 --> 00:45:08,050 bear up to this 15-year case study. 1051 00:45:08,050 --> 00:45:10,380 So this is really going to challenge 1052 00:45:10,380 --> 00:45:13,250 all levels of computer systems to be able to pull this off. 1053 00:45:13,250 --> 00:45:15,750 To give you an example of the type of thing we'd like to do, 1054 00:45:15,750 --> 00:45:18,995 I'm going to show you this next videotape as an example. 1055 00:45:18,995 --> 00:45:20,370 This is not something we've done. 1056 00:45:20,370 --> 00:45:21,840 This is the type of thing we'd like it 1057 00:45:21,840 --> 00:45:23,940 so that Earth's system scientists, global change 1058 00:45:23,940 --> 00:45:27,390 researchers, could try on their screen to find things out. 1059 00:45:27,390 --> 00:45:28,860 For this example, it's going to be 1060 00:45:28,860 --> 00:45:32,640 using green to represent the chlorophyll content of plants. 1061 00:45:32,640 --> 00:45:35,940 And what you're going to see is the trade winds blow the rain, 1062 00:45:35,940 --> 00:45:39,300 the chlorophyll will move across South America 1063 00:45:39,300 --> 00:45:40,980 and bumping into the Andes. 1064 00:45:40,980 --> 00:45:43,170 And this would show what would happen 1065 00:45:43,170 --> 00:45:47,070 if there was a environmental accident on the East 1066 00:45:47,070 --> 00:45:50,910 Coast of South America, how rapidly that might contaminate 1067 00:45:50,910 --> 00:45:52,260 the South American continent. 1068 00:45:52,260 --> 00:45:54,970 And you can see as it contraposes this information 1069 00:45:54,970 --> 00:45:57,345 according to altitudes, you get some interesting insights 1070 00:45:57,345 --> 00:45:58,375 to what's going on. 1071 00:45:58,375 --> 00:46:00,000 This is an example of something that we 1072 00:46:00,000 --> 00:46:02,820 would like to do in the Sequoia Project 1073 00:46:02,820 --> 00:46:05,033 and should fire your imagination. 1074 00:46:05,033 --> 00:46:05,700 Here's the demo. 1075 00:46:05,700 --> 00:46:09,473 1076 00:46:09,473 --> 00:46:10,140 [VIDEO PLAYBACK] 1077 00:46:10,140 --> 00:46:14,540 - So now let's compare North America with South America. 1078 00:46:14,540 --> 00:46:17,150 The relation of vegetation production 1079 00:46:17,150 --> 00:46:21,650 to global climatic patterns is clearly reflected here. 1080 00:46:21,650 --> 00:46:24,500 The tropical easterly winds spread rain 1081 00:46:24,500 --> 00:46:28,210 across the continent to make it green. 1082 00:46:28,210 --> 00:46:30,620 They drop the last of their moisture 1083 00:46:30,620 --> 00:46:34,940 when they meet the tall Andes mountains on the left. 1084 00:46:34,940 --> 00:46:38,060 The black area of low production on the left 1085 00:46:38,060 --> 00:46:40,850 reflects the dry lands and the rain shadow 1086 00:46:40,850 --> 00:46:44,000 produced by the Andes. 1087 00:46:44,000 --> 00:46:46,610 Why, then, is there another black shadow 1088 00:46:46,610 --> 00:46:49,020 in the lower portion of South America, 1089 00:46:49,020 --> 00:46:51,980 but on the opposite side of the Andes? 1090 00:46:51,980 --> 00:46:55,730 This occurs at exactly 30 degrees South latitude, 1091 00:46:55,730 --> 00:46:58,490 where the tropical easterly winds shift 1092 00:46:58,490 --> 00:47:03,500 to the westerly winds which characterize temperate zones. 1093 00:47:03,500 --> 00:47:07,190 GRAS is a system designed to support scientific research 1094 00:47:07,190 --> 00:47:10,670 and to answer land management questions. 1095 00:47:10,670 --> 00:47:13,670 This presentation suggests only a few 1096 00:47:13,670 --> 00:47:16,190 of GRAS's potential applications. 1097 00:47:16,190 --> 00:47:19,490 It does illustrate how today's technology can integrate 1098 00:47:19,490 --> 00:47:22,880 the latest satellite imagery, computer manipulation 1099 00:47:22,880 --> 00:47:26,090 techniques, and hardware capabilities 1100 00:47:26,090 --> 00:47:29,780 for visualizing our fragile ecosystem in ways 1101 00:47:29,780 --> 00:47:32,268 not previously possible. 1102 00:47:32,268 --> 00:47:35,726 [MUSIC PLAYING] 1103 00:47:35,726 --> 00:48:01,347 1104 00:48:01,347 --> 00:48:01,930 [END PLAYBACK] 1105 00:48:01,930 --> 00:48:04,097 That's one application of the technologies 1106 00:48:04,097 --> 00:48:05,680 to help the global change researchers, 1107 00:48:05,680 --> 00:48:07,930 but another interesting application 1108 00:48:07,930 --> 00:48:10,340 is the electronic library or digital library. 1109 00:48:10,340 --> 00:48:12,988 And you can see that on the next slide. 1110 00:48:12,988 --> 00:48:14,530 If you visit the Berkeley campus, one 1111 00:48:14,530 --> 00:48:18,070 of our nicest buildings on campus, the Bancroft Library, 1112 00:48:18,070 --> 00:48:21,043 that has just 372,000 books in it. 1113 00:48:21,043 --> 00:48:22,460 And if you convert that into text, 1114 00:48:22,460 --> 00:48:24,160 that'd be about half a terabyte. 1115 00:48:24,160 --> 00:48:26,800 If instead of just having the text, if what you had 1116 00:48:26,800 --> 00:48:29,500 was the images of the full page, that 1117 00:48:29,500 --> 00:48:32,390 might take maybe 20 terabytes. 1118 00:48:32,390 --> 00:48:35,890 That's not a very big piece of that storage robot 1119 00:48:35,890 --> 00:48:37,450 that we talked about earlier. 1120 00:48:37,450 --> 00:48:39,100 Moreover, if you visited the campus 1121 00:48:39,100 --> 00:48:40,750 right now while we're filming, we're 1122 00:48:40,750 --> 00:48:44,200 in the middle of a four-year, $45 million project 1123 00:48:44,200 --> 00:48:49,570 to create a building contain two million books. 1124 00:48:49,570 --> 00:48:51,700 Right now in the state of California, $45 million 1125 00:48:51,700 --> 00:48:53,510 is a lot of money. 1126 00:48:53,510 --> 00:48:57,730 And what we're doing is building this building for books. 1127 00:48:57,730 --> 00:48:59,320 How good an idea is that? 1128 00:48:59,320 --> 00:49:00,940 We could fit all that information, 1129 00:49:00,940 --> 00:49:03,730 even the page images, into one of these robots. 1130 00:49:03,730 --> 00:49:05,890 And it's pretty expensive to create this. 1131 00:49:05,890 --> 00:49:09,317 I wonder whether or not if you visit the Berkeley campus in 10 1132 00:49:09,317 --> 00:49:10,900 years and come on a tour guide, people 1133 00:49:10,900 --> 00:49:13,780 are going to refer to this new building as the mausoleum 1134 00:49:13,780 --> 00:49:16,780 of dead trees, Tien's folly, where they spent $45 million 1135 00:49:16,780 --> 00:49:18,670 to hold all these books. 1136 00:49:18,670 --> 00:49:21,400 And in fact, I've given this talk before 1137 00:49:21,400 --> 00:49:23,290 and talked about how libraries work 1138 00:49:23,290 --> 00:49:25,753 so much that I can see clearly in my future 1139 00:49:25,753 --> 00:49:28,420 how libraries are going to be so different from the way they are 1140 00:49:28,420 --> 00:49:29,110 today. 1141 00:49:29,110 --> 00:49:30,850 And when we look back at these times, 1142 00:49:30,850 --> 00:49:32,152 people are going to chuckle. 1143 00:49:32,152 --> 00:49:33,610 And to put that in perspective, let 1144 00:49:33,610 --> 00:49:35,420 me tell you what's happened in my lifetime 1145 00:49:35,420 --> 00:49:37,090 in terms of learning how to program, 1146 00:49:37,090 --> 00:49:40,970 and then we'll fast forward and talk about how libraries work. 1147 00:49:40,970 --> 00:49:44,207 So the way I learned to program is I would write my program out 1148 00:49:44,207 --> 00:49:45,040 on a sheet of paper. 1149 00:49:45,040 --> 00:49:46,810 I would then go to a vending machine, 1150 00:49:46,810 --> 00:49:48,352 put a quarter in the vending machine, 1151 00:49:48,352 --> 00:49:49,880 and get a stack of IBM cards. 1152 00:49:49,880 --> 00:49:51,910 I would then take my piece of paper, 1153 00:49:51,910 --> 00:49:54,220 wander over to the keypunch machine, and type 1154 00:49:54,220 --> 00:49:56,680 in, keypunch, put little holes in cardboard 1155 00:49:56,680 --> 00:49:58,030 all of those characters. 1156 00:49:58,030 --> 00:49:59,950 I would take this stack of cards, 1157 00:49:59,950 --> 00:50:02,475 wander over to somebody at the counter, smile, 1158 00:50:02,475 --> 00:50:04,600 try and get to know the person, be the best friend, 1159 00:50:04,600 --> 00:50:06,280 hoping that that stack of cards will get 1160 00:50:06,280 --> 00:50:08,140 put at the front of some queue. 1161 00:50:08,140 --> 00:50:11,470 Then what I did was go home, come back the next morning, 1162 00:50:11,470 --> 00:50:14,260 look, bend down, find my slot with my name, 1163 00:50:14,260 --> 00:50:17,140 and kick out a line printer listing to see what happened. 1164 00:50:17,140 --> 00:50:19,940 Ah, left out a comma. 1165 00:50:19,940 --> 00:50:22,510 Take the cards, go over, get some more cards, 1166 00:50:22,510 --> 00:50:25,750 replicate, insert the comma, hand it to the guy, 1167 00:50:25,750 --> 00:50:30,400 smile, come back the next day for the next printing 1168 00:50:30,400 --> 00:50:31,280 of the cards. 1169 00:50:31,280 --> 00:50:34,060 Now, that is the way I learned to program. 1170 00:50:34,060 --> 00:50:35,830 How do people learn to program today? 1171 00:50:35,830 --> 00:50:37,900 I'm not sure that students today even know 1172 00:50:37,900 --> 00:50:39,370 the syntax of the language. 1173 00:50:39,370 --> 00:50:41,890 They start typing, commas get inserted automatically 1174 00:50:41,890 --> 00:50:43,120 by the editor. 1175 00:50:43,120 --> 00:50:44,962 I can imagine-- suppose what we did on 1176 00:50:44,962 --> 00:50:47,170 the Berkeley campus would say, boy, things are tough. 1177 00:50:47,170 --> 00:50:48,630 We're having to build this library, 1178 00:50:48,630 --> 00:50:49,630 it's using up our funds. 1179 00:50:49,630 --> 00:50:51,580 So what we're going to do is go back and teach programming 1180 00:50:51,580 --> 00:50:53,740 the way the faculty learned how to program. 1181 00:50:53,740 --> 00:50:55,780 We would have a riot on the Berkeley campus 1182 00:50:55,780 --> 00:50:57,460 to put People's Park to shame. 1183 00:50:57,460 --> 00:51:00,070 They'd say, nobody can learn to program that way. 1184 00:51:00,070 --> 00:51:02,440 That's prehistoric, impossible. 1185 00:51:02,440 --> 00:51:05,940 Only an idiot would even suggest it. 1186 00:51:05,940 --> 00:51:07,660 So let's talk about libraries. 1187 00:51:07,660 --> 00:51:09,820 Imagine we fast forward about 10 years, 1188 00:51:09,820 --> 00:51:13,250 and let's explain to the people 10 years in the future 1189 00:51:13,250 --> 00:51:15,200 how we use libraries today. 1190 00:51:15,200 --> 00:51:18,370 So the way we use libraries today is if you're lucky, 1191 00:51:18,370 --> 00:51:20,140 and you've got electronic card catalog, 1192 00:51:20,140 --> 00:51:22,015 you can find what you want and you write down 1193 00:51:22,015 --> 00:51:22,900 the call letters. 1194 00:51:22,900 --> 00:51:24,460 Then you get up out of your office, 1195 00:51:24,460 --> 00:51:28,978 or you come from home to the campus, wander around, 1196 00:51:28,978 --> 00:51:30,520 and if you have a special pad, you're 1197 00:51:30,520 --> 00:51:33,190 allowed to get into the stacks, go up and down the stacks, 1198 00:51:33,190 --> 00:51:36,370 find the call letters, look where the book's supposed 1199 00:51:36,370 --> 00:51:38,770 to be in the shelf, and then you look all around 1200 00:51:38,770 --> 00:51:41,060 because it's probably not where it's supposed to be. 1201 00:51:41,060 --> 00:51:42,310 And if it's not there, you look over 1202 00:51:42,310 --> 00:51:44,020 to the cart that's right next to it, which 1203 00:51:44,020 --> 00:51:45,853 has the books that haven't been put away yet 1204 00:51:45,853 --> 00:51:46,990 and see if you can find it. 1205 00:51:46,990 --> 00:51:49,060 If you're lucky, you find the book you want. 1206 00:51:49,060 --> 00:51:51,700 You wander down the stacks, go to the hard disk. 1207 00:51:51,700 --> 00:51:53,800 If you're fortunate and you have a modern library, 1208 00:51:53,800 --> 00:51:56,050 they'll use a grocery scanner to scan it in. 1209 00:51:56,050 --> 00:51:58,410 If not, you have to write your name on this little card. 1210 00:51:58,410 --> 00:51:59,350 You take the book. 1211 00:51:59,350 --> 00:52:01,450 You go back to your office or go all the way home, 1212 00:52:01,450 --> 00:52:05,050 read the 10 or 20 pages you cared about, and then you try 1213 00:52:05,050 --> 00:52:07,090 and remember to take that book back, 1214 00:52:07,090 --> 00:52:10,720 because no one else gets to use that book while you're 1215 00:52:10,720 --> 00:52:11,360 having it. 1216 00:52:11,360 --> 00:52:14,050 And then when you do get that postcard in the mail 1217 00:52:14,050 --> 00:52:16,210 after three weeks or after the end of semester 1218 00:52:16,210 --> 00:52:18,130 to remind you to bring that book back, you bring it back, 1219 00:52:18,130 --> 00:52:19,450 and it gets put on the shelves. 1220 00:52:19,450 --> 00:52:21,970 So imagine telling somebody 10 years from now this is 1221 00:52:21,970 --> 00:52:24,400 how we did scholarly research. 1222 00:52:24,400 --> 00:52:27,760 This is how we found out what other people were doing. 1223 00:52:27,760 --> 00:52:30,640 They'll say, but you must never have read books. 1224 00:52:30,640 --> 00:52:33,460 You must not have taken any time at all to do that, 1225 00:52:33,460 --> 00:52:34,270 so it's so painful. 1226 00:52:34,270 --> 00:52:35,200 It must have taken you hours. 1227 00:52:35,200 --> 00:52:35,992 Yes, it took hours. 1228 00:52:35,992 --> 00:52:37,533 And it must have been very expensive. 1229 00:52:37,533 --> 00:52:38,750 Oh, it was very expensive. 1230 00:52:38,750 --> 00:52:41,470 Every time you check a book out of a library, 1231 00:52:41,470 --> 00:52:43,150 a library is doing a very good job 1232 00:52:43,150 --> 00:52:47,017 if it only cost them $1 to put the book back on the shelf. 1233 00:52:47,017 --> 00:52:49,600 So the best thing you could do to help libraries out right now 1234 00:52:49,600 --> 00:52:51,332 is to not check out any books. 1235 00:52:51,332 --> 00:52:52,790 That would save them a lot of money 1236 00:52:52,790 --> 00:52:55,510 if nobody checked out books. 1237 00:52:55,510 --> 00:53:00,500 Libraries also have to buy books in anticipation of their use. 1238 00:53:00,500 --> 00:53:03,460 So a librarian is doing an outstanding job if only 1239 00:53:03,460 --> 00:53:06,790 20% of the books that they buy on the Berkeley campus, that's 1240 00:53:06,790 --> 00:53:08,050 about a $3 million budget. 1241 00:53:08,050 --> 00:53:10,960 So 20% of the books that they buy, no one ever, 1242 00:53:10,960 --> 00:53:12,370 ever checks out. 1243 00:53:12,370 --> 00:53:13,660 Not once. 1244 00:53:13,660 --> 00:53:16,600 So that's just-- those books just sit there and use up 1245 00:53:16,600 --> 00:53:18,032 shelf space. 1246 00:53:18,032 --> 00:53:19,990 Similarly, to create a catalog entry for a book 1247 00:53:19,990 --> 00:53:22,130 costs a big fraction of the price of the book. 1248 00:53:22,130 --> 00:53:25,240 So the system we have today is extraordinarily expensive, 1249 00:53:25,240 --> 00:53:27,040 and it's extraordinarily inconvenient. 1250 00:53:27,040 --> 00:53:29,560 Imagine having to get up out of your chair and go do that. 1251 00:53:29,560 --> 00:53:31,570 You can imagine 10 years or so in the future, 1252 00:53:31,570 --> 00:53:33,548 people are going to be doing-- searching 1253 00:53:33,548 --> 00:53:35,590 for the right information, getting those 10 or 20 1254 00:53:35,590 --> 00:53:37,715 pages they're interested in pop up on their screen, 1255 00:53:37,715 --> 00:53:40,358 reading that, inserting what they need, and move ahead. 1256 00:53:40,358 --> 00:53:42,400 It'll be dramatically different than we do today, 1257 00:53:42,400 --> 00:53:44,320 as dramatically different as the way I learned 1258 00:53:44,320 --> 00:53:46,930 to program versus the way people are learning to program today. 1259 00:53:46,930 --> 00:53:48,160 In fact, I've given this talk enough 1260 00:53:48,160 --> 00:53:50,350 whenever I go to the library I get angry that I've 1261 00:53:50,350 --> 00:53:51,880 got to go through all this rigmarole 1262 00:53:51,880 --> 00:53:53,470 because I know it's not necessary. 1263 00:53:53,470 --> 00:53:56,180 The technology is there to do that. 1264 00:53:56,180 --> 00:53:59,950 So let's start wrapping this talk up here. 1265 00:53:59,950 --> 00:54:01,900 What I'm talking about in the second half 1266 00:54:01,900 --> 00:54:04,100 is a new technology. 1267 00:54:04,100 --> 00:54:06,820 This is a pretty classic curve that's 1268 00:54:06,820 --> 00:54:09,743 been shown for a long time where there's DRAM in magnetic disks. 1269 00:54:09,743 --> 00:54:11,410 And this has been called the access gap. 1270 00:54:11,410 --> 00:54:14,170 This is a log scale in terms of dollars per megabyte. 1271 00:54:14,170 --> 00:54:16,120 This is the most expensive. 1272 00:54:16,120 --> 00:54:18,970 This is the fastest down here-- log scale and access time. 1273 00:54:18,970 --> 00:54:21,433 So DRAM's expensive and fast. 1274 00:54:21,433 --> 00:54:23,350 And this is an access gap that a lot of people 1275 00:54:23,350 --> 00:54:25,342 have tried to invent technologies to fill. 1276 00:54:25,342 --> 00:54:26,800 What I'm telling you, there's going 1277 00:54:26,800 --> 00:54:29,320 to be a new access gap, this robo-line tape that 1278 00:54:29,320 --> 00:54:32,410 is much cheaper and much slower, and how do we 1279 00:54:32,410 --> 00:54:36,528 figure out how to use that as systems designers? 1280 00:54:36,528 --> 00:54:38,820 So there's lots of research issues we've got to attack. 1281 00:54:38,820 --> 00:54:40,110 And again, remember, I'm waving my hands 1282 00:54:40,110 --> 00:54:41,587 because it's things we need to do. 1283 00:54:41,587 --> 00:54:43,920 There's, how are we going to manage three or four levels 1284 00:54:43,920 --> 00:54:45,503 of storage hierarchy, when in the past 1285 00:54:45,503 --> 00:54:46,860 we've only managed two? 1286 00:54:46,860 --> 00:54:49,440 How are we going to manage that inherent latency 1287 00:54:49,440 --> 00:54:52,920 of this new technology that's very cheap? 1288 00:54:52,920 --> 00:54:55,400 Other examples of what are we going to do with compression? 1289 00:54:55,400 --> 00:54:56,400 Can we do it on the fly? 1290 00:54:56,400 --> 00:54:57,233 What about hardware? 1291 00:54:57,233 --> 00:54:58,858 How are we going to keep this reliable? 1292 00:54:58,858 --> 00:55:00,570 If you have sum of humankind's knowledge, 1293 00:55:00,570 --> 00:55:02,700 it's not OK if the sum of humankind's knowledge 1294 00:55:02,700 --> 00:55:03,540 goes down. 1295 00:55:03,540 --> 00:55:06,540 It's not OK if you lose Marc Twain's collected works. 1296 00:55:06,540 --> 00:55:09,070 And what does it mean to back up 100 terabytes? 1297 00:55:09,070 --> 00:55:11,350 So there's lots of interesting issues there. 1298 00:55:11,350 --> 00:55:15,960 So for my last slide, let me conclude with a prediction 1299 00:55:15,960 --> 00:55:19,320 that this new storage technology is going to really change 1300 00:55:19,320 --> 00:55:22,410 our society provided a couple of things, 1301 00:55:22,410 --> 00:55:24,275 that for a cost of a minicomputer, 1302 00:55:24,275 --> 00:55:25,650 if you can get a factor of 1,000, 1303 00:55:25,650 --> 00:55:28,290 that's a pretty big impact and that's going to change things. 1304 00:55:28,290 --> 00:55:30,510 The obstacles aren't technical, though, in my view. 1305 00:55:30,510 --> 00:55:33,090 The technical obstacles that come up I bet we can attack. 1306 00:55:33,090 --> 00:55:34,800 We've done it in the past. 1307 00:55:34,800 --> 00:55:36,568 First of all, it's the legal copyrights 1308 00:55:36,568 --> 00:55:37,860 that's going to be an obstacle. 1309 00:55:37,860 --> 00:55:40,740 I'm not allowed to make an online copy of all the books 1310 00:55:40,740 --> 00:55:43,860 on my library at home because the copyright says 1311 00:55:43,860 --> 00:55:45,030 that's a copy. 1312 00:55:45,030 --> 00:55:47,580 So copyright is an obstacle to online. 1313 00:55:47,580 --> 00:55:50,880 Similarly, business model is going to be an obstacle. 1314 00:55:50,880 --> 00:55:53,640 Paper-based publishers are used to having a book, 1315 00:55:53,640 --> 00:55:55,170 meaning they get money. 1316 00:55:55,170 --> 00:55:57,550 If they place that information online, 1317 00:55:57,550 --> 00:56:01,093 what guarantee do they have that they'll get any sales at all. 1318 00:56:01,093 --> 00:56:02,260 Why won't it just be copied? 1319 00:56:02,260 --> 00:56:05,850 So we have, as technologists, to provide those guarantees. 1320 00:56:05,850 --> 00:56:09,060 So my prediction by the end of this decade 1321 00:56:09,060 --> 00:56:12,150 before the next century, that if we can address the first two 1322 00:56:12,150 --> 00:56:15,360 non-technical issues, that this factor of 1,000 1323 00:56:15,360 --> 00:56:16,920 increase in online storage is going 1324 00:56:16,920 --> 00:56:19,410 to have a much greater impact in our society 1325 00:56:19,410 --> 00:56:22,050 than this factor of 1,000 increase in CPU speed. 1326 00:56:22,050 --> 00:56:23,790 So thanks very much for your attention. 1327 00:56:23,790 --> 00:56:26,165 You really did stick through and listen to the whole tape 1328 00:56:26,165 --> 00:56:27,030 about input/output. 1329 00:56:27,030 --> 00:56:28,655 But I hope you could see from this tape 1330 00:56:28,655 --> 00:56:31,440 just why input/output is so much more exciting in processor 1331 00:56:31,440 --> 00:56:34,500 design, and you'll agree that terabytes is a lot more 1332 00:56:34,500 --> 00:56:35,760 important than teraflops. 1333 00:56:35,760 --> 00:56:36,480 Thanks very much. 1334 00:56:36,480 --> 00:56:42,420 1335 00:56:42,420 --> 00:56:44,400 Yeah, where's the applause? 1336 00:56:44,400 --> 00:56:46,050 Yeah, where is the applause? 1337 00:56:46,050 --> 00:56:46,920 If this was on TV. 1338 00:56:46,920 --> 00:56:48,210 We could insert the laughter. 1339 00:56:48,210 --> 00:56:51,560 [LAUGHTER] 1340 00:56:51,560 --> 00:57:31,000