1 00:00:00,000 --> 00:00:02,958 [MUSIC PLAYING] 2 00:00:02,958 --> 00:00:35,060 3 00:00:35,060 --> 00:00:36,230 Hi, I'm Dick Sites. 4 00:00:36,230 --> 00:00:37,250 And I'm Dirk Meyer. 5 00:00:37,250 --> 00:00:39,530 And we're going to talk today about Alpha. 6 00:00:39,530 --> 00:00:41,510 In discussing Alpha, we make the distinction 7 00:00:41,510 --> 00:00:43,100 between the architecture as a paper 8 00:00:43,100 --> 00:00:45,712 document and the various implementations. 9 00:00:45,712 --> 00:00:47,420 I'm going to talk about the architecture. 10 00:00:47,420 --> 00:00:48,680 And then I'll be back in a little bit later 11 00:00:48,680 --> 00:00:50,330 to talk about the first implementation 12 00:00:50,330 --> 00:00:52,970 of that architecture. 13 00:00:52,970 --> 00:00:55,220 I'm going to talk about Alpha architecture, 14 00:00:55,220 --> 00:00:58,735 about the goals, an overview of the architecture, 15 00:00:58,735 --> 00:01:00,860 about the things that are different from other risk 16 00:01:00,860 --> 00:01:03,560 architectures, and then about some of the problems 17 00:01:03,560 --> 00:01:05,630 that we addressed in designing the architecture 18 00:01:05,630 --> 00:01:08,900 and how we thought about them. 19 00:01:08,900 --> 00:01:12,605 Rich Witek and I are the co-architects for Alpha. 20 00:01:12,605 --> 00:01:16,070 My part of the Alpha design has been influenced substantially 21 00:01:16,070 --> 00:01:20,400 by three other architects, Fred Brooks, John Cocke, and Seymour 22 00:01:20,400 --> 00:01:20,900 Cray. 23 00:01:20,900 --> 00:01:23,750 24 00:01:23,750 --> 00:01:27,230 When we started the Alpha project in 1988, 25 00:01:27,230 --> 00:01:30,380 we set four goals for the architecture. 26 00:01:30,380 --> 00:01:37,430 Performance, longevity, scalability, and generality. 27 00:01:37,430 --> 00:01:40,160 In performance, we wanted the architecture 28 00:01:40,160 --> 00:01:42,470 to allow implementations that would 29 00:01:42,470 --> 00:01:45,950 be faster than anything else in the industry each year 30 00:01:45,950 --> 00:01:47,260 for the foreseeable future. 31 00:01:47,260 --> 00:01:49,830 32 00:01:49,830 --> 00:01:51,870 That performance goal reflects itself 33 00:01:51,870 --> 00:01:54,180 in a number of the details of the architecture, 34 00:01:54,180 --> 00:01:57,000 as I'll show you in a few minutes. 35 00:01:57,000 --> 00:02:01,740 For longevity, we wanted the architecture to last 15 or 20 36 00:02:01,740 --> 00:02:07,680 or 25 years, much longer than a 10-year design cycle, 37 00:02:07,680 --> 00:02:13,240 and that longevity goal implied a number of design decisions, 38 00:02:13,240 --> 00:02:15,750 if we were honest about meeting that goal. 39 00:02:15,750 --> 00:02:17,430 The first design decision is that it had 40 00:02:17,430 --> 00:02:20,130 to be a 64-bit architecture. 41 00:02:20,130 --> 00:02:22,260 All the 32-bit architecture will run out 42 00:02:22,260 --> 00:02:26,538 of address bit sometime in the next 10 or 15 or 20 years. 43 00:02:26,538 --> 00:02:28,080 In fact, in designing the first chip, 44 00:02:28,080 --> 00:02:33,780 we ran out of address bits on VAXs last year. 45 00:02:33,780 --> 00:02:37,260 In addition to being a 64-bit architecture, 46 00:02:37,260 --> 00:02:41,940 the longevity goal implied that implementations 47 00:02:41,940 --> 00:02:43,800 over the course of two decades would have 48 00:02:43,800 --> 00:02:46,590 to scale up in performance. 49 00:02:46,590 --> 00:02:50,490 We looked at how the scaling could happen. 50 00:02:50,490 --> 00:02:52,920 The first chip runs at 200 megahertz 51 00:02:52,920 --> 00:02:55,665 five nanosecond cycle. 52 00:02:55,665 --> 00:02:59,790 It looked unlikely that the chip speed, clock speed, 53 00:02:59,790 --> 00:03:03,850 would improve by a factor of 1,000 over 25 years, 54 00:03:03,850 --> 00:03:07,050 but the industry curves say that the implementation performance 55 00:03:07,050 --> 00:03:09,150 does need to improve by a factor of 1,000 56 00:03:09,150 --> 00:03:12,510 if the architecture is going to last that long. 57 00:03:12,510 --> 00:03:14,610 We thought it was realistic for the clock speed 58 00:03:14,610 --> 00:03:16,990 to improve by a factor of 10. 59 00:03:16,990 --> 00:03:19,380 So we looked for other places to pick up 60 00:03:19,380 --> 00:03:22,210 the other factor of 100. 61 00:03:22,210 --> 00:03:24,150 If the clock speed isn't going to pick up 62 00:03:24,150 --> 00:03:27,570 by more than a factor of 10, then to gain more performance 63 00:03:27,570 --> 00:03:30,540 you have to do more work in every clock cycle. 64 00:03:30,540 --> 00:03:33,570 So very early in the design, we settled on the idea 65 00:03:33,570 --> 00:03:36,720 that the architecture would have to gracefully allow 66 00:03:36,720 --> 00:03:39,510 multiple instruction issue, that it 67 00:03:39,510 --> 00:03:43,050 would be necessary for the longevity goal 68 00:03:43,050 --> 00:03:45,930 and the performance goals for the architecture 69 00:03:45,930 --> 00:03:49,880 to allow many instructions to be started every clock cycle. 70 00:03:49,880 --> 00:03:53,710 We thought realistically over the course of a few decades, 71 00:03:53,710 --> 00:03:56,430 perhaps using compilers that our children build, 72 00:03:56,430 --> 00:04:01,020 that compiler technology could pick up a factor of 10 73 00:04:01,020 --> 00:04:03,720 using multiple instruction issue. 74 00:04:03,720 --> 00:04:07,170 That still left us a factor of 10 short. 75 00:04:07,170 --> 00:04:09,540 So we looked at where else implementations could pick up 76 00:04:09,540 --> 00:04:13,230 performance, and the only other way we saw 77 00:04:13,230 --> 00:04:17,130 was to have multiple processors to have perhaps 78 00:04:17,130 --> 00:04:19,019 as many as 10 processors in a box 79 00:04:19,019 --> 00:04:25,050 25 years from now executing pieces of a common program. 80 00:04:25,050 --> 00:04:27,600 So we focused the architecture, just because 81 00:04:27,600 --> 00:04:30,690 of the longevity and the scalability goals, 82 00:04:30,690 --> 00:04:35,340 on those three points of very fast clock 83 00:04:35,340 --> 00:04:40,890 cycle, multiple instruction issue, and multiple processors. 84 00:04:40,890 --> 00:04:44,460 Finally, we designed the architecture to be general. 85 00:04:44,460 --> 00:04:47,190 It's not just a Unix hotbox. 86 00:04:47,190 --> 00:04:51,900 It's designed to be able to run VMS, to be able to run OSF/1, 87 00:04:51,900 --> 00:04:55,320 to be able to run other operating systems to track 88 00:04:55,320 --> 00:04:57,900 different computing paradigms as the industry 89 00:04:57,900 --> 00:05:02,560 changes over the next 5 or 10 or 20 years. 90 00:05:02,560 --> 00:05:06,750 We also designed it to support a variety of computer languages, 91 00:05:06,750 --> 00:05:11,430 not just Fortran and C but also COBOL, Pascal, Ada, 92 00:05:11,430 --> 00:05:15,050 Basic, lots of other languages. 93 00:05:15,050 --> 00:05:18,460 And finally, in looking at bringing a new architecture 94 00:05:18,460 --> 00:05:24,220 to the market, we focused on migrating our current customers 95 00:05:24,220 --> 00:05:27,250 from VAXs and from MIPS stack stations 96 00:05:27,250 --> 00:05:29,860 to the new architecture. 97 00:05:29,860 --> 00:05:32,530 We rejected things like hardware compatibility mode 98 00:05:32,530 --> 00:05:36,520 in favor of doing software translation of binary images 99 00:05:36,520 --> 00:05:39,790 and doing compilers with compatible front ends 100 00:05:39,790 --> 00:05:43,740 for source code re-completion. 101 00:05:43,740 --> 00:05:46,050 Here's an overview of the architecture. 102 00:05:46,050 --> 00:05:49,230 It looks very much like other risk machines on the surface. 103 00:05:49,230 --> 00:05:52,410 It's a 64-bit load store risk machine. 104 00:05:52,410 --> 00:05:54,960 Instructions are 32-bits wide. 105 00:05:54,960 --> 00:05:57,780 They describe operations on one, two, four, 106 00:05:57,780 --> 00:05:59,430 or eight byte integers. 107 00:05:59,430 --> 00:06:02,100 They describe operations on VAX single precision 108 00:06:02,100 --> 00:06:04,110 and double precision floating point, 109 00:06:04,110 --> 00:06:06,960 also IEEE single precision and double precision floating 110 00:06:06,960 --> 00:06:08,910 point. 111 00:06:08,910 --> 00:06:16,050 There are 32 integer registers, each of 64 bits. 112 00:06:16,050 --> 00:06:18,660 One of those is hardwired to zero, and one of them 113 00:06:18,660 --> 00:06:20,500 is used as a stack pointer. 114 00:06:20,500 --> 00:06:24,690 There are also 32 floating point registers, each 64 bits, 115 00:06:24,690 --> 00:06:28,720 and one of those is hardwired to zero. 116 00:06:28,720 --> 00:06:30,340 In addition, the program counter is 117 00:06:30,340 --> 00:06:33,280 a full 64-bit virtual address, and there 118 00:06:33,280 --> 00:06:36,460 are a number of other counters and registers 119 00:06:36,460 --> 00:06:40,840 to allow writing real operating system multitasking software, 120 00:06:40,840 --> 00:06:45,580 things like a cycle calendar for performance studies, floating 121 00:06:45,580 --> 00:06:49,840 point status register for IEEE rounding modes and status bits. 122 00:06:49,840 --> 00:06:53,560 64-bit virtual addresses are used 123 00:06:53,560 --> 00:06:56,777 throughout the architecture, although implementations 124 00:06:56,777 --> 00:06:59,110 are allowed to implement a subset of the virtual address 125 00:06:59,110 --> 00:07:03,470 space so long as they check the high order unimplemented bits 126 00:07:03,470 --> 00:07:06,860 for validity. 127 00:07:06,860 --> 00:07:11,310 The instruction formats are very straightforward. 128 00:07:11,310 --> 00:07:14,690 There is a 6-bit opcode in every instruction. 129 00:07:14,690 --> 00:07:17,930 The first format has call pal instructions. 130 00:07:17,930 --> 00:07:21,680 It has a 26-bit function field that 131 00:07:21,680 --> 00:07:25,820 describes one of a few dozen privilege subroutines 132 00:07:25,820 --> 00:07:27,410 that are used to execute a number 133 00:07:27,410 --> 00:07:29,960 of the complex operations that operating systems are 134 00:07:29,960 --> 00:07:31,340 built on top of. 135 00:07:31,340 --> 00:07:35,270 I'll discuss those operations in a few minutes. 136 00:07:35,270 --> 00:07:38,300 The branch instructions have the six bit opcode 137 00:07:38,300 --> 00:07:39,950 and then a five bit register number 138 00:07:39,950 --> 00:07:44,590 to be tested that can be tested for positive, negative, zero, 139 00:07:44,590 --> 00:07:45,590 non-zero. 140 00:07:45,590 --> 00:07:48,740 In the case of the integer registers, even an odd, 141 00:07:48,740 --> 00:07:51,170 and then there's a 21-bit sign displacement 142 00:07:51,170 --> 00:07:53,900 field that's actually a long word displacement 143 00:07:53,900 --> 00:07:57,140 describes relative to the branch instruction 144 00:07:57,140 --> 00:08:01,010 a target address instruction anywhere within plus or minus 145 00:08:01,010 --> 00:08:03,680 four megabytes. 146 00:08:03,680 --> 00:08:07,010 The memory formats have the six bit opcode and then 147 00:08:07,010 --> 00:08:10,320 two register fields of five bits each. 148 00:08:10,320 --> 00:08:12,920 The first field specifies the register 149 00:08:12,920 --> 00:08:14,870 to be loaded or stored. 150 00:08:14,870 --> 00:08:17,780 Second field specifies the base register for the memory 151 00:08:17,780 --> 00:08:23,570 address, and the remaining field specifies a 16-bit sign 152 00:08:23,570 --> 00:08:26,360 displacement from the base register. 153 00:08:26,360 --> 00:08:30,290 The last format is used for operations between registers, 154 00:08:30,290 --> 00:08:34,400 six bit opcode, two source registers, a function 155 00:08:34,400 --> 00:08:36,740 field that's really an extension of the opcode, 156 00:08:36,740 --> 00:08:39,380 and then a destination register. 157 00:08:39,380 --> 00:08:41,600 All the operations read two registers 158 00:08:41,600 --> 00:08:43,100 and write one register. 159 00:08:43,100 --> 00:08:45,330 They read all 64 bits of the registers, 160 00:08:45,330 --> 00:08:47,721 and they write all 64 bits of the registers. 161 00:08:47,721 --> 00:08:51,570 162 00:08:51,570 --> 00:08:53,393 One of the simplest ways of characterizing 163 00:08:53,393 --> 00:08:54,810 how this architecture is different 164 00:08:54,810 --> 00:08:59,190 from other architectures is by what's not there. 165 00:08:59,190 --> 00:09:03,060 There are no condition codes in the Alpha architecture 166 00:09:03,060 --> 00:09:05,790 because they get in the way of doing multiple issue. 167 00:09:05,790 --> 00:09:08,190 If you issue six instructions at once 168 00:09:08,190 --> 00:09:12,840 and they all potentially set the condition codes 169 00:09:12,840 --> 00:09:16,530 and there's really only one condition code register, 170 00:09:16,530 --> 00:09:19,170 then you have to build a lot of very messy hardware 171 00:09:19,170 --> 00:09:21,750 to sort out which of those six instructions 172 00:09:21,750 --> 00:09:24,840 actually sets the condition code register. 173 00:09:24,840 --> 00:09:29,280 And if the fourth of those takes an overflow and then, in fact, 174 00:09:29,280 --> 00:09:32,310 you discover on the fly that the third one 175 00:09:32,310 --> 00:09:34,620 has to set the condition code register, getting 176 00:09:34,620 --> 00:09:36,558 that right is very messy. 177 00:09:36,558 --> 00:09:37,933 So the condition code register is 178 00:09:37,933 --> 00:09:41,730 sort of an example of a design technique 179 00:09:41,730 --> 00:09:44,010 that's difficult to use when you're planning 180 00:09:44,010 --> 00:09:46,150 on multiple instruction issue. 181 00:09:46,150 --> 00:09:48,270 So we don't have it in Alpha. 182 00:09:48,270 --> 00:09:51,090 In a similar way, we don't have fixed registers 183 00:09:51,090 --> 00:09:55,770 for operations such as string pointers or multiply or divide. 184 00:09:55,770 --> 00:09:59,100 Some architectures have a multiplier quotient register 185 00:09:59,100 --> 00:10:02,022 or dedicated registers for other operations. 186 00:10:02,022 --> 00:10:03,480 The problem with that is if you try 187 00:10:03,480 --> 00:10:05,490 to launch like four instructions at once 188 00:10:05,490 --> 00:10:07,680 and they're all multiplies and there's only 189 00:10:07,680 --> 00:10:12,210 one multiplier quotient register, it's a bottleneck. 190 00:10:12,210 --> 00:10:13,620 You either have to build hardware 191 00:10:13,620 --> 00:10:17,460 that shadows multiple copies of the MQ register, 192 00:10:17,460 --> 00:10:20,500 or you can't launch multiple multiplies. 193 00:10:20,500 --> 00:10:23,160 So we simply don't have it, and all of the operations 194 00:10:23,160 --> 00:10:27,330 are completely general between the general registers. 195 00:10:27,330 --> 00:10:29,400 Finally, we don't have byte writes. 196 00:10:29,400 --> 00:10:32,820 If you look at implementations of writing a single byte 197 00:10:32,820 --> 00:10:36,660 memory, somewhere in real implementations, 198 00:10:36,660 --> 00:10:39,840 memory is accessed in complete memory 199 00:10:39,840 --> 00:10:44,160 words, which are typically four or eight bytes wide. 200 00:10:44,160 --> 00:10:46,260 Updating a single byte in a memory word 201 00:10:46,260 --> 00:10:51,390 somewhere involves reading the entire memory word, 202 00:10:51,390 --> 00:10:53,730 possibly correcting a single bit or double bit 203 00:10:53,730 --> 00:10:57,090 error if the memory system has error correction, 204 00:10:57,090 --> 00:11:00,480 then updating the single byte, and then recalculating 205 00:11:00,480 --> 00:11:02,310 any error correction or parity bits, 206 00:11:02,310 --> 00:11:05,610 and then finally writing the entire memory work. 207 00:11:05,610 --> 00:11:08,700 And that read, modify, write is really a sequence. 208 00:11:08,700 --> 00:11:13,020 And if it's implemented in hardware in the memory board 209 00:11:13,020 --> 00:11:16,620 or in the CPU chip near the pins, 210 00:11:16,620 --> 00:11:18,870 wherever it's implemented, there's 211 00:11:18,870 --> 00:11:21,300 a tendency to only have one implementation 212 00:11:21,300 --> 00:11:24,030 of that sequencing hardware, and that 213 00:11:24,030 --> 00:11:25,440 makes it difficult then to launch 214 00:11:25,440 --> 00:11:28,500 six independent byte writes and to have them run in parallel. 215 00:11:28,500 --> 00:11:32,948 Again, the byte write logic becomes a bottleneck. 216 00:11:32,948 --> 00:11:34,740 So we took the fairly radical move in Alpha 217 00:11:34,740 --> 00:11:36,840 of not having byte writes. 218 00:11:36,840 --> 00:11:39,540 Instead, there are carefully tuned instruction sequences 219 00:11:39,540 --> 00:11:41,820 that allow a program to do the read, 220 00:11:41,820 --> 00:11:46,470 modify, write as a load and in-register modify, 221 00:11:46,470 --> 00:11:48,320 and then a store. 222 00:11:48,320 --> 00:11:55,470 We included in the architecture enough design information 223 00:11:55,470 --> 00:11:59,280 to allow programs to do single byte 224 00:11:59,280 --> 00:12:01,500 updates within the same memory word 225 00:12:01,500 --> 00:12:04,500 even in the face of a other processors updating 226 00:12:04,500 --> 00:12:09,060 adjacent bytes in that memory word, and that can all work. 227 00:12:09,060 --> 00:12:13,230 And we expect over the course of a couple of decades 228 00:12:13,230 --> 00:12:16,650 that removing that bottleneck will allow much faster byte 229 00:12:16,650 --> 00:12:19,470 write or byte manipulation programs 230 00:12:19,470 --> 00:12:20,970 than some other architectures that 231 00:12:20,970 --> 00:12:24,840 have an instruction that looks like it would help instead. 232 00:12:24,840 --> 00:12:29,520 Finally we tried to avoid having first implementation artifacts. 233 00:12:29,520 --> 00:12:32,940 In contrast to doing no chip design 234 00:12:32,940 --> 00:12:35,130 and shipping the first chip and, if it's successful, 235 00:12:35,130 --> 00:12:37,680 then designing another one and shipping that, and it's 236 00:12:37,680 --> 00:12:41,610 a little bit different, we tried to design into the architecture 237 00:12:41,610 --> 00:12:43,680 solutions to all of the design problems 238 00:12:43,680 --> 00:12:47,490 we were aware of even when that made the first chip 239 00:12:47,490 --> 00:12:51,920 implementation more difficult. 240 00:12:51,920 --> 00:12:55,540 Finally, we have included in the architecture careful support 241 00:12:55,540 --> 00:12:58,360 for multiprocessing. 242 00:12:58,360 --> 00:13:01,390 One of the design issues in multiprocessing 243 00:13:01,390 --> 00:13:04,570 is how to do atomic updates in memory 244 00:13:04,570 --> 00:13:09,520 between shared memory locations for multiple processors. 245 00:13:09,520 --> 00:13:15,400 We picked a paradigm of load locked modify store conditional 246 00:13:15,400 --> 00:13:19,420 that is also used in the MIPS R4,000 architecture. 247 00:13:19,420 --> 00:13:21,130 That particular paradigm we chose 248 00:13:21,130 --> 00:13:23,230 because it's the only thing we could find that 249 00:13:23,230 --> 00:13:27,730 would scale up with processor speed over a couple of decades. 250 00:13:27,730 --> 00:13:32,260 The idea is to do a load instruction but using load 251 00:13:32,260 --> 00:13:35,680 locked, which remember a few bits of state, 252 00:13:35,680 --> 00:13:40,420 then modify by doing an add or an or or whatever 253 00:13:40,420 --> 00:13:43,540 the modification is in registers and then 254 00:13:43,540 --> 00:13:45,370 do a store conditional instruction back 255 00:13:45,370 --> 00:13:48,010 to the shared memory location. 256 00:13:48,010 --> 00:13:52,270 If that entire sequence from the load locked through the modify 257 00:13:52,270 --> 00:13:54,460 to the store conditional, if that sequence 258 00:13:54,460 --> 00:13:59,200 runs with no interrupts and no exceptions and no interfering 259 00:13:59,200 --> 00:14:02,080 write from another processor, then the store 260 00:14:02,080 --> 00:14:05,500 conditional stores and leaves a bit 261 00:14:05,500 --> 00:14:08,020 in register for the program to test 262 00:14:08,020 --> 00:14:10,990 to determine that the store actually occurred. 263 00:14:10,990 --> 00:14:12,760 If anything goes wrong, if there's 264 00:14:12,760 --> 00:14:15,430 an interrupt in the middle, if there's an exception on any 265 00:14:15,430 --> 00:14:18,640 of the operations, or if there's an interfering 266 00:14:18,640 --> 00:14:21,610 write from another processor, then the store conditional 267 00:14:21,610 --> 00:14:23,890 doesn't store and instead sets the bit 268 00:14:23,890 --> 00:14:25,305 saying that it did not store. 269 00:14:25,305 --> 00:14:26,680 And eventually, then, the program 270 00:14:26,680 --> 00:14:29,590 needs to branch on that bit and go back around and start again 271 00:14:29,590 --> 00:14:32,590 with the load locked, retry the modification and the store 272 00:14:32,590 --> 00:14:33,970 conditional. 273 00:14:33,970 --> 00:14:37,450 When the entire sequence runs to completion, 274 00:14:37,450 --> 00:14:39,160 then, in fact, an atomic update has 275 00:14:39,160 --> 00:14:41,140 been done with no interference. 276 00:14:41,140 --> 00:14:43,430 So there are two advantages to this paradigm. 277 00:14:43,430 --> 00:14:47,290 The first is that in the absence of an interfering write, 278 00:14:47,290 --> 00:14:50,680 the entire sequence of load locked byte modify store 279 00:14:50,680 --> 00:14:54,970 conditional can be done inside of an on-chip processor cache. 280 00:14:54,970 --> 00:14:58,330 The other advantage is that multiple processors 281 00:14:58,330 --> 00:15:01,450 can be doing independent sequences in parallel. 282 00:15:01,450 --> 00:15:03,430 There's no common lock bit somewhere 283 00:15:03,430 --> 00:15:04,810 in a shared memory system. 284 00:15:04,810 --> 00:15:08,690 285 00:15:08,690 --> 00:15:14,270 For multiple processor support, we also designed around 286 00:15:14,270 --> 00:15:16,460 read write ordering. 287 00:15:16,460 --> 00:15:19,310 Rather than the traditional strict read write ordering, 288 00:15:19,310 --> 00:15:21,920 we specify that in the Alpha architecture, 289 00:15:21,920 --> 00:15:23,930 from the point of view of a second processor, 290 00:15:23,930 --> 00:15:26,750 reads and writes launched by a first processor 291 00:15:26,750 --> 00:15:29,180 can arrive at the second processor in an arbitrary 292 00:15:29,180 --> 00:15:32,930 order, and then we included a memory barrier instruction 293 00:15:32,930 --> 00:15:37,250 to limit the permutations allowed in an implementation. 294 00:15:37,250 --> 00:15:39,745 I'll talk about that more in a few minutes. 295 00:15:39,745 --> 00:15:41,120 Finally, there's the usual number 296 00:15:41,120 --> 00:15:43,550 of registers for writing real software to keep track 297 00:15:43,550 --> 00:15:45,530 of which processor you're running on 298 00:15:45,530 --> 00:15:49,410 and thread-specific context. 299 00:15:49,410 --> 00:15:51,930 Here's a few things that are different in the Alpha 300 00:15:51,930 --> 00:15:56,040 architecture compared to traditional risk architectures. 301 00:15:56,040 --> 00:16:00,990 First, we tried to avoid first implementation difficulties 302 00:16:00,990 --> 00:16:03,180 and instead to do clean features that we 303 00:16:03,180 --> 00:16:06,390 hope will last many decades. 304 00:16:06,390 --> 00:16:11,760 For example, rather than architecting strict read write 305 00:16:11,760 --> 00:16:13,290 ordering, as I just talked about, 306 00:16:13,290 --> 00:16:16,510 we architected arbitrary read write 307 00:16:16,510 --> 00:16:19,860 ordering between processors and a memory barrier 308 00:16:19,860 --> 00:16:22,650 as a mechanism to specify the ordering exactly when it's 309 00:16:22,650 --> 00:16:24,040 needed. 310 00:16:24,040 --> 00:16:25,920 It's possible that other architectures that 311 00:16:25,920 --> 00:16:27,660 specify strict read write ordering 312 00:16:27,660 --> 00:16:30,000 will find that that's a performance bottleneck 313 00:16:30,000 --> 00:16:32,950 over the coming decades. 314 00:16:32,950 --> 00:16:36,870 We also looked at issues like branch prediction 315 00:16:36,870 --> 00:16:39,390 and bearing the latency of branches 316 00:16:39,390 --> 00:16:43,530 rather than doing something like a branch delay 317 00:16:43,530 --> 00:16:48,300 slot that could bury one cycle this year or maybe two cycles. 318 00:16:48,300 --> 00:16:50,025 We looked at the long term and said 319 00:16:50,025 --> 00:16:52,410 that that's a technique that will not 320 00:16:52,410 --> 00:16:54,480 scale up well for a couple of decades 321 00:16:54,480 --> 00:16:58,200 and will not scale well with multiple instruction issues. 322 00:16:58,200 --> 00:17:00,750 So we looked at other ways of solving branch latency 323 00:17:00,750 --> 00:17:03,270 problems. 324 00:17:03,270 --> 00:17:07,859 Similarly, we don't have an Alpha complex instructions 325 00:17:07,859 --> 00:17:11,040 that do multiple operations such as a multiply 326 00:17:11,040 --> 00:17:14,099 add operation, which is really two operations in one 327 00:17:14,099 --> 00:17:15,300 instruction. 328 00:17:15,300 --> 00:17:17,880 We instead seriously designed to allow 329 00:17:17,880 --> 00:17:22,359 implementations to gracefully do multiple instruction issue. 330 00:17:22,359 --> 00:17:25,890 We believe that, through multiple instruction issue, 331 00:17:25,890 --> 00:17:28,560 the same effects you get with the complex combined 332 00:17:28,560 --> 00:17:32,310 instructions can be achieved in Alpha 333 00:17:32,310 --> 00:17:34,710 but achieved in a more straightforward way 334 00:17:34,710 --> 00:17:35,880 over the coming decades. 335 00:17:35,880 --> 00:17:38,520 336 00:17:38,520 --> 00:17:41,480 We also designed a hazard-free architecture. 337 00:17:41,480 --> 00:17:43,310 This means that if an instruction writes 338 00:17:43,310 --> 00:17:45,380 to a register and a later instruction 339 00:17:45,380 --> 00:17:50,210 reads from that register, that the reading instruction always 340 00:17:50,210 --> 00:17:54,410 gets the written value, that implementations must implement 341 00:17:54,410 --> 00:17:57,770 whatever pipeline interlocks or score-boarding are 342 00:17:57,770 --> 00:18:03,590 necessary to deliver the same results on all implementations, 343 00:18:03,590 --> 00:18:07,850 that there are no hazards in some implementations where 344 00:18:07,850 --> 00:18:12,050 a register can be read too soon and get an old value 345 00:18:12,050 --> 00:18:16,078 and be not binary compatible with a different implementation 346 00:18:16,078 --> 00:18:18,620 with different timing where the register turns out to be read 347 00:18:18,620 --> 00:18:21,210 later and get a new value. 348 00:18:21,210 --> 00:18:23,300 We also designed for the Alpha architecture 349 00:18:23,300 --> 00:18:25,040 to have minimal global state. 350 00:18:25,040 --> 00:18:27,410 We've discovered, particularly in implementing of VAXs 351 00:18:27,410 --> 00:18:31,640 that having global state bits such as interrupt enables 352 00:18:31,640 --> 00:18:36,350 or rounding mode bits gets in the way of building very 353 00:18:36,350 --> 00:18:38,450 fast pipeline implementations. 354 00:18:38,450 --> 00:18:42,290 The tendency is, when global state bits are changed, 355 00:18:42,290 --> 00:18:44,750 to completely drain the pipeline, 356 00:18:44,750 --> 00:18:47,960 change the global state bits, and then restart the pipeline 357 00:18:47,960 --> 00:18:51,740 in order to guarantee that subsequent instructions see 358 00:18:51,740 --> 00:18:55,190 the state change and the previously issued instructions 359 00:18:55,190 --> 00:18:56,990 don't see the state change. 360 00:18:56,990 --> 00:18:59,420 And that draining and restarting of the pipelines 361 00:18:59,420 --> 00:19:02,090 eventually becomes a performance bottleneck, 362 00:19:02,090 --> 00:19:06,530 particularly if the state bits need to be saved and restored 363 00:19:06,530 --> 00:19:08,840 at every subroutine boundary. 364 00:19:08,840 --> 00:19:10,940 So rather than having global state bits, 365 00:19:10,940 --> 00:19:13,880 we designed to minimize them and instead put 366 00:19:13,880 --> 00:19:16,610 in a re-instruction, things like interrupt 367 00:19:16,610 --> 00:19:19,340 enable bits or rounding mode bits, 368 00:19:19,340 --> 00:19:23,150 and the entire instruction and its local state 369 00:19:23,150 --> 00:19:26,080 can then be pipelined with no drains. 370 00:19:26,080 --> 00:19:28,820 371 00:19:28,820 --> 00:19:32,170 We also have no mode bit such as a 32-bit mode 372 00:19:32,170 --> 00:19:35,930 versus 64-bit mode or VAX floating point 373 00:19:35,930 --> 00:19:39,380 versus IEEE floating point. 374 00:19:39,380 --> 00:19:42,980 This, again, minimizes the hardware complexity 375 00:19:42,980 --> 00:19:47,942 and maximizes the fraction of time 376 00:19:47,942 --> 00:19:49,400 that the pipeline can be kept full. 377 00:19:49,400 --> 00:19:53,100 378 00:19:53,100 --> 00:19:55,860 As we were doing the architecture design, 379 00:19:55,860 --> 00:19:58,680 we kept in mind a number of implementation issues 380 00:19:58,680 --> 00:20:02,250 in terms of keeping many operations going 381 00:20:02,250 --> 00:20:06,810 simultaneously, having pipelined implementations, 382 00:20:06,810 --> 00:20:10,600 having long memory latencies that would grow with time. 383 00:20:10,600 --> 00:20:13,680 So we included a number of features 384 00:20:13,680 --> 00:20:19,980 such as specifying imprecise arithmetic exceptions rather 385 00:20:19,980 --> 00:20:23,760 than precise ones as an enabling technology 386 00:20:23,760 --> 00:20:25,530 to allow multiple instruction issued 387 00:20:25,530 --> 00:20:30,000 to be done gracefully without a lot of hardware overhead. 388 00:20:30,000 --> 00:20:32,760 We also included things like pre-fetching operations 389 00:20:32,760 --> 00:20:37,530 as a mechanism to cover some of the main memory 390 00:20:37,530 --> 00:20:42,000 latencies that we anticipate over the coming years. 391 00:20:42,000 --> 00:20:45,345 Finally, we included in the design 392 00:20:45,345 --> 00:20:49,560 a discussion of hardware-software agreements 393 00:20:49,560 --> 00:20:52,890 of suggested ground rules for hardware implementers 394 00:20:52,890 --> 00:20:55,050 and also for software implementers 395 00:20:55,050 --> 00:20:58,980 so that, to the extent that both groups use the same ground 396 00:20:58,980 --> 00:21:03,000 rules, you end up with very high performance implementations 397 00:21:03,000 --> 00:21:06,617 running high performance software with very little work. 398 00:21:06,617 --> 00:21:08,700 One of the things that's different about the Alpha 399 00:21:08,700 --> 00:21:11,460 architecture is the existence of PAL code. 400 00:21:11,460 --> 00:21:14,280 PAL stands for privileged architecture library. 401 00:21:14,280 --> 00:21:16,080 It's a set of privilege subroutines, 402 00:21:16,080 --> 00:21:19,890 a lot like the basic input output system in an IBM PC. 403 00:21:19,890 --> 00:21:22,230 They perform specified functions, 404 00:21:22,230 --> 00:21:24,990 but how they are implemented inside the subroutines 405 00:21:24,990 --> 00:21:28,200 can vary from one implementation to another. 406 00:21:28,200 --> 00:21:31,560 The specified functions are the complex operations 407 00:21:31,560 --> 00:21:34,260 that are an operating system is built on top of such 408 00:21:34,260 --> 00:21:37,260 as taking or returning from an interrupt or exception 409 00:21:37,260 --> 00:21:41,700 or loading the memory management registers or context switching. 410 00:21:41,700 --> 00:21:45,090 The PAL subroutines run with the interrupts turned off, 411 00:21:45,090 --> 00:21:48,870 and they may have implementation-specific access 412 00:21:48,870 --> 00:21:51,510 to physical hardware registers. 413 00:21:51,510 --> 00:21:53,190 There's a different set of PAL code 414 00:21:53,190 --> 00:21:56,160 for each operating system that allows the operating 415 00:21:56,160 --> 00:22:00,120 system to run on what's really an extended machine. 416 00:22:00,120 --> 00:22:03,660 For example, the port of VMS to alpha 417 00:22:03,660 --> 00:22:06,660 runs with a set of PAL code that provides 418 00:22:06,660 --> 00:22:12,420 VAX-like 32 interrupt levels and four memory management modes 419 00:22:12,420 --> 00:22:14,640 and a number of the complex operations 420 00:22:14,640 --> 00:22:20,100 that build interrupt vectors and include interlocked Q 421 00:22:20,100 --> 00:22:21,660 instructions. 422 00:22:21,660 --> 00:22:24,420 By having all of those in software, 423 00:22:24,420 --> 00:22:27,570 the actual Alpha hardware is quite simple, 424 00:22:27,570 --> 00:22:32,040 but the VMS port took two or three years less 425 00:22:32,040 --> 00:22:34,860 than if VMS had been completely rewritten to avoid 426 00:22:34,860 --> 00:22:37,650 using those VAX features. 427 00:22:37,650 --> 00:22:40,140 In a similar way, the OSF/1 operating system 428 00:22:40,140 --> 00:22:43,050 runs with its PAL code that presents to the operating 429 00:22:43,050 --> 00:22:47,850 system a simple, Unix-like view of the world that does not 430 00:22:47,850 --> 00:22:49,300 have 32 interrupt modes. 431 00:22:49,300 --> 00:22:52,500 It has eight interrupt modes and a number 432 00:22:52,500 --> 00:22:56,430 of other memory-mapping features that many Unix systems 433 00:22:56,430 --> 00:22:57,840 are used to. 434 00:22:57,840 --> 00:23:01,800 In a similar way, the Windows NT operating system 435 00:23:01,800 --> 00:23:04,890 has its own PAL code that presents 436 00:23:04,890 --> 00:23:06,570 a view of the world that's culturally 437 00:23:06,570 --> 00:23:12,385 compatible with the 46 and the MIPS and T implementations. 438 00:23:12,385 --> 00:23:14,760 So those are some of the things in the Alpha architecture 439 00:23:14,760 --> 00:23:17,430 that are different from other risk architectures. 440 00:23:17,430 --> 00:23:20,520 Now I'm going to talk about a number of the design problems 441 00:23:20,520 --> 00:23:24,510 we had in trying to build a fast architecture to allow 442 00:23:24,510 --> 00:23:27,120 fast implementations. 443 00:23:27,120 --> 00:23:30,810 First design problem is branch latency. 444 00:23:30,810 --> 00:23:34,200 The problem is when there's a discontinuity 445 00:23:34,200 --> 00:23:36,630 in the instruction stream, when a branch goes off 446 00:23:36,630 --> 00:23:42,870 to some other instruction, how to bury the time that 447 00:23:42,870 --> 00:23:44,790 must occur in real implementations 448 00:23:44,790 --> 00:23:48,180 before fetching the target instruction. 449 00:23:48,180 --> 00:23:50,640 Some implementations or some architectures 450 00:23:50,640 --> 00:23:54,570 specify branch delay slots in which, after a branch has 451 00:23:54,570 --> 00:23:57,120 executed, the following instruction is also 452 00:23:57,120 --> 00:24:00,840 executed on the theory that in real implementations 453 00:24:00,840 --> 00:24:04,680 it's already been pre-fetched, so executing it is almost free, 454 00:24:04,680 --> 00:24:06,780 and it can cover up the cycle of latency 455 00:24:06,780 --> 00:24:10,110 while the target address is being fetched. 456 00:24:10,110 --> 00:24:12,990 That was a good design technique in 1988, 457 00:24:12,990 --> 00:24:15,570 but already we've reached the point where, 458 00:24:15,570 --> 00:24:18,630 in real implementations that are quite fast, 459 00:24:18,630 --> 00:24:20,790 the branch latency that you need to cover up 460 00:24:20,790 --> 00:24:23,640 is two or three cycles instead of just one cycle, 461 00:24:23,640 --> 00:24:28,740 and an architected branch delay slot can only bury one cycle. 462 00:24:28,740 --> 00:24:33,180 In addition, branch delay slots don't make any sense 463 00:24:33,180 --> 00:24:36,240 when you are anticipating implementations 464 00:24:36,240 --> 00:24:38,250 that issue multiple instructions, 465 00:24:38,250 --> 00:24:40,860 and different implementations issue different numbers 466 00:24:40,860 --> 00:24:43,620 of instructions. 467 00:24:43,620 --> 00:24:45,600 For example, if you have an implementation that 468 00:24:45,600 --> 00:24:49,110 issues a peak of three instructions every cycle 469 00:24:49,110 --> 00:24:50,690 and you have a branch delay slot that 470 00:24:50,690 --> 00:24:55,280 specifies one instruction is executed after each branch, 471 00:24:55,280 --> 00:24:58,460 then after every branch, the other two potential issue slots 472 00:24:58,460 --> 00:25:00,050 have to be wasted. 473 00:25:00,050 --> 00:25:03,780 You can only issue one of those three following instructions. 474 00:25:03,780 --> 00:25:06,110 If branches occur about every eight instructions, 475 00:25:06,110 --> 00:25:09,720 that can be quite a performance bottleneck. 476 00:25:09,720 --> 00:25:11,750 So rather than branch delay slots, 477 00:25:11,750 --> 00:25:14,060 we went back to the original problem 478 00:25:14,060 --> 00:25:17,390 of covering branch latency and looked for other solutions 479 00:25:17,390 --> 00:25:20,700 that we thought would scale well for a couple of decades. 480 00:25:20,700 --> 00:25:22,850 We didn't find any magic bullets, 481 00:25:22,850 --> 00:25:26,420 so instead we approached covering branch latency 482 00:25:26,420 --> 00:25:28,710 on three different fronts. 483 00:25:28,710 --> 00:25:32,150 First, we architected branch prediction hits 484 00:25:32,150 --> 00:25:34,160 for two-way branches. 485 00:25:34,160 --> 00:25:38,600 The rule is that a two-way branch that is a forward branch 486 00:25:38,600 --> 00:25:41,540 is predicted not to be taken, and a two-way branch 487 00:25:41,540 --> 00:25:44,840 that's a backward branch is predicted to be taken. 488 00:25:44,840 --> 00:25:48,320 To the extent that the hardware implementers and the compiler 489 00:25:48,320 --> 00:25:51,290 writers follow the same rule, you 490 00:25:51,290 --> 00:25:55,370 get a very fast, very simple implementation 491 00:25:55,370 --> 00:25:59,810 that involves less hardware than branch prediction caches 492 00:25:59,810 --> 00:26:02,340 or things like that. 493 00:26:02,340 --> 00:26:07,170 For branches that have calculated targets 494 00:26:07,170 --> 00:26:09,780 such as some subroutine calls or case 495 00:26:09,780 --> 00:26:13,350 statements or subroutine returns, 496 00:26:13,350 --> 00:26:15,850 there were bits left over in those instructions. 497 00:26:15,850 --> 00:26:20,130 So we use those bits as hints to the implementation about where 498 00:26:20,130 --> 00:26:22,980 the most likely target is. 499 00:26:22,980 --> 00:26:26,490 The low 14 bits of those branch instructions 500 00:26:26,490 --> 00:26:29,580 can be used to directly drive an instruction 501 00:26:29,580 --> 00:26:32,970 cache with the address of the most likely target. 502 00:26:32,970 --> 00:26:34,890 Because they are hints, implementations 503 00:26:34,890 --> 00:26:36,510 need not use them. 504 00:26:36,510 --> 00:26:39,480 Also, because they are hints, if the hint is wrong, 505 00:26:39,480 --> 00:26:42,900 implementations must eventually, although perhaps more slowly, 506 00:26:42,900 --> 00:26:45,300 fetch the correctly specified target 507 00:26:45,300 --> 00:26:47,790 and actually branch to it. 508 00:26:47,790 --> 00:26:50,550 We also specified in those hints whether a given 509 00:26:50,550 --> 00:26:54,270 calculated branch was a subroutine call or subroutine 510 00:26:54,270 --> 00:26:56,380 return or neither. 511 00:26:56,380 --> 00:26:59,910 That's enough information for an implementation 512 00:26:59,910 --> 00:27:05,230 to run a small stack of likely subroutine return addresses. 513 00:27:05,230 --> 00:27:08,070 And in fact, as Dirk will talk about in the first EV4 514 00:27:08,070 --> 00:27:14,820 implementation, there is a four level subroutine return stack. 515 00:27:14,820 --> 00:27:16,440 That allows an implementation, when 516 00:27:16,440 --> 00:27:18,810 it encounters a subroutine return, 517 00:27:18,810 --> 00:27:21,930 to pull off the top of that implementation stack, the most 518 00:27:21,930 --> 00:27:24,540 likely I-cache address to fetch from 519 00:27:24,540 --> 00:27:27,570 and, under good circumstances, have subroutine returns take 520 00:27:27,570 --> 00:27:30,430 zero or one cycles. 521 00:27:30,430 --> 00:27:34,620 That's an example of how we thought about problems, looked 522 00:27:34,620 --> 00:27:38,490 at existing solutions in the marketplace, 523 00:27:38,490 --> 00:27:41,370 asked the question of whether they would scale well 524 00:27:41,370 --> 00:27:43,450 over the coming decades. 525 00:27:43,450 --> 00:27:45,660 When the answer was yes, we used them 526 00:27:45,660 --> 00:27:48,180 and did not deviate from conventional risk 527 00:27:48,180 --> 00:27:49,350 architectures. 528 00:27:49,350 --> 00:27:52,290 When the answer was no, we said, what problem 529 00:27:52,290 --> 00:27:54,270 are these techniques solving? 530 00:27:54,270 --> 00:27:56,760 Go back and look at the problem with a clean sheet of paper 531 00:27:56,760 --> 00:27:58,218 and say, how else could the problem 532 00:27:58,218 --> 00:28:01,110 be solved in a way that would scale well 533 00:28:01,110 --> 00:28:03,480 over a couple of decades? 534 00:28:03,480 --> 00:28:06,390 The last thing we did in trying to bury branch 535 00:28:06,390 --> 00:28:09,990 latency is to include conditional move instructions 536 00:28:09,990 --> 00:28:12,840 which test a register and, depending 537 00:28:12,840 --> 00:28:14,610 on the result of the test, conditionally, 538 00:28:14,610 --> 00:28:17,310 move a second register to a third register. 539 00:28:17,310 --> 00:28:20,850 These can be used for simple cases of testing end defects 540 00:28:20,850 --> 00:28:22,950 or doing maximum or minimum. 541 00:28:22,950 --> 00:28:26,100 And when they are used, they completely eliminate branches. 542 00:28:26,100 --> 00:28:28,770 By eliminating branches, eliminate the issues 543 00:28:28,770 --> 00:28:32,440 of branch delays, store branch latencies, and also, in fact, 544 00:28:32,440 --> 00:28:34,260 make basic blocks bigger and allow 545 00:28:34,260 --> 00:28:37,080 compilers to do more optimization and more code 546 00:28:37,080 --> 00:28:38,770 movement. 547 00:28:38,770 --> 00:28:41,910 So in summary, we rejected the traditional branch delay slots 548 00:28:41,910 --> 00:28:45,180 because they're incompatible with multiple instruction issue 549 00:28:45,180 --> 00:28:49,000 and with scaling up performance over many decades. 550 00:28:49,000 --> 00:28:52,830 We instead looked at fresh mechanisms that 551 00:28:52,830 --> 00:28:55,440 would give the same effect. 552 00:28:55,440 --> 00:28:57,240 Another design problem we looked at 553 00:28:57,240 --> 00:28:59,520 was how to handle arithmetic exceptions, things 554 00:28:59,520 --> 00:29:02,880 like overflows or divide by zero. 555 00:29:02,880 --> 00:29:05,400 Some architecture specify precise exceptions 556 00:29:05,400 --> 00:29:09,120 in which if an instruction gets an arithmetic exception, 557 00:29:09,120 --> 00:29:13,295 the following instructions must not have executed. 558 00:29:13,295 --> 00:29:14,670 Now, that makes it very difficult 559 00:29:14,670 --> 00:29:17,910 to do things like four-way instruction issue. 560 00:29:17,910 --> 00:29:21,030 You either cannot issue multiple instructions that could 561 00:29:21,030 --> 00:29:23,280 possibly get arithmetic exceptions, 562 00:29:23,280 --> 00:29:25,780 or you have to issue them and build a bunch of hardware, 563 00:29:25,780 --> 00:29:28,650 things like trap silos, that back out of the instructions 564 00:29:28,650 --> 00:29:31,470 that really should not have happened if you issue four 565 00:29:31,470 --> 00:29:34,420 and the second one overflows. 566 00:29:34,420 --> 00:29:38,160 So rather than specifying precise arithmetic exceptions, 567 00:29:38,160 --> 00:29:40,890 we specified imprecise arithmetic exceptions. 568 00:29:40,890 --> 00:29:45,300 We said, if you get an overflow, you find out about it later. 569 00:29:45,300 --> 00:29:46,895 Because the architecture is not tied 570 00:29:46,895 --> 00:29:48,270 to any particular implementation, 571 00:29:48,270 --> 00:29:50,670 the architecture doesn't say how much later. 572 00:29:50,670 --> 00:29:51,990 It just says later. 573 00:29:51,990 --> 00:29:54,990 It's the Cray-1 model of exception handling, 574 00:29:54,990 --> 00:29:57,090 but there are completely legitimate reasons 575 00:29:57,090 --> 00:30:01,690 for people to want to get precise exceptions. 576 00:30:01,690 --> 00:30:04,380 So we also included a trap barrier instruction 577 00:30:04,380 --> 00:30:06,180 that specifies, of all the instructions 578 00:30:06,180 --> 00:30:08,820 that have been issued at the barrier, 579 00:30:08,820 --> 00:30:12,390 all the preceding instructions must deliver and the exceptions 580 00:30:12,390 --> 00:30:17,100 that they're going to deliver at the barrier or earlier, 581 00:30:17,100 --> 00:30:20,980 and then subsequent instructions can be issued. 582 00:30:20,980 --> 00:30:23,850 So if you write a program with no trap barriers, 583 00:30:23,850 --> 00:30:26,370 it runs at vector-like speeds, and you 584 00:30:26,370 --> 00:30:30,240 get information on something like an overflow that it 585 00:30:30,240 --> 00:30:32,430 occurred but not very precise information 586 00:30:32,430 --> 00:30:34,212 on where it occurred. 587 00:30:34,212 --> 00:30:35,670 If you do something like put a trap 588 00:30:35,670 --> 00:30:38,160 barrier at the end of every subroutine, 589 00:30:38,160 --> 00:30:41,160 that may involve draining a few cycles of the pipeline 590 00:30:41,160 --> 00:30:43,050 at the end of the subroutine, but it then 591 00:30:43,050 --> 00:30:46,650 localizes where the exception occurred to which subroutine. 592 00:30:46,650 --> 00:30:49,380 If you put the trap barrier at the end of every source 593 00:30:49,380 --> 00:30:51,360 language statement, you can localize 594 00:30:51,360 --> 00:30:55,388 the generation of an exception to which statement it was. 595 00:30:55,388 --> 00:30:57,930 And if you put a trap barrier after every single instruction, 596 00:30:57,930 --> 00:31:03,750 you can discover exactly which instruction generated it. 597 00:31:03,750 --> 00:31:06,570 Because the IEEE floating point standard 598 00:31:06,570 --> 00:31:10,560 has an extensive set of exception handling, 599 00:31:10,560 --> 00:31:13,440 we also included in the architecture manual 600 00:31:13,440 --> 00:31:17,560 a design that allows one trap barrier per basic block, 601 00:31:17,560 --> 00:31:21,390 rather than one per instruction, and still allows 602 00:31:21,390 --> 00:31:24,900 complete IEEE exception handling to be performed in software 603 00:31:24,900 --> 00:31:26,235 in the trap routine. 604 00:31:26,235 --> 00:31:31,290 The basic idea is that if there's a software constraint, 605 00:31:31,290 --> 00:31:36,180 that the compiler is not reuse registers within a basic block 606 00:31:36,180 --> 00:31:39,600 or not reuse registers before the trap barrier 607 00:31:39,600 --> 00:31:41,790 so that a software handler can successfully 608 00:31:41,790 --> 00:31:43,950 backtrack through the instruction stream, 609 00:31:43,950 --> 00:31:46,200 find out which exception occurred first, 610 00:31:46,200 --> 00:31:49,680 process that exception, and then continue. 611 00:31:49,680 --> 00:31:52,890 That allows a substantially faster implementation 612 00:31:52,890 --> 00:31:55,700 than stopping after every single instruction 613 00:31:55,700 --> 00:31:58,940 and waiting for any exceptions to occur. 614 00:31:58,940 --> 00:32:00,400 In addition to the trap barrier, we 615 00:32:00,400 --> 00:32:04,080 included exception enable disable bits 616 00:32:04,080 --> 00:32:09,790 in every instruction so that integer add, for instance, 617 00:32:09,790 --> 00:32:11,580 can be specified in the instruction 618 00:32:11,580 --> 00:32:15,480 to take an overflow trap or not to take an overflow trap. 619 00:32:15,480 --> 00:32:19,200 That allows languages such as Fortran that require overflows 620 00:32:19,200 --> 00:32:22,260 to be generated to simply compile into the instructions 621 00:32:22,260 --> 00:32:25,650 to do the overflow and also allows languages such as C, 622 00:32:25,650 --> 00:32:28,680 which generally specify no overflow trapping, 623 00:32:28,680 --> 00:32:32,010 to compile into the instruction not to overflow. 624 00:32:32,010 --> 00:32:35,550 In a similar way, for floating point underflow, 625 00:32:35,550 --> 00:32:37,590 disabling the underflow allows the program 626 00:32:37,590 --> 00:32:40,540 to simply give a result of 0 and keep going. 627 00:32:40,540 --> 00:32:42,570 Some of our customers have expressed 628 00:32:42,570 --> 00:32:44,750 a preference for that behavior. 629 00:32:44,750 --> 00:32:48,540 The whole point here is that by choosing compiler options, 630 00:32:48,540 --> 00:32:51,120 you get to choose what level of performance 631 00:32:51,120 --> 00:32:54,870 and what level of precision on exceptions you want rather 632 00:32:54,870 --> 00:32:58,480 than me choosing for you. 633 00:32:58,480 --> 00:33:00,660 So in summary on the arithmetic exceptions, 634 00:33:00,660 --> 00:33:04,620 we rejected an architecture that would require hardware 635 00:33:04,620 --> 00:33:07,060 implementations that issue multiple instruction, 636 00:33:07,060 --> 00:33:10,530 silo register values, and then have to roll back 637 00:33:10,530 --> 00:33:12,910 if an exception occurs. 638 00:33:12,910 --> 00:33:17,442 We also rejected implementations that read operands 639 00:33:17,442 --> 00:33:19,650 and then check whether an overflow or something could 640 00:33:19,650 --> 00:33:20,950 possibly occur. 641 00:33:20,950 --> 00:33:24,420 And if so, delay issuing other instructions into the pipeline 642 00:33:24,420 --> 00:33:27,660 until the exception decision has been made. 643 00:33:27,660 --> 00:33:29,220 That doesn't work for multiple issue, 644 00:33:29,220 --> 00:33:31,740 and it doesn't work well for single issue 645 00:33:31,740 --> 00:33:34,830 if the read of the registers and the check of the values 646 00:33:34,830 --> 00:33:39,390 is actually two or three stages down the pipeline. 647 00:33:39,390 --> 00:33:42,710 Another problem we looked at was memory latency. 648 00:33:42,710 --> 00:33:44,840 Over the course of a few decades, 649 00:33:44,840 --> 00:33:48,290 maybe memory latency will get longer and longer as processor 650 00:33:48,290 --> 00:33:52,430 speeds get faster and faster, but memory chip speeds don't. 651 00:33:52,430 --> 00:33:54,470 So we included in the architecture 652 00:33:54,470 --> 00:33:56,120 a pair of pre-fetching instructions 653 00:33:56,120 --> 00:33:58,700 as hints to an implementation. 654 00:33:58,700 --> 00:34:02,090 For example, a Fortran program working down 655 00:34:02,090 --> 00:34:04,460 one column of an array could issue 656 00:34:04,460 --> 00:34:06,470 the pre-fetching instructions to be preferred 657 00:34:06,470 --> 00:34:09,770 the second column of the array during the processing 658 00:34:09,770 --> 00:34:11,630 of the first column. 659 00:34:11,630 --> 00:34:13,850 It looked likely that those instructions 660 00:34:13,850 --> 00:34:19,429 could bury about 100 cycles of memory latency 661 00:34:19,429 --> 00:34:21,570 in a realistic way. 662 00:34:21,570 --> 00:34:25,280 We also looked at memory latency and multiprocessor systems. 663 00:34:25,280 --> 00:34:29,420 We specified that there is no implied read write ordering 664 00:34:29,420 --> 00:34:31,159 in the following sense. 665 00:34:31,159 --> 00:34:35,000 If one processor issues a sequence of reads and writes, 666 00:34:35,000 --> 00:34:37,550 they are allowed to arrive at a second processor 667 00:34:37,550 --> 00:34:38,960 in an arbitrary order. 668 00:34:38,960 --> 00:34:42,290 Implementations are allowed to rearrange reads and writes. 669 00:34:42,290 --> 00:34:45,409 That's an enabling technology that allows things 670 00:34:45,409 --> 00:34:49,489 like write buffers, or multi-bank caches, 671 00:34:49,489 --> 00:34:53,179 or routing networks between processors 672 00:34:53,179 --> 00:34:57,800 if you have lots of processors, or things like memory buses 673 00:34:57,800 --> 00:35:02,060 that do error detection and retry. 674 00:35:02,060 --> 00:35:04,190 For example, if you have a processor that 675 00:35:04,190 --> 00:35:08,670 does two writes and you pipeline those writes down a memory box 676 00:35:08,670 --> 00:35:13,190 and the first right arrives with bad parity, 677 00:35:13,190 --> 00:35:16,100 the second write then arrives, and then the first write 678 00:35:16,100 --> 00:35:18,980 is retried and arrives successfully 679 00:35:18,980 --> 00:35:21,290 with good parity on the second try. 680 00:35:21,290 --> 00:35:24,740 That's an example of a good implementation technique that 681 00:35:24,740 --> 00:35:26,750 has the effect of delivering the writes out 682 00:35:26,750 --> 00:35:30,110 of order to another processor. 683 00:35:30,110 --> 00:35:32,900 We wanted to allow implementations 684 00:35:32,900 --> 00:35:34,622 to run at very high speeds. 685 00:35:34,622 --> 00:35:36,080 And therefore, in the architecture, 686 00:35:36,080 --> 00:35:40,430 we do not require precise read write ordering. 687 00:35:40,430 --> 00:35:43,730 Instead, there's a memory barrier instruction that 688 00:35:43,730 --> 00:35:45,650 says, at this point, I care. 689 00:35:45,650 --> 00:35:47,330 At this point, all the reads and writes 690 00:35:47,330 --> 00:35:49,670 that have been issued by a given processor in front 691 00:35:49,670 --> 00:35:53,900 of the memory barrier have to be delivered to other processors 692 00:35:53,900 --> 00:35:55,910 before any reads and writes issues 693 00:35:55,910 --> 00:35:58,310 after the memory barrier. 694 00:35:58,310 --> 00:35:59,990 And the phrasing was chosen carefully. 695 00:35:59,990 --> 00:36:02,660 It doesn't say that the sending processor 696 00:36:02,660 --> 00:36:06,650 has to stop issuing instructions or run slowly. 697 00:36:06,650 --> 00:36:09,860 It simply says that whatever permutation 698 00:36:09,860 --> 00:36:12,260 could occur in the implementations, 699 00:36:12,260 --> 00:36:15,802 that there is a restriction at the memory barrier. 700 00:36:15,802 --> 00:36:18,260 That may mean, for instance, that the memory barrier itself 701 00:36:18,260 --> 00:36:20,480 is pipelined out along with reads and writes 702 00:36:20,480 --> 00:36:23,300 to other processors, or it may mean exactly 703 00:36:23,300 --> 00:36:27,890 at the memory barrier a pipeline memory bus sends a write 704 00:36:27,890 --> 00:36:30,050 and then waits a few cycles to get confirmation 705 00:36:30,050 --> 00:36:33,290 that that right arrived successfully 706 00:36:33,290 --> 00:36:37,620 before sending writes that come after the memory barrier. 707 00:36:37,620 --> 00:36:40,820 So we view this as an enabling technology for very high speed 708 00:36:40,820 --> 00:36:42,140 implementations. 709 00:36:42,140 --> 00:36:45,860 It's one of the design ideas that 710 00:36:45,860 --> 00:36:48,770 led Cray research to choosing the Alpha architecture 711 00:36:48,770 --> 00:36:52,310 for building 1,000 processor or bigger massively parallel 712 00:36:52,310 --> 00:36:54,680 processor. 713 00:36:54,680 --> 00:36:57,500 So in summary on the memory system design, 714 00:36:57,500 --> 00:36:59,435 we rejected things like implicit read 715 00:36:59,435 --> 00:37:02,880 write ordering in favor of explicit programmer 716 00:37:02,880 --> 00:37:06,830 statement of where overlap can occur 717 00:37:06,830 --> 00:37:10,080 and where overlap must not occur. 718 00:37:10,080 --> 00:37:14,240 So the Alpha architecture is the new 64-bit architecture 719 00:37:14,240 --> 00:37:16,790 designed to last a long time, designed 720 00:37:16,790 --> 00:37:19,580 to allow high performance implementations 721 00:37:19,580 --> 00:37:22,490 with specific emphasis on allowing multiple construction 722 00:37:22,490 --> 00:37:28,250 issue, multiple processors, and very fast clock rates. 723 00:37:28,250 --> 00:37:31,490 The architecture supports a very wide range of operating systems 724 00:37:31,490 --> 00:37:33,070 and compiler languages. 725 00:37:33,070 --> 00:37:35,700 726 00:37:35,700 --> 00:37:38,090 Some of you who looked at earlier tapes in this series 727 00:37:38,090 --> 00:37:39,980 may have seen Dave Patterson when 728 00:37:39,980 --> 00:37:44,270 he predicted that by somewhere between 1993 and '96, 729 00:37:44,270 --> 00:37:47,660 there would be super microprocessors running 730 00:37:47,660 --> 00:37:51,080 anywhere from desktop to supercomputers. 731 00:37:51,080 --> 00:37:54,260 We're very happy with Alpha to deliver on Dave's prediction 732 00:37:54,260 --> 00:37:56,360 a year early. 733 00:37:56,360 --> 00:37:58,040 I think now it's time for Dirk to talk 734 00:37:58,040 --> 00:38:01,700 about the first implementation. 735 00:38:01,700 --> 00:38:04,940 As Dick said, I am here to tell you about the first Alpha 736 00:38:04,940 --> 00:38:07,220 implementation, which internal to [INAUDIBLE],, 737 00:38:07,220 --> 00:38:09,087 we refer to as EV4. 738 00:38:09,087 --> 00:38:11,670 Since I'm going to use that term throughout this presentation, 739 00:38:11,670 --> 00:38:13,730 first a couple of words on the term. 740 00:38:13,730 --> 00:38:16,580 When the program started, the original name 741 00:38:16,580 --> 00:38:19,280 for the architecture, which the designers came up, with 742 00:38:19,280 --> 00:38:23,120 was EVAX for extended VAX, which turned out in retrospect 743 00:38:23,120 --> 00:38:25,410 to be not a very good name because the architecture, 744 00:38:25,410 --> 00:38:28,760 in fact, has no formal relationship to the VAX 745 00:38:28,760 --> 00:38:30,360 architecture. 746 00:38:30,360 --> 00:38:32,580 However, we're engineers and not marketing people. 747 00:38:32,580 --> 00:38:35,038 So that was the name we chose at the beginning it was later 748 00:38:35,038 --> 00:38:36,290 changed to Alpha. 749 00:38:36,290 --> 00:38:38,150 The name of the chip, however, has 750 00:38:38,150 --> 00:38:41,128 remained EV for extended VAX four 751 00:38:41,128 --> 00:38:43,670 because it was built in a fourth generation seamless process. 752 00:38:43,670 --> 00:38:46,580 753 00:38:46,580 --> 00:38:50,100 The presentation that I'm going to give has four basic parts. 754 00:38:50,100 --> 00:38:51,920 First of all, I'm going to describe 755 00:38:51,920 --> 00:38:54,470 some of the overall features of the chip, 756 00:38:54,470 --> 00:38:57,110 then I'm going to describe some of the higher level 757 00:38:57,110 --> 00:39:00,290 architectural features such as the pipelines, 758 00:39:00,290 --> 00:39:04,160 the functional units, and the instruction issue rules. 759 00:39:04,160 --> 00:39:05,990 I will next describe just a couple 760 00:39:05,990 --> 00:39:09,440 of the more interesting micro architectural features EV4 761 00:39:09,440 --> 00:39:11,930 and relate those back to the higher level 762 00:39:11,930 --> 00:39:15,218 architectural concepts that Dick described previously. 763 00:39:15,218 --> 00:39:16,760 And finally, I will talk a little bit 764 00:39:16,760 --> 00:39:19,880 about the electrical interface between EV4 765 00:39:19,880 --> 00:39:23,410 and the systems into which it's designed. 766 00:39:23,410 --> 00:39:25,990 EV4 is a single chip implementation 767 00:39:25,990 --> 00:39:28,000 of the Alpha architecture. 768 00:39:28,000 --> 00:39:30,700 It's implemented in CMOS technology 769 00:39:30,700 --> 00:39:33,760 and has two cycle time variance, the first of which 770 00:39:33,760 --> 00:39:37,570 operates at a 200 megahertz internal clock rate, 771 00:39:37,570 --> 00:39:41,360 and the second variant operates at 150 megahertz. 772 00:39:41,360 --> 00:39:43,960 Now, these two speed variants don't really 773 00:39:43,960 --> 00:39:46,360 represent separate designs. 774 00:39:46,360 --> 00:39:49,660 They really represent distinct speed bins. 775 00:39:49,660 --> 00:39:53,020 So it's one design, and various chips 776 00:39:53,020 --> 00:39:55,930 happened to fall in one speed bin or the other. 777 00:39:55,930 --> 00:39:58,990 Some work faster than others, and the ones that run fast, 778 00:39:58,990 --> 00:40:01,640 we run fast. 779 00:40:01,640 --> 00:40:04,820 As I said, the chip is built in a CMOS process. 780 00:40:04,820 --> 00:40:08,310 It's built in digital CMOS four technology, 781 00:40:08,310 --> 00:40:11,880 which means that it's our fourth generation CMOS technology. 782 00:40:11,880 --> 00:40:17,300 This technology features 0.7 micron drawn feature sizes 783 00:40:17,300 --> 00:40:22,320 with a 0.5 micron effective channel length that 784 00:40:22,320 --> 00:40:25,770 also features three layers of metalization, 785 00:40:25,770 --> 00:40:30,120 and the technology is really focused towards high speed 786 00:40:30,120 --> 00:40:32,400 microprocessor implementations. 787 00:40:32,400 --> 00:40:34,800 For example, the third layer of metal 788 00:40:34,800 --> 00:40:39,180 is different than upper layers of metal in other CMOS 789 00:40:39,180 --> 00:40:43,140 technologies where the metal is intended to maximize signal 790 00:40:43,140 --> 00:40:44,100 routeability. 791 00:40:44,100 --> 00:40:45,990 In this case, we use a third layer of metal 792 00:40:45,990 --> 00:40:50,100 solely to distribute power and our internal high-speed clock. 793 00:40:50,100 --> 00:40:53,820 And therefore, it's a much thicker, coarser, metalization 794 00:40:53,820 --> 00:40:57,510 level which can both bring a lot of power into the chip 795 00:40:57,510 --> 00:41:02,280 and distribute the clock with minimum skew. 796 00:41:02,280 --> 00:41:04,840 The die, physically, is fairly large. 797 00:41:04,840 --> 00:41:08,310 It's about 14 millimeters by 17 millimeters in size, 798 00:41:08,310 --> 00:41:12,100 and it implements about 1.7 million transistors. 799 00:41:12,100 --> 00:41:16,000 The part is packaged in a 431 pin grid array 800 00:41:16,000 --> 00:41:18,940 of which 291 pins are signals. 801 00:41:18,940 --> 00:41:23,510 The remaining 140 are power and ground. 802 00:41:23,510 --> 00:41:28,880 The chip dissipates 30 watts at 200 megahertz or 23 watts 803 00:41:28,880 --> 00:41:32,090 at 150 megahertz. 804 00:41:32,090 --> 00:41:37,760 It implements a 43-bit subset of the architected 64-bit virtual 805 00:41:37,760 --> 00:41:38,960 address space. 806 00:41:38,960 --> 00:41:41,510 As Dick mentioned previously, implementations 807 00:41:41,510 --> 00:41:45,170 are allowed by the architecture to implement a subset 808 00:41:45,170 --> 00:41:47,370 of the 64-bit virtual address. 809 00:41:47,370 --> 00:41:51,360 However, the entire 64 bits is checked. 810 00:41:51,360 --> 00:41:56,040 The chip also supports a 34-bit physical address space, 811 00:41:56,040 --> 00:41:58,800 giving us the ability to address some 16 812 00:41:58,800 --> 00:42:02,300 gigabytes of physical memory. 813 00:42:02,300 --> 00:42:05,330 The chip is a super scalar implementation in the sense 814 00:42:05,330 --> 00:42:09,530 that it can issue, at its peak, two instructions in each CPU 815 00:42:09,530 --> 00:42:14,300 cycle to any one of four fully pipelined functional units. 816 00:42:14,300 --> 00:42:17,630 The functional units include an integer operation unit, 817 00:42:17,630 --> 00:42:21,140 a floating point operation unit, a load storing unit, 818 00:42:21,140 --> 00:42:22,670 and a branch unit. 819 00:42:22,670 --> 00:42:27,230 The chip also includes a total of 44 translation lookaside 820 00:42:27,230 --> 00:42:30,260 buffer entries, 32 of which are dedicated 821 00:42:30,260 --> 00:42:32,600 to data stream references and 12 of which 822 00:42:32,600 --> 00:42:35,600 are dedicated to instruction stream references. 823 00:42:35,600 --> 00:42:39,520 Both translation buffers are fully associative. 824 00:42:39,520 --> 00:42:41,560 There is an on-chip right buffer which 825 00:42:41,560 --> 00:42:45,850 includes four entries where each entry can 826 00:42:45,850 --> 00:42:48,610 contained 32 bytes of data. 827 00:42:48,610 --> 00:42:54,330 The chip also includes two on-chip caches, 8 kilobytes 828 00:42:54,330 --> 00:42:57,390 in size, one devoted to instruction stream references 829 00:42:57,390 --> 00:43:00,870 and one devoted to data stream references. 830 00:43:00,870 --> 00:43:04,140 Both of them are physical, direct map caches. 831 00:43:04,140 --> 00:43:07,090 The data cache is right through. 832 00:43:07,090 --> 00:43:13,060 The bus interface is flexible and can support system designs 833 00:43:13,060 --> 00:43:19,420 with either 128-bit or 64-bit data paths between the system 834 00:43:19,420 --> 00:43:20,610 and the chip. 835 00:43:20,610 --> 00:43:23,680 And as previously mentioned, the external interface 836 00:43:23,680 --> 00:43:27,060 supports a 34-bit physical address. 837 00:43:27,060 --> 00:43:31,720 The next graphic shows a crude block diagram of the chip. 838 00:43:31,720 --> 00:43:34,490 Starting from the top, you see an instruction cache, 839 00:43:34,490 --> 00:43:37,420 which each cycle can feed to instructions 840 00:43:37,420 --> 00:43:41,110 to the IBOX or instruction issue a unit of the chip. 841 00:43:41,110 --> 00:43:44,290 The IBOX is responsible for decoding the instructions, 842 00:43:44,290 --> 00:43:46,990 performing all on-chip resource checks, 843 00:43:46,990 --> 00:43:50,410 and sending the instructions off to the respective functional 844 00:43:50,410 --> 00:43:51,700 units. 845 00:43:51,700 --> 00:43:54,100 The EBOX is an integer operate unit 846 00:43:54,100 --> 00:43:58,640 which performs basic integer operations such as add, 847 00:43:58,640 --> 00:44:02,140 subtract, or, and shifts. 848 00:44:02,140 --> 00:44:07,350 The FBOX is the equivalent floating point operate unit. 849 00:44:07,350 --> 00:44:09,955 Integer register file sits below the EBOX 850 00:44:09,955 --> 00:44:11,330 while the floating point register 851 00:44:11,330 --> 00:44:14,480 file sits below the FBOX. 852 00:44:14,480 --> 00:44:18,260 The diagrams in the picture represent data buses 853 00:44:18,260 --> 00:44:22,940 which represent data flowing to and from the functional 854 00:44:22,940 --> 00:44:25,870 units from the register file. 855 00:44:25,870 --> 00:44:28,480 You can correlate Dick's discussion of the instruction 856 00:44:28,480 --> 00:44:31,810 formats laid out in the Alpha architecture 857 00:44:31,810 --> 00:44:35,170 to the physical orientation of the chip itself. 858 00:44:35,170 --> 00:44:37,570 From the top of the integer register file, 859 00:44:37,570 --> 00:44:42,970 you see two 64-bit wide buses which carry integer operands 860 00:44:42,970 --> 00:44:46,090 to the EBOX and another 64-bit bus which 861 00:44:46,090 --> 00:44:49,797 carries integer results back to the integer register file. 862 00:44:49,797 --> 00:44:51,880 Similarly, on the floating point side of the chip, 863 00:44:51,880 --> 00:44:55,180 you see twp 64-bit buses which carry operands 864 00:44:55,180 --> 00:44:58,990 to the floating point functional unit and another 64-bit bus 865 00:44:58,990 --> 00:45:02,210 which carries results back to the register file. 866 00:45:02,210 --> 00:45:05,080 Below the register files sits the ABOX. 867 00:45:05,080 --> 00:45:07,460 The ABOX, fundamentally, is the load storage unit 868 00:45:07,460 --> 00:45:09,380 within the machine. 869 00:45:09,380 --> 00:45:13,500 You can see two register file parts feeding the ABOX. 870 00:45:13,500 --> 00:45:16,340 One of these contains base addresses. 871 00:45:16,340 --> 00:45:18,620 The other contains stored data coming 872 00:45:18,620 --> 00:45:20,240 from the integer register file. 873 00:45:20,240 --> 00:45:22,190 Similarly, on the floating point side, 874 00:45:22,190 --> 00:45:25,310 you see a single 64-bit wide bus which 875 00:45:25,310 --> 00:45:30,110 carries floating point stored data out to the ABOX. 876 00:45:30,110 --> 00:45:34,190 Below the ABOX is the four entry by 32 byte wide right 877 00:45:34,190 --> 00:45:37,910 buffer which gets its data from the ABOX 878 00:45:37,910 --> 00:45:42,140 and supplies data out to the BIU or bus interface unit. 879 00:45:42,140 --> 00:45:44,150 And lastly, at the bottom of the diagram, 880 00:45:44,150 --> 00:45:45,650 you see that on-chip ship data cache 881 00:45:45,650 --> 00:45:49,130 which, as I previously stated, is 8 kilobytes in size 882 00:45:49,130 --> 00:45:51,200 and a write through cache. 883 00:45:51,200 --> 00:45:54,350 The bus interface unit is responsible for orchestrating 884 00:45:54,350 --> 00:45:58,160 activities between on-chip functional units 885 00:45:58,160 --> 00:46:01,500 and the outside world, meaning the system. 886 00:46:01,500 --> 00:46:04,910 It contains a 64-bit data path which 887 00:46:04,910 --> 00:46:10,220 carries both filled data from the off-chip memory structures 888 00:46:10,220 --> 00:46:12,560 to on-chip primary caches and also 889 00:46:12,560 --> 00:46:17,570 carries write data from the write buffer out onto the pins. 890 00:46:17,570 --> 00:46:20,210 The next graphic shows an integer pipeline 891 00:46:20,210 --> 00:46:21,680 in the machine. 892 00:46:21,680 --> 00:46:23,960 It is 7 stages long, and this diagram 893 00:46:23,960 --> 00:46:27,890 can be used to describe the pipelines which 894 00:46:27,890 --> 00:46:32,520 operate in the IBOX, the EBOX, and the ABOX. 895 00:46:32,520 --> 00:46:36,570 Taking the IBOX first, you see fundamentally 896 00:46:36,570 --> 00:46:38,760 four pipeline stages. 897 00:46:38,760 --> 00:46:41,760 In stage zero, which we also refer to as the instruction 898 00:46:41,760 --> 00:46:46,800 fetch stage, the IBOX reads a naturally aligned pair 899 00:46:46,800 --> 00:46:50,580 of long word instructions from the instruction cache. 900 00:46:50,580 --> 00:46:56,390 In the next cycle, the IBOX performs two primary functions. 901 00:46:56,390 --> 00:47:00,160 First, it decodes a portion of the instructions 902 00:47:00,160 --> 00:47:02,500 in order to determine which functional units 903 00:47:02,500 --> 00:47:04,690 the instructions should be sent. 904 00:47:04,690 --> 00:47:07,420 In parallel with that, it determines 905 00:47:07,420 --> 00:47:12,330 for branch instructions the probable target for the branch. 906 00:47:12,330 --> 00:47:16,980 The chip supports two methods for branch prediction. 907 00:47:16,980 --> 00:47:20,820 In the first method, we support the architected hint alluded 908 00:47:20,820 --> 00:47:27,490 to by Dick earlier in this talk where the hardware looks 909 00:47:27,490 --> 00:47:29,920 at the sign bit of the displacement field 910 00:47:29,920 --> 00:47:31,600 in the branch construction itself 911 00:47:31,600 --> 00:47:36,460 and predicts that backward reaching branches are taken 912 00:47:36,460 --> 00:47:38,710 but forward branches are not. 913 00:47:38,710 --> 00:47:40,900 Alternatively, the chip also implements 914 00:47:40,900 --> 00:47:43,630 a dynamic branch prediction structure in which, 915 00:47:43,630 --> 00:47:47,720 associated with each instruction in the instruction cache, 916 00:47:47,720 --> 00:47:49,250 is a single bit. 917 00:47:49,250 --> 00:47:52,160 This bit indicates which direction the branch 918 00:47:52,160 --> 00:47:55,600 took the last time it was executed. 919 00:47:55,600 --> 00:47:57,850 When new instructions are brought into the instruction 920 00:47:57,850 --> 00:48:01,490 cache, we load this bit with the sign 921 00:48:01,490 --> 00:48:03,470 of the displacement as an initial guess, 922 00:48:03,470 --> 00:48:06,430 and then this bit gets updated later. 923 00:48:06,430 --> 00:48:09,430 Branches which are predicted by the hardware to be taken 924 00:48:09,430 --> 00:48:12,610 result in a one cycle bubble being inserted 925 00:48:12,610 --> 00:48:17,020 into the pipeline, meaning that if a branch instruction is 926 00:48:17,020 --> 00:48:20,710 fetched from the instruction cache in cycle zero, 927 00:48:20,710 --> 00:48:22,750 it takes all of cycle one to predict 928 00:48:22,750 --> 00:48:24,520 the direction of the branch and generate 929 00:48:24,520 --> 00:48:26,860 the target for the branch so that we are not 930 00:48:26,860 --> 00:48:29,470 able to go back to I-cache for that target 931 00:48:29,470 --> 00:48:32,830 until what would be cycle two for the branch. 932 00:48:32,830 --> 00:48:35,290 This bubble, which gets inserted into the pipeline, 933 00:48:35,290 --> 00:48:41,350 can in some cases be removed later. 934 00:48:41,350 --> 00:48:44,320 In pipe stage two, labeled I0, we 935 00:48:44,320 --> 00:48:47,010 perform more instruction decode. 936 00:48:47,010 --> 00:48:50,490 And in pipe stage three, labeled I1, all the real action 937 00:48:50,490 --> 00:48:51,750 takes place. 938 00:48:51,750 --> 00:48:55,020 In that pipe stage, we read operands from the register 939 00:48:55,020 --> 00:48:58,620 file and perform resource checks to see whether the instruction 940 00:48:58,620 --> 00:49:00,570 can be issued. 941 00:49:00,570 --> 00:49:03,270 These resource checks include both the availability 942 00:49:03,270 --> 00:49:06,150 of register operands as well as the availability 943 00:49:06,150 --> 00:49:09,410 of the on-chip functional units. 944 00:49:09,410 --> 00:49:11,900 After pipe stage three, instructions either 945 00:49:11,900 --> 00:49:14,660 execute to completion and write the results in the register 946 00:49:14,660 --> 00:49:18,450 file, or perhaps they get aborted for a host of reasons, 947 00:49:18,450 --> 00:49:20,960 which I'll describe later. 948 00:49:20,960 --> 00:49:24,080 Within the IBOX, there's still a few more activities 949 00:49:24,080 --> 00:49:26,620 that happen later down the pipe. 950 00:49:26,620 --> 00:49:29,110 As Dick mentioned, conditional branches 951 00:49:29,110 --> 00:49:31,000 test a register in order to decide 952 00:49:31,000 --> 00:49:32,890 whether to take the branch. 953 00:49:32,890 --> 00:49:35,590 Since we read the register in cycle three, 954 00:49:35,590 --> 00:49:38,320 we can test its value against zero 955 00:49:38,320 --> 00:49:40,750 in the beginning of cycle four and be in a position 956 00:49:40,750 --> 00:49:45,010 to determine whether the branch should in fact be taken or not. 957 00:49:45,010 --> 00:49:48,820 Notice then that although a branch instruction is fetched 958 00:49:48,820 --> 00:49:51,730 from the instruction cache in cycle zero, 959 00:49:51,730 --> 00:49:55,340 we don't know whether it's going to be taken until cycle four. 960 00:49:55,340 --> 00:49:58,760 What this means is that if the branch was predicted 961 00:49:58,760 --> 00:50:01,760 incorrectly, we incur a four cycle penalty 962 00:50:01,760 --> 00:50:06,150 to go back and get the true branch target. 963 00:50:06,150 --> 00:50:08,730 Also in cycle four, we're in a position 964 00:50:08,730 --> 00:50:12,390 to generate the true virtual PC for the instruction 965 00:50:12,390 --> 00:50:14,740 one stage back in the pipeline. 966 00:50:14,740 --> 00:50:18,850 So we have this virtual PC at the end of pipe stage four. 967 00:50:18,850 --> 00:50:22,570 In pipeline stage five, we can, from this virtual PC, 968 00:50:22,570 --> 00:50:25,240 generate the corresponding physical PC 969 00:50:25,240 --> 00:50:27,790 so that, at the end of the pipe stage five 970 00:50:27,790 --> 00:50:29,510 in the beginning of pipe stage six, 971 00:50:29,510 --> 00:50:31,510 we can determine whether the instruction that we 972 00:50:31,510 --> 00:50:35,080 fetched five cycles ago, in fact, 973 00:50:35,080 --> 00:50:37,350 hit in the instruction cache. 974 00:50:37,350 --> 00:50:39,930 So this means that if an instruction fetched 975 00:50:39,930 --> 00:50:42,850 in pipe stage zero missed the cache, 976 00:50:42,850 --> 00:50:46,130 we don't know it until five cycles later. 977 00:50:46,130 --> 00:50:49,300 Moving on to the activities which happen in the EBOX, 978 00:50:49,300 --> 00:50:52,140 I'll start in pipe stage four. 979 00:50:52,140 --> 00:50:58,410 In that pipeline stage, the box has two 64-bit operands 980 00:50:58,410 --> 00:51:01,080 on which to perform an operation. 981 00:51:01,080 --> 00:51:05,190 Most operations in the EBOX can be performed in a single cycle. 982 00:51:05,190 --> 00:51:09,540 These include add and subtract and the simple logic functions 983 00:51:09,540 --> 00:51:12,000 like add, and, and x or. 984 00:51:12,000 --> 00:51:14,490 So at the end of pipeline stage four, 985 00:51:14,490 --> 00:51:16,860 the results of these instructions are ready for use 986 00:51:16,860 --> 00:51:20,400 by subsequent instructions, although the results won't 987 00:51:20,400 --> 00:51:23,010 be, in fact, written to the register file for two more 988 00:51:23,010 --> 00:51:23,880 cycles. 989 00:51:23,880 --> 00:51:27,150 We can supply the data on bypass paths 990 00:51:27,150 --> 00:51:30,330 back to the multiplexor which physically 991 00:51:30,330 --> 00:51:34,460 sits between the integer register file and the EBOX. 992 00:51:34,460 --> 00:51:38,320 Shift operations take two cycles to complete. 993 00:51:38,320 --> 00:51:39,880 The results of shifts are therefore 994 00:51:39,880 --> 00:51:42,220 available in pipe stage five. 995 00:51:42,220 --> 00:51:45,490 Shift operations therefore have a latency of two cycles, 996 00:51:45,490 --> 00:51:47,650 although they can be fully pipelined. 997 00:51:47,650 --> 00:51:50,770 Also in pipe stage five, we produce, 998 00:51:50,770 --> 00:51:53,920 based on the results generated in the EBOX, 999 00:51:53,920 --> 00:51:57,010 a bit which indicates whether that result is zero. 1000 00:51:57,010 --> 00:51:59,620 We then store that bit in the register file. 1001 00:51:59,620 --> 00:52:03,900 This makes branch instructions easier to execute. 1002 00:52:03,900 --> 00:52:06,480 Lastly, in pipe stage six, the EBOX 1003 00:52:06,480 --> 00:52:09,320 writes its results into the register file. 1004 00:52:09,320 --> 00:52:10,990 Now, referencing the same diagram, 1005 00:52:10,990 --> 00:52:12,740 I'm going to describe the operations which 1006 00:52:12,740 --> 00:52:15,690 occur within the ABOX. 1007 00:52:15,690 --> 00:52:17,760 The box receives its instructions 1008 00:52:17,760 --> 00:52:21,930 issued from the IBOX at the end of pipe stage three. 1009 00:52:21,930 --> 00:52:24,880 In pipe stage four, it takes the operand 1010 00:52:24,880 --> 00:52:27,960 supplied to it from the register file and adds to it 1011 00:52:27,960 --> 00:52:30,170 the displacement. 1012 00:52:30,170 --> 00:52:32,090 Thus, at the end of pipe stage four, 1013 00:52:32,090 --> 00:52:35,310 it has an effective virtual address. 1014 00:52:35,310 --> 00:52:37,470 Because the size of the data cache 1015 00:52:37,470 --> 00:52:41,970 is equal to the size of the pages used by the memory 1016 00:52:41,970 --> 00:52:44,760 management system, we can start looking 1017 00:52:44,760 --> 00:52:48,390 in the decache at the beginning of pipe stage five. 1018 00:52:48,390 --> 00:52:50,400 We don't have to wait for the translation buffer 1019 00:52:50,400 --> 00:52:53,190 to produce a full physical address. 1020 00:52:53,190 --> 00:52:55,710 Therefore, at the end of pipe stage five, 1021 00:52:55,710 --> 00:52:58,650 we have on our hands the physical address 1022 00:52:58,650 --> 00:53:01,380 that corresponds to the generated virtual address 1023 00:53:01,380 --> 00:53:03,720 and the data which hopefully corresponds 1024 00:53:03,720 --> 00:53:06,690 as well to that virtual address. 1025 00:53:06,690 --> 00:53:08,970 We are in a position to know whether a load 1026 00:53:08,970 --> 00:53:12,720 instruction actually hit or missed at the beginning of pipe 1027 00:53:12,720 --> 00:53:14,890 stage six. 1028 00:53:14,890 --> 00:53:17,080 And if the instruction hit in the data cache, 1029 00:53:17,080 --> 00:53:20,230 we can supply the results of the instruction 1030 00:53:20,230 --> 00:53:24,340 to the functional units by the end of pipe stage six. 1031 00:53:24,340 --> 00:53:27,130 Hence, load instructions which hit in the data 1032 00:53:27,130 --> 00:53:31,110 have an effective latency of three CPU cycles. 1033 00:53:31,110 --> 00:53:35,310 For stores, we know as well at the beginning of pipe stage six 1034 00:53:35,310 --> 00:53:37,920 whether the store instruction hit in the data. 1035 00:53:37,920 --> 00:53:40,980 If it missed, no further operation 1036 00:53:40,980 --> 00:53:43,020 is required for the data cache. 1037 00:53:43,020 --> 00:53:44,910 The data is moved into the right buffer 1038 00:53:44,910 --> 00:53:49,140 independent of the results of the data cache lookup. 1039 00:53:49,140 --> 00:53:52,080 If the lookup resulted in a data cache hit, 1040 00:53:52,080 --> 00:53:56,340 we write the corresponding data into the data cache array 1041 00:53:56,340 --> 00:53:59,780 the next time that that array is otherwise not busy. 1042 00:53:59,780 --> 00:54:02,420 Turning next to the diagram which shows the floating point 1043 00:54:02,420 --> 00:54:06,800 pipeline, you see that floating point instructions share 1044 00:54:06,800 --> 00:54:10,760 the first four pipeline stages with integer instructions, 1045 00:54:10,760 --> 00:54:14,240 and those really correspond to activity, which I previously 1046 00:54:14,240 --> 00:54:17,240 described in the IBOX. 1047 00:54:17,240 --> 00:54:19,310 The floating point pipeline is quite a bit longer 1048 00:54:19,310 --> 00:54:22,340 than the integer pipeline, and floating point results 1049 00:54:22,340 --> 00:54:25,550 are not available until six CPU cycles 1050 00:54:25,550 --> 00:54:27,620 after the instruction was issued. 1051 00:54:27,620 --> 00:54:30,440 The floating point unit is fully pipelined, however, 1052 00:54:30,440 --> 00:54:32,330 so that we can start a new floating point 1053 00:54:32,330 --> 00:54:34,410 operation every CPU cycle. 1054 00:54:34,410 --> 00:54:36,920 However, as I said, the effective latency 1055 00:54:36,920 --> 00:54:41,233 of this functional unit is six CPU cycles. 1056 00:54:41,233 --> 00:54:42,900 Now I want to talk about the instruction 1057 00:54:42,900 --> 00:54:46,920 issue rules which hardware within EV4 enforces. 1058 00:54:46,920 --> 00:54:49,890 Generally, if you think back to the diagram which 1059 00:54:49,890 --> 00:54:53,030 showed the functional units within the chip, 1060 00:54:53,030 --> 00:54:56,678 you can see that the possibility for multiple instruction issue 1061 00:54:56,678 --> 00:54:57,470 is certainly there. 1062 00:54:57,470 --> 00:54:59,660 We have four fully pipelined units, 1063 00:54:59,660 --> 00:55:03,500 and the IBOX ought to be able to issue pairs of instructions 1064 00:55:03,500 --> 00:55:04,460 to any of these units. 1065 00:55:04,460 --> 00:55:06,240 And generally, that's the case. 1066 00:55:06,240 --> 00:55:09,110 We can issue load restore instructions 1067 00:55:09,110 --> 00:55:10,880 with operate instructions. 1068 00:55:10,880 --> 00:55:13,520 We can issue integer operator instructions 1069 00:55:13,520 --> 00:55:15,920 with floating point operate instructions. 1070 00:55:15,920 --> 00:55:19,040 We can issue floating point operator instructions 1071 00:55:19,040 --> 00:55:22,550 with corresponding floating point branches or integer 1072 00:55:22,550 --> 00:55:26,710 operator instructions with integer branches. 1073 00:55:26,710 --> 00:55:30,820 One quirk is that integer stores can't be issued together 1074 00:55:30,820 --> 00:55:34,330 with floating point operates, nor can floating point stores 1075 00:55:34,330 --> 00:55:37,600 be issued together with integer operates. 1076 00:55:37,600 --> 00:55:41,830 This is due to a internal hardware resource constraint 1077 00:55:41,830 --> 00:55:48,190 in the IBOX which, although not real difficult to get around, 1078 00:55:48,190 --> 00:55:50,320 wasn't worth the added complication in terms 1079 00:55:50,320 --> 00:55:53,032 of the performance that it would have bought us. 1080 00:55:53,032 --> 00:55:54,490 Next, I want to address a couple of 1081 00:55:54,490 --> 00:55:56,532 interesting micro architectural features which we 1082 00:55:56,532 --> 00:55:58,750 included within the EV4 chip. 1083 00:55:58,750 --> 00:56:01,870 The first of these addresses the branch latency problem, 1084 00:56:01,870 --> 00:56:03,730 which Dick referred to earlier. 1085 00:56:03,730 --> 00:56:06,460 Specifically, it addresses the problem associated 1086 00:56:06,460 --> 00:56:08,440 with memory format branches. 1087 00:56:08,440 --> 00:56:11,290 These instructions are used in computed jumps, 1088 00:56:11,290 --> 00:56:15,620 subroutine calls and returns and the like. 1089 00:56:15,620 --> 00:56:17,570 The instructions read the register 1090 00:56:17,570 --> 00:56:21,260 file and jump to the virtual address contained 1091 00:56:21,260 --> 00:56:24,710 within the operand register and represent 1092 00:56:24,710 --> 00:56:30,040 a special problem for the instruction pre-fetch hardware. 1093 00:56:30,040 --> 00:56:31,990 Thinking back to the earlier diagram which 1094 00:56:31,990 --> 00:56:34,180 showed the machine's pipeline, remember 1095 00:56:34,180 --> 00:56:36,550 that the branch prediction hardware 1096 00:56:36,550 --> 00:56:38,770 runs in pipe stage one of the machine 1097 00:56:38,770 --> 00:56:42,250 while registers aren't read until two cycles later. 1098 00:56:42,250 --> 00:56:45,460 Therefore, in order for the prediction hardware 1099 00:56:45,460 --> 00:56:49,330 to come up with a target for the branch, 1100 00:56:49,330 --> 00:56:51,490 something has to be done in hardware 1101 00:56:51,490 --> 00:56:54,490 if we want to come up with a target early in the pipe stage, 1102 00:56:54,490 --> 00:56:57,790 since the actual target isn't available until pipe stage 1103 00:56:57,790 --> 00:56:58,900 three. 1104 00:56:58,900 --> 00:57:00,520 The solution here is to implement 1105 00:57:00,520 --> 00:57:03,070 what we call a JSR stack. 1106 00:57:03,070 --> 00:57:06,580 This stack allows subroutine and call and return addresses 1107 00:57:06,580 --> 00:57:08,980 to be maintained back in the pipeline 1108 00:57:08,980 --> 00:57:12,880 by the instruction pre-fetcher and also requires 1109 00:57:12,880 --> 00:57:14,870 use of hint bits. 1110 00:57:14,870 --> 00:57:17,450 On a subroutine call, the compiler 1111 00:57:17,450 --> 00:57:20,690 inserts into the instruction a 2-bit field 1112 00:57:20,690 --> 00:57:24,740 which tells the hardware that this instruction is actually 1113 00:57:24,740 --> 00:57:27,540 implementing a subroutine call. 1114 00:57:27,540 --> 00:57:31,710 The prediction hardware then takes the virtual address 1115 00:57:31,710 --> 00:57:35,070 following the instruction, following the subroutine call, 1116 00:57:35,070 --> 00:57:37,570 and places that on a hardware stack. 1117 00:57:37,570 --> 00:57:40,120 When the subsequent return comes along, 1118 00:57:40,120 --> 00:57:42,100 we then pop that stack and use that 1119 00:57:42,100 --> 00:57:45,590 as a target for the instruction. 1120 00:57:45,590 --> 00:57:47,910 This JSR stack has a depth of four, 1121 00:57:47,910 --> 00:57:50,750 meaning it's capable of holding at any one time 1122 00:57:50,750 --> 00:57:55,000 a stack of four subroutine return addresses. 1123 00:57:55,000 --> 00:57:56,830 We also include hardware to make sure 1124 00:57:56,830 --> 00:57:59,580 that this stack doesn't get corrupted. 1125 00:57:59,580 --> 00:58:02,170 The way it might get corrupted is as follows. 1126 00:58:02,170 --> 00:58:04,650 We could fetch from the instruction cache 1127 00:58:04,650 --> 00:58:07,540 an instruction that's a call. 1128 00:58:07,540 --> 00:58:10,530 We might then send this instruction down the pipeline. 1129 00:58:10,530 --> 00:58:14,070 And later, for whatever reason, an exception, a branch 1130 00:58:14,070 --> 00:58:17,070 mispredict, an instruction cache miss, 1131 00:58:17,070 --> 00:58:19,890 the call instruction may never get executed. 1132 00:58:19,890 --> 00:58:23,070 If we updated the stack when we fetched the instruction rather 1133 00:58:23,070 --> 00:58:25,630 than when we committed to executing the instruction, 1134 00:58:25,630 --> 00:58:27,900 the stack could therefore get out of sync 1135 00:58:27,900 --> 00:58:30,590 with the rest of the machine. 1136 00:58:30,590 --> 00:58:34,280 Note that this is a performance optimization 1137 00:58:34,280 --> 00:58:36,680 and if the hardware stack ever gets out 1138 00:58:36,680 --> 00:58:39,170 of sync with the real programmed flow, 1139 00:58:39,170 --> 00:58:42,980 we do not end up with incorrect operations. 1140 00:58:42,980 --> 00:58:44,540 We merely take a little bit longer 1141 00:58:44,540 --> 00:58:48,560 to fetch the target of a call or return. 1142 00:58:48,560 --> 00:58:50,480 The next feature that I want to describe 1143 00:58:50,480 --> 00:58:52,610 relates to granularity hints. 1144 00:58:52,610 --> 00:58:56,000 The Alpha architecture uses a 2-bit field 1145 00:58:56,000 --> 00:58:58,520 within each page table entry, which 1146 00:58:58,520 --> 00:59:01,700 can be used by software to communicate 1147 00:59:01,700 --> 00:59:06,110 to hardware that a particular page table entry in fact maps 1148 00:59:06,110 --> 00:59:08,090 greater than a single page. 1149 00:59:08,090 --> 00:59:10,460 This 2-bit field can be used by the software 1150 00:59:10,460 --> 00:59:12,920 to communicate to the hardware that a given page table 1151 00:59:12,920 --> 00:59:17,270 entry in fact maps a contiguous physical region consisting 1152 00:59:17,270 --> 00:59:23,270 of one page, eight pages, 64 pages, or 512 pages. 1153 00:59:23,270 --> 00:59:27,200 Since EV4 implements an eight kilobyte page, 1154 00:59:27,200 --> 00:59:33,890 this translates to a region of 8 kilobytes, 64 kilobytes, 512 1155 00:59:33,890 --> 00:59:36,620 kilobytes, or 4 megabytes in size. 1156 00:59:36,620 --> 00:59:40,280 The data stream translation buffer contains 32 entries, 1157 00:59:40,280 --> 00:59:41,810 as I previously stated. 1158 00:59:41,810 --> 00:59:44,090 Each of these entries is able to support 1159 00:59:44,090 --> 00:59:47,270 any of the four granularity hints specified 1160 00:59:47,270 --> 00:59:48,740 by the architecture. 1161 00:59:48,740 --> 00:59:52,880 This allows a very flexible arrangement for page table 1162 00:59:52,880 --> 00:59:55,310 entries and also allows a relatively small 1163 00:59:55,310 --> 00:59:58,850 associative translation buffer to in fact map 1164 00:59:58,850 --> 01:00:02,480 a large memory space. 1165 01:00:02,480 --> 01:00:05,570 The size of the page mapped by each entry 1166 01:00:05,570 --> 01:00:09,020 is set when the page table is written by PAL code 1167 01:00:09,020 --> 01:00:11,690 into the translation buffer structure. 1168 01:00:11,690 --> 01:00:15,440 The istream translation buffer consists of 12 entries. 1169 01:00:15,440 --> 01:00:18,830 Eight of these entries are devoted to the smallest 1170 01:00:18,830 --> 01:00:22,220 granularity size, meaning that each of those eight entries 1171 01:00:22,220 --> 01:00:25,040 can only map an eight kilobyte region 1172 01:00:25,040 --> 01:00:27,980 corresponding to a single page. 1173 01:00:27,980 --> 01:00:32,420 Four of the entries, however, support the largest granularity 1174 01:00:32,420 --> 01:00:35,150 size, which means that each of those entries 1175 01:00:35,150 --> 01:00:38,220 can map a four megabyte region. 1176 01:00:38,220 --> 01:00:42,200 This is very useful for mapping large, non-paged instruction 1177 01:00:42,200 --> 01:00:47,030 areas such as the operating system kernel or large shared 1178 01:00:47,030 --> 01:00:48,790 libraries. 1179 01:00:48,790 --> 01:00:52,640 I'm next going to describe EV4's external interface. 1180 01:00:52,640 --> 01:00:56,210 This interface was designed with several goals in mind. 1181 01:00:56,210 --> 01:00:58,240 First of all, we needed the interface 1182 01:00:58,240 --> 01:01:00,220 to be extremely flexible. 1183 01:01:00,220 --> 01:01:03,490 Since this is the first chip in a new architecture 1184 01:01:03,490 --> 01:01:06,910 and since we didn't have the resources inside the company 1185 01:01:06,910 --> 01:01:09,760 to devote several design teams each to doing 1186 01:01:09,760 --> 01:01:12,430 an implementation targeted towards a particular end 1187 01:01:12,430 --> 01:01:16,420 of the system-wide spectrum, the single design 1188 01:01:16,420 --> 01:01:20,710 needed to be used in systems ranging from low-end desktop 1189 01:01:20,710 --> 01:01:24,490 machines to mid-range multiprocessor servers 1190 01:01:24,490 --> 01:01:27,970 all the way up to high-end, massively parallel 1191 01:01:27,970 --> 01:01:29,860 supercomputer systems. 1192 01:01:29,860 --> 01:01:32,770 With that in mind, we needed each system designer 1193 01:01:32,770 --> 01:01:35,860 to be able to make his or her own cost performance 1194 01:01:35,860 --> 01:01:37,030 trade-offs. 1195 01:01:37,030 --> 01:01:39,040 Lastly, because of the schedule constraints 1196 01:01:39,040 --> 01:01:41,440 confronted by our internal system partners, 1197 01:01:41,440 --> 01:01:44,320 we wanted to define an external interface around which you 1198 01:01:44,320 --> 01:01:48,670 could design a system using off the shelf industry standard 1199 01:01:48,670 --> 01:01:51,030 components. 1200 01:01:51,030 --> 01:01:54,650 The next graphic shows a crude schematic 1201 01:01:54,650 --> 01:01:58,290 for the external interface of the EV4 chip. 1202 01:01:58,290 --> 01:02:02,250 The chip takes as its input a 2x clock. 1203 01:02:02,250 --> 01:02:05,550 So for example, a chip that operates internally 1204 01:02:05,550 --> 01:02:10,300 at 200 megahertz requires a 400 megahertz input oscillator. 1205 01:02:10,300 --> 01:02:12,090 This is the only high speed signal 1206 01:02:12,090 --> 01:02:14,640 that a system designer really has to worry about. 1207 01:02:14,640 --> 01:02:16,680 The rest of the interface can be programmed 1208 01:02:16,680 --> 01:02:19,170 to operate much more slowly. 1209 01:02:19,170 --> 01:02:21,330 The chip also supports an external cache, 1210 01:02:21,330 --> 01:02:23,880 although this cache is not required 1211 01:02:23,880 --> 01:02:25,650 by the chip design itself. 1212 01:02:25,650 --> 01:02:28,920 Its presence or absence is left solely 1213 01:02:28,920 --> 01:02:31,780 up to the system designer. 1214 01:02:31,780 --> 01:02:36,170 For systems which do implement an off-chip cache, 1215 01:02:36,170 --> 01:02:38,930 the chip is capable of accessing that cache 1216 01:02:38,930 --> 01:02:43,430 with no help from external system-level logic. 1217 01:02:43,430 --> 01:02:47,330 So for example, if a load instruction 1218 01:02:47,330 --> 01:02:50,630 misses the on-chip cache, the on-ship BIU 1219 01:02:50,630 --> 01:02:53,600 can read the off-chip cash rams with no interaction 1220 01:02:53,600 --> 01:02:56,620 from the outside world. 1221 01:02:56,620 --> 01:02:58,270 This relieves the system designer 1222 01:02:58,270 --> 01:03:02,740 from having to design logic which has to directly interface 1223 01:03:02,740 --> 01:03:07,450 to the high speed clocking domain inside the chip. 1224 01:03:07,450 --> 01:03:12,260 The graphic shows three RAM structures which 1225 01:03:12,260 --> 01:03:14,300 make up the external cache. 1226 01:03:14,300 --> 01:03:18,140 The tag control RAM consists of four bits, a valid bit, 1227 01:03:18,140 --> 01:03:21,560 a shared bit, a dirty bit, and a parity bit 1228 01:03:21,560 --> 01:03:25,760 which contains parity across the previously mentioned 1229 01:03:25,760 --> 01:03:27,920 three bits. 1230 01:03:27,920 --> 01:03:31,820 The tag RAM contains a tag corresponding 1231 01:03:31,820 --> 01:03:34,460 to each block and the external cache, 1232 01:03:34,460 --> 01:03:38,240 and the data and check RAMs contain the cache data 1233 01:03:38,240 --> 01:03:42,340 and associated check or parity bits. 1234 01:03:42,340 --> 01:03:45,670 As I previously stated, the external bus 1235 01:03:45,670 --> 01:03:50,790 can be either 64-bits or 128-bits wide, as determined 1236 01:03:50,790 --> 01:03:52,380 by the system designer's need. 1237 01:03:52,380 --> 01:03:55,170 1238 01:03:55,170 --> 01:03:58,740 To begin a cache reference, the chip drives an address out 1239 01:03:58,740 --> 01:04:00,840 onto its address bus. 1240 01:04:00,840 --> 01:04:03,840 This address and associated RAM control flow 1241 01:04:03,840 --> 01:04:06,210 through system-dependent logic which physically 1242 01:04:06,210 --> 01:04:09,210 sit between these buses and the RAMs. 1243 01:04:09,210 --> 01:04:12,150 The RAM access is completely combinatorial. 1244 01:04:12,150 --> 01:04:14,790 The chip drives a new address out under the RAM 1245 01:04:14,790 --> 01:04:17,460 and then waits a user-specified period of time 1246 01:04:17,460 --> 01:04:21,570 before sampling the associated tag and data fields. 1247 01:04:21,570 --> 01:04:26,640 Some external transactions cannot be required by EV4 alone 1248 01:04:26,640 --> 01:04:28,350 using the external cache RAMs. 1249 01:04:28,350 --> 01:04:32,130 And for these transactions, system-level components 1250 01:04:32,130 --> 01:04:34,910 have to get into the picture. 1251 01:04:34,910 --> 01:04:38,030 Information between EV4 and these components 1252 01:04:38,030 --> 01:04:40,040 flows between the miscellaneous control 1253 01:04:40,040 --> 01:04:42,350 field listed on this graphic. 1254 01:04:42,350 --> 01:04:44,930 This field consists of a command field 1255 01:04:44,930 --> 01:04:49,010 onto which EV4 drives commands such as read, write, 1256 01:04:49,010 --> 01:04:52,490 memory barrier, load locked, et cetera, and a pair 1257 01:04:52,490 --> 01:04:55,880 of acknowledgment fields, which system-level components drive 1258 01:04:55,880 --> 01:04:57,890 back to EV4. 1259 01:04:57,890 --> 01:05:01,730 These control fields operate synchronously 1260 01:05:01,730 --> 01:05:06,740 with the internal chip clock but phase aligned to a system 1261 01:05:06,740 --> 01:05:10,310 clock, which is programmable. 1262 01:05:10,310 --> 01:05:13,650 This system clock can be user specified 1263 01:05:13,650 --> 01:05:16,890 to operate at frequencies ranging from 1/2 to 1/8, 1264 01:05:16,890 --> 01:05:20,050 the on-chip CPU cycle time. 1265 01:05:20,050 --> 01:05:23,140 In keeping with the previously mentioned 1266 01:05:23,140 --> 01:05:26,560 external interface goals of performance, flexibility, 1267 01:05:26,560 --> 01:05:30,220 and simplicity, EV4 supports external caches 1268 01:05:30,220 --> 01:05:34,570 ranging in size from 128 kilobytes to 8 megabytes. 1269 01:05:34,570 --> 01:05:38,740 Or, as I previously said, we could have no cache at all. 1270 01:05:38,740 --> 01:05:41,710 The external cash, if implemented by the system 1271 01:05:41,710 --> 01:05:45,220 designer, can be built with a variety of industry standard 1272 01:05:45,220 --> 01:05:46,300 RAMs. 1273 01:05:46,300 --> 01:05:49,090 High-end systems can use the fastest, 1274 01:05:49,090 --> 01:05:53,440 largest static RAMs they can buy while low-end systems can use 1275 01:05:53,440 --> 01:05:56,770 smaller, cheaper, slower RAMs. 1276 01:05:56,770 --> 01:05:58,750 The RAM timing is set by software 1277 01:05:58,750 --> 01:06:02,530 and can range from three cycles to 16 CPU cycles, 1278 01:06:02,530 --> 01:06:04,510 determined by the value which software 1279 01:06:04,510 --> 01:06:08,290 places into an on-chip control register. 1280 01:06:08,290 --> 01:06:11,230 The external interface, although it runs synchronously 1281 01:06:11,230 --> 01:06:14,920 to the on-chip CPU clock, can be specified 1282 01:06:14,920 --> 01:06:18,400 to run anywhere from 1/2 to 1/8 the speed 1283 01:06:18,400 --> 01:06:22,360 of this on-chip clock, again, allowing cost performance 1284 01:06:22,360 --> 01:06:26,310 trade-offs to be made by the system designer. 1285 01:06:26,310 --> 01:06:29,130 Lastly, there are no cache policy decisions 1286 01:06:29,130 --> 01:06:31,230 enforced by the EV4 chip. 1287 01:06:31,230 --> 01:06:34,950 This means that it's left to the system designer or to decide 1288 01:06:34,950 --> 01:06:38,670 what cache coherence protocol is appropriate for him or her. 1289 01:06:38,670 --> 01:06:40,950 The valid, dirty, and shared bits, 1290 01:06:40,950 --> 01:06:45,030 as I previously described within the tag control field, 1291 01:06:45,030 --> 01:06:47,640 imply a bias towards a conditional 1292 01:06:47,640 --> 01:06:51,840 write through cache coherence protocol in name only. 1293 01:06:51,840 --> 01:06:56,430 Really, the shared bid simply specifies to EV4 for that it 1294 01:06:56,430 --> 01:07:00,330 cannot write to an external cache block by itself 1295 01:07:00,330 --> 01:07:03,870 but requires external module-level interaction 1296 01:07:03,870 --> 01:07:06,290 in order to complete the write. 1297 01:07:06,290 --> 01:07:08,920 This means that, in addition to a conditional write 1298 01:07:08,920 --> 01:07:12,640 through protocol, the system designer could implement 1299 01:07:12,640 --> 01:07:16,360 a pure ownership coherence protocol where the shared 1300 01:07:16,360 --> 01:07:20,080 bit becomes really more of an ownership bit, 1301 01:07:20,080 --> 01:07:23,590 or the system designer could implement 1302 01:07:23,590 --> 01:07:27,280 a pure write through protocol where the shared bit is always 1303 01:07:27,280 --> 01:07:28,490 set. 1304 01:07:28,490 --> 01:07:32,200 This means that EV4 will always request help 1305 01:07:32,200 --> 01:07:37,160 from module-level components before writing data 1306 01:07:37,160 --> 01:07:39,080 into the external cache. 1307 01:07:39,080 --> 01:07:43,070 The external interface also allows the system designer 1308 01:07:43,070 --> 01:07:47,360 to implement either a 128-bit wide or 64-bit wide 1309 01:07:47,360 --> 01:07:48,620 external bus. 1310 01:07:48,620 --> 01:07:53,630 The chip supports both CMOS, TTL, and ECL 1311 01:07:53,630 --> 01:07:57,170 and supports both longword error-correcting codes 1312 01:07:57,170 --> 01:07:59,330 or longword parity. 1313 01:07:59,330 --> 01:08:03,410 The chip does not perform on-chip invalidate filtering 1314 01:08:03,410 --> 01:08:06,260 for decache and validates but does provide a mechanism 1315 01:08:06,260 --> 01:08:08,120 so that system designers can filter 1316 01:08:08,120 --> 01:08:13,340 these invalidates off-chip by implementing a back map. 1317 01:08:13,340 --> 01:08:15,860 Here's a die photograph of EV4. 1318 01:08:15,860 --> 01:08:18,710 On the left is the 8 kilobyte instruction cache 1319 01:08:18,710 --> 01:08:22,920 with branch history table located just below that. 1320 01:08:22,920 --> 01:08:26,160 Across the top of the die, you see the instruction 1321 01:08:26,160 --> 01:08:29,370 execute unit with the EBOX on the left 1322 01:08:29,370 --> 01:08:32,069 and the IBOX on the right. 1323 01:08:32,069 --> 01:08:35,000 In the middle of the chip, you see integer control 1324 01:08:35,000 --> 01:08:37,700 with the clock driver extending horizontally 1325 01:08:37,700 --> 01:08:40,590 across the center of the die. 1326 01:08:40,590 --> 01:08:43,350 Below the clock driver is located the floating point 1327 01:08:43,350 --> 01:08:47,069 operating unit while the data cache occupies 1328 01:08:47,069 --> 01:08:49,775 the right edge of the die. 1329 01:08:49,775 --> 01:08:54,420 The right buffer can be seen in the upper right hand corner. 1330 01:08:54,420 --> 01:08:58,109 In conclusion, EV4 is the first implementation 1331 01:08:58,109 --> 01:08:59,880 of the Alpha architecture. 1332 01:08:59,880 --> 01:09:03,000 It's a 200 megahertz single-chip microprocessor 1333 01:09:03,000 --> 01:09:04,740 with a flexible interface to allow 1334 01:09:04,740 --> 01:09:08,490 it to be used in systems ranging from desktop workstations 1335 01:09:08,490 --> 01:09:11,060 to massively parallel supercomputers. 1336 01:09:11,060 --> 01:09:14,760 It'll be supported initially by both VMS and OSF/1 1337 01:09:14,760 --> 01:09:17,729 with other operating systems to follow, 1338 01:09:17,729 --> 01:09:21,120 and it's the first in the family with more implementations 1339 01:09:21,120 --> 01:09:22,660 to come. 1340 01:09:22,660 --> 01:09:24,640 Speaking as a hardware system designer, 1341 01:09:24,640 --> 01:09:27,130 I'm grateful for people like Dick Sites who 1342 01:09:27,130 --> 01:09:29,350 has thought long and hard about the problems faced 1343 01:09:29,350 --> 01:09:31,390 by high performance system designers 1344 01:09:31,390 --> 01:09:33,670 when designing the Alpha architecture. 1345 01:09:33,670 --> 01:09:36,130 I think that the advantages of this architecture 1346 01:09:36,130 --> 01:09:38,740 are apparent by the high performance achieved 1347 01:09:38,740 --> 01:09:41,740 by its initial implementation, and the advantages 1348 01:09:41,740 --> 01:09:44,660 will become even more apparent in the future. 1349 01:09:44,660 --> 01:09:46,960 I hope this tape has been beneficial to you, 1350 01:09:46,960 --> 01:09:49,780 and thank you very much. 1351 01:09:49,780 --> 01:09:52,530 [MUSIC PLAYING] 1352 01:09:52,530 --> 01:10:59,000