1 00:00:00,000 --> 00:00:04,980 2 00:00:04,980 --> 00:00:08,466 [MUSIC PLAYING] 3 00:00:08,466 --> 00:00:29,900 4 00:00:29,900 --> 00:00:32,720 The Cray T3D system is the first of a new generation 5 00:00:32,720 --> 00:00:35,240 of supercomputers from Cray Research based 6 00:00:35,240 --> 00:00:38,870 on the principles of massively parallel processor design. 7 00:00:38,870 --> 00:00:41,330 This lecture will describe architectural and hardware 8 00:00:41,330 --> 00:00:44,300 implementation aspects of the Cray T3D. 9 00:00:44,300 --> 00:00:45,860 A second tape in this series will 10 00:00:45,860 --> 00:00:48,590 describe the system software and application software 11 00:00:48,590 --> 00:00:50,480 environments. 12 00:00:50,480 --> 00:00:52,070 This new family of supercomputers 13 00:00:52,070 --> 00:00:54,080 complements the highly successful line 14 00:00:54,080 --> 00:00:57,020 of Cray parallel-vector processing systems. 15 00:00:57,020 --> 00:00:58,940 Those products, the most recent of which 16 00:00:58,940 --> 00:01:01,550 is the Cray C90 and its dependents, 17 00:01:01,550 --> 00:01:04,879 have roots tracing back to the original Cray-1. 18 00:01:04,879 --> 00:01:07,610 The Cray-1, a universally accepted guidepost 19 00:01:07,610 --> 00:01:10,790 for the industry, was designed by Seymour Cray 20 00:01:10,790 --> 00:01:14,690 and introduced in the marketplace in 1976. 21 00:01:14,690 --> 00:01:17,330 Significant changes have occurred since those early days 22 00:01:17,330 --> 00:01:18,950 at Cray Research. 23 00:01:18,950 --> 00:01:20,960 Most compelling are the great advances 24 00:01:20,960 --> 00:01:24,050 in the integrated circuit density and speed. 25 00:01:24,050 --> 00:01:26,930 These advances have spawned multiple generations 26 00:01:26,930 --> 00:01:30,290 of microprocessors and inexpensive semiconductor 27 00:01:30,290 --> 00:01:31,590 memory. 28 00:01:31,590 --> 00:01:33,470 In fact, the latest microprocessors 29 00:01:33,470 --> 00:01:36,770 now challenge the performance claims of the original Cray-1, 30 00:01:36,770 --> 00:01:40,080 but at a small fraction of the cost. 31 00:01:40,080 --> 00:01:43,490 By combining this microprocessor technology with advanced Cray 32 00:01:43,490 --> 00:01:46,190 proprietary integrated circuit, packaging, 33 00:01:46,190 --> 00:01:50,420 and cooling technology, the Cray T3D has been designed to solve 34 00:01:50,420 --> 00:01:55,370 highly parallel, vary-compute, and I/O-intensive problems. 35 00:01:55,370 --> 00:01:58,700 The Cray T3D is a multiprocessor system, 36 00:01:58,700 --> 00:02:03,200 scalable up to configurations of 2,048 processors. 37 00:02:03,200 --> 00:02:05,750 It operates in a heterogeneous parallel-vector 38 00:02:05,750 --> 00:02:08,690 and massively parallel environment. 39 00:02:08,690 --> 00:02:12,440 The Cray T3D supports a globally addressable, physically 40 00:02:12,440 --> 00:02:14,720 distributed memory subsystem. 41 00:02:14,720 --> 00:02:17,510 I/O is also scalable via I/O gateways, which 42 00:02:17,510 --> 00:02:20,660 support high-bandwidth proprietary channels connected 43 00:02:20,660 --> 00:02:25,700 to Cray parallel-vector systems and dedicated I/O processors. 44 00:02:25,700 --> 00:02:27,680 The configuration supports a high degree 45 00:02:27,680 --> 00:02:30,050 of concurrency to achieve high efficiencies 46 00:02:30,050 --> 00:02:33,920 from the computational and data motion resources. 47 00:02:33,920 --> 00:02:37,250 Scaling options allow varying subsystem configurations 48 00:02:37,250 --> 00:02:39,320 across a wide range. 49 00:02:39,320 --> 00:02:41,450 These options allow parallel-vector production 50 00:02:41,450 --> 00:02:45,200 environments to explore the potential of MPP systems. 51 00:02:45,200 --> 00:02:49,520 Others who have optimized MPP applications in hand 52 00:02:49,520 --> 00:02:51,710 will configure their systems with the majority 53 00:02:51,710 --> 00:02:55,160 of the resources on the MPP side. 54 00:02:55,160 --> 00:02:58,310 Closely coupling the MPP with proven parallel-vector 55 00:02:58,310 --> 00:03:01,850 technology serves several purposes. 56 00:03:01,850 --> 00:03:04,640 Moving MPP technology into the production environment 57 00:03:04,640 --> 00:03:08,042 requires a host of operating system services. 58 00:03:08,042 --> 00:03:09,500 In the parallel-vector environment, 59 00:03:09,500 --> 00:03:12,620 Cray has more than a 10-year development investment 60 00:03:12,620 --> 00:03:17,360 in our supercomputer variant of Unix, which we call UNICOS. 61 00:03:17,360 --> 00:03:21,080 By adding microkernel technology on the MPP side of it, 62 00:03:21,080 --> 00:03:23,810 it has been possible to maximize MPP OS 63 00:03:23,810 --> 00:03:29,180 functionality and stability and minimize time to market. 64 00:03:29,180 --> 00:03:32,780 It is well-known that moving applications onto MPP platforms 65 00:03:32,780 --> 00:03:35,220 can be a daunting task. 66 00:03:35,220 --> 00:03:38,030 The heterogeneous configuration allows an environment 67 00:03:38,030 --> 00:03:39,740 in which an application programmer 68 00:03:39,740 --> 00:03:43,740 can manage the work in a staged, incremental way. 69 00:03:43,740 --> 00:03:45,740 Finally, the closely coupled connection 70 00:03:45,740 --> 00:03:48,830 between the parallel-vector system and the Cray T3D 71 00:03:48,830 --> 00:03:52,250 allows new opportunities for high-performance computing. 72 00:03:52,250 --> 00:03:54,230 There have been recent reports of applications 73 00:03:54,230 --> 00:03:57,410 which have portions of code that map best on the MPP, 74 00:03:57,410 --> 00:03:59,180 while other portions of the code do better 75 00:03:59,180 --> 00:04:01,310 on the parallel-vector platform. 76 00:04:01,310 --> 00:04:04,130 The end result can be performance which outpaces 77 00:04:04,130 --> 00:04:06,600 either component working alone. 78 00:04:06,600 --> 00:04:08,060 The Cray environment will support 79 00:04:08,060 --> 00:04:10,490 efficient heterogeneous computing 80 00:04:10,490 --> 00:04:14,120 using the Parallel Virtual Machine software technology. 81 00:04:14,120 --> 00:04:16,459 PVM was originally developed through 82 00:04:16,459 --> 00:04:18,800 a three-way collaborative effort of researchers 83 00:04:18,800 --> 00:04:20,930 at Oak Ridge National Laboratory, 84 00:04:20,930 --> 00:04:24,680 the University of Tennessee, and Emory University. 85 00:04:24,680 --> 00:04:27,080 The Cray T3D compute and I/O nodes 86 00:04:27,080 --> 00:04:29,840 are interconnected with a three-dimensional toroidal 87 00:04:29,840 --> 00:04:30,800 switch. 88 00:04:30,800 --> 00:04:34,070 The switch is designed for very high bandwidth and extremely 89 00:04:34,070 --> 00:04:36,560 low latencies, setting a new standard 90 00:04:36,560 --> 00:04:39,500 for massively parallel switch design. 91 00:04:39,500 --> 00:04:42,620 The switch supports independent bidirectional paths 92 00:04:42,620 --> 00:04:45,770 to nearest neighbors in each of the six directions, which 93 00:04:45,770 --> 00:04:49,790 can be thought of as up, down, north, south, east, and west. 94 00:04:49,790 --> 00:04:52,340 A seventh pair of paths provides connectivity 95 00:04:52,340 --> 00:04:56,090 to the compute resources associated with the node. 96 00:04:56,090 --> 00:04:59,510 The switch manages a peak input flux and, of course, 97 00:04:59,510 --> 00:05:02,330 an equal output flux of 16.8 gigabits 98 00:05:02,330 --> 00:05:06,230 or 2.1 gigabytes per second. 99 00:05:06,230 --> 00:05:08,810 Unidirectional switch and wire latency 100 00:05:08,810 --> 00:05:13,010 for a two-byte-wide packet fit for physical unit 101 00:05:13,010 --> 00:05:16,670 is typically 13.3 nanoseconds. 102 00:05:16,670 --> 00:05:18,710 When routing requires a corner turn, 103 00:05:18,710 --> 00:05:24,320 there's only an additional 6.7 to 13.3 nanoseconds of delay. 104 00:05:24,320 --> 00:05:26,300 What kinds of problems have the potential 105 00:05:26,300 --> 00:05:28,480 to capitalize on the computing power delivered 106 00:05:28,480 --> 00:05:30,940 by large numbers of processors numbering 107 00:05:30,940 --> 00:05:33,898 in the hundreds or even thousands? 108 00:05:33,898 --> 00:05:35,440 While we are interested in delivering 109 00:05:35,440 --> 00:05:37,840 a high degree of sustained as opposed to peak 110 00:05:37,840 --> 00:05:40,480 performance for real-world applications, 111 00:05:40,480 --> 00:05:42,130 the most important objective must 112 00:05:42,130 --> 00:05:45,100 be in reducing the time to solution. 113 00:05:45,100 --> 00:05:47,140 The problem and selected algorithm 114 00:05:47,140 --> 00:05:49,540 must, of course, be highly parallel. 115 00:05:49,540 --> 00:05:52,600 As is well-known, Amdahl's law observes that for practical 116 00:05:52,600 --> 00:05:56,260 problems, the total execution time will consist of a parallel 117 00:05:56,260 --> 00:05:59,740 portion parceled out to a collection of processors, 118 00:05:59,740 --> 00:06:02,500 plus a serial portion which cannot be performed 119 00:06:02,500 --> 00:06:04,180 in a parallel manner. 120 00:06:04,180 --> 00:06:06,670 If large numbers of processors are frequently 121 00:06:06,670 --> 00:06:09,040 forced to wait for the computation of data coming 122 00:06:09,040 --> 00:06:12,550 from a single processor in the system plus the additional time 123 00:06:12,550 --> 00:06:14,770 required to distribute that data, 124 00:06:14,770 --> 00:06:17,440 the actual benefit of the large numbers of processors 125 00:06:17,440 --> 00:06:18,940 will be reduced. 126 00:06:18,940 --> 00:06:20,860 At some point, adding more processors 127 00:06:20,860 --> 00:06:24,760 is no longer cost-effective in reducing the time to solution. 128 00:06:24,760 --> 00:06:28,180 Even for a problem which is 95% parallel, 129 00:06:28,180 --> 00:06:30,490 one cannot expect speedups beyond 20, 130 00:06:30,490 --> 00:06:34,270 even with an arbitrarily large number of processors. 131 00:06:34,270 --> 00:06:36,670 Massively parallel or MPP systems 132 00:06:36,670 --> 00:06:39,310 of hundreds of processors require problems that 133 00:06:39,310 --> 00:06:42,880 are more than 99% parallel. 134 00:06:42,880 --> 00:06:46,270 The problem design should not only be highly parallel, 135 00:06:46,270 --> 00:06:49,960 but it should exhibit high algorithmic efficiency. 136 00:06:49,960 --> 00:06:53,410 Each calculation should be effectively moving the problem 137 00:06:53,410 --> 00:06:55,180 toward the solution. 138 00:06:55,180 --> 00:06:57,730 A high sustained floating-point operation rate 139 00:06:57,730 --> 00:07:00,790 is diminished in value if the algorithm demands 140 00:07:00,790 --> 00:07:03,880 many more intermediate results than a competing 141 00:07:03,880 --> 00:07:05,980 algorithmic choice. 142 00:07:05,980 --> 00:07:10,000 Third, it must be possible to efficiently move dependent data 143 00:07:10,000 --> 00:07:12,350 to requesting processors. 144 00:07:12,350 --> 00:07:14,050 This communication overhead can have 145 00:07:14,050 --> 00:07:18,100 the same effect as the serial portion of the computation. 146 00:07:18,100 --> 00:07:20,110 Here, we are interested in algorithms 147 00:07:20,110 --> 00:07:22,780 which require only modest amounts of interprocessor 148 00:07:22,780 --> 00:07:27,130 communication or which allow mechanisms in design 149 00:07:27,130 --> 00:07:30,130 to effectively make the latency transparent. 150 00:07:30,130 --> 00:07:33,280 This is accomplished by hardware and software prefetching 151 00:07:33,280 --> 00:07:34,630 techniques. 152 00:07:34,630 --> 00:07:37,000 The importance of each of these criteria-- 153 00:07:37,000 --> 00:07:39,490 algorithmic efficiency, a high level 154 00:07:39,490 --> 00:07:43,330 of parallelism, and moderate or transparent processor 155 00:07:43,330 --> 00:07:44,980 communications-- 156 00:07:44,980 --> 00:07:47,620 can be mitigated by selecting ever-higher-performing 157 00:07:47,620 --> 00:07:50,690 processors and communication networks. 158 00:07:50,690 --> 00:07:52,660 But in the real world, users are also 159 00:07:52,660 --> 00:07:54,250 concerned about price performance. 160 00:07:54,250 --> 00:07:57,790 It would be terrific if we could design a system consisting 161 00:07:57,790 --> 00:08:01,210 of 1,000 processors each with gigaflops of performance, 162 00:08:01,210 --> 00:08:04,150 an interprocessor network that delivers tens of gigabytes 163 00:08:04,150 --> 00:08:06,910 per second of data flux for each processor, 164 00:08:06,910 --> 00:08:11,440 and interprocessor latencies below 100 nanoseconds. 165 00:08:11,440 --> 00:08:15,130 The problem today is that such a system, if it could be built, 166 00:08:15,130 --> 00:08:18,130 would be outrageously expensive. 167 00:08:18,130 --> 00:08:21,250 This will no doubt be less true tomorrow. 168 00:08:21,250 --> 00:08:23,770 Today, however, we are faced with a series 169 00:08:23,770 --> 00:08:27,490 of important architectural and technology trade-offs. 170 00:08:27,490 --> 00:08:30,220 Properly made, there is vast potential 171 00:08:30,220 --> 00:08:32,020 for improving the price performance 172 00:08:32,020 --> 00:08:34,600 of challenging and grand challenging problems 173 00:08:34,600 --> 00:08:38,080 having the characteristics just described. 174 00:08:38,080 --> 00:08:40,120 The rest of the story then will take you 175 00:08:40,120 --> 00:08:42,580 through the series of architectural and technology 176 00:08:42,580 --> 00:08:47,090 choices and trade-offs made by the designers of the Cray T3D. 177 00:08:47,090 --> 00:08:50,170 Three distinct kinds of parallel programming models 178 00:08:50,170 --> 00:08:51,850 and the accompanying high-level language 179 00:08:51,850 --> 00:08:54,880 constructs to support them have evolved. 180 00:08:54,880 --> 00:08:58,600 These models are data-parallel, message passing, 181 00:08:58,600 --> 00:09:00,820 and work sharing. 182 00:09:00,820 --> 00:09:04,720 Data-parallel models use a limited, often single control 183 00:09:04,720 --> 00:09:09,730 flow within a system to control a very large data set. 184 00:09:09,730 --> 00:09:12,940 Data communication is usually specified implicitly 185 00:09:12,940 --> 00:09:16,180 in the high-level language using, for example, naming, 186 00:09:16,180 --> 00:09:19,600 which references entire arrays of data. 187 00:09:19,600 --> 00:09:22,300 Interaction among individual data elements 188 00:09:22,300 --> 00:09:24,700 is masked by high-level operators, 189 00:09:24,700 --> 00:09:29,770 such as matrix multiply, transpose, or inversion. 190 00:09:29,770 --> 00:09:32,530 Message-passing models recognize distinct data 191 00:09:32,530 --> 00:09:36,580 domains which may represent real domains in the problem space 192 00:09:36,580 --> 00:09:39,190 or which may just be artifacts of the organization 193 00:09:39,190 --> 00:09:41,410 of the computer system itself. 194 00:09:41,410 --> 00:09:44,620 In any case, explicit communication sequences are 195 00:09:44,620 --> 00:09:46,850 specified-- indeed, required-- 196 00:09:46,850 --> 00:09:49,240 which serve to move data and also provide 197 00:09:49,240 --> 00:09:52,340 a low-level synchronization function. 198 00:09:52,340 --> 00:09:54,520 The third model is work sharing. 199 00:09:54,520 --> 00:09:57,970 The emphasis here is on implicit data motion requests 200 00:09:57,970 --> 00:10:00,820 automatically invoked by the use of variables 201 00:10:00,820 --> 00:10:04,120 representing a fine degree of granularity. 202 00:10:04,120 --> 00:10:06,820 The naming can effectively specify and use 203 00:10:06,820 --> 00:10:10,900 arrays, array elements, and scalars. 204 00:10:10,900 --> 00:10:14,230 Multiple control flows are supported through a rich set 205 00:10:14,230 --> 00:10:16,900 of synchronization primitives. 206 00:10:16,900 --> 00:10:19,390 The objective here is to keep a high percentage 207 00:10:19,390 --> 00:10:24,340 of the processing cycles actively performing real work. 208 00:10:24,340 --> 00:10:26,110 Without applications, particularly 209 00:10:26,110 --> 00:10:29,310 third-party applications, no supercomputing system 210 00:10:29,310 --> 00:10:32,460 can be broadly successful in the marketplace. 211 00:10:32,460 --> 00:10:36,240 Although application development remains in an immature state, 212 00:10:36,240 --> 00:10:38,850 the variety of applications which do exist 213 00:10:38,850 --> 00:10:41,790 and which must be supported for market success 214 00:10:41,790 --> 00:10:46,740 use all three models, data-parallel, message passing, 215 00:10:46,740 --> 00:10:49,800 and the emerging work-sharing model. 216 00:10:49,800 --> 00:10:54,970 The Cray T3D was designed to efficiently support all three. 217 00:10:54,970 --> 00:10:56,700 Let's turn our attention to the variety 218 00:10:56,700 --> 00:10:58,380 of architectural and implementation 219 00:10:58,380 --> 00:11:01,770 choices for massively parallel processor design. 220 00:11:01,770 --> 00:11:04,680 It is expected that for highly parallel problems, 221 00:11:04,680 --> 00:11:06,420 significant performance gains will 222 00:11:06,420 --> 00:11:10,350 come from large numbers of processors working together. 223 00:11:10,350 --> 00:11:12,900 But there are two distinct ways this cooperation 224 00:11:12,900 --> 00:11:14,770 might be enforced. 225 00:11:14,770 --> 00:11:16,230 The collection of processors could 226 00:11:16,230 --> 00:11:20,640 follow the Single Instruction, Multiple Data or SIMD model 227 00:11:20,640 --> 00:11:23,460 or the Multiple Instruction, Multiple Data 228 00:11:23,460 --> 00:11:25,800 model, the MIMD model. 229 00:11:25,800 --> 00:11:27,930 In the first case, each processor 230 00:11:27,930 --> 00:11:30,840 operates in lockstep with all other processors 231 00:11:30,840 --> 00:11:35,670 in the partition, but with different parallel data sets. 232 00:11:35,670 --> 00:11:38,820 In the second case, MIMD, each processor 233 00:11:38,820 --> 00:11:42,780 follows its own control flow, executing from a distinctly 234 00:11:42,780 --> 00:11:45,870 different instruction stream. 235 00:11:45,870 --> 00:11:49,320 SIMD designs are a good match for data-parallel models 236 00:11:49,320 --> 00:11:52,440 and can be somewhat simpler to program. 237 00:11:52,440 --> 00:11:55,590 On the other hand, they suffer from inefficiencies 238 00:11:55,590 --> 00:11:57,900 when, for example, sparse matrix data 239 00:11:57,900 --> 00:12:00,330 is being manipulated in rather naive ways 240 00:12:00,330 --> 00:12:04,650 with lots of multiplications or additions of 0 operands. 241 00:12:04,650 --> 00:12:07,170 Although the operation count may be high, 242 00:12:07,170 --> 00:12:11,460 the algorithmic efficiency for this example is low. 243 00:12:11,460 --> 00:12:14,520 MIMD systems provide many more degrees of freedom, 244 00:12:14,520 --> 00:12:16,650 with-- conceptually-- each processor 245 00:12:16,650 --> 00:12:19,740 operating from distinctly different optimized control 246 00:12:19,740 --> 00:12:21,060 stores. 247 00:12:21,060 --> 00:12:23,430 In practice, however, the computation 248 00:12:23,430 --> 00:12:25,830 would probably be broken up into a handful 249 00:12:25,830 --> 00:12:27,810 of computational domains. 250 00:12:27,810 --> 00:12:30,180 There may, for example, be separate code 251 00:12:30,180 --> 00:12:32,010 to simulate a physical structure's 252 00:12:32,010 --> 00:12:36,420 interior and, separately, its boundary regions, 253 00:12:36,420 --> 00:12:40,500 whereas synchronization is implicit for the SIMD model, 254 00:12:40,500 --> 00:12:44,310 with all parallel calculations occurring simultaneously. 255 00:12:44,310 --> 00:12:46,350 It is necessary for the MIMD case 256 00:12:46,350 --> 00:12:49,530 to manage more explicitly the limited interaction 257 00:12:49,530 --> 00:12:51,370 among processors. 258 00:12:51,370 --> 00:12:54,180 This adds to complexity at the application programmer 259 00:12:54,180 --> 00:12:57,040 or at least the compiler designer level. 260 00:12:57,040 --> 00:13:01,470 Nevertheless, the opportunity for high real performance 261 00:13:01,470 --> 00:13:03,690 has caused almost every MPP vendor 262 00:13:03,690 --> 00:13:06,480 to concentrate on MIMD designs. 263 00:13:06,480 --> 00:13:08,730 And the Cray approach is no different. 264 00:13:08,730 --> 00:13:12,390 The Cray T3D is a multiple instruction, multiple data 265 00:13:12,390 --> 00:13:13,890 design. 266 00:13:13,890 --> 00:13:15,690 The next important choice involves 267 00:13:15,690 --> 00:13:18,060 the way in which memory and, ultimately, data 268 00:13:18,060 --> 00:13:20,400 is shared among processors. 269 00:13:20,400 --> 00:13:23,580 Parallel-vector systems from Cray Research and others 270 00:13:23,580 --> 00:13:28,290 use a uniform access physically shared memory implementation. 271 00:13:28,290 --> 00:13:30,570 With this approach, the processors and the memory 272 00:13:30,570 --> 00:13:33,870 subsystem form two distinct and physically separated 273 00:13:33,870 --> 00:13:35,580 resource pools. 274 00:13:35,580 --> 00:13:38,730 All processors compete more or less equally for access 275 00:13:38,730 --> 00:13:41,940 to each location in the memory subsystem. 276 00:13:41,940 --> 00:13:44,310 The latency each processor sees is also 277 00:13:44,310 --> 00:13:46,800 approximately equivalent. 278 00:13:46,800 --> 00:13:49,650 Although not fundamental to the physically shared design, 279 00:13:49,650 --> 00:13:52,980 the communication paths between the processor partition 280 00:13:52,980 --> 00:13:56,670 and the memory partition are often highly optimized 281 00:13:56,670 --> 00:13:59,730 for high bandwidth and low latency, 282 00:13:59,730 --> 00:14:03,780 adding significantly to the supercomputer system cost. 283 00:14:03,780 --> 00:14:06,150 Unfortunately, such system organizations 284 00:14:06,150 --> 00:14:09,360 do not scale well to large numbers of processors, 285 00:14:09,360 --> 00:14:12,270 because the required number of independent data paths 286 00:14:12,270 --> 00:14:14,490 grows exponentially. 287 00:14:14,490 --> 00:14:17,430 The non-uniform physically distributed memory model 288 00:14:17,430 --> 00:14:20,370 provides a much more scalable solution, 289 00:14:20,370 --> 00:14:23,760 with each processor located physically nearby a set 290 00:14:23,760 --> 00:14:25,740 of memory devices. 291 00:14:25,740 --> 00:14:29,310 The processors can then have preferential access 292 00:14:29,310 --> 00:14:32,220 that is lower-latency and/or higher-bandwidth 293 00:14:32,220 --> 00:14:34,230 to the nearby memory. 294 00:14:34,230 --> 00:14:36,360 The non-uniform memory access model 295 00:14:36,360 --> 00:14:39,900 can provide a much more cost-effective solution 296 00:14:39,900 --> 00:14:42,330 for parallel algorithms that emphasize 297 00:14:42,330 --> 00:14:45,510 a high degree of local computation 298 00:14:45,510 --> 00:14:49,230 and only moderate amounts of global data movement, 299 00:14:49,230 --> 00:14:51,210 whereas the high degree of connectivity 300 00:14:51,210 --> 00:14:53,670 in the uniform access memory system 301 00:14:53,670 --> 00:14:57,840 may be underutilized a significant part of the time. 302 00:14:57,840 --> 00:15:01,560 The Cray T3D utilizes the non-uniform physically 303 00:15:01,560 --> 00:15:04,730 distributed memory model in its construction. 304 00:15:04,730 --> 00:15:07,650 However, through its memory addressing mechanism, 305 00:15:07,650 --> 00:15:10,470 it, in a logical sense, preserves the concept 306 00:15:10,470 --> 00:15:13,860 of memory sharing, allowing the Cray T3D to operate 307 00:15:13,860 --> 00:15:16,620 as a true multiprocessor. 308 00:15:16,620 --> 00:15:18,420 Many MPP systems to date have been 309 00:15:18,420 --> 00:15:22,350 designed as multi-computers, not multiprocessors. 310 00:15:22,350 --> 00:15:25,350 Interprocessor communication must depend upon explicit 311 00:15:25,350 --> 00:15:27,240 I/O-based requests. 312 00:15:27,240 --> 00:15:29,550 Data moving from one processor to another 313 00:15:29,550 --> 00:15:33,130 must pass through the I/O ports of the processors. 314 00:15:33,130 --> 00:15:36,870 This usually requires the use of processor interrupt mechanisms 315 00:15:36,870 --> 00:15:40,380 and even intervention by the operating system. 316 00:15:40,380 --> 00:15:43,470 The target processor must decode or interpret 317 00:15:43,470 --> 00:15:45,840 via software techniques the address tag 318 00:15:45,840 --> 00:15:49,410 and the purpose of the message to determine the exact source 319 00:15:49,410 --> 00:15:51,990 or destination for the payload. 320 00:15:51,990 --> 00:15:54,240 This results in significant increases 321 00:15:54,240 --> 00:15:56,670 in latency for the interprocessor communication 322 00:15:56,670 --> 00:15:57,795 process. 323 00:15:57,795 --> 00:16:00,490 A multiprocessor, on the other hand, 324 00:16:00,490 --> 00:16:03,300 provides direct paths to each memory location 325 00:16:03,300 --> 00:16:06,930 within the entire processor pool partition. 326 00:16:06,930 --> 00:16:11,310 Addressing is managed through a single global address space. 327 00:16:11,310 --> 00:16:13,410 On the target end of the transfer, 328 00:16:13,410 --> 00:16:15,780 the data communication mechanism does not 329 00:16:15,780 --> 00:16:18,780 interrupt the processor, but proceeds directly 330 00:16:18,780 --> 00:16:22,950 via hardware support to that processor's local memory. 331 00:16:22,950 --> 00:16:25,740 Because network communication within less efficient 332 00:16:25,740 --> 00:16:30,000 multi-computers lacks a unique global addressing mechanism 333 00:16:30,000 --> 00:16:32,250 and also requires processor intervention 334 00:16:32,250 --> 00:16:35,400 on each side of the transfer, the programming models 335 00:16:35,400 --> 00:16:40,050 on these systems are dependent upon the message-passing model. 336 00:16:40,050 --> 00:16:42,090 For a multiprocessor system, however, 337 00:16:42,090 --> 00:16:45,330 a true work-sharing model can be implemented. 338 00:16:45,330 --> 00:16:49,080 The specification of data using a high-level language, 339 00:16:49,080 --> 00:16:53,160 whether local or global, can be consistent. 340 00:16:53,160 --> 00:16:56,160 Memory references local or global at the hardware level 341 00:16:56,160 --> 00:16:59,880 are direct, with very low overhead. 342 00:16:59,880 --> 00:17:03,330 It is very important, however, for the Cray T3D to effectively 343 00:17:03,330 --> 00:17:05,069 support message passing. 344 00:17:05,069 --> 00:17:06,540 And it does. 345 00:17:06,540 --> 00:17:08,940 Explicit message-passing send and receive 346 00:17:08,940 --> 00:17:12,390 requests are translated to allow at the lower hardware 347 00:17:12,390 --> 00:17:15,390 level actual data movement to progress 348 00:17:15,390 --> 00:17:18,750 through the multiprocessor memory protocols. 349 00:17:18,750 --> 00:17:21,329 Let's concentrate now on the interprocessor data 350 00:17:21,329 --> 00:17:24,780 communication network to provide motivation for the network 351 00:17:24,780 --> 00:17:27,359 topology and switch characteristics selected 352 00:17:27,359 --> 00:17:30,300 by the Cray T3D designers. 353 00:17:30,300 --> 00:17:32,790 Access by processors to the local memory 354 00:17:32,790 --> 00:17:36,540 on the same node results in preferential treatment, 355 00:17:36,540 --> 00:17:39,210 with higher bandwidth and lower latency, 356 00:17:39,210 --> 00:17:43,260 than access to memory locations residing in neighboring nodes. 357 00:17:43,260 --> 00:17:45,300 When making trade-offs among resources 358 00:17:45,300 --> 00:17:48,300 spent on local versus global traffic, 359 00:17:48,300 --> 00:17:52,590 support for locality of data has priority. 360 00:17:52,590 --> 00:17:57,120 We have seen that the Cray T3D is a true multiprocessor. 361 00:17:57,120 --> 00:18:00,450 Although processors own local memory resources, 362 00:18:00,450 --> 00:18:02,790 that memory appears logically as part 363 00:18:02,790 --> 00:18:06,240 of a larger globally addressed physically distributed 364 00:18:06,240 --> 00:18:08,580 shared memory subsystem. 365 00:18:08,580 --> 00:18:12,030 Although the objective in programming for a distributed 366 00:18:12,030 --> 00:18:14,640 memory system should always be to maximize 367 00:18:14,640 --> 00:18:16,560 the occurrence of local calculations 368 00:18:16,560 --> 00:18:18,990 and thereby minimize global traffic, 369 00:18:18,990 --> 00:18:21,120 it is nevertheless helpful to minimize, 370 00:18:21,120 --> 00:18:24,270 within reason, global latency. 371 00:18:24,270 --> 00:18:26,730 We give this priority because a certain amount 372 00:18:26,730 --> 00:18:29,700 of global communication for real-world problems 373 00:18:29,700 --> 00:18:31,710 is unavoidable. 374 00:18:31,710 --> 00:18:34,020 But we want more than low latency. 375 00:18:34,020 --> 00:18:36,030 We also want to include mechanisms 376 00:18:36,030 --> 00:18:38,460 which allow the system to hide latency 377 00:18:38,460 --> 00:18:40,890 whenever possible through invocation 378 00:18:40,890 --> 00:18:42,870 of prefetch mechanisms. 379 00:18:42,870 --> 00:18:44,880 Network paths should obviously be 380 00:18:44,880 --> 00:18:47,640 high-bandwidth within the constraints of system cost 381 00:18:47,640 --> 00:18:48,840 trade-offs. 382 00:18:48,840 --> 00:18:52,920 The Cray T3D achieves both low latency and high bandwidth 383 00:18:52,920 --> 00:18:56,190 by using true supercomputer technology. 384 00:18:56,190 --> 00:18:59,340 Microprocessor shell and network switch logic 385 00:18:59,340 --> 00:19:04,350 are clocked by a single synchronous 150-megahertz clock 386 00:19:04,350 --> 00:19:07,170 distribution network. 387 00:19:07,170 --> 00:19:10,230 Market requirements for massively parallel systems 388 00:19:10,230 --> 00:19:13,830 dictate support for a wide range of system sizes 389 00:19:13,830 --> 00:19:16,380 with incremental pricing structures. 390 00:19:16,380 --> 00:19:20,490 The network must be scalable, with a fairly linear increase 391 00:19:20,490 --> 00:19:23,400 in cost to match linear increases 392 00:19:23,400 --> 00:19:26,280 in the number of processors. 393 00:19:26,280 --> 00:19:29,940 Fault-tolerant features must be considered when designing 394 00:19:29,940 --> 00:19:31,650 the switching network. 395 00:19:31,650 --> 00:19:34,680 Reconfiguration can allow masking of hardware faults 396 00:19:34,680 --> 00:19:38,970 until scheduled maintenance, increasing system availability. 397 00:19:38,970 --> 00:19:40,920 Finally, the system architect should 398 00:19:40,920 --> 00:19:43,290 be aware of the technology choices which 399 00:19:43,290 --> 00:19:47,040 will be available to those who implement the hardware design. 400 00:19:47,040 --> 00:19:49,830 Advanced packaging to support high-clock-rate designs 401 00:19:49,830 --> 00:19:52,080 have been a trademark of Cray Research designs 402 00:19:52,080 --> 00:19:53,730 through the years. 403 00:19:53,730 --> 00:19:56,830 With time-to-market always an important consideration, 404 00:19:56,830 --> 00:20:00,900 it is wise to exploit but not overextend the technology 405 00:20:00,900 --> 00:20:03,540 reach of each new product. 406 00:20:03,540 --> 00:20:07,290 Good judgment therefore requires a good technology fit 407 00:20:07,290 --> 00:20:09,780 with the design and manufacturing processes 408 00:20:09,780 --> 00:20:12,870 of the organization. 409 00:20:12,870 --> 00:20:14,460 And now, Steve Oberlin will discuss 410 00:20:14,460 --> 00:20:16,920 the details of the interprocessor communication 411 00:20:16,920 --> 00:20:18,240 network. 412 00:20:18,240 --> 00:20:19,530 Thanks, Steve. 413 00:20:19,530 --> 00:20:23,820 As previously mentioned, Cray T3D Processing Elements or PEs 414 00:20:23,820 --> 00:20:25,650 communicate over a high-speed network 415 00:20:25,650 --> 00:20:27,330 configured in a three-dimensional torus 416 00:20:27,330 --> 00:20:28,620 topology. 417 00:20:28,620 --> 00:20:29,892 What's that? 418 00:20:29,892 --> 00:20:32,100 Well, to illustrate, let's start with networks having 419 00:20:32,100 --> 00:20:34,230 less than three dimensions. 420 00:20:34,230 --> 00:20:37,020 This is a diagram of a simple single-dimension torus 421 00:20:37,020 --> 00:20:39,240 network connecting eight nodes. 422 00:20:39,240 --> 00:20:41,340 The links between them carry data packets 423 00:20:41,340 --> 00:20:44,700 at some maximum rate we're not concerned with for now. 424 00:20:44,700 --> 00:20:47,380 There are a few things worth noting about this network. 425 00:20:47,380 --> 00:20:50,550 First, what's the latency of the network going to be? 426 00:20:50,550 --> 00:20:53,820 Topologically speaking, there are no ends to the network. 427 00:20:53,820 --> 00:20:56,010 During random global communications, 428 00:20:56,010 --> 00:20:57,777 a packet can choose to travel whichever 429 00:20:57,777 --> 00:20:59,610 direction around the network is the shortest 430 00:20:59,610 --> 00:21:01,750 path to its destination. 431 00:21:01,750 --> 00:21:04,680 The worst-case distance then is to a node halfway 432 00:21:04,680 --> 00:21:06,060 around the network. 433 00:21:06,060 --> 00:21:09,570 If we measured latency in hops, the maximum for this network 434 00:21:09,570 --> 00:21:11,550 would be four hops. 435 00:21:11,550 --> 00:21:15,150 If you look at all possible communications between PEs, 436 00:21:15,150 --> 00:21:17,850 the average latency is about two hops. 437 00:21:17,850 --> 00:21:21,210 Second, what is the bandwidth of this network? 438 00:21:21,210 --> 00:21:23,490 The ability of the network to carry global traffic 439 00:21:23,490 --> 00:21:25,560 is measured by counting the number of wires 440 00:21:25,560 --> 00:21:28,350 that would be cut if you snipped the network in two 441 00:21:28,350 --> 00:21:29,280 at the center. 442 00:21:29,280 --> 00:21:31,770 This is called "bisecting" the network. 443 00:21:31,770 --> 00:21:34,020 When you multiply the result by the speed data 444 00:21:34,020 --> 00:21:36,960 moves on each wire, you get a metric called the "bisection 445 00:21:36,960 --> 00:21:38,410 bandwidth." 446 00:21:38,410 --> 00:21:40,950 And for this simple 1D example, the bisection 447 00:21:40,950 --> 00:21:43,530 is two times the channel rate. 448 00:21:43,530 --> 00:21:45,840 In this diagram, it looks as though the link connecting 449 00:21:45,840 --> 00:21:49,020 the ends to form a torus must be a pretty long wire. 450 00:21:49,020 --> 00:21:52,920 In reality, it's possible to fold a torus so that all wires 451 00:21:52,920 --> 00:21:54,720 are about the same length. 452 00:21:54,720 --> 00:21:56,520 This diagram is also a torus. 453 00:21:56,520 --> 00:21:58,170 And it's topologically equivalent 454 00:21:58,170 --> 00:21:59,640 to the previous diagram. 455 00:21:59,640 --> 00:22:03,780 Only the wires go from node to node in an interleaved pattern. 456 00:22:03,780 --> 00:22:06,180 In T3D, two of the three dimensions 457 00:22:06,180 --> 00:22:08,400 are folded in this fashion. 458 00:22:08,400 --> 00:22:10,410 There's one other characteristic of the torus 459 00:22:10,410 --> 00:22:11,670 that's worth mentioning. 460 00:22:11,670 --> 00:22:13,740 It's naturally fault-tolerant. 461 00:22:13,740 --> 00:22:16,230 Suppose one of the links broke for some reason. 462 00:22:16,230 --> 00:22:19,060 The torus provides a second path between nodes-- 463 00:22:19,060 --> 00:22:20,860 the long way around the network-- 464 00:22:20,860 --> 00:22:23,160 so that the system can continue running. 465 00:22:23,160 --> 00:22:25,980 The more processors in a system, the more important fault 466 00:22:25,980 --> 00:22:29,700 tolerance becomes, because the failure of a single PE or link 467 00:22:29,700 --> 00:22:32,070 represents a correspondingly smaller fraction 468 00:22:32,070 --> 00:22:33,480 of the machine. 469 00:22:33,480 --> 00:22:36,000 Who wants to give up their machine for repair if a failure 470 00:22:36,000 --> 00:22:39,940 only costs 1,000th of its capability? 471 00:22:39,940 --> 00:22:42,810 Let's look at a more complex network, an eight-node 2D 472 00:22:42,810 --> 00:22:43,942 torus. 473 00:22:43,942 --> 00:22:45,900 Now, our nodes are connected to their neighbors 474 00:22:45,900 --> 00:22:47,280 in two dimensions. 475 00:22:47,280 --> 00:22:51,150 By adding a second dimension, we double the bandwidth 476 00:22:51,150 --> 00:22:53,880 to four over the 1D Taurus. 477 00:22:53,880 --> 00:22:57,780 Maximum latency is now three and the average is less than two. 478 00:22:57,780 --> 00:23:01,120 These effects are magnified the more nodes you have. 479 00:23:01,120 --> 00:23:03,810 For instance, a 64-node torus has 480 00:23:03,810 --> 00:23:06,600 eight times the global bandwidth of a 1D torus 481 00:23:06,600 --> 00:23:10,380 and a maximum latency of eight, instead of 32. 482 00:23:10,380 --> 00:23:12,780 You can see where this is going by now. 483 00:23:12,780 --> 00:23:15,630 By stacking planes of 2D tori and connecting 484 00:23:15,630 --> 00:23:18,270 nodes in the third dimension into a torus, as well, 485 00:23:18,270 --> 00:23:20,460 we can construct a 3D torus, which 486 00:23:20,460 --> 00:23:22,770 is the T3D's network topology. 487 00:23:22,770 --> 00:23:25,860 We've again doubled the bisection bandwidth to eight. 488 00:23:25,860 --> 00:23:28,740 And maximum latency is now two and the average 489 00:23:28,740 --> 00:23:30,480 barely over one. 490 00:23:30,480 --> 00:23:33,270 Again, the biggest advantages of adding a dimension 491 00:23:33,270 --> 00:23:35,850 come with the larger sizes. 492 00:23:35,850 --> 00:23:39,510 At 64 nodes, a 3D torus has 16 times the bandwidth 493 00:23:39,510 --> 00:23:44,010 of a 1D torus and a maximum latency of six, instead of 32. 494 00:23:44,010 --> 00:23:46,740 For comparison, the largest-size T3D 495 00:23:46,740 --> 00:23:49,680 has a torus with 1,024 nodes in an eight 496 00:23:49,680 --> 00:23:52,260 by 16 by eight configuration. 497 00:23:52,260 --> 00:23:56,620 Data packets route through the T3D network in dimension order. 498 00:23:56,620 --> 00:23:59,310 This means that a packet injected into the network 499 00:23:59,310 --> 00:24:02,430 routes in the x-dimension first, then the y-dimension, 500 00:24:02,430 --> 00:24:05,580 then z, until it reaches the destination node, where 501 00:24:05,580 --> 00:24:07,900 it's ejected from the network. 502 00:24:07,900 --> 00:24:10,980 So what does a T3D network node look like? 503 00:24:10,980 --> 00:24:14,220 This block diagram shows the communication paths associated 504 00:24:14,220 --> 00:24:15,360 with a network switch. 505 00:24:15,360 --> 00:24:19,680 Each network node in T3D is shared by two independent PEs. 506 00:24:19,680 --> 00:24:23,100 The PEs are logically separate and only share 507 00:24:23,100 --> 00:24:25,620 the bandwidth of a network switch and a data mover 508 00:24:25,620 --> 00:24:27,150 called the "Block Transfer Engine," 509 00:24:27,150 --> 00:24:29,350 which I'll talk about later. 510 00:24:29,350 --> 00:24:31,740 There are high-speed channels radiating from the network 511 00:24:31,740 --> 00:24:33,420 switch in seven directions. 512 00:24:33,420 --> 00:24:36,870 Six of those directions are for the 3D torus connections. 513 00:24:36,870 --> 00:24:40,020 The seventh direction is the channel to the PEs. 514 00:24:40,020 --> 00:24:43,380 Each channel is actually two unidirection channels 515 00:24:43,380 --> 00:24:45,400 that operate independently. 516 00:24:45,400 --> 00:24:48,000 In other words, it's possible to have data streaming 517 00:24:48,000 --> 00:24:50,500 through the switch from east to west at the same time 518 00:24:50,500 --> 00:24:53,070 other data is moving from west to east. 519 00:24:53,070 --> 00:24:55,380 Packet steering is controlled by a routing tag 520 00:24:55,380 --> 00:24:56,970 in the header of a data packet that 521 00:24:56,970 --> 00:24:59,160 specifies the number of hops a package should 522 00:24:59,160 --> 00:25:00,930 travel in each direction. 523 00:25:00,930 --> 00:25:04,080 In T3D, routing tags are stored in a high-speed lookup 524 00:25:04,080 --> 00:25:05,790 table at each node. 525 00:25:05,790 --> 00:25:08,280 When a data link fails, the routing directions 526 00:25:08,280 --> 00:25:12,480 in the lookup tables are simply changed to avoid the bad link. 527 00:25:12,480 --> 00:25:15,000 T3D also provide some extra compute nodes 528 00:25:15,000 --> 00:25:18,390 as spares in the event of a PE or node failure for increased 529 00:25:18,390 --> 00:25:19,870 fault tolerance. 530 00:25:19,870 --> 00:25:23,520 There are eight redundant PEs for every 512 PEs. 531 00:25:23,520 --> 00:25:25,830 Redundant nodes can be electronically switched 532 00:25:25,830 --> 00:25:27,390 in place of a failed compute node 533 00:25:27,390 --> 00:25:30,430 by rewriting the lookup table to substitute the redundant node 534 00:25:30,430 --> 00:25:33,460 for the failed node. 535 00:25:33,460 --> 00:25:35,980 The network switch or router is actually 536 00:25:35,980 --> 00:25:37,780 implemented on three chips. 537 00:25:37,780 --> 00:25:39,820 Each chip handles one dimension. 538 00:25:39,820 --> 00:25:42,910 We call this a "dimension-sliced router." 539 00:25:42,910 --> 00:25:45,340 Its functioning is fairly straightforward. 540 00:25:45,340 --> 00:25:48,310 Packets enter the x-dimension switch from a PE 541 00:25:48,310 --> 00:25:50,800 and route from x-switch to x-switch, then 542 00:25:50,800 --> 00:25:53,320 turn the corner by crossing to the y-switch chip 543 00:25:53,320 --> 00:25:55,390 to travel in the y-dimension, then 544 00:25:55,390 --> 00:25:58,510 cross to the z-switch chip for z-dimension routing, 545 00:25:58,510 --> 00:26:01,960 then finally emerge at the destination PE. 546 00:26:01,960 --> 00:26:04,060 Latency of the switch is very low. 547 00:26:04,060 --> 00:26:07,510 Essentially, it takes one clock period, 6.7 nanoseconds 548 00:26:07,510 --> 00:26:10,210 at 150 megahertz, for a packet entering a chip 549 00:26:10,210 --> 00:26:12,970 to decide which way it's going and leave again. 550 00:26:12,970 --> 00:26:15,580 At these clock speeds, the time spent on the wires 551 00:26:15,580 --> 00:26:17,620 isn't negligible and must also be included 552 00:26:17,620 --> 00:26:19,720 in latency calculations. 553 00:26:19,720 --> 00:26:24,100 In T3D, all wires are either one or one and a half clocks long. 554 00:26:24,100 --> 00:26:26,230 Each hop in the network then takes one clock 555 00:26:26,230 --> 00:26:30,160 for the switch plus one to one and a half clocks for the wire. 556 00:26:30,160 --> 00:26:33,040 Turning the corner is similar to routing within a dimension. 557 00:26:33,040 --> 00:26:35,560 It takes one clock inside the first chip, 558 00:26:35,560 --> 00:26:37,900 one clock for the connection between chips, 559 00:26:37,900 --> 00:26:39,610 and one clock for the second chip. 560 00:26:39,610 --> 00:26:43,180 And you're on the wires in the next dimension. 561 00:26:43,180 --> 00:26:45,400 Notice this diagram has all the same arrows 562 00:26:45,400 --> 00:26:47,380 as the earlier node diagram. 563 00:26:47,380 --> 00:26:49,270 Each arrow represents a channel that 564 00:26:49,270 --> 00:26:52,360 is 16 bits wide that is clocked at 150 565 00:26:52,360 --> 00:26:56,490 megahertz for a raw bandwidth of 300 megabytes per second. 566 00:26:56,490 --> 00:26:58,540 Earlier, we talked about the bisection bandwidth 567 00:26:58,540 --> 00:26:59,740 of a network. 568 00:26:59,740 --> 00:27:03,220 The bisection bandwidth of a 1K processor T3D, 569 00:27:03,220 --> 00:27:05,800 which has 512 network nodes in an eight 570 00:27:05,800 --> 00:27:09,250 by eight by eight torus, is easy to calculate. 571 00:27:09,250 --> 00:27:12,700 The number of channels cut when we bisect the torus is 8 times 572 00:27:12,700 --> 00:27:15,850 8, one plane of the torus, times 2 573 00:27:15,850 --> 00:27:19,000 for the connected ends of the torus times 2 574 00:27:19,000 --> 00:27:21,460 for the channels in each direction times 575 00:27:21,460 --> 00:27:24,700 300 megabytes per second, giving a total of 76 576 00:27:24,700 --> 00:27:27,100 gigabytes per second. 577 00:27:27,100 --> 00:27:31,380 Data packets carry data in two payload sizes, one or four 578 00:27:31,380 --> 00:27:33,070 64-bit words. 579 00:27:33,070 --> 00:27:35,530 Packets are broken up into 16-bit chunks 580 00:27:35,530 --> 00:27:37,870 called "phits" for physical units 581 00:27:37,870 --> 00:27:42,050 and sent over the network in a string of successive parcels. 582 00:27:42,050 --> 00:27:44,200 The routing tag is always in the first phit. 583 00:27:44,200 --> 00:27:46,270 And it opens a path through the switches 584 00:27:46,270 --> 00:27:48,940 for the following control and data phits. 585 00:27:48,940 --> 00:27:51,940 The head of a packet can easily be many switches ahead 586 00:27:51,940 --> 00:27:53,680 of the tail in the network. 587 00:27:53,680 --> 00:27:56,800 In scientific literature, this is commonly called "wormhole 588 00:27:56,800 --> 00:27:57,700 routing." 589 00:27:57,700 --> 00:28:00,070 Deadlock is a phenomenon like gridlock 590 00:28:00,070 --> 00:28:02,020 in Manhattan at rush hour that can 591 00:28:02,020 --> 00:28:04,990 happen when packets in a network are not guaranteed to have room 592 00:28:04,990 --> 00:28:06,910 to make forward progress. 593 00:28:06,910 --> 00:28:09,130 Deadlock is impossible in T3D because 594 00:28:09,130 --> 00:28:11,020 of the use of layers of buffering 595 00:28:11,020 --> 00:28:13,120 called "virtual channels." 596 00:28:13,120 --> 00:28:16,150 By separating the buffering for requests and responses 597 00:28:16,150 --> 00:28:18,880 and providing an additional set of buffers for packets that 598 00:28:18,880 --> 00:28:21,910 have passed a certain point in a torus ring called the "date 599 00:28:21,910 --> 00:28:24,280 line," we guarantee network packets 600 00:28:24,280 --> 00:28:27,040 can't be blocked indefinitely. 601 00:28:27,040 --> 00:28:30,550 T3D thus has four sets of virtual channel buffers, 602 00:28:30,550 --> 00:28:33,490 two for requests and responses before the date line 603 00:28:33,490 --> 00:28:37,630 and two for requests and responses after the date line. 604 00:28:37,630 --> 00:28:40,240 Cray chose the 3D torus topology after 605 00:28:40,240 --> 00:28:42,160 an extensive investigation ending 606 00:28:42,160 --> 00:28:45,550 in a final shootout between a couple of champion networks 607 00:28:45,550 --> 00:28:47,770 in a detailed simulation using real 608 00:28:47,770 --> 00:28:50,230 and synthetic communication patterns. 609 00:28:50,230 --> 00:28:52,210 When evaluated within the constraints 610 00:28:52,210 --> 00:28:56,050 of real-world packaging limits and wiring capabilities, 611 00:28:56,050 --> 00:28:58,930 the 3D torus provided the highest global bandwidth 612 00:28:58,930 --> 00:29:01,330 and lowest global latency for the number 613 00:29:01,330 --> 00:29:03,220 of processors we expected to have 614 00:29:03,220 --> 00:29:06,580 in T3D, hundreds to thousands. 615 00:29:06,580 --> 00:29:09,040 Several additional important characteristics 616 00:29:09,040 --> 00:29:12,580 include good physical scalability, fault tolerance, 617 00:29:12,580 --> 00:29:15,970 ability to exploit locality for reduced latency and improved 618 00:29:15,970 --> 00:29:18,880 bandwidth, and a good fit with our technology-- 619 00:29:18,880 --> 00:29:22,450 that is, short wires in a regular wiring pattern, 620 00:29:22,450 --> 00:29:24,670 resulting in a high clock speed. 621 00:29:24,670 --> 00:29:26,320 There's also significant headroom 622 00:29:26,320 --> 00:29:28,300 in the technology that will be used 623 00:29:28,300 --> 00:29:33,070 to provide even better network performance in future machines. 624 00:29:33,070 --> 00:29:37,300 Let's define and differentiate two levels of the Cray T3D 625 00:29:37,300 --> 00:29:38,800 architecture. 626 00:29:38,800 --> 00:29:41,200 These are the macro-architecture level 627 00:29:41,200 --> 00:29:44,170 and the micro-architecture level. 628 00:29:44,170 --> 00:29:47,140 The macro architecture represents the high-level view 629 00:29:47,140 --> 00:29:50,710 of the system seen by the MPP programmer. 630 00:29:50,710 --> 00:29:52,810 This includes the globally addressed 631 00:29:52,810 --> 00:29:55,330 physically distributed memory characteristics, 632 00:29:55,330 --> 00:29:57,760 in a general sense, the topology, which 633 00:29:57,760 --> 00:30:00,700 defines relationships among compute nodes, 634 00:30:00,700 --> 00:30:03,490 and the synchronization features to the degree 635 00:30:03,490 --> 00:30:07,480 they are exposed in high-level language constructs. 636 00:30:07,480 --> 00:30:10,150 It is important to present a consistent, 637 00:30:10,150 --> 00:30:14,480 although probably slowly evolving view to the user. 638 00:30:14,480 --> 00:30:17,290 This allows generation-to-generation source 639 00:30:17,290 --> 00:30:19,270 compatibility. 640 00:30:19,270 --> 00:30:22,630 The balance among computational, memory bandwidth, 641 00:30:22,630 --> 00:30:27,580 and interconnect support will be also fairly consistent. 642 00:30:27,580 --> 00:30:29,760 Application-level optimizations that work 643 00:30:29,760 --> 00:30:32,790 well on one generation system should also 644 00:30:32,790 --> 00:30:35,580 work well on the next. 645 00:30:35,580 --> 00:30:38,100 If a system is successful in the marketplace, 646 00:30:38,100 --> 00:30:40,890 this steadiness of purpose will also 647 00:30:40,890 --> 00:30:44,940 work to support the development of standards in the MPP world. 648 00:30:44,940 --> 00:30:48,130 The micro-architecture roadmap, on the other hand, 649 00:30:48,130 --> 00:30:50,490 must be much more agile. 650 00:30:50,490 --> 00:30:54,840 Micro architecture refers to the CPU and the lower-level logic 651 00:30:54,840 --> 00:30:59,520 shell, which maps the CPU into the system overall. 652 00:30:59,520 --> 00:31:02,760 When beginning an MPP design, one of the first decisions 653 00:31:02,760 --> 00:31:05,820 to be made is whether to design a new proprietary 654 00:31:05,820 --> 00:31:11,010 microprocessor or to select from a variety of emerging commodity 655 00:31:11,010 --> 00:31:12,990 devices. 656 00:31:12,990 --> 00:31:15,990 The primary argument for a proprietary design 657 00:31:15,990 --> 00:31:19,530 is that it gives an opportunity to specifically and efficiently 658 00:31:19,530 --> 00:31:22,140 tailor the micro-architectural design 659 00:31:22,140 --> 00:31:25,860 within the context of the overall macro-architectural 660 00:31:25,860 --> 00:31:27,570 goals. 661 00:31:27,570 --> 00:31:29,310 But there is a problem. 662 00:31:29,310 --> 00:31:31,770 With commodity microprocessor development cycles 663 00:31:31,770 --> 00:31:34,920 lasting only two years, there is opportunity 664 00:31:34,920 --> 00:31:39,630 for rapid improvement, which also can mean rapid change. 665 00:31:39,630 --> 00:31:43,080 To be competitive with current state-of-the-art microprocessor 666 00:31:43,080 --> 00:31:47,130 designs, which by some counts are already as a class 667 00:31:47,130 --> 00:31:50,490 in the middle of the fourth or fifth generation, 668 00:31:50,490 --> 00:31:53,670 demand significant design and integrated circuit process 669 00:31:53,670 --> 00:31:56,280 development resources. 670 00:31:56,280 --> 00:31:58,320 The human resources must be applied 671 00:31:58,320 --> 00:32:02,790 in a widely parallel fashion to minimize time to market. 672 00:32:02,790 --> 00:32:06,060 There is always an unfolding next generation coming 673 00:32:06,060 --> 00:32:09,180 from large competing design groups. 674 00:32:09,180 --> 00:32:11,250 For these reasons, even the largest 675 00:32:11,250 --> 00:32:14,430 of computer corporations have seen fit to build and support 676 00:32:14,430 --> 00:32:19,050 consortia to provide sufficient market presence to pay back 677 00:32:19,050 --> 00:32:21,240 the design effort. 678 00:32:21,240 --> 00:32:24,940 And of course, these devices benefit from high volume, 679 00:32:24,940 --> 00:32:27,240 which allows lower pricing. 680 00:32:27,240 --> 00:32:30,210 This counterargument to embrace an emerging commodity 681 00:32:30,210 --> 00:32:32,490 micro-architectural implementation 682 00:32:32,490 --> 00:32:36,900 was most persuasive for the Cray T3D project team. 683 00:32:36,900 --> 00:32:40,560 The processor selected was from the Alpha Microprocessor family 684 00:32:40,560 --> 00:32:43,230 from Digital Equipment Corporation. 685 00:32:43,230 --> 00:32:45,270 This device was selected primarily 686 00:32:45,270 --> 00:32:48,270 because it best met our objectives of providing 687 00:32:48,270 --> 00:32:50,880 high performance and high precision 688 00:32:50,880 --> 00:32:54,180 for both floating-point and integer data types. 689 00:32:54,180 --> 00:32:56,340 We present only a short list of some 690 00:32:56,340 --> 00:32:59,430 of the more important characteristics of this device, 691 00:32:59,430 --> 00:33:01,410 but would refer you to the other videos 692 00:33:01,410 --> 00:33:05,100 in the UVC series featuring lectures by Digital Equipment 693 00:33:05,100 --> 00:33:07,230 Corporation engineers. 694 00:33:07,230 --> 00:33:10,830 These tapes cover Alpha architecture and implementation 695 00:33:10,830 --> 00:33:12,960 in detail. 696 00:33:12,960 --> 00:33:20,730 The DEC chip 21064 represents a 0.75-micron CMOS implementation 697 00:33:20,730 --> 00:33:25,950 of a full 64-bit reduced instruction set computer. 698 00:33:25,950 --> 00:33:27,870 It has a peak floating-point performance 699 00:33:27,870 --> 00:33:32,250 of 150 million floating-point operations per second, 700 00:33:32,250 --> 00:33:36,690 utilizing a 150-megahertz clock frequency. 701 00:33:36,690 --> 00:33:38,850 Up to two instructions can be issued 702 00:33:38,850 --> 00:33:42,840 per clock period or 300 MIPS. 703 00:33:42,840 --> 00:33:45,090 The internal register set consists 704 00:33:45,090 --> 00:33:49,440 of 32 integer and 32 floating-point registers. 705 00:33:49,440 --> 00:33:53,970 There are two internal caches, 8K bytes of instruction cache 706 00:33:53,970 --> 00:33:56,520 and 8K bytes of data cache. 707 00:33:56,520 --> 00:33:59,370 All candidate commodity microprocessors, 708 00:33:59,370 --> 00:34:01,560 including the Alpha Microprocessor, 709 00:34:01,560 --> 00:34:04,170 were designed primarily to satisfy the requirements 710 00:34:04,170 --> 00:34:07,440 of workstations containing single or, at most, 711 00:34:07,440 --> 00:34:10,350 a handful of CPU devices. 712 00:34:10,350 --> 00:34:13,469 This design focus, while appropriate for the workstation 713 00:34:13,469 --> 00:34:16,110 market, results in certain weaknesses 714 00:34:16,110 --> 00:34:19,199 when it is inserted into a scalable massively parallel 715 00:34:19,199 --> 00:34:21,400 processing system. 716 00:34:21,400 --> 00:34:23,670 This is particularly true when the system design 717 00:34:23,670 --> 00:34:26,370 has an objective of supporting very large numbers 718 00:34:26,370 --> 00:34:31,480 of processors sharing a very large global address space. 719 00:34:31,480 --> 00:34:34,199 Fortunately, there are ways in which the microprocessor 720 00:34:34,199 --> 00:34:39,389 can be enhanced with appropriate supporting shell logic. 721 00:34:39,389 --> 00:34:42,630 A globally addressed memory subsystem designed 722 00:34:42,630 --> 00:34:46,380 with the objective of scaling to more than 1,000 processors 723 00:34:46,380 --> 00:34:49,949 requires a very large physical address. 724 00:34:49,949 --> 00:34:53,250 Even a large workstation memory by today's standards 725 00:34:53,250 --> 00:34:57,210 is small compared to that of a fully configured Cray T3D 726 00:34:57,210 --> 00:34:58,530 system. 727 00:34:58,530 --> 00:35:01,470 When data cannot be found in the local data cache, 728 00:35:01,470 --> 00:35:05,400 the cache miss invokes an external memory reference. 729 00:35:05,400 --> 00:35:07,500 Typical microprocessors can support 730 00:35:07,500 --> 00:35:11,070 only a very limited number of outstanding memory loads, 731 00:35:11,070 --> 00:35:12,900 often only one. 732 00:35:12,900 --> 00:35:16,590 But in a distributed memory MPP, any global reference, 733 00:35:16,590 --> 00:35:19,770 even nearest neighbor, which stalls the processor 734 00:35:19,770 --> 00:35:23,400 will seriously affect computational efficiency. 735 00:35:23,400 --> 00:35:26,010 Logic must be added to support and mask 736 00:35:26,010 --> 00:35:30,720 many outstanding references, thereby hiding latency. 737 00:35:30,720 --> 00:35:33,470 This provides capability to fully exploit 738 00:35:33,470 --> 00:35:36,650 modern compilers that are able to accurately judge when 739 00:35:36,650 --> 00:35:38,990 to prefetch streams of data. 740 00:35:38,990 --> 00:35:44,120 Interprocessor synchronization must be fast and scalable. 741 00:35:44,120 --> 00:35:46,310 Typical RISC microprocessor designs 742 00:35:46,310 --> 00:35:48,500 anticipate only very moderate numbers 743 00:35:48,500 --> 00:35:52,130 of processors working together within a workstation. 744 00:35:52,130 --> 00:35:54,740 Techniques used are often bus-based, 745 00:35:54,740 --> 00:35:58,160 such as the popular bus snooping protocol. 746 00:35:58,160 --> 00:36:00,440 Due to very real physical constraints, 747 00:36:00,440 --> 00:36:03,650 these schemes do not scale well beyond about a dozen 748 00:36:03,650 --> 00:36:05,600 processors. 749 00:36:05,600 --> 00:36:08,300 Software and I/O-based synchronization protocols used 750 00:36:08,300 --> 00:36:11,240 in workstation cluster organizations may scale 751 00:36:11,240 --> 00:36:15,020 to large numbers, but fall far short of our goal for providing 752 00:36:15,020 --> 00:36:20,480 single microsecond or less global synchronization latency. 753 00:36:20,480 --> 00:36:22,820 The T3D's macro architecture can be 754 00:36:22,820 --> 00:36:24,530 described by the characteristics that 755 00:36:24,530 --> 00:36:26,150 are visible to the programmer. 756 00:36:26,150 --> 00:36:29,040 These include the physically distributed memory, 757 00:36:29,040 --> 00:36:31,833 which probably represents both the largest challenge 758 00:36:31,833 --> 00:36:33,500 and the greatest performance opportunity 759 00:36:33,500 --> 00:36:36,662 for a programmer, the global address space, which 760 00:36:36,662 --> 00:36:38,120 provides the ability to communicate 761 00:36:38,120 --> 00:36:40,760 in a fine-grained manner, fast barrier 762 00:36:40,760 --> 00:36:43,400 synchronization, especially useful for data-parallel 763 00:36:43,400 --> 00:36:47,000 programming, fast memory locks and support for dynamic loop 764 00:36:47,000 --> 00:36:50,030 distribution, especially useful for work sharing, 765 00:36:50,030 --> 00:36:52,340 and hardware messaging support, especially 766 00:36:52,340 --> 00:36:55,550 useful for the message-passing programming model. 767 00:36:55,550 --> 00:36:58,040 Also visible to the programmer due to their effect 768 00:36:58,040 --> 00:37:01,550 on performance are the 3D torus interconnect network, 769 00:37:01,550 --> 00:37:04,520 where a programmer may wish to optimize communication patterns 770 00:37:04,520 --> 00:37:07,670 or data locality, the block transfer engine, 771 00:37:07,670 --> 00:37:10,490 which can help hide remote latency, 772 00:37:10,490 --> 00:37:12,595 and the local processor's cache memory, 773 00:37:12,595 --> 00:37:13,970 which a programmer may need to be 774 00:37:13,970 --> 00:37:16,470 aware of for best performance. 775 00:37:16,470 --> 00:37:19,250 Let's take a look at how the PE micro architecture implements 776 00:37:19,250 --> 00:37:22,227 the macro architecture in T3D. 777 00:37:22,227 --> 00:37:23,810 At the heart of the processing element 778 00:37:23,810 --> 00:37:25,760 is the Alpha Microprocessor. 779 00:37:25,760 --> 00:37:28,850 Around the Alpha, we've added a shell of external circuits 780 00:37:28,850 --> 00:37:31,640 to implement the MPP functions we need. 781 00:37:31,640 --> 00:37:33,440 These fall into the three categories 782 00:37:33,440 --> 00:37:36,980 Steve mentioned earlier, address extension, latency hiding, 783 00:37:36,980 --> 00:37:38,910 and synchronization. 784 00:37:38,910 --> 00:37:41,690 The Alpha has a 43-bit virtual address space that is 785 00:37:41,690 --> 00:37:43,850 translated in the on-chip DTB-- 786 00:37:43,850 --> 00:37:46,520 that's Data Translation lookaside Buffer-- 787 00:37:46,520 --> 00:37:49,340 to a 34-bit physical byte address. 788 00:37:49,340 --> 00:37:52,670 T3D has up to 64 megabytes of local memory 789 00:37:52,670 --> 00:37:55,640 and up to 2K processors, which would require 790 00:37:55,640 --> 00:37:57,920 at least 37 bits of address. 791 00:37:57,920 --> 00:38:00,860 In reality, this isn't even close. 792 00:38:00,860 --> 00:38:02,870 Several additional address bits are also 793 00:38:02,870 --> 00:38:05,570 needed to control caching and to facilitate control 794 00:38:05,570 --> 00:38:08,780 of the memory map mechanisms that implement the external MPP 795 00:38:08,780 --> 00:38:09,950 shell. 796 00:38:09,950 --> 00:38:13,670 So an external address extension mechanism of some sort 797 00:38:13,670 --> 00:38:15,110 is required. 798 00:38:15,110 --> 00:38:18,470 In T3D, this takes the form of a 32-entry register 799 00:38:18,470 --> 00:38:20,870 set called the "DTB annex." 800 00:38:20,870 --> 00:38:23,540 The way memory is distributed in T3D, 801 00:38:23,540 --> 00:38:26,540 addresses have two parts, a PE number part 802 00:38:26,540 --> 00:38:29,000 that says essentially which bank of memory 803 00:38:29,000 --> 00:38:30,680 and an offset part that says which 804 00:38:30,680 --> 00:38:32,780 memory location within the bank. 805 00:38:32,780 --> 00:38:34,850 When the Alpha initiates a load or store, 806 00:38:34,850 --> 00:38:36,290 the offset portion of the address 807 00:38:36,290 --> 00:38:38,000 is taken from the physical address bits 808 00:38:38,000 --> 00:38:39,890 that are translated on-chip. 809 00:38:39,890 --> 00:38:43,850 And the PE number portion is taken from the DTB annex. 810 00:38:43,850 --> 00:38:45,560 The choice of which of the 32 annex 811 00:38:45,560 --> 00:38:47,480 registers to read for the PE number 812 00:38:47,480 --> 00:38:50,540 is made by a five-bit field above the offset. 813 00:38:50,540 --> 00:38:54,620 The DTB annex is readable and writable by user code. 814 00:38:54,620 --> 00:38:57,890 The value read from the annex is a virtual PE number, 815 00:38:57,890 --> 00:39:00,080 which is translated by a shell mechanism 816 00:39:00,080 --> 00:39:03,170 into a logical PE number and checked for validity prior 817 00:39:03,170 --> 00:39:05,450 to applying it to the routing tag lookup table 818 00:39:05,450 --> 00:39:07,640 to make a remote reference. 819 00:39:07,640 --> 00:39:11,090 The virtuological PE translation mechanism 820 00:39:11,090 --> 00:39:12,920 is used to create user partitions 821 00:39:12,920 --> 00:39:15,050 within the physical torus and provides 822 00:39:15,050 --> 00:39:17,270 a good deal of flexibility concerning size 823 00:39:17,270 --> 00:39:19,670 and shape of the partition. 824 00:39:19,670 --> 00:39:22,670 Shell circuitry always checks the virtual PE number 825 00:39:22,670 --> 00:39:25,670 and, if it matches the local PE, performs a local memory 826 00:39:25,670 --> 00:39:28,970 reference, instead of a remote reference. 827 00:39:28,970 --> 00:39:31,280 What about latency hiding? 828 00:39:31,280 --> 00:39:33,470 The Alpha, like most microprocessors, 829 00:39:33,470 --> 00:39:36,560 has a very limited external pipeline capability. 830 00:39:36,560 --> 00:39:38,990 When it performs a load, it places the address 831 00:39:38,990 --> 00:39:41,180 on its address bus and waits for the data 832 00:39:41,180 --> 00:39:44,630 to be delivered before it will move on to the next reference. 833 00:39:44,630 --> 00:39:46,610 This may be OK for a desktop system 834 00:39:46,610 --> 00:39:49,820 with a single processor, when the memory is only inches away. 835 00:39:49,820 --> 00:39:51,860 But it's unacceptable in an MPP when 836 00:39:51,860 --> 00:39:53,900 the data is at a remote node. 837 00:39:53,900 --> 00:39:55,520 We can't have the processor stalled 838 00:39:55,520 --> 00:39:57,230 waiting for remote references. 839 00:39:57,230 --> 00:40:01,250 So in T3D, we've extended the memory pipeline externally 840 00:40:01,250 --> 00:40:03,470 by adding a latency hiding mechanism called 841 00:40:03,470 --> 00:40:05,570 the "prefetch queue." 842 00:40:05,570 --> 00:40:08,510 The prefetch queue is a first in, first out memory 843 00:40:08,510 --> 00:40:12,290 16 64-bit-words deep that is used 844 00:40:12,290 --> 00:40:15,380 as the temporary destination of remote loads. 845 00:40:15,380 --> 00:40:19,740 Prefetches are initiated by executing a FETCH instruction, 846 00:40:19,740 --> 00:40:22,820 which is an Alpha op code meant to be used as a prefetch hint 847 00:40:22,820 --> 00:40:25,820 to a board-level secondary cache. 848 00:40:25,820 --> 00:40:28,890 In T3D, a fetch causes a single-word load 849 00:40:28,890 --> 00:40:32,040 to be initiated from memory, possibly remote. 850 00:40:32,040 --> 00:40:34,260 But since the destination is not the processor, 851 00:40:34,260 --> 00:40:35,820 the fetch can be acknowledged as soon 852 00:40:35,820 --> 00:40:37,560 as we've latched the address. 853 00:40:37,560 --> 00:40:40,380 And the processor can continue with other work. 854 00:40:40,380 --> 00:40:42,420 When the load response returns, it's 855 00:40:42,420 --> 00:40:44,790 placed in a slot in the prefetch queue that 856 00:40:44,790 --> 00:40:47,640 was reserved for it at the time the fetch issued. 857 00:40:47,640 --> 00:40:50,610 Sometime later, the processor can come back and pop the data 858 00:40:50,610 --> 00:40:53,520 from the memory-mapped head of the prefetch queue, 859 00:40:53,520 --> 00:40:56,910 thereby hiding the latency of the load. 860 00:40:56,910 --> 00:40:59,160 If the processor tries to pop the data from the queue 861 00:40:59,160 --> 00:41:01,290 before it's returned, the processor 862 00:41:01,290 --> 00:41:03,480 will stall until the data arrives. 863 00:41:03,480 --> 00:41:05,820 Using the prefetch queue, up to 16 loads 864 00:41:05,820 --> 00:41:08,045 can be outstanding at any time. 865 00:41:08,045 --> 00:41:09,420 The prefetch queue is turning out 866 00:41:09,420 --> 00:41:14,130 to be a very powerful mechanism easily used by the compiler. 867 00:41:14,130 --> 00:41:17,620 Alpha stores to memory do not need to wait for a response. 868 00:41:17,620 --> 00:41:21,060 So a large number can naturally be outstanding at any time. 869 00:41:21,060 --> 00:41:22,980 This is a good communication mechanism 870 00:41:22,980 --> 00:41:25,890 if the producer knows who the next consumer will be, 871 00:41:25,890 --> 00:41:28,650 because the Alpha has write buffers on board that 872 00:41:28,650 --> 00:41:32,130 try to accumulate a cache line, four words, of data 873 00:41:32,130 --> 00:41:35,160 prior to performing the actual remote store. 874 00:41:35,160 --> 00:41:39,000 This increases the payload size and effective bandwidth. 875 00:41:39,000 --> 00:41:41,730 A counter in the shell circuitry is incremented each time 876 00:41:41,730 --> 00:41:44,610 a store is issued by Alpha and decremented 877 00:41:44,610 --> 00:41:46,680 each time a write completes. 878 00:41:46,680 --> 00:41:49,080 This counter can be read by the processor at any time 879 00:41:49,080 --> 00:41:50,610 to determine when all of the writes 880 00:41:50,610 --> 00:41:53,970 have completed for synchronization purposes. 881 00:41:53,970 --> 00:41:56,190 The third major latency-hiding mechanism 882 00:41:56,190 --> 00:41:58,140 is an asynchronous memory-to-memory data 883 00:41:58,140 --> 00:42:01,560 mover called the block transfer engine or BLT. 884 00:42:01,560 --> 00:42:04,380 There's one BLT per node-- no sense having two, 885 00:42:04,380 --> 00:42:07,590 since it can saturate the network interface by itself. 886 00:42:07,590 --> 00:42:10,140 Once started, it functions independent of the PE, 887 00:42:10,140 --> 00:42:12,120 transferring data between remote memory 888 00:42:12,120 --> 00:42:14,700 and the local memory in either direction. 889 00:42:14,700 --> 00:42:17,730 The BLT can move up to 256K words of memory 890 00:42:17,730 --> 00:42:19,770 with a single invocation. 891 00:42:19,770 --> 00:42:22,230 The local data that is the source on a write block 892 00:42:22,230 --> 00:42:25,710 transfer or the destination on a read block transfer 893 00:42:25,710 --> 00:42:28,440 can be addressed with any constant stride. 894 00:42:28,440 --> 00:42:31,530 Remote addresses can be any constant stride 895 00:42:31,530 --> 00:42:34,020 or can be indirect, meaning the address is 896 00:42:34,020 --> 00:42:36,330 from a separate stream previously stored 897 00:42:36,330 --> 00:42:37,890 in local memory. 898 00:42:37,890 --> 00:42:39,810 Indirect operations are commonly called 899 00:42:39,810 --> 00:42:42,510 "scatter/gather operations" because they can reference data 900 00:42:42,510 --> 00:42:45,270 in completely arbitrary ways. 901 00:42:45,270 --> 00:42:47,520 The BLT also has a unique mechanism 902 00:42:47,520 --> 00:42:49,740 for interpreting the remote address. 903 00:42:49,740 --> 00:42:51,750 This mechanism, called the "Centrifuge," 904 00:42:51,750 --> 00:42:54,390 is used to support and control the distribution of data 905 00:42:54,390 --> 00:42:57,060 among the PEs that is so critical to attaining 906 00:42:57,060 --> 00:43:00,300 good performance on a distributed memory computer. 907 00:43:00,300 --> 00:43:03,390 We previously said that memory addresses have two parts, 908 00:43:03,390 --> 00:43:05,640 an offset and a PE number. 909 00:43:05,640 --> 00:43:07,650 A remote address generated by the BLT 910 00:43:07,650 --> 00:43:09,480 must somehow have its bits interpreted 911 00:43:09,480 --> 00:43:12,000 to determine which PE to send the reference to 912 00:43:12,000 --> 00:43:15,060 and what offset to use once the reference arrives. 913 00:43:15,060 --> 00:43:17,490 I've got a couple of examples to show you how this works 914 00:43:17,490 --> 00:43:19,740 and why it's so powerful. 915 00:43:19,740 --> 00:43:22,290 In a conventional computer with an interleaved memory, 916 00:43:22,290 --> 00:43:23,940 an address is usually interpreted 917 00:43:23,940 --> 00:43:26,640 as having the memory bank number in the low-order bits, 918 00:43:26,640 --> 00:43:29,040 with the offset in the high-order bits. 919 00:43:29,040 --> 00:43:31,860 If we increment this address repetitively by one, 920 00:43:31,860 --> 00:43:34,620 we see that the resulting address pattern visits the same 921 00:43:34,620 --> 00:43:36,330 offset on each bank once. 922 00:43:36,330 --> 00:43:38,640 Then the carry increments the offset field, 923 00:43:38,640 --> 00:43:41,280 and all the banks are visited again. 924 00:43:41,280 --> 00:43:42,840 In a distributed memory computer, 925 00:43:42,840 --> 00:43:44,640 sometimes the offset is interpreted 926 00:43:44,640 --> 00:43:47,160 as lying in the low-order bits of the address, 927 00:43:47,160 --> 00:43:50,790 while the bank or PE number is in the high-order bits. 928 00:43:50,790 --> 00:43:53,490 If we increment this address repetitively by one, 929 00:43:53,490 --> 00:43:55,980 we see that the resulting address pattern traverses 930 00:43:55,980 --> 00:43:57,930 all of the memory within the first bank 931 00:43:57,930 --> 00:44:01,350 before finally incrementing the bank address to the next bank, 932 00:44:01,350 --> 00:44:03,990 where all of its memory is traversed before incrementing 933 00:44:03,990 --> 00:44:06,240 the bank again, et cetera. 934 00:44:06,240 --> 00:44:09,180 In our MPP, we'd like to have it both ways. 935 00:44:09,180 --> 00:44:11,130 Indeed, we want even more. 936 00:44:11,130 --> 00:44:12,990 We'd like to be able to control exactly 937 00:44:12,990 --> 00:44:15,660 which bits of the address are interpreted as PE 938 00:44:15,660 --> 00:44:17,370 and which are offset. 939 00:44:17,370 --> 00:44:21,210 This would allow us to, for example, slide the PE number up 940 00:44:21,210 --> 00:44:25,690 by, say, four bits so that it had offset bits on either side. 941 00:44:25,690 --> 00:44:28,380 If we start incrementing this address by one, 942 00:44:28,380 --> 00:44:30,210 we see that the address pattern touches 943 00:44:30,210 --> 00:44:33,210 a block of 16 addresses on one PE 944 00:44:33,210 --> 00:44:38,050 before incrementing to the next PE to touch 16 more and so on. 945 00:44:38,050 --> 00:44:40,560 We call this a "block distribution." 946 00:44:40,560 --> 00:44:42,780 The Centrifuge in T3D is controlled 947 00:44:42,780 --> 00:44:46,920 by a mask that indicates which bits are PE number. 948 00:44:46,920 --> 00:44:49,410 Any place a one is set in the mask, 949 00:44:49,410 --> 00:44:52,110 the address is interpreted to be a PE number. 950 00:44:52,110 --> 00:44:54,690 Any place a zero is in the mask, the address 951 00:44:54,690 --> 00:44:56,880 is interpreted as an offset. 952 00:44:56,880 --> 00:45:00,210 The output of the Centrifuge is a separated PE number 953 00:45:00,210 --> 00:45:04,200 and offset that is used to make a memory reference. 954 00:45:04,200 --> 00:45:07,950 Note that bits in the mask do not even need to be contiguous. 955 00:45:07,950 --> 00:45:10,650 This permits precise control of the distribution 956 00:45:10,650 --> 00:45:14,580 of individual dimensions of multidimensional data arrays. 957 00:45:14,580 --> 00:45:17,940 All remote addresses generated by the block transfer engine 958 00:45:17,940 --> 00:45:20,460 are centrifuged prior to use. 959 00:45:20,460 --> 00:45:23,130 The BLT naturally has a larger start-up overhead 960 00:45:23,130 --> 00:45:25,890 than direct processor stores or the prefetch queue. 961 00:45:25,890 --> 00:45:29,870 So it's most useful when moving larger blocks of data. 962 00:45:29,870 --> 00:45:32,880 Now, let's talk about synchronization. 963 00:45:32,880 --> 00:45:34,610 The PE shell circuitry implements 964 00:45:34,610 --> 00:45:37,790 four major synchronization mechanisms, barriers, 965 00:45:37,790 --> 00:45:41,600 fetch and increment, atomic swap, and messaging. 966 00:45:41,600 --> 00:45:43,940 A barrier is a point in a program beyond which 967 00:45:43,940 --> 00:45:46,760 no processor should advance before all processors have 968 00:45:46,760 --> 00:45:47,810 arrived. 969 00:45:47,810 --> 00:45:49,760 Barriers are often used to delineate 970 00:45:49,760 --> 00:45:51,680 between parallel sections of code 971 00:45:51,680 --> 00:45:53,870 when data produced by the first section 972 00:45:53,870 --> 00:45:56,630 is to be consumed during the second section. 973 00:45:56,630 --> 00:45:58,550 It is a block form of synchronization 974 00:45:58,550 --> 00:46:01,250 that is latency-sensitive, since there is usually 975 00:46:01,250 --> 00:46:03,650 little to do while waiting for the knowledge 976 00:46:03,650 --> 00:46:06,470 that the last processor has arrived at a barrier 977 00:46:06,470 --> 00:46:08,330 to be propagated to all processors 978 00:46:08,330 --> 00:46:10,550 so that they can proceed. 979 00:46:10,550 --> 00:46:12,860 T3D has specialized barrier hardware 980 00:46:12,860 --> 00:46:16,340 in the form of 16 parallel logical AND trees 981 00:46:16,340 --> 00:46:18,890 that permit multiple barriers to be pipelined 982 00:46:18,890 --> 00:46:21,260 and the resource to be partitioned. 983 00:46:21,260 --> 00:46:23,660 The barrier resource appears to a PE 984 00:46:23,660 --> 00:46:25,760 as a memory-mapped register. 985 00:46:25,760 --> 00:46:27,890 When a PE arrives at a barrier, it 986 00:46:27,890 --> 00:46:31,340 sets a designated bit in its barrier register to a one 987 00:46:31,340 --> 00:46:33,590 and waits for it to switch to a zero. 988 00:46:33,590 --> 00:46:36,530 When all PEs in a partition have reached the barrier 989 00:46:36,530 --> 00:46:40,580 and set the same bit to a one, the AND function is satisfied. 990 00:46:40,580 --> 00:46:42,770 And the barrier bits are cleared by hardware, 991 00:46:42,770 --> 00:46:47,000 signaling the processors that they may proceed. 992 00:46:47,000 --> 00:46:48,650 As an alternative to repetitively 993 00:46:48,650 --> 00:46:52,280 testing a set barrier bit to see if it switched to a zero, 994 00:46:52,280 --> 00:46:54,320 the processor can enable and interrupt 995 00:46:54,320 --> 00:46:58,730 and perform unrelated work until the barrier is satisfied. 996 00:46:58,730 --> 00:47:00,560 The barrier has a second mode called 997 00:47:00,560 --> 00:47:04,220 Eureka mode that supports search-type operations. 998 00:47:04,220 --> 00:47:07,790 A Eureka is simply a logical OR instead of an AND. 999 00:47:07,790 --> 00:47:10,850 A Eureka's satisfied by any one processor. 1000 00:47:10,850 --> 00:47:14,060 A Eureka may be used during, say, a database search, 1001 00:47:14,060 --> 00:47:16,340 where all processors would enable their interrupt 1002 00:47:16,340 --> 00:47:18,410 on Eureka and proceed with a search 1003 00:47:18,410 --> 00:47:21,200 until any processor succeeds and terminates the search 1004 00:47:21,200 --> 00:47:23,780 by satisfying the Eureka. 1005 00:47:23,780 --> 00:47:27,050 The barrier mechanism in T3D is extremely fast. 1006 00:47:27,050 --> 00:47:29,060 Even in the largest configuration, 1007 00:47:29,060 --> 00:47:33,120 it's less than 50 clock periods, about 330 nanoseconds, 1008 00:47:33,120 --> 00:47:36,050 which is about the latency of a local memory read. 1009 00:47:36,050 --> 00:47:37,970 A second synchronization mechanism 1010 00:47:37,970 --> 00:47:40,340 that supports work distribution well 1011 00:47:40,340 --> 00:47:42,650 is the fetch-and-increment registers. 1012 00:47:42,650 --> 00:47:45,830 This set of registers is globally accessible by the PEs. 1013 00:47:45,830 --> 00:47:47,930 There's one per PE. 1014 00:47:47,930 --> 00:47:50,900 A store to a fetch-and-increment register acts normally, 1015 00:47:50,900 --> 00:47:54,770 but a load from one causes the contents to auto-increment. 1016 00:47:54,770 --> 00:47:57,690 This is very useful for distributing array indices, 1017 00:47:57,690 --> 00:47:59,950 for example, while strip mining a loop. 1018 00:47:59,950 --> 00:48:01,700 Each time the fetch-and-increment register 1019 00:48:01,700 --> 00:48:05,900 is read, it delivers a value one greater than the last. 1020 00:48:05,900 --> 00:48:08,360 To hide the latency of accesses to the fetch-and-increment 1021 00:48:08,360 --> 00:48:12,890 registers, they can be fetched to the prefetch queue. 1022 00:48:12,890 --> 00:48:15,320 Another fine-grained synchronization mechanism, 1023 00:48:15,320 --> 00:48:18,860 atomic swap, permits construction of locks in memory 1024 00:48:18,860 --> 00:48:20,660 or the emulation of full empty bits 1025 00:48:20,660 --> 00:48:23,000 on individual words of data. 1026 00:48:23,000 --> 00:48:25,400 The swap is simply an atomic read/write 1027 00:48:25,400 --> 00:48:27,770 of a global memory location. 1028 00:48:27,770 --> 00:48:29,810 The value that is stored in the memory location 1029 00:48:29,810 --> 00:48:32,120 comes from a local memory-mapped register 1030 00:48:32,120 --> 00:48:34,123 called the "swaperand." 1031 00:48:34,123 --> 00:48:36,290 The value that was originally in the memory location 1032 00:48:36,290 --> 00:48:38,480 is returned to the PE. 1033 00:48:38,480 --> 00:48:40,670 The return destination, like any load, 1034 00:48:40,670 --> 00:48:44,750 can be the prefetch queue in order to hide the latency. 1035 00:48:44,750 --> 00:48:46,910 Full empty bits in memory are emulated 1036 00:48:46,910 --> 00:48:48,830 by designating a token value-- 1037 00:48:48,830 --> 00:48:52,550 say, an undefined IEEE floating-point Not A Number 1038 00:48:52,550 --> 00:48:53,570 or NAN-- 1039 00:48:53,570 --> 00:48:55,250 to mean "empty." 1040 00:48:55,250 --> 00:48:57,440 When a processor wants to load and lock 1041 00:48:57,440 --> 00:48:59,870 a memory location in order to, for instance, 1042 00:48:59,870 --> 00:49:02,510 perform an atomic add to that location, 1043 00:49:02,510 --> 00:49:06,420 it simply swaps the empty token into the memory location. 1044 00:49:06,420 --> 00:49:08,810 If the return value is also an empty token, 1045 00:49:08,810 --> 00:49:11,810 then the location was already locked by another PE 1046 00:49:11,810 --> 00:49:14,060 and the swap must be retried. 1047 00:49:14,060 --> 00:49:17,570 If the value it returned is a valid number, 1048 00:49:17,570 --> 00:49:20,420 then the PE has successfully locked the location 1049 00:49:20,420 --> 00:49:23,750 and may perform its atomic operation. 1050 00:49:23,750 --> 00:49:27,710 To unlock the location, the PE simply stores the valid data 1051 00:49:27,710 --> 00:49:31,190 back to memory, overwriting the empty token. 1052 00:49:31,190 --> 00:49:34,250 Message passing is directly supported by T3D hardware 1053 00:49:34,250 --> 00:49:35,900 in the form of low-level primitives 1054 00:49:35,900 --> 00:49:39,110 that permit cache-line-size messages to be sent directly 1055 00:49:39,110 --> 00:49:40,700 from user space. 1056 00:49:40,700 --> 00:49:42,500 These short messages are initiated 1057 00:49:42,500 --> 00:49:45,620 by performing stores through special memory-mapped space 1058 00:49:45,620 --> 00:49:47,510 to the target PE. 1059 00:49:47,510 --> 00:49:50,420 The message data is stored in a hardware-managed message 1060 00:49:50,420 --> 00:49:53,090 queue in the local memory of the target PE. 1061 00:49:53,090 --> 00:49:55,610 And that PE is interrupted. 1062 00:49:55,610 --> 00:49:57,830 Hardware automatically allocates space 1063 00:49:57,830 --> 00:49:59,690 from the sender's message queue in case 1064 00:49:59,690 --> 00:50:03,920 the destination queue is full and the message rejected. 1065 00:50:03,920 --> 00:50:06,330 Rejected messages land in the sender's queue 1066 00:50:06,330 --> 00:50:10,040 and interrupt the sending processor for a retry. 1067 00:50:10,040 --> 00:50:12,350 Messages carry 32 bytes of payload 1068 00:50:12,350 --> 00:50:14,330 and are stored in the target message queue, 1069 00:50:14,330 --> 00:50:16,250 with the routing tag and packet control 1070 00:50:16,250 --> 00:50:20,360 information bringing the total to 64 bytes per message. 1071 00:50:20,360 --> 00:50:22,700 The message queue area in local DRAM 1072 00:50:22,700 --> 00:50:28,740 is 256K bytes, permitting up to 4K messages to he received. 1073 00:50:28,740 --> 00:50:31,770 The messaging facility is also key to the network 1074 00:50:31,770 --> 00:50:33,630 error-handling mechanism. 1075 00:50:33,630 --> 00:50:36,330 All network packets have parity on each phit 1076 00:50:36,330 --> 00:50:37,920 of control information. 1077 00:50:37,920 --> 00:50:41,830 Data is protected by sec-det. If a parity error or a packet 1078 00:50:41,830 --> 00:50:44,910 miswrote is detected when a packet arrives at a node, 1079 00:50:44,910 --> 00:50:46,920 the entire packet, header and all, 1080 00:50:46,920 --> 00:50:50,770 is deposited in the apparent destination PE's message queue 1081 00:50:50,770 --> 00:50:53,130 and an error interrupt issued. 1082 00:50:53,130 --> 00:50:54,990 That pretty much wraps up our overview 1083 00:50:54,990 --> 00:50:57,210 of the T3D micro architecture. 1084 00:50:57,210 --> 00:50:58,830 Let's step back a minute to discuss 1085 00:50:58,830 --> 00:51:01,440 the design of T3D's I/O. 1086 00:51:01,440 --> 00:51:03,960 System I/O is performed via multiple Cray 1087 00:51:03,960 --> 00:51:07,920 high-speed channels that connect T3D to a host YMP 1088 00:51:07,920 --> 00:51:10,170 or to standard Cray I/O subsystems 1089 00:51:10,170 --> 00:51:12,690 that contain interfaces to a wide array 1090 00:51:12,690 --> 00:51:15,420 of peripheral devices and networks. 1091 00:51:15,420 --> 00:51:18,120 A single Cray high-speed channel consists 1092 00:51:18,120 --> 00:51:21,990 of two unidirectional channels, one input and one output, 1093 00:51:21,990 --> 00:51:25,020 each capable of 200 megabytes per second, coupled 1094 00:51:25,020 --> 00:51:26,910 with a low-speed control channel that 1095 00:51:26,910 --> 00:51:30,270 allows the passing of supervisory information. 1096 00:51:30,270 --> 00:51:33,300 Each high-speed channel pair connects to a specialized node 1097 00:51:33,300 --> 00:51:36,090 within T3D called a "gateway." 1098 00:51:36,090 --> 00:51:40,050 There are eight gateways for every 512 processors in a T3D 1099 00:51:40,050 --> 00:51:41,220 system. 1100 00:51:41,220 --> 00:51:44,430 Aggregate I/O bandwidth for 512 processors 1101 00:51:44,430 --> 00:51:47,370 is 3.2 gigabytes per second. 1102 00:51:47,370 --> 00:51:49,950 Each gateway itself is actually a pair 1103 00:51:49,950 --> 00:51:52,410 of nodes, each handling one direction 1104 00:51:52,410 --> 00:51:54,000 of the high-speed channel. 1105 00:51:54,000 --> 00:51:57,090 The two nodes within a gateway share the low-speed control 1106 00:51:57,090 --> 00:51:57,840 channel. 1107 00:51:57,840 --> 00:52:01,530 Each of the two gateway nodes has its own CPU for control, 1108 00:52:01,530 --> 00:52:04,680 separate local memory, and a dedicated block transfer 1109 00:52:04,680 --> 00:52:06,540 engine that can move data between memory 1110 00:52:06,540 --> 00:52:08,460 and the high-speed channel. 1111 00:52:08,460 --> 00:52:10,810 A gateway node network switch is connected 1112 00:52:10,810 --> 00:52:13,140 into the network in the x- and z-dimensions 1113 00:52:13,140 --> 00:52:15,780 only, that being all that's necessary to allow 1114 00:52:15,780 --> 00:52:17,520 complete communication. 1115 00:52:17,520 --> 00:52:20,580 Gateways are placed on the diagonal of the 3D torus 1116 00:52:20,580 --> 00:52:23,520 network in order to distribute the I/O bandwidth 1117 00:52:23,520 --> 00:52:27,000 and minimize I/O traffic contention with compute traffic 1118 00:52:27,000 --> 00:52:29,730 in the network. 1119 00:52:29,730 --> 00:52:32,130 This Cray T3D processor element module 1120 00:52:32,130 --> 00:52:35,910 holds eight microprocessors, four on this side 1121 00:52:35,910 --> 00:52:38,040 and four on the back side. 1122 00:52:38,040 --> 00:52:40,950 Each pair of microprocessors within the compute node 1123 00:52:40,950 --> 00:52:45,000 is supported by 21 proprietary integrated circuits. 1124 00:52:45,000 --> 00:52:47,640 These circuits implement high-performance bipolar 1125 00:52:47,640 --> 00:52:49,830 emitter-coupled logic. 1126 00:52:49,830 --> 00:52:53,640 Small dotter boards support wide-word dynamic random access 1127 00:52:53,640 --> 00:52:55,410 memory components. 1128 00:52:55,410 --> 00:52:58,050 These dotter boards can be easily replaced locally 1129 00:52:58,050 --> 00:53:00,990 by field service personnel. 1130 00:53:00,990 --> 00:53:02,850 Each long edge of the motherboard 1131 00:53:02,850 --> 00:53:05,880 is configured with a dense array of zero-insertion-force 1132 00:53:05,880 --> 00:53:07,410 connectors. 1133 00:53:07,410 --> 00:53:09,540 These connectors provide the high degree 1134 00:53:09,540 --> 00:53:12,780 of interconnectivity required for the interprocessor 1135 00:53:12,780 --> 00:53:15,210 channels. 1136 00:53:15,210 --> 00:53:18,240 The module is liquid-cooled by attaching the circuit 1137 00:53:18,240 --> 00:53:21,330 boards and integrated circuits to an anodized aluminum 1138 00:53:21,330 --> 00:53:23,190 cold plate. 1139 00:53:23,190 --> 00:53:26,190 An inert cooling fluid flows into, through, 1140 00:53:26,190 --> 00:53:28,740 and then out of the module via hoses 1141 00:53:28,740 --> 00:53:33,250 connected to a cooling manifold in the chassis. 1142 00:53:33,250 --> 00:53:36,550 A single Cray T3D chassis can support up 1143 00:53:36,550 --> 00:53:41,500 to 512 compute processors, eight I/O gateways, 1144 00:53:41,500 --> 00:53:45,490 and additional redundant processors for fault tolerance. 1145 00:53:45,490 --> 00:53:46,990 The zero-insertion-force connector 1146 00:53:46,990 --> 00:53:50,680 is activated and deactivated by inserting a cam 1147 00:53:50,680 --> 00:53:53,770 down each edge of the module after it has been 1148 00:53:53,770 --> 00:53:56,770 inserted in the module slot. 1149 00:53:56,770 --> 00:54:01,330 The cam works to move small sliding contacts one by one 1150 00:54:01,330 --> 00:54:03,070 out of the chassis-mounted connector 1151 00:54:03,070 --> 00:54:07,990 half of the mated pair into the half mounted on the module. 1152 00:54:07,990 --> 00:54:10,750 Bundles of high-performance Teflon wires attached 1153 00:54:10,750 --> 00:54:12,940 to the chassis connectors provide 1154 00:54:12,940 --> 00:54:16,600 the toroidal communication paths. 1155 00:54:16,600 --> 00:54:18,190 The arrangement of the wire bundles 1156 00:54:18,190 --> 00:54:21,940 can be changed to support the future addition of modules 1157 00:54:21,940 --> 00:54:23,830 into chassis which were initially 1158 00:54:23,830 --> 00:54:26,680 configured to smaller sizes. 1159 00:54:26,680 --> 00:54:31,360 Switching power supplies, each supplying up to 1,800 amperes, 1160 00:54:31,360 --> 00:54:33,670 provide module power. 1161 00:54:33,670 --> 00:54:37,480 Standard configurations provide for one extra power supply 1162 00:54:37,480 --> 00:54:40,690 for each required voltage, allowing additional fault 1163 00:54:40,690 --> 00:54:43,570 coverage for the system. 1164 00:54:43,570 --> 00:54:45,730 Steve and I have tried to convey our excitement 1165 00:54:45,730 --> 00:54:47,200 about this design. 1166 00:54:47,200 --> 00:54:50,620 We believe the Cray T3D to be a powerful and flexible 1167 00:54:50,620 --> 00:54:52,570 supercomputing tool. 1168 00:54:52,570 --> 00:54:55,190 I hope you've enjoyed our presentation. 1169 00:54:55,190 --> 00:54:56,620 Thank you. 1170 00:54:56,620 --> 00:54:59,970 [MUSIC PLAYING] 1171 00:54:59,970 --> 00:55:57,000