1 00:00:00,000 --> 00:00:13,460 2 00:00:13,460 --> 00:00:17,780 In 1971, I was a member of a group of scientists, 3 00:00:17,780 --> 00:00:19,670 who proposed a five year research 4 00:00:19,670 --> 00:00:23,750 effort towards a demonstration of a large vocabulary, 5 00:00:23,750 --> 00:00:27,130 connected speech understanding system. 6 00:00:27,130 --> 00:00:29,290 Instead of setting vague objectives, 7 00:00:29,290 --> 00:00:32,560 we propose specific performance goals. 8 00:00:32,560 --> 00:00:34,630 The proposed speech understanding system 9 00:00:34,630 --> 00:00:38,020 was required to accept phrases and sentences 10 00:00:38,020 --> 00:00:43,300 from many speakers based on 1,000 word vocabulary using 11 00:00:43,300 --> 00:00:48,100 a task oriented grammar within a constraining task. 12 00:00:48,100 --> 00:00:51,610 We wanted the system to perform with less than 10% 13 00:00:51,610 --> 00:00:56,500 semantic errors, using about 300 million computer instructions, 14 00:00:56,500 --> 00:00:58,980 per second of speech. 15 00:00:58,980 --> 00:01:01,350 Although these goals were highly ambitious 16 00:01:01,350 --> 00:01:03,510 at the time of our proposal. 17 00:01:03,510 --> 00:01:05,610 We felt that the required system could be 18 00:01:05,610 --> 00:01:09,320 demonstrated within five years. 19 00:01:09,320 --> 00:01:13,640 The Harpy system, developed at Carnegie Mellon University, 20 00:01:13,640 --> 00:01:16,430 not only satisfies these goals but exceed 21 00:01:16,430 --> 00:01:18,920 some of the stated objectives. 22 00:01:18,920 --> 00:01:24,260 It recognizes connected speech from male and female speakers 23 00:01:24,260 --> 00:01:29,290 using a 1011 word document retrieval task. 24 00:01:29,290 --> 00:01:32,890 It achieved 95% semantic accuracy. 25 00:01:32,890 --> 00:01:36,970 And runs an order of magnitude faster than expected. 26 00:01:36,970 --> 00:01:38,710 Harpy is all the more interesting, 27 00:01:38,710 --> 00:01:40,990 because it is one of the few examples 28 00:01:40,990 --> 00:01:44,080 in artificial intelligence research of a five year 29 00:01:44,080 --> 00:01:48,165 prediction that was in fact realized. 30 00:01:48,165 --> 00:01:49,790 I am interested in speech understanding 31 00:01:49,790 --> 00:01:51,618 [MUSIC PLAYING] 32 00:01:51,618 --> 00:02:15,930 33 00:02:15,930 --> 00:02:18,600 Harpy is the result of combining and improving 34 00:02:18,600 --> 00:02:21,240 the best features from two speech systems previously 35 00:02:21,240 --> 00:02:23,430 developed at Carnegie Mellon University. 36 00:02:23,430 --> 00:02:25,860 Hearsay one and dragon. 37 00:02:25,860 --> 00:02:27,780 Two features of the Harpy system that 38 00:02:27,780 --> 00:02:29,910 led to a successful demonstration 39 00:02:29,910 --> 00:02:31,860 over its representation of knowledge 40 00:02:31,860 --> 00:02:34,770 and the use of new search techniques. 41 00:02:34,770 --> 00:02:37,500 In developing speech recognition systems, 42 00:02:37,500 --> 00:02:40,290 it is necessary to devise a means of acquiring 43 00:02:40,290 --> 00:02:43,500 and representing the many diverse types of knowledge that 44 00:02:43,500 --> 00:02:45,720 characterize speech. 45 00:02:45,720 --> 00:02:48,990 We must also develop matching and search techniques, 46 00:02:48,990 --> 00:02:50,820 that convert this passive knowledge 47 00:02:50,820 --> 00:02:53,400 into an active process for understanding 48 00:02:53,400 --> 00:02:58,680 an utterance in the presence of error, noise, and uncertainty. 49 00:02:58,680 --> 00:03:01,020 First, let us consider some aspects 50 00:03:01,020 --> 00:03:03,180 of knowledge representation. 51 00:03:03,180 --> 00:03:06,560 The document retrieval task we saw earlier, 52 00:03:06,560 --> 00:03:10,320 uses a vocabulary of 1,011 words. 53 00:03:10,320 --> 00:03:12,510 When a user speaks of the system, 54 00:03:12,510 --> 00:03:15,630 his utterance must conform to the grammar and vocabulary 55 00:03:15,630 --> 00:03:17,670 for that task. 56 00:03:17,670 --> 00:03:20,910 Internally, Harpy stores all legal sentences 57 00:03:20,910 --> 00:03:24,760 in a finite state graph structure. 58 00:03:24,760 --> 00:03:27,730 Here is a graph of a simple grammar. 59 00:03:27,730 --> 00:03:30,610 This knowledge is organized as a network of nodes. 60 00:03:30,610 --> 00:03:34,510 Or each node holds a word in the vocabulary. 61 00:03:34,510 --> 00:03:37,600 The nodes are interconnected such that any path 62 00:03:37,600 --> 00:03:39,790 through this word network constitutes 63 00:03:39,790 --> 00:03:41,890 an acceptable sentence. 64 00:03:41,890 --> 00:03:42,745 Tell us about Nexen. 65 00:03:42,745 --> 00:03:45,390 66 00:03:45,390 --> 00:03:48,960 Give me the headlines. 67 00:03:48,960 --> 00:03:50,305 Tell me all about China. 68 00:03:50,305 --> 00:04:02,920 69 00:04:02,920 --> 00:04:05,710 Many words have more than one acceptable pronunciation. 70 00:04:05,710 --> 00:04:18,430 71 00:04:18,430 --> 00:04:20,829 Vodka and tomato juice. 72 00:04:20,829 --> 00:04:23,020 We don't have any tomato juice 73 00:04:23,020 --> 00:04:24,550 What? 74 00:04:24,550 --> 00:04:27,440 The next time, the hot tomatoes are in town, 75 00:04:27,440 --> 00:04:31,540 you better have tomato juice. 76 00:04:31,540 --> 00:04:33,730 Alternative pronunciations can be 77 00:04:33,730 --> 00:04:36,970 represented as a separate network of acoustic gestures 78 00:04:36,970 --> 00:04:39,250 called phones. 79 00:04:39,250 --> 00:04:42,520 Phones are the smallest elements of speech knowledge represented 80 00:04:42,520 --> 00:04:44,500 in the Harpy network. 81 00:04:44,500 --> 00:04:46,690 Each path through the network represents 82 00:04:46,690 --> 00:04:49,600 an acceptable pronunciation of a word. 83 00:04:49,600 --> 00:04:52,360 Tomato. 84 00:04:52,360 --> 00:04:54,490 Tomato. 85 00:04:54,490 --> 00:04:57,190 Dialects and accents can also be integrated 86 00:04:57,190 --> 00:05:00,880 into these pronunciation networks as separate paths. 87 00:05:00,880 --> 00:05:04,390 This is the pronunciation network for the word tell. 88 00:05:04,390 --> 00:05:06,860 Tale. 89 00:05:06,860 --> 00:05:08,800 Tail. 90 00:05:08,800 --> 00:05:10,990 The pronunciation network for tell 91 00:05:10,990 --> 00:05:12,850 is really a further specification 92 00:05:12,850 --> 00:05:15,100 of tell in the word network. 93 00:05:15,100 --> 00:05:18,190 Thus, we can replace tell in the word network 94 00:05:18,190 --> 00:05:20,620 with its pronunciation network. 95 00:05:20,620 --> 00:05:23,920 Likewise replacing every node in the word network 96 00:05:23,920 --> 00:05:26,440 with its pronunciation network, generates 97 00:05:26,440 --> 00:05:29,440 a new finite state graph, or each path 98 00:05:29,440 --> 00:05:32,540 is the pronunciation of an acceptable sentence. 99 00:05:32,540 --> 00:05:33,700 Tell me all about China. 100 00:05:33,700 --> 00:05:44,780 101 00:05:44,780 --> 00:05:46,940 The substitution of pronunciation networks 102 00:05:46,940 --> 00:05:51,580 for words, adds a second level of knowledge to our graph. 103 00:05:51,580 --> 00:05:53,710 Another aspect of speech knowledge 104 00:05:53,710 --> 00:05:57,190 involves a phenomena that occur at word boundaries. 105 00:05:57,190 --> 00:06:01,640 In written text, word boundaries are clearly defined by spaces. 106 00:06:01,640 --> 00:06:03,970 Tell me all about China. 107 00:06:03,970 --> 00:06:08,020 In spoken language, word boundaries tend to overlap. 108 00:06:08,020 --> 00:06:10,930 Making it difficult to detect the end of one word 109 00:06:10,930 --> 00:06:13,630 and the beginning of the next. 110 00:06:13,630 --> 00:06:15,850 This problem is similar to writing, 111 00:06:15,850 --> 00:06:19,390 tell me all about China without spaces, 112 00:06:19,390 --> 00:06:22,470 or with spaces within words. 113 00:06:22,470 --> 00:06:24,960 As a result of this, complex phenomena 114 00:06:24,960 --> 00:06:30,150 arise where phones are inserted or deleted at word boundaries. 115 00:06:30,150 --> 00:06:32,730 Knowledge about such phenomena can also 116 00:06:32,730 --> 00:06:35,580 be represented in a Harpy network. 117 00:06:35,580 --> 00:06:37,470 This third type of speech knowledge 118 00:06:37,470 --> 00:06:41,060 completes the Harpy network representation. 119 00:06:41,060 --> 00:06:43,190 The overall size of the network is 120 00:06:43,190 --> 00:06:45,560 reduced by special heuristics. 121 00:06:45,560 --> 00:06:50,210 This is achieved by detecting and removing redundant paths. 122 00:06:50,210 --> 00:06:54,600 A separate knowledge network is compiled for each Harpy task. 123 00:06:54,600 --> 00:06:58,140 This eliminates need for dynamic interpretation of knowledge 124 00:06:58,140 --> 00:07:00,840 during the recognition process. 125 00:07:00,840 --> 00:07:02,970 The finite state representation makes 126 00:07:02,970 --> 00:07:06,870 it possible to store knowledge of grammar, vocabulary, 127 00:07:06,870 --> 00:07:11,610 pronunciation, and acoustic phonetics in a single network. 128 00:07:11,610 --> 00:07:13,740 Let's consider how this knowledge is used 129 00:07:13,740 --> 00:07:17,050 in the recognition process. 130 00:07:17,050 --> 00:07:18,980 Tell me all about China. 131 00:07:18,980 --> 00:07:21,190 Utterances input through the computer. 132 00:07:21,190 --> 00:07:23,410 Several preprocessing steps are taken 133 00:07:23,410 --> 00:07:25,930 to repair the data for recognition. 134 00:07:25,930 --> 00:07:30,130 The utterance is digitized, segmented into acoustic units, 135 00:07:30,130 --> 00:07:32,800 and analyzed to determine the segmental features 136 00:07:32,800 --> 00:07:34,660 and parameters. 137 00:07:34,660 --> 00:07:38,740 This utterance has been divided into 23 segments. 138 00:07:38,740 --> 00:07:40,870 At this point, an attempt is made 139 00:07:40,870 --> 00:07:44,440 to match the first segment with one of the 98 possible phone 140 00:07:44,440 --> 00:07:45,830 labels. 141 00:07:45,830 --> 00:07:49,090 Since an absolute assignment cannot be made reliably, 142 00:07:49,090 --> 00:07:52,090 the system calculates a match probability based 143 00:07:52,090 --> 00:07:55,060 on the acoustic information in each segment. 144 00:07:55,060 --> 00:07:58,570 Thus, the silence phone is given the highest acoustic match 145 00:07:58,570 --> 00:08:03,940 for this segment, then the K phone, then TH and so on. 146 00:08:03,940 --> 00:08:06,280 As the acoustic matches are generated, 147 00:08:06,280 --> 00:08:09,410 Harpy begins the recognition process. 148 00:08:09,410 --> 00:08:11,230 The goal of the recognition task is 149 00:08:11,230 --> 00:08:16,930 to find an optimal sequence of phones satisfying two criteria. 150 00:08:16,930 --> 00:08:19,420 The sequence must represent a legal path 151 00:08:19,420 --> 00:08:21,250 through the knowledge network. 152 00:08:21,250 --> 00:08:25,780 And should consist of phones with high acoustic matches. 153 00:08:25,780 --> 00:08:29,170 Harpy uses a beam search to locate this optimal sequence 154 00:08:29,170 --> 00:08:30,980 of phones. 155 00:08:30,980 --> 00:08:32,950 This is a search technique in which 156 00:08:32,950 --> 00:08:35,590 a group of near miss alternatives around the best 157 00:08:35,590 --> 00:08:38,260 path are examined. 158 00:08:38,260 --> 00:08:41,650 By searching many alternatives simultaneously, 159 00:08:41,650 --> 00:08:45,550 the beam search eliminates the need for backtracking. 160 00:08:45,550 --> 00:08:48,820 The search is executed by creating and examining 161 00:08:48,820 --> 00:08:51,640 a tree structure of phones whose connections are 162 00:08:51,640 --> 00:08:55,180 consistent with transitions in the knowledge network. 163 00:08:55,180 --> 00:08:57,610 Each play in the recognition tree 164 00:08:57,610 --> 00:09:01,310 represents one segment of the digitized utterance. 165 00:09:01,310 --> 00:09:04,100 Harpy begins the beam surge, by taking 166 00:09:04,100 --> 00:09:07,160 all legal forms for the start of a sentence from a knowledge 167 00:09:07,160 --> 00:09:11,740 network, and entering them in the recognition tree. 168 00:09:11,740 --> 00:09:16,690 Next, a path probability is calculated for each candidate. 169 00:09:16,690 --> 00:09:20,440 This is a cumulative value based on the path of probability 170 00:09:20,440 --> 00:09:23,530 for the previous node and the acoustic match probability 171 00:09:23,530 --> 00:09:24,550 of the current node. 172 00:09:24,550 --> 00:09:27,140 173 00:09:27,140 --> 00:09:30,380 The path with the best probability is determined. 174 00:09:30,380 --> 00:09:33,710 And remaining candidates are compared with it. 175 00:09:33,710 --> 00:09:37,070 Those that fall below a threshold of acceptability 176 00:09:37,070 --> 00:09:40,360 are prone from further searching. 177 00:09:40,360 --> 00:09:42,850 The successors of the surviving candidates 178 00:09:42,850 --> 00:09:46,000 are expanded, based on the information and the knowledge 179 00:09:46,000 --> 00:09:50,480 network and the search continues. 180 00:09:50,480 --> 00:09:53,200 When paths are expanded, two phones 181 00:09:53,200 --> 00:09:56,680 may generate the same successor. 182 00:09:56,680 --> 00:09:59,290 Instead of retaining two independent paths 183 00:09:59,290 --> 00:10:01,870 through the same node, we can collapse them 184 00:10:01,870 --> 00:10:05,710 into a common path avoiding redundant computation. 185 00:10:05,710 --> 00:10:08,080 Only the path with the highest prior value 186 00:10:08,080 --> 00:10:10,150 is relevant at this point. 187 00:10:10,150 --> 00:10:13,450 Thus, lesser value paths can be discarded. 188 00:10:13,450 --> 00:10:16,810 Because their path probabilities can never exceed the one 189 00:10:16,810 --> 00:10:19,370 with the highest value. 190 00:10:19,370 --> 00:10:23,000 The path probabilities are calculated as before. 191 00:10:23,000 --> 00:10:27,050 The best path is established and unpromising alternatives 192 00:10:27,050 --> 00:10:27,680 are pruned. 193 00:10:27,680 --> 00:10:29,440 [MUSIC PLAYING] 194 00:10:29,440 --> 00:10:31,150 The forward search continues. 195 00:10:31,150 --> 00:10:34,450 Expanding the recognition tree and saving those connections 196 00:10:34,450 --> 00:10:36,670 that satisfy the threshold until we 197 00:10:36,670 --> 00:10:38,519 reach the end of the utterance. 198 00:10:38,519 --> 00:10:42,012 [MUSIC PLAYING] 199 00:10:42,012 --> 00:11:41,900 200 00:11:41,900 --> 00:11:45,290 Of all the paths that survive to the end of the utterance, 201 00:11:45,290 --> 00:11:47,510 the one with the best path probability 202 00:11:47,510 --> 00:11:50,210 is the solution that we are seeking. 203 00:11:50,210 --> 00:11:53,720 This is the only path that satisfies the two criteria 204 00:11:53,720 --> 00:11:56,010 of the recognition process. 205 00:11:56,010 --> 00:11:58,580 It provides the best interpretation of the acoustic 206 00:11:58,580 --> 00:12:01,890 matches while satisfying the constraints of the knowledge 207 00:12:01,890 --> 00:12:02,390 network. 208 00:12:02,390 --> 00:12:06,800 209 00:12:06,800 --> 00:12:09,530 A back trace through the recognition tree 210 00:12:09,530 --> 00:12:13,490 reveals the desired solution. 211 00:12:13,490 --> 00:12:16,130 This is purely a lookup operation 212 00:12:16,130 --> 00:12:19,310 and does not involve any search. 213 00:12:19,310 --> 00:12:22,880 What appears to be the best choice in the forward search, 214 00:12:22,880 --> 00:12:26,150 may not in fact be part of the overall solution found 215 00:12:26,150 --> 00:12:28,040 by the backtrace. 216 00:12:28,040 --> 00:12:31,370 Thus, errors introduced by the acoustic matches 217 00:12:31,370 --> 00:12:35,330 are easily recovered by delaying commitment to a particular path 218 00:12:35,330 --> 00:12:38,030 until a forward search is completed. 219 00:12:38,030 --> 00:12:41,750 Thus the forward search may be aerofoil without affecting 220 00:12:41,750 --> 00:12:43,970 the final solution. 221 00:12:43,970 --> 00:12:45,640 Let's look at the recognition process 222 00:12:45,640 --> 00:12:49,370 again as it occurred on the computer. 223 00:12:49,370 --> 00:12:50,450 Tell me all about China. 224 00:12:50,450 --> 00:12:54,110 225 00:12:54,110 --> 00:12:56,810 The utterance is digitized, segmented, 226 00:12:56,810 --> 00:12:59,120 and acoustic matches are generated. 227 00:12:59,120 --> 00:13:02,000 The four words with the best path probabilities 228 00:13:02,000 --> 00:13:04,220 appear below each segment. 229 00:13:04,220 --> 00:13:08,880 The backtrace selects words, leading to the optimal path. 230 00:13:08,880 --> 00:13:12,450 This is an example of Harpy working with a very simple task 231 00:13:12,450 --> 00:13:13,740 grammar. 232 00:13:13,740 --> 00:13:15,900 Let's look at the systems performance 233 00:13:15,900 --> 00:13:18,780 on several complex grammars. 234 00:13:18,780 --> 00:13:22,530 Land subject to inundation point OK. 235 00:13:22,530 --> 00:13:26,380 Speech understanding systems offer two advantages. 236 00:13:26,380 --> 00:13:29,730 They allow humans to communicate in a natural manner. 237 00:13:29,730 --> 00:13:31,590 Highway, double line. 238 00:13:31,590 --> 00:13:34,290 And free their hands for other work. 239 00:13:34,290 --> 00:13:37,870 The cartography task requires vocal identification 240 00:13:37,870 --> 00:13:41,130 of topological features as they are traced into the computer. 241 00:13:41,130 --> 00:13:43,650 242 00:13:43,650 --> 00:13:46,020 What is beta times gamma? 243 00:13:46,020 --> 00:13:49,870 This is an example of a desk calculator task. 244 00:13:49,870 --> 00:13:52,122 The speaker request arithmetic operations 245 00:13:52,122 --> 00:13:53,455 to be performed by the computer. 246 00:13:53,455 --> 00:13:56,680 [MUSIC PLAYING] 247 00:13:56,680 --> 00:14:13,520 248 00:14:13,520 --> 00:14:15,590 Do any papers discuss your climbing? 249 00:14:15,590 --> 00:14:16,620 [MUSIC PLAYING] 250 00:14:16,620 --> 00:14:20,840 This is an example of the 1,000 word document retrieval task. 251 00:14:20,840 --> 00:14:22,490 The knowledge network for this task 252 00:14:22,490 --> 00:14:25,160 contains over 14,000 nodes. 253 00:14:25,160 --> 00:14:28,480 It is the largest task currently running on the Harpy system. 254 00:14:28,480 --> 00:14:31,973 [MUSIC PLAYING] 255 00:14:31,973 --> 00:15:06,450 256 00:15:06,450 --> 00:15:09,960 Night to King Bishop 3. 257 00:15:09,960 --> 00:15:13,360 Here is an example of a voice chess system. 258 00:15:13,360 --> 00:15:15,520 The player speaks his move to the computer 259 00:15:15,520 --> 00:15:16,770 in standard rotation. 260 00:15:16,770 --> 00:15:20,242 [MUSIC PLAYING] 261 00:15:20,242 --> 00:15:27,200 262 00:15:27,200 --> 00:15:29,770 Springer F-dry. 263 00:15:29,770 --> 00:15:32,500 When the system is confronted with an utterance that 264 00:15:32,500 --> 00:15:35,290 is inconsistent with a past vocabulary, 265 00:15:35,290 --> 00:15:38,080 it still searches for the optimal path. 266 00:15:38,080 --> 00:15:40,960 However, when the overall path probability 267 00:15:40,960 --> 00:15:43,650 falls below an acceptable threshold, 268 00:15:43,650 --> 00:15:45,280 the recognize utterance is rejected. 269 00:15:45,280 --> 00:15:48,460 270 00:15:48,460 --> 00:15:51,370 By combining the best features of earlier systems, 271 00:15:51,370 --> 00:15:56,320 Harpy is able to recognize speed faster and more accurately. 272 00:15:56,320 --> 00:15:58,240 Since the end of the five year project, 273 00:15:58,240 --> 00:16:01,420 the system has been further improved and at present runs 274 00:16:01,420 --> 00:16:04,120 an order of magnitude faster. 275 00:16:04,120 --> 00:16:06,040 The two features of the system that 276 00:16:06,040 --> 00:16:10,480 contribute to Harpy success are its representation of knowledge 277 00:16:10,480 --> 00:16:13,000 and the beam surge technique. 278 00:16:13,000 --> 00:16:16,000 By pre-compiling all the diverse sources of knowledge 279 00:16:16,000 --> 00:16:18,520 into a single integrated network, 280 00:16:18,520 --> 00:16:20,920 Harpy achieved the level of efficiency 281 00:16:20,920 --> 00:16:24,730 that is unattainable by systems that dynamically interpret 282 00:16:24,730 --> 00:16:26,210 their knowledge. 283 00:16:26,210 --> 00:16:30,100 This permits Harpy to consider many more alternatives, 284 00:16:30,100 --> 00:16:34,410 and deal with error and uncertainty gracefully. 285 00:16:34,410 --> 00:16:37,770 Backtracking and redundant computations have always 286 00:16:37,770 --> 00:16:40,830 been problematic in AI systems. 287 00:16:40,830 --> 00:16:46,420 The Harpy system eliminates both of these in an elegant manner. 288 00:16:46,420 --> 00:16:49,240 It is not always possible to capture all the language 289 00:16:49,240 --> 00:16:52,400 constraints as a finite state graph. 290 00:16:52,400 --> 00:16:54,910 A graph structure that attempts to represent 291 00:16:54,910 --> 00:16:58,120 every possible variation requires a great deal 292 00:16:58,120 --> 00:16:59,940 of memory. 293 00:16:59,940 --> 00:17:03,420 More compact representations of knowledge involving a greater 294 00:17:03,420 --> 00:17:05,700 degree of dynamic interpretation, 295 00:17:05,700 --> 00:17:09,119 can certainly be used in Harpy-like-systems. 296 00:17:09,119 --> 00:17:13,740 This is purely a space time trade off. 297 00:17:13,740 --> 00:17:17,790 Occasionally, the heuristics associated with the beam surge 298 00:17:17,790 --> 00:17:19,940 missed the optimal path. 299 00:17:19,940 --> 00:17:23,410 But because the acoustic matches are less than accurate. 300 00:17:23,410 --> 00:17:27,910 Attempting to find the optimal path at great cost and effort, 301 00:17:27,910 --> 00:17:30,730 leads to little or no improvement 302 00:17:30,730 --> 00:17:33,950 in the overall performance. 303 00:17:33,950 --> 00:17:37,280 Everybody is an expert in speech. 304 00:17:37,280 --> 00:17:40,910 And so we naturally expect flawless performance 305 00:17:40,910 --> 00:17:42,760 from the computer. 306 00:17:42,760 --> 00:17:46,440 However, given the immense complexity of the task. 307 00:17:46,440 --> 00:17:49,620 We must continue to build more and more complex 308 00:17:49,620 --> 00:17:52,140 speech understanding systems, so that we 309 00:17:52,140 --> 00:17:56,670 may one day have systems approaching human performance. 310 00:17:56,670 --> 00:17:59,400 Continued research in this area will also 311 00:17:59,400 --> 00:18:03,960 teach us how to build complex knowledge based systems. 312 00:18:03,960 --> 00:18:07,740 And provide us with a deeper insight into the nature 313 00:18:07,740 --> 00:18:09,240 of intelligence. 314 00:18:09,240 --> 00:18:12,590 [MUSIC PLAYING] 315 00:18:12,590 --> 00:18:48,000