1
00:00:00,000 --> 00:00:13,460


2
00:00:13,460 --> 00:00:17,780
In 1971, I was a member
of a group of scientists,

3
00:00:17,780 --> 00:00:19,670
who proposed a
five year research

4
00:00:19,670 --> 00:00:23,750
effort towards a demonstration
of a large vocabulary,

5
00:00:23,750 --> 00:00:27,130
connected speech
understanding system.

6
00:00:27,130 --> 00:00:29,290
Instead of setting
vague objectives,

7
00:00:29,290 --> 00:00:32,560
we propose specific
performance goals.

8
00:00:32,560 --> 00:00:34,630
The proposed speech
understanding system

9
00:00:34,630 --> 00:00:38,020
was required to accept
phrases and sentences

10
00:00:38,020 --> 00:00:43,300
from many speakers based on
1,000 word vocabulary using

11
00:00:43,300 --> 00:00:48,100
a task oriented grammar
within a constraining task.

12
00:00:48,100 --> 00:00:51,610
We wanted the system to
perform with less than 10%

13
00:00:51,610 --> 00:00:56,500
semantic errors, using about 300
million computer instructions,

14
00:00:56,500 --> 00:00:58,980
per second of speech.

15
00:00:58,980 --> 00:01:01,350
Although these goals
were highly ambitious

16
00:01:01,350 --> 00:01:03,510
at the time of our proposal.

17
00:01:03,510 --> 00:01:05,610
We felt that the
required system could be

18
00:01:05,610 --> 00:01:09,320
demonstrated within five years.

19
00:01:09,320 --> 00:01:13,640
The Harpy system, developed
at Carnegie Mellon University,

20
00:01:13,640 --> 00:01:16,430
not only satisfies
these goals but exceed

21
00:01:16,430 --> 00:01:18,920
some of the stated objectives.

22
00:01:18,920 --> 00:01:24,260
It recognizes connected speech
from male and female speakers

23
00:01:24,260 --> 00:01:29,290
using a 1011 word
document retrieval task.

24
00:01:29,290 --> 00:01:32,890
It achieved 95%
semantic accuracy.

25
00:01:32,890 --> 00:01:36,970
And runs an order of magnitude
faster than expected.

26
00:01:36,970 --> 00:01:38,710
Harpy is all the
more interesting,

27
00:01:38,710 --> 00:01:40,990
because it is one
of the few examples

28
00:01:40,990 --> 00:01:44,080
in artificial intelligence
research of a five year

29
00:01:44,080 --> 00:01:48,165
prediction that was
in fact realized.

30
00:01:48,165 --> 00:01:49,790
I am interested in
speech understanding

31
00:01:49,790 --> 00:01:51,618
[MUSIC PLAYING]

32
00:01:51,618 --> 00:02:15,930


33
00:02:15,930 --> 00:02:18,600
Harpy is the result of
combining and improving

34
00:02:18,600 --> 00:02:21,240
the best features from two
speech systems previously

35
00:02:21,240 --> 00:02:23,430
developed at Carnegie
Mellon University.

36
00:02:23,430 --> 00:02:25,860
Hearsay one and dragon.

37
00:02:25,860 --> 00:02:27,780
Two features of the
Harpy system that

38
00:02:27,780 --> 00:02:29,910
led to a successful
demonstration

39
00:02:29,910 --> 00:02:31,860
over its representation
of knowledge

40
00:02:31,860 --> 00:02:34,770
and the use of new
search techniques.

41
00:02:34,770 --> 00:02:37,500
In developing speech
recognition systems,

42
00:02:37,500 --> 00:02:40,290
it is necessary to devise
a means of acquiring

43
00:02:40,290 --> 00:02:43,500
and representing the many
diverse types of knowledge that

44
00:02:43,500 --> 00:02:45,720
characterize speech.

45
00:02:45,720 --> 00:02:48,990
We must also develop matching
and search techniques,

46
00:02:48,990 --> 00:02:50,820
that convert this
passive knowledge

47
00:02:50,820 --> 00:02:53,400
into an active process
for understanding

48
00:02:53,400 --> 00:02:58,680
an utterance in the presence of
error, noise, and uncertainty.

49
00:02:58,680 --> 00:03:01,020
First, let us
consider some aspects

50
00:03:01,020 --> 00:03:03,180
of knowledge representation.

51
00:03:03,180 --> 00:03:06,560
The document retrieval
task we saw earlier,

52
00:03:06,560 --> 00:03:10,320
uses a vocabulary
of 1,011 words.

53
00:03:10,320 --> 00:03:12,510
When a user speaks
of the system,

54
00:03:12,510 --> 00:03:15,630
his utterance must conform
to the grammar and vocabulary

55
00:03:15,630 --> 00:03:17,670
for that task.

56
00:03:17,670 --> 00:03:20,910
Internally, Harpy stores
all legal sentences

57
00:03:20,910 --> 00:03:24,760
in a finite state
graph structure.

58
00:03:24,760 --> 00:03:27,730
Here is a graph of
a simple grammar.

59
00:03:27,730 --> 00:03:30,610
This knowledge is organized
as a network of nodes.

60
00:03:30,610 --> 00:03:34,510
Or each node holds a
word in the vocabulary.

61
00:03:34,510 --> 00:03:37,600
The nodes are interconnected
such that any path

62
00:03:37,600 --> 00:03:39,790
through this word
network constitutes

63
00:03:39,790 --> 00:03:41,890
an acceptable sentence.

64
00:03:41,890 --> 00:03:42,745
Tell us about Nexen.

65
00:03:42,745 --> 00:03:45,390


66
00:03:45,390 --> 00:03:48,960
Give me the headlines.

67
00:03:48,960 --> 00:03:50,305
Tell me all about China.

68
00:03:50,305 --> 00:04:02,920


69
00:04:02,920 --> 00:04:05,710
Many words have more than
one acceptable pronunciation.

70
00:04:05,710 --> 00:04:18,430


71
00:04:18,430 --> 00:04:20,829
Vodka and tomato juice.

72
00:04:20,829 --> 00:04:23,020
We don't have any tomato juice

73
00:04:23,020 --> 00:04:24,550
What?

74
00:04:24,550 --> 00:04:27,440
The next time, the hot
tomatoes are in town,

75
00:04:27,440 --> 00:04:31,540
you better have tomato juice.

76
00:04:31,540 --> 00:04:33,730
Alternative
pronunciations can be

77
00:04:33,730 --> 00:04:36,970
represented as a separate
network of acoustic gestures

78
00:04:36,970 --> 00:04:39,250
called phones.

79
00:04:39,250 --> 00:04:42,520
Phones are the smallest elements
of speech knowledge represented

80
00:04:42,520 --> 00:04:44,500
in the Harpy network.

81
00:04:44,500 --> 00:04:46,690
Each path through the
network represents

82
00:04:46,690 --> 00:04:49,600
an acceptable
pronunciation of a word.

83
00:04:49,600 --> 00:04:52,360
Tomato.

84
00:04:52,360 --> 00:04:54,490
Tomato.

85
00:04:54,490 --> 00:04:57,190
Dialects and accents
can also be integrated

86
00:04:57,190 --> 00:05:00,880
into these pronunciation
networks as separate paths.

87
00:05:00,880 --> 00:05:04,390
This is the pronunciation
network for the word tell.

88
00:05:04,390 --> 00:05:06,860
Tale.

89
00:05:06,860 --> 00:05:08,800
Tail.

90
00:05:08,800 --> 00:05:10,990
The pronunciation
network for tell

91
00:05:10,990 --> 00:05:12,850
is really a further
specification

92
00:05:12,850 --> 00:05:15,100
of tell in the word network.

93
00:05:15,100 --> 00:05:18,190
Thus, we can replace
tell in the word network

94
00:05:18,190 --> 00:05:20,620
with its pronunciation network.

95
00:05:20,620 --> 00:05:23,920
Likewise replacing every
node in the word network

96
00:05:23,920 --> 00:05:26,440
with its pronunciation
network, generates

97
00:05:26,440 --> 00:05:29,440
a new finite state
graph, or each path

98
00:05:29,440 --> 00:05:32,540
is the pronunciation of
an acceptable sentence.

99
00:05:32,540 --> 00:05:33,700
Tell me all about China.

100
00:05:33,700 --> 00:05:44,780


101
00:05:44,780 --> 00:05:46,940
The substitution of
pronunciation networks

102
00:05:46,940 --> 00:05:51,580
for words, adds a second level
of knowledge to our graph.

103
00:05:51,580 --> 00:05:53,710
Another aspect of
speech knowledge

104
00:05:53,710 --> 00:05:57,190
involves a phenomena that
occur at word boundaries.

105
00:05:57,190 --> 00:06:01,640
In written text, word boundaries
are clearly defined by spaces.

106
00:06:01,640 --> 00:06:03,970
Tell me all about China.

107
00:06:03,970 --> 00:06:08,020
In spoken language, word
boundaries tend to overlap.

108
00:06:08,020 --> 00:06:10,930
Making it difficult to
detect the end of one word

109
00:06:10,930 --> 00:06:13,630
and the beginning of the next.

110
00:06:13,630 --> 00:06:15,850
This problem is
similar to writing,

111
00:06:15,850 --> 00:06:19,390
tell me all about
China without spaces,

112
00:06:19,390 --> 00:06:22,470
or with spaces within words.

113
00:06:22,470 --> 00:06:24,960
As a result of this,
complex phenomena

114
00:06:24,960 --> 00:06:30,150
arise where phones are inserted
or deleted at word boundaries.

115
00:06:30,150 --> 00:06:32,730
Knowledge about such
phenomena can also

116
00:06:32,730 --> 00:06:35,580
be represented in
a Harpy network.

117
00:06:35,580 --> 00:06:37,470
This third type of
speech knowledge

118
00:06:37,470 --> 00:06:41,060
completes the Harpy
network representation.

119
00:06:41,060 --> 00:06:43,190
The overall size
of the network is

120
00:06:43,190 --> 00:06:45,560
reduced by special heuristics.

121
00:06:45,560 --> 00:06:50,210
This is achieved by detecting
and removing redundant paths.

122
00:06:50,210 --> 00:06:54,600
A separate knowledge network is
compiled for each Harpy task.

123
00:06:54,600 --> 00:06:58,140
This eliminates need for dynamic
interpretation of knowledge

124
00:06:58,140 --> 00:07:00,840
during the recognition process.

125
00:07:00,840 --> 00:07:02,970
The finite state
representation makes

126
00:07:02,970 --> 00:07:06,870
it possible to store knowledge
of grammar, vocabulary,

127
00:07:06,870 --> 00:07:11,610
pronunciation, and acoustic
phonetics in a single network.

128
00:07:11,610 --> 00:07:13,740
Let's consider how
this knowledge is used

129
00:07:13,740 --> 00:07:17,050
in the recognition process.

130
00:07:17,050 --> 00:07:18,980
Tell me all about China.

131
00:07:18,980 --> 00:07:21,190
Utterances input
through the computer.

132
00:07:21,190 --> 00:07:23,410
Several preprocessing
steps are taken

133
00:07:23,410 --> 00:07:25,930
to repair the data
for recognition.

134
00:07:25,930 --> 00:07:30,130
The utterance is digitized,
segmented into acoustic units,

135
00:07:30,130 --> 00:07:32,800
and analyzed to determine
the segmental features

136
00:07:32,800 --> 00:07:34,660
and parameters.

137
00:07:34,660 --> 00:07:38,740
This utterance has been
divided into 23 segments.

138
00:07:38,740 --> 00:07:40,870
At this point, an
attempt is made

139
00:07:40,870 --> 00:07:44,440
to match the first segment with
one of the 98 possible phone

140
00:07:44,440 --> 00:07:45,830
labels.

141
00:07:45,830 --> 00:07:49,090
Since an absolute assignment
cannot be made reliably,

142
00:07:49,090 --> 00:07:52,090
the system calculates a
match probability based

143
00:07:52,090 --> 00:07:55,060
on the acoustic information
in each segment.

144
00:07:55,060 --> 00:07:58,570
Thus, the silence phone is
given the highest acoustic match

145
00:07:58,570 --> 00:08:03,940
for this segment, then the
K phone, then TH and so on.

146
00:08:03,940 --> 00:08:06,280
As the acoustic
matches are generated,

147
00:08:06,280 --> 00:08:09,410
Harpy begins the
recognition process.

148
00:08:09,410 --> 00:08:11,230
The goal of the
recognition task is

149
00:08:11,230 --> 00:08:16,930
to find an optimal sequence of
phones satisfying two criteria.

150
00:08:16,930 --> 00:08:19,420
The sequence must
represent a legal path

151
00:08:19,420 --> 00:08:21,250
through the knowledge network.

152
00:08:21,250 --> 00:08:25,780
And should consist of phones
with high acoustic matches.

153
00:08:25,780 --> 00:08:29,170
Harpy uses a beam search to
locate this optimal sequence

154
00:08:29,170 --> 00:08:30,980
of phones.

155
00:08:30,980 --> 00:08:32,950
This is a search
technique in which

156
00:08:32,950 --> 00:08:35,590
a group of near miss
alternatives around the best

157
00:08:35,590 --> 00:08:38,260
path are examined.

158
00:08:38,260 --> 00:08:41,650
By searching many
alternatives simultaneously,

159
00:08:41,650 --> 00:08:45,550
the beam search eliminates
the need for backtracking.

160
00:08:45,550 --> 00:08:48,820
The search is executed
by creating and examining

161
00:08:48,820 --> 00:08:51,640
a tree structure of phones
whose connections are

162
00:08:51,640 --> 00:08:55,180
consistent with transitions
in the knowledge network.

163
00:08:55,180 --> 00:08:57,610
Each play in the
recognition tree

164
00:08:57,610 --> 00:09:01,310
represents one segment of
the digitized utterance.

165
00:09:01,310 --> 00:09:04,100
Harpy begins the
beam surge, by taking

166
00:09:04,100 --> 00:09:07,160
all legal forms for the start
of a sentence from a knowledge

167
00:09:07,160 --> 00:09:11,740
network, and entering them
in the recognition tree.

168
00:09:11,740 --> 00:09:16,690
Next, a path probability is
calculated for each candidate.

169
00:09:16,690 --> 00:09:20,440
This is a cumulative value
based on the path of probability

170
00:09:20,440 --> 00:09:23,530
for the previous node and the
acoustic match probability

171
00:09:23,530 --> 00:09:24,550
of the current node.

172
00:09:24,550 --> 00:09:27,140


173
00:09:27,140 --> 00:09:30,380
The path with the best
probability is determined.

174
00:09:30,380 --> 00:09:33,710
And remaining candidates
are compared with it.

175
00:09:33,710 --> 00:09:37,070
Those that fall below a
threshold of acceptability

176
00:09:37,070 --> 00:09:40,360
are prone from
further searching.

177
00:09:40,360 --> 00:09:42,850
The successors of the
surviving candidates

178
00:09:42,850 --> 00:09:46,000
are expanded, based on the
information and the knowledge

179
00:09:46,000 --> 00:09:50,480
network and the
search continues.

180
00:09:50,480 --> 00:09:53,200
When paths are
expanded, two phones

181
00:09:53,200 --> 00:09:56,680
may generate the same successor.

182
00:09:56,680 --> 00:09:59,290
Instead of retaining
two independent paths

183
00:09:59,290 --> 00:10:01,870
through the same node,
we can collapse them

184
00:10:01,870 --> 00:10:05,710
into a common path avoiding
redundant computation.

185
00:10:05,710 --> 00:10:08,080
Only the path with the
highest prior value

186
00:10:08,080 --> 00:10:10,150
is relevant at this point.

187
00:10:10,150 --> 00:10:13,450
Thus, lesser value
paths can be discarded.

188
00:10:13,450 --> 00:10:16,810
Because their path probabilities
can never exceed the one

189
00:10:16,810 --> 00:10:19,370
with the highest value.

190
00:10:19,370 --> 00:10:23,000
The path probabilities
are calculated as before.

191
00:10:23,000 --> 00:10:27,050
The best path is established
and unpromising alternatives

192
00:10:27,050 --> 00:10:27,680
are pruned.

193
00:10:27,680 --> 00:10:29,440
[MUSIC PLAYING]

194
00:10:29,440 --> 00:10:31,150
The forward search continues.

195
00:10:31,150 --> 00:10:34,450
Expanding the recognition tree
and saving those connections

196
00:10:34,450 --> 00:10:36,670
that satisfy the
threshold until we

197
00:10:36,670 --> 00:10:38,519
reach the end of the utterance.

198
00:10:38,519 --> 00:10:42,012
[MUSIC PLAYING]

199
00:10:42,012 --> 00:11:41,900


200
00:11:41,900 --> 00:11:45,290
Of all the paths that survive
to the end of the utterance,

201
00:11:45,290 --> 00:11:47,510
the one with the
best path probability

202
00:11:47,510 --> 00:11:50,210
is the solution
that we are seeking.

203
00:11:50,210 --> 00:11:53,720
This is the only path that
satisfies the two criteria

204
00:11:53,720 --> 00:11:56,010
of the recognition process.

205
00:11:56,010 --> 00:11:58,580
It provides the best
interpretation of the acoustic

206
00:11:58,580 --> 00:12:01,890
matches while satisfying the
constraints of the knowledge

207
00:12:01,890 --> 00:12:02,390
network.

208
00:12:02,390 --> 00:12:06,800


209
00:12:06,800 --> 00:12:09,530
A back trace through
the recognition tree

210
00:12:09,530 --> 00:12:13,490
reveals the desired solution.

211
00:12:13,490 --> 00:12:16,130
This is purely a
lookup operation

212
00:12:16,130 --> 00:12:19,310
and does not involve any search.

213
00:12:19,310 --> 00:12:22,880
What appears to be the best
choice in the forward search,

214
00:12:22,880 --> 00:12:26,150
may not in fact be part of
the overall solution found

215
00:12:26,150 --> 00:12:28,040
by the backtrace.

216
00:12:28,040 --> 00:12:31,370
Thus, errors introduced
by the acoustic matches

217
00:12:31,370 --> 00:12:35,330
are easily recovered by delaying
commitment to a particular path

218
00:12:35,330 --> 00:12:38,030
until a forward
search is completed.

219
00:12:38,030 --> 00:12:41,750
Thus the forward search may
be aerofoil without affecting

220
00:12:41,750 --> 00:12:43,970
the final solution.

221
00:12:43,970 --> 00:12:45,640
Let's look at the
recognition process

222
00:12:45,640 --> 00:12:49,370
again as it occurred
on the computer.

223
00:12:49,370 --> 00:12:50,450
Tell me all about China.

224
00:12:50,450 --> 00:12:54,110


225
00:12:54,110 --> 00:12:56,810
The utterance is
digitized, segmented,

226
00:12:56,810 --> 00:12:59,120
and acoustic matches
are generated.

227
00:12:59,120 --> 00:13:02,000
The four words with the
best path probabilities

228
00:13:02,000 --> 00:13:04,220
appear below each segment.

229
00:13:04,220 --> 00:13:08,880
The backtrace selects words,
leading to the optimal path.

230
00:13:08,880 --> 00:13:12,450
This is an example of Harpy
working with a very simple task

231
00:13:12,450 --> 00:13:13,740
grammar.

232
00:13:13,740 --> 00:13:15,900
Let's look at the
systems performance

233
00:13:15,900 --> 00:13:18,780
on several complex grammars.

234
00:13:18,780 --> 00:13:22,530
Land subject to
inundation point OK.

235
00:13:22,530 --> 00:13:26,380
Speech understanding systems
offer two advantages.

236
00:13:26,380 --> 00:13:29,730
They allow humans to
communicate in a natural manner.

237
00:13:29,730 --> 00:13:31,590
Highway, double line.

238
00:13:31,590 --> 00:13:34,290
And free their hands
for other work.

239
00:13:34,290 --> 00:13:37,870
The cartography task
requires vocal identification

240
00:13:37,870 --> 00:13:41,130
of topological features as they
are traced into the computer.

241
00:13:41,130 --> 00:13:43,650


242
00:13:43,650 --> 00:13:46,020
What is beta times gamma?

243
00:13:46,020 --> 00:13:49,870
This is an example of
a desk calculator task.

244
00:13:49,870 --> 00:13:52,122
The speaker request
arithmetic operations

245
00:13:52,122 --> 00:13:53,455
to be performed by the computer.

246
00:13:53,455 --> 00:13:56,680
[MUSIC PLAYING]

247
00:13:56,680 --> 00:14:13,520


248
00:14:13,520 --> 00:14:15,590
Do any papers discuss
your climbing?

249
00:14:15,590 --> 00:14:16,620
[MUSIC PLAYING]

250
00:14:16,620 --> 00:14:20,840
This is an example of the 1,000
word document retrieval task.

251
00:14:20,840 --> 00:14:22,490
The knowledge
network for this task

252
00:14:22,490 --> 00:14:25,160
contains over 14,000 nodes.

253
00:14:25,160 --> 00:14:28,480
It is the largest task currently
running on the Harpy system.

254
00:14:28,480 --> 00:14:31,973
[MUSIC PLAYING]

255
00:14:31,973 --> 00:15:06,450


256
00:15:06,450 --> 00:15:09,960
Night to King Bishop 3.

257
00:15:09,960 --> 00:15:13,360
Here is an example of
a voice chess system.

258
00:15:13,360 --> 00:15:15,520
The player speaks his
move to the computer

259
00:15:15,520 --> 00:15:16,770
in standard rotation.

260
00:15:16,770 --> 00:15:20,242
[MUSIC PLAYING]

261
00:15:20,242 --> 00:15:27,200


262
00:15:27,200 --> 00:15:29,770
Springer F-dry.

263
00:15:29,770 --> 00:15:32,500
When the system is confronted
with an utterance that

264
00:15:32,500 --> 00:15:35,290
is inconsistent with
a past vocabulary,

265
00:15:35,290 --> 00:15:38,080
it still searches
for the optimal path.

266
00:15:38,080 --> 00:15:40,960
However, when the
overall path probability

267
00:15:40,960 --> 00:15:43,650
falls below an
acceptable threshold,

268
00:15:43,650 --> 00:15:45,280
the recognize
utterance is rejected.

269
00:15:45,280 --> 00:15:48,460


270
00:15:48,460 --> 00:15:51,370
By combining the best
features of earlier systems,

271
00:15:51,370 --> 00:15:56,320
Harpy is able to recognize speed
faster and more accurately.

272
00:15:56,320 --> 00:15:58,240
Since the end of the
five year project,

273
00:15:58,240 --> 00:16:01,420
the system has been further
improved and at present runs

274
00:16:01,420 --> 00:16:04,120
an order of magnitude faster.

275
00:16:04,120 --> 00:16:06,040
The two features
of the system that

276
00:16:06,040 --> 00:16:10,480
contribute to Harpy success are
its representation of knowledge

277
00:16:10,480 --> 00:16:13,000
and the beam surge technique.

278
00:16:13,000 --> 00:16:16,000
By pre-compiling all the
diverse sources of knowledge

279
00:16:16,000 --> 00:16:18,520
into a single
integrated network,

280
00:16:18,520 --> 00:16:20,920
Harpy achieved the
level of efficiency

281
00:16:20,920 --> 00:16:24,730
that is unattainable by systems
that dynamically interpret

282
00:16:24,730 --> 00:16:26,210
their knowledge.

283
00:16:26,210 --> 00:16:30,100
This permits Harpy to consider
many more alternatives,

284
00:16:30,100 --> 00:16:34,410
and deal with error and
uncertainty gracefully.

285
00:16:34,410 --> 00:16:37,770
Backtracking and redundant
computations have always

286
00:16:37,770 --> 00:16:40,830
been problematic in AI systems.

287
00:16:40,830 --> 00:16:46,420
The Harpy system eliminates both
of these in an elegant manner.

288
00:16:46,420 --> 00:16:49,240
It is not always possible
to capture all the language

289
00:16:49,240 --> 00:16:52,400
constraints as a
finite state graph.

290
00:16:52,400 --> 00:16:54,910
A graph structure that
attempts to represent

291
00:16:54,910 --> 00:16:58,120
every possible variation
requires a great deal

292
00:16:58,120 --> 00:16:59,940
of memory.

293
00:16:59,940 --> 00:17:03,420
More compact representations of
knowledge involving a greater

294
00:17:03,420 --> 00:17:05,700
degree of dynamic
interpretation,

295
00:17:05,700 --> 00:17:09,119
can certainly be used
in Harpy-like-systems.

296
00:17:09,119 --> 00:17:13,740
This is purely a
space time trade off.

297
00:17:13,740 --> 00:17:17,790
Occasionally, the heuristics
associated with the beam surge

298
00:17:17,790 --> 00:17:19,940
missed the optimal path.

299
00:17:19,940 --> 00:17:23,410
But because the acoustic
matches are less than accurate.

300
00:17:23,410 --> 00:17:27,910
Attempting to find the optimal
path at great cost and effort,

301
00:17:27,910 --> 00:17:30,730
leads to little
or no improvement

302
00:17:30,730 --> 00:17:33,950
in the overall performance.

303
00:17:33,950 --> 00:17:37,280
Everybody is an
expert in speech.

304
00:17:37,280 --> 00:17:40,910
And so we naturally expect
flawless performance

305
00:17:40,910 --> 00:17:42,760
from the computer.

306
00:17:42,760 --> 00:17:46,440
However, given the immense
complexity of the task.

307
00:17:46,440 --> 00:17:49,620
We must continue to build
more and more complex

308
00:17:49,620 --> 00:17:52,140
speech understanding
systems, so that we

309
00:17:52,140 --> 00:17:56,670
may one day have systems
approaching human performance.

310
00:17:56,670 --> 00:17:59,400
Continued research in
this area will also

311
00:17:59,400 --> 00:18:03,960
teach us how to build complex
knowledge based systems.

312
00:18:03,960 --> 00:18:07,740
And provide us with a deeper
insight into the nature

313
00:18:07,740 --> 00:18:09,240
of intelligence.

314
00:18:09,240 --> 00:18:12,590
[MUSIC PLAYING]

315
00:18:12,590 --> 00:18:48,000