1 00:00:00,000 --> 00:00:13,490 2 00:00:13,490 --> 00:00:18,050 All right, I'm going to start this particular utterance again 3 00:00:18,050 --> 00:00:18,860 by segmenting it. 4 00:00:18,860 --> 00:00:21,580 5 00:00:21,580 --> 00:00:28,140 This is a very short segment, probably just a single segment 6 00:00:28,140 --> 00:00:29,570 here. 7 00:00:29,570 --> 00:00:31,320 For the time being there's no indication-- 8 00:00:31,320 --> 00:00:34,620 This is a speech spectrogram. 9 00:00:34,620 --> 00:00:37,200 A speech spectrogram is a visual display 10 00:00:37,200 --> 00:00:44,370 of the energy in the speech wave in terms of frequency and time. 11 00:00:44,370 --> 00:00:47,220 This film examines the performance of an expert 12 00:00:47,220 --> 00:00:49,710 spectrogram reader. 13 00:00:49,710 --> 00:00:51,870 Now the first path's segmentation is done. 14 00:00:51,870 --> 00:00:53,010 I'm going to do it here. 15 00:00:53,010 --> 00:00:55,020 There is obviously a [INAUDIBLE] and it's-- 16 00:00:55,020 --> 00:00:56,220 I'm going to say it's a "w." 17 00:00:56,220 --> 00:00:59,580 It's very easy to see that the second formant is climbing. 18 00:00:59,580 --> 00:01:02,670 There's a [INAUDIBLE] here with the second and third formants 19 00:01:02,670 --> 00:01:03,510 going together. 20 00:01:03,510 --> 00:01:05,430 I have to say this is an "e." 21 00:01:05,430 --> 00:01:08,730 There's another nasal here. 22 00:01:08,730 --> 00:01:09,900 There's another nasal here. 23 00:01:09,900 --> 00:01:11,750 I'm going to say again it's an "e." 24 00:01:11,750 --> 00:01:15,480 There's another segment here followed by a nasal at the end. 25 00:01:15,480 --> 00:01:22,140 In 1971, Dr. Victor Zue began to study spectrograms. 26 00:01:22,140 --> 00:01:25,680 His goal, to learn to identify the phonetic content 27 00:01:25,680 --> 00:01:30,450 of an unknown utterance from a spectrographic display. 28 00:01:30,450 --> 00:01:34,680 Since 1971 Dr. Zue has devoted about an hour a day 29 00:01:34,680 --> 00:01:36,743 to learning to read speech spectrographs. 30 00:01:36,743 --> 00:01:38,160 --appears to be coming out of an-- 31 00:01:38,160 --> 00:01:41,730 Altogether he has spent between 2,000 and 3,000 32 00:01:41,730 --> 00:01:45,300 hours developing this skill. 33 00:01:45,300 --> 00:01:49,590 In 1977 and 1978 we studied Dr. Zue's his ability 34 00:01:49,590 --> 00:01:53,610 to identify phonetic segments from spectrograms of utterances 35 00:01:53,610 --> 00:01:55,290 which were unknown to him. 36 00:01:55,290 --> 00:01:57,520 F1 is around 500 hertz. 37 00:01:57,520 --> 00:02:00,583 F2 is around 1,500 hertz or so. 38 00:02:00,583 --> 00:02:02,250 I would say it's probably an [INAUDIBLE] 39 00:02:02,250 --> 00:02:03,660 before an [INAUDIBLE]. 40 00:02:03,660 --> 00:02:06,690 All of them going down to this particular voice fricative. 41 00:02:06,690 --> 00:02:09,990 I would say this is probably a [INAUDIBLE] before [INAUDIBLE].. 42 00:02:09,990 --> 00:02:12,863 Path 3 is lower than 2,000 hertz. 43 00:02:12,863 --> 00:02:14,280 These two are very close together. 44 00:02:14,280 --> 00:02:16,120 I would say there is probably a schwa. 45 00:02:16,120 --> 00:02:18,210 And given that there appears-- 46 00:02:18,210 --> 00:02:20,310 they appear to be separating again. 47 00:02:20,310 --> 00:02:25,020 I'm going to say there is a schwa here. 48 00:02:25,020 --> 00:02:26,940 This particular transition indicates 49 00:02:26,940 --> 00:02:30,480 that it's probably an "a" and with the possibility 50 00:02:30,480 --> 00:02:32,880 that it being an "e." 51 00:02:32,880 --> 00:02:35,610 This [INAUDIBLE] is definitely an "a." 52 00:02:35,610 --> 00:02:37,830 Again, here we see the possibility 53 00:02:37,830 --> 00:02:42,540 of an [? r-color, ?] either "r" [? "ruh" ?] or just 54 00:02:42,540 --> 00:02:44,510 a [INAUDIBLE]. 55 00:02:44,510 --> 00:02:46,400 And going back to the last nasal, 56 00:02:46,400 --> 00:02:49,190 which with the transitions all coming down, 57 00:02:49,190 --> 00:02:51,230 I would say it's probably an [? "n." ?] 58 00:02:51,230 --> 00:02:55,640 Dr. Zue was presented with over 40 spectrograms. 59 00:02:55,640 --> 00:02:58,850 Included were normal English sentences, 60 00:02:58,850 --> 00:03:01,400 semantically anomalous sentences, 61 00:03:01,400 --> 00:03:06,020 and sequences of connected words and nonsense words. 62 00:03:06,020 --> 00:03:08,540 To evaluate Dr. Zue's performance 63 00:03:08,540 --> 00:03:10,730 we asked three phoneticians to listen 64 00:03:10,730 --> 00:03:13,610 to the original utterances and produce 65 00:03:13,610 --> 00:03:16,490 phonetic transcriptions. 66 00:03:16,490 --> 00:03:20,990 Dr. Zue's labeling agreed with the phoneticians on 80% to 90% 67 00:03:20,990 --> 00:03:22,220 of the segments. 68 00:03:22,220 --> 00:03:24,570 I think the only place that I might 69 00:03:24,570 --> 00:03:26,820 want to add something would be saying this is probably 70 00:03:26,820 --> 00:03:27,620 [INAUDIBLE]. 71 00:03:27,620 --> 00:03:31,640 This is probably-- a possibility of a [INAUDIBLE].. 72 00:03:31,640 --> 00:03:34,910 What do you think the utterance is? 73 00:03:34,910 --> 00:03:37,820 The first word is "winning." 74 00:03:37,820 --> 00:03:46,396 "Winning is never"-- at the end I have, "for him." 75 00:03:46,396 --> 00:03:57,490 So "winning is never a" something "for him," 76 00:03:57,490 --> 00:04:00,650 and I'm stuck here. 77 00:04:00,650 --> 00:04:05,690 "Per-- pretty-- pretty"-- 78 00:04:05,690 --> 00:04:07,910 something. 79 00:04:07,910 --> 00:04:11,060 "Pretty [? had"-- ?] I give up. 80 00:04:11,060 --> 00:04:12,950 "Thing." 81 00:04:12,950 --> 00:04:14,870 "Pretty thing for him"? 82 00:04:14,870 --> 00:04:19,110 "Winning is never a pretty thing for him." 83 00:04:19,110 --> 00:04:20,000 Pretty thing for him. 84 00:04:20,000 --> 00:04:21,225 What a weird sentence. 85 00:04:21,225 --> 00:04:24,620 86 00:04:24,620 --> 00:04:27,500 Analysis of the experts' performance reveals that 87 00:04:27,500 --> 00:04:30,560 spectrograms are read in two steps: 88 00:04:30,560 --> 00:04:34,280 segmentation and labelling. 89 00:04:34,280 --> 00:04:36,350 The purpose of the segmentation process 90 00:04:36,350 --> 00:04:39,560 is to divide the speech wave into units that correspond 91 00:04:39,560 --> 00:04:41,630 to phonetic segments. 92 00:04:41,630 --> 00:04:44,540 The main cue to segmentation is an abrupt spectral 93 00:04:44,540 --> 00:04:46,760 discontinuity. 94 00:04:46,760 --> 00:04:50,330 In some cases when an abrupt change is not observed, 95 00:04:50,330 --> 00:04:53,480 duration is used to determine whether one or two segments 96 00:04:53,480 --> 00:04:55,810 exist. 97 00:04:55,810 --> 00:04:56,420 All right. 98 00:04:56,420 --> 00:04:59,570 First of all, I'm going to segment 99 00:04:59,570 --> 00:05:08,850 this particular utterance into as many syllables as I can find 100 00:05:08,850 --> 00:05:16,400 and basically using the spectral change 101 00:05:16,400 --> 00:05:19,550 as a parameter for making-- marking the boundary. 102 00:05:19,550 --> 00:05:22,540 103 00:05:22,540 --> 00:05:25,238 Some of the places, for example, here is awful weak. 104 00:05:25,238 --> 00:05:26,530 It's kind of hard to determine. 105 00:05:26,530 --> 00:05:32,370 So for the time being I'm going to put a marker here and here, 106 00:05:32,370 --> 00:05:36,310 here, and here. 107 00:05:36,310 --> 00:05:38,590 All right. 108 00:05:38,590 --> 00:05:42,370 The formant motion-- the second formant motion 109 00:05:42,370 --> 00:05:45,490 indicates that within this segment 110 00:05:45,490 --> 00:05:48,990 it's going through sort of maybe two different steady states. 111 00:05:48,990 --> 00:05:53,740 So there's a possibility of an additional segment. 112 00:05:53,740 --> 00:06:00,360 Here for this particular one, I see that first of all, 113 00:06:00,360 --> 00:06:02,310 there's almost a discontinuity here, 114 00:06:02,310 --> 00:06:05,490 and also the third formant is rising. 115 00:06:05,490 --> 00:06:07,730 So I'm going to postulate that there 116 00:06:07,730 --> 00:06:11,010 is an additional segment there. 117 00:06:11,010 --> 00:06:14,300 There is intensity-- sharp intensity difference, 118 00:06:14,300 --> 00:06:18,110 a couple with a change in the second and third formant 119 00:06:18,110 --> 00:06:21,670 so I'm going to put these boundaries there. 120 00:06:21,670 --> 00:06:23,060 There's again a-- 121 00:06:23,060 --> 00:06:27,140 Dr. Zue is able to detect the existence of more than 95% 122 00:06:27,140 --> 00:06:29,360 of all segments. 123 00:06:29,360 --> 00:06:32,450 Once the segment boundaries have been identified, 124 00:06:32,450 --> 00:06:35,300 the labeling process begins. 125 00:06:35,300 --> 00:06:38,690 Dr. Zue's ability to accurately label phonetic segments 126 00:06:38,690 --> 00:06:40,745 is a unique and complex skill. 127 00:06:40,745 --> 00:06:43,920 128 00:06:43,920 --> 00:06:45,360 There is something here. 129 00:06:45,360 --> 00:06:55,150 It's either a-- it's a very weak fricative, 130 00:06:55,150 --> 00:06:59,023 or it's going to be a unreleased stop. 131 00:06:59,023 --> 00:07:02,250 132 00:07:02,250 --> 00:07:05,010 Something like that. 133 00:07:05,010 --> 00:07:09,660 Here is another stop and formant transition again indicate 134 00:07:09,660 --> 00:07:12,210 that perhaps this is a labial. 135 00:07:12,210 --> 00:07:19,560 And we again have a voiced stop because of the short voicing 136 00:07:19,560 --> 00:07:20,550 onset time. 137 00:07:20,550 --> 00:07:23,790 So I'm going to guess that it is a "b." 138 00:07:23,790 --> 00:07:27,280 139 00:07:27,280 --> 00:07:31,140 We're going to leave that at that for the moment. 140 00:07:31,140 --> 00:07:35,970 Here I see, again, a vowel gliding into something. 141 00:07:35,970 --> 00:07:38,040 The third formant is going way up 142 00:07:38,040 --> 00:07:40,470 and the second formant is coming down, 143 00:07:40,470 --> 00:07:42,330 and the proximity of these two formants, 144 00:07:42,330 --> 00:07:45,900 first and second formants indicates that this is probably 145 00:07:45,900 --> 00:07:49,140 a lateral [INAUDIBLE]. 146 00:07:49,140 --> 00:07:51,720 Many segments, such as vowels, have 147 00:07:51,720 --> 00:07:53,520 characteristic spectral patterns that 148 00:07:53,520 --> 00:07:56,235 are recognizable over a wide variety of contexts. 149 00:07:56,235 --> 00:07:58,830 150 00:07:58,830 --> 00:08:01,140 One example is the spreading form and pattern 151 00:08:01,140 --> 00:08:02,385 of the diphthong ay. 152 00:08:02,385 --> 00:08:07,200 153 00:08:07,200 --> 00:08:11,610 Reduced vowels can be identified by their short duration. 154 00:08:11,610 --> 00:08:14,340 When a vowel cannot be identified by a characteristic 155 00:08:14,340 --> 00:08:19,110 pattern, formant frequencies may be measured. 156 00:08:19,110 --> 00:08:22,375 Nasal consonants are recognizable by sharp amplitude 157 00:08:22,375 --> 00:08:22,875 drops. 158 00:08:22,875 --> 00:08:27,590 159 00:08:27,590 --> 00:08:30,290 r's are identifiable by a third formant 160 00:08:30,290 --> 00:08:37,400 that drops below 2,000 hertz. 161 00:08:37,400 --> 00:08:40,220 Quite often a segment is influenced by the context 162 00:08:40,220 --> 00:08:42,870 in which it occurs. 163 00:08:42,870 --> 00:08:46,670 For example, in the utterance, "Tom stole a butter plate," 164 00:08:46,670 --> 00:08:48,860 we observe that the "t" in initial position 165 00:08:48,860 --> 00:08:53,060 is relatively long with pronounced aspiration. 166 00:08:53,060 --> 00:08:55,790 In an initial s-t cluster, the aspiration 167 00:08:55,790 --> 00:09:01,550 is no longer observed, and in "butter" it becomes a flap. 168 00:09:01,550 --> 00:09:04,790 In final position, the "t" is often unreleased. 169 00:09:04,790 --> 00:09:07,770 170 00:09:07,770 --> 00:09:10,350 By studying thousands of spectrograms 171 00:09:10,350 --> 00:09:12,870 Dr. Zue has learned to recognize the influence 172 00:09:12,870 --> 00:09:15,630 of phonetic context and the operation 173 00:09:15,630 --> 00:09:16,770 of phonological rules. 174 00:09:16,770 --> 00:09:19,340 175 00:09:19,340 --> 00:09:22,640 Here is an example of the application of this knowledge. 176 00:09:22,640 --> 00:09:28,190 This is a strong fricative with energy well about 4,000 hertz. 177 00:09:28,190 --> 00:09:30,350 So is this one. 178 00:09:30,350 --> 00:09:31,770 So I would posulate at least they 179 00:09:31,770 --> 00:09:33,270 have the same place of articulation, 180 00:09:33,270 --> 00:09:39,020 and I think this one probably is an "s" because it's very 181 00:09:39,020 --> 00:09:42,470 strong, very high in frequency. 182 00:09:42,470 --> 00:09:45,140 This one is shorter than that one. 183 00:09:45,140 --> 00:09:47,390 It's possible that this one is a "z." 184 00:09:47,390 --> 00:09:51,440 However, it's also possible that this particular fricative 185 00:09:51,440 --> 00:09:55,950 is in a cluster with this particular stop here, 186 00:09:55,950 --> 00:10:02,000 which will reduce the duration of the fricative. 187 00:10:02,000 --> 00:10:04,790 For lack of any further information 188 00:10:04,790 --> 00:10:06,380 I'm going to say this one is either 189 00:10:06,380 --> 00:10:10,190 as short "s" because it's a cluster, or it's a "z." 190 00:10:10,190 --> 00:10:13,160 Once labeling was completed, Dr. Zue 191 00:10:13,160 --> 00:10:16,040 was asked if he could read off the utterance, something 192 00:10:16,040 --> 00:10:19,660 he does not often do in his own work. 193 00:10:19,660 --> 00:10:20,160 OK. 194 00:10:20,160 --> 00:10:23,420 Let me try to string the phonemes into words. 195 00:10:23,420 --> 00:10:28,970 I propose the word "yesterday" followed by the word "bill." 196 00:10:28,970 --> 00:10:33,380 That's probably an [INAUDIBLE] or I'm missing a segment here. 197 00:10:33,380 --> 00:10:37,760 That's either "the" or "va," and I probably 198 00:10:37,760 --> 00:10:39,350 would say it's "the." 199 00:10:39,350 --> 00:10:42,680 Yesterday Bill saw the-- 200 00:10:42,680 --> 00:10:48,088 [INAUDIBLE] Goodyear blimp." 201 00:10:48,088 --> 00:10:48,812 [INAUDIBLE]? 202 00:10:48,812 --> 00:10:49,312 Yeah. 203 00:10:49,312 --> 00:10:52,090 Good-- Goodyear blimp, yeah. 204 00:10:52,090 --> 00:10:56,770 "Yesterday Bill saw the Goodyear blimp." 205 00:10:56,770 --> 00:10:59,170 In the 30 years since the invention of the speech 206 00:10:59,170 --> 00:11:01,510 spectrograph there have been several attempts 207 00:11:01,510 --> 00:11:04,630 to teach people to read spectrograms. 208 00:11:04,630 --> 00:11:07,360 Although some encouraging results were achieved, 209 00:11:07,360 --> 00:11:09,190 the general conclusion of this research 210 00:11:09,190 --> 00:11:11,050 has been that spectrogram reading is not 211 00:11:11,050 --> 00:11:13,390 possible because of the variability inherent 212 00:11:13,390 --> 00:11:15,910 in fluent speech. 213 00:11:15,910 --> 00:11:18,400 In this film Dr. Zue has demonstrated 214 00:11:18,400 --> 00:11:21,170 that despite the difficulties involved, 215 00:11:21,170 --> 00:11:22,555 such a skill can be acquired. 216 00:11:22,555 --> 00:11:25,390 217 00:11:25,390 --> 00:11:28,330 "Winning is never a-- 218 00:11:28,330 --> 00:11:34,260 winning is never a--" this is probably "for him." 219 00:11:34,260 --> 00:11:45,820 So "winning is never a something for him." 220 00:11:45,820 --> 00:11:49,960 Pretty-- pretty thing for him? 221 00:11:49,960 --> 00:11:51,160 Pretty thing for him. 222 00:11:51,160 --> 00:11:53,640 What a weird sentence. 223 00:11:53,640 --> 00:12:02,325