1.EQ
2delim $$
3.EN
4.CH "1  WHY SPEECH OUTPUT?"
5.ds RT "Why speech output?
6.ds CX "Principles of computer speech
7.pp
8Speech is our everyday, informal, communication medium.  But although we use
9it a lot, we probably don't assimilate as much information through our
10ears as we do through our eyes, by reading or looking at pictures and diagrams.
11You go to a technical lecture to get the feel of a subject \(em the overall
12arrangement of ideas and the motivation behind them \(em and fill in the details,
13if you still want to know them, from a book.  You probably find out more about
14the news from ten minutes with a newspaper than from a ten-minute news broadcast.
15So it should be emphasized from the start that speech output from computers is
16not a panacea.  It doesn't solve the problems of communicating with computers;
17it simply enriches the possibilities for communication.
18.pp
19What, then, are the advantages of speech output?  One good reason for listening
20to a radio news broadcast instead of spending the time with a newspaper
21is that you can listen while shaving, doing the housework, or driving the car.
22Speech leaves hands and eyes free for other tasks.
23Moreover, it is omnidirectional, and does not require a free line of sight.
24Related to this is the
25use of speech as a secondary medium for status reports and warning messages.
26Occasional interruptions by voice do not interfere with other activities,
27unless they demand unusual concentration, and people can assimilate spoken messages
28and queue them for later action quite easily and naturally.
29.pp
30The second key feature of speech communication stems from the telephone.
31It is the universality of the telephone receiver itself that is important
32here, rather than the existence of a world-wide distribution network;
33for with special equipment (a modem and a VDU) one does not need speech to take advantage of
34the telephone network for information transfer.
35But speech needs no tools other than the telephone, and this gives
36it a substantial advantage.  You can go into a phone booth anywhere in the world,
37carrying no special equipment, and have access to your computer within seconds.
38The problem of data input is still there:  perhaps your computer
39system has a limited word recognizer, or you use the touchtone telephone
40keypad (or a portable calculator-sized tone generator).  Easy remote access
41without special equipment is a great, and unique, asset to speech communication.
42.pp
43The third big advantage of speech output is that it is potentially very cheap.
44Being all-electronic, except for the loudspeaker, speech systems are well
45suited to high-volume, low-cost, LSI manufacture.  Other computer output
46devices are at present tied either to mechanical moving parts or to the CRT.
47This was realized quickly by the computer hobbies market, where speech output
48peripherals have been selling like hot cakes since the mid 1970's.
49.pp
50A further point in favour of speech is that it is natural-seeming and
51somehow cuddly when compared with printers or VDU's.  It would have been much
52more difficult to make this point before the advent of talking toys like
53Texas Instruments' "Speak 'n Spell" in 1978, but now it is an accepted fact that friendly
54computer-based gadgets can speak \(em there are talking pocket-watches
55that really do "tell" the time, talking microwave ovens, talking pinball machines, and,
56of course, talking calculators.
57It is, however, difficult to assess whether the appeal stems from
58mechanical speech's novelty \(em it
59is still a gimmick \(em and also to what extent it is tied up with
60economic factors.
61After all, most of the population don't use high-quality VDU's, and their major
62experience of real-time interactive computing is through the very limited displays
63and keypads provided on video games and teletext systems.
64.pp
65Articles on speech communication with computers often list many more advantages of voice output
66(see Hill 1971, Turn 1974, Lea 1980).
67.[
68Hill 1971 Man-machine interaction using speech
69.]
70.[
71Lea 1980
72.]
73.[
74Turn 1974 Speech as a man-computer communication channel
75.]
76For example, speech
77.LB
78.NP
79can be used in the dark
80.NP
81can be varied from a (confidential) whisper to a (loud) shout
82.NP
83requires very little energy
84.NP
85is not appreciably affected by weightlessness or vibration.
86.LE
87However, these either derive from the three advantages we have discussed above,
88or relate
89mainly to exotic applications in space modules and divers' helmets.
90.pp
91Useful as it is at present, speech output would be even more attractive if it could
92be coupled with speech input.  In many ways, speech input is its "big brother".
93Many of the benefits of speech output are even more striking for speech input.
94Although people can assimilate information faster through the eyes than the
95ears, the majority of us can generate information faster with the mouth than
96with the hands.  Rapid typing is a relatively uncommon skill, and even high
97typing rates are much slower than speaking rates (although whether we can
98originate ideas quickly enough to keep up with fast speech is another matter!)  To
99take full advantage of the telephone for interaction with machines, machine
100recognition of speech is obviously necessary.  A microwave oven, calculator,
101pinball machine, or alarm clock that responds to spoken commands is certainly
102more attractive than one that just generates spoken status messages.  A book
103that told you how to recognize speech by machine would undoubtedly be more
104useful than one like this that just discusses how to synthesize it!  But the
105technology of speech recognition is nowhere near as advanced as that of
106synthesis \(em it's a much more difficult problem.  However, because speech input
107is obviously complementary to speech output, and even very limited input
108capabilities will greatly enhance many speech output systems, it is worth
109summarizing the present state of the art of speech recognition.
110.pp
111Commercial speech recognizers do exist.  Almost invariably, they accept
112words spoken in isolation, with gaps of silence between them, rather than
113connected utterances.
114It is not difficult to discriminate with high accuracy up to a hundred
115different words spoken by the same speaker, especially if the vocabulary
116is carefully selected to avoid words which sound similar.  If several
117different speakers are to be comprehended, performance can be greatly improved
118if the machine is given an opportunity to calibrate their voices in a training
119session, and is informed at recognition time which one is to speak.
120With a large population of unknown speakers, accurate recognition is difficult
121for vocabularies of more than a few carefully-chosen words.
122.pp
123A half-way house between isolated word discrimination and recognition of connected
124speech is the problem of spotting known words in continuous speech.  This
125allows much more natural input, if the dialogue is structured as keywords
126which may be
127interspersed by unimportant "noise words".  To speak in truly isolated
128words requires a great deal of self-discipline and concentration \(em it is
129surprising how much of ordinary speech is accounted for by vague sounds
130like um's and aah's, and false starts.  Word spotting disregards these and so
131permits a more relaxed style of speech.  Some progress has been made on it in
132research laboratories, but the vocabularies that can be accomodated are still
133very small.
134.pp
135The difficulty of recognizing connected speech depends crucially on what is
136known in advance about the dialogue:  its pragmatic, semantic, and syntactic
137constraints.  Highly structured dialogues constrain very heavily the choice of
138the next word.  Recognizers which can deal with vocabularies of over 1000 words
139have been built in research laboratories, but the structure of the input has
140been such that the average "branching factor" \(em the size of the set out of
141which the next word must be selected \(em is only around 10 (Lea, 1980).
142.[
143Lea 1980
144.]
145Whether such
146highly constrained languages would be acceptable in many practical applications
147is a moot point.  One commercial recognizer, developed in 1978, can cope with
148up to five words spoken continuously from a basic 120-word vocabulary.
149.pp
150There has been much debate about whether it will ever be possible for a speech
151recognizer to step outside rigid constraints imposed on the utterances it can
152understand, and act, say, as an automatic dictation machine.  Certainly the most
153advanced recognizers to date depend very strongly on a tight context being
154available.  Informed opinion seems to accept that in ten years' time,
155voice data entry in the office will be an important and economically feasible
156prospect, but that it would be rash to predict the appearance of unconstrained
157automatic dictation by then.
158.pp
159Let's return now to speech output and take a look at some systems which use it,
160to illustrate the advantages and disadvantages of speech in practical
161applications.
162.sh "1.1  Talking calculator"
163.pp
164Figure 1.1 shows a calculator that speaks.
165.FC "Figure 1.1"
166Whenever a key is pressed,
167the device confirms the action by saying the key's name.
168The result of any computation is also spoken aloud.
169For most people, the addition of speech output to a calculator is simply a
170gimmick.
171(Note incidentally that speech
172.ul
173input
174is a different matter altogether.  The ability to dictate lists of numbers and
175commands to a calculator, without lifting one's eyes from the page, would have
176very great advantages over keypad input.)  Used-car
177salesmen find that speech output sometimes helps to clinch a deal:  they key in
178the basic car price and their bargain-basement deductions, and the customer is so
179bemused by the resulting price being spoken aloud to him by a machine that he
180signs the cheque without thinking!  More seriously, there may be some small
181advantage to be gained when keying a list of figures by touch from having their
182values read back for confirmation.  For blind people, however, such devices
183are a boon \(em and there are many other applications, like talking elevators
184and talking clocks, which benefit from even very restricted voice output.
185Much more sophisticated is a typewriter with audio feedback, designed by
186IBM for the blind.  Although blind typists can remember where the keys on a
187typewriter are without difficulty, they rely on sighted proof-readers to help
188check
189their work.  This device could make them more useful as office typists and
190secretaries.  As well as verbalizing the material (including punctuation)
191that has been typed, either by attempting to pronounce the words or by spelling
192them out as individual letters, it prompts the user through the more complex action sequences
193that are possible on the typewriter.
194.pp
195The vocabulary of the talking calculator comprises the 24 words of Table 1.1.
196.RF
197.nr x1 2.0i+\w'percent'u
198.nr x1 (\n(.l-\n(x1)/2
199.in \n(x1u
200.ta 2.0i
201zero	percent
202one	low
203two	over
204three	root
205four	em (m)
206five	times
207six	point
208seven	overflow
209eight	minus
210nine	plus
211times-minus	clear
212equals	swap
213.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
214.in 0
215.FG "Table 1.1  Vocabulary of a talking calculator"
216This represents a total of about 13 seconds of speech.  It is stored
217electronically in read-only memory (ROM), and Figure 1.2 shows the circuitry
218of the speech module inside the calculator.
219.FC "Figure 1.2"
220There are three large integrated circuits.
221Two of them are ROMs, and the other is a special synthesis chip which decodes the
222highly compressed stored data into an audio waveform.
223Although the mechanisms used for storing speech by commercial devices are
224not widely advertised by the manufacturers, the talking calculator almost
225certainly uses linear predictive coding \(em a technique that we will examine
226in Chapter 6.
227The speech quality is very poor because of the highly compressed storage, and
228words are spoken in a grating monotone.
229However, because of the very small vocabulary, the quality is certainly good
230enough for reliable identification.
231.sh "1.2  Computer-generated wiring instructions"
232.pp
233I mentioned earlier that one big advantage of speech over visual output is that
234it leaves the eyes free for other tasks.
235When wiring telephone equipment during manufacture, the operator needs to use
236his hands as well as eyes to keep his place in the task.
237For some time tape-recorded instructions have been used for this in certain
238manufacturing plants.  For example, the instruction
239.LB
240.NI
241Red 2.5    11A terminal strip    7A tube socket
242.LE
243directs the operator to cut 2.5" of red wire, attach one end to a specified point
244on the terminal strip, and attach the other to a pin of the tube socket.  The
245tape recorder is fitted with a pedal switch to allow a sequence of such instructions
246to be executed by the operator at his own pace.
247.pp
248The usual way of recording the instruction tape is to have a human reader
249dictate them from a printed list.
250The tape is then checked against the list by another listener to ensure that
251the instructions are correct.  Since wiring lists are usually stored and
252maintained in machine-readable form, it is natural to consider whether speech
253synthesis techniques could be used to generate the acoustic tape directly by
254a computer (Flanagan
255.ul
256et al,
2571972).
258.[
259Flanagan Rabiner Schafer Denman 1972
260.]
261.pp
262Table 1.2 shows the vocabulary needed for this application.
263.RF
264.nr x1 2.0i+2.0i+\w'tube socket'u
265.nr x1 (\n(.l-\n(x1)/2
266.in \n(x1u
267.ta 2.0i +2.0i
268A	green	seventeen
269black	left	six
270bottom	lower	sixteen
271break	make	strip
272C	nine	ten
273capacitor	nineteen	terminal
274eight	one	thirteen
275eighteen	P	thirty
276eleven	point	three
277fifteen	R	top
278fifty	red	tube socket
279five	repeat coil	twelve
280forty	resistor	twenty
281four	right	two
282fourteen	seven	upper
283.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
284.in 0
285.FG "Table 1.2  Vocabulary needed for computer-generated wiring instructions"
286It is rather larger
287than that of the talking calculator \(em about 25 seconds of speech \(em but well
288within the limits of single-chip storage in ROM, compressed by the linear
289predictive technique.  However, at the time that the scheme was investigated
290(1970\-71) the method of linear predictive coding had not been fully developed,
291and the technology for low-cost microcircuit implementation was not available.
292But this is not important for this particular application, for there is
293no need to perform the synthesis on a miniature low-cost computer system,
294nor need it
295be accomplished in real time.  In fact a technique of concatenating
296spectrally-encoded words was used (described in Chapter 7), and it was
297implemented on a minicomputer.  Operating much slower than real-time, the system
298calculated the speech waveform and wrote it to disk storage.  A subsequent phase
299read the pre-computed messages and recorded them on a computer-controlled analogue
300tape recorder.
301.pp
302Informal evaluation showed the scheme to be quite successful.  Indeed, the
303synthetic speech, whose quality was not high, was actually preferred to
304natural speech in the noisy environment of the production line, for each
305instruction was spoken in the same format, with the same programmed pause
306between the items.
307A list of 58 instructions of the form shown above was recorded and used
308to wire several pieces of apparatus without errors.
309.sh "1.3  Telephone enquiry service"
310.pp
311The computer-generated wiring scheme illustrates how speech can be used to give
312instructions without diverting visual attention from the task at hand.
313The next system we examine shows how speech output can make the telephone
314receiver into a remote computer terminal for a variety of purposes
315(Witten and Madams, 1977).
316.[
317Witten Madams 1977 Telephone Enquiry Service
318.]
319The caller employs the touch-tone keypad shown in Figure 1.3 for input, and the
320computer generates
321a synthetic voice response.
322.FC "Figure 1.3"
323Table 1.3 shows the process of making
324contact with the system.
325.RF
326.fi
327.nh
328.na
329.in 0.3i
330.nr x0 \w'COMPUTER:  '
331.nr x1 \w'CALLER:  '
332.in+\n(x0u
333.ti-\n(x0u
334CALLER:\h'\n(x0u-\n(x1u'  Dials the service.
335.ti-\n(x0u
336COMPUTER:  Answers telephone.
337"Hello, Telephone Enquiry Service.  Please
338enter your user number".
339.ti-\n(x0u
340CALLER:\h'\n(x0u-\n(x1u'  Enters user number.
341.ti-\n(x0u
342COMPUTER:  "Please enter your password".
343.ti-\n(x0u
344CALLER:\h'\n(x0u-\n(x1u'  Enters password.
345.ti-\n(x0u
346COMPUTER:  Checks validity of password.
347If invalid, the user is asked to re-enter
348his user number.
349Otherwise,
350"Which service do you require?"
351.ti-\n(x0u
352CALLER:\h'\n(x0u-\n(x1u'  Enters service number.
353.in 0
354.nf
355.FG "Table 1.3  Making contact with the telephone enquiry system"
356.pp
357Advantage is taken of the disparate speeds of input (keyboard) and
358output (speech) to hasten the dialogue by imposing a question-answer structure
359on it, with the computer taking the initiative.  The machine can
360afford to be slightly verbose if by so doing it makes the caller's
361response easier, and therefore more rapid.  Moreover, operators who
362are experienced enough with the system to anticipate questions can
363easily forestall them just by typing ahead, for the computer is programmed
364to examine its input buffer before issuing prompts and to suppress them if
365input has already been provided.
366.pp
367An important aim of the system is to allow application programmers with no
368special knowledge of speech to write independent services for it.
369Table 1.4 shows an example of the use of one such application program,
370.RF
371.fi
372.nh
373.na
374.in 0.3i
375.nr x0 \w'COMPUTER:  '
376.nr x1 \w'CALLER:  '
377.in+\n(x0u
378.ti-\n(x0u
379COMPUTER:  "Stores Information Service.  Please enter
380component name".
381.ti-\n(x0u
382CALLER:\h'\n(x0u-\n(x1u'  Enters "SN7406#".
383.ti-\n(x0u
384COMPUTER:  "The component name is SN7406.  Is this correct?"
385.ti-\n(x0u
386CALLER:\h'\n(x0u-\n(x1u'  Enters "*1#" (system convention for "yes").
387.ti-\n(x0u
388COMPUTER:  "This component is in stores".
389.ti-\n(x0u
390CALLER:\h'\n(x0u-\n(x1u'  Enters "*7#" (command for "price").
391.ti-\n(x0u
392COMPUTER:  "The component price is 35 pence".
393.ti-\n(x0u
394CALLER:\h'\n(x0u-\n(x1u'  Enters "*8#" (command for "minimum number").
395.ti-\n(x0u
396COMPUTER:  "The minimum number of this component kept
397in stores is 10".
398.ti-\n(x0u
399CALLER:\h'\n(x0u-\n(x1u'  Enters "SN7417#".
400.ti-\n(x0u
401COMPUTER:  "The component name is SN7417.  Is this correct?"
402.ti-\n(x0u
403CALLER:\h'\n(x0u-\n(x1u'  Enters "*1#".
404.ti-\n(x0u
405COMPUTER:  "This component is not in stores".
406.ti-\n(x0u
407CALLER:\h'\n(x0u-\n(x1u'  Enters "*9#" (command for "delivery time").
408.ti-\n(x0u
409COMPUTER:  "The expected delivery time is 14 days".
410.ti-\n(x0u
411CALLER:\h'\n(x0u-\n(x1u'  Enters "*0#".
412.ti-\n(x0u
413COMPUTER:  "Which service do you require?"
414.in 0
415.nf
416.FG "Table 1.4  The Stores Information Service"
417the
418Stores Information Service, which permits enquiries to be made of a database
419holding information on electronic components kept in stock.
420This subsystem is driven by
421.ul
422alphanumeric
423data entered on the touch-tone keypad.  Two or three letters are associated
424with each digit, in a manner which is fairly standard in touch-tone telephone
425applications.  These are printed on a card overlay
426that fits the keypad (see Figure 1.3).  Although true alphanumeric data entry
427would require a multiple key press for each character,
428the ambiguity inherent in
429a single-key-per-character convention can usually be resolved by the computer,
430if it has a list of permissible entries.  For example, the component names
431SN7406 and ZTX300 are read by the machine as "767406" and "189300", respectively.
432Confusion rarely occurs if the machine is expecting a valid component code.
433The same holds true of people's names, and file names \(em although with these
434one must take care not to identify a series of files by similar names, like
435TX38A, TX38B, TX38C.  It is easy for the machine to detect the rare cases
436where ambiguity occurs, and respond by requesting further information:  "The
437component name is SN7406.  Is this correct?"  (In fact, the Stores Information
438Service illustrated in Table 1.4 is defective in that it
439.ul
440always
441requests confirmation of an entry, even when no ambiguity exists.)  The
442use of a telephone keypad for data entry will be taken up again in Chapter 10.
443.pp
444A distinction is drawn throughout the system between data entries and
445commands, the latter being prefixed by a "*".  In this example, the
446programmer chose to define a command for each possible question about a
447component, so that a new component name can be entered at any time
448without ambiguity.  The price paid for the resulting brevity of dialogue
449is the burden of memorizing the meaning of the commands.  This is an
450inherent disadvantage of a one-dimensional auditory display over the
451more conventional graphical output:   presenting menus by speech is tedious and
452long-winded.  In practice, however, for a simple task such as the
453Stores Information Service it is quite convenient for the caller to
454search for the appropriate command by trying out all possibilities \(em there
455are only a few.
456.pp
457The problem of memorizing commands is alleviated by establishing some
458system-wide conventions.  Each input is terminated by a "#", and
459the meaning of standard commands is given in Table 1.5.
460.RF
461.fi
462.nh
463.na
464.in 0.3i
465.nr x0 \w'# alone  '
466.nr x1 \w'\(em  '
467.ta \n(x0u +\n(x1u
468.nr x2 \n(x0+\n(x1
469.in+\n(x2u
470.ti-\n(x2u
471*#	\(em	Erase this input line, regardless of what has
472been typed before the "*".
473.ti-\n(x2u
474*0#	\(em	Stop.  Used to exit from any service.
475.ti-\n(x2u
476*1#	\(em	Yes.
477.ti-\n(x2u
478*2#	\(em	No.
479.ti-\n(x2u
480*3#	\(em	Repeat question or summarize state of current
481transaction.
482.ti-\n(x2u
483# alone	\(em	Short form of repeat.  Repeats or summarizes
484in an abbreviated fashion.
485.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
486.in 0
487.nf
488.FG "Table 1.5  System-wide conventions for the service"
489.pp
490A summary of services available on the system is given in
491Table 1.6.
492.RF
493.fi
494.na
495.in 0.3i
496.nr x0 \w'000  '
497.nr x1 \w'\(em  '
498.nr x2 \n(x0+\n(x1
499.in+\n(x2u
500.ta \n(x0u +\n(x1u
501.ti-\n(x2u
502\0\01	\(em	tells the time
503.ti-\n(x2u
504\0\02	\(em	Biffo (a game of NIM)
505.ti-\n(x2u
506\0\03	\(em	MOO (a game similar to that marketed under the name "Mastermind")
507.ti-\n(x2u
508\0\04	\(em	error demonstration
509.ti-\n(x2u
510\0\05	\(em	speak a file in phonetic format
511.ti-\n(x2u
512\0\06	\(em	listening test
513.ti-\n(x2u
514\0\07	\(em	music (allows you to enter a tune and play it)
515.ti-\n(x2u
516\0\08	\(em	gives the date
517.sp
518.ti-\n(x2u
519100	\(em	squash ladder
520.ti-\n(x2u
521101	\(em	stores information service
522.ti-\n(x2u
523102	\(em	computes means and standard deviations
524.ti-\n(x2u
525103	\(em	telephone directory
526.sp
527.ti-\n(x2u
528411	\(em	user information
529.ti-\n(x2u
530412	\(em	change password
531.ti-\n(x2u
532413	\(em	gripe (permits feedback on services from caller)
533.sp
534.ti-\n(x2u
535600	\(em	first year laboratory marks entering service
536.sp
537.ti-\n(x2u
538910	\(em	repeat utterance (allows testing of system)
539.ti-\n(x2u
540911	\(em	speak utterance (allows testing of system)
541.ti-\n(x2u
542912	\(em	enable/disable user 100 (a no-password guest user number)
543.ti-\n(x2u
544913	\(em	mount a magnetic tape on the computer
545.ti-\n(x2u
546914	\(em	set/reset demonstration mode (prohibits access by low-priority users)
547.ti-\n(x2u
548915	\(em	inhibit games
549.ti-\n(x2u
550916	\(em	inhibit the MOO game
551.ti-\n(x2u
552917	\(em	disable password checking when users log in
553.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
554.in 0
555.nf
556.FG "Table 1.6  Summary of services on a telephone enquiry system"
557They range from simple games and demonstrations, through serious database
558services, to system maintenance facilities.
559A priority structure is imposed upon them, with higher
560service numbers being available only to higher priority users.
561Services in the lowest range (1\-99) can be obtained by all, while
562those in the highest range (900\-999) are maintenance services,
563available only to the system designers.  Access to the lower-numbered
564"games" services can be inhibited by a priority user \(em this was
565found necessary to prevent over-use of the system!  Another advantage
566of telephone access to an information retrieval system is that some
567day-to-day maintenance can be done remotely, from the office telephone.
568.pp
569This telephone enquiry service, which was built in 1974, demonstrated that
570speech synthesis had moved from a specialist phonetic discipline into the
571province of engineering practicability.  The speech was generated "by rule"
572from a phonetic input (the method is covered in Chapters 7 and 8), which
573has very low data storage requirements of around 75\ bit/s of speech.
574Thus an enormous vocabulary and range of services could be accomodated on a
575small computer system.
576Despite the fairly low quality of the speech, the response from callers was
577most encouraging.  Admittedly the user population was a self-selected body of
578University staff, which one might suppose to have high tolerance to new ideas,
579and a system designed for the general public would require more effort to be
580spent on developing speech of greater intelligibility.  Although it was
581observed that some callers failed to understand parts of the responses, even
582after repetition, communication was largely unhindered in most cases; users
583being driven by a high motivation to help the system help them.
584.pp
585The use of speech output in conjunction with a simple input device requires
586careful thought for interaction to be successful and comfortable.  It is
587necessary that the computer direct the conversation as much as possible,
588without seeming to be taking charge.  Provision for eliminating prompts
589which are unwanted by sophisticated users is essential to avoid frustration.
590We will return to the topic of programming techniques for speech interaction
591in Chapter 10.
592.pp
593Making a computer system available over the telephone results in a sudden
594vast increase in the user population.  Although people's reaction to a new
595computer terminal in every office was overwhelmingly favourable, careful
596resource allocation was essential to prevent the service being hogged by a
597persistent few.  As with all multi-access computer systems, it is particularly
598important that error recovery is effected automatically and gracefully.
599.sh "1.4  Speech output in the telephone exchange"
600.pp
601The telephone enquiry service was an experimental vehicle for research on speech
602interaction, and was developed in 1974.
603Since then, speech has begun to be used in real commercial applications.
604One example is System\ X, the British Post Office's computer-controlled
605telephone exchange.  This incorporates many features
606not found in conventional telephone exchanges.
607For example, if a number is found to be busy, the call can be attempted
608again by a "repeat last call" command, without having to re-dial the full number.
609Alternatively, the last number can be stored for future re-dialling, freeing
610the phone for other calls.
611"Short code
612dialling" allows a customer to associate short codes with commonly-dialled
613numbers.
614Alarm calls can be booked at specified times, and are made automatically
615without human intervention.
616Incoming calls can be barred, as can outgoing ones.  A diversion service
617allows all incoming calls to be diverted to another telephone, either
618immediately, or if a call to the original number remains unanswered for
619a specified period of time, or if the original number is busy.
620Three-party calls can be set up automatically, without involving the
621operator.
622.pp
623Making use of these facilities presents the caller with something of a problem.
624With conventional telephone exchanges, feedback is provided on what is happening
625to a call by the use of four tones \(em the dial tone, the busy tone,
626the ringing tone, and the number unavailable tone.
627For the more sophisticated interaction which is expected on the advanced
628exchange, a much greater variety of status signals is required.
629The obvious solution is to use
630computer-generated spoken
631messages to inform the caller when these services are invoked, and to guide him
632through the sequences of actions needed to set up facilities like call
633re-direction.  For example, the messages used by the exchange when a user
634accesses the alarm call
635service are
636.LB
637.NI
638Alarm call service.
639Dial the time of your alarm call followed by square\u\(dg\d.
640.FN 1
641\(dg\d"Square" is the term used for the "#" key on the touch-tone telephone.\u
642.EF
643.NI
644You have booked an alarm call for seven thirty hours.
645.NI
646Alarm call operator.  At the third stroke it will be seven thirty.
647.LE
648.pp
649Because of the rather small vocabulary, the number of messages that can be
650stored in their entirety rather than being formed by concatenation of
651smaller units, and the short time which was available for development,
652System\ X stores speech as a time waveform, slightly compressed by a time-domain
653encoding operation (such techniques are described in Chapter 3).
654Utterances which contain variable parts, like the time of alarm in the messages
655above, are formed by inserting separately-recorded digits in a fixed
656"carrier" message.  No attempt is made to apply uniform intonation
657contours to the synthetic utterances.  The resulting speech is of excellent
658quality (being a slightly compressed recording of a human voice), but sometimes
659exhibits somewhat anomalous pitch contours.
660For example, the digits comprising numbers often sound rather jerky and
661out-of-context \(em which indeed they are.
662.pp
663Even more advanced facilities can be expected on telephone exchanges in
664the future.  A message storage capability is one example.  Although
665automatic call recording machines have been available for years, a centralized
666facility could time and date a message, collect the caller's identity
667(using the telephone keypad), and allow the recipient to select messages left
668for him through an interactive dialogue so that he could control the order
669in which he listens to them.  He could choose to leave certain messages to be
670dealt with later, or re-route them to a colleague.  He may even wish to leave
671reminders for himself, to be dialled automatically at specified times (like
672alarm calls with user-defined information attached).  The sender of a message
673could be informed automatically by the system when it is delivered.  None of
674this requires speech recognition, but it does need economical speech
675.ul
676storage,
677and also speech
678.ul
679synthesis
680(for time and date tags).
681.sh "1.5  Travel consultant"
682.pp
683Some current research in speech communication with computers is aimed at
684embedding the dialogue in a task environment which provides it with an overall
685contextual framework.
686This in fact was the motivation for the experimental telephone enquiry service
687described above, and it was observed that people are able to tolerate rather
688low quality speech if they feel that the interaction is worthwhile.  One step
689further is the use of speech
690.ul
691input
692as well as output, and here the context of the conversation which is provided
693by the task environment can be used to improve recognition accuracy
694significantly.
695A system set up at Bell Telephone Laboratories indicates what we can expect
696to find operating commercially in the future (Levinson and Shipley, 1980).
697.[
698Levinson Shipley 1980
699.]
700It provides timetable information about airline flights, and allows the user
701to make reservations.
702.pp
703An example dialogue is given in Table 1.7.
704.RF
705.fi
706.nh
707.na
708.in 0.3i
709.nr x0 \w'COMPUTER:  '
710.nr x1 \w'CALLER:  '
711.in+\n(x0u
712.ti-\n(x0u
713COMPUTER:  This is the Bell Laboratories flight information system.  May I help you?
714.ti-\n(x0u
715CALLER:\h'\n(x0u-\n(x1u'  I would like to go to Chicago.
716.ti-\n(x0u
717COMPUTER:  Please specify the day of your departure.
718.ti-\n(x0u
719CALLER:\h'\n(x0u-\n(x1u'  I want to leave at four p.m..
720.ti-\n(x0u
721COMPUTER:  Please specify the day of your departure.
722.ti-\n(x0u
723CALLER:\h'\n(x0u-\n(x1u'  I want to depart on the two five May.
724.ti-\n(x0u
725COMPUTER:  Flight number nine two leaves New York at four p.m. arrives in Chicago at five twenty five p.m..
726.ti-\n(x0u
727CALLER:\h'\n(x0u-\n(x1u'  What plane is on the flight?
728.ti-\n(x0u
729COMPUTER:  The aircraft is a Boeing seven twenty seven.
730.ti-\n(x0u
731CALLER:\h'\n(x0u-\n(x1u'  How much is the fare?
732.ti-\n(x0u
733COMPUTER:  The round trip coach fare is one hundred forty two dollars.
734.ti-\n(x0u
735CALLER:\h'\n(x0u-\n(x1u'  I want a first class seat.
736.ti-\n(x0u
737COMPUTER:  One seat thank you.
738.ti-\n(x0u
739CALLER:\h'\n(x0u-\n(x1u'  What is the fare?
740.ti-\n(x0u
741COMPUTER:  The round trip first class fare is one hundred eighty four dollars.
742.ti-\n(x0u
743CALLER:\h'\n(x0u-\n(x1u'  Please repeat the arrival time.
744.ti-\n(x0u
745COMPUTER:  The arrival time is five twenty five p.m..
746.ti-\n(x0u
747CALLER:\h'\n(x0u-\n(x1u'  I will take the flight.
748.ti-\n(x0u
749COMPUTER:  Your reservation on flight number nine two to Chicago on Friday is confirmed.  Thank you.
750.in 0
751.nf
752.MT 2
753Table 1.7  A conversation with an airline flight information service
754(from Levinson and Shipley, 1980)
755.TE
756.[
757Levinson Shipley 1980
758.]
759The user dials the system from an
760ordinary telephone.  The recognition side must be trained by each user, and
761accepts isolated words spoken with brief pauses between them.
762The voice response unit has a vocabulary of around 200 words, and
763synthesizes its answers by slotting words into "templates" evoked by the speech
764understanding part in response to a query.  For example,
765.LB
766.NI
767This flight makes \(em stops
768.NI
769Flight number \(em leaves \(em at \(em , arrives in \(em at \(em
770.LE
771are templates which when called with specific slot fillers could produce the
772utterances
773.LB
774.NI
775This flight makes three stops
776.NI
777Flight number nine two leaves New York at four p.m.,
778arrives in Chicago at five twenty-five p.m.
779.LE
780The chief research interest of the system is in its speech understanding
781capabilities, and the method used for speech output is relatively
782straightforward.  The templates and words are recorded, digitized, compressed
783slightly, and stored on disk files (totalling a few hundred thousand bytes of
784storage), using techniques similar to those of System\ X.
785Again, no independent manipulation of pitch is possible, and so the utterances
786sound intelligible but the transition between templates and slot fillers is not
787completely fluent.  However, the overall context of the interaction means that
788the communication is not seriously disrupted even if the machine occasionally
789misunderstands the man or vice versa.  The user's attention is drawn away from
790recognition accuracy and focussed on the exchange of information with the machine.
791The authors conclude that progress in speech recognition can best be made by
792studying it in the context of communication rather than in a vacuum or as part
793of a one-way channel, and the same is undoubtedly true of speech synthesis as
794well.
795.sh "1.6  Reading machine for the blind"
796.pp
797Perhaps the most advanced attempt to provide speech output from a computer
798is the Kurzweil reading machine for the blind, first marketed in the late
7991970's (Figure 1.4).
800.FC "Figure 1.4"
801This device reads an ordinary book aloud.  Users adjust the reading
802speed according to the content of the material and their familiarity with
803it, and the maximum rate has recently been improved to around 225 words per
804minute \(em perhaps half as fast again as normal human speech rates.
805.pp
806As well as generating speech from text, the machine has to scan the document
807being read and identify the characters presented to it.  A scanning camera
808is used, controlled by a program which searches for and tracks the lines of
809text.  The output of the camera is digitized, and the image is enhanced
810using signal-processing techniques.  Next each individual letter must be
811isolated, and its geometric features identified and compared with a pre-stored
812table of letter shapes.  Isolation of letters is not at all trivial, for
813many type fonts have "ligatures" which are combinations of characters joined
814together (for example, the letters "fi" are often run together.)  The
815machine must cope with many printed type fonts, as well as typewritten ones.
816The text-recognition side of the Kurzweil reading machine is in fact one of
817its most advanced features.
818.pp
819We will discuss the problem of speech generation from text in Chapter 9.
820It has many facets.  First there is pronunciation, the
821translation of letters to sounds.  It is important to take into account
822the morphological structure of words, dividing them into "root" and "endings".
823Many words have concatenated suffixes (like "like-li-ness").  These are
824important to detect, because a final "e" which appears on a root word
825is not pronounced itself but affects the pronunciation of the previous
826vowel.  Then there is the difficulty that some words look the same
827but are pronounced differently, depending on their meaning or on the syntactic
828part that they play in the sentence.
829Appropriate intonation is extremely difficult to generate from a plain textual
830representation, for it depends on the meaning of the text and the way in which
831emphasis is given to it by the reader.  Similarly the rhythmic structure is
832important, partly for correct pronunciation and partly for purposes of
833emphasis.
834Finally the sounds that have been deduced from the text need to be synthesized
835into acoustic form, taking due account of the many and varied contextual effects
836that occur in natural speech.  This by itself is a challenging problem.
837.pp
838The performance of the Kurzweil reading machine is not good.  While it seems
839to be true that some blind people can make use of it, it is far from
840comprehensible to an untrained listener.  For example,
841it will miss out words and even whole phrases, hesitate in a
842stuttering manner, blatantly mis-pronounce many words, fail to detect
843"e"s which should be silent, and give completely wrong rhythms
844to words, making them impossible to understand.
845Its intonation is decidedly unnatural, monotonous, and often downright
846misleading.  When it reads completely new text to people unfamiliar with its
847quirks,
848they invariably fail to understand more than an odd word here and there,
849and do not improve significantly when the text is repeated more than once.
850Naturally performance improves if the material is familiar or expected
851in some way.
852One useful feature is the machine's ability to spell out difficult words
853on command from the user.
854.pp
855While not wishing to denigrate the Kurzweil machine, which is a remarkable
856achievement in that it integrates together many different advanced
857technologies, there is no doubt that the state of the art in speech synthesis
858directly from unadorned text is extremely primitive, at present.
859It is vital not to overemphasize the potential usefulness of abysmal speech,
860which takes a great deal of training on the part of the user before
861it becomes at all intelligible.  To make a rather extreme analogy,
862Morse code could be used as
863audio output, requiring a great deal of training, but capable of being understood
864at quite high rates by an expert.
865It could be generated very cheaply.
866But clearly the man in the street would find it quite unacceptable as
867an audio output medium, because of the excessive effort required to learn to use
868it.  In many applications, very bad synthetic speech is just as useless.
869However, the issue is complicated by the fact that for people who use
870synthesizers regularly, synthetic speech becomes quite easily comprehensible.
871We will return to the problem of evaluating the quality of artificial speech
872later in the book (Chapter 8).
873.sh "1.7  System considerations for speech output"
874.pp
875Fortunately, very many of the applications of speech output from computers
876do not need to read unadorned text.
877In all the example systems described above (except the reading machine),
878it is enough to be able to store utterances in some representation which can
879include pre-programmed cues for pronunciation, rhythm, and intonation in
880a much more explicit way than ordinary text does.
881.pp
882Of course, techniques
883for storing audio information have been in use for decades.
884For example, a domestic cassette tape recorder stores speech at much better
885than telephone quality at very low cost.  The method of direct
886recording of an analogue waveform is currently used for announcements in
887the telephone network to provide information such as the time, weather
888forecasts, and even bedtime stories.
889However, it is difficult to provide rapid access to messages stored in
890analogue form, and although some computer peripherals which use analogue
891recordings for voice-response applications have been marketed \(em they are
892discussed briefly at the beginning of Chapter 3 \(em they have been
893superseded by digital storage techniques.
894.pp
895Although direct storage of a digitized audio waveform is used in some
896voice-response systems, the approach has certain limitations.  The most
897obvious one is the large storage requirement:  suitable coding can reduce
898the data-rate of speech to as little as one hundredth of that needed by
899direct digitization, and textual representations reduce it by another factor
900of ten or twenty.  (Of course, the speech quality is inevitably compromised
901somewhat by data-compression techniques.)  However, the cost of storage is
902dropping so fast that this is not necessarily an overriding factor.
903A more fundamental limitation is that utterances stored directly cannot sensibly
904be modified in any way to take account of differing contexts.
905.pp
906If the results of certain kinds of analyses
907of utterances are stored, instead of simply the digitized waveform,
908a great deal more flexibility can be gained.
909It is possible to separate out the features of intonation and amplitude from
910the articulation of the speech, and this raises the attractive possibility
911of regenerating utterances with pitch contours different from those with which they were
912recorded.
913The primary analysis technique used for this purpose is
914.ul
915linear prediction
916of speech, and this is treated in some detail in Chapter 6.  It also reduces drastically the
917data-rate of speech, by a factor of around 50.
918It is likely that many voice-response systems in the short- and medium-term
919future will use linear predictive representations for utterance storage.
920.pp
921For maximum flexibility, however, it is preferable to store a textual
922representation of the utterance.
923There is an important distinction between speech
924.ul
925storage,
926where an actual human utterance is recorded, perhaps processed to lower
927the data-rate, and stored for subsequent regeneration when required,
928and speech
929.ul
930synthesis,
931where the machine produces its own individual utterances which are not based
932on recordings of a person saying the same thing.  The difference is summarized
933in Figure 1.5.
934.FC "Figure 1.5"
935In both cases something is stored:  for the first it is
936a direct representation of an actual human utterance, while for the second
937it is a typed
938.ul
939description
940of the utterance in terms of the sounds, or phonemes, which constitute it.
941The accent and tone of voice of the human speaker will be apparent in
942the stored speech output, while for synthetic speech the accent is the
943machine's and the tone of voice is determined by the synthesis program.
944.pp
945Probably the most attractive representation of utterances in man-machine
946systems is ordinary English text, as used by the Kurzweil reading machine.
947But, as noted above, this poses extraordinarily difficult problems for the
948synthesis procedure, and these inevitably result in severely degraded speech.
949Although in the very long term these problems may indeed be solved,
950most speech output systems can adopt as their representation of an utterance
951a description of it which explicitly conveys the difficult features of
952intonation, rhythm, and even pronunciation.
953In the kind of applications described above (barring the reading machine),
954input will be prepared by a
955programmer as he builds the software system which supports the interactive
956dialogue.
957Although it is important that the method of specifying utterances be easily
958learned, it is not necessary that plain English
959is used.  It should be simple for the programmer to enter new
960utterances and modify them on-line in cut-and-try attempts to render the
961man-machine dialogue as natural as possible.  A phonetic input
962can be quite adequate for this, especially if the system allows the
963programmer to hear immediately the synthesized version of the message
964he types.  Furthermore, markers which indicate rhythm and intonation can
965be added to the message so that the system does not have to deduce these features
966by attempting to "understand" the plain text.
967.pp
968This brings us to another disadvantage of speech storage as compared with
969speech synthesis.  To provide utterances for a voice response system using
970stored human speech, one must assemble together special input hardware,
971a quiet room, and (probably) a dedicated computer.  If the speech is to be
972heavily encoded, either expensive special hardware is required or the encoding
973process, if performed by software on a general-purpose computer, will take
974a considerable length of time (perhaps hundreds of times real-time).  In
975either case, time-consuming editing of the speech will be necessary, with
976follow-up recordings to clarify sections of speech which turn out to be
977unsuitable or badly recorded.  If at a later date the voice response
978system needs modification, it will be necessary to recall the same speaker,
979or re-record the entire utterance set.  This discourages the application
980programmer from adjusting his dialogue in the light of experience.
981Synthesizing from a textual representation, on the other hand, allows him
982to change a speech prompt as simply as he could a VDU one, and evaluate
983its effect immediately.
984.pp
985We will return to methods of digitizing and compacting speech in Chapters 3
986and 4, and carry on to consider speech synthesis in subsequent chapters.
987Firstly, however, it is necessary to take a look at what speech is and how
988people produce it.
989.sh "1.8  References"
990.LB "nnnn"
991.[
992$LIST$
993.]
994.LE "nnnn"
995.sh "1.9  Further reading"
996.pp
997There are remarkably few general books on speech output, although a
998substantial specialist literature exists for the subject.
999In addition to the references listed above, I suggest that you look
1000at the following.
1001.LB "nn"
1002.\"Ainsworth-1976-1
1003.]-
1004.ds [A Ainsworth, W.A.
1005.ds [D 1976
1006.ds [T Mechanisms of speech recognition
1007.ds [I Pergamon
1008.nr [T 0
1009.nr [A 1
1010.nr [O 0
1011.][ 2 book
1012.in+2n
1013A nice, easy-going introduction to speech recognition, this book covers
1014the acoustic structure of the speech signal in a way which makes
1015it useful as background reading for speech synthesis as well.
1016It complements Lea, 1980, cited above; which presents more recent results
1017in greater depth.
1018.in-2n
1019.\"Flanagan-1973-2
1020.]-
1021.ds [A Flanagan, J.L.
1022.as [A " and Rabiner, L.R. (Editors)
1023.ds [D 1973
1024.ds [T Speech synthesis
1025.ds [I Wiley
1026.nr [T 0
1027.nr [A 0
1028.nr [O 0
1029.][ 2 book
1030.in+2n
1031This is a collection of previously-published research papers on speech
1032synthesis, rather than a unified book.
1033It contains many of the classic papers on the subject from 1940\ -\ 1972,
1034and is a very useful reference work.
1035.in-2n
1036.\"LeBoss-1980-3
1037.]-
1038.ds [A LeBoss, B.
1039.ds [D 1980
1040.ds [K *
1041.ds [T Speech I/O is making itself heard
1042.ds [J Electronics
1043.ds [O May\ 22
1044.ds [P 95-105
1045.nr [P 1
1046.nr [T 0
1047.nr [A 1
1048.nr [O 0
1049.][ 1 journal-article
1050.in+2n
1051The magazine
1052.ul
1053Electronics
1054is an excellent source of up-to-the-minute news, product announcements,
1055titbits, and rumours in the commercial speech technology world.
1056This particular article discusses the projected size of the voice
1057output market and gives a brief synopsis of the activities of several
1058interested companies.
1059.in-2n
1060.\"Witten-1980-5
1061.]-
1062.ds [A Witten, I.H.
1063.ds [D 1980
1064.ds [T Communicating with microcomputers
1065.ds [I Academic Press
1066.ds [C London
1067.nr [T 0
1068.nr [A 1
1069.nr [O 0
1070.][ 2 book
1071.in+2n
1072A recent book on microcomputer technology, this is unusual in that
1073it contains a major section on speech communication
1074with computers (as well as ones
1075on computer buses, interfaces, and graphics).
1076.in-2n
1077.LE "nn"
1078.EQ
1079delim $$
1080.EN
1081.CH "2  WHAT IS SPEECH?"
1082.ds RT "What is speech?
1083.ds CX "Principles of computer speech
1084.pp
1085People speak by using their vocal cords as a sound source, and making rapid
1086gestures of the articulatory organs (tongue, lips, jaw, and so on).
1087The resulting changes in shape of the vocal tract allow production
1088of the different sounds that we know as the vowels and consonants of
1089ordinary language.
1090.pp
1091What is it necessary to learn about this process for the purposes of
1092speech output from computers?
1093That depends crucially upon how speech is represented in the system.
1094If utterances are stored as time waveforms \(em and this is what we will be
1095discussing in the next chapter \(em the structure of speech is not important.
1096If frequency-related parameters of particular natural utterances are
1097stored, then it is advantageous to take into account some of the
1098acoustic properties of the speech waveform.
1099.pp
1100This point can be brought into focus by contrasting the transmission
1101(or storage) of speech with that of real-life television pictures,
1102as has been proposed for a videophone service.
1103Massive data reductions, of the order of 50:1, can be achieved for speech,
1104using techniques that are described in later chapters.  For pictures,
1105data reduction is still an important issue \(em even more so for the
1106videophone than for the telephone, because of the vastly higher
1107information rates involved.
1108Unfortunately, the potential for data reduction is much
1109smaller \(em nothing like the 50:1 figure quoted above.
1110This is because speech sounds have definite characteristics, imparted
1111by the fact that they are produced by a human vocal tract, which
1112can be exploited for data reduction.
1113Television pictures have no equivalent generative structure, for
1114they show just those things that the camera points at.
1115.pp
1116Moving up from frequency-related parameters of
1117.ul
1118particular
1119utterances, it
1120is possible to store such parameters in a
1121.ul
1122general
1123form which characterizes the sound segments that appear in spoken language.
1124This immediately raises the issue of
1125.ul
1126classification
1127of sound segments, to form a basis for storing generalized acoustic
1128information and for retrieval of the information needed to synthesize
1129any particular utterance.
1130Speech is by nature continuous, and any synthesis system based upon
1131discrete classification must come to terms with this by tackling
1132the problems of transition from one segment to another,
1133and local modification of sound segments as a function of their context.
1134.pp
1135This brings us to another level of representation.
1136So far we have talked of the
1137.ul
1138acoustic
1139nature of speech, but when we have to cope with transitions between
1140discrete sound segments it may be fruitful to consider
1141.ul
1142articulatory
1143properties as well.
1144Any model of the speech production process
1145is in effect a model of the articulatory process that generates the speech.
1146Some speech research is concerned with
1147modelling
1148the vocal tract directly, rather than modelling the acoustic output from it.
1149One might specify, for example, position of tongue and posture of jaw and lips
1150for a vowel, instead of giving frequency-related
1151characteristics of it.  This is a potent
1152tool in linguistic research, for it brings one closer to human production of
1153speech \(em in particular to the connection between brain and articulators.
1154.pp
1155Articulatory
1156synthesis holds a promise of high-quality speech, for the transitional
1157effects caused by tongue and jaw inertia can be modelled directly.
1158However, this potential has
1159not yet been realized.
1160Speech from current articulatory models is of much poorer quality than
1161that from acoustically-based synthesis methods.
1162The major problem is in gaining data about articulatory
1163behaviour during running speech \(em it is much easier to perform acoustic
1164analysis on the resulting sound than it is to examine the vocal organs in
1165action.  Because of this, the subject is not treated in this book.
1166We will only look at articulatory properties insofar as they help us
1167to understand, in a qualitative way, the acoustic nature of speech.
1168.pp
1169Speech, however, is much more than mere articulation.
1170Consider \(em admittedly a rather extreme and chauvinistic example \(em the
1171number of ways a girl can say "yes".
1172Breathy voice, slow tempo, low pitch \(em these are all characteristics which
1173affect the utterance as a whole, rather than being classifiable into
1174individual sound segments.  Linguists call them "prosodic" or
1175"suprasegmental" features, for they relate to overall aspects of the
1176utterance, and distinguish them from "segmental" ones which concern
1177the articulation of individual segments of syllables.
1178The most important prosodic features are pitch, or fundamental frequency
1179of the voice, and rhythm.
1180.pp
1181This chapter provides a brief introduction to the nature of the speech
1182signal.  Depending upon what speech output techniques we use, it may be
1183necessary to understand something of the acoustic nature of the speech
1184signal; the system that generates it (the vocal tract); commonly-used
1185classifications of sound segments; and the prosodic aspects of speech.
1186This material is little used in the early chapters of the book, but
1187becomes increasingly important as the story unfolds.
1188Hence you may skip the remainder of this chapter if you wish, but
1189should return to it later to pick up more background whenever it
1190becomes necessary.
1191.sh "2.1  The anatomy of speech"
1192.pp
1193The so-called "voiced" sounds of speech \(em like the sound you make when
1194you say "aaah" \(em are produced by passing air up from the lungs through
1195the larynx or voicebox, which is situated just behind the Adam's apple.
1196The vocal tract from the larynx to the lips acts as a resonant cavity,
1197amplifying certain frequencies and attenuating others.
1198.pp
1199The waveform generated by the larynx, however, is not simply sinusoidal.
1200(If it were, the vocal tract resonances would merely
1201give a sine wave of the same frequency but amplified or
1202attenuated according to how close it was to the nearest resonance.)  The
1203larynx contains two folds of skin \(em the vocal cords \(em which blow apart and flap
1204together again in each cycle of the pitch period.
1205The pitch of a male voice in speech varies from as low as 50\ Hz
1206(cycles per second) to perhaps
1207250\ Hz, with a typical median value of 100\ Hz.
1208For a female voice the range is higher, up to about 500\ Hz in speech.
1209Singing can go much higher:  a top C sung by a soprano has a frequency
1210of just over 1000\ Hz, and some opera singers can reach
1211substantially higher than this.
1212.pp
1213The flapping action of the vocal cords
1214gives a waveform which can be approximated by a
1215triangular pulse (this and other approximations will be discussed in
1216Chapter 5).
1217It has a rich spectrum of harmonics,
1218decaying at around 12\ dB/octave, and each harmonic is affected
1219by the vocal tract resonances.
1220.rh "Vocal tract resonances."
1221A simple model of the vocal tract is an organ-pipe-like cylindrical tube
1222(Figure 2.1),
1223with a sound source at one end (the larynx) and open at the other (the lips).
1224.FC "Figure 2.1"
1225This has resonances at wavelengths $4L$, $4L/3$, $4L/5$, ..., where $L$
1226is the length of the tube;
1227and these correspond to frequencies $c/4L$, $3c/4L$, $5c/4L$, ...\ Hz, $c$
1228being the speed of
1229sound in air.
1230Calculating these frequencies, using a typical figure for the
1231distance between larynx and lips of 17\ cm,
1232and $c = 340$\ m/s for the speed of sound, leads to resonances at
1233approximately 500\ Hz, 1500\ Hz, 2500\ Hz, ... .
1234.pp
1235When excited by the harmonic-rich waveform of the larynx,
1236the vocal tract resonances produce
1237peaks known as
1238.ul
1239formants
1240in the energy spectrum of the speech wave (Figure 2.2).
1241.FC "Figure 2.2"
1242The lowest formant, called formant one, varies from around 200\ Hz
1243to 1000\ Hz during speech, the exact range depending on the size
1244of the vocal tract.
1245Formant two varies from around 500 to 2500\ Hz, and formant three
1246from around 1500 to 3500\ Hz.
1247.pp
1248You can easily hear the lowest formant by whispering the vowels in
1249the words "heed", "hid", "head", "had", "hod", "hawed", and "who'd".
1250They appear to have a steadily descending pitch, yet since you are
1251whispering there is no fundamental frequency.
1252What you hear is the lowest resonance of the vocal tract \(em formant one.
1253Some masochistic people can play simple tunes with this formant by putting
1254their mouth in successive vowel shapes and knocking the top of their head
1255with their knuckles \(em hard!
1256.pp
1257A difficulty occurs when trying to identify the lower formants for speakers
1258with high-pitched voices.
1259When a formant frequency falls below the fundamental excitation frequency
1260of the voice, its effect is diminished \(em although it is still present.
1261The vibrato used by opera singers provides a very low-frequency excitation
1262(at the vibrato rate) which helps to illuminate the lower formants even
1263when the pitch of the voice is very high.
1264.pp
1265Of course, speech is not a static phenomenon.
1266The organ-pipe model describes the speech spectrum during a continuously
1267held vowel with the mouth in a neutral position such as for "aaah".
1268But in real speech the tongue and lips are in continuous motion,
1269altering the shape of the vocal tract and hence the positions of the resonances.
1270It is as if the organ-pipe were being squeezed and expanded in
1271different places all the time.
1272Say
1273.ul
1274ee
1275as in "heed" and feel how close your tongue is to the roof of your mouth,
1276causing a constriction near the front of the vocal cavity.
1277.pp
1278Linguists and speech engineers use a special frequency analyser called a
1279"sound spectrograph" to make a three-dimensional plot of the variation
1280of the speech energy spectrum with time.
1281Figure 2.3 shows a spectrogram of the
1282utterance "go away".
1283.FC "Figure 2.3"
1284Frequency is given on the vertical axis,
1285and bands are shown at the beginning to indicate the scale.
1286Time is plotted horizontally,
1287and energy is given by the darkness of any particular area.
1288The lower few formants can be seen as dark bands extending horizontally,
1289and they are in continuous motion.
1290In the neutral first vowel of "away", the formant frequencies
1291pass through
1292approximately the 500\ Hz, 1500\ Hz, and 2500\ Hz that we calculated earlier.
1293(In fact, formants two and three are somewhat lower than these values.)
1294.pp
1295The
1296fine vertical striations in the spectrogram correspond to single openings of the vocal cords.
1297Pitch changes continuously throughout an utterance,
1298and this can be seen on the spectrogram by the differences in spacing
1299of the striations.
1300Pitch change, or
1301.ul
1302intonation,
1303is singularly important in
1304lending naturalness to speech.
1305.pp
1306On a spectrogram, a continuously held vowel shows up as a static energy spectrum.
1307But beware \(em what we call a vowel in everyday language is not the same thing as a
1308"vowel" in phonetic terms.
1309Say "I" and feel how the tongue moves continuously while you're speaking.
1310Technically, this is a
1311.ul
1312diphthong
1313or slide between two vowel positions,
1314and not a single vowel.
1315If you say
1316.ul
1317ar
1318as in "hard",
1319and change slowly to
1320.ul
1321ee
1322as in "heed", you will obtain a diphthong not unlike that in "I".
1323And there are many more phonetically different vowel sounds
1324than the a, e, i, o, and u that we normally think of.
1325The words "hood" and "mood" have different vowels, for example, as do "head" and "mead".
1326The principal acoustic difference between the various vowel sounds
1327is in the frequencies of the first two formants.
1328.pp
1329A further complication is introduced by the nasal tract.  This is
1330a large cavity which is coupled to the oral tract by a passage at the
1331back of the mouth.
1332The passage is guarded by a flap of skin called the "velum".
1333You know about this because inadvertent opening of the velum while
1334swallowing causes food or drink to go up your nose.
1335The nasal cavity is switched in and out of the vocal tract
1336by the velum during speech.
1337It is used for consonants
1338.ul
1339m,
1340.ul
1341n,
1342and the
1343.ul
1344ng
1345sound in the word
1346"singing".
1347Vowels are frequently nasalized too.
1348A very effective demonstration of the amount of nasalization in ordinary
1349speech can be obtained by cutting a nose-shaped hole in a large
1350baffle which divides a room, speaking normally with one's nose in the hole,
1351and having someone listen on the other side.
1352The frequency of occurrence of
1353nasal sounds, and the volume of sound that is emitted
1354through the nose, are both surprisingly large.
1355Interestingly enough, when we say in conversation that someone sounds
1356"nasal", we usually mean "non-nasal".  When the nasal passages are
1357blocked by a cold, nasal sounds are missing \(em
1358.ul
1359n\c
1360\&'s turn into
1361.ul
1362d\c
1363\&'s,
1364and
1365.ul
1366m\c
1367\&'s to
1368.ul
1369b\c
1370\&'s.
1371.pp
1372When the nasal cavity is switched in to the vocal tract, it introduces
1373formant resonances, just as the oral cavity does.
1374Although we cannot
1375alter the shape of the nasal tract significantly, the nasal formant
1376pattern is not fixed, because the oral tract does play a part in nasal
1377resonances.
1378If you say
1379.ul
1380m,
1381.ul
1382n,
1383and
1384.ul
1385ng
1386continuously, you can hear the difference and feel how it is produced by
1387altering the combined nasal/oral tract resonances with your tongue position.
1388The nasal cavity operates in parallel with
1389the oral one:  this causes the two resonance patterns to be summed
1390together, with resulting complications which will be discussed in Chapter 5.
1391.rh "Sound sources."
1392Speech involves sounds other than those caused by regular vibration of
1393the larynx.
1394When you whisper, the folds of the larynx are held slightly
1395apart so that the air passing between them becomes turbulent, causing a noisy excitation
1396of the resonant cavity.
1397The formant peaks are still present, superimposed on the noise.  Such
1398"aspirated" sounds occur in the
1399.ul
1400h
1401of "hello", and for a very short time
1402after the lips are opened at the beginning of "pit".
1403.pp
1404Constrictions made in the mouth produce hissy noises such as
1405.ul
1406ss,
1407.ul
1408sh,
1409and
1410.ul
1411f.
1412For example, in
1413.ul
1414ss
1415the tip of the tongue is high up,
1416very close to the roof of the mouth.
1417Turbulent air passing through this constriction causes a
1418random noise excitation, known as "frication".
1419Actually, the roof of the mouth is quite a complicated object.
1420You can feel with your tongue a bony hump or ridge just behind the front
1421teeth, and it is this that forms a constriction with the tongue for
1422.ul
1423s.
1424In
1425.ul
1426sh,
1427the tongue is flattened close to the roof of the mouth slightly farther back,
1428in a position rather similar to that for
1429.ul
1430ee
1431but with a narrower
1432constriction,
1433while
1434.ul
1435f
1436is produced with the upper teeth and lower lip.
1437Because they are made near the front of the mouth,
1438the resonances of the vocal tract have little effect on these fricative
1439sounds.
1440.pp
1441To distinguish them from aspiration and frication, the ordinary speech
1442sounds (like "aaah") which have their source in larynx vibration are
1443known technically as "voiced".  Aspirated and fricative sounds are called
1444"unvoiced".  Thus the three different sound types can be classified as
1445.LB
1446.NP
1447voiced
1448.NP
1449unvoiced (fricative)
1450.NP
1451unvoiced (aspirated).
1452.LE
1453Can any of these three types occur together?
1454It would seem that voicing and aspiration can not, for the former requires
1455the larynx to be vibrating regularly, but for the latter it must be
1456generating turbulent noise.
1457However, there is a condition known technically as "breathy voice"
1458which occurs when the vocal cords are slightly apart, still vibrating,
1459but with a large volume of air passing between to create turbulence.
1460Voicing can easily occur in conjunction with frication.
1461Corresponding to
1462.ul
1463s,
1464.ul
1465sh,
1466and
1467.ul
1468f
1469we get the
1470.ul
1471voiced
1472fricatives
1473.ul
1474z,
1475the sound in the middle of words like "vision" which I will call
1476.ul
1477zh,
1478and
1479.ul
1480v.
1481A simple illustration of voicing is to say "ffffvvvvffff\ ...".
1482During the voiced part you can feel the larynx vibrations with a finger
1483on your Adam's apple, and it can be heard quite clearly if you stop up
1484your ears.
1485Technically, there is nothing to prevent frication and aspiration
1486from occurring together \(em they do, for example, when a voiced fricative
1487is whispered \(em but the combination is not an important one.
1488.pp
1489The complicated acoustic effects of noisy excitations in speech can be
1490seen in the spectrogram in Figure 2.4 of
1491"high altitude jets whizz past screaming".
1492.FC "Figure 2.4"
1493.rh "The source-filter model of speech production."
1494We have been talking in terms of a sound source (be it voiced or unvoiced)
1495exciting the resonances of the oral (and possible the nasal) tract.
1496This model, which is used extensively in speech analysis and synthesis,
1497is known as
1498the source-filter model of speech production.  The reason for its success
1499is that the effect of the resonances can be modelled as a frequency-selective
1500filter, operating on an input which is the source excitation.
1501Thus the frequency spectrum of the source is modified by multiplying it
1502by the frequency characteristic of the filter (or adding it, if amplitudes
1503are expressed logarithmically).
1504This can be seen in Figure 2.5, which shows a source
1505spectrum and filter characteristic which combine to give the overall
1506spectrum of Figure 2.2.
1507.FC "Figure 2.5"
1508.pp
1509Although, as mentioned above, the various fricatives are not subjected
1510to the resonances of the vocal tract to the same extent
1511that voiced and aspirated
1512sounds are, they can still be modelled as a noise source followed by
1513a filter to give them their different sound qualities.
1514.pp
1515The source-filter model is an oversimplification of the actual speech
1516production system.  There is inevitably some coupling between the vocal
1517tract and the lungs, through the glottis, during the period when
1518it is open.  This effectively makes the filter characteristics
1519change during each individual cycle of the excitation.
1520However, although the effect is of interest to speech researchers,
1521it is probably not of great significance for practical speech output.
1522.pp
1523One very interesting implication of the
1524source-filter model is that the prosodic features of
1525pitch and amplitude are largely properties of the source; while
1526segmental ones are introduced by the filter.  This makes it possible to
1527separate some aspects of
1528overall prosody from the actual segmental content of an
1529utterance, so that, for example, a human utterance can be stored initially
1530and then spoken by a machine with a variety of different intonations.
1531.sh "2.2  Classification of speech sounds"
1532.pp
1533The need to classify sound segments as a basis for storing generalized acoustic
1534information and retrieving it was mentioned earlier.  There is a real
1535difficulty here because speech is by nature continuous and classifications are
1536discrete.
1537It is important to remember this difficulty because it is all too easy
1538to criticize the complex and often confusing attempts of linguists to
1539tackle the classification task.
1540.pp
1541Linguists call a written representation of the
1542.ul
1543sounds
1544of an utterance a "phonetic
1545transcription" of it.  The same utterance can be transcribed at
1546different levels of detail:  simple transcriptions are called "broad"
1547and more specific ones are called "narrow".
1548Perhaps the most logically satisfying kind of transcription employs units
1549termed "phonemes".  This is the broadest transcription,
1550and is sometimes called a
1551.ul
1552phonemic
1553transcription to emphasize that that it is in terms of phonemes.
1554Unfortunately, the word "phoneme" is often used somewhat loosely.
1555In its true sense, a phoneme is a
1556.ul
1557logical
1558unit, rather than a physical, acoustic, one,
1559and is defined in relation to a particular language by reference
1560to its use in discriminating different words.
1561Classifications of sounds which are based on their
1562semantic
1563role as word-discriminators are called
1564.ul
1565phonological
1566classifications:  we could ensure that there is no ambiguity in the sense
1567with which we use the term "phoneme" by calling it a phonological unit, and
1568the phonemic transcription could be called a phonological one.
1569.rh "Broad phonetic transcription."
1570A phoneme is an abstract unit representing a set of different sounds.
1571The issue is confused by the fact that the members of the set actually
1572sound very similar, if not identical, to the untrained ear \(em precisely because
1573the difference between them plays no part in distinguishing words from
1574each other in the particular language concerned.
1575.pp
1576Take the words "key" and "caw", for example.  Despite the difference in
1577spelling, both of them begin with a
1578.ul
1579k
1580sound that belongs (in English)
1581to the same phoneme set, called
1582.ul
1583k.
1584However, say them two or three times each, concentrating on the position of
1585the tongue during the
1586.ul
1587k.
1588It is quite different in each case.  For "key", it
1589is raised, close to the roof of the mouth, in preparation for the
1590.ul
1591ee,
1592whereas in "caw" it is much lower down.
1593The sound of the
1594.ul
1595k
1596is actually quite different in the two cases.
1597Yet they belong to the same phoneme, for there is no pair of words which
1598relies on this difference to distinguish them \(em "key" and "caw" are
1599obviously distinguished by their vowels, not by the initial
1600consonant.
1601You probably cannot hear clearly the difference between the two
1602.ul
1603k\c
1604\&'s,
1605precisely because they belong to the same phoneme and so the difference
1606is not important (for English).
1607.pp
1608The point is sharpened by considering another language where we make a
1609distinction \(em and hence can hear the difference \(em between two sounds
1610that belong, in the language, to the same phoneme.
1611Japanese does not distinguish
1612.ul
1613r
1614from
1615.ul
1616l.
1617Japanese people
1618.ul
1619do not hear
1620the difference between "lice" and "rice", in the same way that you do
1621not hear the difference between the two
1622.ul
1623k\c
1624\&'s above.
1625Cockneys do not hear, except with a special effort, the difference
1626between "has" and "as", or "haitch" and "aitch", for the Cockney dialect
1627does not recognize initial
1628.ul
1629h\c
1630\&'s.
1631.pp
1632So what is a phoneme?  It is a set of sounds whose members do not
1633discriminate between any words in the language under consideration.
1634If you are mathematically minded you could think of it as an equivalence
1635class of sounds, determined by the relationship
1636.LB
1637$sound sub 1$ is related to $sound sub 2$ if $sound sub 1$ and $sound sub 2$
1638do not discriminate any pair of words in the language.
1639.LE
1640The
1641.ul
1642p
1643and
1644.ul
1645d
1646in
1647"pig" and "dig" belong to different phonemes (in English),
1648because they discriminate
1649the two words.
1650.ul
1651b,
1652.ul
1653f,
1654and
1655.ul
1656j
1657belong to different phonemes again.
1658.ul
1659i
1660and
1661.ul
1662a
1663in "hid" and "had" belong to different phonemes too.
1664Proceeding like this, a list of phonemes can be drawn up.
1665.pp
1666Such a list is shown in Table 2.1, for British English.
1667(The layout of the list does have some significance in terms of different
1668categories of phonemes, which will be explained later.)  In fact,
1669linguists use an
1670assortment of English letters, foreign letters, and special
1671symbols to represent phonemes.  In this book we use one- or two-letter
1672codes, partly because they are more mnemonic, and partly because
1673they are more suitable for communication to computers using standard
1674peripheral devices.
1675They are
1676a direct transliteration of linguists' standard International Phonetic
1677Association symbols.
1678.RF
1679.nr x1 3m+1.0i+0.5i+0.5i+0.5i+\w'y'u
1680.nr x1 (\n(.l-\n(x1)/2
1681.in \n(x1u
1682.ta 3m +1.0i +0.5i +0.5i +0.5i +0.5i +0.5i
1683\fIuh\fR	(the)	\fIp\fR	\fIt\fR	\fIk\fR
1684\fIa\fR	(bud)	\fIb\fR	\fId\fR	\fIg\fR
1685\fIe\fR	(head)	\fIm\fR	\fIn\fR	\fIng\fR
1686\fIi\fR	(hid)
1687\fIo\fR	(hod)	\fIr\fR	\fIw\fR	\fIl\fR	\fIy\fR
1688\fIu\fR	(hood)
1689\fIaa\fR	(had)	\fIs\fR	\fIz\fR
1690\fIee\fR	(heed)	\fIsh\fR	\fIzh\fR
1691\fIer\fR	(heard)	\fIf\fR	\fIv\fR
1692\fIuu\fR	(food)	\fIth\fR	\fIdh\fR
1693\fIar\fR	(hard)	\fIch\fR	\fIj\fR
1694\fIaw\fR	(hoard)	\fIh\fR
1695.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
1696.in 0
1697.FG "Table 2.1 The phonemes of British English"
1698.pp
1699We will discuss the sounds which make up each of these phoneme classes
1700shortly.  First, however, it is worthwhile pointing out some rather
1701tricky points in the definition of these phonemes.
1702.rh "Phonological difficulties."
1703There are snags with phonological classification, as there are
1704in any area where attempts are made to make completely logical
1705statements about human activity.
1706Consider
1707.ul
1708h
1709and the
1710.ul
1711ng
1712in "singing".
1713(\c
1714.ul
1715ng
1716is certainly not an
1717.ul
1718n
1719sound followed by a
1720.ul
1721g
1722sound, although
1723it is true that in some English accents "singing" is rendered with
1724the
1725.ul
1726ng
1727followed by a
1728.ul
1729g
1730at each of its two occurrences.)  No words
1731end with
1732.ul
1733h,
1734and none begin with
1735.ul
1736ng.
1737(Notice that we are still talking about British English.
1738In Chinese, the sound
1739.ul
1740ng
1741is a word in its own right, and is a common
1742family name.
1743But we must stick with one language for phonological classification.)  Hence
1744it follows that there is no pair of words which is distinguished
1745by the difference between
1746.ul
1747h
1748and
1749.ul
1750ng.
1751Technically,
1752they belong to the same phoneme.  However, technical considerations
1753in this case must take second place to common sense!
1754.pp
1755The
1756.ul
1757j
1758in "jig" is another interesting case.  It can be considered
1759to belong to a
1760.ul
1761j
1762phoneme, or to be a sequence of two
1763phonemes,
1764.ul
1765d
1766followed by
1767.ul
1768zh
1769(the sound in "vision").  There is
1770disagreement on this point in phonetics textbooks, and we do not
1771have the time (nor, probably, the inclination!) to consider the
1772pros and cons of this moot point.
1773I have resolved the matter arbitrarily by writing it as a separate
1774phoneme.  The
1775.ul
1776ch
1777in "choose" is a similar case
1778(\c
1779.ul
1780t
1781followed by the
1782.ul
1783sh
1784in "shoes").
1785.pp
1786Another difficulty, this time where Table 2.1 does not show how to
1787distinguish between two sounds which
1788.ul
1789do
1790discriminate words in many people's English, is the
1791.ul
1792w
1793in "witch"
1794and that in "which".  The latter is conventionally transcribed
1795as a sequence of two phonemes,
1796.ul
1797h w.
1798.pp
1799The last few difficulties are all to do with deciding whether a
1800sound belongs to a single phoneme class, or comprises a sequence
1801of sounds each of which belongs to a phoneme.
1802Are the
1803.ul
1804j
1805in "jug", the
1806.ul
1807ch
1808in "chug", and the
1809.ul
1810w
1811in "which",
1812single phonemes or not?  The definition above of a phoneme
1813as a "set of sounds whose members do not discriminate any words
1814in the language" does not help us to answer this question.
1815As far as this definition is concerned, we could go so far as
1816to call each and every word of the language an individual phoneme!
1817It is clear that some acoustic evidence, and quite a lot of judgement,
1818is being used when phonemes such as those of Table 2.1 are defined.
1819.pp
1820So much for the consonants.  This same problem occurs in vowel sounds,
1821particularly in diphthongs, which are sequences of two vowel-like sounds.
1822Do the vowels of "main" and "man" belong to different phonemes?
1823Clearly so, if they are both transcribed as single units, for they
1824distinguish the two words.
1825Notwithstanding the fact that they are sequences of separate sounds,
1826a logically consistent system could be constructed which gave separate,
1827unitary, symbols to each diphthong.
1828However, it is usual to employ a compound symbol which indicates explicitly
1829the character of the two vowel-like sounds involved.
1830We will transcribe the diphthong of "main" as a sequence of two
1831vowels,
1832.ul
1833e
1834(as in "head") and
1835.ul
1836i
1837(as in "hid", not "I").
1838This is done primarily for economy of symbols, choosing the constituent
1839sounds on the basis of the closest match to existing vowel sounds.
1840(Note that this again violates purely
1841.ul
1842logical
1843criteria for identifying phonemes.)
1844.rh "Categories of speech sounds."
1845A phoneme is defined as a set of sounds whose members to not discriminate
1846between any words in the language under consideration.
1847The phonemes themselves can be classified into groups which reflect
1848similarities between them.
1849This can be done in many different ways, using various criteria
1850for classification.  In fact, one branch of linguistic research
1851is concerned with defining a set of "distinctive
1852features" such that a phoneme class is uniquely identified by
1853the values of the features.  Distinctive features are binary,
1854and include such things as voiced\(emunvoiced, fricative\(emnot\ fricative,
1855aspirated\(emunaspirated.  We will not be concerned here with such
1856detailed classifications, but it is as well to know that they exist.
1857.pp
1858There is an everyday distinction between vowels and consonants.
1859A vowel forms the nucleus of every syllable, and one or more consonants
1860may optionally surround the vowel.
1861But the distinction sometimes becomes a little ambiguous.
1862Syllables like
1863.ul
1864sh
1865are commonly uttered and certainly do not
1866contain a vowel.  Furthermore, when we say "vowel" in everyday
1867language we usually refer to the
1868.ul
1869written
1870vowels a, e, i, o, and u; there are many more vowel sounds.
1871A vowel in orthography is different to a vowel as a phoneme.
1872Is a diphthong a phonetic vowel?  \(em certainly, by the syllable-nucleus
1873criterion; but it is a little different from ordinary vowels because
1874it is a changing sound rather than a constant one.
1875.pp
1876Table 2.2 shows one classification of the phonemes of Table 2.1, which
1877will be useful in our later studies of speech synthesis from phonetics.
1878It shows twelve vowels, including the rather peculiar one
1879.ul
1880uh
1881(which corresponds to the first vowel in the word "above").
1882This is the sound produced by the vocal tract when it is in a relaxed,
1883neutral position; and it never occurs in prominent, stressed,
1884syllables.  The vowels later in the list are almost always longer
1885than the earlier ones.  In fact, the first six
1886(\c
1887.ul
1888uh, a, e, i, o, u\c
1889)
1890are often called "short" vowels, and the last five
1891(\c
1892.ul
1893ee, er, uu, ar, aw\c
1894)
1895"long" ones.  The shortness or longness of the one in the middle
1896(\c
1897.ul
1898aa\c
1899)
1900is rather ambiguous.
1901.RF
1902.nr x0 \w'000unvoiced fricative    'u
1903.nr x1 \n(x0+\w'[not classified as individual phonemes]'u
1904.nr x1 (\n(.l-\n(x1)/2
1905.in \n(x1u
1906.ta \n(x0u
1907.fi
1908vowel	\c
1909.ul
1910uh  a  e  i  o  u  aa  ee  er  uu  ar  aw
1911.br
1912diphthong	[not classified as individual phonemes]
1913.br
1914glide (or liquid)	\c
1915.ul
1916r  w  l  y
1917.br
1918stop
1919.br
1920\0\0\0unvoiced stop	\c
1921.ul
1922p  t  k
1923.br
1924\0\0\0voiced stop	\c
1925.ul
1926b  d  g
1927.br
1928nasal	\c
1929.ul
1930m  n  ng
1931.br
1932fricative
1933.br
1934\0\0\0unvoiced fricative	\c
1935.ul
1936s  sh  f  th
1937.br
1938\0\0\0voiced fricative	\c
1939.ul
1940z  zh  v  dh
1941.br
1942affricate
1943.br
1944\0\0\0unvoiced affricate	\c
1945.ul
1946ch
1947.br
1948\0\0\0voiced affricate	\c
1949.ul
1950j
1951.br
1952aspirate	\c
1953.ul
1954h
1955.nf
1956.in 0
1957.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
1958.FG "Table 2.2 Phoneme categories"
1959.pp
1960Diphthongs pose no problem here because we have not classified them
1961as single phonemes.
1962.pp
1963The remaining categories are consonants.  The glides are quite
1964similar to vowels and diphthongs, though; for they are voiced,
1965continuous sounds.  You can say them and prolong them.
1966(This is also true of the fricatives.)
1967.ul
1968r
1969is interesting
1970because it can be realized acoustically in very different ways.
1971Some people curl the tip of the tongue
1972back \(em a so-called retroflex action of the tongue.  Many people
1973cannot do this, and their
1974.ul
1975r\c
1976\&'s sound like
1977.ul
1978w\c
1979\&'s.
1980The stage Scotsman's
1981.ul
1982r
1983is a trill where the tip of the tongue vibrates against the roof of the mouth.
1984.ul
1985l
1986is also
1987slightly unusual, for it is the only English phoneme which is "lateral" \(em
1988air passes either side of it, in two separate passages.  Welsh
1989has another lateral sound, a fricative, which is written "ll" as
1990in "Llandudno".
1991.pp
1992The next category is the stops.  These are formed by stopping up
1993the mouth, so that air pressure builds up behind the lips, and
1994releasing this pressure suddenly.  The result is a little
1995explosion (and the stops are often called "plosives"), which
1996usually creates a very short burst of fricative noise (and, in some cases,
1997aspiration as well).  They are further subdivided into voiced and
1998unvoiced stops, depending upon whether voicing starts as soon as
1999the plosion occurs (sometimes even before) or well after it.
2000If you put your hand in front of your mouth when saying "pit" you
2001can easily feel the puff of air that signals the plosion on the
2002.ul
2003p,
2004and probably on the
2005.ul
2006t
2007as well.
2008.pp
2009In a sense, nasals are really stops as well (and they are often
2010called stops), for the oral tract is blocked although the nasal
2011one is not.  The peculiar fact that the nasal
2012.ul
2013ng
2014never occurs at the beginning of a word (in English) was mentioned
2015earlier.  Notice that for stops and nasals there is a similarity in the
2016.ul
2017vertical
2018direction of Table 2.2, between
2019.ul
2020p,
2021.ul
2022b,
2023and
2024.ul
2025m;
2026.ul
2027t,
2028.ul
2029d,
2030and
2031.ul
2032n;
2033and
2034.ul
2035k,
2036.ul
2037g,
2038and
2039.ul
2040ng.
2041.ul
2042p
2043is an unvoiced version of
2044.ul
2045b
2046(try saying them),
2047and
2048.ul
2049m
2050is a nasalized version (for
2051.ul
2052b
2053is what you get when you
2054have a cold and try to say
2055.ul
2056m\c
2057).
2058These three sounds are all made
2059at the front of the mouth, while
2060.ul
2061t,
2062.ul
2063d,
2064and
2065.ul
2066n,
2067which bear the
2068same resemblance to each other, are made in the middle; and
2069.ul
2070k,
2071.ul
2072g,
2073and
2074.ul
2075ng
2076are made at the back.  This introduces another
2077possible classification, according to
2078.ul
2079place of articulation.
2080.pp
2081The unvoiced fricatives are quite straightforward, except perhaps
2082for
2083.ul
2084th,
2085which is the sound at the beginning of "thigh".
2086They are paired with the voiced fricatives on the basis of place
2087of articulation.  The voiced version of
2088.ul
2089th
2090is the
2091.ul
2092dh
2093at
2094the beginning of "thy".
2095.ul
2096zh
2097is a fairly rare phoneme, which
2098is heard in the middle of "vision".  Affricates are similar to
2099fricatives but begin with a stopped posture, and we mentioned earlier
2100the controversy as to whether they should be considered to be
2101single phonemes, or
2102sequences of stop phonemes and fricatives.
2103Finally comes the lonely aspirate,
2104.ul
2105h.
2106Aspiration does occur
2107elsewhere in speech, during the plosive burst of unvoiced stops.
2108.rh "Narrow phonetic transcription."
2109The phonological classification outlined above is based upon a clear
2110rationale for distinguishing between sounds according to how
2111they affect meaning \(em although the rationale does become
2112somewhat muddied in difficult cases.
2113Narrower transcriptions are not so systematic.
2114They use units called
2115.ul
2116allophones,
2117which are defined by reference to physical, acoustic, criteria rather
2118than purely logical ones.
2119("Phone" is a more old-fashioned term for the same thing,
2120and the misused word "phoneme" is often employed where allophone is
2121meant, that is, as a physical rather than a logical
2122unit.)  Each phoneme has several allophones,
2123more or less depending on how narrow or broad the transcription is,
2124and the allophones are different acoustic realizations of the same
2125logical unit.
2126For example, the
2127.ul
2128k\c
2129\&'s in "key" and "caw" may be considered as different
2130allophones (in a slightly narrow transcription).
2131Although we will not use symbols for allophones here,
2132they are often indicated by diacritical marks in a text
2133which modify the basic phoneme classes.
2134For example, a tilde (~) over a vowel means that it is nasalized, while a small
2135circle underneath a consonant means that it is devoiced.
2136.pp
2137Allophonic variation in speech is governed by a mechanism called
2138.ul
2139coarticulation,
2140where a sound is affected by those that come either side of it.
2141"Key"\-"caw" is a clear example of this, where the tongue
2142position in the
2143.ul
2144k
2145anticipates that of the following vowel \(em high
2146in the first case, low in the second.
2147Most allophonic variation in English is anticipatory, in that the sound
2148is influenced by the following articulation rather than by
2149preceding ones.
2150.pp
2151Nasalization is a feature which applies to vowels in English through
2152anticipatory coarticulation.
2153In many languages (for example, French) it is a
2154.ul
2155distinctive
2156feature for vowels in that it serves to distinguish one vowel phoneme class
2157from another.
2158That this is not so in English sometimes tempts us to assume,
2159incorrectly, that nasalization does not occur in vowels.
2160It does, typically when the vowel is followed by a nasal consonant, and it is
2161important for synthesis that nasalized vowel allophones are recognized and
2162treated accordingly.
2163.pp
2164Coarticulation can be predicted by phonological rules, which show
2165how a phonemic sequence will be realized by allophones.
2166Such rules have been studied extensively by linguists.
2167.pp
2168The reason for coarticulation, and for the existence of allophones,
2169lies in the physical constraints imposed by the motion
2170of the articulatory organs \(em particularly their acceleration and deceleration.
2171An immensely crude model is that the brain decides what phonemes to
2172say (for it is concerned with semantic things, and the definition
2173of a phoneme is a semantic one).
2174It then takes this sequence and translates it into neural commands
2175which actually move the articulators into target positions.
2176However, other commands may be issued, and executed, before these targets
2177are reached, and this accounts for coarticulation effects.
2178Phonological rules for converting a phonemic sequence to an
2179allophonic one are a sort of discrete model of the process.
2180Particularly for work involving computers, it is possible that this
2181rule-based approach will be overtaken by potentially more accurate
2182methods which attempt to model the continuous articulatory phenomena
2183directly.
2184.sh "2.3  Prosody"
2185.pp
2186The phonetic classification introduced above divides speech into
2187segments and classifies these into phonemes or allophones.
2188Riding on top of this stream of segments are other, more global,
2189attributes that dictate the overall prosody of the utterance.
2190Prosody is defined by the Oxford English Dictionary as the
2191"science of versification, laws of metre,"
2192which emphasizes the aspects of stress and rhythm that are central
2193to classical verse.
2194There are, however, many other features which are more or less
2195global.
2196These are collectively called prosodic or, equivalently, suprasegmental,
2197features, for they lie above the level of phoneme or syllable segments.
2198.pp
2199Prosodic features can be split into two basic categories:  features
2200of voice quality and features of voice dynamics.
2201Variations in voice quality, which are sometimes called
2202"paralinguistic" phenomena, are accounted for by anatomical
2203differences and long-term muscular idiosyncrasies (like a sore
2204throat), and have little part to play in the kind of applications
2205for speech output that have been sketched in Chapter 1.
2206Variations in voice dynamics occur in three dimensions:  pitch
2207or fundamental frequency of the voice, time, and amplitude.
2208Within the first, the pattern of pitch variation, or
2209.ul
2210intonation,
2211can be distinguished from the overall range within which that variation
2212occurs.
2213The time dimension encompasses the rhythm of the speech, pauses, and the
2214overall tempo \(em whether it is uttered quickly or slowly.
2215The third dimension, amplitude, is of relatively minor importance.
2216Intonation and rhythm work together to produce an effect commonly called
2217"stress", and we will elaborate further on the nature of stress and discuss
2218algorithms for synthesizing intonation and rhythm in Chapter 8.
2219.pp
2220These features have a very important role to play in communicating meaning.
2221They are not fancy, optional components.
2222It is their neglect which is largely responsible for the layman's
2223stereotype of computer speech,
2224a caricature of living speech \(em abrupt, arhythmic, and in a grating
2225monotone \(em
2226which was well characterized by Isaac Asimov when he wrote of speaking
2227"all in capital letters".
2228.pp
2229Timing has a syntactic function in that it sometimes helps to
2230distinguish nouns from
2231verbs
2232(\c
2233.ul
2234ex\c
2235tract versus ex\c
2236.ul
2237tract\c
2238).
2239and adjectives from verbs (app\c
2240.ul
2241rox\c
2242imate versus approxi\c
2243.ul
2244mate\c
2245) \(em although segmental aspects play a part here too, for the vowel
2246qualities differ in each pair of words.
2247Nevertheless, if you make a mistake when assigning stress to words
2248like these in conversation you are very likely to be queried as
2249to what you actually said.
2250.pp
2251Intonation has a big effect on meaning too.
2252Pitch often \(em but by no means always \(em rises on a question,
2253the extent and abruptness of the rise depending on features like whether
2254a genuine information-bearing reply or merely confirmation is expected.
2255A distinctive pitch pattern accompanies the introduction of a new topic.
2256In conjunction with rhythm, intonation can be used to bring out contrasts
2257as in
2258.LB
2259.NI
2260"He didn't have a
2261.ul
2262red
2263car, he had a
2264.ul
2265black
2266one."
2267.LE
2268In general, the intonation patterns used by a reader depend not only on
2269the text itself, but on his interpretation of it, and also on his
2270expectation of the listener's interpretation of it.
2271For example:
2272.LB
2273.NI
2274"He had a
2275.ul
2276red
2277car" (I think you thought it was black),
2278.NI
2279"He had a red
2280.ul
2281bi\c
2282cycle" (I think you thought it was a car).
2283.LE
2284.pp
2285In natural speech, prosodic features are significantly influenced by
2286whether the utterance is generated spontaneously or read aloud.
2287The variations in spontaneous speech are enormous.
2288There are all sorts of emotions which are plainly audible in
2289everyday speech:  sarcasm, excitement, rudeness, disagreement,
2290sadness, fright, love.
2291Variations in voice quality certainly play a part here.
2292Even with "ordinary" cooperative friendly conversation, the need to find
2293words and somehow fit them into an overall utterance produces great
2294diversity of prosodic structures.
2295Applications for speech output from computers do not, however, call for
2296spontaneous conversation, but for a controlled delivery which is
2297like that when reading aloud.
2298Here, the speaker is articulating utterances which have been set out for
2299him, reducing his cognitive load to one of understanding and interpreting
2300the text rather than generating it.
2301Unfortunately for us, linguists are (quite rightly)
2302primarily interested in living,
2303spontaneous speech rather than pre-prepared readings.
2304.pp
2305Nevertheless, the richness of prosody in speech even when reading from
2306a book should not be underestimated.
2307Read aloud to an audience and listen to the contrasts in voice dynamics
2308deliberately introduced for variety's sake.
2309If stories are to be read there is even a case for controlling voice
2310.ul
2311quality
2312to cope with quotations and affective imitations.
2313.pp
2314We saw earlier that the source-filter model is particularly
2315helpful in distinguishing prosodic features, which are largely
2316properties of the source, from segmental ones, which belong to
2317the filter.
2318Pitch and amplitude are primarily source properties.
2319Rhythm and speed of speaking are not, but neither are they filter
2320properties, for they belong to the source-filter system as a whole
2321and not specifically to either part of it.
2322The difficult notion of stress is, from an acoustic point of view,
2323a combination of pitch, rhythm, and amplitude.
2324Even some features of voice quality can be attributed to the source
2325(like laryngitis), although others \(em cleft palate, badly-fitting
2326dentures \(em affect segmental features as well.
2327.sh "2.4  Further reading"
2328.pp
2329This chapter has been no more than a cursory introduction to some
2330of the difficult problems of linguistics and phonetics.
2331Here are some readable books which discuss these problems further.
2332.LB "nn"
2333.\"Abercrombie-1967-1
2334.ds [F 1
2335.]-
2336.ds [A Abercrombie, D.
2337.ds [D 1967
2338.ds [T Elements of general phonetics
2339.ds [I Edinburgh Univ Press
2340.nr [T 0
2341.nr [A 1
2342.nr [O 0
2343.][ 2 book
2344.in+2n
2345This is an excellent book which covers all of the areas of this
2346chapter, in much more detail than has been possible here.
2347.in-2n
2348.\"Brown-1980-2
2349.ds [F 2
2350.]-
2351.ds [A Brown, Gill
2352.as [A ", Currie, K.L.
2353.as [A ", and Kenworthy, J.
2354.ds [D 1980
2355.ds [T Questions of intonation
2356.ds [I Croom Helm
2357.ds [C London
2358.nr [T 0
2359.nr [A 1
2360.nr [O 0
2361.][ 2 book
2362.in+2n
2363An intensive study of the prosodics of colloquial, living speech
2364is presented, with particular reference to intonation.  Although
2365not particularly relevant to speech output from computers,
2366this book gives great insight into how conversational speech
2367differs from reading aloud.
2368.in-2n
2369.\"Fry-1979-1
2370.ds [F 1
2371.]-
2372.ds [A Fry, D.B.
2373.ds [D 1979
2374.ds [T The physics of speech
2375.ds [I Cambridge University Press
2376.ds [C Cambridge, England
2377.nr [T 0
2378.nr [A 1
2379.nr [O 0
2380.][ 2 book
2381.in+2n
2382This is a simple and readable account of speech science, with a good
2383and completely non-mathematical introduction to frequency analysis.
2384.in-2n
2385.\"Ladefoged-1975-4
2386.ds [F 4
2387.]-
2388.ds [A Ladefoged, P.
2389.ds [D 1975
2390.ds [T A course in phonetics
2391.ds [I Harcourt Brace and Johanovich
2392.ds [C New York
2393.nr [T 0
2394.nr [A 1
2395.nr [O 0
2396.][ 2 book
2397.in+2n
2398Usually books entitled "A course on ..." are dreadfully dull, but
2399this is a wonderful exception.  An exciting, readable, almost racy
2400introduction to phonetics, full of little experiments you can try
2401yourself.
2402.in-2n
2403.\"Lehiste-1970-5
2404.ds [F 5
2405.]-
2406.ds [A Lehiste, I.
2407.ds [D 1970
2408.ds [T Suprasegmentals
2409.ds [I MIT Press
2410.ds [C Cambridge, Massachusetts
2411.nr [T 0
2412.nr [A 1
2413.nr [O 0
2414.][ 2 book
2415.in+2n
2416This fairly comprehensive study of the prosodics of speech
2417complements Ladefoged's book, which is mainly concerned with segmental
2418phonetics.
2419.in-2n
2420.\"O'Connor-1973-1
2421.ds [F 1
2422.]-
2423.ds [A O'Connor, J.D.
2424.ds [D 1973
2425.ds [T Phonetics
2426.ds [I Penguin
2427.ds [C London
2428.nr [T 0
2429.nr [A 1
2430.nr [O 0
2431.][ 2 book
2432.in+2n
2433This is another introductory book on phonetics.
2434It is packed with information on all aspects of the subject.
2435.in-2n
2436.LE "nn"
2437.EQ
2438delim $$
2439.EN
2440.CH "3  SPEECH STORAGE"
2441.ds RT "Speech storage
2442.ds CX "Principles of computer speech
2443.pp
2444The most familiar device that produces speech output is the ordinary tape
2445recorder, which stores information in analogue form on magnetic tape.
2446However, this is unsuitable for speech output from computers.
2447One reason is that it is difficult to access different utterances quickly.
2448Although random-access tape recorders do exist, they are expensive and
2449subject to mechanical breakdown because of the stresses associated with
2450frequent starting and stopping.
2451.pp
2452Storing speech on a rotating drum instead of
2453tape offers the possibility of access to any track within one revolution time.
2454For example, the IBM 7770 Audio Response Unit employs drums rotating twice
2455a second which are able to store up to 32 500-msec words.  These can be accessed
2456randomly, within half a second at most.
2457Although one can
2458arrange to store longer words by allowing overflow on to an adjacent track at
2459the end of the rotation period, the discrete time-slots provided by this
2460system make it virtually impossible for it to generate connected utterances
2461by assembling appropriate words from the store.
2462.pp
2463The Cognitronics Speechmaker has a similar structure, but with
2464the analogue speech waveform recorded on photographic film.
2465Storing audio waveforms optically is not an unusual technique, for this is how
2466soundtracks are recorded on ordinary movie films.  The original version of
2467the "speaking clock" of the British Post Office used optical storage in
2468concentric tracks on flat glass discs.
2469It is described by Speight and Gill (1937),
2470who include a fascinating account of how the utterances are synchronized.
2471.[
2472Speight Gill 1937
2473.]
2474A 4\ Hz signal from a pendulum clock was used to supply current to an electric
2475motor, which drove a shaft equipped with cams and gears that rotated
2476the glass discs containing utterances for seconds, minutes, and hours
2477at appropriate speeds!
2478.pp
2479A second reason for avoiding analogue storage is price.  It is difficult to see how a random-access
2480tape recorder could be incorporated into a talking pocket calculator or
2481child's toy without considerably inflating the cost.
2482Solid-state electronics is much cheaper than mechanics.
2483.pp
2484But the best reason is that, in many of the applications we have discussed,
2485it is necessary to form utterances by concatenating separately-recorded
2486parts.  It is totally infeasible, for example, to store each and every
2487possible telephone number as an individual recording!  And
2488utterances that are formed by concatenating individual words which were
2489recorded in isolation, or in a different context, do not sound completely
2490natural.  For example, in an early experiment, Stowe and Hampton (1961) recorded
2491individual words on acoustic tape, spliced the tape with the words in a different
2492order to make sentences, and played the result to subjects who were scored on
2493the number of key words which they identified correctly.
2494.[
2495Stowe Hampton 1961
2496.]
2497The overall conclusion was that while embedding a word in normally-spoken sentences
2498.ul
2499increases
2500the probability of recognition (because the extra context gives clues about the
2501word), embedding a word in a constructed sentence, where intonation and rhythm
2502are not properly rendered,
2503.ul
2504decreases
2505the probability of recognition.  When the speech was uttered slowly,
2506however, a considerable improvement was noticed, indicating that if the
2507listener has more processing time he can overcome the lack of proper intonation
2508and rhythm.
2509.pp
2510Nevertheless, many present-day voice response systems
2511.ul
2512do
2513store what amounts to a direct recording of the acoustic wave.
2514However, the storage medium is digital rather than analogue.
2515This means that standard computer storage devices can be used, providing
2516rapid access to any segment of the speech at relatively low cost \(em for
2517the economics of mass-production ensures a low price for random-access
2518digital devices compared with random-access analogue ones.
2519Furthermore, it reduces the amount of special equipment needed for speech
2520output.  One can buy very cheap speech input/output interfaces for home computers
2521which connect to standard hobby buses.
2522Another advantage of digital over analogue recording is that
2523integrated circuit read-only memories (ROMs)
2524can be used for hand-held devices which need small quantities of speech.
2525Hence this chapter begins by showing how waveforms are stored digitally,
2526and then describes some techniques for reducing the data needed for a given
2527utterance.
2528.sh "3.1  Storing waveforms digitally"
2529.pp
2530When an analogue signal is converted to digital form, it is made discrete
2531both in time and in amplitude.  Discretization in time is the operation of
2532.ul
2533sampling,
2534whilst in amplitude it is
2535.ul
2536quantizing.
2537It is worth pointing out that the transmission of analogue information by
2538digital means is called "PCM" (standing for "pulse code modulation") in
2539telecommunications jargon.
2540Much of the theory of digital signal processing investigates signals which
2541are sampled but not quantized (or quantized into sufficiently many levels to
2542avoid inaccuracies).  The operation of quantization, being non-linear,
2543is not very amenable to theoretical analysis.  Quantization introduces issues
2544such as accumulation of round-off noise in arithmetic operations,
2545which, although they are very important in practical implementations, can only
2546be treated theoretically under certain somewhat unrealistic assumptions
2547(in particular, independence of the quantization error from sample to sample).
2548.rh "Sampling."
2549A fundamental theorem of telecommunications states that a signal can only be
2550reconstructed accurately from a sampled version if it does not contain
2551components whose frequency is greater than half the frequency at which the
2552sampling takes place.  Figure 3.1(a) shows how a component of slightly greater
2553than half the sampling frequency can masquerade, as far as an observer with
2554access only to the sampled data can tell, as a component at slightly less
2555than half the sampling frequency.
2556.FC "Figure 3.1"
2557Call the sampling interval $T$ seconds, so that the
2558sampling frequency is $1/T$\ Hz.
2559Then components at $1/2T+f$, $3/2T-f$, $3/2T+f$ and so on all masquerade
2560as a component at $1/2T-f$.  Similarly, components at frequencies just under
2561the sampling frequency masquerade as very low-frequency components, as shown
2562in Figure 3.1(b).  This phenomenon is often called "aliasing".
2563.pp
2564Thus the continuous, infinite, frequency axis for the unsampled signal, where
2565two components at different frequencies can always be distinguished, maps
2566into a repetitive frequency axis when the signal is sampled.  As depicted
2567in Figure 3.2, the frequency
2568interval $[1/T,~ 2/T)$ \u\(dg\d
2569.FN 3
2570.sp
2571\u\(dg\dIntervals are specified in brackets, with a square bracket representing
2572a closed end of the interval and a round one representing an open one.
2573Thus the interval $[1/T,~ 2/T)$ specifies the range $1/T ~ <= ~ frequency
2574~ < ~ 2/T$.
2575.EF
2576is mapped back into the band $[0,~ 1/T)$, as are the
2577intervals $[2/T,~ 3/T)$,  $[3/T,~ 4/T)$, and so on.
2578.FC "Figure 3.2"
2579Furthermore, the interval $[1/2T,~ 1/T)$ between half the sampling frequency and the sampling
2580frequency, is mapped back into the interval
2581below half the sampling frequency; but this time the mapping is backwards,
2582with frequencies at just under $1/T$ being mapped to frequencies slightly greater
2583than zero, and frequencies just over $1/2T$ being mapped to ones
2584just under $1/2T$.
2585The best way to represent a repeating frequency axis like this is as a circle.
2586Figure 3.3 shows how the linear frequency axis for continuous systems maps
2587on to a circular axis for sampled systems.
2588.FC "Figure 3.3"
2589For present purposes it is
2590easiest to imagine the bottom half of the circle as being reflected into
2591the top half, so that traversing the upper semicircle in the anticlockwise direction
2592corresponds to frequencies increasing from 0 to $1/2T$ (half the sample frequency),
2593and returning along the lower semicircle is actually the same as coming
2594back round the upper one, and corresponds to frequencies from $1/2T$ to $1/T$
2595being mapped into the range $1/2T$ to 0.
2596.pp
2597As far as speech is concerned, then, we must ensure that before sampling a
2598signal no significant components at greater than half the sample frequency
2599are present.  Furthermore, the sampled signal will only contain information
2600about frequency components less than this, so the sample frequency must be
2601chosen as twice the highest frequency of interest.
2602For example, consider telephone-quality speech.
2603Telephones provide a familiar standard of speech quality which,
2604although it can only be an approximate "standard",
2605will be much used throughout this book.
2606The telephone network
2607aims to transmit only frequencies lower than 3.4\ kHz.  We saw in the
2608previous chapter that this region will contain the information-bearing formants,
2609and some \(em but not all \(em of the fricative and aspiration energy.
2610Actually, transmitting speech through the telephone system degrades its
2611quality very significantly, probably more than you realize since everyone is
2612so accustomed to telephone speech.  Try the dial-a-disc service and compare
2613it with high-fidelity music for a striking example of the kind of degradation
2614suffered.
2615.pp
2616For telephone speech, the sampling frequency must be chosen to be
2617at least 6.8\ kHz.
2618Since speech contains significant amounts of energy above 3.4\ kHz, it should be
2619filtered before sampling to remove this; otherwise the higher components
2620would be mapped back into the baseband and distort the low-frequency information.
2621Because it is difficult to make filters that cut off very sharply, the
2622sampling frequency is chosen rather greater than twice the highest frequency of
2623interest.  For example, the digital telephone network samples at 8\ kHz.
2624The pre-sampling filter should have a cutoff frequency of 4\ kHz; aim for
2625negligible distortion below 3.4\ kHz; and transmit negligible components
2626above 4.6\ kHz \(em for these are reflected back into the band of interest,
2627namely 0 to 3.4\ kHz.  Figure 3.4 shows a block diagram for the input hardware.
2628.FC "Figure 3.4"
2629.rh "Quantization."
2630Before considering specifications for the pre-sampling filter, let us turn
2631from discretization in time to discretization in amplitude, that is,
2632quantization.
2633This is performed by an A/D converter (analogue-to-digital), which takes as input
2634a constant analogue voltage (produced by the sampler) and generates a
2635corresponding binary value as output.  The simplest correspondence is
2636.ul
2637uniform
2638quantization, where the amplitude range is split into equal regions by points
2639termed "quantization levels", and the output is a binary representation of
2640the nearest quantization level to the input voltage.
2641Typically, 11-bit conversion is used for speech, giving 2048 quantization
2642levels, and the signal is adjusted to have zero mean so that half the
2643levels correspond to negative input voltages and the other half to positive
2644ones.
2645.pp
2646It is, at first sight, surprising that as many as 11 bits are needed for
2647adequate representation of speech signals.  Research on the digital telephone
2648network, for example, has concluded that a signal-to-noise ratio of
2649some 26\-27\ dB is enough to avoid undue harshness of quality, loss
2650of intelligibility, and listener fatigue for speech at a comfortable
2651level in an otherwise reasonably good channel.
2652Rabiner and Schafer (1978) suggest that about 36\ dB signal-to-noise ratio
2653would "most likely provide adequate quality in a communications system".
2654.[
2655Rabiner Schafer 1978 Digital processing of speech signals
2656.]
2657But 11-bit quantization seems to give a very much better signal-to-noise
2658ratio than these figures.  To estimate its magnitude, note that for N-bit quantization
2659the error for each sample will lie between
2660.LB
2661$
2662- ~ 1 over 2 ~. 2 sup -N$    and    $+ ~ 1 over 2 ~. 2 sup -N .
2663$
2664.LE
2665Assuming that it is uniformly distributed in this range \(em an assumption
2666which is likely to be justified if the number of levels is sufficiently
2667large \(em leads to a mean-squared error of
2668.LB
2669.EQ
2670integral from {-2 sup -N-1} to {2 sup -N-1} ~e sup 2 p(e) de,
2671.EN
2672.LE
2673where $p(e)$, the probability density function of the error $e$, is a constant
2674which satisfies the usual probability normalization constraint, namely
2675.LB
2676.EQ
2677integral from {-2 sup -N-1} to {2 sup -N-1} ~ p(e) de ~~=~ 1.
2678.EN
2679.LE
2680Hence $p(e)=2 sup N $, and so the mean-squared error is  $2 sup -2N /12$.
2681This is  $10 ~ log sub 10 (2 sup -2N /12)$\ dB, or around \-77\ dB for 11-bit
2682quantization.
2683.pp
2684This noise level is relative to the maximum amplitude range of the conversion.
2685A maximum-amplitude sine wave has a power of \-9\ dB relative to the same
2686reference, giving a signal-to-noise ratio of some 68\ dB.  This is far in excess
2687of that needed for telephone-quality speech.  However, look at the very peaky
2688nature of the typical speech waveform given in Figure 3.5.
2689.FC "Figure 3.5"
2690If clipping is to be avoided, the maximum amplitude level of the A/D converter
2691must be set at a value which makes the power of the speech signal very much
2692less than a maximum-amplitude sine wave.  Furthermore, different people
2693speak at very different volumes, and the overall level fluctuates constantly
2694with just one speaker.  Experience shows that while 8- or 9-bit quantization
2695may provide sufficient signal-to-noise ratio to preserve telephone-quality
2696speech if the overall speaker levels are carefully controlled, about 11 bits
2697are generally required to provide high-quality representation of speech with
2698a uniform quantization.  With 11 bits, a sine wave whose amplitude is only 1/32
2699of the full-scale value would be digitized with a signal-to-noise ratio
2700of around 36\ dB, the most pessimistic figure quoted above for adequate quality.
2701Even then it is useful if the speaker is provided
2702with an indication of the amplitude of his speech:  a traffic-light
2703indicator with red signifying clipping overload, orange a suitable level,
2704and green too low a value, is often convenient for this.
2705.rh "Logarithmic quantization."
2706For the purposes of speech
2707.ul
2708processing,
2709it is essential to have the signal quantized uniformly.  This is because
2710all of the theory applies to linear systems, and nonlinearities introduce
2711complexities which are not amenable to analysis.
2712Uniform quantization, although a nonlinear operation, is linear in the
2713limiting case as the number of levels becomes large, and for most purposes
2714its effect can be modelled by assuming that the quantized signal is obtained
2715from the original analogue one by the addition of a small amount of
2716uniformly-distributed quantizing noise, as in fact was done above.
2717Usually the quantization noise is disregarded in subsequent analysis.
2718.pp
2719However, the peakiness of the speech signal illustrated in Figure 3.5 leads
2720one to suspect that a non-linear representation, for example a logarithmic one,
2721could provide a better signal-to-noise ratio over a wider range of input
2722amplitudes, and hence be more useful than linear quantization \(em at least
2723for speech storage (and transmission).
2724And indeed this is the case.  Linear quantization has the unfortunate effect
2725that the absolute noise level is independent of the signal level, so that an excessive
2726number of bits must be used if a reasonable ratio is to be achieved for peaky
2727signals.  It can be shown that a logarithmic representation like
2728.LB
2729.EQ
2730y ~ = ~ 1 ~ + ~ k ~ log ~ x,
2731.EN
2732.LE
2733where $x$ is the original signal and $y$ is the value which is to be quantized,
2734gives a
2735signal-to-noise
2736.ul
2737ratio
2738which is independent of the input signal level.
2739This relationship cannot be realized physically, for it is undefined when the signal
2740is negative and diverges when it is zero.
2741However, realizable approximations to it can be made which retain the advantages
2742of constant signal-to-noise ratio within a useful range of signal amplitudes.
2743Figure 3.6 shows the logarithmic relation with one widely-used approximation to it,
2744called the A-law.
2745.FC "Figure 3.6"
2746The idea of non-linearly quantizing a signal to achieve adequate signal-to-noise
2747ratios for a wide variety of amplitudes is called "companding", a contraction
2748of "compressing-expanding".  The original signal can be retrieved from
2749its A-law compression by antilogarithmic expansion.
2750.pp
2751Figure 3.6 also
2752shows one common coding scheme which is a piecewise linear approximation
2753to the A-law.  This provides an 8-bit code, and gives the equivalent
2754of 12-bit linear quantization for small signal levels.  It approximates
2755the A-law in 16 linear segments, 8 for positive and 8 for negative
2756inputs.
2757Consider the positive part of the curve.  The first two segments, which
2758are actually collinear, correspond exactly to 12-bit linear conversion.
2759Thus the output codes 0 to 31 correspond to inputs from 0 to 31/2048,
2760in equal steps.  (Remember that both positive and negative signals
2761must be converted, so a 12-bit linear converter will allocate 2048 levels
2762for positive signals and 2048 for negative ones.)  The next
2763segment provides 11-bit linear quantization,
2764output codes 32 to 47 corresponding to inputs from 16/1024 to 31/1024.
2765Similarly, the next segment corresponds to 10-bit quantization, covering
2766inputs from 16/512 to 31/512.  And so on, the last section giving 6-bit
2767quantization of inputs from 16/32 to 31/32, the full-scale positive value.
2768Negative inputs are converted similarly.
2769For signal levels of less than 32/2048, that is, $2 sup -8$, this implementation
2770of the A-law provides full 12-bit precision.
2771As the signal level increases, the precision decreases gradually to 6 bits
2772at maximum amplitudes.
2773.pp
2774Logarithmic encoding provides what is in effect a floating-point representation
2775of the input.  The conventional floating-point format, however, is not used
2776because many different codes can represent the same value.  For example, with
2777a 4-bit exponent preceding a 4-bit mantissa, the words 0000:1000,
27780001:0100, 0010:0010, and 0011:0001 represent the numbers
2779$0.1 ~ times ~ 2 sup 0$,  $0.01 ~ times ~ 2 sup 1
2780$,  $0.001 ~ times ~ 2 sup 2$,  \c
2781and  $0.0001 ~ times ~ 2 sup 3$  respectively,
2782which are the same.  (Some floating-point conventions assume that an unwritten
2783"1" bit precedes the mantissa, except when the whole word is zero; but this
2784gives decreased resolution around zero \(em which is exactly where we want the
2785resolution to be greatest.)  Table 3.1 shows the 8-bit A-law codes,
2786.RF
2787.in+0.7i
2788.ta 1.6i +\w'bits 1-3   'u
27898-bit codeword:	bit 0	sign bit
2790	bits 1-3	3-bit exponent
2791	bits 4-7	4-bit mantissa
2792.sp2
2793.ta 1.6i 3.5i
2794.ul
2795 codeword	   interpretation
2796.sp
27970000 0000	\h'\w'\0-\0  +  'u'$.0000 ~ times ~ 2 sup -7$
2798\0\0\0...	\0\0\0\0...
27990000 1111	\h'\w'\0-\0  +  'u'$.1111 ~ times ~ 2 sup -7$
28000001 0000	$2 sup -7 ~~ + ~~ .0000 ~ times ~ 2 sup -7$
2801\0\0\0...	\0\0\0\0...
28020001 1111	$2 sup -7 ~~ + ~~ .1111 ~ times ~ 2 sup -7$
28030010 0000	$2 sup -6 ~~ + ~~ .0000 ~ times ~ 2 sup -6$
2804\0\0\0...	\0\0\0\0...
28050010 1111	$2 sup -6 ~~ + ~~ .1111 ~ times ~ 2 sup -6$
28060011 0000	$2 sup -5 ~~ + ~~ .0000 ~ times ~ 2 sup -5$
2807\0\0\0...	\0\0\0\0...
28080011 1111	$2 sup -5 ~~ + ~~ .1111 ~ times ~ 2 sup -5$
28090100 0000	$2 sup -4 ~~ + ~~ .0000 ~ times ~ 2 sup -4$
2810\0\0\0...	\0\0\0\0...
28110100 1111	$2 sup -4 ~~ + ~~ .1111 ~ times ~ 2 sup -4$
28120101 0000	$2 sup -3 ~~ + ~~ .0000 ~ times ~ 2 sup -3$
2813\0\0\0...	\0\0\0\0...
28140101 1111	$2 sup -3 ~~ + ~~ .1111 ~ times ~ 2 sup -3$
28150110 0000	$2 sup -2 ~~ + ~~ .0000 ~ times ~ 2 sup -2$
2816\0\0\0...	\0\0\0\0...
28170110 1111	$2 sup -2 ~~ + ~~ .1111 ~ times ~ 2 sup -2$
28180111 0000	$2 sup -1 ~~ + ~~ .0000 ~ times ~ 2 sup -1$
2819\0\0\0...	\0\0\0\0...
28200111 1111	$2 sup -1 ~~ + ~~ .1111 ~ times ~ 2 sup -1$
2821
28221000 0000	\h'\w'\0-\0  'u'$- ~~ .0000 ~ times ~ 2 sup -7$	negative numbers treated as
2823\0\0\0...	\0\0\0\0...	above, with a sign bit of 1
28241111 1111	\h'-\w'\- 'u'\- $2 sup -1 ~~ - ~~ .1111 ~ times ~ 2 sup -1$
2825.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
2826.in 0
2827.FG "Table 3.1  8-bit A-law codes, with their floating-point equivalents"
2828according
2829to the piecewise linear approximation of Figure 3.6, written in a notation which
2830suggests floating point.  Each linear segment has a different exponent except
2831the first two segments, which as explained above are collinear.
2832.pp
2833Logarithmic encoders and decoders are available from many semiconductor
2834manufacturers as single-chip devices
2835called "codecs" (for "coder/decoder").  Intended for use on digital communication
2836links, these generally provide a serial output bit-stream, which
2837should be converted to parallel by a shift register if the data is intended
2838for a computer.
2839Because of the potentially vast market for codecs in telecommunications,
2840they are made in great quantities and are consequently very cheap.
2841Estimates of the speech quality necessary for telephone applications indicate
2842that somewhat less than this accuracy is needed \(em 7-bit logarithmic encoding
2843was used in early digital communications links, and it may be that even 6 bits
2844are adequate.  However, during the transition period when digital
2845networks must coexist with the present analogue one, it is anticipated that
2846a particular telephone call may have to pass through several links, some
2847using analogue technology and some being digital.  The possibility of
2848several successive encodings and decodings has led telecommunications
2849engineers to standardize on 8-bit representations, leaving some margin
2850before additional degradation of signal quality becomes unduly distracting.
2851.pp
2852Unfortunately, world telecommunications authorities cannot agree on a single
2853standard for logarithmic encoding.  The A-law, which we have described,
2854is the European standard, but there is another system, called
2855the $mu$-law, which is used universally in North America.  It also is available
2856in single-chip form with an 8-bit code.  It has very similar
2857quantization error characteristics to the A-law, and would be indistinguishable
2858from it on the scale of Figure 3.6.
2859.rh "The pre-sampling filter."
2860Now that we have some idea of the accuracy requirements for quantization,
2861let us discuss quantitative specifications for the pre-sampling filter.
2862Figure 3.7 sketches the characteristics of this filter.
2863.FC "Figure 3.7"
2864Assume a
2865sampling frequency of 8\ kHz and a range of interest from 0 to 3.4\ kHz.
2866Although all components at frequencies above 4\ kHz will fold back into
2867the 0\ \-\ 4\ kHz baseband, those below 4.6\ kHz fold back above 3.4\ kHz and are
2868therefore outside the range of interest.  This gives a "guard band" between
28693.4 and 4.6\ kHz which separates the passband from the stopband.  The filter
2870should transmit negligible components in the stopband above 4.6\ kHz.
2871To reduce the harmonic distortion caused by aliasing to the same level
2872as the quantization noise in 11-bit linear conversion, the stopband
2873attenuation should be around \-68\ dB (the signal-to-noise ratio for a full-scale
2874sine wave).  Passband ripple is not so critical,
2875for two reasons.  Whilst the presence of aliased components means that
2876information has been lost about the frequency components within the range of
2877interest, passband ripple does not actually cause a loss of information but
2878only a distortion, and could, if necessary, be compensated by a suitable
2879filter acting on the digitized waveform.  Secondly, distortion of the
2880passband spectrum is not nearly so audible as the frequency images caused
2881by aliasing.  Hence one usually aims for a passband ripple of around 0.5\ dB.
2882.pp
2883The pass and stopband targets we have mentioned above can be achieved with
2884a 9'th order elliptic filter.  While such a filter is often used in
2885high-quality signal-processing systems, for telephone-quality speech
2886much less stringent specifications seem to be sufficient.  Figure 3.8, for
2887example, shows a template which has been recommended by telecommunications
2888authorities.
2889.FC "Figure 3.8"
2890A 5'th order elliptic filter can easily meet this specification.
2891Such filters, implemented by switched-capacitor means, are available in
2892single-chip form.  Integrated CCD (charge-coupled device)
2893filters which meet the same specification
2894are also marketed.  Indeed, some codecs provide input filtering on the same
2895chip as the A/D converter.
2896.pp
2897Instead of implementing a filter by analogue means to meet the aliasing
2898specifications, digital filtering can be used.  A high sample-rate A/D
2899converter, operating at, say, 32\ kHz, and preceded by a very simple low-pass
2900pre-sampling filter, is followed by a digital filter which meets the
2901desired specification, and its output is subsampled to provide an 8\ kHz sample
2902rate.  While such implementations may be economic where a multichannel digitizing
2903capability is required, as in local telephone exchanges where the subscriber
2904connection is an analogue one, they are unlikely to prove cost-effective for
2905a single channel.
2906.rh "Reconstructing the analogue waveform."
2907Having digitized and stored a signal, it needs to be passed though a D/A
2908converter (digital-to-analogue) and low-pass filter when replayed.
2909D/A converters are cheaper than A/D converters, and the characteristics of the
2910low-pass filter for output can be the same as those for input.
2911However, the desampling operation introduces an additional distortion, which
2912has an effect on the component at frequency $f$ of
2913.LB
2914.EQ
2915{ sin ( pi f/f sub s )} over { pi f/f sub s } ~ ,
2916.EN
2917.LE
2918where $f sub s$ is the sampling frequency.  An "aperture correction" filter is
2919needed to compensate for this, although many systems simply do without it.
2920Such a filter is sometimes incorporated into the codec chip.
2921.rh "Summary."
2922For telephone-quality speech, existing codec chips,
2923coupled if necessary with integrated pre-sampling filters, can
2924be used, at a remarkably low cost.
2925For higher-quality speech storage the analogue interface can become quite complex.
2926A comprehensive study of the problems as they relate to digitization of audio,
2927which demands much greater fidelity than speech, has been made by Blesser (1978).
2928.[
2929Blesser 1978
2930.]
2931He notes the following sources of error (amongst others):
2932.LB
2933.NP
2934slew-rate distortion in the pre-sampling filter for signals at the upper end
2935of the audio band;
2936.NP
2937insufficient filtering of high-frequency input signals;
2938.NP
2939noise generated by the sample-and-hold amplifier or pre-sampling filter;
2940.NP
2941acquisition errors because of the finite settling time of the sample-and-hold
2942circuit;
2943.NP
2944insufficient settling time in the A/D conversion;
2945.NP
2946errors in the quantization levels of the A/D and D/A converters;
2947.NP
2948noise in the converters;
2949.NP
2950jitter on the clock used for timing input or output samples;
2951.NP
2952aperture distortion in the output sampler;
2953.NP
2954noise in the output filter as a result of limited dynamic range of the
2955integrated circuits;
2956.NP
2957power-supply noise injection or ground coupling;
2958.NP
2959changes in characteristics as a result of temperature or ageing.
2960.LE
2961Care must be taken with the analogue interface to ensure that the precision
2962implied by the resolution of the A/D and D/A converters is not compromised
2963by inadequate analogue circuitry.  It is especially important to eliminate
2964high-frequency noise caused by fast edges on nearby computer buses.
2965.sh "3.2  Coding in the time domain"
2966.pp
2967There are several methods of coding the time waveform of a speech signal to
2968reduce the data rate for a given signal-to-noise ratio, or alternatively to
2969reduce the signal-to-noise ratio for a given data rate.  They almost all require
2970more processing, both at the encoding (for storage) and decoding (for
2971regeneration) ends of the digitization process.  They are sometimes used to
2972economize on memory in systems using stored speech,
2973for example the System\ X telephone exchange and the travel consultant described
2974in Chapter 1, and so will be described here.  However, it is to be expected
2975that simple time-domain coding techniques will be superseded by the more complex
2976linear predictive method, which is covered in Chapter 6, because this
2977can give a much more substantial reduction in the data rate for only a small
2978degradation in speech quality.  Hence the aim of this section is to introduce
2979the ideas in a qualitative way:  theoretical development and summaries of
2980results of listening tests can be found elsewhere (eg Rabiner and Schafer, 1978).
2981.[
2982Rabiner Schafer 1978 Digital processing of speech signals
2983.]
2984The methods we will examine are summarized in Table 3.2.
2985.RF
2986.nr x0 \w'linear PCM      'u
2987.nr x1 \n(x0+\w'    adaptive quantization, or adaptive prediction,'u
2988.nr x2 (\n(.l-\n(x1)/2
2989.in \n(x2u
2990.ta \n(x0u
2991\l'\n(x1u\(ul'
2992.sp
2993linear PCM	linearly-quantized pulse code modulation
2994.sp
2995log PCM	logarithmically-quantized pulse code modulation
2996	    (instantaneous companding)
2997.sp
2998APCM	adaptively quantized pulse code modulation
2999	    (usually syllabic companding)
3000.sp
3001DPCM	differential pulse code modulation
3002.sp
3003ADPCM	differential pulse code modulation with either
3004	    adaptive quantization, or adaptive prediction,
3005	    or both
3006.sp
3007DM	delta modulation (1-bit DPCM)
3008.sp
3009ADM	delta modulation with adaptive quantization
3010\l'\n(x1u\(ul'
3011.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
3012.in 0
3013.FG "Table 3.2  Time-domain encoding techniques"
3014.rh "Syllabic companding."
3015We have already studied one time-domain encoding technique, namely logarithmic
3016quantization, or log PCM (sometimes called "instantaneous companding").  A more
3017sophisticated encoder could track slowly varying trends in the overall amplitude
3018of the speech signal and use this information to adjust the quantization
3019levels dynamically.  Speech coding methods based on this principle are called
3020adaptive pulse code modulation systems (APCM).  Because the overall amplitude
3021changes slowly, it is sufficient to adjust the quantization relatively infrequently
3022(compared with the sampling rate), and this is often done at rates approximating
3023the syllable rate of running speech, leading to the term "syllabic companding".
3024A block floating-point format can be used, with a common exponent being
3025stored every M samples (with M, say, 125 for a 100\ msec block rate at 8\ kHz
3026sampling), but the mantissa being stored at the regular sample rate.  The overall
3027energy in the block,
3028.LB
3029$sum from n=h to h+M-1 ~x(n) sup 2$    ($M = 125$, say),
3030.LE
3031is used to determine a suitable exponent, and every sample
3032in the block \(em namely
3033$x(h)$, $x(h+1)$, ..., $x(h+M-1)$ \(em is scaled according to that exponent.
3034Note that for speech transmission systems this method necessitates a delay of
3035$M$ samples at the encoder, and indeed some methods base the exponent on the
3036energy in the last block to avoid this.  For speech storage, however, the delay
3037is irrelevant.  A rather different, nonsyllabic, method of adaptive PCM is
3038continually to change the step size of a uniform quantizer, by multiplying it by
3039a constant at each sample which is based on the magnitude of the previous code
3040word.
3041.pp
3042Adaptive quantization exploits information about the amplitude of the signal,
3043and, as a rough generalization, yields a reduction of one bit per sample
3044in the data rate for telephone-quality speech over ordinary logarithmic
3045quantization, for a given signal-to-noise ratio.  Alternatively, for the
3046same data rate an improvement of 6\ dB in signal-to-noise ratio can be obtained.
3047Some results for actual schemes are given by Rabiner and Schafer (1978).
3048.[
3049Rabiner Schafer 1978 Digital processing of speech signals
3050.]
3051However, there is other information in the time waveform of speech, namely, the
3052sample-to-sample correlation, which can be exploited to give further reductions.
3053.rh "Differential coding."
3054Differential pulse code modulation (DPCM), in its simplest form, uses the
3055present speech sample as a prediction of the next one,
3056and stores the prediction error \(em that is, the sample-to-sample difference.
3057This is a simple case of predictive encoding.
3058Referring back to the speech waveform displayed in Figure 3.5,
3059it seems plausible that the data rate can be reduced by transmitting the difference
3060between successive samples instead of their absolute values:  less bits are
3061required for the difference signal for a given overall accuracy because it
3062does not assume such extreme values as the absolute signal level.
3063Actually, the improvement is not all that great \(em about 4\ \-\ 5\ dB in
3064signal-to-noise ratio, or just under one bit per sample for a given
3065signal-to-noise ratio \(em for the difference signal can be nearly as large as
3066the absolute signal level.
3067.pp
3068If DPCM is used in conjunction with adaptive quantization, giving one form of
3069adaptive differential pulse code modulation (ADPCM), both the overall amplitude
3070variation and the sample-to-sample correlation are exploited, leading to a
3071combined gain of 10\ \-\ 11\ dB in signal-to-noise ratio (or just under two bits
3072reduction per sample for telephone-quality speech).  Another form of adaptation
3073is to alter the predictor by multiplying the previous sample value by a
3074parameter which is adjusted for best performance.
3075Then the transmitted signal at time $n$ is
3076.LB
3077.EQ
3078e(n) ~~ = ~~ x(n)~ - ~ax(n-1),
3079.EN
3080.LE
3081where the parameter $a$ is adapted (and stored) on a syllabic time-scale.  This
3082leads to a slight improvement in signal-to-noise ratio, which can be combined
3083with that achieved by adaptive quantization.  Much more substantial benefits
3084can be realized by using a weighted sum of the past several (up to 15) speech
3085samples, and adapting all the weights.  This is the basic idea of linear
3086prediction, which is developed in Chapter 6.
3087.rh "Delta modulation."
3088The coding methods presented so far all increase the complexity of the
3089analogue-to-digital interface (or, if the sampled waveform is coded
3090digitally, they increase the processing required before and after storage).
3091One method which considerably
3092.ul
3093simplifies
3094the interface is the limiting case
3095of DPCM with just 1-bit quantization.  Only the sign of the difference between
3096the current and last values is transmitted.  Figure 3.9 shows the conversion
3097hardware.
3098.FC "Figure 3.9"
3099The encoding part is essentially the same as a tracking D/A,
3100where the value in a counter is forced to track the analogue input by
3101incrementing or decrementing the counter according as the input exceeds or
3102falls short of the analogue equivalent of the counter's contents.  However,
3103for this encoding scheme, called "delta modulation", the increment-decrement
3104signal itself forms the discrete representation of the waveform, instead of the counter's
3105contents.  The analogue waveform can be reconstituted from the bit stream with
3106another counter and D/A converter.  Alternatively, an all-analogue implementation
3107can be used, both for the encoder and decoder, with a capacitor as integrator
3108whose charging current is controlled digitally.  This is a much cheaper realization.
3109.pp
3110It is fairly obvious that the sampling frequency for delta modulation will need
3111to be considerably higher than for straightforward PCM.  Figure 3.10 shows
3112an effect called "slope overload" which occurs when the sampling rate is too low.
3113.FC "Figure 3.10"
3114Either a higher sample rate or a larger step size will reduce the overload;
3115however, larger steps increase the noise level of the alternate 1's and \-1's
3116that occur when no input is present \(em called "granular noise".  A compromise
3117is necessary between slope overload and granular noise for a given bit rate.
3118Delta modulation results in lower data rates than logarithmic quantization
3119for a given signal-to-noise ratio if that ratio is low (poor-quality speech).
3120As the desired speech quality is increased its data rate grows faster than
3121that of logarithmic PCM.  The crossover point occurs at much lower than
3122telephone quality speech, and so although delta modulation is used for some
3123applications where the permissible data rate is severely constrained,
3124it is not really suitable for speech output from computers.
3125.pp
3126It is profitable to adjust the step size, leading to
3127.ul
3128adaptive
3129delta modulation.
3130A common strategy is to increase or decrease the step size by a multiplicative
3131constant, which depends on whether the new transmitted bit will be equal to
3132or different from the last one.  That is,
3133.LB "nnnn"
3134.NI "nn"
3135$stepsize(n+1)  =  stepsize(n) times 2$  if $x(n+1)<x(n)<x(n-1)$
3136or $x(n+1)>x(n)>x(n-1)$
3137.br
3138(slope overload condition);
3139.NI "nn"
3140$stepsize(n+1) = stepsize(n)/2$  if $x(n+1),~x(n-1)<x(n)$
3141or $x(n+1),~x(n-1)>x(n)$
3142.br
3143(granular noise condition).
3144.LE "nnnn"
3145Despite these adaptive equations, the step size should be constrained to
3146lie between a predetermined fixed maximum and minimum, to prevent it from
3147becoming so large or so small that rapid accomodation to changing input signals is
3148impossible.
3149Then, in a period of potential slope overload the step size will grow, preventing
3150overload, possibly to its maximum value when overload may resume.  In a quiet
3151period it will decrease to its minimum value which determines the granular
3152noise in the idle condition.  Note that the step size need not be stored, for
3153it can be deduced from the bit changes in the digitized data.  Although
3154adaptation improves the performance of delta modulation, it is still inferior to
3155PCM at telephone qualities.
3156.rh "Summary."
3157It seems that ADPCM, with
3158adaptive quantization and adaptive prediction, can provide a worthwhile
3159advantage for speech storage, reducing the number of bits needed per sample of
3160telephone-quality speech from 7 for logarithmic PCM to perhaps 5, and the data
3161rate from 56\ Kbit/s to 40\ Kbit/s.  Disadvantages are additional complexity
3162in the encoding and decoding processes, and the fact that byte-oriented storage,
3163with 8 bits/sample in logarithmic PCM, is more convenient for computer use.
3164For low quality speech where hardware complexity is to be minimized,
3165adaptive delta modulation could provide worthwhile \(em although the ready
3166availability of PCM codec chips reduces the cost advantage.
3167.sh "3.3  References"
3168.LB "nnnn"
3169.[
3170$LIST$
3171.]
3172.LE "nnnn"
3173.sh "3.4  Further reading"
3174.pp
3175Probably the best single reference on time-domain coding of speech is
3176the book by Rabiner and Schafer (1978), cited above.
3177However, this does not contain a great deal of information on practical
3178aspects of the analogue-to-digital conversion process; this is
3179covered by Blesser (1978) above, who is especially interested in
3180high-quality conversion for digital audio applications,
3181and Garrett (1978) below.
3182There are many textbooks in the telecommunications area which
3183are relevant to the subject of the chapter,
3184although they concentrate primarily on fundamental theoretical aspects rather
3185than the practical application of the technology.
3186.LB "nn"
3187.\"Cattermole-1969-1
3188.]-
3189.ds [A Cattermole, K.W.
3190.ds [D 1969
3191.ds [T Principles of pulse code modulation
3192.ds [I Iliffe
3193.ds [C London
3194.nr [T 0
3195.nr [A 1
3196.nr [O 0
3197.][ 2 book
3198.in+2n
3199This is a standard, definitive, work on PCM, and provides a good grounding
3200in the theory.
3201It goes into the subject in much more depth than we have been able to here.
3202.in-2n
3203.\"Garrett-1978-1
3204.]-
3205.ds [A Garrett, P.H.
3206.ds [D 1978
3207.ds [T Analog systems for microprocessors and minicomputers
3208.ds [I Reston Publishing Company
3209.ds [C Reston, Virginia
3210.nr [T 0
3211.nr [A 1
3212.nr [O 0
3213.][ 2 book
3214.in+2n
3215Garrett discusses the technology of data conversion systems, including
3216A/D and D/A converters and basic analogue filter design, in a
3217clear and practical manner.
3218.in-2n
3219.\"Inose-1979-2
3220.]-
3221.ds [A Inose, H.
3222.ds [D 1979
3223.ds [T An introduction to digital integrated communications systems
3224.ds [I Peter Peregrinus
3225.ds [C Stevenage, England
3226.nr [T 0
3227.nr [A 1
3228.nr [O 0
3229.][ 2 book
3230.in+2n
3231Inose's book is a recent one which covers the whole area of digital
3232transmission and switching technology.
3233It gives a good idea of what is happening to the telephone networks
3234in the era of digital communications.
3235.in-2n
3236.\"Steele-1975-3
3237.]-
3238.ds [A Steele, R.
3239.ds [D 1975
3240.ds [T Delta modulation systems
3241.ds [I Pentech Press
3242.ds [C London
3243.nr [T 0
3244.nr [A 1
3245.nr [O 0
3246.][ 2 book
3247.in+2n
3248Again a standard work, this time on delta modulation techniques.
3249Steele gives an excellent and exhaustive treatment of the subject from a
3250communications viewpoint.
3251.in-2n
3252.LE "nn"
3253.EQ
3254delim $$
3255.EN
3256.CH "4  SPEECH ANALYSIS"
3257.ds RT "Speech analysis
3258.ds CX "Principles of computer speech
3259.pp
3260Digital recordings of speech provide a jumping-off point for
3261further processing of the audio waveform, which is usually necessary for
3262the purpose of speech output.
3263It is difficult to synthesize natural sounds by concatenating
3264individually-spoken words.
3265Pitch is perhaps the most perceptually significant contextual effect
3266which must be
3267taken into account when forming connected speech out of isolated words.
3268The intonation of an utterance, which manifests itself as a
3269continually changing pitch, is a holistic property of the utterance
3270and not the sum of components determined by the individual words alone.
3271Happily, and quite coincidentally, communications engineers in their quest
3272for reduced-bandwidth telephony have invented methods of coding speech that
3273separate the pitch information from that carried by the articulation.
3274.pp
3275Although these analysis techniques, which were first introduced in the late
32761930's (Dudley, 1939), were originally implemented by analogue means \(em and
3277in many systems still are (Blankenship, 1978, describes a recent
3278switched-capacitor realization) \(em there is a continuing trend
3279towards digital implementations, particularly for the more sophisticated coding
3280schemes.
3281.[
3282Dudley 1939
3283.]
3284.[
3285Blankenship 1978
3286.]
3287It is hard to see how the technique of linear prediction of speech,
3288which is described in detail in Chapter 6, could be accomplished in the
3289absence of digital processing.
3290Some groundwork is laid for the theory of digital signal analysis in this
3291chapter.
3292The ideas are not presented in a formal, axiomatic way; but are developed as
3293and when they are needed to examine some of the structures that turn out to be
3294useful in speech processing.
3295.pp
3296Most speech analysis views speech according to the source-filter model which
3297was introduced in Chapter 2, and aims to separate the effects of the source from
3298those of the filter.  The frequency spectrum of the vocal tract filter is of
3299great interest, and the technique of discrete Fourier transformation is
3300discussed in this chapter.  For many purposes it is better to extract the formant
3301frequencies from the spectrum and use these alone (or in conjunction with their
3302bandwidths) to characterize it.  As far as the signal source in the source-filter
3303model is concerned, its most interesting features are pitch and amplitude \(em the
3304latter being easy to estimate.  Hence we go on to look at pitch extraction.
3305Related to this is the problem of deciding whether a segment of speech has
3306voiced or unvoiced excitation, or both.
3307.pp
3308Estimating formant and pitch parameters is one of the messiest areas of
3309speech processing.  There is a delightful paper which points this out
3310(Schroeder, 1970), entitled "Parameter estimation in speech: a lesson in unorthodoxy".
3311.[
3312Schroeder 1970
3313.]
3314It emphasizes that the most successful estimation procedures "have often relied
3315on intuition based on knowledge of speech signals and their production in the
3316human vocal apparatus rather than routine applications of well-established
3317theoretical methods".
3318Fortunately, the emphasis of the present book is on speech
3319.ul
3320output,
3321which involves parameter estimation only in so far as it is needed to produce
3322coded speech for storage, and to illuminate the acoustic nature of speech
3323for the development of synthesis by rule from phonetics or text.
3324Hence the many methods of formant and pitch estimation are treated rather
3325cursorily and qualitatively here:  our main interest is in how to
3326.ul
3327use
3328such information for speech output.
3329.pp
3330If the incoming speech can be analysed into its formant frequencies, amplitude,
3331excitation mode, and pitch (if voiced), it is quite easy to resynthesize
3332it directly from these parameters.  Speech synthesizers are described in the
3333next chapter.  They can be realized in either analogue or digital
3334hardware, the former being predominant in production systems and the latter
3335in research systems \(em although, as in other areas of electronics, the balance
3336is changing in favour of digital implementations.
3337.sh "4.1  The channel vocoder"
3338.pp
3339A direct representation of the frequency spectrum of a signal can be obtained
3340by a bank of bandpass filters.  This is the basis of
3341the
3342.ul
3343channel vocoder,
3344which was the first device that attempted to take advantage of the source-filter
3345model for speech coding (Dudley, 1939).
3346.[
3347Dudley 1939
3348.]
3349The word "vocoder" is a contraction
3350of
3351.ul
3352vo\c
3353ice
3354.ul
3355coder.
3356The energy in each filter band is
3357estimated by rectification and smoothing, and the resulting approximation to
3358the frequency spectrum is transmitted or stored.  The source properties are
3359represented by the type of excitation (voiced or unvoiced), and if voiced,
3360the pitch.  It is not necessary to include the overall amplitude of the speech
3361explicitly, because this is conveyed by the energy levels from the separate
3362bandpass filters.
3363.pp
3364Figure 4.1 shows the encoding part of a channel vocoder which has been used
3365successfully for many years (Holmes, 1980).
3366.[
3367Holmes 1980 JSRU channel vocoder
3368.]
3369.FC "Figure 4.1"
3370We will discuss the block labelled "pre-emphasis" shortly.
3371The shape of the spectrum is estimated by 19 bandpass filters, whose spacing
3372and bandwidth decrease slightly with decreasing frequency to obtain the rather
3373greater resolution that is needed in the lower frequency region,
3374as shown in Table 4.1.
3375.RF
3376.nr x0 4n+2.6i+\w'\0\0'u+(\w'bandwidth'/2)
3377.nr x1 (\n(.l-\n(x0)/2
3378.in \n(x1u
3379.ta 4n +1.3i +1.3i
3380\l'\n(x0u\(ul'
3381.sp
3382.nr x1 (\w'channel'/2)
3383.nr x2 (\w'centre'/2)
3384.nr x3 (\w'analysis'/2)
3385	\0\h'-\n(x1u'channel	\0\h'-\n(x2u'centre	\0\0\h'-\n(x3u'analysis
3386.nr x1 (\w'number'/2)
3387.nr x2 (\w'frequency'/2)
3388.nr x3 (\w'bandwidth'/2)
3389	\0\h'-\n(x1u'number	\0\0\h'-\n(x2u'frequency	\0\0\h'-\n(x3u'bandwidth
3390.nr x2 (\w'(Hz)'/2)
3391		\0\h'-\n(x2u'(Hz)	\0\0\h'-\n(x2u'(Hz)
3392\l'\n(x0u\(ul'
3393.sp
3394	\01	\0240	\0120
3395	\02	\0360	\0120
3396	\03	\0480	\0120
3397	\04	\0600	\0120
3398	\05	\0720	\0120
3399	\06	\0840	\0120
3400	\07	1000	\0150
3401	\08	1150	\0150
3402	\09	1300	\0150
3403	10	1450	\0150
3404	11	1600	\0150
3405	12	1800	\0200
3406	13	2000	\0200
3407	14	2200	\0200
3408	15	2400	\0200
3409	16	2700	\0200
3410	17	3000	\0300
3411	18	3300	\0300
3412	19	3750	\0500
3413\l'\n(x0u\(ul'
3414.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
3415.in 0
3416.FG "Table 4.1  Filter specifications for a vocoder analyser (after Holmes, 1980)"
3417.[
3418Holmes 1980 JSRU channel vocoder
3419.]
3420The 3\ dB points
3421of adjacent filters are halfway between their centre frequencies, so that there
3422is some overlap between bands.
3423The filter characteristics do not need to have very sharp edges, because the energy
3424in neighbouring bands is fairly highly correlated.  Indeed, there is a
3425disadvantage in making them too sharp, because the phase delays associated
3426with sharp cutoff filters induce "smearing" of the spectrum in the time domain.
3427This particular channel vocoder uses second-order Butterworth bandpass filters.
3428.pp
3429For regenerating speech stored in this way, an excitation of unit impulses
3430at the specified pitch period (for voiced sounds) or white noise (for unvoiced
3431sounds) is produced and passed through a bank of bandpass filters similar
3432to the analysis ones.  The excitation has a flat spectrum, for regular impulses
3433have harmonics at multiples of the repetition frequency which are all of the
3434same size, and so the spectrum of the output signal is completely determined
3435by the filter bank.  The gain of each filter is controlled by the stored
3436magnitude of the spectrum at that frequency.
3437.pp
3438The frequency spectrum and voicing pitch of speech change at much slower rates
3439than the time waveform.  The changes are due to movements of the articulatory
3440organs (tongue, lips, etc) in the speaker, and so are limited in their speed
3441by physical constraints.  A typical rate of production of phonemes is 15 per
3442second, but in fact the spectrum can change quite a lot within a single
3443phoneme (especially a stop sound).
3444Between 10 and 25\ msec (100\ Hz and 40\ Hz)
3445is generally thought to be a satisfactory interval for transmitting or storing
3446the spectrum, to preserve a reasonably faithful representation of the speech.
3447Of course, the entire spectrum, as well as the source characteristics, must
3448be stored at this rate.
3449The channel vocoder described by Holmes (1980) uses 48 bits to encode
3450the information.
3451.[
3452Holmes 1980 JSRU channel vocoder
3453.]
3454Repeated every 20\ msec, this gives a data rate of 2400\ bit/s \(em very
3455considerably less than any of the time-domain encoding techniques.
3456.pp
3457It needs some care to encode the output of 19 filters, the excitation type,
3458and the pitch into 48 bits of information.  Holmes uses 6 bits for pitch,
3459logarithmically encoded,
3460and one bit for excitation type.
3461This leaves 41 bits to encode the output of the 19 filters, and so a differential
3462technique is used which transmits just the difference between adjacent
3463channels \(em for the spectrum does not change abruptly in the frequency domain.
3464Three bits are used for the absolute level in channel 1, and two bits
3465for each channel-to-channel difference, giving a total of 39 bits for the whole
3466spectrum.  The remaining two bits per frame are reserved for signalling or
3467monitoring purposes.
3468.pp
3469A 2400 bit/s channel vocoder degrades the speech in a telephone channel quite
3470perceptibly.  It is sufficient for interactive communication, where
3471if you do not understand something you can always ask for it to be repeated.
3472It is probably not good enough for most voice response applications.
3473However, the vocoder principle can be used with larger filter banks and much
3474higher bit rates, and still reduce the data rate substantially below that
3475required by log PCM.
3476.sh "4.2  Pre-emphasis"
3477.pp
3478There is an
3479overall \-6\ dB/octave trend in speech radiated from the lips,
3480as frequency increases.
3481We will discuss why this is so in the next chapter.
3482Notice that this trend means that the signal power is reduced
3483by a factor of 4, or the signal amplitude by a factor of 16, for each
3484doubling in frequency.
3485For vocoders, and indeed for other methods of spectral analysis of speech,
3486it is usually desirable to equalize this by a +6\ dB/octave lift prior to
3487processing, so that the channel outputs occupy a similar range of levels.
3488On regeneration, the output speech is passed through an inverse filter which
3489provides 6\ dB/octave of attenuation.
3490.pp
3491For a digital system, such pre-emphasis
3492can either be implemented as an analogue circuit which precedes the presampling
3493filter and digitizer, or as a digital operation on the sampled and quantized
3494signal.  In the former case, the characteristic is usually flat up to a certain
3495breakpoint, which occurs somewhere between 100\ Hz and 1\ kHz \(em the exact
3496position does not seem to be critical \(em at which point the +6\ dB/octave lift
3497begins.  Although de-emphasis on output ought to have an exactly inverse
3498characteristic, it is sometimes modified or even eliminated altogether in an
3499attempt to counteract approximately
3500the  $sin( pi f/f sub s )/( pi f/f sub s )$  distortion
3501introduced by the desampling operation, which was discussed in an earlier
3502section.  Above half the sampling frequency, the characteristic of the
3503pre-emphasis is irrelevant because any effect will be suppressed by the presampling
3504filter.
3505.pp
3506The effect of a 6\ dB/octave lift can also be achieved digitally, by differencing
3507the input.  The operation
3508.LB
3509.EQ
3510y(n)~~ = ~~ x(n)~ -~ ax(n-1)
3511.EN
3512.LE
3513is suitable, where the constant parameter $a$ is usually chosen between 0.9 and 1.
3514The latter value gives straightforward differencing, and this amounts to
3515creating a DPCM signal as input to the spectral analysis.  Figure 4.2 plots
3516the frequency response of this operation, with a sample frequency of 8\ kHz,
3517for two values of the parameter; together with that of a 6\ dB/octave lift
3518above 100\ Hz.
3519.FC "Figure 4.2"
3520The vertical positions of the plots have been adjusted to give
3521the same gain, 20\ dB, at 1\ kHz.
3522The difference at 3.4\ kHz, the upper end of the telephone spectrum, is just
3523over 2\ dB.  At frequencies below the breakpoint, in this case 100\ Hz, the
3524difference between analogue and digital pre-emphasis can be very great.  For
3525$a=0.9$ the attenuation at DC (zero frequency) is 18\ dB below that at 1\ kHz,
3526which happens to be close to that of the analogue filter for frequencies below the
3527breakpoint.  However, if the breakpoint had been at 1\ kHz there would have been
352820\ dB difference between the analogue and $a=0.9$ plots at DC.  And of course
3529the $a=1$ characteristic has infinite attenuation at DC.
3530In practice, however, the exact form of the pre-emphasis does not seem to be at all
3531critical.
3532.pp
3533The above remarks apply only to voiced speech.  For unvoiced speech there appears
3534to be no real need for pre-emphasis; indeed, it may do harm by reinforcing
3535the already large high-frequency components.  There is a case for altering the
3536parameter $a$ according to the excitation mode of the speech:  $a=1$ for voiced
3537excitation and $a=0$ for unvoiced gives pre-emphasis just when it is needed.
3538This can be achieved by expressing the parameter in terms of the autocorrelation
3539of the incoming signal, as
3540.LB
3541.EQ
3542a ~~ = ~~ R(1) over R(0) ~ ,
3543.EN
3544.LE
3545where $R(1)$ is the correlation of the signal with itself delayed by one sample,
3546and $R(0)$ is the correlation without delay (that is, the signal variance).
3547This is reasonable intuitively because high sample-to-sample correlation
3548is to be expected in voiced speech, so that $R(1)$ is very nearly as great as
3549$R(0)$ and the ratio becomes 1; whereas little or no sample-to-sample correlation
3550will be present in unvoiced speech, making the ratio close to 0.  Such a
3551scheme is reminiscent of ADPCM with adaptive prediction.
3552.pp
3553However, this sophisticated pre-emphasis method does not seem to be worthwhile
3554in practice.  Usually the breakpoint in an analogue pre-emphasis filter is
3555chosen to be rather greater than 100\ Hz to limit the amplification of fricative
3556energy.  In fact, the channel vocoder described by Holmes (1980) has the
3557breakpoint at 1\ kHz, limiting the gain to 12\ dB at 4\ kHz, two octaves above.
3558.[
3559Holmes 1980 JSRU channel vocoder
3560.]
3561.sh "4.3  Digital signal analysis"
3562.pp
3563You may be wondering how the frequency response for the digital pre-emphasis
3564filters, displayed in Figure 4.2, can be calculated.  Suppose a digitized
3565sinusoid is applied as input to the filter
3566.LB
3567.EQ
3568y(n) ~~ = ~~ x(n)~ - ~ax(n-1).
3569.EN
3570.LE
3571A sine wave of frequency $f$ has equation  $x(t) ~ = ~ sin ~ 2 pi ft$, and when
3572sampled at $t=0,~ T,~ 2T,~ ...$ (where $T$ is the sampling interval, 125\ msec for
3573an 8\ kHz sample rate), this becomes  $x(n) ~ = ~ sin ~ 2 pi fnT.$  It is much
3574more convenient to consider a complex exponential
3575input,  $e sup { j2 pi fnT}$  \(em the response to a sinusoid can then be derived
3576by taking imaginary parts, if necessary.  The output for this input is
3577.LB
3578.EQ
3579y(n) ~~ = ~~ e sup {j2 pi fnT} ~~-~ae sup {j2 pi f(n-1)T} ~~ = ~~
3580(1~-~ae sup {-j2 pi fT} )~e sup {j2 pi fnT} ,
3581.EN
3582.LE
3583a sinusoid at the same frequency as the input.  The
3584factor  $1~-~ae sup {-j2 pi fT}$  is complex, with both amplitude and phase
3585components.  Thus the output will be a phase-shifted and amplified version
3586of the input.  The amplitude response at frequency $f$ is therefore
3587.LB
3588.EQ
3589|1~ - ~ ae sup {-j2 pi fT} | ~~ = ~~
3590[1~ +~ a sup 2 ~-~ 2a~cos~2 pi fT ] sup 1/2 ,
3591.EN
3592.LE
3593or
3594.LB
3595.EQ
359610 ~ log sub 10 (1~ +~ a sup 2 ~ - ~ 2a~ cos 2 pi fT)
3597.EN
3598dB.
3599.LE
3600Normalizing to 20\ dB at 1\ kHz, and assuming 8\ kHz sampling, yields
3601.LB
3602.EQ
360320~ + ~~ 10~ log sub 10 (1~ +~ a sup 2 ~-~ 2a~ cos ~ { pi f} over 4000 )
3604~~ -~ 10~ log sub 10 (1~ +~ a sup 2 ~-~ 2a~ cos ~ pi over 4 )
3605.EN
3606dB.
3607.LE
3608With $a=0.9$ and 1 this gives the graphs of Figure 4.2.
3609.pp
3610Frequency responses for analogue filters are often plotted with a logarithmic
3611frequency scale, as well as a logarithmic amplitude one, to bring out the
3612asymptotes in dB/octave as straight lines.  For digital filters the response
3613is usually drawn on a
3614.ul
3615linear
3616frequency axis extending to half the sampling frequency.  The response is
3617symmetric about this point.
3618.pp
3619Analyses like the above are usually expressed in terms of the $z$-transform.
3620Denote the unit delay operation by $z sup -1$.  The choice of the inverse rather
3621than $z$ itself is of course an arbitrary matter, but the convention has stuck.
3622Then the filter can be characterized
3623by Figure 4.3, which signifies that the output is the input minus a delayed
3624and scaled version of itself.
3625.FC "Figure 4.3"
3626The transfer function of the filter is
3627.LB
3628.EQ
3629H(z) ~~ = ~~ 1~ -~ az sup -1 ,
3630.EN
3631.LE
3632and we have seen that the effect of the system on a (complex) exponential of
3633frequency $f$ is to multiply it by
3634.LB
3635.EQ
36361~ -~ ae sup {-j2 pi fT}.
3637.EN
3638.LE
3639To get the frequency response from the transfer function, replace $z sup -1$
3640by $e sup {-j2 pi fT}$.  Amplitude and phase responses can then be found by
3641taking the modulus and angle of the complex frequency response.
3642.pp
3643If $z sup -1$ is treated as an
3644.ul
3645operator,
3646it is quite in order to summarize the action of the filter by
3647.LB
3648.EQ
3649y(n) ~~ = ~~ x(n)~ - ~az sup -1 x(n) ~~ = ~~ (1~ -~ az sup -1 )x(n).
3650.EN
3651.LE
3652However, it is usual to derive from the sequence $x(n)$ a
3653.ul
3654transform
3655$X(z)$ upon which $z sup -1$ acts as a
3656.ul
3657multiplier.
3658If the transform of $x(n)$ is defined as
3659.LB
3660.EQ
3661X(z) ~~ = ~~ sum from {n=- infinity} to infinity ~x(n) z sup -n ,
3662.EN
3663.LE
3664then on multiplication by $z sup -1$ we get a new transform, say $V(z)$:
3665.LB
3666.EQ
3667V(z) ~~ = ~~ z sup -1 X(z) ~~ =
3668~~ z sup -1 sum from {n=- infinity} to infinity ~x(n) z sup -n ~~ =
3669~~ sum ~x(n)z sup -n-1 ~~ =
3670~~ sum ~x(n-1)z sup -n .
3671.EN
3672.LE
3673$V(z)$ can also be expressed as the transform of a new sequence, say $v(n)$, by
3674.LB
3675.EQ
3676V(z) ~~ = ~~ sum from {n=- infinity} to infinity ~v(n) z sup -n ,
3677.EN
3678.LE
3679from which it becomes apparent that
3680.LB
3681.EQ
3682v(n) ~~ = ~~ x(n-1).
3683.EN
3684.LE
3685Thus $v(n)$ is a delayed version of $x(n)$, and we have accomplished what we
3686set out to do, namely to show that the delay
3687.ul
3688operator
3689$z sup -1$ can be treated as an ordinary
3690.ul
3691multiplier
3692in the $z$-transform domain, where $z$-transforms are defined as the infinite
3693sums given above.
3694.pp
3695In terms of $z$-transforms, the filter can be written
3696.LB
3697.EQ
3698Y(z) ~~ = ~~ (1~ -~ az sup -1 )X(z),
3699.EN
3700.LE
3701where $z sup -1$ is now treated as a multiplier.
3702The transfer function of the filter is
3703.LB
3704.EQ
3705H(z) ~~ = ~~ Y(z) over X(z) ~~ = ~~ 1 - az sup -1 ,
3706.EN
3707.LE
3708the ratio of the output to the input transform.
3709.pp
3710It may seem that little has been gained by inventing this rather abstract
3711notion of transform, simply to change an operator to a multiplier.  After
3712all, the equation of the filter is no simpler in the transform domain than
3713it was in the time domain using $z sup -1$ as an operator.  However, we will
3714need to go on to examine more complex filters.  Consider, for example, the
3715transfer function
3716.LB
3717.EQ
3718H(z) ~~ = ~~ {1~+~az sup -1 ~+~bz sup -2} over {1~+~cz sup -1 ~+~dz sup -2} ~ .
3719.EN
3720.LE
3721If $z sup -1$ is treated as an operator, it is not immediately obvious how
3722this transfer function can be realized by a time-domain recurrence relation.
3723However, with $z sup -1$ as an ordinary multiplier in the transform domain, we can
3724make purely mechanical manipulations with infinite sums to see what the transfer
3725function means as a recurrence relation.
3726.pp
3727It is worth noting the similarity between the $z$-transform in the discrete
3728domain and the Fourier and Laplace transforms in the continuous domains.
3729In fact, the $z$-transform plays an analogous role in digital signal processing
3730to the Laplace transform in continuous theory, for the delay operator
3731$z sup -1$
3732performs a similar service to the differentiation operator $s$.
3733Recall first the continuous Fourier transform,
3734.LB
3735$
3736G(f) ~~ = ~~
3737integral from {- infinity} to infinity ~g(t)~e sup {-j2 pi ft} dt
3738$,    where $f$ is real,
3739.LE
3740and the Laplace transform,
3741.LB
3742$
3743F(s) ~~ = ~~
3744integral from 0 to infinity ~f(t)~e sup -st dt
3745$,    where $s$ is complex.
3746.LE
3747The main difference between these two transforms is that the range of integration
3748begins at -$infinity$ for the Fourier transform and at 0 for the Laplace.
3749Advocates of the Fourier transform, which typically include people involved with
3750telecommunications, enjoy the freedom from initial conditions which is bestowed
3751by an origin way back in the mists of time.  Advocates of Laplace, including
3752most analogue filter theorists, invariably
3753consider systems where all is quiet before $t=0$ \(em altering the origin
3754of measurement of time to achieve this if necessary \(em and welcome the opportunity
3755to include initial conditions explicitly
3756.ul
3757without
3758having to worry about what happens in the mists of time.
3759Although there is a two-sided Laplace transform where the integration begins
3760at -$infinity$, it is not generally used because it causes some convergence
3761complications.  Ignoring this difference between the transforms (by considering
3762signals which are zero when $t<0$), the Fourier spectrum can be found from the
3763Laplace transform by writing  $s=j2 pi f$; that is, by considering values
3764of $s$ which lie on the imaginary axis.
3765.pp
3766The $z$-transform is
3767.LB
3768$
3769H(z) ~~ = ~~ sum from n=0 to infinity ~h(n)~z sup -n
3770$,    or    $
3771H(z) ~~ = ~~ sum from {n=- infinity} to infinity ~h(n)~z sup -n ,
3772$
3773.LE
3774depending on whether a one-sided or two-sided transform is used.  The advantages
3775and disadvantages of one- and two-sided transforms are the same as in the
3776analogue case.
3777$z$ plays the role of $e sup sT $, and so it is not surprising that the response
3778to a (sampled) sinusoid input can be found by setting
3779.LB
3780.EQ
3781z ~~ = ~~ e sup {j2 pi fT}
3782.EN
3783.LE
3784in $H(z)$, as we proved explicitly above for the pre-emphasis filter.
3785.pp
3786The above relation between $z$ and $f$ means that real-valued frequencies correspond
3787to points where $|z|=1$, that is, the unit circle in the complex $z$-plane.
3788As you travel anticlockwise around this unit circle, starting from the
3789point $z=1$, the corresponding frequency increases from 0, to $1/2T$ half-way
3790round ($z=-1$), to $1/T$ when you get back to the beginning ($z=1$) again.
3791Frequencies greater than the sampling frequency are aliased back into the
3792sampling band, corresponding to further circuits of $|z|=1$ with frequency
3793going from $1/T$ to $2/T$, $2/T$ to $3/T$, and so on.  In fact, this is the circle
3794of Figure 3.3 which was used earlier to explain how sampling affects the frequency
3795spectrum!
3796.sh "4.4  Discrete Fourier transform"
3797.pp
3798Let us return from this brief digression into techniques of digital signal
3799analysis to the problem of determining the frequency spectrum of speech.
3800Although a bank of bandpass filters such as is used in the channel vocoder
3801is the perhaps most straightforward way to obtain a frequency spectrum,
3802there are other techniques which are in fact more commonly used in digital speech
3803processing.
3804.pp
3805It is possible to define the Fourier transform of a discrete sequence of
3806points.  To motivate the definition, consider first the
3807ordinary Fourier transform (FT), which is
3808.LB
3809$
3810g(t) ~~ = ~~
3811integral from {- infinity} to infinity ~G(f)~e sup {+j2 pi ft} df
3812~~~~~~~~~~~~~~~~
3813G(f) ~~ = ~~
3814integral from {- infinity} to infinity ~g(t)~e sup {-j2 pi ft} dt .
3815$
3816.LE
3817This takes a continuous time domain into a continuous frequency domain.
3818Sometimes you see a normalizing factor $1/2 pi$ multiplying the integral in
3819either the forward or the reverse transform.  This is only needed
3820when the frequency variable is expressed in radians/s, and we will find it
3821more convenient to express frequencies in\ Hz.
3822.pp
3823The Fourier series (FS), which should also be familiar to you,
3824operates on a periodic time waveform (or, equivalently,
3825one that only exists for a finite period of time, which is notionally extended
3826periodically).  If a period lies in the time range $[0,b)$, then the transform is
3827.LB
3828$
3829g(t) ~~ = ~~
3830sum from {r = - infinity} to infinity ~G(r)~e sup {+j2 pi rt/b}
3831~~~~~~~~~~~~~~~~
3832G(r) ~~ = ~~ 1 over b ~ integral from 0 to b ~g(t)~e sup {-j2 pi rt/b} dt .
3833$
3834.LE
3835The Fourier series takes a periodic time-domain function into a discrete frequency-domain one.
3836Because of the basic duality between the time and frequency domains in the
3837Fourier transforms, it is not surprising that another version of the transform
3838can be defined which takes a periodic
3839.ul
3840frequency\c
3841-domain function into a
3842discrete
3843.ul
3844time\c
3845-domain one.
3846.pp
3847Fourier transforms can only deal with a finite stretch of a time signal
3848by assuming that the signal is periodic, for if $g(t)$ is evaluated from
3849its transform $G(r)$ according to the formula above, and $t$ is chosen outside
3850the interval $[0,b)$, then a periodic extension of the function $g(t)$ is obtained
3851automatically.
3852Furthermore, periodicity in one domain implies discreteness in the other.
3853Hence if we transform a
3854.ul
3855finite
3856stretch of a
3857.ul
3858discrete
3859time waveform,
3860we get a frequency-domain representation which is also finite (or, equivalently,
3861periodic), and discrete.
3862This is the discrete Fourier transform (DFT),
3863and takes a discrete periodic time-domain function into a discrete
3864periodic frequency-domain one as illustrated in Figure 4.4.
3865.FC "Figure 4.4"
3866It is defined by
3867.LB
3868$
3869g(n) ~~ = ~~
38701 over N ~ sum from r=0 to N-1~G(r)~e sup { + j2 pi rn/N}
3871~~~~~~~~~~~~~~~~
3872G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~e sup { - j2 pi rn/N} ,
3873$
3874.LE
3875or, writing  $W=e sup {-j2 pi /N}$,
3876.LB
3877$
3878g(n) ~~ = ~~
38791 over N ~ sum from r=0 to N-1~G(r)~W sup -rn
3880~~~~~~~~~~~~~~~~
3881G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~W sup rn .
3882$
3883.LE
3884.sp
3885The $1/N$ in the first equation is the same normalizing
3886factor as the $1/b$ in the Fourier series,
3887for the finite time domain is $[0,N)$
3888in the discrete case and $[0,b)$ in the Fourier series case.
3889It does not matter
3890whether it is written into the forward or the reverse transform, but it is usually
3891placed as shown above as a matter of convention.
3892.pp
3893As illustrated by Figure 4.5, discrete Fourier transforms
3894take an input of $N$ real values, representing equally-spaced time samples
3895in the interval $[0,b)$, and produce as output $N$ complex values, representing
3896equally-spaced frequency samples in the interval $[0,N/b)$.
3897.FC "Figure 4.5"
3898Note that the end-point of this frequency interval is the sampling frequency.
3899It seems odd that the input is real and the output is the same number of
3900.ul
3901complex
3902quantities:  we seem to be getting some numbers for nothing!
3903However, this isn't so, for it is easy to show that if the input sequence is
3904real, the output frequency
3905spectrum has a symmetry about its mid-point (half the sampling frequency).
3906This can be expressed as
3907.LB
3908DFT symmetry:\0\0\0\0\0\0 $
3909~ mark G( half N +r) ~=~ G( half N -r) sup *$  if $g$ is real-valued,
3910.LE
3911where $*$ denotes the conjugate of a complex quantity
3912(that is, $(a+jb) sup * = a-jb$).
3913.pp
3914It was argued above that the frequency spectrum in the DFT is periodic, with
3915the spectrum from 0 to the sampling frequency being repeated regularly up and
3916down the frequency axis.  It can easily be seen from the DFT equation that
3917this is so.  It can be written
3918.LB
3919DFT periodicity:$ lineup G(N+r) ~=~ G(r)$  always.
3920.LE
3921Figure 4.6 illustrates the properties of symmetry and periodicity.
3922.FC "Figure 4.6"
3923.sh "4.5  Estimating the frequency spectrum of speech using the DFT"
3924.pp
3925Speech signals are not exactly periodic.  Although the waveform in a particular
3926pitch period will usually resemble those in the preceding and following pitch
3927periods, it will certainly not be identical to them.
3928As the articulation of the speech changes, the formant positions will alter.
3929As we saw in Chapter 2, the pitch itself is certainly not constant.
3930Hence the fundamental assumption of the DFT, that the waveform is periodic,
3931is not really justified.  However, the signal is quasi-periodic, for changes
3932from period to period will not usually be very great.  One way of computing
3933the short-term frequency spectrum of speech is to use
3934.ul
3935pitch-synchronous
3936Fourier transformation, where single pitch periods are isolated from the
3937waveform and processed with the DFT.  This gives a rather accurate estimate
3938of the spectrum.  Unfortunately, it is difficult to determine the beginning
3939and end of each pitch cycle, as we shall see later in this chapter when
3940discussing pitch extraction techniques.
3941.pp
3942If a finite stretch of a speech waveform is isolated and Fourier transformed,
3943without regard to pitch of the speech, then the periodicity assumption will
3944be grossly violated.  Figure 4.7 illustrates that the effect is the same
3945as
3946multiplying the signal by a rectangular
3947.ul
3948window function,
3949which is 0 except during the period to be analysed, where it is 1.
3950.FC "Figure 4.7"
3951The windowed sequence will almost certainly have discontinuities at its edges,
3952and these will affect the resulting spectrum.  The effect can be analysed
3953quite easily, but we will not do so here.  It is enough to say that the
3954high frequencies associated with the edges of the window cause considerable
3955distortion of the spectrum.  The effect can be alleviated by
3956using a smoother window than a rectangular one,
3957and several have been investigated extensively.  The commonly-used windows of
3958Bartlett, Blackman, and Hamming are illustrated in Figure 4.8.
3959.FC "Figure 4.8"
3960.pp
3961Because the DFT produces the same number of frequency samples, equally spaced,
3962as there were points in the time waveform, there is a tradeoff between
3963frequency resolution and time resolution (for a given sampling rate).
3964For example, a 256-point transform with a sample rate of 8\ kHz gives the 256
3965equally-spaced frequency components between 0 and 8\ kHz that are shown in Table
39664.2.
3967.RF
3968.nr x0 (\w'time domain'/2)
3969.nr x1 (\w'frequency domain'/2)
3970.in+1.0i
3971.ta 1.0i 3.0i 4.0i
3972\h'0.5i+2n-\n(x0u'time domain\h'|3.5i+2n-\n(x1u'frequency domain
3973.sp
3974sample	time	sample	\h'-3n'frequency
3975number		number
3976.nr x0 1i+\w'00000'
3977\l'\n(x0u\(ul'	\l'\n(x0u\(ul'
3978.sp
3979\0\0\00	\0\0\0\00 $mu$sec	\0\0\00	\0\0\0\00 Hz
3980\0\0\01	\0\0125	\0\0\01	\0\0\031
3981\0\0\02	\0\0250	\0\0\02	\0\0\062
3982\0\0\03	\0\0375	\0\0\03	\0\0\094
3983\0\0\04	\0\0500	\0\0\04	\0\0125
3984.nr x2 (\w'...'/2)
3985\h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'...
3986\h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'...
3987\h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'...
3988.sp
3989\0254	31750	\0254	\07938
3990\0255	31875 $mu$sec	\0255	\07969 Hz
3991\l'\n(x0u\(ul'	\l'\n(x0u\(ul'
3992.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
3993.in 0
3994.MT 2
3995Table 4.2  Time domain and frequency domain samples for a 256-point DFT,
3996with 8\ kHz sampling
3997.TE
3998The top half of the frequency spectrum is of no interest, because
3999it contains the complex conjugates of the bottom half (in reverse order),
4000corresponding to frequencies greater than half the sampling frequency.
4001Thus for a 30\ Hz resolution in the frequency domain,
4002256 time samples, or a 32\ msec stretch of speech, needs to be transformed.
4003A common technique is to take overlapping periods in the time domain to
4004give a new frequency spectrum every 16\ msec.  From the acoustic point
4005of view this is a reasonable rate to re-compute the spectrum, for as noted
4006above when discussing channel vocoders the rate of change in the spectrum
4007is limited by the speed that the speaker can move his vocal organs, and
4008anything between 10 and 25\ msec is a reasonable figure for transmitting
4009or storing the spectrum.
4010.pp
4011The DFT is a complex transform, and speech is a real signal.  It is possible
4012to do two DFT's at once by putting one time waveform into the real parts
4013of the input and another into the imaginary parts.  This destroys the DFT
4014symmetry property, for it only holds for real inputs.  But given the DFT
4015of a complex sequence formed in this way, it is easy to separate out the
4016DFT's of the two real time sequences.  If the two time sequences are
4017$x(n)$ and $y(n)$, then the transform of the complex sequence
4018.LB
4019.EQ
4020g(n) ~~ = ~~ x(n) ~+~ jy(n)
4021.EN
4022.LE
4023is
4024.LB
4025.EQ
4026G(r) ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup rn ~+~ y(n)W sup rn ] .
4027.EN
4028.LE
4029It follows that the complex conjugate of the aliased parts of the spectrum,
4030in the upper frequency region, are
4031.LB
4032.EQ
4033G(N-r) sup * ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup -(N-r)n
4034~-~ y(n)W sup -(N-r)n ] ,
4035.EN
4036.LE
4037and this is the same as
4038.LB
4039.EQ
4040G(N-r) sup * ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup rn
4041~-~ y(n)W sup rn ] ,
4042.EN
4043.LE
4044because $W sup N$ is 1 (recall the definition of $W$),
4045and so $W sup -Nn$ is 1 for any $n$.
4046Thus
4047.LB
4048.EQ
4049X(r) ~~ = ~~ {G(r) ~+~ G(N-r) sup * } over 2
4050~~~~~~~~~~~~~~~~
4051Y(r) ~~ = ~~ {G(r) ~-~ G(N-r) sup * } over 2
4052.EN
4053.LE
4054extracts the transforms $X(r)$ and $Y(r)$ of the original sequences
4055$x$ and $y$.
4056.pp
4057With speech, this trick is frequently used to calculate two spectra at once.
4058Using 256-point transforms, a new estimate of the spectrum can be obtained
4059every 16\ msec by taking overlapping 32\ msec stretches of speech, with a
4060computational requirement of one 256-point transform every 32\ msec.
4061.sh "4.6  The fast Fourier transform"
4062.pp
4063Straightforward calculation of the DFT, expressed as
4064.LB
4065.EQ
4066G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~W sup nr ,
4067.EN
4068.LE
4069for $r=0,~ 1,~ 2,~ ...,~ N-1$, takes $N sup 2$ operations, where each operation
4070is a complex multiply and add (for $W$ is, of course, a complex number).
4071There is a better way, invented in the early sixties, which reduces this to
4072$N ~ log sub 2 N$ operations \(em a very considerable improvement.
4073Dubbed the "fast Fourier transform" (FFT) for historical reasons, it would actually
4074be better called the "Fourier transform", with the straightforward method above
4075known as the "slow Fourier transform"!  There
4076is no reason nowadays to use the slow method, except for tiny transforms.
4077It is worth describing the basic principle of the FFT, for it is surprisingly
4078simple.  More details on actual implementations can be found in Brigham (1974).
4079.[
4080Brigham 1974
4081.]
4082.pp
4083It is important to realize that the FFT involves no approximation.
4084It is an
4085.ul
4086exact
4087calculation of the values that would be obtained by the slow method
4088(although it may be affected differently by round-off errors).
4089Problems of aliasing and windowing occur in all discrete Fourier transforms,
4090and they are neither alleviated nor exacerbated by the FFT.
4091.pp
4092To gain insight into the working of the FFT, imagine the sequence $g(n)$ split
4093into two halves, containing the even and odd points
4094respectively.
4095.LB
4096even half $e(n)$ is $g(0)~ g(2)~ .~ .~ .~ g(N-2)$
4097.br
4098odd  half $o(n)$ is $g(1)~ g(3)~ .~ .~ .~ g(N-1)$.
4099.LE
4100Then it is easy to show that if $G$ is the transform of $g$,
4101$E$ the transform of $e$,
4102and $O$ that of $o$, then
4103.LB
4104$
4105G(r) ~~ = ~~ E(r) ~+~ W sup r O(r)$  for  $r=0,~ 1,~ ...,~ half N -1$,
4106.LE
4107and
4108.LB
4109$
4110G( half N +r ) ~~ = ~~ E(r) ~+~ W sup { half N +r} O(r)$  for  $
4111r = 0,~ 1,~ ...,~ half N -1$.
4112.LE
4113Calculation of the $E$ and $O$ transforms involves $( half N) sup 2$ operations each,
4114while combining them together according to the above relationship occupies
4115$N$ operations.  Thus the total is  $N + half N sup 2 $  operations, which is considerably
4116less than $N sup 2$.
4117.pp
4118But don't stop there!  The even half can itself be broken down into
4119even and odd parts to expedite its calculation, and the same with the odd half.
4120The only constraint is that the number of elements in the sequences splits
4121exactly into two at each stage.
4122Providing $N$ is a power of 2, then, we are left at the end with some 1-point
4123transforms to do.  But transforming a single point leaves it unaffected!  (Check
4124the definition of the DFT.)  A quick calculation shows that the number of operations
4125needed is not  $N + half N sup 2$, but $N~ log sub 2 N$.
4126Figure 4.9 compares this with $N sup 2$, the number of operations for
4127straightforward DFT calculation, and it can be seen that the FFT is very much
4128faster.
4129.FC "Figure 4.9"
4130.pp
4131The only restriction on the use of the FFT is that $N$ must be a power of two.
4132If it is not, alternative, more complicated, algorithms can be used which
4133give comparable computational advantages.  However, for speech processing
4134the number of samples that are transformed is usually arranged to be a power
4135of two.  If a pitch synchronous analysis is undertaken, the
4136time stretch that is to be transformed is dictated by the length of the pitch
4137period, and will vary from time to time.  Then, it is usual to pad out the
4138time waveform with zeros to bring the number of samples up to a power of two;
4139otherwise, if different-length time stretches were transformed the scale
4140of the resulting frequency components would vary too.
4141.pp
4142The FFT provides very worthwhile cost savings over the use of a bank of
4143bandpass filters for spectral analysis.  Take the example of a 256-point
4144transform with 8\ kHz sampling, giving 128 frequency components spaced
4145by 31.25\ Hz from 0 up to almost 4\ kHz.  This can be computed on overlapping
414632\ msec stretches of the time waveform, giving a new spectrum every 16\ msec,
4147by a single FFT calculation every 32\ msec (putting successive pairs of
4148time stretches in the real and imaginary parts of the complex input sequence,
4149as described earlier).  The FFT algorithm requires $N~ log sub 2 N$ operations,
4150which is 2048 when $N=256$.  An additional 512 operations are required
4151for the windowing calculation.  Repeated every 32\ msec, this gives
4152a rate of 80,000 operations per second.  To achieve a much lower frequency
4153resolution with 20 bandpass filters, each of which are fourth-order,
4154will need a great deal more operations.  Each filter will need between 4 and 8
4155multiplications per sample, depending on its exact digital implementation.  But new
4156samples appear every 125
4157.ul
4158micro\c
4159seconds, and so somewhere around a million
4160operations will be required every second.
4161If we increased the frequency resolution to that obtained by the FFT, 128
4162filters would be needed, requiring between 4 and 8 million operations!
4163.sh "4.7  Formant estimation"
4164.pp
4165Once the frequency spectrum of a speech signal has been calculated, it may
4166seem a simple matter to estimate the positions of the formants.  But it is
4167not!  Spectra obtained in practice are not usually like the idealized ones
4168of Figure 2.2.  One reason for this is that, unless the analysis is
4169pitch-synchronous, the frequency spectrum of the excitation source is mixed
4170in with that of the vocal tract filter.  There are other reasons, which will
4171be discussed later in this section.  But first, let us consider how to
4172extract the vocal tract filter characteristics from the combined spectrum
4173of source and filter.  To do so we must begin to explore the theory of linear
4174systems.
4175.rh "Discrete linear systems."
4176Figure 4.10 shows an input signal exciting a filter to produce an output
4177signal.
4178.FC "Figure 4.10"
4179For present purposes, imagine the input to be a glottal
4180waveform, the filter a vocal tract one, and the output a
4181speech signal (which is then subjected to high-frequency de-emphasis
4182by radiation from the lips).
4183We will consider here
4184.ul
4185discrete
4186systems, so that the input $x(n)$ and output $y(n)$ are sampled signals,
4187defined only when $n$ is integral.  The theory is quite similar for continuous
4188systems.
4189.pp
4190Assume that the system is
4191.ul
4192linear,
4193that is, if input $x sub 1 (n)$ produces output $y sub 1 (n)$ and
4194input $x sub 2 (n)$ produces output $y sub 2 (n)$,
4195then the sum of $x sub 1 (n)$ and
4196$x sub 2 (n)$ will produce the sum of $y sub 1 (n)$ and $y sub 2 (n)$.
4197It is easy to show from this that, for any constant multiplier $a$,
4198the input $ax(n)$ will produce output $ay(n)$ \(em it is pretty obvious
4199when $a=2$,
4200or indeed any positive integer; for then $ax(n)$ can be written as
4201$x(n)+x(n)+...$ .
4202Assume further that the system is
4203.ul
4204time-invariant,
4205that is, if input $x(n)$
4206produces output $y(n)$ then a time-shifted version of $x$,
4207say $x(n+n sub 0 )$ for
4208some constant $n sub 0$, will produce the same output, only time-shifted; namely
4209$y(n+n sub 0)$.
4210.pp
4211Now consider the discrete delta function $delta (n)$, which is 0 except at
4212$n=0$ when it is 1.
4213If this single impulse is presented as input to the system, the output is called
4214the
4215.ul
4216impulse response,
4217and will be denoted by $h(n)$.
4218The fact that the system is time-invariant guarantees that the response does
4219not depend upon the particular time at which the impulse occurred, so that,
4220for example, the impulsive input $delta (n+n sub 0 )$ will produce output
4221$h(n+n sub 0 )$.
4222A delta-function input and corresponding impulse response are shown in Figure
42234.10.
4224.pp
4225The impulse response of a linear, time-invariant system is an extremely useful
4226thing to
4227know, for it can be used to calculate the output of the system for any input
4228at all!  Specifically, an input signal $x(n)$ can be written
4229.LB
4230.EQ
4231x(n)~ = ~~ sum from {k=- infinity} to infinity ~ x(k) delta (n-k) ,
4232.EN
4233.LE
4234because $delta (n-k)$ is non-zero only when $k=n$, and so for any
4235particular value of $n$, the summation contains only
4236one non-zero term \(em that is, $x(n)$.
4237The action of the system on each term of the sum is to produce an output
4238$x(k)h(n-k)$, because $x(k)$ is just a constant, and
4239the system is linear.
4240Furthermore, the complete input $x(n)$ is just the sum of such terms, and since
4241the system is linear, the output is the sum of $x(k)h(n-k)$.
4242Hence the response of the system to an arbitrary input is
4243.LB
4244.EQ
4245y(n)~ = ~~ sum from {k=- infinity} to infinity ~ x(k) h(n-k) .
4246.EN
4247.LE
4248This is called a
4249.ul
4250convolution sum,
4251and is sometimes written
4252.LB
4253.EQ
4254y(n)~ =~ x(n) ~*~ h(n).
4255.EN
4256.LE
4257.pp
4258Let's write this in terms of $z$-transforms.  The (two-sided) $z$-transform of y(n)
4259is
4260.LB
4261.EQ
4262Y(z)~ = ~~ sum from {n=- infinity} to infinity ~y(n)z sup -n ~~ =
4263~~ sum from n ~ sum from k ~x(k)h(n-k) ~z sup -n ,
4264.EN
4265.LE
4266Writing $z sup -n$ as  $z sup -(n-k) z sup -k$,  and interchanging the order
4267of summation, this becomes
4268.LB
4269.EQ
4270Y(z)~ mark = ~~ sum from k ~[~ sum from n ~ h(n-k)z sup -(n-k) ~]~x(k)z sup -k
4271.EN
4272.br
4273.EQ
4274lineup = ~~ sum from k ~H(z)~z sup -k ~~ = ~~ H(z)~ sum from k ~x(k)z sup
4275-k ~~=~~H(z)X(z) .
4276.EN
4277.LE
4278Thus convolution in the time domain is the same as multiplication in the
4279$z$-transform domain; a very important result.  Applied to the linear system of
4280Figure 4.10, this means that the output $z$-transform is the input $z$-transform
4281multiplied by the $z$-transform of the system's impulse response.
4282.pp
4283What we really want to do is to relate the frequency spectrum of
4284the output to the response of the system and the spectrum of the
4285input.
4286In fact, frequency spectra are very closely connected with $z$-transforms.  A
4287periodic signal $x(n)$ which repeats every $N$ samples has DFT
4288.LB
4289.EQ
4290sum from n=0 to N-1 ~x(n)~e sup {-j2 pi rn/N} ,
4291.EN
4292.LE
4293and its $z$-transform is
4294.LB
4295.EQ
4296sum from {n=- infinity} to infinity ~x(n) ~z sup -n .
4297.EN
4298.LE
4299Hence the DFT is the same as the $z$-transform of a single cycle of the signal,
4300evaluated at the points  $z= e sup {j2 pi r/N}$  for $r=0,~ 1,~ ...~ ,~ N-1$.
4301In other
4302words, the frequency components are samples of the $z$-transform at $N$
4303equally-spaced points around the unit circle.
4304Hence the frequency spectrum at the output of a linear system is the product of
4305the
4306input spectrum and the frequency response of the system itself (that is, the
4307transform of its impulse response function).
4308It should be admitted that this statement is somewhat questionable,
4309because to get from $z$-transforms to DFT's we have assumed that
4310a single cycle only is transformed \(em and the impulse response function of
4311a system is not necessarily periodic.  The real action of the system is
4312to multiply $z$-transforms, not DFT's.  However, it is useful in imagining
4313the behaviour of the system to think in terms of products of DFT's; and in
4314practice it is always these rather than $z$-transforms which are computed
4315because of the existence of the FFT algorithm.
4316.pp
4317Figure 4.11 shows the frequency spectrum of a typical voiced speech signal.
4318.FC "Figure 4.11"
4319The overall shape shows humps at the formant positions, like those in the
4320idealized Figure 2.2.  However, superimposed on this is an "oscillation"
4321(in the frequency domain!) at the pitch frequency.  This occurs because the
4322transform of the vocal tract filter has been multiplied by that of the
4323pitch pulse, the latter having components at harmonics of the pitch frequency.
4324The oscillation must be suppressed before the formants
4325can be estimated to any degree of accuracy.
4326.pp
4327One way of eliminating the oscillation is to perform pitch-synchronous
4328analysis.
4329This removes the influence of pitch from the frequency domain by dealing with
4330it in the time domain!  The snag is, of course, that it is not easy to estimate
4331the pitch frequency:  some techniques for doing so are discussed in the next
4332main section.
4333Another way is to use linear predictive analysis, which really does get rid
4334of pitch information without having to estimate the pitch period first.  A
4335smooth
4336frequency spectrum can be produced using the analysis techniques described in
4337Chapter 6, which provides
4338a suitable starting-point for formant frequency estimation.
4339The third method is to remove the pitch ripple from the frequency spectrum
4340directly.  This will be discussed in an intuitive rather than a
4341theoretical way, because linear predictive methods are becoming dominant
4342in speech processing.
4343.rh "Cepstral processing of speech."
4344Suppose the frequency spectrum of Figure 4.11 were actually a time waveform.
4345To remove the high-frequency pitch ripple is easy:  just filter it out!
4346However,
4347filtering removes
4348.ul
4349additive
4350ripples, whereas this is a
4351.ul
4352multiplicative
4353ripple.  To turn multiplication into addition, take logarithms.  Then the
4354procedure would be
4355.LB
4356.NP
4357compute the DFT of the speech waveform (windowed, overlapped);
4358.NP
4359take the logarithm of the transform;
4360.NP
4361filter out the high-frequency part, corresponding to pitch ripple.
4362.LE
4363.pp
4364Filtering is often best done using the DFT.  If the rippled waveform of Figure
43654.11 is transformed, a strong component could be expected at the ripple
4366frequency, with weaker ones at its harmonics.  These components can be
4367simply removed by setting them to zero, and inverse-transforming the result
4368to give a smoothed version of the original frequency spectrum.
4369A spectrum of the logarithm of a frequency spectrum is often called a
4370.ul
4371cepstrum
4372\(em a sort of backwards spectrum.  The horizontal axis of the cepstrum,
4373having the dimension of time, is called "quefrency"!  Note that high-frequency
4374signals have low quefrencies and vice versa.  In practice,
4375because the pitch ripple is usually well above the quefrency of interest for
4376formants, the upper end of the cepstrum is often simply cut off from a fixed
4377quefrency which corresponds to the maximum pitch expected.  However, identifying
4378the pitch peaks of the cepstrum has the useful byproduct of giving the pitch
4379period of the original speech.
4380.pp
4381To summarize, then, the procedure for spectral smoothing by the cepstral method
4382is
4383.LB
4384.NP
4385compute the DFT of the speech waveform (windowed, overlapped);
4386.NP
4387take the logarithm of the transform;
4388.NP
4389take the DFT of this log-transform, calling it the cepstrum;
4390.NP
4391identify the lowest-quefrency peak in the spectrum as the pitch,
4392confirming it by examining its harmonics, which should be
4393equally spaced at the pitch quefrency;
4394.NP
4395remove pitch effects from the cepstrum by cutting off its high-quefrency
4396part above either the pitch quefrency or some constant representing the maximum
4397expected pitch (which is the minimum expected pitch quefrency);
4398.NP
4399inverse DFT the resulting cepstrum to give a smoothed spectrum.
4400.LE
4401.rh "Estimating formant frequencies from smoothed spectra."
4402The difficulties of formant extraction are not over even when a smooth frequency
4403spectrum has been obtained.  A simple peak-picking algorithm which identifies
4404a peak at the $k$'th frequency component whenever
4405.LB
4406$
4407X(k-1) ~<~ X(k)
4408$  and  $
4409X(k) ~>~ X(k+1)
4410$
4411.LE
4412will quite often identify formants incorrectly.
4413It helps to specify in advance minimum and maximum formant frequencies \(em say
4414100\ Hz and 3\ kHz for three-formant identification, and ignore peaks lying
4415outside these limits.  It helps to estimate
4416the bandwidth of the peaks and reject those with bandwidths greater than
4417500\ Hz \(em for real formants are never this wide.  However, if two formants are
4418very close, then they may appear as a single, wide, peak and be rejected by
4419this criterion.  It is usual to take account of formant positions identified
4420in previous frames under these conditions.
4421.pp
4422Markel and Gray (1976) describe in detail several estimation algorithms.
4423.[
4424Markel Gray 1976 Linear prediction of speech
4425.]
4426Their simplest uses the number of peaks identified in the raw spectrum
4427(under 3\ kHz, and with
4428bandwidths greater than 500\ Hz), to determine what to do.  If exactly three
4429peaks are found, they are used as the formant positions.  It is claimed that
4430this happens about 85% to 90% of the time.
4431If only one peak is found, the present frame is ignored and the
4432previously-identified
4433formant positions are used (this happens less than 1% of the time).
4434The remaining cases are two peaks \(em corresponding to omission of one formant \(em
4435and four peaks \(em corresponding to an extra formant being included.  (More
4436than
4437four peaks never occurred in their data.)  Under these conditions,
4438a nearest-neighbour measure is used for disambiguation.  The measure is
4439.LB
4440.EQ
4441v sub ij ~ = ~ |{ F sup * } sub i (k) ~-~ F sub j (k-1)| ,
4442.EN
4443.LE
4444where $F sub j sup (k-1)$ is the $j$'th formant frequency defined
4445in the previous frame
4446$k-1$ and ${ F sup * } sub i (k)$ is the $i$'th raw data frequency estimate
4447for frame $k$.
4448If two peaks only are found, this measure is used to identify
4449the closest peaks in the previous frame; and then the
4450third peak of that frame is taken to be the missing formant
4451position.  If four peaks are found, the measure is used to
4452determine which of them is furthest from the previous formant
4453values, and this one is discarded.
4454.pp
4455This procedure works forwards, using the previous frame to
4456disambiguate peaks given in the current one.  More sophisticated
4457algorithms work backwards as well, identifying
4458.ul
4459anchor points
4460in the data which have clearly-defined formant positions, and
4461moving in both directions from these to disambiguate
4462neighbouring frames of data.  Finally, absolute limits can be
4463imposed upon the magnitude of formant movements between frames
4464to give an overall smoothing to the formant tracks.
4465.pp
4466Very often, people will refine the result of such automatic formant
4467estimation procedures by hand, looking at the tracks, knowing
4468what was said, and making adjustments in the light of their
4469experience of how formants move in speech.  Unfortunately, it is difficult to
4470obtain high-quality formant tracks by completely automatic
4471means.
4472.pp
4473One of the most difficult cases in formant estimation is where
4474two formants are so close together that the individual peaks
4475cannot be resolved.  One simple solution to this problem is to
4476employ "analysis-by-synthesis", whereby once a formant is
4477identified, a standard formant shape at this position is
4478synthesized and
4479subtracted from the
4480logarithmic spectrum (Coker, 1963).
4481.[
4482Coker 1963
4483.]
4484Then, even if two formants
4485are right on top of each other, the second is not missed because
4486it remains after the first one has been subtracted.
4487.pp
4488Unfortunately, however, the single peak which appears when
4489two formants are close together usually does not correspond exactly with the
4490position of either one.
4491There is one rather advanced signal-processing technique that
4492can help in this case.
4493The frequency spectrum of
4494speech is determined by
4495.ul
4496poles
4497which lie in the complex $z$-plane inside the unit circle.  (They
4498must be inside the unit circle if the system is stable.  Those
4499familiar with Laplace analysis of analogue systems may like to note that the
4500left half of the $s$-plane corresponds with the inside of the unit
4501circle in the $z$-plane.)  As shown earlier, computing a DFT is tantamount to
4502evaluating the $z$-transform at equally-spaced points around the
4503unit circle.  However, better resolution is obtained by
4504evaluating around a circle which lies
4505.ul
4506inside
4507the unit circle, but
4508.ul
4509outside
4510the outermost pole position.  Such a circle is sketched in
4511Figure 4.12.
4512.FC "Figure 4.12"
4513.pp
4514Recall that the FFT is a fast way of calculating the DFT of a
4515sequence.  Is there a similarly fast way of evaluating the
4516$z$-transform inside the unit circle?  The answer is yes, and the
4517technique is known as the "chirp $z$-transform", because it
4518involves considering a signal whose frequency increases
4519linearly \(em just like a radar chirp signal.  The chirp method
4520allows the $z$-transform to be computed quickly at equally-spaced
4521points along spirally-shaped contours around the origin of the
4522$z$-plane \(em corresponding to signals of linearly increasing
4523complex frequency.  The spiral nature of these curves is not of
4524particular interest in speech processing.  What
4525.ul
4526is
4527of interest, though, is that the spiral can begin at any point
4528on
4529the $z=0$ axis, and its pitch can be set arbitrarily.
4530If we begin spiralling at $z=0.9$, say, and set the pitch
4531to zero, the contour becomes a circle inside the unit one, with
4532radius 0.9.  Such a circle is exactly what is needed to refine
4533formant resolution.
4534.sh "4.8  Pitch extraction"
4535.pp
4536The last section discussed how to characterize the vocal tract filter
4537in the source-filter model of speech production:  this one looks
4538at how the most important property of the source \(em that is, the
4539pitch period \(em can be derived.  In many ways pitch extraction
4540is more important from a practical point of view than is formant
4541estimation.  In a voice-output system, formant estimation is
4542only necessary if speech is to be stored in formant-coded form.
4543For linear predictive storage of speech, or for speech synthesis
4544from phonetics or text, formant extraction is unnecessary \(em
4545although of course general information about formant
4546frequencies and formant tracks in natural speech is needed
4547before a synthesis-from-phonetics system can be built.
4548However, knowledge of the pitch contour is needed for
4549many different purposes.  For example, compact encoding of
4550linearly predicted speech relies on the pitch being estimated and
4551stored as a parameter separate from the articulation.
4552Significant improvements in frequency analysis can be made by
4553performing pitch-synchronous Fourier transformations,
4554because the need to window is eliminated.
4555Many synthesis-from-phonetics systems require the pitch contour
4556for utterances to be stored rather computed from markers in the
4557phonetic text.
4558.pp
4559Another issue which is closely bound up with pitch extraction is
4560the voiced-unvoiced distinction.   A good pitch estimator ought to
4561fail when presented with aperiodic input such as an unvoiced
4562sound, and so give a reliable indication of whether the frame of
4563speech is voiced or not.
4564.pp
4565One method of pitch estimation, which uses the cepstrum, has been outlined
4566above.  It involves a substantial amount of computation,
4567and has a high degree of complexity.  However, if implemented
4568properly it gives excellent results, because the source-filter
4569structure of the speech is fully utilized.
4570Another method, using the
4571linear prediction residual, will be described in Chapter 6.
4572Again, this requires a great deal of computation of a fairly sophisticated
4573nature, and gives good results \(em although it relies on a
4574somewhat more
4575restricted version of the source-filter model than cepstral
4576analysis.
4577.rh "Autocorrelation methods."
4578The most reliable way of estimating the pitch of a periodic
4579signal which is corrupted by noise is to examine its
4580short-time autocorrelation function.
4581The autocorrelation of a signal $x(n)$ with lag $k$ is defined as
4582.LB
4583.EQ
4584phi (k) ~~ = ~~ sum from {n=- infinity} to infinity ~ x(n)x(n+k) .
4585.EN
4586.LE
4587If the signal is quasi-periodic, with slowly varying period,
4588a finite stretch of it can be isolated with a window
4589$w(i)$, which is 0 when $i$ is outside the range $[0,N)$.
4590Beginning this window at sample $m$ gives the windowed signal
4591.LB
4592.EQ
4593x(n)w(n-m),
4594.EN
4595.LE
4596whose autocorrelation,
4597the
4598.ul
4599short-time
4600autocorrelation of the signal $x$ at point $m$ is
4601.LB
4602.EQ
4603phi sub m (k)~ = ~~ sum from n ~ x(n)w(n-m)x(n+k)w(n-m+k) .
4604.EN
4605.LE
4606.pp
4607The autocorrelation function exhibits peaks at lags which correspond to
4608the pitch periods and multiples of it.  At such lags, the signal is in
4609phase with a delayed version of itself, giving high correlation.
4610The pitch of natural speech ranges about three octaves, from 50\ Hz (low-pitched men) to around
4611400\ Hz (children).  To ensure that at least two pitch cycles are seen, even at
4612the
4613low end, the window needs to be at least 40\ msec long, and the autocorrelation
4614function calculated for lags up to 20\ msec.  The peaks which occur at lags
4615corresponding to multiples of the pitch become smaller as the multiple
4616increases, because the speech waveform will change slightly and the pitch
4617period is not perfectly constant.  If signals at the high end of the pitch
4618range, 400\ Hz, are
4619viewed through a 40\ msec autocorrelation window, considerable smearing of
4620pitch resolution in the time domain is to be expected.  Finally, for unvoiced
4621speech, no substantial peaks of autocorrelation will occur.
4622.pp
4623If all deviations from perfect periodicity can be attributed to
4624additive, white, Gaussian noise, then it can be shown from
4625standard detection theory that autocorrelation methods are
4626appropriate for pitch identification.  Unfortunately, this is
4627certainly not the case for speech signals.  Although the
4628short-time autocorrelation of voiced speech exhibits peaks at
4629multiples of the pitch period, it is not clear that it is any
4630easier to detect these peaks in the autocorrelation function
4631than it is in the original time waveform!  To take a simple
4632example, if a signal contains a fundamental and in-phase first
4633and second harmonics,
4634.LB
4635.EQ
4636x(n)~ =~ a sin 2 pi fnT ~+~ b sin 4 pi fnT ~+~ c sin 6 pi fnT ,
4637.EN
4638.LE
4639then its autocorrelation function is
4640.LB
4641.EQ
4642phi (k) ~=~~ {a sup 2 ~cos~2 pi fkT~+~b sup 2 ~cos~2 pi
4643fkT~+~c sup 2 ~cos 2 pi fkT} over 2 ~ .
4644.EN
4645.LE
4646There is no reason to believe that detection of the fundamental
4647period of this signal will be any easier in the autocorrelation
4648domain than in the time domain.
4649.pp
4650The most common error of pitch detection by autocorrelation
4651analysis is that the periodicities of the formants are confused
4652with the pitch.  This typically leads to the repetition time
4653being identified as  $T sub pitch ~ +- ~ T sub formant1$,  where the
4654$T$'s are the periods of the pitch and first formant.  Fortunately,
4655there are simple ways of processing the signal non-linearly to
4656reduce the effect of formants on pitch estimation using autocorrelation.
4657.pp
4658One way
4659is to low-pass filter the
4660signal with a cut-off above the maximum pitch period, say 600
4661Hz.  However, formant 1 is often below this value.  A different
4662technique, which may be used in conjunction with filtering, is
4663to "centre-clip" the signal as shown in Figure 4.13.
4664.FC "Figure 4.13"
4665This
4666removes many of
4667the ripples which are associated with formants.  However, it
4668entails the use of an adjustable clipping threshold to cater for
4669speech of varying amplitudes.  Sondhi (1968), who introduced the
4670technique, set the clipping level at 30% of the maximum
4671amplitude.
4672.[
4673Sondhi 1968
4674.]
4675An alternative which achieves
4676much the same effect without the need to fiddle with thresholds,
4677is to cube the signal, or raise it to some other high (odd!)
4678power, before taking the autocorrelation.  This highlights the
4679peaks and suppresses the effect of low-amplitude parts.
4680.pp
4681For very accurate pitch detection, it is best to combine the evidence
4682from several different methods of analysis of the time waveform.
4683The autocorrelation function provides one source of evidence;
4684and the cepstrum provides another.
4685A third source comes from the time waveform itself.
4686McGonegal
4687.ul
4688et al
4689(1975) have described a semi-automatic method of pitch
4690detection which uses human judgement to make a final decision based upon these
4691three sources of evidence.
4692.[
4693McGonegal Rabiner Rosenberg 1975 SAPD
4694.]
4695This appears to provide highly accurate pitch contours at the expense of
4696considerable human effort \(em it takes an experienced user 30 minutes to
4697process each second of speech.
4698.rh "Speeding up autocorrelation."
4699Calculating the autocorrelation function is an
4700arithmetic-intensive procedure.  For large lags, it can best be
4701done using FFT methods; although there are simpler arithmetic
4702tricks which speed it up without going to such complexity.
4703However, with the availability of analogue delay lines using
4704charge-coupled devices, autocorrelation can now be done
4705effectively and cheaply by analogue, sampled-data, hardware.
4706.pp
4707Nevertheless, some techniques to speed up digital
4708calculation of short-time autocorrelations are in wide use.  It
4709is tempting to hard-limit the signal so that it becomes binary
4710(Figure 4.14(a)), thus eliminating multiplication.
4711.FC "Figure 4.14"
4712This can be
4713disastrous, however, because hard-limited speech is known to
4714retain considerable intelligibility and therefore the formant
4715structure is still there.  A better plan is to take
4716centre-clipped speech and hard-limit that to a ternary signal
4717(Figure 4.14(b)).  This simplifies the computation considerably
4718with essentially no degradation in performance (Dubnowski
4719.ul
4720et al,
47211976).
4722.[
4723Dubnowski Schafer Rabiner 1976 Digital hardware pitch detector
4724.]
4725.pp
4726A different approach to reducing the amount of calculation is to
4727perform a kind of autocorrelation which does not use
4728multiplications.  The
4729"average magnitude difference function",
4730which is defined by
4731.LB
4732.EQ
4733d(k)~ = ~~ sum from {n=- infinity} to infinity ~ |x(n)-x(n+k)| ,
4734.EN
4735.LE
4736has been used for this purpose with some success (Ross
4737.ul
4738et al,
47391974).
4740.[
4741Ross Schafer Cohen Freuberg Manley 1974
4742.]
4743It exhibits dips at pitch periods (instead of the peaks of the
4744autocorrelation function).
4745.rh "Feature-extraction methods."
4746Another possible way of extracting pitch in the time domain is to try to
4747integrate information from different sources to give reliable
4748pitch estimates.  Several features of the time
4749waveform can be defined, each of which provides an estimate of the pitch period,
4750and
4751an overall estimate can be obtained by majority vote.
4752.pp
4753For example, suppose that the only feature of the speech
4754waveform which is retained is the height and position of the
4755peaks, where a "peak" is defined by the simplistic criterion
4756.LB
4757$
4758x(n-1) ~<~ x(n)
4759$  and  $
4760x(n) $>$ x(n+1) .
4761$
4762.LE
4763Having found a peak which is thought to represent a pitch pulse,
4764one could define a "blanking period", based upon the current
4765pitch estimate, within which the next pitch pulse could not
4766occur.  When this period has expired, the next pitch pulse is
4767sought.  At first, a stringent criterion should be used for
4768identifying the next peak as a pitch pulse; but it can gradually be
4769relaxed if time goes on without a suitable pulse being
4770located.  Figure 4.15 shows a convenient way of doing this:  a
4771decaying exponential is begun at the end of the blanking period
4772and when a peak shows above, it is identified as a pitch pulse.
4773.FC "Figure 4.15"
4774One big advantage of this type of algorithm is that the data is
4775greatly reduced by considering peaks only \(em which can be
4776detected by simple hardware.  Thus it can permit real-time
4777operation on a small processor with minimal special-purpose
4778hardware.
4779.pp
4780Such a pitch pulse detector is exceedingly simplistic, and will
4781often identify the pitch incorrectly.  However, it can be used
4782in conjunction with other features to produce good pitch
4783estimates.  Gold and Rabiner (1969), who pioneered the
4784approach, used six features:
4785.[
4786Gold Rabiner 1969 Parallel processing techniques for pitch periods
4787.]
4788.LB
4789.NP
4790peak height
4791.NP
4792valley depth
4793.NP
4794valley-to-peak height
4795.NP
4796peak-to-valley depth
4797.NP
4798peak-to-peak height (if greater than 0)
4799.NP
4800valley-to-valley depth (if greater than 0).
4801.LE
4802The features are symmetric with regard to peaks and valleys.
4803The first feature is the one described above, and the second one works in
4804exactly the same way.
4805The third feature records the
4806height between each valley and the succeeding peak, and fourth
4807uses the depth between each peak and the succeeding valley.  The
4808purpose of the final two detectors is to eliminate secondary,
4809but rather large, peaks from consideration.  Figure 4.16 shows
4810the kind of waveform on which the other features might
4811incorrectly double the pitch, but the last two features identify
4812correctly.
4813.FC "Figure 4.16"
4814.pp
4815Gold and Rabiner also included the last two pitch estimates from each
4816feature detector.
4817Furthermore, for each feature, the present estimate
4818was added to the previous one to make a fourth, and the previous one to
4819the one before that to make a fifth, and all three were added together
4820to make a sixth; so that for each feature there were 6 separate estimates of
4821pitch.  The reason for this is that if three consecutive estimates of the
4822fundamental period are $T sub 0$, $T sub 1$ and $T sub 2$; then if some peaks are
4823being falsely identified, the actual period could be any of
4824.LB
4825.EQ
4826T sub 0 ~+~ T sub 1 ~~~~ T sub 1 ~+~ T sub 2 ~~~~
4827T sub 0 ~+~ T sub 1 ~+~ T sub 2 .
4828.EN
4829.LE
4830It is essential to do this, because
4831a feature of a given type can occur more than once in a pitch period \(em
4832secondary peaks usually exist.
4833.pp
4834Six features, each contributing six separate estimates, makes 36 estimates
4835of pitch in all.
4836An overall figure was obtained from this
4837set by selecting the most popular estimate (within some
4838pre-specified tolerance).  The complete scheme has been
4839evaluated extensively (Rabiner
4840.ul
4841et al,
48421976) and compares
4843favourably with other methods.
4844.[
4845Rabiner Cheng Rosenberg McGonegal 1976
4846.]
4847.pp
4848However, it must be admitted that this procedure seems to be rather
4849.ul
4850ad hoc
4851(as are many other successful speech parameter estimation
4852algorithms!).  Specifically, it is not easy to predict what
4853kinds of waveforms it will fail on, and evaluation of it can
4854only be pragmatic.  When used to
4855estimate the pitch of musical
4856instruments and singers over a 6-octave range (40\ Hz to 2.5\ kHz),
4857instances were found where it failed dramatically (Tucker and Bates, 1978).
4858.[
4859Tucker Bates 1978
4860.]
4861This is, of
4862course, a much more difficult problem than pitch estimation for
4863speech, where the range is typically 3 octaves.
4864In fact, for speech the feature
4865detectors are usually preceded by
4866a low-pass filter to attenuate the myriad
4867of peaks
4868caused by higher formants, and this
4869is inappropriate for
4870musical applications.
4871.pp
4872There is evidence which shows that additional features can
4873assist with pitch identification.  The above features are all
4874based upon the signal amplitude, and could be described as
4875.ul
4876secondary
4877features derived from a single
4878.ul
4879primary
4880feature.  Other primary features can easily be defined.
4881Tucker and Bates (1978) used a centre-clipped waveform, and considered only
4882the peaks rising above the central region.
4883.[
4884Tucker Bates 1978
4885.]
4886They defined two
4887further primary features, in addition to the peak amplitude:  the
4888.ul
4889time width
4890of a peak (period for which it is
4891outside the clipping level), and its
4892.ul
4893energy
4894(again, outside the clipping level).  The primary
4895features are shown in Figure 4.17.
4896.FC "Figure 4.17"
4897Secondary features are
4898defined, based on these three primary ones, and pitch estimates
4899are made for each one.  A further innovation was to combine the
4900individual estimates on a way which is based upon
4901autocorrelation analysis, reducing to some degree the
4902.ul
4903ad-hocery
4904of the pitch detection process.
4905.sh "4.9  References"
4906.LB "nnnn"
4907.[
4908$LIST$
4909.]
4910.LE "nnnn"
4911.sh "4.10  Further reading"
4912.pp
4913There are a lot of books on digital signal analysis, although in general
4914I find them rather turgid and difficult to read.
4915.LB "nn"
4916.\"Ackroyd-1973-1
4917.]-
4918.ds [A Ackroyd, M.H.
4919.ds [D 1973
4920.ds [T Digital filters
4921.ds [I Butterworths
4922.ds [C London
4923.nr [T 0
4924.nr [A 1
4925.nr [O 0
4926.][ 2 book
4927.in+2n
4928Here is the exception to prove the rule.
4929This book
4930.ul
4931is
4932easy to read.
4933It provides a good introduction to digital signal processing,
4934together with a wealth of practical design information on digital filters.
4935.in-2n
4936.\"Committee.I.D.S.P-1979-3
4937.]-
4938.ds [A IEEE Digital Signal Processing Committee
4939.ds [D 1979
4940.ds [T Programs for digital signal processing
4941.ds [I Wiley
4942.ds [C New York
4943.nr [T 0
4944.nr [A 0
4945.nr [O 0
4946.][ 2 book
4947.in+2n
4948This is a remarkable collection of tried and tested Fortran programs
4949for digital signal analysis.
4950They are all available from the IEEE in machine-readable form on magnetic
4951tape.
4952Included are programs for digital filter design, discrete Fourier transformation,
4953and cepstral analysis, as well as others (like linear predictive analysis;
4954see Chapter 6).
4955Each program is accompanied by a concise, well-written description of how
4956it works, with references to the relevant literature.
4957.in-2n
4958.\"Oppenheim-1975-4
4959.]-
4960.ds [A Oppenheim, A.V.
4961.as [A " and Schafer, R.W.
4962.ds [D 1975
4963.ds [T Digital signal processing
4964.ds [I Prentice Hall
4965.ds [C Englewood Cliffs, New Jersey
4966.nr [T 0
4967.nr [A 1
4968.nr [O 0
4969.][ 2 book
4970.in+2n
4971This is one of the standard texts on most aspects of digital signal processing.
4972It treats the $z$-transform, digital filters, and discrete Fourier transformation
4973in far more detail than we have been able to here.
4974.in-2n
4975.\"Rabiner-1975-5
4976.]-
4977.ds [A Rabiner, L.R.
4978.as [A " and Gold, B.
4979.ds [D 1975
4980.ds [T Theory and application of digital signal processing
4981.ds [I Prentice Hall
4982.ds [C Englewood Cliffs, New Jersey
4983.nr [T 0
4984.nr [A 1
4985.nr [O 0
4986.][ 2 book
4987.in+2n
4988This is the other standard text on digital signal processing.
4989It covers the same ground as Oppenheim and Schafer (1975) above,
4990but with a slightly faster (and consequently more difficult) presentation.
4991It also contains major sections on special-purpose hardware for
4992digital signal processing.
4993.in-2n
4994.\"Rabiner-1978-1
4995.]-
4996.ds [A Rabiner, L.R.
4997.as [A " and Schafer, R.W.
4998.ds [D 1978
4999.ds [T Digital processing of speech signals
5000.ds [I Prentice Hall
5001.ds [C Englewood Cliffs, New Jersey
5002.nr [T 0
5003.nr [A 1
5004.nr [O 0
5005.][ 2 book
5006.in+2n
5007Probably the best single reference for digital speech analysis,
5008as it is for the time-domain encoding techniques of the last chapter.
5009Unlike the books cited above, it is specifically oriented to speech processing.
5010.in-2n
5011.LE "nn"
5012.EQ
5013delim $$
5014.EN
5015.CH "5  RESONANCE SPEECH SYNTHESIZERS"
5016.ds RT "Resonance speech synthesizers
5017.ds CX "Principles of computer speech
5018.pp
5019This chapter considers the design of speech synthesizers which
5020implement a direct electrical analogue of
5021the resonance properties of the vocal tract by providing a filter for each
5022formant whose resonant frequency is to be controlled.  Another method is the
5023channel vocoder, with a bank of fixed filters whose gains are varied to match
5024the spectrum of the speech as described in Chapter 4.  This is not generally
5025used for synthesis from a written representation, however, because it is hard
5026to get good quality speech.  It
5027.ul
5028is
5029used sometimes for low-bandwidth
5030transmission and storage, for
5031it is fairly easy to analyse natural speech into fixed frequency bands.
5032A second alternative to the resonance synthesizer is the linear predictive
5033synthesizer, which at present is used quite extensively and is likely to become
5034even more popular.  This is covered in the next chapter.
5035Another alternative is the articulatory synthesizer, which
5036attempts to model the vocal tract directly, rather than
5037modelling the acoustic output from it.
5038Although, as noted in Chapter 2, articulatory synthesis holds a promise of
5039high-quality speech \(em for the coarticulation effects caused by tongue
5040and jaw inertia can be modelled directly \(em this has not yet been realized.
5041.pp
5042The source-filter model of speech production indicates that an electrical
5043analogue of the vocal tract can be obtained by considering the source
5044excitation and the filter that produces the formant frequencies separately.
5045This approach was pioneered by Fant (1960), and we shall present much of his
5046work in this chapter.
5047.[
5048Fant 1960 Acoustic theory of speech production
5049.]
5050There has been some discussion over whether the source-filter model really
5051is a good one, and some
5052synthesizers
5053explicitly introduce an element of
5054"sub-glottal coupling", which simulates the effect of the lung cavity
5055on the vocal tract transfer function during the periods when the glottis is
5056open (for an example see Rabiner, 1968).
5057.[
5058Rabiner 1968 Digital formant synthesizer JASA
5059.]
5060However, this is very much a low-order effect when considering
5061speech synthesized by rule from a written representation, for the software
5062which calculates parameter values to drive the synthesizer is a far greater
5063source of degradation in speech quality.
5064.sh "5.1  Overall spectral considerations"
5065.pp
5066Figure 5.1 shows the source-filter model of speech production.
5067.FC "Figure 5.1"
5068For voiced speech, the excitation source produces a waveform whose frequency
5069components decay at about 12\ dB/octave, as we shall see in a later section.
5070The excitation passes into the vocal tract filter.  Conceptually, this can best
5071be viewed as an infinite series of formant filters, although for implementation
5072purposes only the first few are modelled explicitly and the effect of the rest
5073is lumped together into a higher-formant compensation network.  In either case
5074the overall frequency profile of the filter is a flat one, upon which humps are
5075superimposed at the various formant frequencies.  Thus the output of the
5076vocal tract filter falls off at 12\ dB/octave just as the input does.
5077However, measurements of actual speech show a 6\ dB/octave decay with increasing
5078frequency.  This is explained by the effect of radiation of speech from the
5079lips, which in fact has a "differentiating" action, producing a 6\ dB/octave
5080rise in the frequency spectrum.  This 6\ dB/octave lift is similar to that
5081provided by a treble boost control on a radio or amplifier.  Speech synthesized
5082without it sounds unnaturally heavy and bassy.
5083.pp
5084These overall spectral shapes, which are derived from considering the human
5085vocal tract, are summarized in the upper annotations in Figure 5.1.  But there
5086is no real necessity for a synthesizer to model the frequency characteristics
5087of the human vocal tract at intermediate points:  only the output speech is of
5088any concern.  Because the system is a linear one, the filter blocks in the
5089figure can be shuffled around to suit engineering requirements.  One such
5090requirement is the desire to minimize internally-generated noise in the
5091electrical implementation, most of which will arise in the vocal tract filter
5092(because it is much more complicated than the other components).  For this
5093reason an excitation source with a flat spectrum is often preferred, as shown
5094in the lower annotations.  This can be generated either by taking the desired
5095glottal pulse shape, with its 12\ dB/octave fall-off, and passing it through a
5096filter giving 12\ dB/octave lift at higher frequencies; or, if the pulse shape
5097is to be stored digitally, by storing its second derivative instead.
5098Then the radiation compensation, which is now more properly called
5099"spectral equalization", will comprise a 6\ dB/octave fall-off to give the
5100required trend in the output spectrum.
5101.pp
5102For a given pitch period, this scheme yields exactly the same spectral
5103characteristics as the original system which modelled the human vocal tract.
5104However, when the pitch varies there will be a difference, for sounds with
5105higher excitation frequencies will be attenuated by \-6\ dB/octave in the new
5106system and +6\ dB/octave in the old by the final spectral equalization.
5107In practice, the pitch of the human voice lies quite low in the frequency
5108region \(em usually below 400\ Hz \(em and if all filter characteristics begin
5109their roll-off at this frequency the two systems will be the same.  This
5110simplifies the implementation with a slight compromise in its accuracy in
5111modelling the spectral trend of human speech, for the overall \-6\ dB/octave
5112decay actually begins at a frequency of around 100\ Hz.  If this is
5113implemented, some adjustment will need to be made to the amplitudes to ensure
5114that high-pitched sounds are not attenuated unduly.
5115.pp
5116The discussion so far pertains to voiced speech only.  The source spectrum of
5117the random excitation in unvoiced sounds is substantially flat, and combines
5118with the radiation from the lips to give a +6\ dB/octave rise in the output
5119spectrum.  Hence if spectral equalization is changed to \-6\ dB/octave to
5120accomodate a voiced excitation with flat spectrum, the noise source should
5121show a 12\ dB/octave rise to give the correct overall effect.
5122.sh "5.2  The excitation sources"
5123.pp
5124In human speech, the excitation source for voiced sounds is produced by two
5125flaps of skin called the "vocal cords".  These are blown apart by pressure from
5126the lungs.  When they come apart the pressure is relieved, and the muscles
5127tensioning the skin cause the flaps to come together again.  Subsequently, the
5128lung pressure \(em called "sub-glottal pressure" \(em builds up once more and the
5129process is repeated.  The factors which influence the rate and nature of
5130vibration are muscular tension of the cords and the sub-glottal pressure.  The detail
5131of the excitation has considerable importance to speech synthesis because it
5132greatly influences the apparent naturalness of the sound produced.  For example,
5133if you have inflamed vocal cords caused by laryngitis the sound quality
5134changes dramatically.  Old people who do not have proper muscular control over
5135their vocal cord tension produce a quavering sound.  Shouted speech can easily
5136be distinguished from quiet speech even when the volume cue is absent \(em you
5137can verify this by fiddling with the volume control of a tape recorder \(em because
5138when shouting, the vocal cords stay apart for a much smaller fraction of the
5139pitch cycle than at normal volumes.
5140.rh "Voiced excitation in natural speech."
5141There are two basic ways to examine the shape of the excitation source in
5142people.  One is to use a dentist's mirror and high-speed photography to observe
5143the vocal cords directly.  Although it seems a lot to ask someone to speak
5144naturally with a mirror stuck down the back of his throat, the method has been
5145used and photographs can be found, for example, in Flanagan (1972).
5146.[
5147Flanagan 1972 Speech analysis synthesis and perception
5148.]
5149The second
5150technique is to process the acoustic waveform digitally, identifying the
5151formant positions and deducting the formant contributions from the waveform by
5152filtering.  This leaves the basic excitation waveform, which can then be
5153displayed.  Such techniques lead to excitation shapes like those sketched in
5154Figure 5.2, in which the gradual opening and abrupt closure of the vocal cords
5155can easily be seen.
5156.FC "Figure 5.2"
5157.pp
5158It is a fact that if a periodic function has one or more discontinuities, its frequency
5159spectrum will decay at sufficiently high frequencies at the rate of 6\ dB/octave.
5160For example, the components of the square wave
5161.LB
5162$
5163g(t) ~~ = ~~ mark 0
5164$  for $
51650 <= t < h
5166$
5167.br
5168$
5169lineup 1
5170$  for $
5171h <= t < b
5172$
5173.LE
5174can be calculated from the Fourier series
5175.LB
5176.EQ
5177G(r) ~~ = ~~ 1 over b ~ integral from 0 to b ~g(t)~e sup {-j2 pi rt/b} ~dt
5178~~ = ~~ j over {2 pi r} ~e sup {-j2 pi rh/b} ,
5179.EN
5180.LE
5181so $|G(r)|$ is proportional to $1/r$, and the change in one octave is
5182.LB
5183.EQ
518420~log sub 10 ~ |G(2r)| over |G(r)|
5185~~=~~20~log sub 10 ~ 1 over 2
5186~~ = ~
5187.EN
5188\-6\ dB.
5189.LE
5190However, if the discontinuities are ones of slope only, then the asymptotic decay
5191at high frequencies is 12\ dB/octave.  Thus the glottal excitation of Figure 5.2
5192will decay at this rate.
5193Note that it is not the
5194.ul
5195number
5196but the
5197.ul
5198type
5199of discontinuities which are important in determining the asymptotic spectral
5200trend.
5201.rh "Voiced excitation in synthetic speech."
5202There are several ways that glottal excitation can be simulated in a synthesizer,
5203four of which are shown in Figure 5.3.
5204.FC "Figure 5.3"
5205The square pulse and the sawtooth pulse
5206both exhibit discontinuities, and so will have the wrong asymptotic rate of
5207decay (6\ dB/octave instead of 12\ dB/octave).  A better bet is the triangular
5208pulse.  This has the correct decay, for there are only discontinuities of slope.
5209However, although the asymptotic rate of decay is of first importance, the fine
5210structure of the frequency spectrum at the lower end is also significant, and
5211the fact that there are two discontinuities of slope instead of just one in the
5212natural waveform means that the spectra cannot match closely.
5213.pp
5214Rosenberg (1971) has investigated several different shapes using listening
5215tests, and he found that the polynomial approximation sketched in Figure 5.3
5216was preferred by listeners.
5217.[
5218Rosenberg 1971
5219.]
5220This has one slope discontinuity, and comprises
5221three sections:
5222.LB
5223$g(t) ~~ = ~~ 0$  for $0 <= t < t sub 1$    (flat during the period of closure)
5224.sp
5225$g(t) ~~ = ~~ A~ u sup 2 (3 - 2u) $,	where
5226$u ~=~ {t-t sub 1} over {t sub 2 -t sub 1} $ ,    for
5227$t sub 1 <= t < t sub 2$  (opening phase)
5228.sp
5229.sp
5230$g(t) ~~ = ~~ A~ (1 - v sup 2 )$,	where
5231$v ~=~ {t-t sub 2} over {b-t sub 2} $ ,    for
5232$t sub 2 <= t < b$    (closing phase).
5233.LE
5234It is easy to see that the joins between the first and second section, and
5235between the second and third section, are smooth; but that the slope of the third
5236section at the end of the cycle when $t=b$ is
5237.LB
5238.EQ
5239dg over dt ~~ = ~~ -~ 2A.
5240.EN
5241.LE
5242$A$ is the maximum amplitude of the pulse, and is reached when $t=t sub 2$.
5243.pp
5244A much simpler glottal pulse shape to implement is the filtered impulse.
5245Passing an impulse through a filter with characteristic
5246.LB
5247.EQ
52481 over {(1+sT) sup 2}
5249.EN
5250.LE
5251imparts a 12\ dB/octave decay after frequency $1/T$.  This gives a pulse shape of
5252.LB
5253.EQ
5254g(t) ~~ = ~~ A~ t over T ~e sup {1-t/T} ,
5255.EN
5256.LE
5257which is sketched in Figure 5.4.
5258.FC "Figure 5.4"
5259The pulse is the wrong way round in time
5260when compared with the desired one; but this is not important under most
5261listening conditions because phase differences are not noticeable (this
5262point is discussed further below).
5263The maximum is reached when $t=T$ and has
5264height $A$.  The value zero is never actually attained, for the decay to it
5265is asymptotic, and if the slight discontinuity between pulses shown in the
5266Figure is left, the asymptotic rate of decay of the frequency spectrum will
5267be 6\ dB/octave rather than 12\ dB/octave.  However, in a real implementation
5268involving filtering an impulse there will be no such discontinuity, for the
5269next pulse will start off where the last one ended.
5270.pp
5271This seems to be an attractive scheme because of its simplicity,
5272and indeed is sometimes used in speech synthesis.  However, it does not have
5273the right properties when the pitch is varied, for in real glottal
5274waveforms the maximum occurs at a fixed
5275.ul
5276fraction
5277of the period, whereas the filtered impulse's maximum is at a fixed time, $T$.
5278If $T$ is chosen to make the system correct at high pitch frequencies (say
5279400\ Hz), then the pulse will be much too narrow at low pitches and sound rather
5280harsh.  The only solution is to vary the filter parameters with the pitch,
5281leading to complexity again.
5282.pp
5283Holmes (1973) has made an extensive study of the effect of the glottal
5284waveshape on the naturalness of high-quality synthesized speech.
5285.[
5286Holmes 1973 Influence of glottal waveform on naturalness
5287.]
5288He employed a rather special speech synthesizer, which provides far more
5289comprehensive and sophisticated control than most.  It was driven by parameters
5290which were extracted from natural utterances by hand \(em but the process of
5291generating and tuning them took many months of a skilled person's time.
5292By using the pulse shape
5293extracted from the natural utterance, he found that synthetic and natural
5294versions could actually be made indistinguishable to most people, even under high-quality
5295listening conditions using headphones.  Performance dropped quite drastically
5296when one of Rosenberg's pulse shapes, similar to the three-section one given
5297above, was used.  Holmes also investigated phase effects and found that whilst
5298different pulse shapes with identical frequency spectra could easily be
5299distinguished when listening over headphones, there was no perceptible difference
5300if the listener was placed at a comfortable distance from a loudspeaker in
5301a room.  This is attributable to the fact that the room itself imposes a
5302complex modification to the phase characteristics of the speech signal.
5303.pp
5304Although a great deal of care must be taken with the glottal pulse shape for very
5305high-quality synthetic speech, for speech synthesized by rule from a written
5306representation the degradation which stems from incorrect control of the
5307synthesizer parameters is much greater than that caused by using a slightly
5308inferior glottal pulse.  The triangular pulse illustrated in Figure 5.3
5309has been found quite satisfactory for speech synthesis by rule.
5310.rh "Unvoiced excitation."
5311Speech quality is much less sensitive to the characteristics of the unvoiced
5312excitation.  Broadband white noise will serve admirably.  It is quite
5313acceptable to generate this digitally, using a pseudo-random feedback shift
5314register.  This gives a bit sequence whose autocorrelation is zero except at
5315multiples of the repetition length.  The repetition length
5316can easily be made as long as the number of states in the shift
5317register (less one) \(em in this case, the configuration is called
5318"maximal length" (Gaines, 1969).
5319.[
5320Gaines 1969 Stochastic computing advances in information science
5321.]
5322For example, an 18-bit maximal-length shift register will repeat
5323every $2 sup 18 -1$ cycles.  If the bit-stream is used as a source of analogue
5324noise, the autocorrelation function will have triangular parts whose width is
5325twice the clock period, as shown in Figure 5.5.
5326.FC "Figure 5.5"
5327According to a well-known
5328result (the Weiner-Kinchine theorem; see for example Chirlian, 1973)
5329the power density of the frequency
5330spectrum is the same as the Fourier transform of the autocorrelation function.
5331.[
5332Chirlian 1973
5333.]
5334Since the feedback shift register gives a periodic autocorrelation function,
5335its transform is a Fourier series.  The $r$'th frequency component is
5336.LB
5337.EQ
5338G(r) ~~ = ~~ {R sup 2} over {4 pi sup 2 r sup 2 T}
5339~(1~-~~cos~{{2 pi rT} over R}) ~ .
5340.EN
5341.LE
5342Here, $T$ is the clock period and  $R=(2 sup N -1)T$  is the repetition time of
5343an $N$-bit shift register.
5344.pp
5345The spectrum is a bar spectrum, with components spaced
5346at
5347.LB
5348$
5349{1 over R}~~=~~{1 over {(2 sup N -1)T}}$   Hz.
5350.LE
5351These are very close together \(em with $N=18$ and
5352sampling at 20\ kHz (50\ $mu$sec)
5353the spacing becomes under 0.1\ Hz \(em and so it is reasonable to treat the
5354spectrum as continuous, with
5355.LB
5356.EQ
5357G(f) ~~ = ~~ 1 over {4 pi sup 2 f sup 2 T}~~(1~-~cos 2 pi fT) .
5358.EN
5359.LE
5360This spectrum is sketched in Figure 5.6(a), and the measured result of an actual
5361implementation in Figure 5.6(b).
5362.FC "Figure 5.6"
5363The 3\ dB point occurs when
5364.LB
5365.EQ
5366{G(f) over G(0)} ~~=~~{1 over 2} ~ ,
5367.EN
5368.LE
5369and $G(0)$ is $T/2$.  Hence, at the 3\ dB point,
5370.LB
5371.EQ
5372{1~-~cos 2 pi fT} over {2 pi sup 2 f sup 2 T sup 2}
5373~~ = ~~ 1 over 2 ~ ,
5374.EN
5375.LE
5376which has solution  $f=0.45/T$.
5377Thus a pseudo-random shift register generates
5378noise whose spectrum is substantially flat up to half the clock frequency.
5379Anything over 10\ kHz is therefore a suitable clocking rate for speech-quality
5380noise.  Choose 20\ kHz to err on the conservative side.  If the repetition occurs
5381in less than 3 or 4 seconds, it can be heard quite clearly; but above this figure
5382it is not noticeable.  An 18-bit shift register clocked at 20\ kHz repeats
5383every  $(2 sup 18 -1)/20000 ~ = ~ 13$ seconds, which is more than adequate.
5384.sh "5.3  Simulating vocal tract resonances"
5385.pp
5386The vocal tract, from glottis to lips, can be modelled as an unconstricted
5387tube of varying cross-section with no side branches and no sub-glottal coupling.
5388This has an all-pole transfer function, which can be written in the form
5389.LB
5390.EQ
5391H(s) ~~ = ~~
5392{w sub 1 sup 2} over {s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2}
5393~.~{w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} ~~ .~ .~ .
5394.EN
5395.LE
5396There is an unspecified (conceptually infinite) number of terms in the
5397product.  Each of them produces a peak in the energy spectrum,
5398and these are the formants we observed in Chapter 2.
5399.pp
5400Formants appear even in an over-simplified
5401model of the tract as a tube of uniform cross-section, with a sound source
5402at one end (the larynx) and open at the other (the lips).
5403This extremely crude model was discussed in Chapter 2, and surprisingly,
5404perhaps, it gives a good approximation to the observed formant frequencies
5405for a neutral, relaxed vowel such as that in
5406.ul
5407"a\c
5408bove".
5409.pp
5410Speech is made by varying the postures of the various organs of the vocal tract.
5411Different vowels, for example, result largely from different tongue positions
5412and lip postures.  Naturally, such physical changes alter the frequencies of the
5413resonances, and successful automatic speech synthesis depends upon
5414successful movement of the formants.  Fortunately, only the first three or
5415four resonances need to be altered even for extremely realistic synthesis, and
5416virtually all existing synthesizers provide control over these formants only.
5417.rh "Analysis of a single formant."
5418Each formant is modelled as a second-order resonance, with transfer function
5419.LB
5420.EQ
5421H(s) ~~ = ~~ {w sub c sup 2} over {s sup 2 ~+~ b s ~+~ w sub c sup 2} ~ .
5422.EN
5423.LE
5424As will be shown below, $w sub c$ is the nominal resonant frequency in
5425radians/s, and $b$ is the
5426approximate 3\ dB bandwidth of the resonance.  The term $w sub c sup 2$ in the
5427numerator adjusts the gain to be unity at DC ($s=0$).
5428.pp
5429To calculate the frequency response of the formant, write  $s=jw$.  Then the
5430energy spectrum is
5431.LB
5432.EQ
5433|H(jw)| sup 2 ~~ mark = ~~
5434{w sub c sup 4} over {(w sup 2 - w sub c sup 2 ) sup 2 ~+~ b sup 2 w sup 2}
5435.EN
5436.sp
5437.sp
5438.EQ
5439lineup = ~~
5440{w sub c sup 4} over
5441{[w sup 2 ~-~(w sub c sup 2 -~ {b sup 2} over 2 )] sup 2 ~~
5442+~~b sup 2 (w sub c sup 2~-~{{b sup 2} over 4})} ~ .
5443.EN
5444.sp
5445.LE
5446This reaches a maximum when the squared term in the denominator of the second
5447expression is zero, namely when  $w=(w sub c sup 2 ~-~ b sup 2 /2) sup 1/2$.
5448However,
5449formant bandwidths are low compared with their centre frequencies, and so to
5450a good approximation the peak occurs
5451at  $w=w sub c$  and is of amplitude  $w sub c /b$,  that
5452is,  $10~log sub 10 w sub c /b$\ dB above the DC gain.
5453At frequencies higher than the peak the energy falls off as $1/w sup 4$,
5454a factor of 1/16 for each doubling
5455in frequency, and so the asymptotic decay is 12\ dB/octave.
5456.pp
5457At the points which are 3\ dB below the peak,
5458.LB
5459.EQ
5460|H(jw sub 3dB )| sup 2 ~~ = ~~
54611 over 2 ~|H(jw sub max )| sup 2 ~~ = ~~
54621 over 2 ~ times ~ {w sub c sup 2} over {b sup 2} ~ ,
5463.EN
5464.LE
5465and it is easy to show that
5466this is satisfied by  $w sub 3dB ~ = ~ w sub c ~ +- ~ b/2$  to a
5467good approximation (neglecting higher powers of $b/w sub c )$.  Figure 5.7
5468summarizes the shape of an individual formant resonance.
5469.FC "Figure 5.7"
5470.pp
5471The bandwidth of a formant is fairly constant, regardless of the formant
5472frequency.  This makes the formant filter a slightly unusual one:  most
5473engineering applications which use variable-frequency resonances require
5474the bandwidth to be a constant proportion of the resonant
5475frequency \(em the ratio
5476$w sub c /b$, often called the "$Q$" of the filter, is to be constant.
5477For formants, we wish the Q to increase linearly with resonant frequency.
5478Since the amplitude gain of the formant at resonance is $w sub c /b$,
5479this peak gain increases as the formant frequency is increased.
5480.pp
5481Although it is easy to measure formant frequencies on a spectrogram
5482(cf Chapter 2),
5483it is not so easy to measure bandwidths accurately.  One rather unusual method
5484was reported by van den Berg (1955), who took a subject who had had a partial
5485laryngectomy, an operation which left an opening into the vocal tract near
5486the larynx position.  Into this he inserted a sound source and made a
5487swept-frequency calibration of the vocal tract!
5488.[
5489Berg van den 1955
5490.]
5491Almost as bizarre is a
5492technique which involves setting off a spark inside the mouth of a subject
5493as he holds his articulators in a given position.
5494.pp
5495The results of several different kinds of experiment are reported by Dunn (1961),
5496and are summarized in Table 5.1, along with the formant frequency ranges.
5497.[
5498Dunn 1961
5499.]
5500.RF
5501.in+0.5i
5502.ta 1.7i +2.5i
5503.nr x1 (\w'range of formant'/2)
5504.nr x2 (\w'range of bandwidths'/2)
5505	\h'-\n(x1u'range of formant	\h'-\n(x2u'range of bandwidths
5506.nr x1 (\w'frequencies (Hz)'/2)
5507.nr x2 (\w'as measured in different'/2)
5508	\h'-\n(x1u'frequencies (Hz)	\h'-\n(x2u'as measured in different
5509.nr x1 (\w'experiments (Hz)'/2)
5510		\h'-\n(x1u'experiments (Hz)
5511.nr x1 (\w'0000 \- 0000'/2)
5512.nr x2 (\w'000 \- 000'/2)
5513.nr x0 2.5i+(\w'range of formant'/2)+(\w'as measured in different'/2)
5514.nr x3 (\w'range of formant'/2)
5515	\h'-\n(x3u'\l'\n(x0u\(ul'
5516.sp
5517formant 1	\h'-\n(x1u'\0100 \- 1100	\h'-\n(x2u'\045 \- 130
5518formant 2	\h'-\n(x1u'\0500 \- 2500	\h'-\n(x2u'\050 \- 190
5519formant 3	\h'-\n(x1u'1500 \- 3500	\h'-\n(x2u'\070 \- 260
5520	\h'-\n(x3u'\l'\n(x0u\(ul'
5521.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
5522.in-0.5i
5523.MT 2
5524Table 5.1  Different estimates of formant bandwidths, with range of
5525formant frequencies for reference
5526.TE
5527Note that the bandwidths really are narrow compared with the resonant frequencies
5528of the filters, except at the lower end of the formant 1 range.  Choosing the
5529lowest bandwidth estimate leads to an amplification factor at resonance of 50 for formant 2
5530when its frequency is at the top of its range; and formant 3 happens to give
5531the same value.
5532.rh "Series synthesizers."
5533The simplest realization of the vocal tract filter is a chain of formant
5534filters in series, as illustrated in Figure 5.8.
5535.FC "Figure 5.8"
5536This leads to particular difficulties if the frequencies of two formants
5537stray close together.  The worst case occurs if formants 2 and 3 have the
5538same resonant frequencies, at the top of the range of formant 2, namely 2500\ Hz.
5539In this case, and if the bandwidths of the formants are set to the lowest
5540estimates, a combined amplification factor
5541of  $(2500/50) times (2500/70)=1800$  is
5542obtained at the point of resonance \(em that is,
554365\ dB above the DC value.  This is enough
5544to tax most analogue implementations, and can evoke clipping in the formant
5545filters, with a very noticeable effect on speech quality.  This
5546extreme case will not occur during synthesis of realistic speech, for
5547although the formant
5548.ul
5549ranges
5550overlap, the values for any particular (human) sound will not coincide exactly.  However,
5551it illustrates the difficulty of designing a series synthesizer which copes
5552sensibly with arbitrary parameter settings, and explains why designers often
5553choose formant bandwidths in the top half of the ranges given in Table 5.1.
5554.pp
5555The problem of excessive amplification within a series synthesizer can be
5556alleviated to a small extent by choosing carefully the order in which the
5557filters are placed in the chain.  In a linear system, of course, the order in
5558which the components occur does not matter.
5559In physical implementations, however, it is advantageous to minimize extreme
5560amplification at intermediate points.  By placing the formant 1 filter between
5561formants 2 and 3, the formant 2 resonance is attenuated somewhat before it
5562reaches formant 3.  Continuing with the extreme example above, where both
5563formants 2 and 3 were set to 2500\ Hz; assume that formant 1 is at its
5564nominal value of 500\ Hz.  It provides attenuation at approximately 12\ dB/octave
5565above this, and so at the formant 2 peak, 2.3\ octaves higher, the attenuation
5566is 28\ dB.  Thus the gain at 2500\ Hz,
5567which is $20 ~ log sub 10 ~ 2500/50 ~ = ~ 34$\ dB after
5568passing through the formant 2 filter, is reduced to 6\ dB by formant 1, only
5569to be increased by $20 ~ log sub 10 ~ 2500/70 ~ = ~ 31$\ dB to
5570a value of 37\ dB by formant 3.
5571This avoids the extreme 65\ dB gain of formants 2 and 3 combined.
5572.pp
5573Figure 5.8 shows only three formant filters modelled explicitly.
5574The effect of the rest \(em and they do have an effect, although it is small
5575at low frequencies \(em is
5576incorporated by lumping them together into the "higher-formant correction" filter.
5577To calculate the characteristics of this filter, assume that the lumped
5578formants have the values given by the simple uniform-tube model of Chapter 2,
5579namely 3500\ Hz for formant 4, 4500\ Hz for formant 5, and, in general,
5580$500(2n-1)$\ Hz for formant $n$.  The effect of each of these on the spectrum is
5581.LB
5582.EQ
558310~ log sub 10  {w sub n sup 4} over {(w sup 2 ~-~w sub n sup 2 ) sup 2
5584~~+~~b sub n sup 2 w sup 2}
5585~~ = ~~ -~ 10~ log sub 10 ~[(1~-~~{{w sup 2} over {w sub n sup 2}}) sup 2
5586~~+~~ {{b sub n sup 2 w sup 2} over {w sub n sup 4}}]
5587.EN
5588dB,
5589.LE
5590following from what was calculated above.
5591We will have to approximate this by assuming that
5592$b sub n sup 2 /w sub n sup 2$ is
5593negligible \(em this is quite reasonable for these higher formants because
5594Table 5.1 shows that the bandwidth does not increase in proportion to the
5595formant frequency range \(em and approximate the logarithm by the first
5596term of its series expansion:
5597.LB
5598.EQ
5599-10 ~ log sub 10 ~ (1~-~~{{w sup 2} over {w sub n sup 2}}) sup 2
5600~~ = ~~ -20~ log sub 10 ~ e ~ log sub e
5601(1~-~~{{w sup 2} over {w sub n sup 2}})
5602~~ = ~~ 20~ log sub 10 ~ e ~ times ~ {w sup 2} over {w sub n sup 2} ~ .
5603.EN
5604.LE
5605.pp
5606Now the total effect of formants 4, 5, ... at frequency $f$\ Hz (as distinct
5607from $w$\ radians/s) is
5608.LB
5609.EQ
561020~ log sub 10 ~ e ~ times ~ sum from n=4 to infinity
5611~{{f sup 2} over {500 sup 2 (2n-1) sup 2}} ~ .
5612.EN
5613.LE
5614This expression is
5615.LB
5616.EQ
561720~ log sub 10 ~ e ~ times ~
5618{{f sup 2} over {500 sup 2}}~~(~sum from n=1 to infinity
5619~{1 over {(2n-1) sup 2}} ~~-~~ sum from n=1 to 3 ~{1 over {(2n-1) sup 2}}~)
5620~ .
5621.EN
5622.LE
5623The infinite sum can actually be calculated in closed form, and is equal
5624to  $pi sup 2 /8$.  Hence the total correction is
5625.LB
5626.EQ
562720~ log sub 10 ~ e ~ times {{f sup 2} over {500 sup 2}}
5628~~(~{pi sup 2} over 8 ~~-~~ sum from n=1 to 3 ~{1 over {(2n-1) sup 2}}~)
5629~~ = ~~ 2.87 times 10 sup -6 f sup 2
5630.EN
5631dB.
5632.LE
5633.pp
5634Although this may at first seem to be a rather small correction,
5635it is in fact 72\ dB when
5636$f=5$\ kHz!  On further reflection this is not an unreasonable figure, for the
563712\ dB/octave decays contributed by formants 1, 2, and 3 must all be annihilated
5638by the higher-formant correction to give an overall flat spectral trend.
5639In fact, formant 1 will contribute
564012\ dB/octave from 500\ Hz (3.3\ octaves to 5\ kHz, representing 40\ dB); formant
56412 will contribute 12\ dB/octave from 1500\ Hz (1.7\ octaves to 5\ kHz, representing
564221\ dB); and formant 3 will contribute 12\ dB/octave from 2500\ Hz (1\ octave to 5\ kHz,
5643representing 12\ dB).
5644These sum to 73\ dB.
5645.pp
5646If the first five formants are synthesized explicitly instead of just the
5647first three, the correction is
5648.LB
5649.EQ
565020~ log sub 10 ~ e ~ times ~ {{f sup 2} over {500 sup 2}}
5651~~(~{pi sup 2} over 8 ~-~~ sum from n=1 to 5 ~{1 over {(2n-1) sup 2}}~)
5652~~ = ~~ 1.73 times 10 sup -6  f sup 2
5653.EN
5654dB,
5655.LE
5656giving a rather more reasonable value of 43\ dB when $f=5$\ kHz.  In actual
5657implementations, fixed filters are sometimes included explicitly for
5658formants 4 and 5.  Although this lowers the gain of the higher-formant
5659correction filter, the total amplification at 5\ kHz of the combined correction
5660is still 72\ dB.  If one is less demanding and aims for a synthesizer that
5661produces a correct spectrum only up to 3.5\ kHz, it is 35\ dB.
5662This places quite stringent requirements on the preceding formant filters if
5663the stray noise that they generate internally is not to be amplified to
5664perceptible magnitudes by the correction filter at high frequencies.
5665.pp
5666Explicit inclusion of fixed filters for formants 4 and 5 undoubtedly improves
5667the accuracy of the higher-formant correction.  Recall that the above derivation
5668of the correction filter characteristic used the first-order approximation
5669.LB
5670.EQ
5671log sub e (1~-~{{w sup 2} over {w sub n sup 2}})
5672~~ = ~~ -~ {w sup 2} over {w sub n sup 2} ~ ,
5673.EN
5674.LE
5675which is only valid if $w << w sub n$.
5676Thus it only holds at frequencies less than
5677the highest explicitly synthesized formant,
5678and so with formants 4 (3.5\ kHz) and
56795 (4.5\ kHz) included a reasonable correction should be obtained for
5680telephone-quality speech.  However, detailed analysis with a second-order
5681approximation shows that the coefficient of the neglected term is in fact
5682small (Fant, 1960).
5683.[
5684Fant 1960 Acoustic theory of speech production
5685.]
5686A second, perhaps more compelling, reason for explicitly
5687including a couple of fixed formants is that the otherwise enormous amplification
5688provided by the correction can be distributed throughout the formant chain.
5689We saw earlier why there is reason to prefer the
5690order F3\(emF1\(emF2 over F1\(emF2\(emF3.
5691With explicit formants 4 and 5, a suitable order which helps
5692to keep the amplification at intermediate points in the chain within reasonable
5693bounds is F3\(emF5\(emF2\(emF4\(emF1.
5694.rh "Parallel synthesizers."
5695A series synthesizer models the vocal tract resonances by a chain of formant
5696filters in series.  A parallel synthesizer utilizes a parallel connection of
5697filters as illustrated in Figure 5.9.
5698.FC "Figure 5.9"
5699.pp
5700Consider a parallel combination of two formants with individually-controllable
5701amplitudes.  The combined transfer function is
5702.LB
5703.EQ
5704H(s) ~~ mark = ~~ {A sub 1 w sub 1 sup 2} over
5705{s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2}
5706~~+~~{A sub 2 w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2}
5707.EN
5708.sp
5709.sp
5710.EQ
5711lineup = ~~ { (A sub 1 w sub 1 sup 2 + A sub 2 w sub 2 sup 2 )s sup 2
5712~+~(A sub 1 b sub 2 w sub 1 sup 2 + A sub 2 b sub 1 w sub 2 sup 2 )s
5713~+~ (A sub 1 +A sub 2 )w sub 1 sup 2 w sub 2 sup 2 }
5714over
5715{ (s sup 2 ~+~b sub 1 s~+~w sub 1 sup 2 )
5716(s sup 2 ~+~b sub 2 s~+~w sub 2 sup 2 ) }
5717.EN
5718.LE
5719If the formant bandwidths $b sub 1$ and $b sub 2$
5720are equal and the amplitudes are
5721chosen as
5722.LB
5723.EQ
5724A sub 1 ~~=~~ {w sub 2 sup 2} over {w sub 2 sup 2 -w sub 1 sup 2}
5725~~~~~~~~
5726A sub 2 ~~=~~-~ {w sub 1 sup 2} over {w sub 2 sup 2 -w sub 1 sup 2} ~ ,
5727.EN
5728.LE
5729then the transfer function becomes the same as that of a two-formant series synthesizer,
5730namely
5731.LB
5732.EQ
5733H(s) ~~ = ~~ {w sub 1 sup 2} over {s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2}
5734~ . ~{w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} ~ .
5735.EN
5736.LE
5737The argument can be extended to any number of formants, under the assumption
5738that the formant bandwidths are equal.  Note that the signs of $A sub 1$
5739and $A sub 2$
5740differ:  in general the formant amplitudes for a parallel synthesizer alternate
5741in sign.
5742.pp
5743In theory, therefore, it would be possible to use five parallel formants to
5744model a five-formant series synthesizer exactly.  Then the same higher-formant
5745correction filter would be needed for the parallel synthesizer as for the
5746series one.  If the formant amplitudes were set slightly incorrectly, however,
5747the five filters would not combine to give a total of 60\ dB/octave high-frequency
5748decay above the resonances.  It is easy to see this in the context of the
5749simplified two-formant combination above:  if the amplitudes were not chosen
5750exactly right then the $s sup 2$
5751term in the numerator would not be quite zero.
5752Then, the decay in the two-formant combination would be \-12\ dB/octave instead
5753of \-24\ dB/octave, and in the five-formant case the decay would in fact still be
5754\-12\ dB/octave.  Advantage can be taken of this to equalize the levels
5755within the synthesizer so that large amplitude variations do not occur.
5756This can best be done by associating relatively low-gain fixed correction filters
5757with each formant instead of providing one comprehensive correction to the
5758combined spectrum:  these are shown in Figure 5.9.
5759Suitable correction filters
5760have been determined empirically by Holmes (1972).
5761.[
5762Holmes 1972 Speech synthesis
5763.]
5764They provide a 6\ dB/octave
5765lift above 640\ Hz for formant 1, and 6\ dB/octave lift above 300\ Hz for formant
57662.  Formants 3 and 4 are uncorrected, whilst for formant 5 the correction begins
5767as a 6\ dB/octave decay above 600\ Hz and increases to an 18\ dB/octave decay
5768above 5.5\ kHz.
5769.pp
5770The disadvantage of a parallel synthesizer is that the amplitudes of the
5771formants must be specified as well as their frequencies.  (Furthermore, the
5772formant bandwidths should all be equal, but they are often chosen to be such
5773in series synthesizers because of the uncertainty as to their exact
5774values.)  However, the extra amplitude parameters clearly give greater
5775control over the frequency spectrum of the synthesized speech.
5776.pp
5777A good example of how this extra control can usefully be exploited is the
5778synthesis of nasal sounds.
5779Nasalization introduces a cavity parallel to the oral tract, as illustrated
5780in Figure 5.10, and this causes zeros in the transfer function.
5781.FC "Figure 5.10"
5782It is as if two different copies of the vocal tract transfer function, one for
5783the oral and the other for the nasal passage, were added
5784together.  We have seen the effect of this above when considering parallel
5785synthesis.  The combination
5786.LB
5787.EQ
5788H(s) ~~ = ~~ {A sub 1 w sub o sup 2} over
5789{s sup 2 ~+~ b sub o s ~+~ w sub o sup 2}
5790~~+~~{A sub 2 w sub n sup 2}
5791over {s sup 2 ~+~ b sub n s ~+~ w sub n sup 2} ~ ,
5792.EN
5793.LE
5794where the subscript "$o$" stands for oral and "$n$" for nasal,
5795produces zeros in the
5796numerator (unless the amplitudes are carefully adjusted to avoid them).
5797These cannot be modelled by a series synthesizer, but they obviously can be
5798by a parallel one.
5799.pp
5800Although they are certainly needed for accurate imitation of human speech,
5801transfer function zeros to simulate nasal sounds are not essential for
5802synthesis of intelligible English.  It is not difficult to get a sound
5803like a nasal consonant
5804(\c
5805.ul
5806n,
5807or
5808.ul
5809m\c
5810)
5811with an all-pole synthesizer.
5812Nevertheless, it is certainly true that a parallel synthesizer gives better
5813.ul
5814potential
5815control over the spectrum than a series one.  Whether the added flexibility
5816can be used properly by a synthesis-by-rule computer program is another matter.
5817.rh "Implementation of formant filters."
5818Formant filters can be built in either analogue or digital form.  A
5819second-order resonance is needed, whose centre frequency can be controlled
5820but whose bandwidth is fixed.  If the control can be arranged as two
5821tracking resistors, then the simple analogue configuration of Figure 5.11,
5822with two operational amplifiers, will suffice.
5823.FC "Figure 5.11"
5824.pp
5825The transfer function of this arrangement is
5826.LB
5827.EQ
5828- ~~ { 1/C sub 1 R sub 1 C sub 2 R sub 2 } over
5829{ s sup 2 ~~+~~ {1 over {C sub 2 R sub 2}}~s
5830~~+~~{1 over {C sub 1 R' sub 1 C sub 2 R sub 2 }}} ~ ,
5831.EN
5832.LE
5833which characterizes it as a low-pass resonator with DC gain
5834of  $- R' sub 1 /R sub 1 $,  bandwidth of  $1/2 pi C sub 2 R sub 2$\ Hz,  and
5835centre frequency of  $1/2 pi (C sub 1 R' sub 1 C sub 2 R sub 2 ) sup 1/2$\ Hz.
5836Tracking $R' sub 1$ with $R sub 1$ ensures that the DC gain remains constant,
5837and that the centre frequency follows  $R sub 1 sup -1/2$.  Moreover,
5838neither is especially sensitive to slight departures from exact tracking
5839of $R' sub 1$ with $R sub 1$.
5840Such a filter has been used in a simple hand-controlled speech synthesizer,
5841built for demonstration and amusement (Witten and Madams, 1978).
5842.[
5843Witten Madams 1978 Chatterbox
5844.]
5845However, the need for tracking resistors, and the inverse square root variation
5846of the formant frequency with $R sub 1$, makes it rather unsuitable for serious
5847applications.
5848.pp
5849A better analogue filter is the ring-of-three configuration
5850shown in Figure 5.12.
5851.FC "Figure 5.12"
5852(Ignore the secondary output for now.)  Control
5853is achieved over the centre frequency by two multipliers, driven from
5854the same control input $k$.  These have a high-impedance output, producing a
5855current $kx$ if the input voltage is $x$.
5856It is not too difficult to show that the transfer function of the circuit is
5857.LB
5858.EQ
5859- ~~ { {k sup 2} over {C sup 2} } over
5860{ s sup 2 ~~+~~ 2 over RC ~s
5861~~+~~{1+k sup 2 R sup 2} over {R sup 2 C sup 2} } ~ .
5862.EN
5863.LE
5864Suppose that $R$ is chosen so that  $k sup 2 R sup 2 ~ >>~ 1$.  Then this is a
5865unity-gain resonator with constant bandwidth  $1/ pi RC$\ Hz  and centre
5866frequency  $k/2 pi C$\ Hz.  Note that it is the combination of both multipliers that
5867makes the centre frequency grow linearly with $k$:  with one multiplier there
5868would be a square-root relationship.
5869.pp
5870The ring-of-three filter of Figure 5.12 is arranged in a slightly unusual
5871way, with an inverting stage at the beginning and the two resonant stages
5872following it.  This ensures that the signal level at intermediate
5873points in the filter does not exceed that at the output, and gives the filter
5874the best chance of coping with a wide range of input amplitudes without
5875clipping.  This contrasts markedly with the resonator of Figure 5.11, where
5876the voltage at the output of the first integrator is $w/b$ times the final output \(em a
5877factor of 50 in the worst case.
5878.pp
5879For a digital implementation of a formant, consider the recurrence relation
5880.LB
5881.EQ
5882y(n)~ = ~~ a sub 1 y(n-1) ~-~ a sub 2 y(n-2) ~+~ a sub 0 x(n) ,
5883.EN
5884.LE
5885where $x(n)$ is the input and $y(n)$ the output at time $n$,
5886$y(n-1)$ and $y(n-2)$ are the previous two values of the output,
5887and $a sub 0$, $a sub 1$, and $a sub 2$ are (real) constants.
5888The minus sign is in front of the second term because it makes $a sub 2$
5889turn out to be
5890positive.  To calculate the $z$-transform version of this relationship, multiply
5891through by $z sup -n$ and sum from $n=- infinity$ to $infinity$ :
5892.LB "nn"
5893.EQ
5894sum from {n=- infinity} to infinity ~y(n)z sup -n ~~ mark =~~
5895a sub 1 sum from {n=- infinity} to infinity ~y(n-1)z sup -n ~~-~
5896a sub 2 sum from {n=- infinity} to infinity ~y(n-2)z sup -n ~~+~
5897a sub 0 sum from {n=- infinity} to infinity ~x(n)z sup -n
5898.EN
5899.sp
5900.EQ
5901lineup = ~~ a sub 1 z sup -1 ~ sum ~y(n-1)z sup -(n-1) ~~-~~
5902a sub 2 z sup -2 ~ sum ~y(n-2)z sup -(n-2)
5903~~+~~ a sub 0 ~ sum ~x(n)x sup -n ~ .
5904.EN
5905.LE "nn"
5906Writing this in terms of $z$-transforms,
5907.LB
5908.EQ
5909Y(z)~ = ~~ a sub 1 z sup -1 Y(z) ~-~ a sub 2 z sup -2 Y(z) ~+~ a sub 0 X(z) .
5910.EN
5911.LE
5912Thus the input-output transfer function of the system is
5913.LB
5914.EQ
5915H(z)~ = ~~ Y(z) over X(z)
5916~~=~~ {a sub 0 } over {1~-~a sub 1 z sup -1 ~+~a sub 2 z sup -2} ~ .
5917.EN
5918.LE
5919.pp
5920We learned in the previous chapter that the frequency response is obtained
5921from the $z$-transform of a system by replacing $z sup -1$
5922by  $e sup {-j2 pi fT}$,  where $f$ is the frequency variable in\ Hz.
5923Hence the amplitude response of the digital formant filter is
5924.LB
5925.EQ
5926|H(e sup {j2 pi fT} )| sup 2
5927~~ = ~~ left [ {a sub 0} over {1~-~a sub 1 e sup {-j2 pi fT}
5928~+~a sub 2 e sup {-j4 pi fT} } ~ right ] sup 2 ~ .
5929.EN
5930.sp
5931.LE
5932It is fairly obvious from this that a DC gain of 1 is obtained if
5933.LB
5934.EQ
5935a sub 0 ~ = ~~ 1 ~-~ a sub 1  ~+~ a sub 2 ,
5936.EN
5937.LE
5938for  $e sup {-j2 pi fT}$  is 1 at a frequency of 0\ Hz.  Some manipulation is
5939required to show that, under the usual assumption that the bandwidth is
5940small, the centre frequency is
5941.LB
5942.EQ
59431 over {2 pi T} ~~ cos sup -1 ~ {a sub 1} over {2 a sub 2 sup 1/2} ~
5944.EN
5945Hz.
5946.LE
5947Furthermore, the 3\ dB bandwidth of the resonance is given approximately by
5948.LB
5949.EQ
5950-~ 1 over {2 pi T} ~~ log sub e a sub 2 ~
5951.EN
5952Hz.
5953.LE
5954.pp
5955As an example, Figure 5.13 shows an amplitude response for this digital filter.
5956.FC "Figure 5.13"
5957The parameters $a sub 0$, $a sub 1$ and $a sub 2$
5958were generated from the above
5959relationships for a sampling frequency of 8\ kHz, centre frequency of 1\ kHz,
5960and bandwidth of 75\ Hz.
5961It exhibits a peak of approximately the right bandwidth at the correct
5962frequency, 1\ kHz.  Note that the response is flat at half the sampling
5963frequency, for the frequency response from 4\ kHz to 8\ kHz is just a reflection of
5964that up to 4\ kHz.
5965This contrasts sharply with that of an analogue formant filter, also shown
5966in Figure 5.13, which slopes
5967at \-12\ dB/octave at frequencies above resonance.
5968.pp
5969The behaviour of a digital formant filter at frequencies above
5970resonance actually makes it preferable to an analogue implementation.
5971We saw earlier that considerable trouble must be taken with the latter to
5972compensate for the cumulative effect of \-12\ dB/octave at higher frequencies for
5973each of the formants.
5974This is not necessary with digital implementations, for the response of
5975a digital formant filter is flat at half the sampling frequency.  In fact, further
5976study shows that digital synthesizers without any higher-pole correction
5977give a closer approximation to the vocal tract than analogue ones with higher-pole
5978correction (Gold and Rabiner, 1968).
5979.[
5980Gold Rabiner 1968 Analysis of digital and analogue formant synthesizers
5981.]
5982.rh "Time-domain methods."
5983An interesting alternative to frequency-domain speech synthesis is to construct
5984the formants in the time domain.  When a second-order resonance is excited by
5985an impulse, an exponentially decaying sinusoid is produced, as illustrated by
5986Figure 5.14.
5987.FC "Figure 5.14"
5988The oscillation occurs at the resonant frequency of the filter,
5989while the decay is related to the bandwidth.  In fact, if the formant filter
5990has transfer function
5991.LB
5992.EQ
5993{w sup 2} over {s sup 2 ~+~ b s ~+~ w sup 2} ~ ,
5994.EN
5995.LE
5996the time waveform for impulsive excitation is
5997.LB
5998.EQ
5999x(t)~ = ~~ w~ e sup -bt/2 ~ sin ~ wt ~~~~~~~~
6000.EN
6001(neglecting  $b sup 2 /w sup 2$).
6002.LE
6003It is the combination of several such time waveforms, coupled with the regular
6004reappearance of excitation at the pitch period, that produces the characteristic
6005wiggly waveform of voiced speech.
6006.pp
6007Now suppose we take a sine wave of frequency $w$ and multiply it by a
6008decaying exponential  $e sup -bt/2$.  This gives a signal
6009.LB
6010.EQ
6011x(t)~ = ~~ e sup -bt/2 ~ sin ~ wt ,
6012.EN
6013.LE
6014which is identical with the filtered impulse except for a factor $w$.
6015If there are several formants in parallel, all with the same bandwidth,
6016the exponential factor is the same for each:
6017.LB
6018.EQ
6019x(t)~ = ~~ e sup -bt/2 ~ (A sub 1 ~ sin ~ w sub 1 t
6020~~+ ~~ A sub 2 ~ sin ~ w sub 2 t ~~ + ~~ A sub 3 ~  sin ~ w sub 3 t) .
6021.EN
6022.LE
6023$A sub 1$, $A sub 2$, and $A sub 3$ control the formant amplitudes,
6024as in an ordinary parallel synthesizer;
6025except that they need adjusting to account for the missing
6026factors $w sub 1$, $w sub 2$, and $w sub 3$.
6027.pp
6028A neat way of implementing such a synthesizer digitally is to store one cycle of a
6029sine wave in a read-only memory (ROM).  Then, the formant frequencies can be
6030controlled by reading the ROM at different rates.  For example, if twice the
6031basic frequency is desired, every second value should be read.
6032Multiplication is needed for amplitude control of each formant:  this can be
6033accomplished by shifting the digital word (each place shifted accounts for
60346\ dB of attenuation).  Finally, the exponential damping factor can be
6035provided in analogue hardware by a single capacitor after the D/A converter.
6036This implementation gives a system for hardware-software synthesis which
6037involves an absolutely minimal amount of extra hardware apart from the computer,
6038and does not need hardware multiplication for real-time operation.
6039It could easily be made to work in real time with a microprocessor coupled
6040to a D/A converter, damping capacitor, and fixed tone-control filter to give
6041the required spectral equalization.
6042.pp
6043Because the overall spectral decay of an impulse exciting a second-order
6044formant filter is 12\ dB/octave, the appropriate equalization is +6\ dB/octave
6045lift at high frequencies, to give an overall \-6\ dB/octave spectral trend.
6046.pp
6047Note, however, that this synthesis model is an extremely basic one.  Only
6048impulsive excitation can be accomodated.  For fricatives, which we will
6049discuss in more detail below, a different implementation is needed.  A
6050hardware noise generator, with a few fixed filters \(em one
6051for each fricative type \(em will suffice for a simple system.  More damaging
6052is the lack of aspiration, where random noise excites the vocal tract resonances.
6053This cannot be simulated in the model.  The
6054.ul
6055h
6056sound can be provided by
6057treating it as a fricative, and although it will not sound completely realistic,
6058because there will be no variation with the formant positions of adjacent phonemes,
6059this can be tolerated because
6060.ul
6061h
6062is not too important for speech intelligibility.
6063A bigger disadvantage is the lack of proper aspiration control for producing
6064unvoiced stops, which as mentioned in Chapter 2 consist of an silent phase
6065followed by a burst of aspiration.
6066Experience has shown that although it is difficult to drive such a synthesizer
6067from a software synthesis-by-rule system, quite intelligible output can
6068be obtained if parameters are derived from real speech and tweaked by hand.
6069Then, for each aspiration burst the most closely-matching fricative sound
6070can be used.
6071.sh "5.4  Aspiration and frication"
6072.pp
6073The model of the vocal tract as a filter which affects the frequency spectrum
6074of the basic voiced excitation breaks down if there are constrictions in it,
6075for these introduce new sound sources caused by turbulent air.
6076The generation of unvoiced excitation has been discussed earlier in this
6077chapter:  now we must consider how to simulate the filtering action of
6078the vocal tract for unvoiced sounds.
6079.pp
6080Aspiration and frication need to be dealt with separately.  The former
6081is caused by excitation at the vocal cords \(em the cords are held
6082so close together that turbulent noise is produced.
6083This noise passes through the same vocal tract filter that modifies voiced
6084sounds, and the same kind of formant structure can be observed.
6085All that is needed to simulate it is to replace the voiced excitation
6086source by white noise, as shown in the upper part of Figure 5.15.
6087.FC "Figure 5.15"
6088.pp
6089Speech can be whispered by substituting aspiration for voicing throughout.
6090Of course, there is no fundamental frequency associated with aspiration.
6091An interesting way of assessing informally the degradation caused by inadequate
6092pitch control in a speech synthesis-by-rule system is to listen to
6093whispered speech, in which pitch variations play no part.
6094.pp
6095Voiced and aspirative excitation are rarely produced at the same time
6096in natural speech (but see the discussion in Chapter 2 about breathy voice).
6097However, the excitation can change from one to the other quite quickly, and
6098when this happens there is no discontinuity in the formant structure.
6099.pp
6100Fricative, or sibilant, excitation is quite different from aspiration,
6101because it introduces a new sound source at a different place from the vocal
6102cords.  The constriction which produces the sound may be at the lips,
6103the teeth, the hard ridge just behind the top front teeth, or further
6104back along the palate.
6105These positions each produce a different sound
6106(\c
6107.ul
6108f,
6109.ul
6110th,
6111.ul
6112s,
6113and
6114.ul
6115sh
6116respectively).  However, smooth transitions from one of these sounds to another
6117do not occur in natural speech; and dynamical movement of the frequency
6118spectrum during a fricative is unnecessary for speech synthesis.
6119.pp
6120It is necessary, however, to be able to produce an approximation to the
6121noise spectrum for each of these sound types.  This is commonly achieved
6122by a single high-pass resonance whose centre frequency can be controlled.
6123This is the purpose of the secondary output
6124of the formant filter of Figure 5.12.
6125Taking the output from this point gives a high-pass instead of a low-pass
6126resonance, and this same filter configuration is quite acceptable for
6127fricatives.  Figure 5.15 shows the fricative sound path as a noise generator
6128followed by such a filter.
6129.pp
6130Unlike aspiration, fricative excitation is frequently combined with voicing.
6131This gives the voiced fricative sounds
6132.ul
6133v,
6134.ul
6135dh,
6136.ul
6137z,
6138and
6139.ul
6140zh.
6141It is possible to produce frication and aspiration together, and although
6142there are no examples of this in English, speech synthesis-by-rule
6143programs often use a short burst of aspiration
6144.ul
6145and
6146frication when simulating the opening of unvoiced stops.
6147Separate amplitude controls are therefore needed for voicing and frication,
6148but the former can be used for aspiration as well, with a "glottal excitation
6149type" switch to indicate aspiration rather than voicing.
6150.sh "5.5  Summary"
6151.pp
6152A resonance speech synthesizer consists of a vocal tract filter, excited by
6153either a periodic pitch pulse or aspiration noise.  In addition, a set of
6154sibilant sounds must be provided.  The vocal tract filter is dynamic, with
6155three controllable resonances.  These, coupled with some fixed spectral
6156compensation, give it a fairly high order \(em about 10 complex poles are
6157needed.  Although several different sibilant sound types must be simulated,
6158dynamical movement is less important in fricative sound spectra than
6159for voiced and aspirated sounds because
6160smooth transitions between one fricative and another are not important
6161in speech.
6162However, fricative timing and amplitude must be controlled rather precisely.
6163.pp
6164The speech synthesizer is controlled by several parameters.
6165These include fundamental frequency (if voiced), amplitude of voicing,
6166frequency of the first few \(em typically three \(em formants,
6167aspiration amplitude, sibilance amplitude, and frequency of one (or more)
6168sibilance filters.
6169Additionally, if the synthesizer is a parallel one, parameters for the
6170amplitudes of individual formants will need to be included.
6171It may be that some control over formant bandwidths is provided too.
6172Thus synthesizers have from eight up to about 20 parameters (Klatt, 1980,
6173describes one with 20 parameters).
6174.[
6175Klatt 1980 Software for a cascade/parallel formant synthesizer
6176.]
6177.pp
6178The parameters are supplied to the synthesizer at regular intervals of time.
6179For a 10-parameter synthesizer, the control can be thought of as a set of
618010 graphs, each representing the time evolution of one parameter.
6181They are usually called parameter
6182.ul
6183tracks,
6184the terminology dating from the days when a track was painted on a glass
6185slide for each parameter to provide dynamic control of the synthesizer
6186(Lawrence, 1953).
6187.[
6188Lawrence 1953
6189.]
6190The pitch track is often called a pitch
6191.ul
6192contour;
6193this is a common phonetician's usage.
6194Do not confuse this with the everyday meaning of "contour"
6195as a line joining points of equal height on a map \(em a pitch contour is
6196just the time evolution of the pitch frequency.
6197.pp
6198For computer-controlled synthesizers, of course, the parameter tracks
6199are sampled, typically every 5 to 20\ msec.
6200The rate is determined by the need to generate fast amplitude transitions
6201for nasals and stop consonants.
6202Contrast it with the 125\ $mu$sec sampling period needed to digitize
6203telephone-quality speech.
6204The raw data rate for a 10-parameter synthesizer updated every 10 msec
6205is 1,000 parameters/sec, or 6\ Kbit/s if each parameter is represented
6206by 6\ bits.
6207This is a substantial reduction over the 56\ Kbit/s needed for PCM representation.
6208For speech synthesis by rule (Chapter 7), these parameter tracks
6209are generated by a computer program from a phonetic (or English)
6210version of the utterance, lowering the data rate by a further one or two
6211orders of magnitude.
6212.pp
6213Filters for speech
6214synthesizers can be implemented in either analogue or digital form.
6215High-order filters are usually broken down into second-order sections in
6216parallel or in series.  A third possibility, which has not been discussed
6217above, is to implement a single high-order filter directly.  Finally, the
6218action of formant filters can be synthesized in the time domain.  This gives
6219eight possibilities which are summarized in Table 5.2.
6220.RF
6221.in +0.5i
6222.ta 2.1i +2.0i
6223.nr x1 (\w'Analogue'/2)
6224.nr x2 (\w'Digital'/2)
6225	\h'-\n(x1u'Analogue	\h'-\n(x2u'Digital
6226.nr x0 2.0i+(\w'Liljencrants (1968)'/2)+(\w'Morris and Paillet (1972)'/2)
6227.nr x3 (\w'Liljencrants (1968)'/2)
6228	\h'-\n(x3u'\l'\n(x0u\(ul'
6229.sp
6230.nr x1 (\w'Rice (1976)'/2)
6231.nr x2 (\w'Rabiner \fIet al\fR'/2)
6232Series	\h'-\n(x1u'Rice (1976)	\h'-\n(x2u'Rabiner \fIet al\fR
6233.nr x1 (\w'Liljencrants (1968)'/2)
6234.nr x2 (\w'Holmes (1973)'/2)
6235Parallel	\h'-\n(x1u'Liljencrants (1968)	\h'-\n(x2u'Holmes (1973)
6236.nr x1 (\w'unpublished'/2)
6237.nr x2 (\w'unpublished'/2
6238Time-domain	\h'-\n(x1u'unpublished	\h'-\n(x2u'unpublished
6239.nr x1 (\w'\(em'/2)
6240.nr x2 (\w'Morris and Paillet (1972)'/2)
6241High-order filter	\h'-\n(x1u'\(em	\h'-\n(x2u'Morris and Paillet (1972)
6242	\h'-\n(x3u'\l'\n(x0u\(ul'
6243.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
6244.in-0.5i
6245.FG "Table 5.2  Implementation options for resonance speech synthesizers"
6246.[
6247Rice 1976 Byte
6248.]
6249.[
6250Rabiner Jackson Schafer Coker 1971
6251.]
6252.[
6253Liljencrants 1968
6254.]
6255.[
6256Holmes 1973 Influence of glottal waveform on naturalness
6257.]
6258.[
6259Morris and Paillet 1972
6260.]
6261All but one have certainly been used as the basis for synthesis, and
6262the table includes reference to published descriptions.
6263.pp
6264Each method has advantages and disadvantages.  Series decomposition obviates
6265the need for control over the amplitudes of individual formants, but does
6266not allow synthesis of sounds which use the nasal tract as well as the oral
6267one; for these are in parallel.  Analogue implementation of series synthesizers
6268is complicated by the need for higher-pole correction, and the fact that
6269the gains at different frequencies can vary widely throughout the system.
6270Higher-pole correction is not so important for digital synthesizers.
6271Parallel decomposition eliminates some of these problems:  higher-pole correction
6272can be implemented individually for each formant.  However, the formant
6273amplitudes must be controlled rather precisely to simulate the vocal tract,
6274which is essentially serial.
6275Time-domain synthesis is associated with low hardware costs but does not
6276easily allow proper control over the excitation sources.  In particular,
6277it cannot simulate dynamical movement of the spectrum during aspiration.
6278Implementation of the entire vocal tract model as a single high-order filter,
6279without breaking it down into individual formants in series or parallel,
6280is attractive from the computational point of view because less arithmetic
6281operations are required.  It is best analysed in terms of linear predictive
6282coding, which is the subject of the next chapter.
6283.sh "5.6  References"
6284.LB "nnnn"
6285.[
6286$LIST$
6287.]
6288.LE "nnnn"
6289.sh "5.7  Further reading"
6290.pp
6291Historically-minded readers should look at the early speech synthesizer
6292designed by Lawrence (1953).
6293This and other classic papers on the subject
6294are reprinted in Flanagan and Rabiner (1973).
6295A good description of a quite sophisticated parallel synthesizer can
6296be found in Holmes (1973), above, and another of a switchable
6297series/parallel one in Klatt (1980), who even includes a listing of
6298the Fortran program that implements it.
6299Here are some useful books on speech synthesizers.
6300.LB "nn"
6301.\"Fant-1960-1
6302.]-
6303.ds [A Fant, G.
6304.ds [D 1960
6305.ds [T Acoustic theory of speech production
6306.ds [I Mouton
6307.ds [C The Hague
6308.nr [T 0
6309.nr [A 1
6310.nr [O 0
6311.][ 2 book
6312.in+2n
6313Fant really started the study of the vocal tract as an acoustic system,
6314and this book marks the beginning of modern speech synthesis.
6315.in-2n
6316.\"Flanagan-1972-1
6317.]-
6318.ds [A Flanagan, J.L.
6319.ds [D 1972
6320.ds [T Speech analysis, synthesis, and perception (2nd, expanded, edition)
6321.ds [I Springer Verlag
6322.ds [C Berlin
6323.nr [T 0
6324.nr [A 1
6325.nr [O 0
6326.][ 2 book
6327.in+2n
6328This book is the speech researcher's bible, and like the bible, it's not
6329all that easy to read.
6330However, it is an essential reference source for speech acoustics and
6331speech synthesis (as well as for human speech perception).
6332.in-2n
6333.\"Flanagan-1973-2
6334.]-
6335.ds [A Flanagan, J.L.
6336.as [A " and Rabiner, L.R.(Editors)
6337.ds [D 1973
6338.ds [T Speech synthesis
6339.ds [I Dowsen, Hutchinson and Ross
6340.ds [C Stroudsburg, Pennsylvania
6341.nr [T 0
6342.nr [A 0
6343.nr [O 0
6344.][ 2 book
6345.in+2n
6346I recommended this book at the end of Chapter 1 as a collection of
6347classic papers on the subject of speech synthesis and synthesizers.
6348.in-2n
6349.\"Holmes-1972-3
6350.]-
6351.ds [A Holmes, J.N.
6352.ds [D 1972
6353.ds [T Speech synthesis
6354.ds [I Mills and Boom
6355.ds [C London
6356.nr [T 0
6357.nr [A 1
6358.nr [O 0
6359.][ 2 book
6360.in+2n
6361This little book, by one of Britain's foremost workers in the field,
6362introduces the subject of speech synthesis and speech synthesizers.
6363It has a particularly good discussion of parallel synthesizers.
6364.in-2n
6365.LE "nn"
6366.EQ
6367delim $$
6368.EN
6369.CH "6  LINEAR PREDICTION OF SPEECH"
6370.ds RT "Linear prediction of speech
6371.ds CX "Principles of computer speech
6372.pp
6373The speech coding techniques which were discussed in Chapter 3 operate
6374in the time domain, while the analysis and synthesis techniques
6375of Chapters 4 and 5 are
6376based in the frequency domain.  Linear prediction is a relatively
6377new method of speech analysis-synthesis,
6378introduced in the early 1970's and used
6379extensively since then, which is primarily a time-domain coding method
6380but can be used to give frequency-domain parameters like formant
6381frequency, bandwidth, and amplitude.
6382.pp
6383It has several advantages over other speech analysis techniques, and is
6384likely to become increasingly dominant in speech output systems.
6385As well as bridging the gap between time- and frequency-domain techniques, it
6386is of equal value for both speech storage and speech synthesis, and forms
6387an extremely convenient basis for speech-output systems which use high-quality
6388stored speech for routine messages and synthesis from phonetics or text
6389for unusual or exceptional conditions.  Linear prediction can be used to
6390separate the excitation source properties of pitch and amplitude from the
6391vocal tract filter which governs phoneme articulation, or, in other words,
6392to separate much of the prosodic from the segmental information.
6393Hence it makes it easy to use stored segmentals with synthetic prosody,
6394which is just what is needed to enhance the flexibility of stored speech by
6395providing overall intonation contours for utterances formed by word
6396concatenation (see Chapter 7).
6397.pp
6398The frequency-domain analysis technique
6399of Fourier transformation necessarily involves approximation because it
6400applies only to periodic waveforms, and so the artificial operation
6401of windowing is required to suppress the aperiodicity of real
6402speech.  In contrast, the linear predictive technique, being a time-domain
6403method, can \(em in certain forms \(em deal more rationally with aperiodic
6404signals.
6405.pp
6406The basic idea of linear predictive coding is exactly the same as
6407one form of adaptive differential pulse code modulation which
6408was introduced briefly in Chapter 3.  There it was noted that a speech
6409sample $x(n)$ can be predicted quite closely by the previous sample
6410$x(n-1)$.  The prediction can be improved by multiplying the previous
6411sample by a number, say $a sub 1$, which is adapted on a syllabic
6412time-scale.  This can be utilized for speech coding by transmitting
6413only the prediction error
6414.LB
6415.EQ
6416e(n)~=~~x(n)~-~a sub 1 x(n-1),
6417.EN
6418.LE
6419and using it (and the value of $a sub 1$) to reconstitute the signal
6420$x(n)$ at the receiver.  It is worthwhile noting that
6421exactly the same relationship was used for digital
6422preemphasis in Chapter 4, with the value of $a sub 1$
6423being constant at about 0.9 \(em although
6424the possibility of adapting it to take into account the difference
6425between voiced and unvoiced speech was discussed.
6426.pp
6427An obvious extension is to use several past values of the signal to form
6428the prediction, instead of just one.  Different multipliers for each would
6429be needed, so that the prediction error could be written as
6430.LB
6431.EQ
6432e(n)~~ mark =~~x(n)~-~a sub 1 x(n-1)~-~a sub 2 x(n-2)~-~...~-~a sub p x(n-p)
6433.EN
6434.sp
6435.EQ
6436lineup =~~x(n)~-~~sum from k=1 to p ~a sub k x(n-k).
6437.EN
6438.LE
6439The multipliers $a sub k$ should be adapted to minimize the error signal,
6440and we will consider how to do this in the next section.  It turns out
6441that they must be re-calculated and transmitted on a time-scale that is
6442rather faster than syllabic but much slower than
6443the basic sampling rate:  intervals
6444of 10\-25\ msec are usually used (compare this with the 125\ $mu$sec sampling
6445rate for telephone-quality speech).
6446A configuration for high-order adaptive differential
6447pulse code modulation is shown in Figure 6.1.
6448.FC "Figure 6.1"
6449.pp
6450Figure 6.2 shows typical time waveforms for each of the ten coefficients
6451over a 1-second stretch of speech.
6452.FC "Figure 6.2"
6453Notice that they vary much more slowly than, say, the speech waveform of
6454Figure 3.5.
6455.pp
6456Turning the above relationship into $z$-transforms gives
6457.LB
6458.EQ
6459E(z)~~=~~X(z)~-~~sum from k=1 to p ~a sub k z sup -k ~X(z)~~=~~(1~-~~
6460sum from k=1 to p ~a sub k z sup -k )~X(z).
6461.EN
6462.LE
6463Rewriting the speech signal in terms of the error,
6464.LB
6465.EQ
6466X(z)~~=~~1 over {1~-~~ sum ~a sub k z sup -k }~.~E(z) .
6467.EN
6468.LE
6469.pp
6470Now let us bring together some facts from the previous chapter which will
6471allow the time-domain technique of linear prediction to be interpreted
6472in terms of the frequency-domain formant model of speech.  Recall that speech
6473can be viewed as an excitation source passing through a vocal tract filter,
6474followed by another filter to model the effect of radiation from the lips.
6475The overall spectral levels can be reassigned as in Figure 5.1 so that
6476the excitation source has a 0\ dB/octave spectral profile, and hence is
6477essentially impulsive.
6478Considering the vocal tract filter as a series connection
6479of digital formant filters, its transfer function is the product of terms like
6480.LB
6481.EQ
64821 over {1~-~b sub 1 z sup -1 ~+~b sub 2 z sup -2}~ ,
6483.EN
6484.LE
6485where $b sub 1$ and $b sub 2$ control the position and bandwidth of the formant resonances.
6486The \-6\ dB/octave spectral compensation can be modelled by the
6487first-order digital filter
6488.LB
6489.EQ
64901 over {1~-~bz sup -1}~ .
6491.EN
6492.LE
6493The product of all these terms, when multiplied out, will have the
6494form
6495.LB
6496.EQ
64971 over {1~-~c sub 1 z sup -1 ~-~c sub 2 z sup -2 ~-~...~-~
6498c sub q z sup -q }~ ,
6499.EN
6500.LE
6501where $q$ is twice the number of formants plus one, and the $c$'s are calculated
6502from the positions and bandwidths of the formant resonances and the spectral
6503compensation parameter.  Hence
6504the $z$-transform of the speech is
6505.LB
6506.EQ
6507X(z)~=~~1 over {1~-~~ sum from k=1 to q ~c sub k z sup -k }~.~I(z) ,
6508.EN
6509.LE
6510where $I(z)$ is the transform of the impulsive excitation.
6511.pp
6512This is remarkably similar to the linear prediction relation given earlier!  If
6513$p$ and $q$ are the same, then the linear predictive coefficients $a sub k$
6514form a $p$'th order polynomial which is the same as that obtained by multiplying
6515together the second-order polynomials representing the individual formants
6516(together with the first-order one for spectral compensation).
6517Furthermore, the predictive error $E(z)$ can be identified with the
6518impulsive excitation $I(z)$.  This raises the very interesting
6519possibility of parametrizing the error signal by its frequency and
6520amplitude \(em two relatively slowly-varying quantities \(em instead of
6521transmitting it sample-by-sample (at an 8\ kHz rate).  This is how
6522linear prediction separates out the excitation properties of the source
6523from the vocal tract filter:  the source parameters can be derived
6524from the error signal and the vocal tract filter is represented by
6525the linear predictive coefficients.
6526Figure 6.3 shows how this can be used for speech transmission.
6527.FC "Figure 6.3"
6528Note that
6529.ul
6530no
6531signals need now be transmitted at the speech sampling rate; for the
6532source parameters vary relatively slowly.  This leads to an extremely
6533low data rate.
6534.pp
6535Practical linear predictive coding schemes operate with a value of $p$ between
653610 and 15, corresponding approximately to 4-formant and 7-formant synthesis
6537respectively.  The $a sub k$'s are re-calculated every 10 to 25\ msec, and
6538transmitted to the receiver.  Also, the pitch and amplitude
6539of the speech are estimated and transmitted at the same rate.
6540If the speech
6541is unvoiced, there is no pitch value:  an "unvoiced flag" is
6542transmitted instead.
6543Because the linear predictive coefficients are intimately related to
6544formant frequencies and bandwidths, a "frame rate" in the region
6545of 10 to 25\ msec is appropriate because this approximates the maximum rate
6546at which acoustic events happen in speech production.
6547.pp
6548At the receiver, the excitation waveform
6549is reconstituted.
6550For voiced speech, it is impulsive at the specified
6551frequency and with the specified amplitude, while for unvoiced speech it
6552is random, with the specified amplitude.  This signal $e(n)$, together
6553with the transmitted parameters $a sub 1$, ..., $a sub p$, is used
6554to regenerate the speech waveform by
6555.LB
6556.EQ
6557x(n)~=~~e(n)~+~~sum from k=1 to p ~a sub k x(n-k) ,
6558.EN
6559.LE
6560\(em which is the inverse of the transmitter's formula for calculating $e(n)$,
6561namely
6562.LB
6563.EQ
6564e(n)~=~~x(n)~-~~sum from k=1 to p ~a sub k x(n-k) .
6565.EN
6566.LE
6567This relies on knowing the past $p$ values of the speech samples.
6568Many systems set these past values to zero at the beginning of each pitch
6569cycle.
6570.pp
6571Linear prediction can also be used for speech analysis, rather than
6572for speech coding, as shown in Figure 6.4.
6573.FC "Figure 6.4"
6574Instead of transmitting the coefficients $a sub k$,
6575they are used to determine the formant positions and bandwidths.
6576We saw above that the polynomial
6577.LB
6578.EQ
65791~-~a sub 1 z sup -1 ~-~a sub 2 z sup -2 ~-~...~-~a sub p z sup -p ,
6580.EN
6581.LE
6582when factored into a product of second-order terms, gives the formant
6583characteristics (as well as the spectral compensation term).
6584Factoring is equivalent to finding the complex roots of the polynomial,
6585and this is fairly demanding computationally \(em especially if done at
6586a high rate.  Consequently, peak-picking algorithms are sometimes
6587used instead.  The absolute value of the polynomial gives the
6588frequency spectrum of the vocal tract filter, and the formants
6589appear as peaks \(em just as they do in cepstrally smoothed speech
6590(see Chapter 4).
6591.pp
6592The chief deficiency in the linear predictive method, whether it
6593is used for speech coding or for speech analysis, is that \(em like a series
6594synthesizer \(em it
6595implements an all-pole model of the vocal tract.
6596We mentioned in Chapter 5 that this is rather simplistic,
6597especially for nasalized sounds which involve a cavity in parallel
6598with the oral one.  Some research has been done on incorporating zeros
6599into a linear predictive model, but it complicates the problem of
6600calculating the parameters enormously.  For most purposes people seem
6601to be able to live with the limitations of the all-pole model.
6602.sh "6.1  Linear predictive analysis"
6603.pp
6604The key problem in linear predictive coding is to determine the values
6605of the coefficients $a sub 1$, ..., $a sub p$.
6606If the error signal is to be transmitted on a sample-by-sample basis,
6607as it is in adaptive differential pulse code modulation, then it can be most
6608economically encoded if its mean power is as small as possible.
6609Thus the coefficients are chosen to minimize
6610.LB
6611.EQ
6612sum ~e(n) sup 2
6613.EN
6614.LE
6615over some period of time.
6616The period of time used is related to the frame rate at which the
6617coefficients are transmitted or stored, although there is no need
6618to make it exactly the same as one frame interval.  As mentioned above,
6619the frame size
6620is usually chosen to be in the region of 10 to 25\ msec.  Some
6621schemes minimize the error signal over as few as 30 samples
6622(corresponding to 3\ msec at a 10\ kHz sampling rate).  Others take
6623longer; up to 250 samples (25\ msec).
6624.pp
6625However, if the error signal is to be considered as impulsive and
6626parametrized by its frequency and amplitude before transmission,
6627or if the coefficients $a sub k$ are to be used for spectral calculations,
6628then it is not immediately obvious how the coefficients should be
6629calculated.
6630In fact, it is still best to choose them to minimize the above sum.
6631This is at least plausible, for an impulsive excitation will have a
6632rather small mean power \(em most of the samples are zero.
6633It can be justified theoretically in terms of
6634.ul
6635spectral whitening,
6636for it can be shown that minimizing the mean-squared error
6637produces an error signal whose spectrum is maximally flat.
6638Now the only two waveforms whose spectra are absolutely flat
6639are a single impulse and white noise.  Hence if
6640the speech is voiced, minimizing the mean-squared error
6641will lead to an error signal which is as nearly impulsive
6642as possible.  Provided the time-frame for minimizing is short enough,
6643the impulse will correspond to a single excitation pulse.
6644If the speech is unvoiced, minimization will lead to an error
6645signal which is as nearly white noise as possible.
6646.pp
6647How does one choose the linear predictive coefficients to minimize
6648the mean-squared error?  The total squared prediction error is
6649.LB
6650.EQ
6651M~=~~sum from n ~e(n) sup 2~~=~~sum from n
6652~[x(n)~-~ sum from k=1 to p ~a sub k x sub n-k ] sup 2 ,
6653.EN
6654.LE
6655leaving the range of summation unspecified for the moment.
6656To minimize $M$ by choice of the coefficients $a sub j$, differentiate
6657with respect to each of them and set the resulting derivatives
6658to zero.
6659.LB
6660.EQ
6661dM over {da sub j} ~~=~~-2 sum from n ~x(n-j)[x(n)~-~~
6662sum from k=1 to p ~a sub k x(n-k)]~~=~0~,
6663.EN
6664.LE
6665so
6666.LB
6667.EQ
6668sum from k=1 to p ~a sub k ~ sum from n ~x(n-j)x(n-k)~~=~~
6669sum from n ~x(n)x(n-j)~~~~j~=~1,~2,~...,~p.
6670.EN
6671.LE
6672.pp
6673This is a set of $p$ linear equations for the $p$ unknowns $a sub 1$, ...,
6674$a sub p$.
6675Solving it is equivalent to inverting a $p times p$ matrix.
6676This job must be repeated at the frame rate, and so if
6677real-time operation is desired quite a lot of calculation is needed.
6678.rh "The autocorrelation method."
6679So far, the range of the $n$-summation has been left open.  The
6680coefficients of the matrix equation have the form
6681.LB
6682.EQ
6683sum from n ~x(n-j)x(n-k).
6684.EN
6685.LE
6686If a doubly-infinite summation were made, with $x(n)$ being defined
6687as zero whenever $n<0$, we could make use of the fact that
6688.sp
6689.ce
6690.EQ
6691sum from {n=- infinity} to infinity ~x(n-j)x(n-k)~=~~
6692sum from {n=- infinity} to infinity ~x(n-j+1)x(n-k+1)~=~...~=~~
6693sum from {n=- infinity} to infinity ~x(n)x(n+j-k)
6694.EN
6695.sp
6696to simplify the matrix equation.  This just states that the
6697autocorrelation of an infinite sequence depends only on the lag at which
6698it is computed, and not on absolute time.
6699.pp
6700Defining $R(m)$ as the
6701autocorrelation at lag $m$, that is,
6702.LB
6703.EQ
6704R(m)~=~ sum from n ~x(n)x(n+m),
6705.EN
6706.LE
6707the matrix equation becomes
6708.LB
6709.ne7
6710.nf
6711.EQ
6712R(0)a sub 1 ~+~R(1)a sub 2 ~+~R(2)a sub 3 ~+~...~~=~R(1)
6713.EN
6714.EQ
6715R(1)a sub 1 ~+~R(0)a sub 2 ~+~R(1)a sub 3 ~+~...~~=~R(2)
6716.EN
6717.EQ
6718R(2)a sub 1 ~+~R(1)a sub 2 ~+~R(0)a sub 3 ~+~...~~=~R(3)
6719.EN
6720.EQ
6721etc
6722.EN
6723.fi
6724.LE
6725An elegant method due to Durbin and Levinson exists for solving this
6726special system of equations.  It requires much less computational
6727effort than is generally needed for symmetric matrix equations.
6728.pp
6729Of course, an infinite range of summation can not be used in
6730practice.  For one thing, the power spectrum is changing, and
6731only the data from a short time-frame should be used for
6732a realistic estimate of the optimum linear predictive coefficients.
6733Hence a windowing procedure,
6734.LB
6735.EQ
6736x(n) sup * ~=~w sub n x(n),
6737.EN
6738.LE
6739is used to reduce the signal to zero outside a finite range of
6740interest.  Windows were discussed in Chapter 4 from the
6741point of view of Fourier analysis of speech signals, and the same
6742sort of considerations apply to choosing a window for linear
6743prediction.
6744.pp
6745This is known as the
6746.ul
6747autocorrelation method
6748of computing prediction parameters.  Typically a window of
6749100 to 250 samples is used for analysis of one frame of speech.
6750.rh "Algorithm for the autocorrelation method."
6751The algorithm for obtaining linear prediction coefficients
6752by the autocorrelation method is quite simple.  It is
6753straightforward to compute the matrix coefficients
6754$R(m)$ from the speech samples and window coefficients.
6755The Durbin-Levinson method of solving matrix equations operates
6756directly on this $R$-vector to produce the coefficient vector $a sub k$.
6757The complete procedure is given as Procedure 6.1, and is shown
6758diagrammatically in Figure 6.5.
6759.FC "Figure 6.5"
6760.RF
6761.fi
6762.na
6763.nh
6764.ul
6765const
6766N=256; p=15;
6767.ul
6768type
6769svec =
6770.ul
6771array
6772[0..N\-1]
6773.ul
6774of
6775real;
6776cvec =
6777.ul
6778array
6779[1..p]
6780.ul
6781of
6782real;
6783.sp
6784.ul
6785procedure
6786autocorrelation(signal: vec; window: svec;
6787.ul
6788var
6789coeff: cvec);
6790.sp
6791{computes linear prediction coefficients by autocorrelation method
6792in coeff[1..p]}
6793.sp
6794.ul
6795var
6796R, temp:
6797.ul
6798array
6799[0..p]
6800.ul
6801of
6802real;
6803n: [0..N\-1]; i,j: [0..p]; E: real;
6804.sp
6805.ul
6806begin
6807{window the signal}
6808.in+6n
6809.ul
6810for
6811n:=0
6812.ul
6813to
6814N\-1
6815.ul
6816do
6817signal[n] := signal[n]*window[n];
6818.sp
6819{compute autocorrelation vector}
6820.br
6821.ul
6822for
6823i:=0
6824.ul
6825to
6826p
6827.ul
6828do begin
6829.in+2n
6830R[i] := 0;
6831.br
6832.ul
6833for
6834n:=0
6835.ul
6836to
6837N\-1\-i
6838.ul
6839do
6840R[i] := R[i] + signal[n]*signal[n+i]
6841.in-2n
6842.ul
6843end;
6844.sp
6845{solve the matrix equation by the Durbin-Levinson method}
6846.br
6847E := R[0];
6848.br
6849coeff[1] := R[1]/E;
6850.br
6851.ul
6852for
6853i:=2
6854.ul
6855to
6856p
6857.ul
6858do begin
6859.in+2n
6860E := (1\-coeff[i\-1]*coeff[i\-1])*E;
6861.br
6862coeff[i] := R[i];
6863.br
6864.ul
6865for
6866j:=1
6867.ul
6868to
6869i\-1
6870.ul
6871do
6872coeff[i] := coeff[i] \- R[i\-j]*coeff[j];
6873.br
6874coeff[i] := coeff[i]/E;
6875.br
6876.ul
6877for
6878j:=1
6879.ul
6880to
6881i\-1
6882.ul
6883do
6884temp[j] := coeff[j] \- coeff[i]*coeff[i\-j];
6885.br
6886.ul
6887for
6888j:=1
6889.ul
6890to
6891i\-1
6892.ul
6893do
6894coeff[j] := temp[j]
6895.in-2n
6896.ul
6897end
6898.in-6n
6899.ul
6900end.
6901.nf
6902.FG "Procedure 6.1  Pascal algorithm for the autocorrelation method"
6903.pp
6904This algorithm is not quite as efficient as it might be, for some
6905multiplications are repeated during the calculation of the
6906autocorrelation vector.  Blankinship (1974) shows how
6907the number of multiplications can be reduced by about half.
6908.[
6909Blankinship 1974
6910.]
6911.pp
6912If the algorithm is performed in fixed-point arithmetic
6913(as it often is in practice because of speed considerations),
6914some scaling must be done.  The maximum and minimum values of
6915the windowed signal can be determined within the window
6916calculation loop, and one extra pass over the vector will
6917suffice to scale it to maximum significance.
6918(Incidentally, if all sample values are the same the procedure
6919cannot produce a solution because $E$ becomes zero, and this
6920can easily be checked when scaling.)
6921.pp
6922The absolute value of the $R$-vector has no significance, and since
6923$R(0)$ is always the greatest element, this can be set to the largest
6924fixed-point number and the other $R$'s scaled down appropriately
6925after they have been calculated.
6926These scaling operations are shown as dashed boxes in Figure 6.5.
6927$E$ decreases monotonically
6928as the computation proceeds, so it is safe to initialize it to $R(0)$
6929without extra scaling.  The remainder of the scaling is straightforward,
6930with the linear prediction coefficients $a sub k$ appearing as fractions.
6931.rh "The covariance method."
6932One of the advantages of linear predictive methods that was
6933promised earlier was that it allows us to escape from
6934the problem of windowing.  To do this, we must abandon the
6935requirement that the coefficients of the matrix equation have
6936the symmetry property of autocorrelations.  Instead, suppose
6937that the range of $n$-summation uses a fixed number of
6938elements, say N, starting at $n=h$, to estimate the prediction
6939coefficients between sample number $h$ and sample number $h+N$.
6940.pp
6941This leads to the matrix equation
6942.LB
6943.EQ
6944sum from k=1 to p ~a sub k sum from n=h to h+N-1 ~x(n-j)x(n-k) ~~=~~
6945sum from n=h to h+N-1 ~x(n)x(n-j)~~~~j~=~1,~2,~...,~p.
6946.EN
6947.LE
6948Alternatively, we could write
6949.LB
6950.EQ
6951sum from k=1 to p ~a sub k ~ Q sub jk sup h~~=~~Q sub 0j sup h
6952~~~~j~=~1,~2,~...,~p;
6953.EN
6954.LE
6955where
6956.LB
6957.EQ
6958Q sub jk sup h~~=~~sum from n=h to h+N-1 ~x(n-j)x(n-k).
6959.EN
6960.LE
6961Note that some values of $x(n)$ outside the range  $h ~ <= ~ n ~ < ~ h+N$  are
6962required:  these are shown diagrammatically in Figure 6.6.
6963.FC "Figure 6.6"
6964.pp
6965Now  $Q sub jk sup h ~=~ Q sub kj sup h$,  so the equation has
6966a diagonally symmetric matrix; and in fact the matrix $Q sup h$ can
6967be shown to be positive semidefinite \(em and is almost always positive
6968definite in practice.  Advantage can be taken of these facts
6969to provide a computationally efficient method for solving the
6970equation.  According to a result called Cholesky's theorem, a
6971positive definite symmetric matrix $Q$ can be factored into the form
6972$Q ~ = ~ LL sup T$, where $L$ is a lower triangular matrix.
6973This leads to an efficient
6974solution algorithm.
6975.pp
6976This method of computing prediction coefficients has become known
6977as the
6978.ul
6979covariance method.
6980It does not use windowing of the speech signal, and can give accurate
6981estimates of the prediction coefficients with a smaller analysis
6982frame than the autocorrelation method.  Typically, 50 to 100 speech samples
6983might be used to estimate the coefficients, and they are re-calculated
6984every 100 to 250 samples.
6985.rh "Algorithm for the covariance method."
6986An algorithm for the covariance method is given in Procedure 6.2,
6987.RF
6988.fi
6989.na
6990.nh
6991.ul
6992const
6993N=100; p=15;
6994.ul
6995type
6996svec =
6997.ul
6998array
6999[\-p..N\-1]
7000.ul
7001of
7002real;
7003cvec =
7004.ul
7005array
7006[1..p]
7007.ul
7008of
7009real;
7010.sp
7011.ul
7012procedure
7013covariance(signal: svec;
7014.ul
7015var
7016coeff: cvec);
7017.sp
7018{computes linear prediction coefficients by covariance method
7019in coeff[1..p]}
7020.sp
7021.ul
7022var
7023Q:
7024.ul
7025array
7026[0..p,0..p]
7027.ul
7028of
7029real;
7030n: [0..N\-1]; i,j,r: [0..p]; X: real;
7031.sp
7032.ul
7033begin
7034{calculate upper-triangular covariance matrix in Q}
7035.in+6n
7036.ul
7037for
7038i:=0
7039.ul
7040to
7041p
7042.ul
7043do
7044.in+2n
7045.ul
7046for
7047j:=i
7048.ul
7049to
7050p
7051.ul
7052do begin
7053.in+2n
7054Q[i,j]:=0;
7055.br
7056.ul
7057for
7058n:=0
7059.ul
7060to
7061N\-1
7062.ul
7063do
7064.in+2n
7065Q[i,j] := Q[i,j] + signal[n\-i]*signal[n\-j]
7066.in-2n
7067.in-2n
7068.ul
7069end;
7070.in-2n
7071.sp
7072{calculate the square root of Q}
7073.br
7074.ul
7075for
7076r:=2
7077.ul
7078to
7079p
7080.ul
7081do
7082.in+2n
7083.ul
7084begin
7085.in+2n
7086.ul
7087for
7088i:=2
7089.ul
7090to
7091r\-1
7092.ul
7093do
7094.in+2n
7095.ul
7096for
7097j:=1
7098.ul
7099to
7100i\-1
7101.ul
7102do
7103.in+2n
7104Q[i,r] := Q[i,r] \- Q[j,i]*Q[j,r];
7105.in-2n
7106.ul
7107for
7108j:=1
7109.ul
7110to
7111r\-1
7112.ul
7113do
7114.in+2n
7115.ul
7116begin
7117.in+2n
7118X := Q[j,r];
7119.br
7120Q[j,r] := Q[j,r]/Q[j,i];
7121.br
7122Q[r,r] := Q[r,r] \- Q[j,r]*X
7123.in-2n
7124.ul
7125end
7126.in-2n
7127.in-2n
7128.in-2n
7129.ul
7130end;
7131.in-2n
7132.sp
7133{calculate coeff[1..p]}
7134.br
7135.ul
7136for
7137r:=2
7138.ul
7139to
7140p
7141.ul
7142do
7143.in+2n
7144.ul
7145for
7146i:=1
7147.ul
7148to
7149r\-1
7150.ul
7151do
7152Q[0,r] := Q[0,r] \- Q[i,r]*Q[0,i];
7153.in-2n
7154.ul
7155for
7156r:=1
7157.ul
7158to
7159p
7160.ul
7161do
7162Q[0,r] := Q[0,r]/Q[r,r];
7163.br
7164.ul
7165for
7166r:=p\-1
7167.ul
7168downto
71691
7170.ul
7171do
7172.in+2n
7173.ul
7174for
7175i:=r+1
7176.ul
7177to
7178p
7179.ul
7180do
7181Q[0,r] := Q[0,r] \- Q[r,i]*Q[0,i];
7182.in-2n
7183.ul
7184for
7185r:=1
7186.ul
7187to
7188p
7189.ul
7190do
7191coeff[r] := Q[0,r]
7192.in-6n
7193.ul
7194end.
7195.nf
7196.FG "Procedure 6.2  Pascal algorithm for the covariance method"
7197and is shown diagrammatically in Figure 6.7.
7198.FC "Figure 6.7"
7199The algorithm shown is not terribly efficient from a computation
7200and storage point of view, although it is workable.  For one thing,
7201it uses the obvious method for computing the covariance matrix
7202by calculating
7203.EQ
7204Q sub 01 sup h ,
7205.EN
7206.EQ
7207Q sub 02 sup h , ~ ...,
7208.EN
7209.EQ
7210Q sub 0p sup h ,
7211.EN
7212.EQ
7213Q sub 11 sup h , ...,
7214.EN
7215in turn, which repeats most of the multiplications $p$ times \(em not
7216an efficient procedure.  A simple alternative is to precompute the necessary
7217multiplications and store them in a  $(N+h) times (p+1)$ diagonally symmetric
7218table, but even apart from the extra storage required for this, the number
7219of additions which must be performed subsequently to give the $Q$'s is far
7220larger than necessary.  It is possible, however, to write a procedure which is
7221both time- and space-efficient (Witten, 1980).
7222.[
7223Witten 1980 Algorithms for linear prediction
7224.]
7225.pp
7226The scaling problem is rather more tricky for the covariance
7227method than for the autocorrelation method.  The $x$-vector
7228should be scaled initially in the same way as before, but now there
7229are $p+1$ diagonal elements of the covariance matrix, any of which could
7230be the greatest element.  Of course,
7231.LB
7232.EQ
7233Q sub jk ~~ <= ~~ Max ( Q sub 11 , Q sub 22 , ..., Q sub pp ),
7234.EN
7235.LE
7236but despite the considerable communality in the summands of the diagonal
7237elements, there are no
7238.ul
7239a priori
7240bounds on the ratios between them.
7241.pp
7242The only way to scale the $Q$ matrix properly is to calculate each of its $p$
7243diagonal elements and use the greatest as a scaling factor.
7244Alternatively, the fact that
7245.LB
7246.EQ
7247Q sub jk ~~ <= ~~ N times Max( x sub n sup 2 )
7248.EN
7249.LE
7250can be used to give a bound for scaling purposes; however, this
7251is usually a rather conservative bound, and as $N$ is often around 100, several
7252bits of significance will be lost.
7253.pp
7254Scaling difficulties do not cease when $Q$ has been determined.  It is possible
7255to show that the elements of the lower-triangular matrix $L$ which represents
7256the square root of $Q$ are actually
7257.ul
7258unbounded.
7259In fact there is a slightly different variant of the Cholesky decomposition
7260algorithm which guarantees bounded coefficients but suffers from the
7261disadvantage that it requires square roots to be taken (Martin
7262.ul
7263et al,
72641965).
7265.[
7266Martin Peters Wilkinson 1965
7267.]
7268However, experience with the method indicates that it is rare for the elements
7269of $L$ to exceed 16 times the maximum element of $Q$, and the possibility of
7270occasional failure to adjust the coefficients may be tolerable in a practical
7271linear prediction system.
7272.rh "Comparison of autocorrelation and covariance analysis."
7273There are various factors which should be taken into account when
7274deciding whether to use the autocorrelation or covariance method for linear
7275predictive analysis.  Furthermore, there is a rather different technique,
7276called the "lattice method", which will be discussed shortly.
7277The autocorrelation method involves windowing, which means that in
7278practice a rather longer stretch of speech should be used
7279for analysis.  We have illustrated this by setting $N$=256 in the
7280autocorrelation algorithm and 100 in the covariance one.
7281Offsetting the extra calculation that this entails is the
7282fact that the Durbin-Levinson method of inverting a matrix is much more
7283efficient than Cholesky decomposition.  In practice, this means
7284that similar amounts of computation are needed for each method \(em a
7285detailed comparison is made in Witten (1980).
7286.[
7287Witten 1980 Algorithms for linear prediction
7288.]
7289.pp
7290A factor which weighs against the covariance method is the
7291difficulty of scaling intermediate quantities within the algorithm.
7292The autocorrelation method can be implemented quite satisfactorily
7293in fixed-point arithmetic, and this makes it more suitable for
7294hardware implementation.  Furthermore, serious instabilities sometimes
7295arise with the covariance method, whereas it can be shown that
7296the autocorrelation one is always stable.  Nevertheless, the approximations
7297inherent in the windowing operation, and the smearing effect of taking a
7298larger number of sample points, mean that covariance-method coefficients
7299tend to represent the speech more accurately, if they can be obtained.
7300.pp
7301One way of using the covariance method which has proved to be rather
7302satisfactory in practice is to synchronize the analysis frame with
7303the beginning of a pitch period, when the excitation is strongest.
7304Pitch synchronous techniques were discussed in Chapter 4 in the context
7305of discrete Fourier transformation of speech.  The snag, of course, is that
7306pitch peaks do not occur uniformly in time, and furthermore it is difficult
7307to estimate their locations precisely.
7308.sh "6.2  Linear predictive synthesis"
7309.pp
7310If the linear predictive coefficients and the error signal are available,
7311it is easy to regenerate the original speech by
7312.LB
7313.EQ
7314x(n)~=~~e(n)~+~~ sum from k=1 to p ~a sub k x(n-k) .
7315.EN
7316.LE
7317If the error signal is parametrized into the sound source type
7318(voiced or unvoiced), amplitude, and pitch (if voiced), it can be
7319regenerated by an impulse repeated at the appropriate pitch
7320frequency (if voiced), or white noise (if unvoiced).
7321.pp
7322However, it may be that the filter represented by the coefficients $a sub k$ is
7323unstable, causing the output speech signal to oscillate wildly.
7324In fact, it is only possible for the covariance method to produce an
7325unstable filter, and not the autocorrelation method \(em although even
7326with the latter, truncation of the $a sub k$'s for transmission may turn
7327a stable filter into an unstable one.  Furthermore, the coefficients
7328$a sub k$ are not suitable candidates for quantization, because small
7329changes in them can have a dramatic effect on the characteristics of
7330the synthesis filter.
7331.pp
7332Both of these problems can be solved by using a different set of numbers,
7333called
7334.ul
7335reflection coefficients,
7336for quantization and transmission.  Thus, for example, in Figures 6.1
7337and 6.3 these reflection coefficients could be derived at the
7338transmitter, quantized, and used by the receiver to reproduce
7339the speech waveform.  They can be related to reflection and transmission
7340parameters at the junctions of an acoustic tube model of the vocal tract;
7341hence the name.  Procedure 6.3 shows an algorithm for calculating the
7342reflection coefficients from the filter coefficients $a sub k$.
7343.RF
7344.fi
7345.na
7346.nh
7347.ul
7348const
7349p=15;
7350.ul
7351type
7352cvec =
7353.ul
7354array
7355[1..p]
7356.ul
7357of
7358real;
7359.sp
7360.ul
7361procedure
7362reflection(coeff: cvec;
7363.ul
7364var
7365refl: cvec);
7366.sp
7367{computes reflection coefficients in refl[1..p] corresponding
7368to linear prediction coefficients in coeff[1..p]}
7369.sp
7370.ul
7371var
7372temp: cvec;  i, m: 1..p;
7373.sp
7374.ul
7375begin
7376.in+6n
7377.ul
7378for
7379m:=p
7380.ul
7381downto
73821
7383.ul
7384do begin
7385.in+2n
7386refl[m] := coeff[m];
7387.br
7388.ul
7389for
7390i:=1
7391.ul
7392to
7393m\-1
7394.ul
7395do
7396temp[i] := coeff[i];
7397.br
7398.ul
7399for
7400i:=1
7401.ul
7402to
7403m\-1
7404.ul
7405do
7406.ti+2n
7407coeff[i] :=
7408.ti+4n
7409(coeff[i] + refl[m]*temp[m\-i]) / (1 \- refl[m]*refl[m]);
7410.in-2n
7411.ul
7412end
7413.in-6n
7414.ul
7415end.
7416.nf
7417.MT 2
7418Procedure 6.3  Pascal algorithm for producing reflection coefficients
7419from filter coefficients
7420.TE
7421.pp
7422Although we will not go into the theoretical details here,
7423reflection coefficients are bounded by $+-$1 for stable filters,
7424and hence form a useful test for stability.  Having a limited
7425range makes them easy to quantize for transmission, and in fact
7426they behave better under quantization than do the filter coefficients.
7427One could resynthesize speech from reflection coefficients by first
7428converting them to filter coefficients and using the synthesis
7429method described above.  However, it is natural to seek a single-stage
7430procedure which can regenerate speech directly from reflection
7431coefficients.
7432.pp
7433Such a procedure does exist, and is called a
7434.ul
7435lattice filter.
7436Figure 6.8 shows one form of lattice for speech synthesis.
7437.FC "Figure 6.8"
7438The error signal (whether transmitted or synthesized)
7439enters at the upper left-hand corner, passes along the top forward
7440signal path, being modified on the way, to give the output signal
7441at the right-hand side.
7442Then it passes back through a chain of delays along the bottom,
7443backward, path, and is used to modify subsequent forward signals.
7444Finally it is discarded at the lower left-hand corner.
7445.pp
7446There are $p$ stages in the lattice structure of Figure 6.8, where $p$ is the
7447order of the linear predictive filter.
7448Each stage involves two multiplications by the appropriate
7449reflection coefficients, one by the backward signal \(em the
7450result of which is added into the forward path \(em and the other by
7451the forward signal \(em the result of which is subtracted from the
7452backward path.  Thus the number of multiplications is twice
7453the order of the filter, and hence twice as many as for the
7454realization using coefficients $a sub k$.  If the labour necessary
7455to turn the reflection coefficients into $a sub k$'s is included,
7456the computational load becomes the same.  Moreover, since the
7457reflection coefficients need fewer quantization bits than the $a sub k$'s
7458(for a given speech quality), the word lengths are smaller in the
7459lattice realization.
7460.pp
7461The advantages of the lattice method of synthesis over direct evaluation
7462of the prediction using filter coefficients $a sub k$, then, are:
7463.LB
7464.NP
7465the reflection coefficients are used directly
7466.NP
7467the stability of the filter is obvious from the reflection coefficient
7468values
7469.NP
7470the system is more tolerant to quantization errors in fixed-point
7471implementations.
7472.LE
7473Although it may seem unlikely that an unstable filter would be produced
7474by linear predictive analysis, instability is in fact a real problem
7475in non-lattice implementations.  For example,
7476coefficients are often interpolated at the receiver, to allow longer
7477frame times and smooth over sudden transitions, and it is quite likely that
7478an unstable configuration is obtained when interpolating filter coefficients
7479between two stable configurations.
7480This cannot happen with reflection coefficients, however, because a
7481necessary and sufficient condition for stability is that all
7482coefficients lie in the interval $(-1,+1)$.
7483.sh "6.3  Lattice filtering"
7484.pp
7485Lattice filters are an important new method of linear predictive
7486.ul
7487analysis
7488as well as synthesis, and so
7489it is worth considering the theory behind them a little further.
7490.rh "Theory of the lattice synthesis filter."
7491Figure 6.9 shows a single stage of the synthesis lattice given earlier.
7492.FC "Figure 6.9"
7493There are two signals at each side of the lattice, and the $z$-transforms
7494of these have been labelled $X sup +$ and $X sup -$ at the left-hand side
7495and $Y sup +$ and $Y sup -$ at the right-hand side.
7496The direction of signal flow is forwards along the upper ("positive") path
7497and backwards along the lower ("negative") one.
7498.pp
7499The signal flows show that the following two relationships hold:
7500.LB
7501.EQ
7502Y sup + ~=~~ X sup + ~+~ k z sup -1 Y sup - ~~~~~~
7503.EN
7504for the forward (upper) path
7505.br
7506.EQ
7507X sup - ~ =~ -kY sup + ~+~ z sup -1 Y sup - ~~~~~~~
7508.EN
7509\h'-\w'\-'u'for the backward (lower) path.
7510.LE
7511Re-arranging the first equation yields
7512.LB
7513.EQ
7514X sup + ~ =~~ Y sup + ~-~ k z sup -1 Y sup - ,
7515.EN
7516.LE
7517and so we can describe the function of the lattice by a single matrix
7518equation:
7519.LB
7520.ne4
7521.EQ
7522left [ matrix {ccol {X sup + above X sup -}} right ] ~~=~~
7523left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
7524~ left [ matrix {ccol {Y sup + above Y sup -}} right ] ~ .
7525.EN
7526.LE
7527It would be nice to be able to
7528call this an input-output equation, but it is not;
7529for the input signals to the lattice stage are $X sup +$ and $Y sup -$,
7530and the outputs are $X sup -$ and $Y sup +$.
7531We have written it in this form because it allows a multi-stage lattice to
7532be described by cascading these matrix equations.
7533.pp
7534A single-stage lattice filter has $Y sup +$ and $Y sup -$ connected together,
7535forming its output (call this $X sub output$), while the input is $X sup +$
7536($X sub input$).
7537Hence the input is related to the output by
7538.LB
7539.EQ
7540left [ matrix {ccol {X sub input above \(sq }} right ] ~~ =
7541~~ left [ matrix {ccol {1 above -k} ccol {-k z sup -1
7542above z sup -1}} right ]
7543~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ ,
7544.EN
7545.LE
7546so
7547.LB
7548.EQ
7549X sub input ~ = ~~ (1~-~ k z sup -1 )~X sub output ,
7550.EN
7551.LE
7552or
7553.LB
7554.EQ
7555{X sub output} over {X sub input} ~~=~~ 1 over {1~-~ k sub 1 z sup -1} ~ .
7556.EN
7557.LE
7558(The symbol \(sq is used here and elsewhere
7559to indicate an unimportant element of a vector
7560or matrix.)  This certainly has the form of a linear predictive
7561synthesis filter, which is
7562.LB
7563.EQ
7564X(z) over E(z) ~~=~~ 1 over {1~-~~ sum from k=1 to p ~a sub k
7565z sup -k}~~=~~ 1 over {1~-~a sub 1 z sup -1 } ~~~~~~
7566.EN
7567when $p=1$.
7568.LE
7569.pp
7570The behaviour of a second-order lattice filter, shown in Figure 6.10,
7571can be described by
7572.LB
7573.ne4
7574.EQ
7575left [ matrix {ccol {X sub 3 sup + above X sub 3 sup -}} right ] ~~ =
7576~~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1
7577above z sup -1}} right ]
7578~ left [ matrix {ccol {X sub 2 sup + above X sub 2 sup -}} right ]
7579.EN
7580.sp
7581.ne4
7582.EQ
7583left [ matrix {ccol {X sub 2 sup + above X sub 2 sup -}} right ] ~~ =
7584~~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1
7585above z sup -1}} right ]
7586~ left [ matrix {ccol {X sub 1 sup + above X sub 1 sup -}} right ]
7587.EN
7588.LE
7589with
7590.LB
7591.ne3
7592.EQ
7593X sub 3 sup + ~=~X sub input
7594.EN
7595.br
7596.EQ
7597X sub 1 sup + ~=~ X sub 1 sup - ~=~ X sub output .
7598.EN
7599.LE
7600.FC "Figure 6.10"
7601$X sub 2 sup +$ and $X sub 2 sup -$ can be eliminated by substituting the
7602second equation into the first, which yields
7603.LB
7604.EQ
7605left [ matrix {ccol {X sub input above \(sq }} right ] ~~ mark =
7606~~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1
7607above z sup -1}} right ]
7608~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1
7609above z sup -1}} right ]
7610~ left [ matrix {ccol {X sub output above X sub output}} right ]
7611.EN
7612.sp
7613.sp
7614.EQ
7615lineup = ~~ left [ matrix {ccol {1+k sub 1 k sub 2 z sup -1 above \(sq }
7616ccol { -k sub 1 z sup -1 -k sub 2 z sup -2 above \(sq }} right ]
7617~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ .
7618.EN
7619.LE
7620This leads to an input-output relationship
7621.LB
7622.EQ
7623{X sub output} over {X sub input} ~~ = ~~
76241 over {1~+~k sub 1 (k sub 2 -1)z sup -1 ~-~k sub 2 z sup -2} ~ ,
7625.EN
7626.LE
7627which has the required form, namely
7628.LB
7629.EQ
76301 over {1~-~~ sum from k=1 to p ~a sub k z sup -k } ~~~~~~ (p=2)
7631.EN
7632.LE
7633when
7634.LB
7635.EQ
7636a sub 1 ~=~-k sub 1 (k sub 2 -1)
7637.EN
7638.br
7639.EQ
7640a sub 2 ~=~k sub 2.
7641.EN
7642.LE
7643.pp
7644A third-order filter is described by
7645.LB
7646.EQ
7647left [ matrix {ccol {X sub input above \(sq }} right ] ~~ =
7648~~ left [ matrix {ccol {1 above -k sub 3 } ccol {-k sub 3 z sup -1
7649above z sup -1}} right ]
7650~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1
7651above z sup -1}} right ]
7652~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1
7653above z sup -1}} right ]
7654~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ ,
7655.EN
7656.LE
7657and brave souls can verify that this gives an input-output
7658relationship
7659.LB
7660.EQ
7661{X sub output} over {X sub input} ~~ = ~~
76621 over {1~+~[k sub 2 k sub 3 + k sub 1 (1-k sub 2 )] z sup -1 ~+~
7663[k sub 1 k sub 3 (1-k sub 2 ) -k sub 2 ] z sup -2 ~-~ k sub 3 z sup -3 } ~ .
7664.EN
7665.LE
7666It is fairly obvious that a $p$'th order lattice filter will give the
7667required all-pole $p$'th order synthesis form,
7668.LB
7669.EQ
76701 over { 1~-~~ sum from k=1 to p ~a sub k z sup -k } ~ .
7671.EN
7672.LE
7673.pp
7674We have not shown that the algorithm given in Procedure 6.3 for producing
7675reflection coefficients from filter coefficients gives those values
7676for $k sub i$ which are necessary to make the lattice filter equivalent
7677to the ordinary synthesis filter.  However, this is the case, and it is
7678easy to verify by hand for the first, second, and third-order cases.
7679.rh "Different lattice configurations."
7680The lattice filters of Figures 6.8, 6.9, and 6.10 have two multipliers
7681per section.
7682This is called a "two-multiplier" configuration.
7683However, there are other configurations which achieve
7684the same effect, but require different numbers of multiplies.
7685Figure 6.11 shows one-multiplier and four-multiplier configurations,
7686along with the familiar two-multiplier one.
7687.FC "Figure 6.11"
7688It is easy to verify that the three configurations can be modelled in
7689matrix terms by
7690.LB
7691.ne4
7692$
7693left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~
7694left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
7695~ left [ matrix {ccol {Y sup + above Y sup -}} right ]
7696$		two-multiplier configuration
7697.sp
7698.sp
7699.ne4
7700$
7701left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~
7702left [ {1-k over 1+k} right ] sup 1/2 ~
7703left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
7704~ left [ matrix {ccol {Y sup + above Y sup -}} right ]
7705$	one-multiplier configuration
7706.sp
7707.sp
7708.ne4
7709$
7710left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~
77111 over {(1-k sup 2) sup 1/2} ~
7712left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
7713~ left [ matrix {ccol {Y sup + above Y sup -}} right ]
7714$	four-multiplier configuration.
7715.LE
7716Each of the three has the same frequency-domain response, although
7717a different constant factor is involved in each case.
7718The effect of this can be annulled by performing a single multiply
7719operation on the output of a complete lattice chain.
7720The multiplier has the form
7721.LB
7722.EQ
7723left [ {1 - k sub p} over {1 + k sub p} ~.~
7724{1 - k sub p-1} over {1 + k sub p-1} ~.~...~.~
7725{1 - k sub 1} over {1 + k sub 1} right ] sup 1/2
7726.EN
7727.sp
7728.LE
7729for single-multiplier lattices, and
7730.LB
7731.EQ
7732left [ 1 over {1 - k sub p sup 2} ~.~
77331 over {1 - k sub p-1 sup 2} ~.~...~.~
77341 over {1 - k sub 1 sup 2} right ] sup 1/2
7735.EN
7736.LE
7737for four-multiplier lattices, where the reflection coefficients
7738in the lattice are $k sub p$, $k sub p-1$, ..., $k sub 1$.
7739.pp
7740There are important differences between these three configurations.
7741If multiplication is time-consuming, the one-multiplier model has obvious
7742computational advantages over the other two methods.
7743However, the four-multiplier structure behaves substantially better
7744in finite word-length implementations.  It is easy to show that, with this
7745configuration,
7746.LB
7747.EQ
7748(X sup - ) sup 2 ~+~ (Y sup + ) sup 2 ~~ = ~~
7749(X sup + ) sup 2 ~+~ (z sup -1 Y sup - ) sup 2 ,
7750.EN
7751.LE
7752\(em a relationship which suggests that the "energy" in the
7753the input signals, namely  $X sup +$ and $Y sup -$,  is preserved in the output
7754signals,  $X sup -$ and $Y sup +$.
7755Notice that care must be taken with the $z$-transforms, since squaring is a
7756non-linear operation.  $(z sup -1 Y sup - ) sup 2$  means the square of
7757the previous value of  $Y sup -$,  which is not the same
7758as  $z sup -2 (Y sup - ) sup 2$.
7759.pp
7760It has been shown (Gray and Markel, 1975) that the four-multiplier
7761configuration has some stability properties which are not shared by other
7762digital filter structures.
7763.[
7764Gray Markel 1975 Normalized digital filter structure
7765.]
7766When a linear predictive filter is used for synthesis, the parameters
7767of the filter \(em the $k$-parameters in the case of lattice filters,
7768and the $a$-parameters in the case of direct ones \(em change with time.
7769It is usually rather difficult to guarantee stability in the case of
7770time-varying filter parameters, but some guarantees can be made for a
7771chain of four-multiplier lattices.  Furthermore, if the input is a
7772discrete delta function, the cumulative energies at each stage of the
7773lattice are the same, and so maximum dynamic range will be achieved
7774for the whole filter if each section is implemented with the same
7775word size.
7776.rh "Lattice analysis."
7777It is quite easy to construct a filter which is inverse to
7778a single-stage lattice.
7779The structure of Figure 6.12(a) does the job.
7780(Ignore for a moment
7781the dashed lines connecting Figure 6.12(a) and (b).)  Its matrix transfer
7782function is
7783.FC "Figure 6.12"
7784.LB
7785.ne4
7786$
7787left [ matrix {ccol {Y sup + above Y sup -}} right ] ~~=~~
7788left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
7789~ left [ matrix {ccol {X sup + above X sup -}} right ]
7790$	analysis lattice (Figure 6.12(a)).
7791.LE
7792Notice that this is exactly the same as the transfer function of the
7793synthesis lattice of Figure 6.9, which is reproduced
7794in Figure 6.12(b), except that the $X$'s and $Y$'s are reversed:
7795.LB
7796.ne4
7797$
7798left [ matrix {ccol {X sup + above X sup -}} right ] ~~=~~
7799left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
7800~ left [ matrix {ccol {Y sup + above Y sup -}} right ]
7801$	synthesis lattice (Figure 6.12(b)),
7802.LE
7803or, in other words,
7804.LB
7805.ne4
7806$
7807left [ matrix {ccol {Y sup + above Y sup -}} right ] ~~ = ~~
7808left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}}
7809right ] sup -1
7810~ left [ matrix {ccol {X sup + above X sup -}} right ]
7811$	synthesis lattice (Figure 6.12(b)).
7812.LE
7813Hence if the filters of Figures 6.12(a) and (b) were connected together
7814as shown by the dashed lines, they
7815would cancel each other out, and the overall transfer would be unity:
7816.LB
7817.ne4
7818.EQ
7819left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}}
7820right ] ~
7821left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}}
7822right ] sup -1 ~~ = ~~
7823left [ matrix {ccol {1 above 0} ccol {0 above 1}} right ] ~ .
7824.EN
7825.LE
7826Actually, such a connection is not possible in physical terms,
7827for although the upper paths can be joined together the lower ones can not.
7828The right-hand lower point of Figure 6.12(a) is an
7829.ul
7830output
7831terminal, and so is the left-hand lower one of Figure 6.12(b)!  However,
7832there is no need to envisage a physical connection of the lower paths.
7833It is sufficient for cancellation just to assume that the signals at both
7834of the points turn out to be the same.
7835.pp
7836And they do.
7837The general case of a $p$-stage analysis lattice
7838connected to a $p$-stage synthesis
7839lattice is shown in Figure 6.13.
7840.FC "Figure 6.13"
7841Notice that the forward and backward paths are connected together at both
7842of the extreme ends of the system.
7843It is not difficult to show that under these
7844conditions the signal at the lower righthand
7845terminal of the analysis chain will equal that at the lower lefthand
7846terminal of the synthesis chain, even though they are not connected,
7847provided the upper terminals are connected together as shown by the dashed
7848line.
7849Of course, the reflection coefficients  $k sub 1$, $k sub 2$, ...,
7850$k sub p$  in the analysis lattice must equal those in the synthesis
7851lattice, and as Figure 6.13 shows the order is reversed in the synthesis
7852lattice.
7853Successive analysis and synthesis sections pair off, working from
7854the middle outwards.  At each stage the sections cancel each other out,
7855giving a unit transfer function as demonstrated above.
7856.rh "Estimating reflection coefficients."
7857As stated earlier in this chapter, the key problem in linear prediction is to
7858determine the values of the predictive coefficients \(em in this case, the
7859reflection coefficients.
7860If this is done correctly, we have shown using Procedure 6.3 that
7861the the synthesis part of Figure 6.13 performs the same calculation that
7862a conventional direct-form linear predictive synthesizer would, and hence
7863the signal that excites it \(em that is, the signal represented by the
7864dashed line \(em must be the prediction residual, or error signal, discussed
7865earlier.  The system is effectively the same as the high-order adaptive
7866differential pulse code modulation one of Figure 6.1.
7867.pp
7868One of the most interesting features of the lattice structure for
7869analysis filters is that calculation of suitable values for the
7870reflection coefficients can be done locally at each stage of the lattice.
7871For example, consider the $i$'th section of the analysis lattice in
7872Figure 6.13.  It is possible to determine a suitable value of $k sub i$
7873simply by performing a calculation on the inputs to the $i$'th
7874section (ie $X sup +$ and $X sup -$ in Figure 6.12).
7875No longer need the complicated global optimization technique of matrix
7876inversion be used, as in the autocorrelation and covariance methods discussed
7877earlier.
7878.pp
7879A suitable value for $k$ in the single lattice section of Figure 6.12 is
7880.LB
7881.EQ
7882k~ = ~~ {E[ x sup + (n) x sup - (n-1)]} over
7883{( E[ x sup + (n) sup 2 ] E[ x sup - (n-1) sup 2 ] ) sup 1/2} ~~ ;
7884.EN
7885.LE
7886that is, the statistical correlation between $x sup + (n)$ and
7887$x sup - (n-1)$.
7888Here, $x sup + (n)$ and $x sup - (n)$ represent the input signals to the
7889upper and lower paths (recall that $X sup +$ and $X sup -$
7890are their $z$-transforms).
7891$x sup - (n-1)$ is just $x sup - (n)$ delayed by one time unit, that is,
7892the output of the $z sup -1$ box in the Figure.
7893.pp
7894The criterion of optimality for the autocorrelation and covariance methods
7895was that the prediction error, that is, the signal which emerges from
7896the right-hand end of the upper path of a lattice analysis filter,
7897should be minimized in a mean-square sense.
7898The reflection coefficients obtained from the above formula do not necessarily
7899satisfy any such global minimization criterion.
7900Nevertheless, they do keep the error signal small, and have been used with
7901success in speech analysis systems.
7902.pp
7903It is easy to minimize the output from either the upper or the lower path
7904of the lattice filter at each stage.  For example, the $z$-transform of the
7905upper output is given by
7906.LB
7907.EQ
7908Y sup + ~~=~~ X sup + ~-~ k z sup -1 X sup - ,
7909.EN
7910.LE
7911or
7912.LB
7913.EQ
7914y sup + (n) ~~=~~ x sup + (n) ~-~ k x sup - (n-1) .
7915.EN
7916.LE
7917Hence
7918.LB
7919.EQ
7920E[y sup + (n) sup 2 ] ~~ = ~~ E[x sup + (n) sup 2 ] ~-~
79212kE[x sup + (n) x sup - (n-1) ] ~+~ k sup 2 E [x sup - (n-1) sup 2 ] ,
7922.EN
7923.LE
7924where $E$ stands for expected value, and this reaches a minimum when the
7925derivative with respect to $k$ becomes zero:
7926.LB
7927.EQ
7928-2E[x sup + (n) x sup - (n-1) ] ~+~ 2kE[x sup - (n-1) sup 2 ] ~~=~0 ,
7929.EN
7930.LE
7931that is, when
7932.LB
7933.EQ
7934k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over {E[x sup - (n-1) sup 2 ]
7935} ~ .
7936.EN
7937.LE
7938A similar calculation shows that the output of the lower path is minimized
7939when
7940.LB
7941.EQ
7942k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over {E[x sup + (n-1) sup 2 ]
7943} ~ .
7944.EN
7945.LE
7946Unfortunately, either of these expressions can exceed 1, leading to an
7947unstable filter.
7948The value of $k$ cited earlier is the geometric mean of these two
7949expressions, and since it is a correlation coefficient, must be less than 1.
7950.pp
7951Another possibility is to minimize the expected value of the sum of the
7952squares of the upper and lower outputs:
7953.LB
7954.EQ
7955y sup + (n) sup 2 ~+~ y sup - (n) sup 2 ~~ = ~~
7956(1+k sup 2 )x sup + (n) sup 2 ~-~ 2kx sup + (n) x sup - (n-1) ~+~
7957(1+k sup 2 )x sup - (n) sup 2 .
7958.EN
7959.LE
7960Taking expected values and setting the derivative with respect to k to zero
7961leads to
7962.LB
7963.EQ
7964k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over
7965{ half ~ E[x sup + (n) sup 2 ~+~ x sup - (n-1) sup 2 ]} ~.
7966.EN
7967.LE
7968This also is guaranteed to be less than 1, and has given good results
7969in speech analysis systems.
7970.pp
7971Figure 6.14 shows the implementation of a single section of an analysis
7972lattice.
7973.FC "Figure 6.14"
7974The signals $x sup + (n)$ and $x sup - (n-1)$ are fed to a
7975correlator, which produces a suitable value for $k$.
7976This value is used to calculate the output of the lattice section,
7977and hence the input to the next lattice section.
7978The reflection coefficient needs to be low-pass filtered, because it will
7979only be transmitted to the synthesizer occasionally (say every 20\ msec) and so a
7980short-term average is required.
7981.pp
7982One implementation of the correlator is shown in Figure 6.15 (Kang, 1974).
7983.[
7984Kang 1974
7985.]
7986.FC "Figure 6.15"
7987This calculates the value of $k$ given by the last equation above, and does it
7988by summing and differencing the two
7989signals $x sup + (n)$ and $x sup - (n-1)$, squaring the results to give
7990.LB
7991.EQ
7992x sup + (n) sup 2 + 2x sup + (n mark ) x sup - (n-1) +x sup - (n-1) sup 2
7993~~~~~~~~ x sup + (n) sup 2 - 2x sup + (n) x sup - (n-1) +x sup - (n-1) sup 2
7994~ ,
7995.EN
7996.LE
7997and summing and differencing these, to yield
7998.LB
7999.EQ
8000lineup 2x sup + (n) sup 2 + 2x sup - (n-1) sup 2 ~~~~~~~~
80014x sup + (n) x sup - (n-1) ~ .
8002.EN
8003.LE
8004.sp
8005Before these are divided to give the final coefficient $k$, they are
8006individually low-pass filtered.
8007While some rather complex schemes have been proposed,
8008based upon Kalman filter theory (eg Matsui
8009.ul
8010et al,
80111972),
8012.[
8013Matsui Nakajima Suzuki Omura 1972
8014.]
8015a simple exponential weighted past average has been found to be
8016satisfactory.  This has $z$-transform
8017.LB
8018.EQ
80191 over {64 - 63 z sup -1} ~ ,
8020.EN
8021.LE
8022that is, in the time domain,
8023.LB
8024.EQ
8025y(n)~ = ~~ 63 over 64 ~ y(n-1) ~+~ 1 over 64 ~ y(n) ~ .
8026.EN
8027.LE
8028This filter exponentially averages past sample values
8029with a time-constant of 64 sampling intervals
8030\(em that is, 8\ msec at an 8\ kHz sampling rate.
8031.sh "6.4  Pitch estimation"
8032.pp
8033It is sometimes useful to think of linear prediction as a kind of
8034curve-fitting technique.
8035Figure 6.16 illustrates how four samples of a speech signal can predict
8036the next one.
8037.FC "Figure 6.16"
8038In essence, a curve is drawn through four points
8039to predict the position of the fifth, and only the prediction error
8040is actually transmitted.  Now if the order of linear prediction
8041is high enough (at least 10), and if the coefficients are chosen
8042correctly, the prediction will closely model the resonances of the
8043vocal tract.  Thus the error will actually be zero, except at pitch
8044pulses.
8045.pp
8046Figure 6.17 shows a segment of voiced speech together with the prediction
8047error (often called the prediction residual).
8048.FC "Figure 6.17"
8049It is apparent that the
8050error is indeed small, except at pitch pulses.
8051This suggests that a good way to determine the pitch period is to examine
8052the error signal, perhaps by looking at its autocorrelation function.
8053As with all pitch detection methods, one must be
8054careful:  spurious peaks can occur, especially in nasal sounds when
8055the all-pole model provided by linear prediction fails.  Continuity
8056constraints, which use previous values of pitch period when determining
8057which peak to accept as a new pitch impulse, can eliminate many of these
8058spurious peaks.  Unvoiced speech should produce an error signal with no
8059prominent peaks, and this needs to be detected.
8060Voiced fricatives are a difficult case:  peaks should be present
8061but the general noise level of the error signal will be greater than
8062it is in
8063purely voiced speech.
8064Such considerations have been taken into account in a practical pitch
8065estimation system based upon this technique (Markel, 1972).
8066.[
8067Markel 1972 SIFT
8068.]
8069.pp
8070This method of pitch detection highlights another advantage of the lattice
8071analysis technique.  When using autocorrelation or covariance analysis to
8072determine the filter (or reflection) coefficients, the error signal is not
8073normally produced.  It can, of course, be found by taking the speech samples
8074which constitute the current frame and running them through an analysis
8075filter whose parameters are those determined by the analysis, but this
8076is a computationally demanding exercise, for the filter must run at the
8077speech sampling rate (say 8\ kHz) instead of at the frame rate (say 50\ Hz).
8078Usually, pitch is estimated by other methods, like those discussed in
8079Chapter 4, when using autocorrelation or covariance linear prediction.
8080However, we have seen above that with the lattice method, the error
8081signal is produced as a byproduct:  it appears at the right-hand end
8082of the  upper path of the lattice chain.  Thus it is already available
8083for use in determining pitch periods.
8084.sh "6.5  Parameter coding for linear predictive storage or transmission"
8085.pp
8086In this section, the coding requirements of linear predictive parameters
8087will be examined.  The parameters that need to be stored or transmitted
8088are:
8089.LB
8090.NP
8091pitch
8092.NP
8093voiced-unvoiced flag
8094.NP
8095overall amplitude level
8096.NP
8097filter coefficients or reflection coefficients.
8098.LE
8099The first three are parameters of the excitation source.
8100They can be derived directly from the error signal as indicated above, if
8101it is generated (as it is in lattice implementations); or by other
8102methods if no error signal is calculated.
8103The filter or reflection coefficients are, of course, the main product
8104of linear predictive analysis.
8105.pp
8106It is generally agreed that around 60 levels, logarithmically spaced,
8107are needed to represent pitch for telephone quality speech.
8108The voiced-unvoiced indication requires one bit, but since pitch is
8109irrelevant in unvoiced speech it can be coded as one of the pitch
8110levels.  For example, with 6-bit coding of pitch, the value 0 can be
8111reserved to indicate unvoiced speech, with values 1\-63 indicating the
8112pitch of voiced speech.
8113The overall gain has not been discussed above:  it is simply the average
8114amplitude of the error signal.  Five bits on a logarithmic scale
8115are sufficient to represent it.
8116.pp
8117Filter coefficients are not very amenable to quantization.  At least
81188\-10\ bits are required for each one.  However, reflection coefficients
8119are better behaved, and 5\-6\ bits each seems adequate.  The number of
8120coefficients that must be stored or transmitted is the same as the
8121order of the linear prediction:  10 is commonly used for low-quality
8122speech, with as many as 15 for higher qualities.
8123.pp
8124These figures give around 100\ bits/frame for a 10'th order system using
8125filter coefficients, and around 65\ bits/frame for a 10'th order system
8126using reflection coefficients.  Frame lengths vary between 10\ msec
8127and 25\ msec, depending on the quality desired.  Thus for 20\ msec frames,
8128the data rates work out at around 5000\ bit/s using filter coefficients,
8129and 3250\ bit/s using reflection coefficients.
8130.pp
8131Substantially lower data rates can be achieved by more careful
8132coding of parameters.  In 1976, the US Government defined a standard
8133coding scheme for 10-pole linear prediction with a data rate of
81342400\ bit/s \(em conveniently chosen as one of the
8135commonly-used rates for serial data transmission.
8136This standard, called LPC-10, tackles the difficult problem of
8137protection against transmission errors (Fussell
8138.ul
8139et al,
81401978).
8141.[
8142Fussell Boudra Abzug Cowing 1978
8143.]
8144.pp
8145Whenever data rates are reduced, redundancy inherent in the signal is
8146necessarily lost and so the effect of transmission errors becomes
8147greatly magnified.
8148For example, a single corrupted sample in PCM transmission of speech
8149will probably not be noticed, and even a short burst of errors will be
8150perceived as a click which can readily be distinguished from the speech.
8151However, any error in LPC transmission will last for one entire
8152frame \(em say 20\ msec \(em and worse still, it will be integrated into the
8153speech signal and not easily discriminated from it by the listener's brain.
8154A single corruption may, for example, change a voiced frame into an
8155unvoiced one, or vice versa.  Even if it affects only
8156a reflection coefficient it will change the resonance characteristics
8157of that frame, and change them in a way that does not simply sound like
8158superimposed noise.
8159.pp
8160Table 6.1 shows the LPC-10 coding scheme.
8161.RF
8162.in+0.1i
8163.ta 2.0i +1.8i +0.6i
8164.nr x1 (\w'voiced sounds'/2)
8165.nr x2 (\w'unvoiced sounds'/2)
8166.ul
8167	\h'-\n(x1u'voiced sounds	\h'-\n(x2u'unvoiced sounds
8168.sp
8169pitch/voicing	7	7	60 pitch levels, Hamming
8170			\h'\w'00 'u'and Gray coded
8171energy	5	5	logarithmically coded
8172$k sub 1$	5	5	coded by table lookup
8173$k sub 2$	5	5	coded by table lookup
8174$k sub 3$	5	5
8175$k sub 4$	5	5
8176$k sub 5$	4	\-
8177$k sub 6$	4	\-
8178$k sub 7$	4	\-
8179$k sub 8$	4	\-
8180$k sub 9$	3	\-
8181$k sub 10$	2	\-
8182synchronization	1	1	alternating 1,0 pattern
8183error detection/	\-	\h'-\w'0'u'21
8184correction
8185	\h'-\w'__'u+\w'0'u'__	\h'-\w'__'u+\w'0'u'__
8186.sp
8187	\h'-\w'0'u'54	\h'-\w'0'u'54
8188.sp
8189.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
8190	frame rate: 44.4\ Hz (22.5\ msec frames)
8191.in 0
8192.FG "Table 6.1  Bit requirements for each parameter in LPC-10 coding scheme"
8193Different coding is used for voiced and unvoiced frames.
8194Only four reflection coefficients are transmitted for unvoiced frames,
8195because it has been determined that no perceptible increase in speech quality
8196occurs when more are used.
8197The bits saved are more fruitfully employed to provide error detection
8198and correction for the other parameters.
8199Seven bits are used for pitch and the voiced-unvoiced flag, and they are
8200redundant in that only 60 possible pitch values are
8201allowed.
8202Most transmission errors in this field will be detected by the receiver;
8203which can then use an estimate of pitch based on previous values and
8204discard the erroneous one.  Pitch values are also Gray coded so that
8205even if errors are not detected, there is a good chance that an adjacent
8206pitch value is read instead.
8207Different numbers of bits are allocated to the various reflection
8208coefficients:  experience shows that the lower-numbered ones contribute
8209most highly to intelligibility and so these are quantized most finely.
8210In addition, a table lookup operation is performed on the code
8211generated for the first two, providing a non-linear quantization which is
8212chosen to minimize the error on a statistical basis.
8213.pp
8214With 54\ bits/frame and 22.5\ msec frames, LPC-10 requires a 2400\ bit/s
8215data rate.  Even lower rates have been used successfully for lower-quality
8216speech.  The Speak 'n Spell toy, described in Chapter 11, has an
8217average data rate of 1200\ bit/s.  Rates as low as 600\ bit/s have
8218been achieved (Kang and Coulter, 1976) by pattern recognition techniques operating
8219on the reflection coefficients:  however, the speech quality is not good.
8220.[
8221Kang Coulter 1976
8222.]
8223.sh "6.6  References"
8224.LB "nnnn"
8225.[
8226$LIST$
8227.]
8228.LE "nnnn"
8229.sh "6.7  Further reading"
8230.pp
8231Most recent books on digital signal processing contain some information
8232on linear prediction (see Oppenheim and Schafer, 1975; Rabiner and Gold, 1975;
8233and Rabiner and Schafer, 1978; all referenced at the end of Chapter 4).
8234.LB "nn"
8235.\"Atal-1971-1
8236.]-
8237.ds [A Atal, B.S.
8238.as [A " and Hanauer, S.L.
8239.ds [D 1971
8240.ds [T Speech analysis and synthesis by linear prediction of the acoustic wave
8241.ds [J JASA
8242.ds [V 50
8243.ds [P 637-655
8244.nr [P 1
8245.ds [O August
8246.nr [T 0
8247.nr [A 1
8248.nr [O 0
8249.][ 1 journal-article
8250.in+2n
8251This paper is of historical importance because it introduced the idea
8252of linear prediction to the speech processing community.
8253.in-2n
8254.\"Makhoul-1975-2
8255.]-
8256.ds [A Makhoul, J.I.
8257.ds [D 1975
8258.ds [K *
8259.ds [T Linear prediction: a tutorial review
8260.ds [J Proc IEEE
8261.ds [V 63
8262.ds [N 4
8263.ds [P 561-580
8264.nr [P 1
8265.ds [O April
8266.nr [T 0
8267.nr [A 1
8268.nr [O 0
8269.][ 1 journal-article
8270.in+2n
8271An interesting, informative, and readable survey of linear prediction.
8272.in-2n
8273.\"Markel-1976-3
8274.]-
8275.ds [A Markel, J.D.
8276.as [A " and Gray, A.H.
8277.ds [D 1976
8278.ds [T Linear prediction of speech
8279.ds [I Springer Verlag
8280.ds [C Berlin
8281.nr [T 0
8282.nr [A 1
8283.nr [O 0
8284.][ 2 book
8285.in+2n
8286This is the only book which is entirely devoted to linear prediction of speech.
8287It is an essential reference work for those interested in the subject.
8288.in-2n
8289.\"Wiener-1947-4
8290.]-
8291.ds [A Wiener, N.
8292.ds [D 1947
8293.ds [T Extrapolation, interpolation and smoothing of stationary time series
8294.ds [I MIT Press
8295.ds [C Cambridge, Massachusetts
8296.nr [T 0
8297.nr [A 1
8298.nr [O 0
8299.][ 2 book
8300.in+2n
8301Linear prediction is often thought of as a relatively new technique,
8302but it is only its application to speech processing that is novel.
8303Wiener develops all of the basic mathematics used in linear prediction
8304of speech, except the lattice filter structure.
8305.in-2n
8306.LE "nn"
8307.EQ
8308delim $$
8309.EN
8310.CH "7  JOINING SEGMENTS OF SPEECH"
8311.ds RT "Joining segments of speech
8312.ds CX "Principles of computer speech
8313.pp
8314The obvious way to provide speech output from computers
8315is to select the basic acoustic units to be used; record them;
8316and generate utterances by concatenating together appropriate segments
8317from this pre-stored inventory.
8318The crucial question then becomes, what are the basic units?
8319Should they be whole sentences, words, syllables, or phonemes?
8320.pp
8321There are several trade-offs to be considered here.
8322The larger the units, the more utterances have to be stored.
8323It is not so much the length of individual utterances that is of concern,
8324but rather their variety, which tends to increase exponentially instead
8325of linearly with the size of the basic unit.  Numbers provide an
8326easy example:  there are $10 sup 7$ 7-digit telephone numbers, and it is
8327certainly infeasible to record each one individually.
8328Note that as storage technology improves the limitation is becoming
8329more and more one of recording the utterances in the first place rather
8330than finding somewhere to store them.
8331At a PCM data rate of 50\ Kbit/s, a 100\ Mbyte disk can hold over 4\ hours
8332of continuous speech.
8333With linear predictive coding at 1\ Kbit/s it holds 0.8 of a
8334megasecond \(em well over a week.  And this is a 24-hour 7-day week,
8335which corresponds to a working month; and continuous speech \(em without
8336pauses \(em which probably requires another factor of five for
8337production by a person.
8338Setting up a recording session to fill the disk would be a formidable
8339task indeed!
8340Furthermore, the use of videodisks \(em which will be common domestic items
8341by the end of the decade \(em could increase these figures by a factor of 50.
8342.pp
8343The word seems to be a sensibly-sized basic unit.
8344Many applications use a rather limited vocabulary \(em 190 words
8345for the airline reservation system described in Chapter 1.
8346Even at PCM data rates, this will consume less than 0.5\ Mbyte of
8347storage.
8348Unfortunately, coarticulation and prosodic factors now come into play.
8349.pp
8350Real speech is connected \(em there are few gaps between words.
8351Coarticulation, where sounds are affected by those on either side,
8352naturally operates across word boundaries.
8353And the time constants of coarticulation are associated with the
8354mechanics of the vocal tract and hence measure tens or hundreds
8355of msec.  Thus the effects straddle several pitch periods (100\ Hz pitch
8356has 10\ msec period) and cannot be simulated by simple interpolation of the
8357speech waveform.
8358.pp
8359Prosodic features \(em notably pitch and rhythm \(em span much longer
8360stretches of speech than single words.  As far as most speech output
8361applications are concerned, they operate at the utterance level of
8362a single, sentence-sized, information unit.  They cannot be
8363accomodated if speech waveforms of individual words of
8364the utterance are stored,
8365for it is rarely feasible to alter the fundamental
8366frequency or duration of a time waveform without changing all the formant
8367resonances as well.
8368However, both word-to-word coarticulation and the essential features
8369of rhythm and intonation can be incorporated if the stored words are
8370coded in source-filter form.
8371.pp
8372For more general applications of speech output, the limitations of
8373word storage soon become apparent.  Although people's daily
8374vocabularies are not large, most words have a variety
8375of inflected forms which need to be treated separately if a strict
8376policy is adopted of word storage.  For instance, in this book
8377there are 84,000 words, and 6,500 (8%) different ones (counting
8378inflected forms).
8379In Chapter 1 alone, there are 6,800 words and 1,700 (25%) different ones.
8380.pp
8381It seems crazy to treat a simple inflection like "$-s$" or its voiced
8382counterpart, "$-z$" (as in "inflection\c
8383.ul
8384s\c
8385"),
8386as a totally different word from the base form.
8387But once you consider storing roots and endings separately,
8388it becomes apparent
8389that there is a vast number of different endings, and it is difficult to know
8390where to draw the line.  It is natural to think instead of simply
8391using the syllable as the basic unit.
8392.pp
8393A generous estimate of the number of different syllables in English is 10,000.
8394At three a second, only about an
8395hour's storage is required for them all.  But waveform storage
8396will certainly not do.
8397Although coarticulation effects between words are needed to make
8398speech sound fluent, coarticulation between syllables is necessary
8399for it even to be
8400.ul
8401comprehensible.
8402Adopting a source-filter form of representation is essential, as is
8403some scheme of interpolation between syllables which simulates
8404coarticulation.
8405Unfortunately, a great deal of acoustic action occurs at syllable
8406boundaries \(em stops are exploded, the sound source changes
8407between voicing and frication, and so on.  It may be more appropriate
8408to consider inverse syllables, comprising a vowel-consonant-vowel sequence
8409instead of consonant-vowel-consonant.
8410(These have jokingly been dubbed "lisibles"!)
8411.pp
8412There is again some considerable practical difficulty in creating
8413an inventory of syllables, or lisibles.
8414Now it is not so much the recording that is impractical, but
8415the editing needed to ensure that the cuts between syllables are made
8416at exactly the right point.  As units get smaller, the exact
8417placement of the boundaries becomes ever more critical; and several thousand
8418sensitive editing jobs is no easy task.
8419.pp
8420Since quite general effects of coarticulation must be accomodated
8421with syllable synthesis, there will not necessarily be significant
8422deterioration if smaller, demisyllable, units are employed.
8423This reduces the segment inventory to an estimated 1000\-2000 entries,
8424and the tedious job of editing each one individually becomes at
8425least feasible, if not enviable.
8426Alternatively, the segment inventory could be created by artificial
8427means involving cut-and-try experiments with resonance parameters.
8428.pp
8429The ultimate in economy of inventory size, of course, is to use
8430phonemes as the basic unit.  This makes the most critical
8431part of the task interpolation between units, rather than their
8432construction or recording.  With only about 40 phonemes
8433in English, each one can be examined in many different contexts to
8434ascertain the best data to store.
8435There is no need to record them directly from a human voice \(em it
8436would be difficult anyway for most cannot be produced in isolation.
8437In fact, a phoneme is an abstract unit, not a particular sound
8438(recall the discussion of phonology in Chapter 2), and so it is
8439most appropriate that data be abstracted from several different
8440realizations rather than an exact record made of any one.
8441.pp
8442If information is stored about phonological units of
8443speech \(em phonemes \(em the difficult task of phonological-to-phonetic
8444conversion must necessarily be performed automatically.
8445Allophones are created by altering the transitions between units,
8446and to a lesser extent by modifying the central parts of the units
8447themselves.
8448The rules for making transitions will have a big effect on the
8449quality of the resulting speech.
8450Instead of trying to perform this task automatically by a computer
8451program, the allophones themselves could be stored.  This will
8452ease the job of generating transitions between segments, but
8453will certainly not eliminate it.
8454The total number of allophones will depend on the narrowness of the
8455transcription system:  60\-80 is typical, and it is unlikely to exceed
8456one or two hundred.  In any case there will not be a storage problem.
8457However, now the burden of producing an allophonic transcription
8458has been transferred to the person who codes the utterance prior
8459to synthesizing it.  If he is skilful and patient, he should
8460be able to coax the system into producing fairly understandable
8461speech, but the effort required for this on a per-utterance basis
8462should not be underestimated.
8463.RF
8464.nr x0 \w'sentences  '
8465.nr x1 \w'  '
8466.nr x2 \w'depends on  '
8467.nr x3 \w'generalized or  '
8468.nr x4 \w'natural speech  '
8469.nr x5 \w'author of segment'
8470.nr x6 \n(x0u+\n(x1u+\n(x2u+\n(x3u+\n(x4u+\n(x5u
8471.nr x7 (\n(.l-\n(x6)/2
8472.in \n(x7u
8473.ta \n(x0u +\n(x1u +\n(x2u +\n(x3u +\n(x4u
8474	|	size of	storage	source of	principal
8475	|	utterance	method	utterance	burden is
8476	|	inventory		inventory	placed on
8477	|\h'-1.0i'\l'\n(x6u\(ul'
8478	|
8479sentences	|	depends on	waveform or	natural speech	recording artist,
8480	|	application	source-filter		storage medium
8481	|		parameters
8482	|
8483words	|	depends on	source-filter	natural speech	recording artist
8484	|	application	parameters		and editor,
8485	|				storage medium
8486	|
8487syllables/	|	\0\0\010000	source-filter	natural speech	recording editor
8488  lisibles	|		parameters
8489	|
8490demi-	|	\0\0\0\01000	source-filter	natural speech	recording editor
8491  syllables	|		parameters	or artificially	or inventory
8492	|			generated	compiler
8493	|
8494phonemes	|	\0\0\0\0\0\040	generalized	artificially	author of segment
8495	|		parameters	generated	concatenation
8496	|				program
8497	|
8498allophones	|	\0\050\-100	generalized or	artificially	coder of
8499	|		source-filter	generated or	synthesized
8500	|		parameters	natural speech	utterances
8501	|\h'-1.0i'\l'\n(x6u\(ul'
8502.in 0
8503.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
8504.FG "Table 7.1  Some issues relevant to choice of basic unit"
8505.pp
8506Table 7.1 summarizes in broad brush-strokes the issues which relate to the
8507choice of basic unit for concatenation.
8508The sections which follow provide more detail about the different
8509methods of joining segments of speech together.
8510Only segmental aspects are considered, for the important problems of
8511prosody will be treated in the next chapter.
8512All of the methods rely to some extent on the acoustic properties of speech,
8513and as smaller basic units are considered the role of speech acoustics
8514becomes more important.
8515It is impossible in a book like this to give a detailed account of acoustic
8516phonetics, for it would take several volumes!
8517What I aim to do in the following pages is to highlight some salient features
8518which are relevant to segment concatenation, without attempting to be
8519complete.
8520.sh "7.1  Word concatenation"
8521.pp
8522For general speech output, word concatenation is an inherently limited
8523technique because of the large number of phonetically different words.
8524Despite this fact, it is at present the most widely-used synthesis
8525method, and is likely to remain so for several years.
8526We have seen that the primary problems are word-to-word
8527coarticulation and prosody; and both can be overcome, at least to a useful
8528approximation, by coding the words in source-filter form.
8529.rh "Time-domain techniques."
8530Nevertheless, a surprising number of applications simply store
8531the time waveform, coded, usually, by one of the techniques described in
8532Chapter 3.
8533From an implementation point of view there are many advantages to this.
8534Speech quality can easily be controlled by selecting a suitable sampling
8535rate and coding scheme.
8536A natural-sounding voice is guaranteed; male or female as desired.
8537The equipment required is minimal \(em a digital-to-analogue
8538converter and post-sampling filter will do for synthesis if
8539PCM coding is used, and
8540DPCM, ADPCM, and delta modulation decoders are not much more complicated.
8541.pp
8542From a speech point of view, the resulting utterances can never be made
8543convincingly fluent.
8544We discussed the early experiments of Stowe and Hampton (1961)
8545at the beginning of Chapter 3.
8546.[
8547Stowe Hampton 1961
8548.]
8549A major drawback to word concatenation in the
8550analogue domain is the introduction of clicks and other interference
8551between words:  it is difficult to prevent the time waveform transitions
8552from adding extraneous sounds.
8553This poses no problem with digital storage, however, for the waveforms
8554can be edited accurately prior to storage so that they start
8555and finish at an exactly
8556zero level.
8557Rather, the lack of fluency stems from the absence of proper control
8558of coarticulation and prosody.
8559.pp
8560But this is not necessarily a serious drawback if the application is
8561a sufficiently limited one.  Complete, invariant utterances can be
8562stored as one unit.  Often they must contain data-dependent
8563slot-fillers, as in
8564.LB
8565This flight makes \(em stops
8566.LE
8567and
8568.LB
8569Flight number \(em leaves \(em at \(em , arrives in \(em at \(em
8570.LE
8571(taken from the airline reservation system of Chapter 1
8572(Levinson and Shipley, 1980)).
8573.[
8574Levinson Shipley 1980
8575.]
8576Then, each slot-filling word is recorded in an intonation consistent
8577both with its position in the template utterance and with the
8578intonation of that utterance.
8579This could be done by embedding the word in the utterance
8580for recording, and excising it by digital editing before storage.
8581It would be dangerous to try to take into account coarticulation effects,
8582for the coarticulation could not be made consistent with both the
8583several slot-fillers and the single template.
8584This could be overcome if several versions of the template were stored,
8585but then the scheme becomes subject to combinatorial explosion
8586if there is more than one slot in a single utterance.
8587But it is not really necessary, for the lack of fluency will probably
8588be interpreted by a benevolent listener as an attempt to convey the
8589information as clearly as possible.
8590.pp
8591Difficulties will occur if the same slot-filler is used in different
8592contexts.  For instance, the first gap in each of the sentences above
8593contains a number; yet the intonation of that number is different.
8594Many systems simply ignore this problem.
8595Then one does notice anomalies, if one is attentive:  the words come,
8596as it were, from different mouths, without fluency.
8597However, the problem is not necessarily acute.  If it is, two or more
8598versions of each slot-filler can be recorded, one for each context.
8599.pp
8600As an example, consider the synthesis of 7-digit telephone numbers,
8601like 289\-5371.  If one version only of each digit is stored,
8602it should be recorded in a level tone of voice.  A pause should be
8603inserted after the third digit of the synthetic number, to accord
8604with common elocution.  The result will certainly be unnatural, although
8605it should be clear and intelligible.
8606Any pitch errors in the recordings will make certain numbers
8607audibly anomalous.
8608At the other extreme, 70 single digits could be stored, one version of
8609each digit for each position in the number.  The recording will be
8610tedious and error-prone, and the synthetic utterances will still not
8611be fluent \(em for coarticulation is ignored \(em but instead
8612unnaturally clearly enunciated.  A compromise is to record only
8613three versions of each digit, one for any of the
8614five positions
8615.nr x1 \w'\(ul'
8616.nr x2 (8*\n(x1)
8617.nr x3 0.2m
8618\zx\h'\n(x1u'\zx\h'\n(x1u'\h'\n(x1u'\z\-\h'\n(x1u'\zx\h'\n(x1u'\zx\h'\n(x1u'\c
8619\zx\h'\n(x1u'\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' ,
8620another one for the third position
8621\h'\n(x1u'\h'\n(x1u'\zx\h'\n(x1u'\z\-\h'\n(x1u'\h'\n(x1u'\c
8622\h'\n(x1u'\h'\n(x1u'\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' ,
8623and the last for the final position
8624\h'\n(x1u'\h'\n(x1u'\h'\n(x1u'\z\-\h'\n(x1u'\h'\n(x1u'\c
8625\h'\n(x1u'\h'\n(x1u'\zx\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' .
8626The first version will be in a level voice, the second an
8627incomplete, rising tone; and the third a final, dropping pitch.
8628.rh "Joining formant-coded words."
8629The limitations of the time-domain method are lack of
8630fluency caused by unnatural transitions between words, and the
8631combinatorial explosion created by recording slot-fillers several times
8632in different contexts.
8633Both of these problems can be alleviated by storing formant tracks,
8634concatenating them with suitable interpolation, and applying a complete
8635pitch contour suitable for the whole utterance.
8636But one can still not generate conversational speech, for natural speech
8637rhythms cause non-linear warpings of the time axis which cannot reasonably
8638be imitated by this method.
8639.pp
8640Solving problems often creates others.
8641As we saw in Chapter 4, it is not easy to obtain reliable formant tracks
8642automatically.  Yet hand-editing of formant parameters adds a whole new
8643dimension to the problem of vocabulary construction, for it is
8644an exceedingly tiresome and time-consuming task.
8645Even after such tweaking, resynthesized utterances will be degraded
8646considerably from the original, for the source-filter model is by no means
8647a perfect one.
8648A hardware or real-time software formant synthesizer must be added
8649to the system, presenting design problems and creating extra cost.
8650Should a serial or parallel synthesizer be used? \(em the latter offers
8651potentially better speech (especially in nasal sounds), but requires
8652additional parameters, namely formant amplitudes, to be estimated.
8653Finally, as we will see in the next chapter, it is not an easy matter to
8654generate a suitable pitch contour and apply it to the utterance.
8655.pp
8656Strangely enough, the interpolation itself does not present any great
8657difficulty, for there is not enough information in the formant-coded
8658words to make possible sophisticated coarticulation.
8659The need for interpolation is most pressing when one word ends with
8660a voiced sound and the next begins with one.
8661If either the end of the first or the beginning of the second word
8662(or both) is unvoiced, unnatural formant transitions do not matter
8663for they will not be heard.
8664Actually, this is only strictly true for fricative transitions:  if
8665the juncture is aspirated then formants will be perceived in the
8666aspiration.  However,
8667.ul
8668h
8669is the only fully aspirated sound in English,
8670and it is relatively uncommon.
8671It is not absolutely necessary to interpolate the fricative filter resonance,
8672because smooth transitions from one fricative sound to another are rare
8673in natural speech.
8674.pp
8675Hence unless both sides of the junction are voiced, no interpolation
8676is needed:  simple abuttal of the stored parameter tracks will do.
8677Note that this is
8678.ul
8679not
8680the same as joining time waveforms, for the synthesizer
8681will automatically ensure a relatively smooth transition from one
8682segment to another because of energy storage in the filters.
8683A new set of resonance parameters for the formant-coded words will be stored
8684every 10 or 20 msec (see Chapter 5), and so the transition will automatically
8685be smoothed over this time period.
8686.pp
8687For voiced-to-voiced transitions, some interpolation is needed.
8688An overlap period of duration, say, 50\ msec, is established, and
8689the resonance parameters in the final 50\ msec of the first word are
8690averaged with those in the first 50\ msec of the second.
8691The average is weighted, with the first word's formants dominating
8692at the beginning and their effect progressively dying out
8693in favour of the second word.
8694.pp
8695More sophisticated than a simple average is to weight the components
8696according to how rapidly they are changing.
8697If the spectral change in one word is much greater than that in the
8698other, we might expect that this will dominate the transition.
8699A simple measure of spectral derivative at any given time can be found
8700by adding the magnitude of the discrepancies in each formant frequency
8701between one sample and the next.
8702The spectral change in the transition region can be obtained by summing
8703the spectral derivatives at each sample in the region.
8704Such a measure can perhaps be made more accurate by taking into
8705account the relative importance of the formants, but will probably
8706never be more than a rough and ready yardstick.
8707At any rate, it can be used to load the average in favour of the
8708dominant side of the junction.
8709.pp
8710Much more important for naturalness of the speech are the effects
8711of rhythm and intonation, discussed in the next chapter.
8712.pp
8713Such a scheme has been implemented and tested on \(em guess what! \(em 7-digit
8714telephone numbers (Rabiner
8715.ul
8716et al,
87171971).
8718.[
8719Rabiner Schafer Flanagan 1971
8720.]
8721Significant improvement (at the 5% level of statistical
8722significance) in people's
8723ability to recall numbers was found for this method over direct
8724abuttal of either natural or synthetic versions of the digits.
8725Although the method seemed, on balance, to produce utterances that were
8726recalled less accurately than completely natural spoken
8727telephone numbers, the difference was not significant (at the 5% level).
8728The system was also used to generate wiring instructions by computer
8729directly from the connection list, as described in Chapter 1.
8730As noted there, synthetic speech was actually preferred to natural speech
8731in the noisy environment of the production line.
8732.rh "Joining linear predictive coded words."
8733Because obtaining accurate formant tracks for natural utterances
8734by Fourier transform methods is difficult, it is worth considering
8735the use of linear prediction as the source-filter model.
8736Actually, formant resonances can be extracted from linear predictive
8737coefficients quite easily, but there is no need to do this because
8738the reflection coefficients themselves are quite suitable
8739for interpolation.
8740.pp
8741A slightly different interpolation scheme from that described in the
8742previous section has been reported (Olive, 1975).
8743.[
8744Olive 1975
8745.]
8746The reflection coefficients were spliced during an overlap region of
8747only 20\ msec.
8748More interestingly, attempts were made to suppress the plosive bursts
8749of stop sounds in cases where they were followed by another stop at
8750the beginning of the next word.
8751This is a common coarticulation, occurring, for instance, in the phrase
8752"stop burst".  In running speech, the plosion on the
8753.ul
8754p
8755of "stop" is
8756normally suppressed because it is followed by another stop.
8757This is a particularly striking case because the place of articulation
8758of the two stops
8759.ul
8760p
8761and
8762.ul
8763b
8764is the same:  complete suppression is not as likely
8765to happen in "stop gap", for example (although it may occur).
8766Here is an instance of how extra information could improve the
8767quality of the synthetic transitions considerably.
8768However, automatically identifying the place of articulation of stops is
8769a difficult job, of a complexity far above what is appropriate for
8770simply joining words stored in source-filter form.
8771.pp
8772Another innovation was introduced into the transition between two
8773vowel sounds, when the second word began with an accented syllable.
8774A glottal stop was placed at the juncture.
8775Although the glottal stop was not described in Chapter 2, it is a sound
8776used in many dialects of English.  It frequently occurs
8777in the utterance "uh-uh", meaning "no".  Here it
8778.ul
8779is
8780used to separate two vowel sounds, but in fact this is not particularly
8781common in most dialects.
8782One could say "the apple", "the orange", "the onion" with a neutral vowel
8783in "the" (to rhyme with "\c
8784.ul
8785a\c
8786bove") and a glottal stop as separator,
8787but it is much more usual to rhyme "the" with "he" and introduce a
8788.ul
8789y
8790between the words.
8791Similarly, even speakers who do not normally pronounce an
8792.ul
8793r
8794at the
8795end of words will introduce one in "bigger apple", rather than
8796using a glottal stop.
8797Note that it would be wrong to put an
8798.ul
8799r
8800in "the apple", even
8801for speakers who usually terminate "the" and "bigger" with the same sound.
8802Such effects occur at a high level of processing, and are practically
8803impossible to simulate with word-interpolation rules.
8804Hence the expedient of introducing a glottal stop is a good one, although
8805it is certainly unnatural.
8806.sh "7.2  Concatenating whole or partial syllables"
8807.pp
8808The use of segments larger than a single phoneme or allophone but smaller
8809than a word as the basic unit for speech synthesis has an interesting
8810history.
8811It has long been realized that transitions between phonemes are
8812extremely sensitive and critical components of speech, and thus are
8813essential for successful synthesis.
8814Consider the unvoiced stop sounds
8815.ul
8816p, t,
8817and
8818.ul
8819k.
8820Their central portion is actually silence!  (Try saying a word like
8821"butter" with a very long
8822.ul
8823t.\c
8824)  Hence
8825in this case it is
8826.ul
8827only
8828the transitional information which can distinguish these sounds from
8829each other.
8830.pp
8831Sound segments which comprise the transition from the centre of one phoneme
8832to the centre of the next are called
8833.ul
8834dyads
8835or
8836.ul
8837diphones.
8838The possibility of using them as the basic units for concatenation
8839was first mooted in the mid 1950's.
8840The idea is attractive because there is relatively little spectral
8841movement in the central, so-called "steady-state", portion of many
8842phonemes \(em in the extreme case of unvoiced stops there is not only
8843no spectral movement, but no spectrum at all in the steady state!
8844At that time the resonance synthesizer was in its infancy, and
8845so recorded segments of live speech were used.  The early experiments
8846met with little success because of the technical difficulties
8847of joining analogue waveforms and inevitable discrepancies between
8848the steady-state parts of a phoneme recorded in different contexts \(em not
8849to mention the problems of coarticulation and prosody which effectively
8850preclude the use of waveform concatenation at such a low level.
8851.pp
8852In the mid 1960's, with the growing use of resonance synthesizers,
8853it became possible to generate diphones by copying resonance parameters
8854manually from a spectrogram, and improving the result by trial and error.
8855It was not feasible to extract formant frequencies automatically from real
8856speech, though, because the fast Fourier transform was not yet widely
8857known and the computational burden of slow Fourier transformation was
8858prohibitive.
8859For example, a project at IBM stored manually-derived parameter tracks
8860for diphones, identified by pairs of phoneme names (Dixon and Maxey, 1968).
8861.[
8862Dixon Maxey 1968
8863.]
8864To generate a synthetic utterance it was coded in
8865phonetic form and used to access
8866the diphone table to give a set of parameter tracks for the complete
8867utterance.  Note that this is the first system we have encountered
8868whose input is a phonetic transcription which relates to an inventory
8869of truly synthetic character:  all previous schemes used recordings of
8870live speech, albeit processed in some form.
8871Since the inventory was synthetic, there was no difficulty in ensuring
8872that discontinuities did not arise between segments beginning and ending with
8873the same phoneme.  Thus interpolation was irrelevant, and the synthesis
8874procedure concentrated on prosodic questions.  The resulting speech
8875was reported to be quite impressive.
8876.pp
8877Strictly speaking, diphones are not demisyllables but phoneme pairs.
8878In the simplest case they happen to be similar, for two primary diphones
8879characterize a consonant-vowel-consonant syllable.
8880There is an advantage to using demisyllables rather than diphones as the basic
8881unit, for many syllables begin or end with complicated consonant clusters
8882which are not easy to produce convincingly by diphone
8883concatenation.
8884But they are not easy to produce by hand-editing resonance parameters
8885either!
8886Now that speech analysis methods have been developed and refined,
8887resonance parameters or linear predictive coefficients
8888can be extracted automatically
8889from natural utterances, and there has been a resurgence of interest in
8890syllabic and demisyllabic synthesis methods.  The wheel has turned
8891full circle, from segments of natural speech to hand-tailored parameters
8892and back again!
8893.pp
8894The advantage of storing demisyllables over syllables (or lisibles) from
8895the point of view of storage capacity has already been pointed out
8896(perhaps 1,000\-2,000 demisyllables as opposed to 4,000\-10,000 syllables).
8897But it is probably not too significant with the continuing decline
8898of storage costs.
8899The requirements are of the order of 25\ Kbyte versus 0.5\ Mbyte
8900for 1200\ bit/s linear predictive coding, and the latter could
8901almost be accomodated today \(em 1981 \(em on a state-of-the-art
8902read-only memory chip.
8903A bigger advantage comes from rhythmic considerations.
8904As we will see in the next chapter, the rhythms of fluent speech cause
8905dramatic variations in syllable duration, but these seem to affect
8906the vowel and closing consonant cluster much more than the initial consonant
8907cluster.  Thus if a demisyllable is deemed to begin shortly (say 60\ msec)
8908after onset of the vowel, when the formant structure has settled down,
8909the bulk of the vowel and the closing consonant cluster will form a
8910single demisyllable.  The opening cluster of the next syllable will lie
8911in the next demisyllable.  Then differential lengthening can be applied
8912to that part of the syllable which tends to be stretched in live speech.
8913.pp
8914One system for demisyllable concatenation has produced excellent results
8915for monosyllabic English words (Lovins and Fujimura, 1976).
8916.[
8917Lovins Fujimura 1976
8918.]
8919Complex word-final consonant clusters are excluded from the inventory by
8920using syllable affixes
8921.ul
8922s, z, t,
8923and
8924.ul
8925d;
8926these are attached to the
8927syllabic core as a separate exercise (Macchi and Nigro, 1977).
8928.[
8929Macchi Nigro 1977
8930.]
8931Prosodic rather than segmental considerations are likely to prove the major
8932limiting factor when this scheme is extended to running speech.
8933.pp
8934Monosyllabic words spoken in isolation are coded as linear predictive
8935reflection coefficients, and segmented by digital editing into the initial
8936consonant cluster and the vocalic nucleus plus final cluster.
8937The cut is made 60\ msec into the vowel, as suggested above.
8938This minimizes the difficulty of interpolation when concatenating
8939segments, for there is ample voicing on either side of the juncture.
8940The reflection coefficients should not differ radically because the
8941vowel is the same in each demisyllable.
8942A 40\ msec overlap is used, with the usual linear interpolation.
8943An alternative smoothing rule applies when the second segment has
8944a nasal or glide after the vowel.  In this case anticipatory coarticulation
8945occurs, affecting even the early part of the vowel.  For example, a vowel
8946is frequently nasalized when followed by a nasal sound \(em even in English
8947where nasalization is not a distinctive feature in vowels (see Chapter 2).
8948Under these circumstances the overlap area is moved forward in time so
8949that the colouration applies throughout almost the whole vowel.
8950.sh "7.3  Phoneme synthesis"
8951.pp
8952Acoustic phonetics is the study of how the acoustic
8953signal relates to the phonetic sequence which was spoken or heard.
8954People \(em especially engineers \(em often ask, how could phonetics not
8955be acoustic?  In fact it can be articulatory, auditory, or linguistic
8956(phonological), for example, and we have touched on the first and last
8957in Chapter 2.
8958The invention of the sound spectrograph in the late 1940's was an
8959event of colossal significance for acoustic phonetics, for it somehow
8960seemed to make the intricacies of speech visible.
8961(This was thought to be a greater advance than actually turned
8962out:  historically-minded readers should refer to Potter
8963.ul
8964et al,
89651947,
8966for an enthusiastic contemporary appraisal of the invention.)  A
8967.[
8968Potter Kopp Green 1947
8969.]
8970result of several years of research at Haskins Laboratories in New York
8971during the 1950's was a set of "minimal rules for synthesizing speech",
8972which showed how stylized formant patterns could generate cues for
8973identifying vowels and, particularly, consonants
8974(Liberman, 1957; Liberman
8975.ul
8976et al,
89771959).
8978.[
8979Liberman 1957 Some results of research on speech perception
8980.]
8981.[
8982Liberman Ingemann Lisker Delattre Cooper 1959
8983.]
8984.pp
8985These were to form the basis of many speech synthesis-by-rule computer
8986programs in the ensuing decades.  Such programs take as input a
8987phonetic transcription of the utterance and generate a spoken version
8988of it.  The transcription may be broad or narrow, depending on the
8989system.  Experience has shown that the Haskins rules really are
8990minimal, and the success of a synthesis-by-rule program depends on
8991a vast collection of minutia, each seemingly insignificant in isolation
8992but whose effects combine to influence the speech quality dramatically.
8993The best current systems produce clearly understandable
8994speech which is nevertheless something of a strain to listen to for
8995long periods.
8996However, many are not good; and some are execrable.
8997In recent times commercial influences have unfortunately restricted
8998the free exchange of results and programs between academic researchers,
8999thus slowing down progress.
9000Research attention has turned to prosodic factors,
9001which are certainly less well understood than segmental ones, and
9002to synthesis from plain English text rather than from phonetic transcriptions.
9003.pp
9004The remainder of this chapter describes the techniques of segmental
9005synthesis.  First it is necessary to introduce some
9006elements of acoustic phonetics.
9007It may be worth re-reading Chapter 2 at this point, to refresh
9008your memory about the classification of speech sounds.
9009.sh "7.4  Acoustic characterization of phonemes"
9010.pp
9011Shortly after the invention of the sound spectrograph an inverse
9012instrument was developed, called the "pattern playback" synthesizer.
9013This took as input a spectrogram, either in its original form or
9014painted by hand.
9015An optical arrangment was used to modulate the amplitude of some
9016fifty harmonically-related oscillators by the lightness or darkness
9017of each point on the frequency axis of the spectrogram.
9018As it was drawn past the playing head, sound was produced which
9019had approximately the frequency components shown on the spectrogram,
9020although the fundamental frequency was constant.
9021.pp
9022This device allowed the complicated
9023acoustic effects seen on a spectrogram (see for example Figures 2.3 and 2.4)
9024to be replayed in either original or simplified form.
9025Hence the features which are important for perception of the different sounds
9026could be isolated.  The procedure was to copy from an actual spectrogram
9027the features which were most prominent visually, and then to make further
9028changes by trial and error until the result was judged to have
9029reasonable intelligibility when replayed.
9030.pp
9031For the purpose of acoustic characterization of particular phonemes,
9032it is useful to consider the central, steady-state part separately from
9033transitions into and out of the segment.
9034The steady-state part is that sound which is heard when the phoneme
9035is prolonged.  The term "phoneme" is being used in a rather loose sense
9036here:  it is more appropriate to think of a "sound segment" rather than
9037the abstract unit which forms the basis of phonological classification,
9038and this is the terminology I will adopt.
9039.pp
9040The essential auditory characteristics of some sound segments are inherent in
9041their steady states.
9042If a vowel, for example, is spoken and prolonged, it can readily be
9043identified by listening to any part of the utterance.
9044This is not true for diphthongs:  if you say "I" very slowly and freeze
9045your vocal tract posture at any time, the resulting steady-state sound
9046will not be sufficient to identify the diphthong.  Rather, it will be
9047a vowel somewhere between
9048.ul
9049aa
9050(in "had") or
9051.ul
9052ar
9053(in "hard") and
9054.ul
9055ee
9056(in "heed").
9057Neither is it true for glides, for prolonging
9058.ul
9059w
9060(in "want") or
9061.ul
9062y
9063(in "you") results in vowels resembling respectively
9064.ul
9065u
9066("hood") or
9067.ul
9068ee
9069("heed").
9070Fricatives, voiced or unvoiced, can be identified from the steady state;
9071but stops can not, for their's is silent (or \(em in the case
9072of voiced stops \(em something close to it).
9073.pp
9074Segments which are identifiable from their steady state are easy to synthesize.
9075The difficulty lies with the others, for it must be the transitions which
9076carry the information.  Thus "transitions" are an essential part of speech,
9077and perhaps the term is unfortunate for it calls to mind an unimportant
9078bridge between one segment and the next.
9079It is tempting to use the words "continuant" and "non-continuant" to distinguish
9080the two categories; unfortunately they are used by phoneticians in a different
9081sense.
9082We will call them "steady-state" and "transient" segments.  The latter term
9083is not particularly appropriate, for even sounds in this class
9084.ul
9085can
9086be prolonged:  the point is that the identifying information is in the
9087transitions rather than the steady state.
9088.RF
9089.nr x1 (\w'excitation'/2)
9090.nr x2 (\w'formant resonance'/2)
9091.nr x3 (\w'fricative'/2)
9092.nr x4 (\w'frequencies (Hz)'/2)
9093.nr x5 (\w'resonance (Hz)'/2)
9094.nr x0 4n+1.7i+0.8i+0.6i+0.6i+1.0i+\w'00'+\n(x5
9095.nr x6 (\n(.l-\n(x0)/2
9096.in \n(x6u
9097.ta 4n +1.7i +0.8i +0.6i +0.6i +1.0i
9098		\h'-\n(x1u'excitation		\0\0\h'-\n(x2u'formant resonance	\0\0\h'-\n(x3u'fricative
9099				\0\0\h'-\n(x4u'frequencies (Hz)	\0\0\c
9100\h'-\n(x5u'resonance (Hz)
9101\l'\n(x0u\(ul'
9102.sp
9103.nr x1 (\w'voicing'/2)
9104\fIuh\fR	(the)	\h'-\n(x1u'voicing	\0500	1500	2500
9105\fIa\fR	(bud)	\h'-\n(x1u'voicing	\0700	1250	2550
9106\fIe\fR	(head)	\h'-\n(x1u'voicing	\0550	1950	2650
9107\fIi\fR	(hid)	\h'-\n(x1u'voicing	\0350	2100	2700
9108\fIo\fR	(hod)	\h'-\n(x1u'voicing	\0600	\0900	2600
9109\fIu\fR	(hood)	\h'-\n(x1u'voicing	\0400	\0950	2450
9110\fIaa\fR	(had)	\h'-\n(x1u'voicing	\0750	1750	2600
9111\fIee\fR	(heed)	\h'-\n(x1u'voicing	\0300	2250	3100
9112\fIer\fR	(heard)	\h'-\n(x1u'voicing	\0600	1400	2450
9113\fIar\fR	(hard)	\h'-\n(x1u'voicing	\0700	1100	2550
9114\fIaw\fR	(hoard)	\h'-\n(x1u'voicing	\0450	\0750	2650
9115\fIuu\fR	(food)	\h'-\n(x1u'voicing	\0300	\0950	2300
9116.nr x1 (\w'aspiration'/2)
9117\fIh\fR	(he)	\h'-\n(x1u'aspiration
9118.nr x1 (\w'frication'/2)
9119.nr x2 (\w'frication and voicing'/2)
9120\fIs\fR	(sin)	\h'-\n(x1u'frication				6000
9121\fIz\fR	(zed)	\h'-\n(x2u'frication and voicing				6000
9122\fIsh\fR	(shin)	\h'-\n(x1u'frication				2300
9123\fIzh\fR	(vision)	\h'-\n(x2u'frication and voicing				2300
9124\fIf\fR	(fin)	\h'-\n(x1u'frication				4000
9125\fIv\fR	(vat)	\h'-\n(x2u'frication and voicing				4000
9126\fIth\fR	(thin)	\h'-\n(x1u'frication				5000
9127\fIdh\fR	(that)	\h'-\n(x2u'frication and voicing				5000
9128\l'\n(x0u\(ul'
9129.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
9130.in 0
9131.FG "Table 7.2  Resonance synthesizer parameters for steady-state sounds"
9132.rh "Steady-state segments."
9133Table 7.2 shows appropriate values for the resonance parameters and
9134excitation sources of a resonance synthesizer, for steady-state
9135segments only.
9136There are several points to note about it.
9137Firstly, all the frequencies involved obviously depend upon the
9138speaker \(em the size of his vocal tract, his accent and speaking habits.
9139The values given are nominal ones for a male speaker with a dialect of
9140British English called "received pronunciation" (RP) \(em for it is what
9141used to be "received" on the wireless in the old days
9142before the British Broadcasting Corporation
9143adopted a policy of more informal, more regional, speech.
9144Female speakers have formant frequencies approximately 15% higher
9145than male ones.
9146Secondly, the third formant is relatively unimportant for vowel
9147identification; it is
9148the first and second that give the vowels their character.
9149Thirdly, formant values for
9150.ul
9151h
9152are not given, for they would be meaningless.
9153Although it is certainly a steady-state sound,
9154.ul
9155h
9156changes radically
9157in context.  If you say "had", "heed", "hud", and so on, and freeze
9158your vocal tract posture on the initial
9159.ul
9160h,
9161you will find it
9162already configured for the following vowel \(em an excellent
9163example of anticipatory coarticulation.
9164Fourthly, amplitude values do play some part in identification,
9165particularly for fricatives.
9166.ul
9167th
9168is the weakest sound, closely followed by
9169.ul
9170f,
9171with
9172.ul
9173s
9174and
9175.ul
9176sh
9177the
9178strongest.  It is necessary to get a reasonable mix of excitation in
9179the voiced fricatives; the voicing amplitude is considerably less than
9180in vowels.  Finally, there are other sounds that might be considered
9181steady state ones.  You can probably identify
9182.ul
9183m, n,
9184and
9185.ul
9186ng
9187just by
9188their steady states.  However, the difference is not particularly
9189strong; it is the transitional parts which discriminate most effectively
9190between these sounds.  The steady state of
9191.ul
9192r
9193is quite distinctive, too,
9194for most speakers, because the top of the tongue is curled back in a
9195so-called "retroflex" action and this causes a radical change in the
9196third formant resonance.
9197.rh "Transient segments."
9198Transient sounds include diphthongs, glides,
9199nasals, voiced and unvoiced stops, and affricates.
9200The first two are relatively easy to characterize, for they are
9201basically continuous, gradual transitions from one vocal tract posture
9202to another \(em sort of dynamic vowels.  Diphthongs and glides are
9203similar to each other.  In fact "you" could be transcribed as
9204a triphthong,
9205.ul
9206i e uu,
9207except that in the initial posture the tongue
9208is even higher, and the vocal tract correspondingly more constricted,
9209than in
9210.ul
9211i
9212("hid") \(em though not as constricted as in
9213.ul
9214sh.
9215Both categories can be represented in terms of target formant
9216values, on the understanding that these are not to be
9217interpreted as steady state configurations but strictly as
9218extreme values at the beginning or end of the formant motion (for
9219transitions out of and into the segment, respectively).
9220.pp
9221Nasals have a steady-state portion comprising a strong nasal formant
9222at a fairly low frequency, on account of the large size of the
9223combined nasal and oral cavity which is resonating.
9224Higher formants are relatively weak, because of attenuation effects.
9225Transitions into and out of nasals are strongly nasalized,
9226as indeed are adjacent vocalic segments, with
9227the oral and nasal tract operating in parallel.  As discussed in
9228Chapter 5, this cannot be simulated on a series synthesizer.
9229However, extremely fast motions of the formants occur on account of
9230the binary switching action of the velum, and it turns out that
9231fast formant transitions are sufficient to simulate nasals because
9232the speech perception mechanism is accustomed to hearing them only
9233in that context!  Contrast this with the extremely slow transitions
9234in diphthongs and glides.
9235.pp
9236Stops form the most interesting category, and research using the pattern
9237playback synthesizer was instrumental in providing adequate acoustic
9238characterizations for them.  Consider unvoiced stops.
9239They each have three phases:  transition in, silent central portion,
9240and transition out.  There is a lot of action on the transition out
9241(and many phoneticians would divide this part alone into several "phases").
9242First, as the release occurs, there is a small burst of fricative noise.
9243Say "t\ t\ t\ ..." as in "tut-tut", without producing any voicing.
9244Actually, when used as an admonishment this is accompanied by
9245an ingressive, inhaling air-stream instead of the normal egressive,
9246exhaling one used in English speech (although some languages
9247do have ingressive sounds).
9248In any case, a short fricative somewhat resembling a tiny
9249.ul
9250s
9251can be heard as the tongue leaves the roof of the mouth.
9252Frication is produced when the gap is very narrow, and ceases
9253rapidly as it becomes wider.
9254Next, when an unvoiced stop is released, a significant amount of aspiration
9255follows the release.
9256Say "pot", "tot", "cot" with force and you will hear the
9257.ul
9258h\c
9259-like
9260aspiration quite clearly.
9261It doesn't always occur, though; for example you will hear little
9262aspiration when a fricative like
9263.ul
9264s
9265precedes the stop in the
9266same syllable, as in "spot", "scot".  The aspiration is a distinguishing
9267feature between "white spot" and the rather unlikely "White's pot".
9268It tends to increase as the emphasis on the syllable increases,
9269and this in an example of a prosodic feature influencing segmental
9270characteristics.  Finally, at the end of the segment,
9271the aspiration \(em if any \(em will turn to voicing.
9272.pp
9273What has been described applies to
9274.ul
9275all
9276unvoiced stops.
9277What distinguishes one from another?
9278The tiny fricative burst will be different because the noise is produced
9279at different places in the vocal tract \(em at the lips for
9280.ul
9281p,
9282tongue and front of palate for
9283.ul
9284t,
9285and tongue and back of palate for
9286.ul
9287k.
9288The most important difference, however, is the formant motion illuminated
9289by the last vestiges of voicing at closure and by both aspiration and the
9290onset of voicing at opening.
9291Each stop has target formant values which, although
9292they cannot be heard during the stopped portion (for there is no
9293sound there), do affect the transitions in and out.
9294An added complexity is that the target positions themselves vary to some
9295extent depending on the adjacent segments.
9296If the stop is heavily aspirated, the vocal posture will have almost
9297attained that for the following vowel before voicing begins, but
9298the formant transitions will be perceived because they affect
9299the sound quality of aspiration.
9300.pp
9301The voiced stops
9302.ul
9303b, d,
9304and
9305.ul
9306g
9307are quite similar to their unvoiced analogues
9308.ul
9309p, t,
9310and
9311.ul
9312k.
9313What distinguishes them from each other are the formant transitions to
9314target positions, heard during closure and opening.
9315They are distinguished from their unvoiced counterparts by the fact
9316that more voicing is present:  it lingers on longer at closure
9317and begins earlier on opening.  Thus little or no aspiration appears
9318during the opening phase.  If an unvoiced stop is uttered in a context
9319where aspiration is suppressed, as in "spot", it is almost identical to the
9320corresponding voiced stop, "sbot".  Luckily no words in English require
9321us to make a distinction in such contexts.
9322Voicing sometimes pervades the entire stopped portion of a voiced stop,
9323especially when it is surrounded by other voiced segments.
9324When saying a word like "baby" slowly you can choose whether or not to
9325prolong voicing throughout the second
9326.ul
9327b.
9328If you do, creating what is
9329called a "voice bar" in spectrograms,
9330the sound escapes through the cheeks, for
9331the lips are closed \(em try doing it for a very long time and your cheeks
9332will fill up with air!
9333This severely attenuates high-frequency components, and can
9334be simulated with a weak first formant at a low resonant frequency.
9335.RF
9336.nr x0 \w'unvoiced stops:    'u
9337.nr x1 4n
9338.nr x2 \n(x0+\n(x1+\w'aspiration burst (context- and emphasis-dependent)'u
9339.nr x3 (\n(.l-\n(x2)/2
9340.in \n(x3u
9341.ta \n(x0u +\n(x1u
9342unvoiced stops:	closure (early cessation of voicing)
9343	silent steady state
9344	opening, comprising
9345		short fricative burst
9346		aspiration burst (context- and emphasis-dependent)
9347		onset of voicing
9348.sp
9349voiced stops:	closure (late cessation of voicing)
9350	steady state (possibility of voice bar)
9351	opening, comprising
9352		pre-voicing
9353		short fricative burst
9354.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
9355.in 0
9356.FG "Table 7.3  Acoustic phases of stop consonants"
9357.pp
9358Table 7.3 summarizes some of the acoustic phases of voiced and unvoiced
9359stops.  There are many variations that have not been mentioned.
9360Nasal plosion ("good news") occurs (at the word boundary, in this case)
9361when the nasal formant pervades the
9362opening phase.  Stop bursts are suppressed when the next sound is a stop
9363too (the burst on the
9364.ul
9365p
9366of "apt", for example).
9367It is difficult to distinguish a voiced stop from an unvoiced one
9368at the end of a word ("cab" and "cap"); if the speaker is trying to
9369make himself particularly clear he will put a short neutral vowel
9370after the voiced stop to emphasize its early onset of voicing.
9371(If he is Italian he will probably do this anyway, for it is the norm
9372in his own language.)
9373.pp
9374Finally, we turn to affricates, of which there are only two
9375in English:
9376.ul
9377ch
9378("chin") and
9379.ul
9380j
9381("djinn").
9382They are very similar to the stops
9383.ul
9384t
9385and
9386.ul
9387d
9388followed by the fricatives
9389.ul
9390sh
9391and
9392.ul
9393zh
9394respectively, and their acoustic characterization is similar to that
9395of the phoneme pair.
9396.ul
9397ch
9398has a closing phase, a stopped phase, and a long fricative burst.
9399There is no aspiration,
9400for the vocal cords are not involved.
9401.ul
9402j
9403is the same except that voicing extends further into the stopped
9404portion, and the terminating fricative is also voiced.
9405It may be pronounced with a voice bar if the preceding segment is voiced
9406("adjunct").
9407.sh "7.5  Speech synthesis by rule"
9408.pp
9409Generation of speech by rules acting upon a phonetic transcription
9410was first investigated in the early 1960's (Kelly and Gerstman, 1961).
9411.[
9412Kelly Gerstman 1961
9413.]
9414Most systems employ a hardware resonance synthesizer, analogue or digital,
9415series or parallel,
9416to reduce the load on the computer which operates the rules.
9417The speech-by-rule program, rather than the
9418synthesizer, inevitably contributes by far the greater part of the
9419degradation in the resulting speech.
9420Although parallel synthesizers offer greater potential control over
9421the spectrum, it is not clear to what extent a synthesis program can take
9422advantage of this.  Parameter tracks for a series synthesizer can
9423easily be converted into linear predictive coefficients, and systems
9424which use a linear predictive synthesizer will probably become popular
9425in the near future.
9426.pp
9427The phrase "synthesis by rule", which is in common use, does not
9428make it clear just what sort of features the rules are supposed to
9429accomodate, and what information must be included explicitly in the
9430input transcription.
9431Early systems made no attempt to simulate prosodics.
9432Pitch and rhythm could be controlled, but only by inserting
9433pitch specifiers and duration markers in the input.
9434Some kind of prosodic control was often incorporated later,
9435but usually as a completely separate phase from segmental synthesis.
9436This does not allow interaction effects (such as the extra
9437aspiration for voiceless stops in accented syllables) to be taken
9438into account easily.
9439Even systems which perform prosodic operations invariably need to have
9440prosodic specifications embedded explicitly in the input.
9441.pp
9442Generating parameter tracks for a synthesizer from a phonetic transcription
9443is a process of data
9444.ul
9445expansion.
9446Six bits are ample to specify a phoneme, and a speaking rate of 12 phonemes/sec
9447leads to an input data rate of 72 bit/s.
9448The data rate required to control the synthesizer will depend upon the number
9449of parameters and the rate at which they are sampled,
9450but a typical figure is 6 Kbit/s (Chapter 5).
9451Hence there is something like a hundredfold data expansion.
9452.pp
9453Figure 7.1 shows the parameter tracks for a series synthesizer's rendering
9454of the utterance
9455.ul
9456s i k s.
9457.FC "Figure 7.1"
9458There are eight parameters.
9459You can see the onset of frication at the beginning and end (parameter 5),
9460and the amplitude of voicing (parameter 1) come on for the
9461.ul
9462i
9463and off again before the
9464.ul
9465k.
9466The pitch (parameter 0) is falling slowly throughout the utterance.
9467These tracks are stylized:  they come from a computer synthesis-by-rule
9468program and not from a human utterance.
9469With a parameter update rate of 10 msec, the graphs can be represented
9470by 90 sets of eight parameter values, a total of 720 values or 4320 bits
9471if a 6-bit representation is used for each value.
9472Contrast this with the input of only four phoneme segments, or say 24 bits.
9473.rh "A segment-by-segment system."
9474A seminal paper appearing in 1964 was the first comprehensive
9475description of a computer-based synthesis-by-rule system
9476(Holmes
9477.ul
9478et al,
94791964).
9480.[
9481Holmes Mattingly Shearme 1964
9482.]
9483The same system is still in use and has been reimplemented in a more
9484portable form (Wright, 1976).
9485.[
9486Wright 1976
9487.]
9488The inventory of sound segments
9489includes the phonemes listed in Table 2.1, as well as diphthongs and
9490a second allophone of
9491.ul
9492l.
9493(Many British speakers use quite a different vocal posture for
9494pre- and post-vocalic
9495.ul
9496l\c
9497\&'s, called clear and dark
9498.ul
9499l\c
9500\&'s
9501respectively.)  Some phonemes are expanded into sub-phonemic
9502"phases" by the program.  Stops have three phases, corresponding to
9503the closure, silent steady state, and opening.
9504Diphthongs have two phases.  We will call individual phases and
9505single-phase phonemes "segments", for they are subject to exactly
9506the same transition rules.
9507.pp
9508Parameter tracks are constructed out of linear pieces.
9509Consider a pair of adjacent segments in an utterance to be synthesized.
9510Each one has a steady-state portion and an internal transition.
9511The internal transition of one phoneme is dubbed "external"
9512as far as the other is concerned.
9513This is important because instead of each segment being responsible
9514for its own internal transition, one of the pair is identified
9515as "dominant" and it controls the duration of both transitions \(em its
9516internal one and its external (the other's internal) one.
9517For example, in Figure 7.2 the segment
9518.ul
9519sh
9520dominates
9521.ul
9522ee
9523and so it
9524governs the duration of both transitions shown.
9525.FC "Figure 7.2"
9526Note that each
9527segment contributes as many as three linear pieces to the parameter track.
9528.pp
9529The notion of domination is similar to that discussed earlier for
9530word concatenation.
9531The difference is that for word concatenation the dominant segment was
9532determined by computing the spectral derivative over the transition
9533region, whereas for synthesis-by-rule
9534segments are ranked according to a static precedence,
9535and the higher-ranking segment dominates.
9536Segments of stop consonants have the highest rank (and also
9537the greatest spectral derivative), while fricatives, nasals, glides,
9538and vowels follow in that order.
9539.pp
9540The concatenation procedure is controlled by a table which associates
954125 quantities with each segment.  They are
9542.LB
9543.NI
9544rank
9545.NI
95462\ \ overall durations (for stressed and unstressed occurrences)
9547.NI
95484\ \ transition durations (for internal and external transitions of
9549formant frequencies and amplitudes)
9550.NI
95518\ \ target parameter values (amplitudes and frequencies of three
9552formant resonances, plus fricative information)
9553.NI
95545\ \ quantities which specify how to calculate boundary values for
9555formant frequencies (two for each formant except the third,
9556which has only one)
9557.NI
95585\ \ quantities which specify how to calculate boundary values for
9559amplitudes.
9560.LE
9561This table is rather large.  There are 80 segments in all (remember
9562that many phonemes are represented by more than one segment),
9563and so it has 2000 entries.  The system was an offline one which ran on
9564what was then \(em 1964 \(em a large computer.
9565.pp
9566The advantage of such a large table of "rules" is the
9567flexibility it affords.
9568Notice that transition durations are specified independently for
9569formant frequency and amplitude parameters \(em this permits
9570fine control which is particularly useful for stops.
9571For each parameter the boundary value between segments is calculated
9572using a fixed contribution from the dominant one
9573and a proportion of the steady state value of the other.
9574.pp
9575It is possible that the two transition durations which are
9576calculated for a segment actually exceed the overall duration specified
9577for it.  In this case, the steady-state target values will be approached
9578but not actually attained, simulating a situation where coarticulation
9579effects prevent a target value from being reached.
9580.rh "An event-based system."
9581The synthesis system described above, in common with many others, takes
9582an uncompromisingly segment-by-segment view of speech.
9583The next phoneme is read, perhaps split into a few segments, and
9584these are synthesized one by one with due attention being paid
9585to transitions between them.
9586Some later work has taken a more syllabic view.
9587Mattingly (1976) urges a return to syllables for both practical and
9588theoretical reasons.
9589.[
9590Mattingly 1976 Syllable synthesis
9591.]
9592Transitional effects are particularly strong
9593within a syllable and comparatively weak (but by no means negligible)
9594from one syllable to the next.  From a theoretical viewpoint,
9595there are much stronger phonetic restrictions on phoneme sequences
9596than there are on syllable sequences:  pretty well any syllable can
9597follow another (although whether the pair makes sense is
9598a different matter), but the linguistically
9599acceptable phoneme sequences are only a fraction
9600of those formed by combining phonemes in all
9601possible ways.
9602Hill (1978) argues against what be calls the "segmental assumption"
9603that progress through the utterance should be made one segment at a time,
9604and recommends a description of speech based upon perceptually relevant
9605"events".
9606.[
9607Hill 1978 A program structure for event-based speech synthesis by rules
9608.]
9609This framework is interesting because it provides an opportunity for prosodic
9610considerations to be treated as an integral part of the synthesis
9611process.
9612.pp
9613The phonetic segments and other information that specify an utterance
9614can be regarded as a list of events which describes it
9615at a relatively high level.
9616Synthesis-by-rule is the act of taking this list and elaborating on it
9617to produce lower-level events which are realized by the vocal tract,
9618or acoustically simulated by a resonance synthesizer, to give a speech
9619waveform.
9620In articulatory terms, an event might be "begin tongue motion towards
9621upper teeth with a given effort", while in resonance terms it could be
9622"begin second formant transition towards 1500\ Hz at a given rate".
9623(These two examples are
9624.ul
9625not
9626intended to describe the same event:  a tongue motion causes much more
9627than the transition of a single formant.)  Coarticulation
9628issues such as stop burst suppression and nasal plosion should
9629be easier to imitate within an event-based scheme than a segment-to-segment
9630one.
9631.pp
9632The ISP system (Witten and Abbess, 1979) is event-based.
9633.[
9634Witten Abbess 1979
9635.]
9636The key to its operation is the
9637.ul
9638synthesis list.
9639To prepare an utterance for synthesis, the lexical items which specify
9640it are joined into a linked list.  Figure 7.3 shows the start of
9641the list created for
9642.LB
96431
9644.ul
9645dh i z  i z  /*d zh aa k s  /h aa u s
9646.LE
9647(this is Jack's house); the "1\ ...\ /*\ ...\ /\ ..." are
9648prosodic markers which will be discussed in the next chapter.
9649.FC "Figure 7.3"
9650Next, the rhythm and pitch assignment routines
9651augment the list with syllable boundaries, phoneme
9652cluster identifiers, and duration and pitch specifications.
9653Then it is passed to the segmental synthesis routine
9654which chains events into the appropriate places and, as it
9655proceeds, removes the no longer useful elements (phoneme names,
9656pitch specifiers, etc) which originally constituted the synthesis list.
9657Finally, an interrupt-driven speech synthesizer handler removes
9658events from the list as they become due and uses them to control
9659the hardware synthesizer.
9660.pp
9661By adopting the synthesis list as a uniform data structure for
9662holding utterances at every stage of processing, the problems of storage
9663allocation and garbage collection are minimized.
9664Each list element has a forward pointer and five data words, the first
9665indicating what type of element it is.
9666Lexical items which may appear in the input are
9667.LB
9668.NI
9669end of utterance (".", "!", ",", ";")
9670.NI
9671intonation indicator ("1", ...)
9672.NI
9673rhythm indicator ("/", "/*")
9674.NI
9675word boundary ("  ")
9676.NI
9677syllable boundary ("'")
9678.NI
9679phoneme segment
9680(\c
9681.ul
9682ar, b, ng, ...\c
9683)
9684.NI
9685explicit duration or pitch information.
9686.LE
9687Several of these have to do with prosodic features \(em a prime
9688advantage of the structure is that it does not create an artificial
9689division between segmentals and prosody.
9690Syllable boundaries and duration and pitch information are optional.
9691They will normally be computed by ISP, but the user can override them in the
9692input in a natural way.
9693The actual characters which identify lexical items are not fixed
9694but are taken from the rule table.
9695.pp
9696As synthesis
9697proceeds, new elements are chained in to the synthesis list.
9698For segmental purposes, three types of event are defined \(em
9699target events, increment events, and aspiration events.
9700With each event is associated a time at which the event becomes due.
9701For a target event, a parameter number, target parameter value,
9702and time-increment are specified.
9703When it becomes due, motion of the parameter towards the
9704target is begun.  If no other event for that parameter intervenes,
9705the target value will be reached after the given time-increment.
9706However, another target event for the parameter may change its motion
9707before the target has been attained.
9708Increment events contain a parameter number, a parameter increment,
9709and a time-increment.  The fixed increment is added to the parameter value
9710throughout the time specified.  This provides an easy way to make a
9711fricative burst during the opening phase of a stop consonant.
9712Aspiration events switch the mode of excitation from voicing to aspiration
9713for a given period of time.  Thus the aspirated part of unvoiced stops
9714can be accomodated in a natural manner, by changing the mode of excitation
9715for the duration of the aspiration.
9716.RF
9717.nr x1 (\w'excitation'/2)
9718.nr x2 (\w'formant resonance'/2)
9719.nr x3 (\w'fricative'/2)
9720.nr x4 (\w'type'/2)
9721.nr x5 (\w'frequencies (Hz)'/2)
9722.nr x6 (\w'resonance (Hz)'/2)
9723.nr x0 1.0i+0.7i+0.6i+0.6i+1.0i+1.2i+(\w'long vowel'/2)
9724.nr x7 (\n(.l-\n(x0)/2
9725.in \n(x7u
9726.ta 1.0i +0.7i +0.6i +0.6i +1.0i +1.2i
9727	\h'-\n(x1u'excitation		\0\0\h'-\n(x2u'formant resonance	\0\0\h'-\n(x3u'fricative	\h'-\n(x4u'type
9728			\0\0\h'-\n(x5u'frequencies (Hz)	\0\0\h'-\n(x6u'resonance (Hz)
9729\l'\n(x0u\(ul'
9730.sp
9731.nr x1 (\w'voicing'/2)
9732.nr x2 (\w'vowel'/2)
9733\fIuh\fR	\h'-\n(x1u'voicing	\0490	1480	2500		\c
9734\h'-\n(x2u'vowel
9735\fIa\fR	\h'-\n(x1u'voicing	\0720	1240	2540		\h'-\n(x2u'vowel
9736\fIe\fR	\h'-\n(x1u'voicing	\0560	1970	2640		\h'-\n(x2u'vowel
9737\fIi\fR	\h'-\n(x1u'voicing	\0360	2100	2700		\h'-\n(x2u'vowel
9738\fIo\fR	\h'-\n(x1u'voicing	\0600	\0890	2600		\h'-\n(x2u'vowel
9739\fIu\fR	\h'-\n(x1u'voicing	\0380	\0950	2440		\h'-\n(x2u'vowel
9740\fIaa\fR	\h'-\n(x1u'voicing	\0750	1750	2600		\h'-\n(x2u'vowel
9741.nr x2 (\w'long vowel'/2)
9742\fIee\fR	\h'-\n(x1u'voicing	\0290	2270	3090		\h'-\n(x2u'long vowel
9743\fIer\fR	\h'-\n(x1u'voicing	\0580	1380	2440		\h'-\n(x2u'long vowel
9744\fIar\fR	\h'-\n(x1u'voicing	\0680	1080	2540		\h'-\n(x2u'long vowel
9745\fIaw\fR	\h'-\n(x1u'voicing	\0450	\0740	2640		\h'-\n(x2u'long vowel
9746\fIuu\fR	\h'-\n(x1u'voicing	\0310	\0940	2320		\h'-\n(x2u'long vowel
9747.nr x1 (\w'aspiration'/2)
9748.nr x2 (\w'h'/2)
9749\fIh\fR	\h'-\n(x1u'aspiration					\h'-\n(x2u'h
9750.nr x1 (\w'voicing'/2)
9751.nr x2 (\w'glide'/2)
9752\fIr\fR	\h'-\n(x1u'voicing	\0240	1190	1550			 \h'-\n(x2u'glide
9753\fIw\fR	\h'-\n(x1u'voicing	\0240	\0650			\h'-\n(x2u'glide
9754\fIl\fR	\h'-\n(x1u'voicing	\0380	1190			\h'-\n(x2u'glide
9755\fIy\fR	\h'-\n(x1u'voicing	\0240	2270			\h'-\n(x2u'glide
9756.nr x2 (\w'nasal'/2)
9757\fIm\fR	\h'-\n(x1u'voicing	\0190	\0690	2000		\h'-\n(x2u'nasal
9758.nr x1 (\w'none'/2)
9759.nr x2 (\w'stop'/2)
9760\fIb\fR	\h'-\n(x1u'none	\0100	\0690	2000		\h'-\n(x2u'stop
9761\fIp\fR	\h'-\n(x1u'none	\0100	\0690	2000		\h'-\n(x2u'stop
9762.nr x1 (\w'voicing'/2)
9763.nr x2 (\w'nasal'/2)
9764\fIn\fR	\h'-\n(x1u'voicing	\0190	1780	3300		\h'-\n(x2u'nasal
9765.nr x1 (\w'none'/2)
9766.nr x2 (\w'stop'/2)
9767\fId\fR	\h'-\n(x1u'none	\0100	1780	3300		\h'-\n(x2u'stop
9768\fIt\fR	\h'-\n(x1u'none	\0100	1780	3300		\h'-\n(x2u'stop
9769.nr x1 (\w'voicing'/2)
9770.nr x2 (\w'nasal'/2)
9771\fIng\fR	\h'-\n(x1u'voicing	\0190	2300	2500		\h'-\n(x2u'nasal
9772.nr x1 (\w'none'/2)
9773.nr x2 (\w'stop'/2)
9774\fIg\fR	\h'-\n(x1u'none	\0100	2300	2500		\h'-\n(x2u'stop
9775\fIk\fR	\h'-\n(x1u'none	\0100	2300	2500		\h'-\n(x2u'stop
9776.nr x1 (\w'frication'/2)
9777.nr x2 (\w'voice + fric'/2)
9778.nr x3 (\w'fricative'/2)
9779\fIs\fR	\h'-\n(x1u'frication				6000	\h'-\n(x3u'fricative
9780\fIz\fR	\h'-\n(x2u'voice + fric	\0190	1780	3300	6000	\h'-\n(x3u'fricative
9781\fIsh\fR	\h'-\n(x1u'frication				2300	\h'-\n(x3u'fricative
9782\fIzh\fR	\h'-\n(x2u'voice + fric	\0190	2120	2700	2300	\h'-\n(x3u'fricative
9783\fIf\fR	\h'-\n(x1u'frication				4000	\h'-\n(x3u'fricative
9784\fIv\fR	\h'-\n(x2u'voice + fric	\0190	\0690	3300	4000	\h'-\n(x3u'fricative
9785\fIth\fR	\h'-\n(x1u'frication				5000	\h'-\n(x3u'fricative
9786\fIdh\fR	\h'-\n(x2u'voice + fric	\0190	1780	3300	5000	\h'-\n(x3u'fricative
9787\l'\n(x0u\(ul'
9788.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
9789.in 0
9790.FG "Table 7.4  Rule table for an event-based synthesis-by-rule program"
9791.pp
9792Now the rule table, which is shown in Table 7.4,
9793holds simple target positions for each phoneme segment, as well as
9794the segment type.  The latter is used to trigger events by computer
9795procedures which have access to the context of the segment.
9796In principle, this allows considerably more sophistication to be
9797introduced than does a simple segment-by-segment approach.
9798.RF
9799.nr x1 0.5i+0.5i+\w'preceding consonant in this syllable (suppress burst if fricative)'u
9800.nr x1 (\n(.l-\n(x1)/2
9801.in \n(x1u
9802.ta 0.5i +0.5i
9803fricative bursts on stops
9804aspiration bursts on unvoiced stops, affected by
9805	preceding consonant in this syllable (suppress burst if fricative)
9806	following consonant (suppress burst if another stop; introduce
9807		nasal plosion if a nasal)
9808	prosodics (increase burst if syllable is stressed)
9809voice bar on voiced stops (in intervocalic position)
9810post-voicing on terminating voiced stops, if syllable is stressed
9811anticipatory coarticulation for \fIh\fR
9812vowel colouring when a nasal or glide follows
9813.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
9814.in 0
9815.FG "Table 7.5  Some coarticulation effects"
9816.pp
9817For example, Table 7.5 summarizes some of the subtleties of the
9818speech production process which have been mentioned earlier in this
9819chapter.  Most of them are context-dependent, with the prosodic
9820context (whether two segments are in the same syllable; whether a
9821syllable is stressed) playing a significant role.  A scheme where
9822data-dependent "demons" fire on particular patterns in a linked list
9823seems to be a sensible approach towards incorporating such rules.
9824.rh "Discussion."
9825There are two opposing trends in speech synthesis by rule.
9826On the one hand larger and larger segment inventories can be used,
9827containing more and more allophones explicitly.
9828This is the approach of the Votrax sound-segment synthesizer,
9829discussed in Chapter 11.
9830It puts an increasing burden on the person who codes the utterances
9831for synthesis, although, as we shall see, computer programs can assist with
9832this task.
9833On the other hand the segment inventory can be kept small, perhaps
9834comprising just the logical phonemes as in the ISP system.
9835This places the onus on the computer program to accomodate allophonic variations,
9836and to do so it must take account of the segmental and prosodic
9837context of each phoneme.
9838An event-based approach seems to give the best chance of incorporating
9839contextual modification whilst avoiding undesired interactions.
9840.pp
9841The second trend brings synthesis closer to the articulatory process
9842of speech production.  In fact an event-based system would be
9843an ideal way of implementing an articulatory model for speech synthesis
9844by rule.  It would be much more satisfying to have the rule table
9845contain articulatory target positions instead of resonance ones,
9846with events like "begin tongue motion towards upper teeth with a given
9847effort".  The problem is that hard data on articulatory postures and
9848constraints is much more difficult to gather than resonance information.
9849.pp
9850An interesting question that relates to articulation is whether formant
9851motion can be simulated adequately by a small number of linear pieces.
9852The segment-by-segment system described above had as many as nine
9853pieces for a single phoneme, for some phonemes had three phases
9854and each one contributes up to three pieces (transition in,
9855steady state, and transition out).
9856Another system used curves of decaying exponential
9857form which ensured that all transitions started rapidly towards
9858the target position but slowed down as it was approached (Rabiner, 1968, 1969).
9859.[
9860Rabiner 1968 Speech synthesis by rule Bell System Technical J
9861.]
9862.[
9863Rabiner 1969 A model for synthesizing speech by rule
9864.]
9865The time-constant of decay was stored with each segment in the rule
9866table.  The rhythm of the synthetic speech was controlled at this level,
9867for the next segment was begun when all the formants had attained
9868values sufficiently close to the current targets.
9869This is a poor model of the human speech production process, where rhythm
9870is dictated at a relatively high level and the next phoneme is not
9871simply started when the current one happens to end.
9872Nevertheless, the algorithm produced smooth, continuous formant motions
9873not unlike those found in spectrograms.
9874.pp
9875There is, however, by no means universal agreement on decaying exponential formant
9876motions.  Lawrence (1974) divided segments into "checked" and "free"
9877categories, corresponding roughly to consonants and vowels; and postulated
9878.ul
9879increasing
9880exponential transitions into checked segments, and decaying transitions into
9881free ones.
9882.[
9883Lawrence 1974
9884.]
9885This is a reasonable supposition if you consider the mechanics of
9886articulation.  The speed of movement of the tongue (for example) is likely
9887to increase until it is physically stopped by reaching the roof of the
9888mouth.
9889When moving away from a checked posture into a free one the transition will
9890be rapid at first but slow down to approach the target asymptotically,
9891governed by proprioceptive feedback.
9892.pp
9893The only thing that seems to be agreed is that the formant tracks should
9894certainly
9895.ul
9896not
9897be piecewise linear.  However, in the face of
9898conflicting opinions as to whether exponentials should be decaying
9899or increasing, piecewise linear motions seem to be a reasonable
9900compromise!  It is likely that the precise shape of formant
9901tracks is unimportant so long as the gross features are imitated
9902correctly.
9903Nevertheless, this is a question which an articulatory model
9904could help to answer.
9905.sh "7.6  References"
9906.LB "nnnn"
9907.[
9908$LIST$
9909.]
9910.LE "nnnn"
9911.sh "7.7  Further reading"
9912.pp
9913There are unfortunately few books to recommend on the subject of
9914joining segments of speech.
9915The references form a representative and moderately comprehensive bibliography.
9916Here is some relevant background reading in linguistics.
9917.LB "nn"
9918.\"Fry-1976-1
9919.]-
9920.ds [A Fry, D.B.(Editor)
9921.ds [D 1976
9922.ds [T Acoustic phonetics
9923.ds [I Cambridge Univ Press
9924.ds [C Cambridge, England
9925.nr [T 0
9926.nr [A 0
9927.nr [O 0
9928.][ 2 book
9929.in+2n
9930This book of readings contains many classic papers on acoustic phonetics
9931published from 1922\-1965.
9932It covers much of the history of the subject, and is intended
9933primarily for students of linguistics.
9934.in-2n
9935.\"Lehiste-1967-2
9936.]-
9937.ds [A Lehiste, I.(Editor)
9938.ds [D 1967
9939.ds [T Readings in acoustic phonetics
9940.ds [I MIT Press
9941.ds [C Cambridge, Massachusetts
9942.nr [T 0
9943.nr [A 0
9944.nr [O 0
9945.][ 2 book
9946.in+2n
9947Another basic collection of references which covers much the same ground
9948as Fry (1976), above.
9949.in-2n
9950.\"Sivertsen-1961-3
9951.]-
9952.ds [A Sivertsen, E.
9953.ds [D 1961
9954.ds [K *
9955.ds [T Segment inventories for speech synthesis
9956.ds [J Language and Speech
9957.ds [V 4
9958.ds [P 27-89
9959.nr [P 1
9960.nr [T 0
9961.nr [A 1
9962.nr [O 0
9963.][ 1 journal-article
9964.in+2n
9965This is a careful early study of the quantitative implications of using
9966phonemes, demisyllables, syllables, and words as the basic building
9967blocks for speech synthesis.
9968.in-2n
9969.LE "nn"
9970.EQ
9971delim $$
9972.EN
9973.CH "8  PROSODIC FEATURES IN SPEECH SYNTHESIS"
9974.ds RT "Prosodic features
9975.ds CX "Principles of computer speech
9976.pp
9977Prosodic features are those which characterize an utterance as a whole,
9978rather than having a local influence on individual sound segments.
9979For speech output from computers, an "utterance" usually comprises a
9980single unit of information which stretches over several words \(em a clause
9981or sentence.  In natural speech an utterance can be very much longer, but
9982it will be broken into prosodic units which are again roughly the size of a
9983clause or sentence.  These prosodic units are certainly closely related
9984to each other.  For example, the pitch contour used when introducing a new
9985topic is usually different from those employed to develop it subsequently.
9986However, for the purposes of synthesis the successive prosodic units can
9987be treated independently, and information about pitch contours to be used
9988will have to be specified in the input for each one.
9989The independence between them is not complete, however, and
9990lower-level contextual effects, such as interpolation of pitch between
9991the end of one prosodic unit and the start of the next, must still be
9992imitated.
9993.pp
9994Prosodic features were introduced briefly in Chapter 2.
9995Variations in voice dynamics occur in three dimensions:  pitch of the voice,
9996time, and amplitude.
9997These dimensions are inextricably twined together in living speech.
9998Variations in voice quality are much less important for the factual
9999kind of speech usually sought in voice response applications,
10000although they can play a considerable in conveying emotions
10001(for a discussion of the acoustic manifestations of emotion in speech,
10002see Williams and Stevens, 1972).
10003.[
10004Williams Stevens 1972
10005.]
10006.pp
10007The distinction between prosodic and segmental effects is a traditional one,
10008but it becomes rather fuzzy when examined in detail.
10009It is analogous to the distinction between hardware and
10010software in computer science:  although useful from some points of view
10011the borderline becomes blurred as one gets closer to actual systems \(em with
10012microcode, interrupts, memory management, and the like.
10013At a trivial level, prosodics
10014cannot exist without segmentals, for there must be some vehicle to carry the
10015prosodic contrasts.
10016Timing \(em a prosodic feature \(em is actually realized by the durations of
10017individual segments.  Pauses are tantamount to silent segments.
10018.pp
10019While pitch may seem to be relatively independent of segmentals \(em and
10020this view is reinforced by the success of the source-filter model
10021which separates the frequency of the
10022excitation source from the filter characteristics \(em there
10023are some subtle phonetic effects of pitch.
10024It has been observed that it drops on the transition into certain
10025consonants, and rises again on the transition out (Haggard
10026.ul
10027et al,
100281970).
10029.[
10030Haggard Ambler Callow 1970
10031.]
10032This can be explained in terms of variations in pressure from the
10033lungs on the vocal cords (Ladefoged, 1967).
10034.[
10035Ladefoged 1967
10036.]
10037Briefly, the increase in mouth pressure which occurs during some consonants
10038causes a reduction in the pressure difference across the vocal cords
10039and in the rate of flow of air between them.
10040This results in a decrease in their frequency of vibration.
10041When the constriction is released, there is a temporary increase in the air
10042flow which increases the pitch again.
10043The phenomenon is called "microintonation".
10044It is particularly noticeable in voiced stops, but also occurs in voiced
10045fricatives and unvoiced stops.
10046Simulation of the effect in synthesis-by-rule has often been found to give
10047noticeable improvements in the speech quality.
10048.pp
10049Loudness also has a segmental role.  For example, we noted in the last chapter
10050that amplitude values play a small part in identification of fricatives.
10051In fact loudness is a very
10052.ul
10053weak
10054prosodic feature.  It contributes little to the perception of stress.
10055Even for shouting the distinction from normal speech is as much in the voice
10056quality as in amplitude
10057.ul
10058per se.
10059It is not necessary to consider varying loudness on a prosodic basis
10060in most speech synthesis systems.
10061.pp
10062The above examples show how prosodic features have segmental influences
10063as well.
10064The converse is also true:  some segmental features have a prosodic effect.
10065The last chapter described how stress is associated with increased aspiration
10066of syllable-initial unvoiced stops.  Furthermore, stressed syllables
10067are articulated with greater effort than unstressed ones, and hence the formant
10068transitions are more likely to attain their target values
10069under circumstances which would otherwise cause them to fall short.
10070In unstressed syllables, extreme vowels (like
10071.ul
10072ee, aa, uu\c
10073)
10074tend to more centralized sounds
10075(like
10076.ul
10077i, uh, u
10078respectively).
10079Although all British English vowels
10080.ul
10081can
10082appear in unstressed syllables, they often become "reduced" into a
10083centralized form.
10084Consider the following examples.
10085.LB
10086.NI
10087diplomat	\
10088.ul
10089d i p l uh m aa t
10090.NI
10091diplomacy	\
10092.ul
10093d i p l uh u m uh s i
10094.NI
10095diplomatic	\
10096.ul
10097d i p l uh m aa t i k.
10098.LE
10099The vowel of the second syllable is reduced to
10100.ul
10101uh
10102in "diplomat" and "diplomatic", whereas the root form "diploma", and also
10103"diplomacy", has a diphthong
10104(\c
10105.ul
10106uh u\c
10107)
10108there.  The third syllable has an
10109.ul
10110aa
10111in "diplomat" and "diplomatic" which is reduced to
10112.ul
10113uh
10114in "diplomacy".
10115In these cases the reduction is shown explicitly in the phonetic transcription;
10116but in more marginal examples where it is less extreme it will not be.
10117.pp
10118I have tried to emphasize in previous chapters that prosodic features are
10119important in speech synthesis.
10120There is something very basic about them.
10121Rhythm is an essential part of all bodily activity \(em of breathing,
10122walking, working and playing \(em and so it pervades speech too.
10123Mothers and babies communicate effectively using intonation alone.
10124Some experiments have indicated that the language environment of
10125an infant affects his babbling at an early age, before he has effective
10126segmental control.
10127There is no doubt that "tone of voice" plays a large part in human
10128communication.
10129.pp
10130However, early attempts at synthesis did not pay too
10131much attention to prosodics, perhaps because it was thought sufficient to get the
10132meaning across by providing clear segmentals.
10133As artificial speech grows more widespread, however, it is becoming
10134apparent that its acceptability to users, and hence its ultimate
10135success, depends to a large extent on incorporating natural-sounding
10136prosodics.  Flat, arhythmic speech may be comprehensible in short stretches,
10137but it strains the concentration in significant discourse and people
10138are not usually prepared to listen to it.
10139Unfortunately, current commercial speech output systems do not really tackle
10140prosodic questions, which indicates our present rather inadequate
10141state of knowledge.
10142.pp
10143The importance of prosodics for automatic speech
10144.ul
10145recognition
10146is beginning to be appreciated too.  Some research projects
10147have attended to the automatic identification of points of stress,
10148in the hope that the clear articulation of stressed syllables can be used
10149to provide anchor points in an unknown utterance (for example, see Lea
10150.ul
10151et al,
101521975).
10153.[
10154Lea Medress Skinner 1975
10155.]
10156.pp
10157But prosodics and segmentals are closely intertwined.
10158I have chosen to
10159treat them in separate chapters in order to split the material up into
10160manageable chunks rather than to enforce a deep division between them.
10161It is also true that synthesis of prosodic features is an uncharted and
10162controversial area, which gives this chapter rather a different
10163flavour from the last.
10164It is hard to be as definite about alternative strategies
10165and methods as you can for segment concatenation.
10166In order to make the treatment as concrete and down-to-earth as possible,
10167I will describe in some detail two example projects in prosodic synthesis.
10168The first treats the problem of transferring pitch from one utterance to
10169another, while the second considers how artificial timing and pitch can be
10170assigned to synthetic speech.
10171These examples illustrate quite different problems, and are reasonably
10172representative of current research activity.
10173(Other systems are described by Mattingly, 1966; Rabiner
10174.ul
10175et al,
101761969.)  Before
10177.[
10178Mattingly 1966
10179.]
10180.[
10181Rabiner Levitt Rosenberg 1969
10182.]
10183looking at the two examples, we will discuss
10184a feature which is certainly prosodic but does not appear in the
10185list given earlier \(em stress.
10186.sh "8.1  Stress"
10187.pp
10188Stress is an everyday notion, and when
10189listening to natural speech people can usually agree on which syllables
10190are stressed.  But it is difficult to characterize in acoustic terms.
10191From the speaker's point of view, a stressed syllable is produced by
10192pushing more air out of the lungs.  For a listener, the points of stress
10193are "obvious".
10194You may think that stressed syllables are louder than the others:  however,
10195instrumental studies show that this is not necessarily (nor even usually)
10196so (eg Lehiste and Peterson, 1959).
10197.[
10198Lehiste Peterson 1959
10199.]
10200Stressed syllables frequently have a longer vowel than unstressed
10201ones, but this is by no means universally true \(em if you say "little"
10202or "bigger" you will find that the vowel in the first, stressed, syllable
10203is short and shows little sign of lengthening as you increase the emphasis.
10204Moreover, experiments using bisyllabic nonsense words have indicated
10205that some people consistently judge the
10206.ul
10207shorter
10208syllable to be stressed in the absence of other clues (Morton and Jassem,
102091965).
10210.[
10211Morton Jassem 1965
10212.]
10213Pitch often helps to indicate stress.
10214It is not that stressed syllables are always higher- or lower-pitched
10215than neighbouring ones, or even that they are uttered with a rising or
10216falling pitch.  It is the
10217.ul
10218rate of change
10219of pitch that tends to be greater
10220for stressed syllables:  a sharp rise or fall,
10221or a reversal of direction, helps to give emphasis.
10222.pp
10223Stress is acoustically manifested in timing and pitch,
10224and to a much lesser extent in loudness.
10225However it is a rather subtle feature and does
10226.ul
10227not
10228correspond simply to duration increases or pitch rises.
10229It seems that listeners unconsciously put together all the clues
10230that are present in an utterance in order to deduce which syllables are
10231stressed.
10232It may be that speech is perceived by a listener with reference to how
10233he would have produced it himself, and that this is how he detects which syllables
10234were given greater vocal effort.
10235.pp
10236The situation is confused by the fact that certain syllables in words are
10237often said in ordinary language to be "stressed" on account of their
10238position in the word.  For example, the words
10239"diplomat", "diplomacy", and "diplomatic" have stress on the first,
10240second, and third syllables respectively.
10241But here we are talking about the word itself rather than
10242any particular utterance of it.  The "stress" is really
10243.ul
10244latent
10245in the indicated syllables and only made manifest upon uttering them,
10246and then to a greater or lesser degree depending on exactly how
10247they are uttered.
10248.pp
10249Some linguists draw a careful distinction between salient syllables,
10250accented syllables, and stressed syllables,
10251although the words are sometimes used differently by different authorities.
10252I will not adopt a precise terminology here,
10253but it is as well to be aware of the subtle distinctions involved.
10254The term "salience" is applied to actual utterances, and salient
10255syllables are those that are perceived as being more prominent than their
10256neighbours.
10257"Accent" is the potential for salience, as marked, for example,
10258in a dictionary or lexicon.
10259Thus the discussion of the "diplo-" words above is about accent.
10260Stress is an articulatory phenomenon associated with increased
10261muscular activity.
10262Usually, syllables which are perceived as salient were produced with stress,
10263but in shouting, for example, all syllables can be stressed \(em even
10264non-salient ones.
10265Furthermore, accented syllables may not be salient.
10266For instance, the first syllable of the word "very" is accented,
10267that is, potentially salient, but in a sentence as uttered it may or may not be
10268salient.  One can say
10269.LB
10270"\c
10271.ul
10272he's
10273very good"
10274.LE
10275with salience on "he" and possibly "good", or
10276.LB
10277"he's
10278.ul
10279very
10280good"
10281.LE
10282with salience on the first syllable of "very", and possibly "good".
10283.pp
10284Non-standard stress patterns are frequently used to bring out contrasts.
10285Words like "a" and "the" are normally unstressed, but can be stressed
10286in contexts where ambiguity has arisen.
10287Thus factors which operate at a much higher level than the phonetic structure
10288of the utterance must be taken into account when deciding where stress
10289should be assigned.  These include syntactic and semantic considerations,
10290as well as the attitude of the speaker and the likely attitude of
10291the listener to the material being spoken.
10292For example, I might say
10293.LB
10294"Anna
10295.ul
10296and
10297Nikki should go",
10298.LE
10299with emphasis on the "and" purely because I was aware that my listener
10300might quibble about the expense of sending them both.
10301Clearly some notation is needed to communicate to the synthesis process
10302how the utterance is supposed to be rendered.
10303.sh "8.2  Transferring pitch from one utterance to another"
10304.pp
10305For speech stored in source-filter form and concatenated on a
10306slot-filling basis, it would be useful to
10307have stored typical pitch contours which can be applied to the
10308synthetic utterances.
10309From a practical point of view it is important to be able to generate
10310natural-sounding pitch for high-quality artificial speech.
10311Although several algorithms for creating completely synthetic contours
10312have been proposed \(em and we will examine one later in this chapter \(em
10313they are unsuitable for high-quality speech.
10314They are generally designed for use with synthesis-by-rule from phonetics,
10315and the rather poor quality of articulation does not encourage the
10316development of excellent pitch assignment procedures.  With speech
10317synthesized by rule there is generally an emphasis on keeping the
10318data storage requirements to a minimum, and so it is not appropriate
10319to store complete contours.
10320Moreover, if speech is entered in textual
10321form as phoneme strings, it is natural to attach pitch information as markers
10322in the text rather than by entering a complete and detailed contour.
10323.pp
10324The picture is rather different for concatenated segments of natural speech.
10325In the airline reservation system, with utterances formed from templates like
10326.LB
10327Flight number \(em leaves \(em at \(em , arrives in \(em at \(em ,
10328.LE
10329it is attractive to store the pitch contour of one complete instance of the
10330utterance and apply it to all synthetic versions.
10331.pp
10332There is an enormous literature on the anatomy of intonation, and much of it
10333rests upon the notion of a pitch contour as a descriptive aid to analysis.
10334Underlying this is the assumption, usually unstated, that a contour can be
10335discussed independently of the particular stream of words that manifests it;
10336that a single contour can somehow be bound to any sentence (or phrase, or
10337clause) to produce an acceptable utterance.  But the contour, and its binding,
10338are generally described only at the grossest level, the details being left
10339unspecified.
10340.pp
10341There are phonetic influences on pitch \(em the characteristic lowering
10342during certain consonants was mentioned above \(em and these are
10343not normally considered as part of intonation.
10344Such effects will certainly spoil attempts to store contours extracted
10345from living speech and apply them to different utterances, but the impairment
10346may not be too great, for pitch is only one of many segmental clues to
10347consonant identification.
10348.pp
10349In the system mentioned earlier which generated 7-digit telephone numbers
10350by concatenating formant-coded words, a single natural pitch contour
10351was applied to all utterances.
10352It was taken to match as well as possible the general shape of the
10353contours measured in naturally-spoken telephone numbers.  However, this is a very
10354restricted environment, for telephone numbers exhibit almost no variety in
10355the configuration of stressed and unstressed syllables \(em
10356the only digit which is not a monosyllable is "seven".
10357Significant problems arise when more general utterances are considered.
10358.pp
10359Suppose the pitch contour of one utterance (the "source")
10360is to be transferred to another (the "target").
10361Assume that the utterances are encoded in source-filter form,
10362either as parameter tracks for a formant synthesizer or as linear predictive
10363coefficients.
10364Then there are no technical obstacles to combining pitch and segmentals.
10365The source must be available as a complete utterance, while the target
10366may be formed by concatenating smaller units such as words.
10367.pp
10368For definiteness, we will consider utterances of the form
10369.LB
10370The price is \(em dollars and \(em cents,
10371.LE
10372where the slots are filled by numbers less than 100;
10373and of the form
10374.LB
10375The price is \(em cents.
10376.LE
10377The domain of prices encompasses a wide range of syllable
10378configurations.
10379There are between one and five syllables in each variable part,
10380if the numbers are restricted to be less than 100.
10381The sentences have a constant pragmatic, semantic, and syntactic structure.
10382As in the vast majority of real-life situations,
10383minimal phonetic distinctions between utterances do not occur.
10384.pp
10385Pitch transfer is complicated by the fact that values of the source pitch
10386are only known during the voiced parts of the utterance.
10387Although it would certainly be possible to extrapolate pitch
10388over unvoiced parts, this would introduce some artificiality into
10389the otherwise completely natural contours.
10390Let us assume, therefore, that the pitch contour
10391of the voiced nucleus of each syllable in the source is applied to the
10392corresponding syllable nucleus in the target.
10393.pp
10394The primary factors which might tend to inhibit successful transfer
10395are
10396.LB
10397.NP
10398different numbers of syllables in the utterances;
10399.NP
10400variations in the pattern of stressed and unstressed syllables;
10401.NP
10402different syllable durations;
10403.NP
10404pitch discontinuities;
10405.NP
10406phonetic differences between the utterances.
10407.LE
10408.rh "Syllabification."
10409It is essential to take into account the syllable structures
10410of the utterances, so that pitch is transferred between
10411corresponding syllables rather than over the utterance
10412as a whole.
10413Fortunately, syllable boundaries can be detected automatically
10414with a fair degree of accuracy, especially if the speech is carefully
10415enunciated.
10416It is worth considering briefly how this can be done, even though it takes
10417us off the main topic of synthesis and into speech analysis.
10418.pp
10419A procedure developed by Mermelstein (1975)
10420involves integrating the spectral energy
10421at each point in the utterance.
10422.[
10423Mermelstein 1975 Automatic segmentation of speech into syllabic units
10424.]
10425First the low (<500\ Hz) and high (>4000\ Hz) ends are filtered out
10426with 12\ dB/octave cutoffs.
10427The resulting energy signal is smoothed
10428by a 40\ Hz lowpass filter, giving a so-called "loudness"
10429function.
10430All this can be accomplished with simple recursive digital filters.
10431.pp
10432Then, the loudness function is compared with its convex hull.
10433The convex hull is the shape a piece of elastic would assume if
10434stretched over the top of the loudness function and anchored down at
10435both ends, as illustrated in Figure 8.1.
10436.FC "Figure 8.1"
10437The point of maximum difference between the hull and loudness function
10438is taken to be a tentative syllable
10439boundary.
10440The hull is recomputed, but anchored to the actual loudness function
10441at the tentative boundary,
10442and the points of maximum hull-loudness difference in each of the
10443two halves  are selected as further tentative
10444boundaries.
10445The procedure continues recursively until the maximum hull-loudness
10446difference, with the hull anchored at each tentative boundary,
10447falls below a certain minimum (say 4\ dB).
10448.pp
10449At this stage, the number of tentative boundaries will greatly exceed
10450the actual number of syllables (by a factor of around 5).
10451Many of the extraneous boundaries are eliminated by the following
10452constraints:
10453.LB
10454.NP
10455if two boundaries lie within a certain time of each other
10456(say 120\ msec), one of them is discarded;
10457.NP
10458if the maximum loudness within a tentative syllable falls too
10459far short of the overall maximum for the utterance
10460(more than 20\ dB), one boundary is discarded.
10461.LE
10462The question of which boundary to discard can be decided by
10463examining the voicing continuity of the utterance.
10464If possible, voicing across a syllable boundary should be avoided.
10465Otherwise, the boundary with the smallest hull-loudness
10466difference should be rejected.
10467.RF
10468.nr x0 \w'boundaries moved slightly to correspond better with voicing:'
10469.nr x1 (\n(.l-\n(x0)/2
10470.in \n(x1u
10471.ta 3.4i +0.5i
10472\l'\n(x0u\(ul'
10473.sp
10474total syllable count:	332
10475boundaries missed by algorithm:	\0\09	(3%)
10476extra boundaries inserted by algorithm:	\029	(9%)
10477boundaries moved slightly to correspond better with voicing:
10478	\0\03	(1%)
10479.sp
10480total errors:	\041	(12%)
10481\l'\n(x0u\(ul'
10482.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
10483.in 0
10484.FG "Table 8.1  Success of the syllable segmentation procedure"
10485.pp
10486Table 8.1 illustrates the success of this syllabification
10487procedure, in a particular example.
10488Segmentation is performed with less than 10% of extraneous
10489boundaries being inserted,
10490and much less than 10% of actual boundaries being missed.
10491These figures are rather sensitive to the values of the
10492three thresholds.
10493The values were chosen to err on the side
10494of over-zealous syllabification, because all the boundaries need to be checked
10495by ear and eye and it is easier to delete
10496a boundary by hand than to insert one at an appropriate place.
10497It may well be that with careful optimization of thresholds,
10498better figures could be
10499achieved.
10500.rh "Stressed and unstressed syllables."
10501If the source and target utterances have the same number of
10502syllables, and the same pattern of stressed and unstressed syllables,
10503pitch can simply be transferred from a syllable in the source
10504to the corresponding one in the target.
10505But if the pattern differs \(em even though the
10506number of syllables may be the same, as in "eleven" and "seventeen" \(em
10507then a one-to-one mapping will conflict with the stress points,
10508and certainly sound unnatural.
10509Hence an attempt should be made to ensure that the pitch is mapped in a
10510plausible way.
10511.pp
10512The syllables of each utterance can be classified as "stressed"
10513and "unstressed".
10514This distinction could be made automatically by
10515inspection of the pitch contour, within the domain of utterances used,
10516and possibly even in general (Lea
10517.ul
10518et al,
105191975).
10520.[
10521Lea Medress Skinner 1975
10522.]
10523However, in many cases it is expedient to perform the job by hand.
10524In our example, the sentences have fixed "carrier" parts and
10525variable "number" parts.
10526The stressed carrier syllables, namely
10527.LB
10528"... price ... dol\- ... cents",
10529.LE
10530can be marked as such, by hand,
10531to facilitate proper alignment between the source and target.
10532This marking would be difficult to do automatically
10533because it would be hard to distinguish the carrier from the numbers.
10534.pp
10535Even after classifying the syllables as "carrier stressed",
10536"stressed", and "unstressed", alignment still presents problems,
10537because the configuration of syllables in the variable parts
10538of the utterances may differ.
10539Syllables in the source which have no
10540correspondence in the target can be ignored.
10541The pitch track of
10542the source syllable can be replicated for each
10543additional syllable in corresponding
10544position in the target.
10545Of course, a stressed syllable should be selected for copying
10546if the unmatched target syllable is stressed,
10547and similarly for unstressed ones.
10548It is rather dangerous to copy exactly a part of a pitch
10549contour, for the ear is very sensitive to the juxtaposition of
10550identically intoned segments of speech \(em especially when the segment is stressed.
10551To avoid this, whenever a stressed syllable is replicated the
10552pitch values should be decreased by, say, 20%, on the second copy.
10553It sometimes happens that a single stressed syllable in the source
10554needs to cover a stressed-unstressed pair in the target:  in
10555this case the first part of the source pitch track can be used
10556for the stressed syllable, and the remainder for the
10557unstressed one.
10558.pp
10559The example of Figure 8.2 will help to make these rules clear.
10560.FC "Figure 8.2"
10561Note that the marking alone is done by hand.
10562The detailed mapping decisions can be left to the computer.
10563The rules were derived intuitively, and do not have any sound theoretical
10564basis.
10565They are intended to give reasonable results in the majority of cases.
10566.pp
10567Figure 8.3 shows the result of transferring the pitch from "the price is ten
10568cents" to "the price is seventy-seven cents".
10569.FC "Figure 8.3"
10570The syllable boundaries which are marked were determined automatically.
10571The use of the last 30% of the
10572"ten" contour to cover the first "-en" syllable, and its replication
10573to serve the "-ty" syllable, can be seen.
10574However, the 70%\(em30% proportion is applied to the source contour,
10575and the linear distortion (described next) upsets the proportion in the
10576target utterance.
10577The contour of the second "seven" can be seen to be a
10578replication of that of the first one, lowered by 20%.
10579Notice that the pitch extraction procedure has introduced an artifact into the final
10580part of one of the "cents" contours by doubling the pitch.
10581.rh "Stretching and squashing."
10582The pitch contour over a source syllable nucleus must be stretched
10583or squashed to match the duration
10584of the target nucleus.
10585It is difficult to see how anything other than linear stretching
10586and squashing could be done without considerably increasing the
10587complexity of the procedure.
10588The gross non-linearities will have been accounted for
10589by the syllable alignment process, and so simple linear time-distortion
10590should not cause too much degradation.
10591.rh "Pitch discontinuities."
10592Sudden jumps in pitch during voiced speech sound peculiar,
10593although they can in fact be produced naturally (by yodelling).
10594People frequently burst into laughter on hearing them in synthetic speech.
10595It is particularly important to avoid this diverting effect in
10596voice response applications,
10597for the listener's attention is instantly directed
10598away from what is said to the voice that speaks.
10599.pp
10600Discontinuities can arise in the pitch-transfer procedure either by a
10601voiced-unvoiced-voiced transition between syllables mapping on to
10602a voiced-voiced transition in the target,
10603or by voicing continuity being broken when the syllable
10604alignment procedure drops or replicates a syllable.
10605There are several ways in which at least some of the possibilities can
10606be avoided.
10607For example, one could hold unstressed syllables at a constant pitch
10608whose value coincides with either the end of the previous
10609syllable's contour or the beginning of the next syllable's contour,
10610depending on which transition is voiced.
10611Alternatively, the policy of reserving the trailing part
10612of a stressed syllable in the source to cover an unmatched following
10613unstressed syllable in the target could be generalized to allow use of the leading 30%
10614of the next stressed syllable's contour instead,
10615if that maintained voicing continuity.
10616A third solution is simply to merge the pitch contours
10617at a discontinuity by mixing the average pitch value at the break
10618with the pitch contour on either side of it in a proportion which
10619increases linearly from the edges of the domain of influence to the discontinuity.
10620Figure 8.4 shows the effect of this merging,
10621when the pitch contour of "the price is seven cents"
10622is transferred to "the price is eleven cents".
10623.FC "Figure 8.4"
10624Of course, the
10625interpolated part will not necessarily be linear.
10626.rh "Results of an experiment on pitch transfer."
10627Some experiments have been conducted to evaluate the performance
10628of this pitch transfer method on the kind of utterances discussed above
10629(Witten, 1979).
10630.[
10631Witten 1979 On transferring pitch from one utterance to another
10632.]
10633First, the source and target sentences
10634were chosen to be lexically identical, that is, the same words were spoken.
10635For this experiment alone,
10636expert judges were employed.
10637Each sentence was recorded twice (by the same person),
10638and pitch was transferred from copy A
10639to copy B and vice versa.  Also, the originals were resynthesized from their linear
10640predictive coefficients with their own pitch contours.
10641Although all four often sounded extremely similar, sometimes the pitch
10642contours of originals A and B were quite different,
10643and in these cases it was immediately obvious to the ear that two of
10644the four utterances shared the same intonation,
10645which was different to that shared by the other two.
10646.pp
10647Experienced researchers in speech analysis-synthesis served as
10648judges.
10649In order to make the test as stringent as possible it was explained
10650to them exactly what had been done,
10651except that the order of the utterances in each quadruple was kept secret.
10652They were asked to identify which two of the four sentences did not have their
10653original contours,
10654and were allowed to listen to each quadruple as often as they liked.
10655On occasion they were prepared to identify only one, or even none,
10656of the sentences as artificial.
10657.pp
10658The result was that an utterance with pitch transferred
10659from another, lexically identical, one is indistinguishable from
10660a resynthesized version of the original, even to a skilled ear.
10661(To be more precise, this hypothesis
10662could not be rejected even at the 1% level of statistical significance.)  This
10663gave confidence in the transfer procedure.
10664However, one particular judge was quite successful at identifying the bogus contours,
10665and he attributed his success to the fact that
10666on occasion the segmental durations did not accord with the
10667pitch contour.
10668This casts a shadow of suspicion on the linear stretching and
10669squashing mechanism.
10670.pp
10671The second experiment examined pitch transfers between utterances having only one variable part
10672each ("the price is ... cents") to test the transfer
10673method under relatively controlled conditions.
10674Ten sentences of the form
10675.LB
10676"The price is \(em cents"
10677.LE
10678were selected to cover
10679a wide range of syllable structures.
10680Each one was regenerated with pitch transferred from each of
10681the other nine,
10682and these nine versions were paired with the original resynthesized
10683with its natural pitch.
10684The $10 times 9=90$ resulting pairs were recorded on tape in random order.
10685.pp
10686Five males and five females, with widely differing occupations
10687(secretaries, teachers, academics, and students), served as judges.
10688Written instructions explained that the tape contained pairs of
10689sentences which were lexically identical but had a slight difference
10690in "tone of voice", and that the subjects were to judge which of
10691each pair sounded "most natural and intelligible".  The
10692response form gave the price associated with each pair \(em
10693a preliminary experiment had shown that there was never
10694any difficulty in identifying this \(em and a column for decision.
10695With each decision, the subjects recorded their confidence in the decision.
10696Subjects could rest at any time during the test, which lasted for about
1069730 minutes, but they were not permitted to hear any pair a second time.
10698.pp
10699Defining a "success" to be a choice of the utterance with
10700natural pitch as the best of a pair,
10701the overall success rate was about 60%.
10702If choices were random, one would of course expect only a 50% success rate,
10703and the figure obtained was significantly different from this.
10704Almost half the choices were correct and made with high confidence;
10705high-confidence but incorrect choices accounted for a quarter of the
10706judgements.
10707.pp
10708To investigate structural effects in the pitch transfer process,
10709low confidence decisions were ignored to eliminate noise, and the others
10710lumped together and tabulated by source and target utterance.
10711The number of stressed and unstressed syllables does not appear to play
10712an important part in determining whether a particular utterance is an
10713easy target.
10714For example, it proved to be particularly difficult to tell
10715.EQ
10716delim @@
10717.EN
10718natural from transferred contours with utterances $0.37 and $0.77.
10719.EQ
10720delim $$
10721.EN
10722In fact, the results showed no better than random discrimination for them,
10723even though the decisions in which listeners expressed little confidence
10724had been discarded.
10725Hence it seems that the syllable alignment procedure and the policy
10726of replication were successful.
10727.pp
10728.EQ
10729delim @@
10730.EN
10731The worst target scores were for utterances $0.11 and $0.79.
10732.EQ
10733delim $$
10734.EN
10735Both of these contained large unbroken voiced periods
10736in the "variable" part \(em almost twice as long as the next longest
10737voiced period.
10738The first has an unstressed syllable followed by
10739a stressed one with no break in voicing,
10740involving, in a natural contour,
10741a fast but continuous climb in pitch over the juncture,
10742and it is not surprising that it proved to be the most difficult target.
10743A more sophisticated "smoothing" algorithm than the
10744one used may be worth investigating.
10745.pp
10746In a third experiment, sentences with two variable parts were used to check
10747that the results of the second experiment extended to more complex
10748utterances.
10749The overall success rate was 75%, significantly different from chance.
10750However, a breakdown of the results by source and target utterance
10751showed that there was one contour (for the utterance
10752"the price is 19 dollars and 8 cents") which exhibited very successful
10753transfer, subjects identifying the transferred-pitch utterances at only
10754a chance level.
10755.pp
10756Finally, transfers of pitch from utterances with two variable parts
10757to those with one variable part were tested.
10758Pitch contours were transferred to sentences with the same "cents"
10759figure but no "dollars" part; for example,
10760"the price is five dollars and thirteen cents"
10761to
10762"the price is thirteen cents".  The
10763contour was simply copied between the corresponding
10764syllables, so that no adjustment needed to be made
10765for different syllable structures.
10766The overall score was 60 successes in 100 judgements \(em
10767the same percentage as in the second experiment.
10768.pp
10769To summarize the results of these four experiments,
10770.LB
10771.NP
10772even accomplished linguists cannot distinguish an utterance from one with
10773pitch transferred from a different recording of it;
10774.NP
10775when the utterance contained only one variable part embedded in a
10776carrier sentence,
10777lay listeners identified the original correctly in 60% of cases,
10778over a wide variety of syllable structures:  this
10779figure differs significantly from the chance value of 50%;
10780.NP
10781lay listeners identified the original confidently and correctly in
1078250% of cases; confidently but incorrectly in 25% of cases;
10783.NP
10784the greatest hindrance to successful transfer was the presence of
10785a long uninterrupted period of voicing in the target utterance;
10786.NP
10787the performance of the method deteriorates as the number
10788of variable parts in the utterances increases;
10789.NP
10790some utterances seemed to serve better than others as the pitch source for
10791transfer, although this was not correlated with complexity of syllable structure;
10792.NP
10793even when the utterance contained two variable parts,
10794there was one source utterance whose pitch contour was
10795transferred to all the others so successfully that listeners could not identify
10796the original.
10797.LE
10798.pp
10799The fact that only 60% of originals in the second experiment were
10800spotted by lay listeners in a stringent
10801paired-comparison test \(em many of them being identified without confidence \(em
10802does encourage the use of the procedure for generating stereotyped,
10803but different, utterances of high quality in voice-response systems.
10804The experiments indicate that although different syllable patterns
10805can be handled satisfactorily by this procedure,
10806long voiced periods should be avoided if possible when designing
10807the message set,
10808and that if individual utterances must contain multiple variable parts
10809the source utterance should be chosen with the aid of listening tests.
10810.sh "8.3  Assigning timing and pitch to synthetic speech"
10811.pp
10812The pitch transfer method can give good results within a fairly narrow
10813domain of application.
10814But like any speech output technique which treats complete utterances
10815as a single unit, with provision for a small number of slot-fillers to
10816accomodate data-dependent messages, it becomes unmanageable in more general
10817situations with a large variety of utterances.
10818As with segmental synthesis it becomes necessary to consider methods
10819which use a textual rather than an acoustically-based representation
10820of the prosodic features.
10821.pp
10822This raises a problem with prosodics that was not there for segmentals:  how
10823.ul
10824can
10825prosodic features be written in text form?
10826The standard phonetic transcription method does not give much help with
10827notation for prosodics.  It does provide a diacritical mark to indicate
10828stress, but this is by no means enough information for synthesis.
10829Furthermore, text-to-speech procedures (described in the next chapter)
10830promise to allow segmentals to be specified by an ordinary orthographic
10831representation of the utterance; but we have seen that considerable
10832intelligence is required to derive prosodic features from text.
10833(More than mere intelligence may be needed:  this is underlined by a paper
10834(Bolinger, 1972)
10835delightfully entitled
10836"Accent is predictable \(em if you're a mind reader"!)
10837.[
10838Bolinger 1972 Accent is predictable \(em if you're a mind reader
10839.]
10840.pp
10841If synthetic speech is to be used as a computer output medium rather
10842than as an experimental tool for linguistic research, it is important
10843that the method of specifying utterances is natural and easy to learn.
10844Prosodic features must be communicated to the computer in a manner
10845considerably simpler than individual duration and pitch specifications
10846for each phoneme, as was required in early synthesis-by-rule systems.
10847Fortunately, a notation has been developed for conveying some of the
10848prosodic features of utterances, as a by-product of the linguistically
10849important task of classifying the intonation contours used in
10850conversational English (Halliday, 1967).
10851.[
10852Halliday 1967
10853.]
10854This system has even been used to help foreigners speak English
10855(Halliday, 1970) \(em which emphasizes the fact that it was designed for use
10856by laymen, not just linguists!
10857.[
10858Halliday 1970 Course in spoken English: Intonation
10859.]
10860.pp
10861Here are examples of the way utterances can be conveyed to the ISP
10862speech synthesis system which was described in the previous chapter.
10863The notation is based upon Halliday's.
10864.LB
10865.NI
108663
10867.ul
10868^  aw\ t\ uh/m\ aa\ t\ i\ k  /s\ i\ n\ th\ uh\ s\ i\ s  uh\ v  /*s\ p\ ee\ t\ sh,
10869.NI
108701
10871.ul
10872^  f\ r\ uh\ m  uh  f\ uh/*n\ e\ t\ i\ k  /r\ e\ p\ r\ uh\ z\ e\ n/t\ e\ i\ sh\ uh\ n.
10873.LE
10874(Automatic synthesis of speech, from a phonetic representation.)  Three
10875levels of stress are distinguished:  tonic or "sentence" stress,
10876marked by "*" before the syllable; foot stress (marked by "/");
10877and unstressed syllables.
10878The notion of a "foot" controls the rhythm of the speech in a way that
10879will be described shortly.
10880A fourth level of stress is indicated on a segmental basis when a syllable
10881contains a reduced vowel.
10882.pp
10883Utterances are divided by punctuation into
10884.ul
10885tone groups,
10886which are the basic prosodic unit \(em there are two in the example.
10887The shape of the pitch contour is governed by a numeral at the start of
10888each tone group.
10889Crude control over pauses is achieved by punctuation marks:  full stop, for
10890example, signals a pause while comma does not.
10891(Longer pauses can be obtained by several full stops as in "...".)  The
10892"^" character stands for a so-called "silent stress" or breath point.
10893Word boundaries are marked by two spaces between phonemes.
10894As mentioned in the previous chapter, syllable boundaries and explicit
10895pitch and duration specifiers can also be included in the input.
10896If they are not, the ISP system will attempt to compute them.
10897.rh "Rhythm."
10898Our understanding of speech rhythm knows many laws but little order.
10899In the mid 1970's there was a spate of publications reporting new data
10900on segmental duration in various contexts, and there is a growing
10901awareness that segmental duration is influenced by a great many factors,
10902ranging from the structure of a discourse, through semantic and syntactic
10903attributes of the utterances, their phonemic and phonetic make-up,
10904right down to physiological constraints
10905(these multifarious influences are ably documented and reviewed by
10906Klatt, 1976).
10907.[
10908Klatt 1976 Linguistic uses of segment duration in English
10909.]
10910What seems to be lacking in this work is a conceptual framework on to
10911which new information about segmental duration can be nailed.
10912.pp
10913One starting-point for imitating the rhythm of English speech is the
10914hypothesis of regularly recurring stresses.
10915These stresses are primarily
10916.ul
10917rhythmic
10918ones, and should be distinguished from the tonic stress mentioned above which
10919is primarily an
10920.ul
10921intonational
10922one.
10923Rhythmic stresses are marked in the transcription by a "/".
10924The stretch between one and the next is called a "foot",
10925and the hypothesis above is often referred to as that of isochronous feet
10926("isochronous" means "of equal time").
10927There is considerable controversy about this hypothesis.
10928It is most popular among British linguists and, it must be admitted,
10929amongst those who work by introspection and intuition and do not actually
10930.ul
10931measure
10932things.
10933Although the question of isochrony of feet has long been debated, there
10934seems to be general agreement
10935\(em even amongst American linguists \(em
10936that there is at least a tendency towards
10937equal spacing of foot boundaries.
10938However, little is known about the strength of this tendency and the extent
10939of deviations from it (see Hill
10940.ul
10941et al,
109421979, for an attempt
10943to quantify it) \(em and there is even evidence to suggest that it may in part
10944be a
10945.ul
10946perceptual
10947phenomenon (Lehiste, 1973).
10948.[
10949Hill Jassem Witten 1979
10950.]
10951.[
10952Lehiste 1973
10953.]
10954On this basic point, as on many others, the designer of a prosodic synthesis
10955strategy must needs make assumptions which cannot be properly justified.
10956.pp
10957From a pragmatic point of view there are two advantages to basing
10958a synthesis strategy on this hypothesis.
10959Firstly, it provides a way to represent the many influences of higher-level
10960processes (like syntax and semantics) on rhythm using a simple notation which
10961fits naturally into the phonetic utterance representation,
10962and which people find quite easy to understand and generate.
10963Secondly, it tends to produce a heavily accentuated, but not unnatural,
10964speech rhythm which can easily be moderated into a more acceptable rhythm
10965by departing from isochrony in a controlled manner.
10966.pp
10967The ISP procedure does not make feet exactly isochronous.
10968It starts with a standard foot time and attempts to fit the syllables of the
10969foot into this time.
10970If doing so would result in certain syllables having less than a preset minimum
10971duration, the isochrony constraint is relaxed and the foot is expanded.
10972There is no preset
10973.ul
10974maximum
10975syllable length.
10976However, when the durations of individual phoneme postures are adjusted
10977to realize the calculated syllable durations,
10978limits are imposed on the amount by which individual phonemes can be expanded
10979or contracted.
10980Thus a hierarchy of limits exists.
10981.pp
10982The rate of talking is determined by the standard foot time.
10983If this time is short, many feet will be forced to have durations longer than
10984the standard, and the speech will be "less isochronous".
10985This seems to accord with common human experience.
10986If the standard time is longer, however, the minimum syllable limit
10987will always be exceeded and the speech will be completely isochronous.
10988If it is too long, the above-mentioned limits to phoneme expansion will
10989come into play and again partially destroy the isochrony.
10990.pp
10991It has often been observed that the final foot of an utterance tends to be
10992longer than others; as does the tonic foot \(em that which bears the
10993major stress.
10994This is easy to accomodate, simply by making the target duration
10995longer for these feet.
10996.rh "From feet to syllables."
10997A foot is a succession of syllables, one or more.
10998And it is obvious that since there are more syllables in some feet than
10999in others, some syllables must occupy less time than others in order to preserve
11000the tendency towards isochrony of feet.
11001.pp
11002However, the duration of a foot is not divided evenly between its constituent
11003syllables.  The syllables have a definite rhythm of their own, which seems
11004to be governed by
11005.LB
11006.NP
11007the nature of the salient (that is, the first) syllable of the foot
11008.NP
11009the presence of word boundaries within the foot.
11010.LE
11011A salient syllable tends to be long either if it contains one of
11012a class of so-called "long" vowels, or if there is a cluster of two or more
11013consonants following the vowel.
11014The pattern of syllables and word boundaries governs the rhythm of the foot,
11015and Table 8.2 shows the possibilities for one-, two-, and three-syllable feet.
11016This theory of speech rhythm is due to Abercrombie (1964).
11017.[
11018Abercrombie 1964 Syllable quantity and enclitics in English
11019.]
11020.RF
11021.nr x2 \w'three-syllable feet  'u
11022.nr x3 \w'sal-short  'u
11023.nr x4 \w'weak [#]  'u
11024.nr x5 \w'weak      'u
11025.nr x6 \w'/\fIit s incon\fR/ceivable    'u
11026.nr x1 (\w'syllable rhythm'/2)
11027.nr x7 \n(x2+\n(x3+\n(x4+\n(x5+\n(x6+\n(x1+\n(x1
11028.nr x7 (\n(.l-\n(x7)/2
11029.in \n(x7u
11030.ta \n(x2u +\n(x3u +\n(x4u +\n(x5u +\n(x6u
11031.ul
11032	syllable pattern		example	\0\0\h'-\n(x1u'syllable rhythm
11033.sp
11034one-syllable feet	salient			/\fIgood\fR /show	1
11035	^	weak		/\fI^ good\fR/bye	2:1
11036.sp
11037two-syllable feet	sal-long	weak		/\fIcentre\fR /forward	1:1
11038	sal-short	weak		/\fIatom\fR /bomb	1:2
11039	salient  #	weak		/\fItea for\fR /two	2:1
11040.sp
11041three-syllable feet	salient  #	weak [#]	weak	/\fIone for the\fR /road	2:1:1
11042				/\fIit's incon\fR/ceivable
11043	sal-long	weak #	weak	/\fIafter the\fR /war	2:3:1
11044	sal-short	weak #	weak	/\fImiddle to\fR /top	1:3:2
11045	sal-long	weak	weak	/\fInobody\fR /knows	3:1:2
11046	sal-short	weak	weak	/\fIanything\fR /more	1:1:1
11047.sp
11048	# denotes a word boundary;
11049	[#] is an optional word boundary
11050.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
11051.FG "Table 8.2  Syllable patterns and rhythms"
11052.pp
11053A foot may have the rhythmical characteristics of a two-syllable foot
11054while having only one syllable, if the first place in it is filled by a
11055silent stress (marked by "^").
11056This is shown in the second one-syllable example of
11057Table 8.2.
11058A similar effect may occur with two- and three-syllable feet,
11059although examples are not given in the table.
11060Feet of four and five syllables \(em with or without a silent stress \(em are
11061considerably rarer.
11062.pp
11063Syllabification \(em splitting an utterance into syllables \(em is a job
11064which had to be done for the pitch-transfer procedure described earlier,
11065and the nature of syllable rhythms calls for it here too.
11066Even though the utterance is now specified phonetically instead of
11067acoustically, the same basic principle applies.
11068Syllables normally coincide with peaks of sonority,
11069where "sonority" measures the inherent loudness of a sound relative to
11070other sounds of the same duration and pitch.
11071However, difficult cases exist where it seems to be unclear how many syllables
11072there are in a word.  (Ladefoged, 1975, discusses this problem with examples
11073such as "real", "realistic", and "reality".)  Furthermore,
11074.[
11075Ladefoged 1975
11076.]
11077care must be taken to avoid counting two syllables in a word like "sky"
11078because of its two peaks of sonority \(em for the stop
11079.ul
11080k
11081has lower
11082sonority than the fricative
11083.ul
11084s.
11085.pp
11086Three levels of notional sonority are enough for syllabification.
11087Dividing phoneme segments into
11088.ul
11089sonorants
11090(glides and nasals),
11091.ul
11092obstruents
11093(stops and fricatives), and vowels; a general syllable has the form
11094.LB
11095.EQ
11096<obstruent> sup * ~ <sonorant> sup * ~ <vowel> sup * ~ <sonorant> sup * ~
11097<obstruent> sup * ~ ,
11098.EN
11099.LE
11100where "*" means repetition, that is, occurrence zero or more times.
11101This sidesteps the "sky" problem by giving fricatives the same
11102sonority as stops.
11103It is easy to use the above structure to count the number
11104of syllables in a given utterance by counting the sonority
11105peaks.
11106.pp
11107However, what is required is an indication of syllable
11108.ul
11109boundaries
11110as well as a syllable count.
11111For slow conversational speech, these can be approximated as follows.
11112Word divisions obviously form syllable boundaries, as should
11113foot markers \(em but it may be wise not to assume that the latter do if the
11114utterance has been prepared by someone with little knowledge of linguistics.
11115Syllable boundaries should be made to coincide with sonority minima.
11116As an
11117.ul
11118ad hoc
11119pragmatic
11120rule, if only one segment has the minimum sonority the boundary is placed
11121before it.
11122If there are two segments, each with the minimum sonority, it is placed between
11123them, while for three or more it is placed after the first two.
11124.pp
11125These rules produce obviously acceptable divisions in many cases
11126(to'day, ash'tray, tax'free), with perhaps unexpected positioning of the
11127boundary in others (ins'pire, de'par'tment).
11128Actually, people do differ in placement of syllable boundaries
11129(Abercrombie, 1967).
11130.[
11131Abercrombie 1967
11132.]
11133.rh "From syllables to segments."
11134The theory of isochronous feet (with the caveats noted earlier)
11135and that of syllable rhythms provide a way of producing durations for
11136individual syllables.  But where are these durations supposed to be measured?
11137There is a beat point, or tapping point, near the beginning of each syllable.
11138This is the place where a listener will tap if asked to give one tap to each
11139syllable; it has been investigated experimentally by Allen (1972).
11140.[
11141Allen 1972 Location of rhythmic stress beats in English One
11142.]
11143It is not necessarily at the very beginning of the syllable.
11144For example, in "straight", the tapping point is certainly after the
11145.ul
11146s
11147and the stopped part of the
11148.ul
11149t.
11150.pp
11151Another factor which relates to the division of the syllable duration
11152amongst phonetic segments is the often-observed fact that the length of the
11153vocalic nucleus is a strong clue to the degree of voicing of the terminating
11154cluster (Lehiste, 1970).
11155.[
11156Lehiste 1970 Suprasegmentals
11157.]
11158If you say in pairs words like "cap", "cab"; "cat", "cad"; "tack", "tag"
11159you will find that the vowel in the first word of each pair is significantly
11160shorter than that in the second.
11161In fact, the major difference between such pairs is the vowel length,
11162not the final consonant.
11163.pp
11164Such effects can be taken into account by considering a syllable to comprise
11165an initial consonant cluster, followed by a vocalic nucleus and a final
11166consonant cluster.
11167Any of these elements can be missing \(em the most unusual case where the
11168nucleus is absent occurs, for example, in so-called syllabic
11169.ul
11170n\c
11171\&'s
11172(as in renderings of "button", "pudding" which might be written
11173"butt'n", "pudd'n").
11174However, it is convenient to modify the definition of the nucleus
11175so as to rule out the possibility of it being empty.
11176Using the characterization of the syllable given above, the clusters can
11177be defined as
11178.LB
11179.NI
11180initial cluster	=  <obstruent>\u*\d <sonorant>\u*\d
11181.NI
11182nucleus	=  <vowel>\u*\d <sonorant>\u*\d
11183.NI
11184final cluster	=  <obstruent>\u*\d.
11185.LE
11186Sonorants are included in the nucleus so that it is always present,
11187even in the case of a syllabic consonant.
11188.pp
11189Then, rules can be used to divide the syllable duration between the
11190initial cluster, nucleus, and final cluster.
11191These must distinguish between situations where the terminating cluster
11192is voiced or unvoiced so that the characteristic differences in vowel lengths
11193can be accomodated.
11194.pp
11195Finally, the cluster durations must be apportioned amongst their constituent
11196phonetic segments.  There is little published data on which to base this.
11197Two simple schemes which have been used in ISP are described in
11198Witten (1977) and Witten & Smith (1977).
11199.[
11200Witten 1977 A flexible scheme for assigning timing and pitch to synthetic speech
11201.]
11202.[
11203Witten Smith 1977 Synthesizing British English rhythm
11204.]
11205.rh "Pitch."
11206There are two basically different ways of looking at the pitch of an
11207utterance.
11208One is to imagine pitch
11209.ul
11210levels
11211attached to individual syllables.
11212This has been popular amongst American linguists, and some people
11213have even gone so far as to associate pitch levels with levels of
11214stress.
11215The second approach is to consider pitch
11216.ul
11217contours,
11218as we did earlier when examining how to transfer pitch from one utterance
11219to another.
11220This seems to be easier for the person who transcribes the utterances
11221to produce, for the information required is much less detailed than levels
11222attached to each syllable.  Some indication needs to be given of how
11223the contour is to be bound to the utterance, and in the notation introduced above
11224the most prominent, or "tonic", syllable is indicated in the transcription.
11225.pp
11226Halliday's (1970) classification identifies five different primary intonation
11227contours, each hinging on the tonic syllable.
11228.[
11229Halliday 1970 Course in spoken English: Intonation
11230.]
11231These are sketched in Figure 8.5, in the style of Halliday.
11232.FC "Figure 8.5"
11233Several secondary contours, which are variations on the primary ones,
11234are defined as well.
11235However, this classification scheme is intended for consumption by people,
11236who bring to the problem a wealth of prior knowledge of speech and years
11237of experience with it!  It captures only the gross features
11238of the infinite variety of pitch contours found in living speech.
11239In a sense, the classification is
11240.ul
11241phonological
11242rather than
11243.ul
11244phonetic,
11245for it attempts to distinguish the features which make a logical difference
11246to the listener instead of the acoustic details of the pitch contours.
11247.pp
11248It is necessary to take these contours and subject them to a sort of
11249phonological-to-phonetic embellishment before applying them in synthetic
11250speech.
11251For example, the stretches with constant pitch which precede the tonic
11252syllable in tone groups 1, 2, and 3 sound
11253most unnatural when synthesized \(em for pitch is hardly ever
11254exactly constant in living speech.
11255Some pretonic pitch variation is necessary,
11256and this can be made to emphasize the salient syllable
11257of each foot.  A "lilting" effect which reaches a peak at each foot
11258boundary, and drops rather faster at the beginning of the foot than it
11259rises at the end, sounds more natural.  The magnitude of this inflection
11260can be altered slightly to add interest, but a considerable increase in it
11261produces a semantic change by making the utterance sound more emphatic.
11262It is a major problem to pin down exactly the turning points of pitch in
11263the falling-rising and rising-falling contours (4 and 5 in Figure 8.5).
11264And even deciding on precise values for the pitch frequencies involved is not
11265always easy.
11266.pp
11267The aim of the pitch assignment method of ISP is to allow the person
11268(or program) which originates a spoken message to exercise a great deal
11269of control over its intonation, without having to concern himself with
11270foot or syllable structure.  The message to be spoken must be broken down
11271into tone groups,
11272which correspond roughly to Halliday's tone groups.
11273Each one comprises a
11274.ul
11275tonic
11276of one or more feet, which is optionally preceded by a
11277.ul
11278pretonic,
11279also with a number of feet.  It is advantageous to allow a tone group
11280boundary to occur in the middle of a foot (whereas Halliday's scheme
11281insists that it occurs at a foot boundary).
11282The first foot of the tonic, the
11283.ul
11284tonic foot,
11285is marked by an asterisk at the beginning.
11286It is on the first syllable of this foot \(em the
11287"tonic" or "nuclear"
11288syllable \(em that the major stress of the tone group occurs.
11289If there is no asterisk in a tone group,
11290ISP takes the final foot as the tonic
11291(since this is the most common case).
11292.pp
11293The pitch contour on a tone group is specified by an array of ten numbers.
11294Of course, the system cannot generate all conceivable contours for a tone
11295group, but the definitions of the ten specifiable quantities have been
11296chosen to give a useful range of contours.
11297If necessary, more precise control over the pitch of an utterance can
11298be achieved by making the tone groups smaller.
11299.pp
11300The overall pitch movement is controlled by specifying the pitch at three
11301places:  the beginning of the tone group, the beginning of the tonic syllable,
11302and the end of the tone group.
11303Provision is made for an abrupt pitch break at the start of the tonic
11304syllable in order to simulate tone groups 2 and 3, and, to a lesser
11305extent, tone groups 4 and 5.
11306The pitch is interpolated linearly over the first part of the
11307tone group (up to the tonic syllable) and over the last part (from there to
11308the end), except that it is possible to specify a non-linearity on the tonic
11309syllable, for emphasis, as shown in Figure 8.6.
11310.FC "Figure 8.6"
11311.pp
11312On this basic shape are superimposed two finer pitch patterns.
11313One of these is an initialization-continuation option which allows
11314the pitch to rise (or fall) independently on the initial and final feet
11315to specified values, without affecting the contour on the rest
11316of the tone group (Figure 8.7).
11317.FC "Figure 8.7"
11318The other is a foot pattern which is superimposed on each pretonic foot,
11319to give the stressed syllables of the pretonic added prominence and avoid
11320the monotony of constant pitch.
11321This is specified by a
11322.ul
11323non-linearity
11324parameter which distorts the contour on the foot at a pre-determined
11325point along it.
11326Figure 8.8 shows the effect.
11327.FC "Figure 8.8"
11328.pp
11329The ten quantities that define a pitch contour are summarized in
11330Table 8.3, and shown diagrammatically in Figure 8.9.
11331.FC "Figure 8.9"
11332.RF
11333.nr x0 \w'H:    'u
11334.nr x1 \n(x0+\w'fraction along foot of the non-linearity position, for the tonic foot'u
11335.nr x1 (\n(.l-\n(x1)/2
11336.in \n(x1u
11337.ta \n(x0u +4n
11338A:	continuation from previous tone group
11339		zero gives no continuation
11340		non-zero gives pitch at start of tone group
11341B:	notional pitch at start
11342C:	pitch range on whole of pretonic
11343D:	departure from linearity on each foot of pretonic
11344E:	pitch change at start of tonic
11345F:	pitch range on tonic
11346G:	departure from linearity on tonic
11347H:	continuation to next tone group
11348		zero gives no continuation
11349		non-zero gives pitch at end of tone group
11350I:	fraction along foot of the non-linearity position, for pretonic feet
11351J:	fraction along foot of the non-linearity position, for the tonic foot
11352.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
11353.in 0
11354.FG "Table 8.3  The quantities that define a pitch contour"
11355.pp
11356The intention of this parametric method of specifying contours
11357is that the parameters should be easily derivable from semantic variables
11358like emphasis, novelty of idea, surprise, uncertainty, incompleteness.
11359Here we really are getting into controversial, unresearched areas.
11360Roughly speaking, parameters D and G control emphasis, G by itself
11361controls novelty and surprise, and H and the relative sizes of E and F
11362control uncertainty and incompleteness.
11363Certain parameters (notably I and J) are defined because although they
11364do not appear to correspond to semantic distinctions, we do not yet know
11365how to generate them automatically.
11366.RF
11367.nr x0 0.6i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+\w'0000'
11368.nr x1 (\n(.l-\n(x0)/2
11369.in \n(x1u
11370.ta 0.6i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i
11371Halliday's
11372tone group	\0\0A	\0\0B	\0\0C	\0\0D	\0\0E	\0\0F	\0\0G	\0\0H	\0\0I	\0\0J
11373\l'\n(x0u\(ul'
11374.sp
11375	1	\0\0\00	\0175	\0\0\00	\0\-40	\0\0\00	\-100	\0\-40	\0\0\00	0.33	\00.5
11376	2	\0\0\00	\0280	\0\0\00	\0\-40	\-190	\0100	\0\0\00	\0\0\00	0.33	\00.5
11377	3	\0\0\00	\0175	\0\0\00	\0\-40	\0\-70	\0\045	\0\-10	\0\0\00	0.33	\00.5
11378	4	\0\0\00	\0280	\-100	\0\-40	\0\020	\0\045	\0\-45	\0\0\00	0.33	\00.5
11379	5	\0\0\00	\0175	\0\060	\0\-40	\0\-20	\0\-45	\0\045	\0\0\00	0.33	\00.5
11380\l'\n(x0u\(ul'
11381.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
11382.in 0
11383.FG "Table 8.4  Pitch contour table for Halliday's primary tone groups"
11384.pp
11385One basic requirement of the pitch assignment scheme was the ability to
11386generate contours which approximate Halliday's five primary tone groups.
11387Values of the ten specifiable quantities are given in Table 8.4, for each
11388tone group.
11389All pitches are given in\ Hz.
11390A distinctly dipping pitch movement has been given to each pretonic foot
11391(parameter D),
11392to lend prominence to the salient syllables.
11393.sh "8.4  Evaluating prosodic synthesis"
11394.pp
11395It is extraordinarily difficult to evaluate schemes for prosodic synthesis,
11396and this is surely a large part of the reason why prosodics are among the
11397least advanced aspects of artificial speech.
11398Segmental synthesis can be tested by playing people minimal pairs of
11399words which differ in just one feature that is being investigated.
11400For example, one might experiment with "pit", "bit"; "tot", "dot";
11401"cot", "got" to test the rules which discriminate unvoiced from voiced stops.
11402There are standard word-lists for intelligibility tests which can be
11403used to compare systems, too.
11404No equivalent of such micro-level evaluation exists for prosodics,
11405for they by definition have a holistic effect on utterances.
11406They are most noticeable, and most important, in longish stretches of speech.
11407Even monotonous, arhythmic speech will be intelligible in
11408sufficiently short samples provided the segmentals are good enough;
11409but it is quite impossible to concentrate on such speech in quantity.
11410Some attempts at evaluation appear in Ainsworth (1974) and McHugh (1976),
11411but these are primarily directed at assessing the success of pronunciation
11412rules, which are discussed in the next chapter.
11413.[
11414Ainsworth 1974 Performance of a speech synthesis system
11415.]
11416.[
11417McHugh 1976 Listener preference and comprehension tests
11418.]
11419.pp
11420One evaluation technique is to compare synthetic with natural versions
11421of utterances, as was done in the pitch transfer experiment.
11422The method described earlier used a sensitive paired-comparison test,
11423where subjects heard both versions in quick succession and were asked
11424to judge which was "most natural and intelligible".
11425This is quite a stringent test, and one that may not be so useful
11426for inferior, completely synthetic, contours.
11427It is essential to degrade the "natural" utterance so that it is
11428comparable segmentally to the synthetic one:  this was done in the
11429experiment described by extracting its pitch and resynthesizing it
11430from linear predictive coefficients.
11431.pp
11432Several other experiments could be undertaken to evaluate artificial
11433prosody.
11434For example, one could compare
11435.LB
11436.NP
11437natural and artificial rhythms, using artificial segmental synthesis
11438in both cases;
11439.NP
11440natural and artificial pitch contours, using artificial segmental synthesis
11441in both cases;
11442.NP
11443natural and artificial pitch contours, using segmentals extracted from
11444natural utterances.
11445.LE
11446There are many other topics which have not yet been fully investigated.
11447It would be interesting, for example, to define rules for generating speech
11448at different tempos.
11449Elisions, where phonemes or even whole syllables are suppressed,
11450occur in fast speech; these have been analyzed by linguists
11451but not yet incorporated into synthetic models.
11452It should be possible to simulate emotion by altering parameters such as
11453pitch range and mean pitch level; but this seems exceptionally difficult
11454to evaluate.  One situation where it would perhaps be possible to
11455measure emotion is in the reading of sports results \(em in fact a study
11456has already been made of intonation in soccer results (Bonnet, 1980)!
11457.[
11458Bonnet 1980
11459.]
11460Even the synthesis of voices with different pitch ranges requires
11461investigation, for, as noted earlier, it is difficult to place
11462precise frequency specifications on phonological contours such as
11463those sketched in Figure 8.5.
11464Clearly the topic of prosodic synthesis is a rich and potentially
11465rewarding area of research.
11466.sh "8.5  References"
11467.LB "nnnn"
11468.[
11469$LIST$
11470.]
11471.LE "nnnn"
11472.sh "8.6  Further reading"
11473.pp
11474There are quite a lot of books in the field of linguistics which
11475describe prosodic features.
11476Here is a small but representative sample from both sides of the Atlantic.
11477.LB "nn"
11478.\"Abercrombie-1965-1
11479.]-
11480.ds [A Abercrombie, D.
11481.ds [D 1965
11482.ds [T Studies in phonetics and linguistics
11483.ds [I Oxford Univ Press
11484.ds [C London
11485.nr [T 0
11486.nr [A 1
11487.nr [O 0
11488.][ 2 book
11489.in+2n
11490Abercrombie is one of the leading English authorities on phonetics,
11491and this is a collection of essays which he has written over the years.
11492Some of them treat prosodics explicitly, and others show the influence
11493of verse structure on Abercrombie's thinking.
11494.in-2n
11495.\"Bolinger-1972-2
11496.]-
11497.ds [A Bolinger, D.(Editor)
11498.ds [D 1972
11499.ds [T Intonation
11500.ds [I Penguin
11501.ds [C Middlesex, England
11502.nr [T 0
11503.nr [A 0
11504.nr [O 0
11505.][ 2 book
11506.in+2n
11507A collection of papers that treat a wide variety of different aspects
11508of intonation in living speech.
11509.in-2n
11510.\"Crystal-1969-3
11511.]-
11512.ds [A Crystal, D.
11513.ds [D 1969
11514.ds [T Prosodic systems and intonation in English
11515.ds [I Cambridge Univ Press
11516.nr [T 0
11517.nr [A 1
11518.nr [O 0
11519.][ 2 book
11520.in+2n
11521This book attempts to develop a theoretical basis for the study of British
11522English intonation.
11523.in-2n
11524.\"Gimson-1966-3
11525.]-
11526.ds [A Gimson, A.C.
11527.ds [D 1966
11528.ds [T The linguistic relevance of stress in English
11529.ds [B Phonetics and linguistics
11530.ds [E W.E.Jones and J.Laver
11531.ds [P 94-102
11532.nr [P 1
11533.ds [I Longmans
11534.ds [C London
11535.nr [T 0
11536.nr [A 1
11537.nr [O 0
11538.][ 3 article-in-book
11539.in+2n
11540Here is a careful discussion of what is meant by "stress", with much more
11541detail than has been possible in this chapter.
11542.in-2n
11543.\"Lehiste-1970-4
11544.]-
11545.ds [A Lehiste, I.
11546.ds [D 1970
11547.ds [T Suprasegmentals
11548.ds [I MIT Press
11549.ds [C Cambridge, Massachusetts
11550.nr [T 0
11551.nr [A 1
11552.nr [O 0
11553.][ 2 book
11554.in+2n
11555This is a comprehensive study of suprasegmental phenomena in natural speech.
11556It is divided into three major sections:  quantity (timing), tonal features
11557(pitch), and stress.
11558.in-2n
11559.\"Pike-1945-5
11560.]-
11561.ds [A Pike, K.L.
11562.ds [D 1945
11563.ds [T The intonation of American English
11564.ds [I Univ of Michigan Press
11565.ds [C Ann Arbor, Michigan
11566.nr [T 0
11567.nr [A 1
11568.nr [O 0
11569.][ 2 book
11570.in+2n
11571A classic, although somewhat dated, study.
11572Notice that it deals specifically with American English.
11573.in-2n
11574.LE "nn"
11575.EQ
11576delim $$
11577.EN
11578.CH "9  GENERATING SPEECH FROM TEXT"
11579.ds RT "Generating speech from text
11580.ds CX "Principles of computer speech
11581.pp
11582In the preceding two chapters I have described how artificial speech
11583can be produced from a written phonetic representation with additional
11584markers indicating intonation contours, points of major stress, rhythm,
11585and pauses.
11586This representation is substantially the same as that used by linguists
11587when recording natural utterances.
11588What we will discuss now are techniques for generating this information,
11589or at least some of it, from text.
11590.pp
11591Figure 9.1 shows various levels of the speech synthesis process.
11592.FC "Figure 9.1"
11593Starting from the top with plain text, the first box splits it into
11594intonation units (tone groups), decides where the major emphases
11595(tonic stresses) should be placed,
11596and further subdivides the tone group into rhythmic units (feet).
11597For intonation analysis it is necessary to decide on an "interpretation"
11598of the text, which in turn, as was emphasized at the beginning of the
11599previous chapter, depends both on the semantics of what is being said and
11600on the attitude of the speaker to his material.
11601The resulting representation will be at the level of Halliday's notation
11602for utterances, with the words still in English rather than phonetics.
11603Table 9.1 illustrates the utterance representation at the various levels
11604of the Figure.
11605.RF
11606.nr x0 \w'pitch and duration    '+\w'at 8 kHz sampling rate a 4-second utterance'
11607.nr x1 (\n(.l-\n(x0)/2
11608.in \n(x1u
11609.ta \w'pitch and duration    'u +\w'pause  'u +\w'00 msec   'u
11610representation	example
11611\l'\n(x0u\(ul'
11612.sp
11613plain text	Automatic synthesis of speech,
11614	from a phonetic representation.
11615.sp
11616text adorned with	3\0^ auto/matic /synthesis of /*speech,
11617prosodic markers	1\0^ from a pho/*netic /represen/tation.
11618.sp
11619phonetic text with	3\0\fI^  aw t uh/m aa t i k  /s i n th uh s i s\fR
11620prosodic markers	\0\0\fIuh v  /*s p ee t sh\fR ,
11621	1\0\fI^  f r uh m  uh  f uh/*n e t i k\fR
11622	\0\0\fI/r e p r uh z e n/t e i sh uh n\fR .
11623.sp
11624phonemes with	pause	80 msec
11625pitch and duration	\fIaw\fR	70 msec	105 Hz
11626	\fIt\fR	40 msec	136 Hz
11627	\fIuh\fR	50 msec	148 Hz
11628	\fIm\fR	70 msec	175 Hz
11629	\fIaa\fR	90 msec	140 Hz
11630		...
11631		...
11632		...
11633.sp
11634parameters for	10 parameters, each updated at a frame
11635formant or linear	rate of 10 msec
11636predictive	(4 second utterance gives 400 frames,
11637synthesizer	or 4,000 data values)
11638.sp
11639acoustic wave	at 8 kHz sampling rate a 4-second utterance
11640	has 32,000 samples
11641\l'\n(x0u\(ul'
11642.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
11643.in 0
11644.FG "Table 9.1  Utterance representations at various levels in speech synthesis"
11645.pp
11646The next job is to translate the plain text into a broad phonetic
11647transcription.
11648This requires knowledge of letter-to-sound pronunciation
11649rules for the language under consideration.
11650But much more is needed.  The structure of each word must be examined for
11651prefixes and suffixes, because they \(em especially the latter \(em have a
11652strong influence on pronunciation.
11653This is called "morphological" analysis.
11654Actually it is also required for rhythmical purposes, because prefixes
11655are frequently unstressed (note that the word "prefix" is itself an
11656exception to this!).
11657Thus the appealing segmentation of the overall problem shown in Figure 9.1
11658is not very accurate, for the individual processes cannot be rigidly
11659separated as it implies.  In fact, we saw earlier how this intermixing of
11660levels occurs with prosodic and segmental features.
11661Nevertheless, it is helpful to structure discussion of the problem by
11662separating levels as a first approximation.
11663Further influences on pronunciation come from the semantics and syntax
11664of the utterance \(em and both also play a part in intonation and rhythm analysis.
11665The result of this second process is a phonetic representation, still
11666adorned with prosodic markers.
11667.pp
11668Now we move down from higher-level intonation and rhythm considerations
11669to the details of the pitch contour and segment durations.
11670This process was the subject of the previous chapter.
11671The problems are twofold:  to map an appropriate acoustic pitch contour
11672on to the utterance, using tonic stress point and foot boundaries as
11673anchor points; and to assign durations to segments using the
11674foot\(emsyllable\(emcluster\(emsegment hierarchy.
11675If it is accepted that the overall rhythm can be captured adequately by foot
11676markers, this process does not interact with earlier ones.
11677However, many researchers do not, believing instead that rhythm is
11678syntactically determined at a very detailed level.
11679This will, of course, introduce strong interaction between the duration
11680assignment process and the levels above.
11681(Klatt, 1975, puts it into his title \(em
11682"Vowel lengthening is syntactically determined in a connected discourse".
11683.[
11684Klatt 1975 Vowel lengthening is syntactically determined
11685.]
11686Contrast this with the paper cited earlier (Bolinger, 1972) entitled
11687"Accent is predictable \(em if you're a mind reader".
11688.[
11689Bolinger 1972 Accent is predictable \(em if you're a mind reader
11690.]
11691No-one would disagree that "accent" is an influential factor in vowel length!)
11692.pp
11693Notice incidentally that the representation of the result of the pitch
11694and duration assignment process in Table 9.1 is inadequate, for each segment
11695is shown as having just one pitch.
11696In practice the pitch varies considerably throughout every segment,
11697and can easily rise and fall on a single one.  For example,
11698.LB
11699"he's
11700.ul
11701very
11702good"
11703.LE
11704may have a rise-fall on the vowel of "very".
11705The linked event-list data-structure of ISP is much more suitable
11706than a textual string for utterance representation at this level.
11707.pp
11708The fourth and fifth processes of Figure 9.1 have little interaction with
11709the first two, which are the subject of this chapter.  Segmental
11710concatenation, which was treated in Chapter 7, is affected by prosodic
11711features like stress; but a notation which indicates stressed syllables
11712(like Halliday's) is sufficient to capture this influence.
11713Contextual modification of segments, by which I mean
11714the coarticulation effects which govern allophones of phonemes,
11715is included explicitly in the fourth process to emphasize that the upper levels
11716need only provide a broad phonemic transcription rather than a detailed
11717phonetic one.
11718Signal synthesis can be performed by either a formant synthesizer or a
11719linear predictive one (discussed in Chapters 5 and 6).
11720This will affect the details of the segmental concatenation process but should have no
11721impact at all on the upper levels.
11722.pp
11723Figure 9.1 performs a useful function by summarizing where we have
11724been in earlier chapters \(em the lower three boxes \(em and introducing the
11725remaining problems that must be faced by a full text-to-speech system.
11726It also serves to illustrate an important point:  that a speech output system
11727can demand that its utterances be entered in any of a wide range of
11728representations.
11729Thus one can enter at a low level with a digitized waveform or linear
11730predictive parameters; or higher up with a phonetic representation
11731that includes detailed pitch and duration specification at the phoneme level;
11732or with a phonetic text or plain text adorned with prosodic markers;
11733or at the very top with plain text as it would appear in a book.
11734A heavy price in naturalness and intelligibility is paid by moving up
11735.ul
11736any
11737of these levels \(em and this is just as true at the top of the Figure as
11738at the bottom.
11739.sh "9.1  Deriving prosodic features"
11740.pp
11741If you really need to start with plain text,
11742some very difficult problems present themselves.
11743The text should be understood, first of all, and then decisions need to be
11744made about how it is to be interpreted.
11745For an excellent speaker \(em like an actor \(em these decisions will be artistic,
11746at least in part.
11747They should certainly depend upon the opinion and attitude of the speaker,
11748and his perception of the structure and progress of the dialogue.
11749Very little is known about this upper level of speech synthesis from text.
11750In practice it is almost completely ignored \(em and the speech is at most
11751barely intelligible, and certainly uncomfortable to listen to.
11752Hence anybody contemplating building or using a speech output system which
11753starts from something close to plain text should consider carefully whether some extra
11754semantic information can be coded into the initial utterances to help with
11755prosodic interpretation.
11756Only rarely is this impossible \(em and reading machines for the blind are
11757a prime example of a situation where arbitrary, unannotated, texts
11758must be read.
11759.rh "Intonation analysis."
11760One distinction which a program can usefully try
11761to make is between basically rising
11762and basically falling pitch contours.  It is often said that pitch rises on
11763a question and falls on a statement, but if you listen to speech you will
11764find this to be a gross oversimplification.  It normally
11765falls on statements, certainly; but it falls as often as it rises on questions.
11766It is more accurate to say that pitch rises on "yes-no" questions
11767and falls on other utterances, although this rule is still only a rough guide.
11768A simple test which operates lexically on the input text is to determine
11769whether a sentence is a question by looking at the
11770punctuation mark at its end, and then to examine the first word.
11771If it is a "wh"-word like "what", "which", "when", "why" (and also "how")
11772a falling contour is likely to fit.
11773If not, the question is probably a yes-no one, and the contour
11774should rise.
11775Such a crude rule will certainly not be very accurate
11776(it fails, for example, when the "wh"-word is embedded in a phrase as in
11777"at what time are you going?"), but at least it provides a starting-point.
11778.pp
11779An air of finality is given to an utterance when it bears a definite
11780fall in pitch, dropping to a rather low value at the end.
11781This should accompany the last intonation unit in an utterance
11782(unless it is a yes-no question).
11783However, a rise-fall contour such as Halliday's tone group 5 (Figure 8.5)
11784can easily be used in utterance-final position by one person
11785in a conversation \(em
11786although it would be unlikely to terminate the dialogue altogether.
11787A new topic is frequently introduced by a fall-rise contour \(em such as
11788Halliday's tone group 4 \(em and this often begins a paragraph.
11789.pp
11790Determining the type of pitch contour is only one part of
11791intonation assignment.  There are really three separate problems:
11792.LB
11793.NP
11794dividing the utterance into tone groups
11795.NP
11796choosing the tonic syllable, or major stress point, of each one
11797.NP
11798assigning a pitch contour to each tone group.
11799.LE
11800Let us continue to use the Halliday notation for intonation, which was introduced
11801in simplified form in the previous chapter.
11802Moreover, assume that the foot boundaries can be placed correctly \(em
11803this problem will be discussed in the next subsection.
11804Then a scheme which considers only the lexical form of the utterance
11805and does not attempt to "understand" it (whatever that means) is as follows:
11806.LB
11807.NP
11808place a tone group boundary at every punctuation mark
11809.NP
11810place the tonic at the first syllable of the last foot in a tone group
11811.NP
11812use contour 4 for the first tone group in a paragraph and contour 1
11813elsewhere, except for a yes-no question which receives contour 2.
11814.LE
11815.RF
11816.nr x0 \w'From Scarborough to Whitby\0\0\0\0'+\w'4  ^  from /Scarborough to /*Whitby is a'
11817.nr x1 (\n(.l-\n(x0)/2
11818.in \n(x1u
11819.ta \w'From Scarborough to Whitby\0\0\0\0\0\0'u
11820plain text	text adorned with prosodic markers
11821\l'\n(x0u\(ul'
11822.sp
11823From Scarborough to Whitby is a	4 ^ from /Scarborough to /*Whitby is a
11824very pleasant journey, with	1\- very /pleasant /*journey with
11825very beautiful countryside.	1\- very /beautiful /*countryside ...
11826In fact the Yorkshire coast is	1+ ^ in /fact the /Yorkshire /coast is
11827\0\0\0\0lovely,	\0\0\0\0/*lovely
11828all along, ex-	1+ all a/*long ex
11829cept the parts that are covered	_4 cept the /parts that are /covered
11830\0\0\0\0in caravans of course; and	\0\0\0\0in /*caravans of /course and
11831if you go in spring,	4 if you /go in /*spring
11832when the gorse is out,	4 ^ when the /*gorse is /out
11833or in summer,	4 ^ or in /*summer
11834when the heather's out,	4 ^ when the /*heather's /out
11835it's really one of the most	13 ^ it's /really /one of the /most
11836\0\0\0\0delightful areas in the	\0\0\0\0de/*lightful /*areas in the
11837whole country.	1 whole /*country
11838.sp
11839The moorland is	4 ^ the /*moorland is
11840rather high up, and	1 rather /high /*up and
11841fairly flat \(em a	1 fairly /*flat a
11842sort of plateau.	1 sort of /*plateau ...
11843At least,	1 ^ at /*least
11844it isn't really flat,	13 ^ it /*isn't /really /*flat
11845when you get up on the top;	\-3 ^ when you /get up on the /*top
11846it's rolling moorland	1 ^ it's /rolling /*moorland
11847cut across by steep valleys.  But	1 cut across by /steep /*valleys but
11848seen from the coast it's	4 seen from the /*coast it's ...
11849"up there on the moors", and you	1 up there on the /*moors and you
11850always think of it as a	_4 always /*think of it as a
11851kind of tableland.	1 kind of /*tableland
11852\l'\n(x0u\(ul'
11853.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
11854.in 0
11855.FG "Table 9.2  Example of intonation and rhythm analysis (from Halliday, 1970)"
11856.[
11857Halliday 1970 Course in spoken English: Intonation
11858.]
11859.pp
11860These extremely crude and simplistic rules are really the most that one can do
11861without subjecting the utterance to a complicated semantic analysis.
11862In statistical terms, they are actually remarkably effective.
11863Table 9.2 shows part of a spontaneous monologue which was transcribed by
11864Halliday and appears in his teaching text on intonation
11865(Halliday, 1970, p 133).
11866.[
11867Halliday 1970 Course in Spoken English: Intonation
11868.]
11869Among the prosodic markers are some that were not introduced in Chapter 8.
11870Firstly, each tone group has secondary contours which are identified
11871by "1+", "1\-" (for tone group 1), and so on.
11872Secondly, the mark "..." is used to indicate a pause which disrupts
11873the speech rhythm.
11874Notice that its positioning belies the advice of the old elocutionists:
11875.br
11876.ev2
11877.in 0
11878.LB
11879.fi
11880A Comma stops the Voice while we may privately tell
11881.NI
11882.ul
11883one,
11884a Semi-colon
11885.ul
11886two;
11887a Colon
11888.ul
11889three:\c
11890  and a Period
11891.ul
11892four.
11893.br
11894.nr x0 \w'\fIone,\fR a Semi-colon \fItwo;\fR a Colon \fIthree:\fR  and a Period \fIfour.'-\w'(Mason,\fR 1748)'
11895.NI
11896\h'\n(x0u'(Mason, 1748)
11897.nf
11898.LE
11899.br
11900.ev
11901Thirdly, compound tone groups such as "13" appear which contain
11902.ul
11903two
11904tonic syllables.
11905This differs from a simple concatenation of tone groups
11906(with contours 1 and 3 in this case) because the second is in some sense subsidiary to
11907the first.
11908Typically it forms an adjunct clause, while the first clause gives the
11909main information.  Halliday provides many examples, such as
11910.LB
11911.NI
11912/Jane goes /shopping in /*town /every /*Friday
11913.NI
11914/^ I /met /*Arthur on the /*train.
11915.LE
11916But he does not comment on the
11917.ul
11918acoustic
11919difference between a compound tone group and a concatenation of simple ones \(em
11920which is, after all, the information needed for synthesis.
11921A final, minor, difference between Halliday's scheme and that outlined earlier
11922is that he compels tone group boundaries to occur at the beginning
11923of a foot.
11924.RF
11925.nr x0 3.3i+1.3i+\w'complete'
11926.nr x1 (\n(.l-\n(x0)/2
11927.in \n(x1u
11928.ta 3.3i +1.3i
11929	excerpt in	complete
11930	Table 9.2	passage
11931\l'\n(x0u\(ul'
11932.sp
11933number of tone groups	25	74
11934.sp
11935number of boundaries correctly	19 (76%)	47 (64%)
11936placed
11937.sp
11938number of boundaries incorrectly	\00	\01 (\01%)
11939placed
11940.sp
11941number of tone groups having a	22 (88%)	60 (81%)
11942tonic syllable at the beginning
11943of the final foot
11944.sp
11945number of tone groups whose	17 (68%)	51 (69%)
11946contours are correctly assigned
11947\l'\n(x0u\(ul'
11948.sp
11949number of compound tone groups	\02 (\08%)	\06 (\08%)
11950.sp
11951number of secondary intonation	\07 (28%)	13 (17%)
11952contours
11953\l'\n(x0u\(ul'
11954.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
11955.in 0
11956.FG "Table 9.3  Success of simple intonation assignment rules"
11957.pp
11958Applying the simple rules given above to the text of Table 9.2 leads to
11959the results in the first column of Table 9.3.
11960Three-quarters of the foot boundaries are flagged by
11961punctuation marks, with no extraneous ones being included.
1196288% of tone groups have a tonic syllable at the start of the final foot.
11963However, the compound tone groups each have two tonic syllables,
11964and of course only the second one is predicted by the final-foot rule.
11965Assigning intonation contours on the extremely simple basis of using
11966contour 4 for the first tone group in a paragraph, and contour 1 thereafter,
11967also seems to work quite well.  Secondary contours such as "1+" and "1\-"
11968have been mapped into the appropriate primary contour (1, in this case)
11969for the present purpose, and compound tone groups have been assigned the first
11970contour of the pair.
11971The result is that 68% of contours are given correctly.
11972.pp
11973In order to give some idea of the reliability of these figures, the results
11974for the whole passage transcribed by Halliday \(em of which Table 9.2 is an
11975excerpt \(em are shown in the second column of Table 9.3.  Although it
11976looks as though the rules may have been slightly lucky with the excerpt,
11977the general trends are the same, with 65% to 80% of features being assigned
11978correctly.
11979It could be argued, though, that the complete text is punctuated fairly liberally by
11980present-day standards, so that the tone-group boundary rule is unusually
11981successful.
11982.pp
11983These results are really astonishingly good, considering the crudeness of
11984the rules.  However, they should be interpreted with caution.
11985What is missed by the rules, although appearing to comprise only
1198620% to 35% of the features, is certain to include the important,
11987information-bearing, and variety-producing features that give the utterance
11988its liveliness and interest.
11989It would be rash to assume that all tone-group boundaries,
11990all tonic positions, and all intonation contours, are equally
11991important for intelligibility and naturalness.
11992It is much more likely that the rules predict a
11993default pattern, while most information is borne by deviations from
11994them.
11995To give an engineering analogy, it may be as though the carrier waveform
11996of a modulated transmission is being simulated, instead of the
11997information-bearing signal!
11998Certainly the utterance will, if synthesized with intonation given by these
11999rules, sound extremely dull and repetitive, mainly because of the
12000overwhelming predominance of tone group 1 and the universal placement
12001of tonic stress on the final foot.
12002.pp
12003There are certainly many different ways to orate any particular text,
12004and that given by Halliday and reproduced in Table 9.2 is only one possible
12005version.
12006However, it is fair to say that the default intonation discussed above
12007could only occur naturally under very unusual circumstances \(em such as
12008a petulant child, unwilling and sulky, having been forced to read aloud.
12009This is hardly how we want our computers to speak!
12010.rh "Rhythm analysis."
12011Consider now how to decide where foot boundaries should be placed
12012in English text.
12013Clearly semantic considerations sometimes play a part in this \(em one could
12014say
12015.LB
12016/^ is /this /train /going /*to /London
12017.LE
12018instead of the more usual
12019.LB
12020/^ is /this /train /going to /*London
12021.LE
12022in circumstances where the train might be going
12023.ul
12024to
12025or
12026.ul
12027from
12028London.
12029Such effects are ignored here, although it is worth noting in passing that the
12030rogue words will often be marked by underscoring or italicizing
12031(as in the previous sentence).
12032If the text is liberally underlined, semantic analysis may
12033be unnecessary for the purposes of rhythm.
12034.pp
12035A rough and ready rule for placing foot boundaries is to insert one before
12036each word which is not in a small closed set of "function words".
12037The set includes, for example, "a", "and", "but", "for", "is", "the", "to".
12038If a verb or adjective begins with a prefix, the boundary should be moved
12039between it and the root \(em but not for a noun.
12040This will give the distinction between
12041.ul
12042con\c
12043vert (noun) and con\c
12044.ul
12045vert
12046(verb),
12047.ul
12048ex\c
12049tract and ex\c
12050.ul
12051tract,
12052and for many North American speakers,
12053will help to distinguish
12054.ul
12055in\c
12056quiry from in\c
12057.ul
12058quire.
12059However, detecting prefixes by a simple splitting algorithm is dangerous.
12060For example, "predate" is a verb with stress on what appears to be a prefix,
12061contrary to the rule; while the "pre" in "predator" is not a prefix \(em at
12062least, it is not pronounced as the prefix "pre" normally is.
12063Moreover, polysyllabic words like "/diplomat", "dip/lomacy", "diplo/matic";
12064or "/telegraph", "te/legraphy", "tele/graphic" cannot be handled on such a simple
12065basis.
12066.pp
12067In 1968, a remarkable work on English sound structure was published
12068(Chomsky and Halle, 1968) which proposes a system of rules to transform
12069English text into a phonetic representation in terms of distinctive features,
12070with the aid of a lexicon.
12071.[
12072Chomsky Halle 1968
12073.]
12074A great deal of attention is paid to stress, and rules are given which
12075perform well in many tricky cases.
12076.pp
12077It uses the American system of levels of stress, marking
12078so-called primary stress with a superscript 1, secondary stress with a
12079superscript 2, and so on.
12080The superscripts are written on the vowel of the stressed
12081syllable:  completely unstressed syllables receive no annotation.
12082For example, the sentence "take John's blackboard eraser" is written
12083.LB
12084ta\u2\dke Jo\u3\dhn's bla\u1\dckboa\u5\drd era\u4\dser.
12085.LE
12086In foot notation this utterance
12087is
12088.LB
12089/take /John's /*blackboard e/raser.
12090.LE
12091It undoubtedly contains less information than the stress-level version.
12092For example, the second syllable of "blackboard" and the first one of "erase"
12093are both unstressed, although the rhythm rules given in Chapter 8
12094will cause them
12095to be treated differently because they occupy different places in the
12096syllable pattern of the foot.
12097"Take", "John's", and the second syllable of "erase" are all non-tonic
12098foot-initial syllables and hence are not distinguished in the notation;
12099although the pitch contours schematized in Figure 8.9 will give them different
12100intonations.
12101.pp
12102An indefinite number of levels of stress can be used.  For example, according
12103to the rules given by Chomsky and Halle, the word "sad" in
12104.LB
12105my friend can't help being shocked at anyone who would fail to consider
12106his sad plight
12107.LE
12108has level-8 stress, the final two words being annotated
12109as "sa\u8\dd pli\u1\dght".
12110However, only the first few levels are used regularly, and
12111it is doubtful whether acoustic distinctions are made in speech
12112between the weaker ones.
12113.pp
12114Chomsky and Halle are concerned to distinguish between such utterances as
12115.LB
12116.NI
12117bla\u2\dck boa\u1\drd-era\u3\dser    ("board eraser that is black")
12118.NI
12119bla\u1\dckboa\u3\drd era\u2\dser     ("eraser for a blackboard")
12120.NI
12121bla\u3\dck boa\u1\drd era\u2\dser    ("eraser of a black board"),
12122.LE
12123and their stress assignment rules do indeed produce each version when
12124appropriate.
12125In foot notation the distinctions can still be made:
12126.LB
12127.NI
12128/black /*board-eraser/
12129.NI
12130/*blackboard e/raser/
12131.NI
12132/black /*board e/raser/
12133.LE
12134.pp
12135The rules operate on a grammatical derivation tree
12136of the text.
12137For instance, input for the three examples would be written
12138.LB
12139.NI
12140[\dNP\u[\dA\u black ]\dA\u [\dN\u[\dN\u board]\dN\u
12141[\dN\u eraser ]\dN\u]\dN\u]\dNP\u
12142.NI
12143[\dN\u[\dN\u[\dA\u black ]\dA\u [\dN\u board ]\dN\u]\dN\u [\dN\u eraser ]\dN\u]\dN\u
12144.NI
12145[\dN\u[\dNP\u[\dA\u black ]\dA\u [\dN\u board ]\dN\u]\dNP\u [\dN\u eraser ]\dN\u]\dN\u,
12146.LE
12147representing the trees shown in Figure 9.2.
12148.FC "Figure 9.2"
12149Here, N stands for a noun, NP for a noun phrase, and A for an adjective.
12150These categories appear explicitly as nodes in the tree.
12151In the linearized textual representation they are used to label
12152brackets which represent the tree structure.
12153An additional piece of information which is needed is the lexical entry for
12154"eraser", which would show that it has only one accented
12155(that is, potentially stressed) syllable, namely, the second.
12156.pp
12157Consider now how to account for stress in prefixed and
12158suffixed words, and those polysyllabic ones with more than one potential
12159stress point.
12160For these, the morphological structure must appear in the input.
12161.pp
12162Now
12163.ul
12164morphemes
12165are well-defined minimal units of grammatical analysis from which a word
12166may be composed.
12167For example,  [went]\ =\ [go]\ +\ [ed]  is
12168a morphemic decomposition, where "[ed]" denotes the
12169past-tense morpheme.
12170This representation is not particularly suitable for speech synthesis
12171for the obvious reason that the result bears no phonetic resemblance to
12172the input.
12173What is needed is a decomposition into
12174.ul
12175morphs,
12176which occur only when the lexical or phonetic representation of a word may
12177easily be segmented into parts.
12178Thus  [wanting]\ =\ [want]\ +\ [ing]  and  [bigger]\ =\ [big]\ +\ [er]  are
12179simultaneously morphic and morphemic decompositions.
12180Notice that in the second example, a rule about final consonant doubling has
12181been applied at the lexical level (although it is not needed in
12182a phonetic representation):  this comes into the sphere
12183of "easy" segmentation.
12184Contrast this with  [went]\ =\ [go]\ +\ [ed]  which
12185is certainly not an easy segmentation and hence a
12186morphemic but not a morphic decomposition.
12187But between these extremes there are some difficult
12188cases:  [specific]\ =\ [specify]\ +\ [ic]  is probably morphic
12189as well as morphemic, but it is not clear
12190that  [galactic]\ =\ [galaxy]\ +\ [ic]  is.
12191.pp
12192Assuming that the input is given as a derivation tree with morphological
12193structure made explicit, Chomsky and Halle present rules which assign stress
12194correctly in nearly all cases.  For example, their rules give
12195.LB
12196.NI
12197[\dA\u[\dN\u incident ]\dN\u + al]\dA\u  \(em>  i\u2\dncide\u1\dntal;
12198.LE
12199and if the stem is marked by  [\dS\u\ ...\ ]\dS\u  in prefixed words,
12200they can deduce
12201.LB
12202.NI
12203[\dN\u tele [\dS\u graph ]\dS\u]\dN\u		\(em>  te\u1\dlegra\u3\dph
12204.NI
12205[\dN\u[\dN\u tele [\dS\u graph ]\dS\u]\dN\u y ]\dN\u	\(em>  tele\u1\dgraphy
12206.NI
12207[\dA\u[\dN\u tele [\dS\u graph ]\dS\u]\dN\u ic ]\dA\u	\(em>  te\u3\dlegra\u1\dphi\u2\dc.
12208.LE
12209.pp
12210There are two rules which account for the word-level stress
12211on such examples:  the "main stress"
12212rule and the "alternating stress" rule.
12213In essence, the main stress rule emphasizes the last strong syllable
12214of a stem.
12215A syllable is "strong" either if it contains one of a class of so-called
12216"long" vowels, or if there is a cluster of two or more consonants
12217following the vowel; otherwise it is "weak".
12218(If you are exceptionally observant you will notice that this strong\(emweak
12219distinction has been used before, when discussing the rhythm of feet in
12220syllables.)  Thus the verb "torment" receives stress on the second syllable,
12221for it is a strong one.
12222A noun like "torment" is treated as being derived from the corresponding verb,
12223and the rule assigns stress to the verb first and then modifies it for the noun.
12224The second, "alternating stress", rule gives some stress to alternate
12225syllables of polysyllabic words like "form\c
12226.ul
12227al\c
12228de\c
12229.ul
12230hyde\c
12231".
12232.pp
12233It is quite easy to incorporate the word-level rules into a computer
12234program which uses feet rather than stress levels as the basis for prosodic
12235description.
12236A foot boundary is simply placed before the primary-stressed (level-1) syllable,
12237except for function words, which do not begin a foot.
12238The other stress levels should be ignored,
12239except that for slow, deliberate speech, secondary (level-2) stress is
12240mapped into a foot boundary too, if it precedes the primary stress.
12241There is also a rule which reduces vowels in unstressed
12242syllables.
12243.pp
12244The stress assignment rules can work on phonemic script, as well as English.
12245For example, starting from the phonetic
12246form  [\d\V\u\ \c
12247.ul
12248aa\ s\ t\ o\ n\ i\ sh\ \c
12249]\dV\u,
12250the stress assignment rules
12251produce  \c
12252.ul
12253aa\ s\ t\ o\u1\d\ n\ i\ sh\ ;\c
12254  the
12255vowel reduction rule
12256generates  \c
12257.ul
12258uh\ s\ t\ o\u1\d\ n\ i\ sh\ ;\c
12259  and
12260the foot conversion process
12261gives  \c
12262.ul
12263uh\ s/t\ o\ n\ i\ sh.
12264This appears to provide a fairly reliable algorithm for foot boundary
12265placement.
12266.rh "Speech synthesis from concept."
12267I argued earlier that in order to derive prosodic features
12268of an utterance from text it
12269is necessary to understand its role in the dialogue, its semantics,
12270its syntax, and \(em as we have just seen \(em its morphological structure.
12271This is a very tall order, and the problem of natural language comprehension
12272by machine is a vast research area in its own right.
12273However, in many applications requiring speech output,
12274utterances are generated by the computer from internally stored data
12275rather than being read aloud from pre-prepared text.
12276Then the problem of comprehending text may be evaded, for
12277presumably the language-generation module can provide a semantic,
12278syntactic, and even morphological decomposition of the utterance,
12279as well as some indication of its role in the dialogue
12280(that is, why it is necessary to say it).
12281.pp
12282This forms the basis of the appealing notion of "speech synthesis from concept".
12283It has some advantages over speech generation from text, and in principle
12284should provide more natural-sounding speech.
12285Every word produced by the system can have a complete lexical entry which
12286shows its morphological decomposition and potential stress points.
12287The full syntactic history of each utterance is known.
12288The Chomsky-Halle rules described above can therefore be used to place
12289foot boundaries accurately, without the need for a complex parsing program
12290and without the risk of having to make guesses about unknown words.
12291.pp
12292However, it is not clear how to take advantage of any semantic information
12293which is available.  Ideally, it should be possible to place tone group
12294boundaries and tonic stress points, and assign intonation contours, in
12295a natural-sounding way.
12296But look again at the example text of Table 9.2 and imagine that you have
12297at your disposal as much semantic information as is needed.
12298It is
12299.ul
12300still
12301far from obvious how the intonation features could be assigned!
12302It is, in the ultimate analysis, interpretive and stylistic
12303.ul
12304choices
12305that add variety and interest to speech.
12306.pp
12307Take the problem of determining pitch contours, for instance.
12308Some of them may be explicable.
12309Contour 4 on
12310.LB
12311.NI
12312except the parts that are covered in caravans of course
12313.LE
12314is due to its being a contrastive clause, for it presents
12315essentially new information.
12316Similarly, the succession
12317.LB
12318.NI
12319if you go in spring
12320.NI
12321when the gorse is out
12322.NI
12323or in summer
12324.NI
12325when the heather's out
12326.LE
12327could be considered contrastive, being in the subjunctive voice, and
12328this could explain why contour 4's were used.
12329But this is all conjecture, and it is difficult to apply throughout the
12330passage.
12331Halliday (1970) explains the contexts in which each tone group is typically
12332used, but in an extremely high-level manner which would be impossible
12333to embody directly in a computer program.
12334.[
12335Halliday 1970 Course in spoken English: Intonation
12336.]
12337At the other end of the spectrum, computer systems for written
12338discourse production do not seem to provide the subtle information needed
12339to make intonation decisions (see, for example, Davey, 1978, for a fairly
12340complete description of such a system).
12341.[
12342Davey 1978
12343.]
12344.pp
12345One project which uses such a method for generating speech has been
12346described (Young and Fallside, 1980).
12347.[
12348Young Fallside 1980
12349.]
12350Although some attention is paid to rhythm, the intonation contours
12351which are generated are disappointingly repetitive and lacking in
12352richness.
12353In fact, very little semantic information is used to assign contours; really
12354just that inferred by the crude punctuation-driven method described
12355earlier.
12356.pp
12357The higher-level semantic problems associated with speech output were
12358studied some years go under the
12359title "synthetic elocution" (Vanderslice, 1968).
12360.[
12361Vanderslice 1968
12362.]
12363A set of rules was generated and tested by hand on a sample passage,
12364the first part of which is shown in Table 9.4.
12365However, no attempt was made to formalize the rules in a computer program,
12366and indeed it was recognized that a number of important questions,
12367such as the form of the semantic information assumed at the input,
12368had been left unanswered.
12369.RF
12370.nr x0 \w'\0\0  psychologist   '+\w'emphasis assigned because of antithesis with  '
12371.nr x1 (\n(.l-\n(x0)/2
12372.in \n(x1u
12373.ta \w'\0\0  psychologist   'u
12374\l'\n(x0u\(ul'
12375.sp
12376Human experience and human behaviour are accessible to
12377observation by everyone.  The psychologist tries to bring
12378them under systematic study.  What he perceives, however,
12379anyone can perceive; for his task he requires no microscope
12380or electronic gear.
12381.sp2
12382\0\0  word	comments
12383\l'\n(x0u\(ul'
12384.sp
12385\01  Human	special treatment because paragraph-initial
12386\04  human	accent deleted because it echoes word 1
1238713  psychologist	emphasis assigned because of antithesis with
12388	"everyone"
1238917  them	anaphoric to "Human experience and human
12390	behaviour"
1239119  systematic	emphasis assigned because of contrast with
12392	"observation"
1239320  study	emphasis? \(em text is ambiguous whether
12394	"observation" is a kind of study that is
12395	nonsystematic, or an activity contrasting
12396	with the entire concept of "systematic study"
1239721  What	increase in pitch for "What he perceives"
12398	because it is not the subject
1239922  he	accented although anaphoric to word 13
12400	because of antithesis with word 25
1240124  however	decrease in pitch because it is parenthetical
1240225  anyone	emphasized by antithesis with word 22
1240327  perceive	unaccented because it echoes word 23,
12404	"perceives"
12405\0\0  ;	semicolon assigns falling intonation
1240630  task	unaccented because it is anaphoric with
12407	"tries to bring them under systematic study"
12408\l'\n(x0u\(ul'
12409.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
12410.in 0
12411.FG "Table 9.4  Sample passage and comments pertinent to synthetic elocution"
12412.pp
12413The comments in the table, which are selected and slightly edited versions
12414of those appearing in the original work (Vanderslice, 1968), are intended
12415as examples of the nature and subtlety of the prosodic influences which
12416were examined.
12417.[
12418Vanderslice 1968
12419.]
12420The concepts of "accent" and "emphasis" are used; these relate to stress
12421but are not easy to define precisely in our tone-group terminology.
12422Fortunately we do not need an exact characterization of them for the present
12423purpose.
12424Roughly speaking, "accent" encompasses both foot-initial stress and
12425tonic stress, whereas "emphasis" is something more than this,
12426typically being realized by the fall-rise or rise-fall contours of
12427Halliday's tone groups 4 and 5 (Figure 8.5).
12428.pp
12429Particular attention is paid to anaphora and antithesis (amongst other things).
12430The first term means the repetition of a word or phrase in the text,
12431and is often applied to pronoun references.
12432In the example, the word "human" is repeated in the first few words;
12433"them" in the second sentence refers to "human experience and human
12434behaviour"; "he" in the third sentence is the previously-mentioned
12435psychologist; and "task" is anaphoric with "tries to bring them under
12436systematic study".
12437Other things being equal, anaphoric references are unaccented.
12438In our terms this means that they certainly do not receive tonic stress
12439and may not even receive foot stress.
12440.pp
12441Antithesis is defined as the contrast of ideas expressed by parallelism of
12442strongly contrasting words or phrases; and the second element taking part
12443in it is generally emphasized.
12444"Psychologist" in the passage is an antithesis of "everyone";
12445"systematic" and possibly "study" of "observation".
12446Thus
12447.LB
12448.NI
12449/^ the psy/*chologist
12450.LE
12451would probably receive intonation contour 4, since it is also introducing
12452a new actor; while
12453.LB
12454.NI
12455/tries to /bring them /under /system/*matic /study
12456.LE
12457could receive contour 5.
12458"He" and "everyone" are antithetical; not only does the latter receive
12459emphasis but the former has its accent restored \(em for otherwise
12460it would have been removed because of anaphora with "psychologist".
12461Hence it will certainly begin a foot, possibly a tonic foot.
12462.pp
12463A factor that does not affect the sample passage is the accentuation
12464of unusual syllables of similar words to bring out a contrast.
12465For example,
12466.LB
12467.NI
12468he went
12469.ul
12470out\c
12471side, not
12472.ul
12473in\c
12474side.
12475.LE
12476Although this may seem to be just another facet of antithesis,
12477Vanderslice points out that it is phonetic rather than structural
12478similarity that is contrasted:
12479.LB
12480.NI
12481I said
12482.ul
12483de\c
12484plane, not
12485.ul
12486com\c
12487plain.
12488.LE
12489This introduces an interesting interplay between the phonetic and
12490prosodic levels.
12491.pp
12492Anaphora and antithesis provide an ideal domain for speech synthesis from
12493concept.
12494Determining them from plain text is a very difficult problem,
12495requiring a great deal of real-world knowledge.
12496The first has received some attention in the field of natural language
12497understanding.
12498Finding pronoun referents is an important problem for language translation,
12499for their gender is frequently distinguished in, say, French where it is not
12500in English.
12501Examples such as
12502.LB
12503.NI
12504I bought the wine, sat on a table, and drank it
12505.NI
12506I bought the wine, sat on a table, and broke it
12507.LE
12508have been closely studied (Wilks, 1975); for if they were to be translated
12509into French the pronoun "it" would be rendered differently in each case
12510(\c
12511.ul
12512le
12513vin,
12514.ul
12515la
12516table).
12517.[
12518Wilks 1975 An intelligent analyzer and understander of English
12519.]
12520.pp
12521In spoken language, emphasis is used to indicate the referent of a pronoun
12522when it would not otherwise be obvious.
12523Vanderslice gives the example
12524.LB
12525.NI
12526Bill saw John across the room and he ran over to him
12527.NI
12528Bill saw John across the room and
12529.ul
12530he
12531ran over to
12532.ul
12533him,
12534.LE
12535where the emphasis reverses the pronoun referents
12536(so that John did the running).
12537He suggests accenting a personal pronoun whenever the true
12538antecedent is not the same as the "unmarked" or default one.
12539Unfortunately he does not elaborate on what is meant by "unmarked".
12540Does it mean that the referent cannot be predicted from
12541knowledge of the words alone \(em as in the second example above?
12542If so, this is a clear candidate for speech synthesis from concept,
12543for the distinction cannot be made from text!
12544.sh "9.2  Pronunciation"
12545.pp
12546English pronunciation is notoriously irregular.
12547A poem by Charivarius, the pseudonym of a Dutch high school teacher
12548and linguist G.N.Trenite (1870\-1946), surveys the problems in an amusing
12549way and is worth quoting in full.
12550.br
12551.ev2
12552.in 0
12553.LB "nnnnnnnnnnnnnnnn"
12554.ul
12555              The Chaos
12556.sp2
12557.ne4
12558Dearest creature in Creation
12559Studying English pronunciation,
12560.in +5n
12561I will teach you in my verse
12562Sounds like corpse, corps, horse and worse.
12563.ne4
12564.in -5n
12565It will keep you, Susy, busy,
12566Make your head with heat grow dizzy;
12567.in +5n
12568Tear in eye your dress you'll tear.
12569So shall I!  Oh, hear my prayer:
12570.ne4
12571.in -5n
12572Pray, console your loving poet,
12573Make my coat look new, dear, sew it.
12574.in +5n
12575Just compare heart, beard and heard,
12576Dies and diet, lord and word.
12577.ne4
12578.in -5n
12579Sword and sward, retain and Britain,
12580(Mind the latter, how it's written).
12581.in +5n
12582Made has not the sound of bade,
12583Say \(em said, pay \(em paid, laid, but plaid.
12584.ne4
12585.in -5n
12586Now I surely will not plague you
12587With such words as vague and ague,
12588.in +5n
12589But be careful how you speak:
12590Say break, steak, but bleak and streak,
12591.ne4
12592.in -5n
12593Previous, precious; fuchsia, via;
12594Pipe, shipe, recipe and choir;
12595.in +5n
12596Cloven, oven; how and low;
12597Script, receipt; shoe, poem, toe.
12598.ne4
12599.in -5n
12600Hear me say, devoid of trickery;
12601Daughter, laughter and Terpsichore;
12602.in +5n
12603Typhoid, measles, topsails, aisles;
12604Exiles, similes, reviles;
12605.ne4
12606.in -5n
12607Wholly, holly; signal, signing;
12608Thames, examining, combining;
12609.in +5n
12610Scholar, vicar and cigar,
12611Solar, mica, war and far.
12612.ne4
12613.in -5n
12614Desire \(em desirable, admirable \(em admire;
12615Lumber, plumber; bier but brier;
12616.in +5n
12617Chatham, brougham; renown but known,
12618Knowledge; done, but gone and tone,
12619.ne4
12620.in -5n
12621One, anemone; Balmoral,
12622Kitchen, lichen; laundry, laurel;
12623.in +5n
12624Gertrude, German; wind and mind;
12625Scene, Melpemone, mankind;
12626.ne4
12627.in -5n
12628Tortoise, turquoise, chamois-leather,
12629Reading, Reading; heathen, heather.
12630.in +5n
12631This phonetic labyrinth
12632Gives:  moss, gross; brook, brooch; ninth, plinth.
12633.ne4
12634.in -5n
12635Billet does not end like ballet;
12636Bouquet, wallet, mallet, chalet;
12637.in +5n
12638Blood and flood are not like food,
12639Nor is mould like should and would.
12640.ne4
12641.in -5n
12642Banquet is not nearly parquet,
12643Which is said to rime with darky
12644.in +5n
12645Viscous, viscount; load and broad;
12646Toward, to forward, to reward.
12647.ne4
12648.in -5n
12649And your pronunciation's O.K.
12650When you say correctly:  croquet;
12651.in +5n
12652Rounded, wounded; grieve and sieve;
12653Friend and fiend, alive and live
12654.ne4
12655.in -5n
12656Liberty, library; heave and heaven;
12657Rachel, ache, moustache; eleven.
12658We say hallowed, but allowed;
12659People, leopard; towed, but vowed.
12660.in +5n
12661Mark the difference moreover
12662Between mover, plover, Dover;
12663.ne4
12664.in -5n
12665Leeches, breeches; wise, precise;
12666Chalice, but police and lice.
12667.in +5n
12668Camel, constable, unstable,
12669Principle, discipline, label;
12670.ne4
12671.in -5n
12672Petal, penal and canal;
12673Wait, surmise, plait, promise; pal.
12674.in +5n
12675Suit, suite, ruin; circuit, conduit,
12676Rime with:  "shirk it" and "beyond it";
12677.ne4
12678.in -5n
12679But it is not hard to tell
12680Why it's pall, mall, but Pall Mall.
12681.in +5n
12682Muscle, muscular; goal and iron;
12683Timber, climber; bullion, lion;
12684.ne4
12685.in -5n
12686Worm and storm; chaise, chaos, chair;
12687Senator, spectator, mayor.
12688.in +5n
12689Ivy, privy; famous, clamour
12690and enamour rime with "hammer".
12691.ne4
12692.in -5n
12693Pussy, hussy and possess,
12694Desert, but dessert, address.
12695.in +5n
12696Golf, wolf; countenants; lieutenants
12697Hoist, in lieu of flags, left pennants.
12698.ne4
12699.in -5n
12700River, rival; tomb, bomb, comb;
12701Doll and roll, and some and home.
12702.in +5n
12703Stranger does not rime with anger,
12704Neither does devour with clangour.
12705.ne4
12706.in -5n
12707Soul, but foul; and gaunt, but aunt;
12708Font, front, won't; want, grand and grant;
12709.in +5n
12710Shoes, goes, does.  Now first say:  finger,
12711And then; singer, ginger, linger.
12712.ne4
12713.in -5n
12714Real, zeal; mauve, gauze and gauge;
12715Marriage, foliage, mirage, age.
12716.in +5n
12717Query does not rime with very,
12718Nor does fury sound like bury.
12719.ne4
12720.in -5n
12721Dost, lost, post; and doth, cloth, loth;
12722Job, Job; blossom, bosom, oath.
12723.in +5n
12724Though the difference seems little
12725We say actual, but victual;
12726.ne4
12727.in -5n
12728Seat, sweat; chaste, caste; Leigh, eight, height;
12729Put, nut; granite but unite.
12730.in +5n
12731Reefer does not rime with deafer,
12732Feoffer does, and zephyr, heifer.
12733.ne4
12734.in -5n
12735Dull, bull; Geoffrey, George; ate, late;
12736Hint, pint; senate, but sedate.
12737.in +5n
12738Scenic, Arabic, Pacific;
12739Science, conscience, scientific.
12740.ne4
12741.in -5n
12742Tour, but our, and succour, four;
12743Gas, alas and Arkansas!
12744.in +5n
12745Sea, idea, guinea, area,
12746Psalm, Maria, but malaria.
12747.ne4
12748.in -5n
12749Youth, south, southern; cleanse and clean;
12750Doctrine, turpentine, marine.
12751.in +5n
12752Compare alien with Italian.
12753Dandelion with battalion,
12754.ne4
12755.in -5n
12756Sally with ally, Yea, Ye,
12757Eye, I, ay, aye, whey, key, quay.
12758Say aver, but ever, fever,
12759Neither, leisure, skein, receiver.
12760.in +5n
12761Never guess \(em it is not safe;
12762We say calves, valves, half, but Ralf.
12763.ne4
12764.in -5n
12765Heron, granary, canary;
12766Crevice and device and eyrie;
12767.in +5n
12768Face, preface, but efface,
12769Phlegm, phlegmatic; ass, glass, bass;
12770.ne4
12771.in -5n
12772Large, but target, gin, give, verging;
12773Ought, out, joust and scour, but scourging;
12774.in +5n
12775Ear, but earn; and wear and tear
12776Do not rime with "here", but "ere".
12777.ne4
12778.in -5n
12779Seven is right, but so is even;
12780Hyphen, roughen, nephew, Stephen;
12781.in +5n
12782Monkey, donkey; clerk and jerk;
12783Asp, grasp, wasp; and cork and work.
12784.ne4
12785.in -5n
12786Pronunciation \(em think of psyche -
12787Is a paling, stout and spikey;
12788.in +5n
12789Won't it make you lose your wits,
12790Writing groats and saying "groats"?
12791.ne4
12792.in -5n
12793It's a dark abyss or tunnel,
12794Strewn with stones, like rowlock, gunwale,
12795.in +5n
12796Islington and Isle of Wight,
12797Housewife, verdict and indict.
12798.ne4
12799.in -5n
12800Don't you think so, reader, rather
12801Saying lather, bather, father?
12802.in +5n
12803Finally:  which rimes with "enough",
12804Though, through, plough, cough, hough or tough?
12805.ne4
12806.in -5n
12807Hiccough has the sound of "cup",
12808My advice is ... give it up!
12809.LE "nnnnnnnnnnnnnnnn"
12810.br
12811.ev
12812.rh "Letter-to-sound rules."
12813Despite such irregularities, it is surprising how much can be done
12814with simple letter-to-sound rules.
12815These specify phonetic equivalents of word fragments and single letters.
12816The longest stored fragment which matches the current word is translated,
12817and then the same strategy is adopted on the remainder of the word.
12818Table 9.5 shows some English fragments and their pronunciations.
12819.RF
12820.nr x0 1.5i+\w'pronunciation  '
12821.nr x1 (\n(.l-\n(x0)/2
12822.in \n(x1u
12823.ta 1.5i
12824fragment	pronunciation
12825\l'\n(x0u\(ul'
12826.sp
12827-p-	\fIp\fR
12828-ph-	\fIf\fR
12829-phe|	\fIf ee\fR
12830-phe|s	\fIf ee z\fR
12831-phot-	\fIf uh u t\fR
12832-place|-	\fIp l e i s\fR
12833-plac|i-	\fIp l e i s i\fR
12834-ple|ment-	\fIp l i m e n t\fR
12835-plie|-	\fIp l aa i y\fR
12836-post	\fIp uh u s t\fR
12837-pp-	\fIp\fR
12838-pp|ly-	\fIp l ee\fR
12839-preciou-	\fIp r e s uh\fR
12840-proce|d-	\fIp r uh u s ee d\fR
12841-prope|r-	\fIp r o p uh r\fR
12842-prov-	\fIp r uu v\fR
12843-purpose-	\fIp er p uh s\fR
12844-push-	\fIp u sh\fR
12845-put	\fIp u t\fR
12846-puts	\fIp u t s\fR
12847\l'\n(x0u\(ul'
12848.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
12849.in 0
12850.FG "Table 9.5  Word fragments and their pronunciations"
12851.pp
12852It is sometimes important to specify that a rule applies only when
12853the fragment is matched at the beginning or end of a word.
12854In the Table "-" means that other fragments can precede or follow this
12855one.
12856The "|" sign is used to separate suffixes from a word stem,
12857as will be explained
12858shortly.
12859.pp
12860An advantage of the longest-string search strategy is that it is easy
12861to account for exceptions simply by incorporating them into the fragment
12862table.
12863If they occur in the input, the complete word will automatically be
12864matched first, before any fragment of it is translated.
12865The exception list of complete words can be surprisingly small for
12866quite respectable performance.
12867Table 9.6 shows the entire dictionary for an excellent early pronunciation
12868system written at Bell Laboratories (McIlroy, 1974).
12869.[
12870McIlroy 1974
12871.]
12872Some of the words are notorious exceptions in English, while others are
12873included simply because the rules would run amok on them.
12874Notice that the exceptions are all quite short, with only a few of them
12875having more than two syllables.
12876.RF
12877.nr x1 0.9i+0.9i+0.9i+0.9i+0.9i+0.9i
12878.nr x1 (\n(.l-\n(x1)/2
12879.in \n(x1u
12880.ta 0.9i +0.9i +0.9i +0.9i +0.9i
12881a	doesn't	guest	meant	reader	those
12882alkali	doing	has	moreover	refer	to
12883always	done	have	mr	says	today
12884any	dr	having	mrs	seven	tomorrow
12885april	early	heard	nature	shall	tuesday
12886are	earn	his	none	someone	two
12887as	eleven	imply	nothing	something	upon
12888because	enable	into	nowhere	than	very
12889been	engine	is	nuisance	that	water
12890being	etc	island	of	the	wednesday
12891below	evening	john	on	their	were
12892body	every	july	once	them	who
12893both	everyone	live	one	there	whom
12894busy	february	lived	only	thereby	whose
12895copy	finally	living	over	these	woman
12896do	friday	many	people	they	women
12897does	gas	maybe	read	this	yes
12898.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
12899.in 0
12900.FG "Table 9.6  Exception table for a simple pronunciation program"
12901.pp
12902Special action has to be taken with final "e"'s.
12903These lengthen and alter the quality
12904of the preceding vowel, so that "bit" becomes "bite" and so on.
12905Unfortunately, if the word has a suffix the "e" must be detected even though
12906it is no longer final, as in "lonely", and it is even dropped sometimes
12907("biting") \(em otherwise these would be pronounced "lonelly", "bitting".
12908To make matters worse the suffix may be another word:  we do not
12909want "kiteflying" to have an extra syllable which rhymes with "deaf"!
12910Although simple procedures can be developed to take care of common
12911word endings like "-ly", "-ness", "-d", it is difficult to decompose
12912compound words like "wisecrack" and "bumblebee" reliably \(em but this must
12913be done if they are not to be articulated with three syllables instead of two.
12914Of course, there are exceptions to the final "e" rule.
12915Many common words ("some", "done", "[live]\dV\u") disobey the rule by not
12916lengthening the main vowel, while in other, rarer, ones ("anemone",
12917"catastrophe", "epitome") the final "e" is actually pronounced.
12918There are also some complete anomalies ("fete").
12919.pp
12920McIlroy's (1974) system is a superb example of a robust program which takes
12921a pragmatic approach to these problems, accepting that they will never be
12922fully solved, and which is careful to degrade
12923gracefully when stumped.
12924.[
12925McIlroy 1974
12926.]
12927The pronunciation of each word is found by a succession of increasingly
12928desperate trials:
12929.LB
12930.NP
12931replace upper- by lower-case letters, strip punctuation, and try again;
12932.NP
12933remove final "-s", replace final "ie" by "y", and try again;
12934.NP
12935reject a word without a vowel;
12936.NP
12937repeatedly mark any suffixes with "|";
12938.NP
12939mark with "|" probable morph divisions in compound words;
12940.NP
12941mark potential long vowels indicated by "e|",
12942and long vowels elsewhere in the word;
12943.NP
12944mark voiced medial "s" as in "busy", "usual";
12945replace final "-s" if stripped;
12946.NP
12947scanning the word from left to right, apply letter-to-sound rules
12948to word fragments;
12949.NP
12950when all else fails spell the word, punctuation and all
12951(burp on letters for which no spelling rule exists).
12952.LE
12953.RF
12954.nr x0 \w'| ment\0\0\0'+\w'replace final ie by y\0\0\0'+\w'except when no vowel would remain in  '
12955.nr x1 (\n(.l-\n(x0)/2
12956.in \n(x1u
12957.ta \w'| ment\0\0\0'u +\w'replace final ie by y\0\0\0'u
12958suffix	action	notes and exceptions
12959\l'\n(x0u\(ul'
12960.sp
12961s	strip off final s	except in context us
12962\&'	strip off final '
12963ie	replace final ie by y
12964e	replace final e by E	when it is the only vowel in a word
12965	(long "e")
12966
12967| able	place suffix mark as	except when no vowel would remain in
12968| ably	shown	the rest of the word
12969e | d
12970e | n
12971e | r
12972e | ry
12973e | st
12974e | y
12975| ful
12976| ing
12977| less
12978| ly
12979| ment
12980| ness
12981| or
12982
12983| ic	place suffix mark as
12984| ical	shown and terminate
12985e |	final e processing
12986\l'\n(x0u\(ul'
12987.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
12988.in 0
12989.FG "Table 9.7  Rules for detecting suffixes for final 'e' processing"
12990.pp
12991Table 9.7 shows the suffixes which the program recognizes, with some comments
12992on their processing.
12993Multiple suffixes are detected and marked in words like
12994"force|ful|ly" and "spite|ful|ness".
12995This allows silent "e"'s to be spotted even when they occur far back in a
12996word.
12997Notice that the suffix marks are available to the word-fragment
12998rules of Table 9.5, and are frequently used by them.
12999.pp
13000The program has some
13001.ul
13002ad hoc
13003rules for dealing with compound words like "race|track", "house|boat";
13004these are applied as well as normal suffix splitting so that multiple
13005decompositions like "pace|make|r" can be accomplished.
13006The rules look for short letter sequences which do not
13007usually appear in monomorphemic words.
13008It is impossible, however, to detect every morph boundary
13009by such rules, and the program inevitably makes mistakes.
13010Examples of boundaries which go undetected are
13011"edge|ways", "fence|post", "horse|back", "large|mouth", "where|in";
13012while boundaries are incorrectly inserted into "comple|mentary",
13013"male|volent", "prole|tariat", "Pame|la".
13014.pp
13015We now seem to have presented two opposing points of view on the pronunciation
13016problem.
13017Charivarius, the Dutch poet, shows that an enormous number of
13018exceptional words exist; whereas McIlroy's program makes do with a tiny
13019exception dictionary.
13020These views can be reconciled by noting that most of Charivarius' words
13021are relatively uncommon.
13022McIlroy tested his program against the 2000 most frequent words in a large
13023corpus (Kucera and Francis, 1967),
13024and found that 97% were pronounced correctly if word frequencies were
13025taken into account.
13026.[
13027Kucera Francis 1967
13028.]
13029(The notion of "correctness" is of course a rather subjective one.)  However,
13030he estimated that on the remaining words the success rate was only 88%.
13031.pp
13032The system is particularly impressive in that it is prepared to say
13033anything:  if used, for example, on source programs in a high-level
13034computer language it will say the keywords and pronouncable
13035identifiers, spell the other identifiers, and even give the names of special
13036symbols (like +, <, =) correctly!
13037.rh "Morphological analysis."
13038The use of letter-to-sound rules provides a cheap and fast technique
13039for pronunciation \(em the fragment table and exception dictionary for the
13040program described above occupy only 11 Kbyte of storage, and can easily
13041be kept in solid-state read-only memory.
13042It produces reasonable results if careful attention is paid to rules
13043for suffix-splitting.
13044However, it is inherently limited because it is not possible in general
13045to detect compound words by simple rules which operate on the lexical
13046structure of the word.
13047.pp
13048Compounds can only be found reliably by using a morph dictionary.
13049This gives the added advantage that syntactic information
13050can be stored with the morphs to assist with rhythm assignment according
13051to the Chomsky-Halle theory.
13052However, it was noted earlier that morphs, unlike the grammatically-determined
13053morphemes, are not very well defined from a linguistic point of view.
13054Some morphemic decompositions are obviously not morphic because the
13055constituents do not in any way resemble the final word;
13056while others, where the word is simply a concatenation
13057of its components, are clearly morphic.
13058Between these extremes lies a hazy region where what one considers
13059to be a morph depends upon how complex one is prepared to make the
13060concatenation rules.
13061The following description draws on techniques used in a project at MIT
13062in which a morph-based pronunciation system has been implemented
13063(Lee, 1969; Allen, 1976).
13064.[
13065Lee 1969
13066.]
13067.[
13068Allen 1976 Synthesis of speech from unrestricted text
13069.]
13070.pp
13071Estimates of the number of morphs in English vary from 10,000 to 30,000.
13072Although these seem to be very large numbers, they are considerably less
13073than the number of words in the language.
13074For example, Webster's
13075.ul
13076New Collegiate Dictionary
13077(7'th edition) contains about 100,000 entries.
13078If all forms of the words were included, this number would probably
13079double.
13080.pp
13081There are several classes of morphs, with restrictions on the combinations
13082that occur.
13083A general word has prefixes, a root, and suffixes, as shown in Figure 9.3;
13084only the root is mandatory.
13085.FC "Figure 9.3"
13086Suffixes usually perform a grammatical role, affecting the
13087conjugation of a verb or declension of a noun; or transforming one
13088part of speech into another
13089("-al" can make a noun into an adjective, while "-ness" performs the reverse
13090transformation.)  Other
13091suffixes, such as "-dom" or "-ship", only apply to certain parts of
13092speech (nouns, in this case), but do not change the grammatical
13093role of the word.  Such suffixes, and all prefixes, alter the meaning
13094of a word.
13095.pp
13096Some root morphs cannot combine with other morphs but always stand
13097alone \(em for instance, "this".
13098Others, called free morphs, can either occur on their own or combine
13099with further morphs to form a word.
13100Thus the root "house" can be joined on either side by another root,
13101such as "boat",
13102or by a suffix such as "ing".
13103A third type of root morph is one which
13104.ul
13105must
13106combine with another morph, like "crimin-", "-ceive".
13107.pp
13108Even with a morph dictionary, decomposing a word into a sequence
13109of morphs is not a trivial operation.
13110The process of lexical concatenation often results in a
13111minor change in the constituents.
13112How big this change is allowed to be governs the morph system being used.
13113For example, Allen (1976) gives three concatenation rules:  a
13114final "e" can be omitted, as in
13115.ta 1.1i
13116.LB
13117.NI
13118give + ing	\(em>  giving;
13119.LE
13120the last consonant of the root can be doubled, as in
13121.LB
13122.NI
13123bid + ing	\(em>  bidding;
13124.LE
13125or a final "y" can change to an "i", as in
13126.LB
13127.NI
13128handy + cap	\(em>  handicap.
13129.[
13130Allen 1976 Synthesis of speech from unrestricted text
13131.]
13132.LE
13133If these are the only rules permitted, the morph dictionary will
13134have to include multiple versions of some suffixes.
13135For example, the plural morpheme [-s] needs to be represented both by
13136"-s" and "-es", to account for
13137.LB
13138.NI
13139pea + s	\(em>  peas
13140.LE
13141and
13142.LB
13143.NI
13144baby + es	\(em>  babies  (using the "y" \(em> "i" rule).
13145.LE
13146This would not be necessary if a  "y" \(em> "ie"  rule were included too.
13147Similarly, the morpheme [-ic] will include morphs
13148"-ic" and "-c"; the latter to cope with
13149.LB
13150.NI
13151specify + c	\(em>  specific    (using the "y" \(em> "i" rule).
13152.LE
13153Furthermore, non-morphemic roots such as "galact" need to be included because
13154the concatenation rules do not capture the transformation
13155.LB
13156.NI
13157galaxy + ic	\(em>  galactic.
13158.LE
13159There is clearly a trade-off between the size of the morph dictionary
13160and the complexity of the concatenation rules.
13161.pp
13162Since a text-to-speech system is presented with already-concatenated
13163morphs, it must be prepared to reverse the effects of the concatenation
13164rules to deduce the constituents of a word.
13165When two morphs combine with any of the three rules given above,
13166the changes in spelling occur only in the lefthand one.
13167Therefore the word is best scanned in a right-to-left direction to
13168split off the morphs starting with suffixes, as McIlroy's program does.
13169If the procedure fails at any point, one of the three rules is
13170hypothesized, its effect is undone, and splitting continues.
13171For example, consider the word
13172.LB
13173.NI
13174grasshoppers	<\(em  grass + hop + er + s
13175.LE
13176(Lee, 1969).
13177.[
13178Lee 1969
13179.]
13180The "-s" is detected first, then "-er"; these are both stored in
13181the dictionary as suffixes.
13182The remainder, "grasshopp", cannot be decomposed and does not appear
13183in the dictionary.
13184So each of the rules above is hypothesized in turn, and the
13185result investigated.  (The "y" \(em> "i" rule is obviously not
13186applicable.)  When
13187the final-consonant-doubling rule is considered, the sequence
13188"grasshop" is investigated.
13189"Shop" could be split off this, but then the unknown morph "gras"
13190would result.
13191The alternative, to remove "hop", leaves a remainder "grass" which
13192.ul
13193is
13194a free morph, as desired.
13195Thus a unique and correct decomposition is obtained.
13196Notice that the procedure would fail if, for example, "grass" had
13197been inadvertently omitted from the dictionary.
13198.pp
13199Sometimes, several seemingly valid decompositions present themselves
13200(Allen, 1976).
13201.[
13202Allen 1976 Synthesis of speech from unrestricted text
13203.]
13204For example:
13205.LB
13206.NI
13207scarcity	<\(em  scar + city
13208.NI
13209	<\(em  scarce + ity  (using final-"e" deletion)
13210.NI
13211	<\(em  scar + cite + y  (using final-"e" deletion)
13212.NI
13213resting	<\(em  rest + ing
13214.NI
13215	<\(em  re + sting
13216.NI
13217biding	<\(em  bide + ing  (using final-"e" deletion)
13218.NI
13219	<\(em  bid + ing
13220.NI
13221unionized	<\(em  un + ion + ize + d
13222.NI
13223	<\(em  union + ize + d
13224.NI
13225winding	<\(em  [wind]\dN\u + ing
13226.NI
13227	<\(em  [wind]\dV\u + ing.
13228.LE
13229The last distinction is important because the pronunciation of "wind"
13230depends on whether it is a noun or a verb.
13231.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
13232.pp
13233Several sources of information can be used to resolve these ambiguities.
13234The word structure of Figure 9.3, together with the division of root
13235morphs into bound and free ones, may eliminate some possibilities.
13236Certain letter sequences (such as "rp") do not appear at the beginning
13237of a word or morph, and others never occur at the end.
13238Knowledge of these sequences can reject some unacceptable
13239decompositions \(em or perhaps more importantly, can enable intelligent guesses
13240to be made in cases where a constituent morph has been omitted from the
13241dictionary.
13242The grammatical function of suffixes allows suffix sequences to be
13243checked for compatibility.
13244The syntax of the sentence, together with suffix knowledge, can
13245rule out other combinations.
13246Semantic knowledge will occasionally be necessary (as in the "unionized"
13247and "winding" examples above \(em compare a "winding road" with a "winding
13248blow").
13249Finally, Allen (1976) suggests that a preference structure on composition
13250rules can be used to resolve ambiguity.
13251.[
13252Allen 1976 Synthesis of speech from unrestricted text
13253.]
13254.pp
13255Once the morphological structure has been determined,
13256the rest of the pronunciation
13257process is relatively easy.
13258A phonetic transcription of each morph may be stored in the morph dictionary,
13259or else letter-to-sound rules can be used on individual morphs.
13260These are likely to be quite successful because final-"e" processing can be
13261now be done with confidence:  there are no hidden final "e"'s in the middle
13262of morphs.
13263In either case the resulting phonetic transcriptions of the individual morphs
13264must be concatenated to give the transcription of the complete word.
13265Although some contextual modification has to be accounted for,
13266it is relatively straightforward and easy to predict.
13267For example, the plural morphs "-s" and "-es" can be realized phonetically
13268by
13269.ul
13270uh\ z,
13271.ul
13272s,
13273or
13274.ul
13275z
13276depending on context.
13277Similarly the past-tense suffix "-ed" may be rendered as
13278.ul
13279uh\ d,
13280.ul
13281t,
13282or
13283.ul
13284d.
13285The suffixes "-ion" and "-ure" sometimes cause modification of the previous
13286morph:  for example
13287.LB
13288.NI
13289act + ion  \(em>  \c
13290.ul
13291a k t\c
13292  + ion  \(em>  \c
13293.ul
13294a k sh uh n.
13295.LE
13296.pp
13297The morph dictionary does not remove the need for a lexicon of exceptional
13298words.
13299The irregular final-"e" words mentioned earlier ("done", "anemone", "fete")
13300need to be treated on an individual basis,
13301as do words such as "quadruped" which have misleading endings
13302(it should not be decomposed as "quadrup|ed").
13303.rh "Pronunciation of languages other than English."
13304Text-to-speech systems for other languages have been reported in
13305the literature.
13306(For example, French, Esperanto,
13307Italian, Russian, Spanish, and German are covered
13308by Lesmo
13309.ul
13310et al,
133111978; O'Shaughnessy
13312.ul
13313et al,
133141981; Sherwood, 1978;
13315Mangold and Stall, 1978).
13316.[
13317Lesmo 1978
13318.]
13319.[
13320O'Shaughnessy Lennig Mermelstein Divay 1981
13321.]
13322.[
13323Sherwood 1978
13324.]
13325.[
13326Mangold Stall 1978
13327.]
13328Generally speaking, these present fewer difficulties than does English.
13329Esperanto is particularly easy because each letter in its orthography
13330has only one sound, making the pronunciation problem trivial.
13331Moreover, stress in polysyllabic words always occurs on the penultimate
13332syllable.
13333.pp
13334It is tempting and often sensible when designing a synthesis system for
13335English to use an utterance representation somewhere between phonetics and
13336ordinary spelling.
13337This may happen in practice even if it is not intended:  a user, finding
13338that a given word is pronounced incorrectly, will alter the spelling to
13339make it work.
13340The Word English Spelling alphabet (Dewey, 1971), amongst others (Haas, 1966),
13341is a simplified and apparently natural scheme which was developed by the
13342spelling reform movement.
13343.[
13344Dewey 1971
13345.]
13346.[
13347Haas 1966
13348.]
13349It maps very simply on to a phonetic representation, just like Esperanto.
13350However, it can provide little help with the crucial problem of stress
13351assignment, except perhaps by explicitly indicating reduced vowels.
13352.sh "9.3  Discussion"
13353.pp
13354This chapter has really only touched the tip of a linguistic iceberg.
13355I have given some examples of representations, rules, algorithms,
13356and exceptions, to make the concepts more tangible; but a whole mass of
13357detail has been swept under the carpet.
13358.pp
13359There are two important messages that are worth reiterating once more.
13360The first is that the representation of the input \(em that is,
13361whether it be a "concept"
13362in some semantic domain, a syntactic description of an utterance, a
13363decomposition into morphs, plain text or some contrived re-spelling of it \(em
13364is crucial to the quality of the output.
13365Almost any extra information about the utterance can be taken into account
13366and used to improve the speech.
13367It is difficult to derive such information if it is not provided explicitly,
13368for the process of climbing the tree from text to semantic representation is
13369at least as hard as descending it to a phonetic transcription.
13370.pp
13371Secondly, simple algorithms perform remarkably well \(em witness the
13372punctuation-driven intonation assignment scheme, and word fragment rules
13373for pronunciation.
13374However, the combined degradation contributed by several imperfect
13375processes is likely to impair speech quality very seriously.
13376And great complexity is introduced when these simple algorithms are
13377discarded in favour of more sophisticated ones.
13378There is, for example, a world of difference between a pronunciation
13379program that copes with 97% of common words and one that deals correctly
13380with 99% of a random sample from a dictionary.
13381.pp
13382Some of the options that face the system designer are recapitulated in
13383Figure 9.4.
13384.FC "Figure 9.4"
13385Starting from text, one can take the simple approach of lexically-based
13386suffix-splitting, letter-to-sound rules, and prosodics derived
13387from punctuation, to generate a phonetic transcription.
13388This will provide a cheap system which is relatively easy to implement
13389but whose speech quality will probably not be acceptable to any but the
13390most dedicated listener
13391(such as a blind person with no other access to reading material).
13392.pp
13393The biggest improvement in speech quality from such a system would
13394almost certainly come from more intelligent prosodic
13395control \(em particularly of intonation.
13396This, unfortunately, is also by far the most difficult to make unless
13397intonation contours, tonic stresses, and tone-group boundaries are hand-coded
13398into the input.
13399To generate the appropriate information from text one has to climb to the
13400upper levels in Figure 9.4 \(em and even when these are reached, the problems
13401are by no means over.
13402Still, let us climb the tree.
13403.pp
13404For syntax analysis, part-of-speech information is needed; and for this
13405the grammatical roles of individual words in the text must be ascertained.
13406A morph dictionary is the most reliable way to do this.
13407A linguist may prefer to go from morphs to syntax by way of morphemes;
13408but this is not necessary for the present purpose.
13409Just the information that
13410the morph "went" is a verb can be stored in the dictionary, instead
13411of its decomposition  [went]\ =\ [go]\ +\ [ed].
13412.pp
13413Now that we have the morphological structure of the text, stress assignment rules
13414can be applied to produce more accurate speech rhythms.
13415The morph decomposition will also allow improvements to be made to the
13416pronunciation, particularly in the case of silent "e"'s in compound words.
13417But the ability to assign intonation has hardly been improved at all.
13418.pp
13419Let us proceed upwards.
13420Now the problems become really difficult.
13421A semantic representation of the text is needed; but what exactly does this
13422mean?
13423We certainly must have
13424.ul
13425morphemic
13426knowledge, for now the fact that "went" is a derivative of "go"
13427(rather than any other verb) becomes crucial.
13428Very well, let us augment the morph dictionary with morphemic information.
13429But this does not attack the problem of semantic representation.
13430We may wish to resolve pronoun references to help assign stress.
13431Parts of the problem are solved in principle
13432and reported in the artificial intelligence
13433literature, but if such an ability is incorporated into the speech
13434synthesis system it will become enormously complicated.
13435In addition, we have seen that knowledge of antitheses in the text will greatly
13436assist intonation assignment, but procedures for extracting this
13437information constitute a research topic in their own right.
13438.pp
13439Now step back and take a top-down approach.
13440What could we do with this semantic understanding and knowledge of the structure
13441of the discourse if we had it?
13442Suppose the input were a "concept" in some as yet undetermined representation.
13443What are the
13444.ul
13445acoustic
13446manifestations of such high-level features as anaphoric references or
13447antithetical comparisons,
13448of parenthetical or satirical remarks,
13449of emotions:  warmth, sarcasm, sadness and despair?
13450Can we program the art of elocution?
13451These are good questions.
13452.sh "9.4  References"
13453.LB "nnnn"
13454.[
13455$LIST$
13456.]
13457.LE "nnnn"
13458.sh "9.5  Further reading"
13459.pp
13460Books on pronunciation give surprisingly little help in designing
13461a text-to-speech procedure.
13462The best aid is a good on-line dictionary and flexible software to
13463search it and record rules, examples, and exceptions.
13464Here are some papers that describe existing systems.
13465.LB "nn"
13466.\"Ainsworth-1974-1
13467.]-
13468.ds [A Ainsworth, W.A.
13469.ds [D 1974
13470.ds [T A system for converting text into speech
13471.ds [J IEEE Trans Audio and Electroacoustics
13472.ds [V AU-21
13473.ds [P 288-290
13474.nr [P 1
13475.nr [T 0
13476.nr [A 1
13477.nr [O 0
13478.][ 1 journal-article
13479.in+2n
13480.in-2n
13481.\"Colby-1978-2
13482.]-
13483.ds [A Colby, K.M.
13484.as [A ", Christinaz, D.
13485.as [A ", and Graham, S.
13486.ds [D 1978
13487.ds [K *
13488.ds [T A computer-driven, personal, portable, and intelligent speech prosthesis
13489.ds [J Computers and Biomedical Research
13490.ds [V 11
13491.ds [P 337-343
13492.nr [P 1
13493.nr [T 0
13494.nr [A 1
13495.nr [O 0
13496.][ 1 journal-article
13497.in+2n
13498.in-2n
13499.\"Elovitz-1976-3
13500.]-
13501.ds [A Elovitz, H.S.
13502.as [A ", Johnson, R.W.
13503.as [A ", McHugh, A.
13504.as [A ", and Shore, J.E.
13505.ds [D 1976
13506.ds [K *
13507.ds [T Letter-to-sound rules for automatic translation of English text to phonetics
13508.ds [J IEEE Trans Acoustics, Speech and Signal Processing
13509.ds [V ASSP-24
13510.ds [N 6
13511.ds [P 446-459
13512.nr [P 1
13513.ds [O December
13514.nr [T 0
13515.nr [A 1
13516.nr [O 0
13517.][ 1 journal-article
13518.in+2n
13519.in-2n
13520.\"Kooi-1978-4
13521.]-
13522.ds [A Kooi, R.
13523.as [A " and Lim, W.C.
13524.ds [D 1978
13525.ds [T An on-line minicomputer-based system for reading printed text aloud
13526.ds [J IEEE Trans Systems, Man and Cybernetics
13527.ds [V SMC-8
13528.ds [P 57-62
13529.nr [P 1
13530.ds [O January
13531.nr [T 0
13532.nr [A 1
13533.nr [O 0
13534.][ 1 journal-article
13535.in+2n
13536.in-2n
13537.\"Umeda-1975-5
13538.]-
13539.ds [A Umeda, N.
13540.as [A " and Teranishi, R.
13541.ds [D 1975
13542.ds [K *
13543.ds [T The parsing program for automatic text-to-speech synthesis developed at the Electrotechnical Laboratory in 1968
13544.ds [J IEEE Trans Acoustics, Speech and Signal Processing
13545.ds [V ASSP-23
13546.ds [N 2
13547.ds [P 183-188
13548.nr [P 1
13549.ds [O April
13550.nr [T 0
13551.nr [A 1
13552.nr [O 0
13553.][ 1 journal-article
13554.in+2n
13555.in-2n
13556.\"Umeda-1976-6
13557.]-
13558.ds [A Umeda, N.
13559.ds [D 1976
13560.ds [K *
13561.ds [T Linguistic rules for text-to-speech synthesis
13562.ds [J Proc IEEE
13563.ds [V 64
13564.ds [N 4
13565.ds [P 443-451
13566.nr [P 1
13567.ds [O April
13568.nr [T 0
13569.nr [A 1
13570.nr [O 0
13571.][ 1 journal-article
13572.in+2n
13573.in-2n
13574.LE "nn"
13575.EQ
13576delim $$
13577.EN
13578.CH "10  DESIGNING THE MAN-COMPUTER DIALOGUE"
13579.ds RT "The man-computer dialogue
13580.ds CX "Principles of computer speech
13581.pp
13582Interactive computers are being used more and more by non-specialist people
13583without much previous computer experience.
13584As processing costs continue to decline, the overall expense of providing
13585highly interactive systems
13586becomes increasingly dominated by terminal and communications equipment.
13587Taken together, these two factors highlight the need for easy-to-use,
13588low-bandwidth interactive terminals that make maximum use of the existing
13589telephone network for remote access.
13590.pp
13591Speech output can provide versatile feedback from a computer at very low
13592cost in distribution and terminal equipment.  It is attractive from several
13593points of view.
13594Terminals \(em telephones \(em are invariably in place already.
13595People without experience of computers are accustomed to their use,
13596and are not intimidated by them.
13597The telephone network is cheap to use and extends all over the world.
13598The touch-tone keypad (or a portable tone generator)
13599provides a complementary data input device which will do for many
13600purposes until the technology of speech recognition becomes better developed
13601and more widespread.
13602Indeed, many applications \(em especially information retrieval ones \(em need
13603a much smaller bandwidth from user to computer than in the reverse direction,
13604and voice output combined with restricted keypad entry provides a good match
13605to their requirements.
13606.pp
13607There are, however, severe problems in implementing natural and useful
13608interactive systems using speech output.
13609The eye can absorb information at a far greater rate than can the ear.
13610You can scan a page of text in a way which has no analogy in auditory terms.
13611Even so, it is difficult to design a dialogue which allows you to search
13612computer output visually at high speed.
13613In practice, scanning a new report is often better done at your desk
13614with a printed copy than at a computer terminal with a viewing program
13615(although this is likely to change in the near future).
13616.pp
13617With speech, the problem of organizing output becomes even harder.
13618Most of the information we learn using our ears is presented in a
13619conversational way, either in face-to-face discussions or over the telephone.
13620Verbal but non-conversational presentations, as in the
13621university lecture theatre, are known to be a rather inefficient way
13622of transmitting information.
13623The degree of interaction is extremely high even in a telephone conversation,
13624and communication relies heavily on speech gestures such as hesitations,
13625grunts, and pauses; on prosodic features such as intonation, pitch range,
13626tempo, and voice quality; and on conversational gambits such as interruption
13627and long silence.
13628I emphasized in the last two chapters the rudimentary state of knowledge
13629about how to synthesize
13630prosodic features, and the situation is even worse
13631for the other, paralinguistic, phenomena.
13632.pp
13633There is also a very special problem with voice output, namely, the transient
13634nature of the speech signal.
13635If you miss an utterance, it's gone.
13636With a visual display unit, at least the last few interactions usually remain
13637available.
13638Even then, it is not uncommon to look up beyond the top of the screen and
13639wish that more of the history was still visible!
13640This obviously places a premium on a voice response system's
13641ability to repeat utterances.
13642Moreover, the dialogue designer must do his utmost to ensure that the user
13643is always aware of the current state of the interaction,
13644for there is no opportunity to refresh the memory by glancing at earlier
13645entries and responses.
13646.pp
13647There are two separate aspects to the man-computer interface in a voice
13648response system.
13649The first is the relationship between the system and the end user,
13650that is, the "consumer" of the synthesized dialogue.
13651The second is the relationship between the system and the applications
13652programmer who creates the dialogue.
13653These are treated separately in the next two sections.
13654We will have more to say about the former aspect,
13655for it is ultimately more important to more people.
13656But the applications programmer's view is important, too; for without him
13657no systems would exist!
13658The technical difficulties in creating synthetic dialogues
13659for the majority of voice systems probably
13660explain why speech output technology is still greatly under-used.
13661Finally we look at techniques for using small keypads such as those on
13662touch-tone telephones,
13663for they are an essential part of many voice response systems.
13664.sh "10.1  Programming principles for natural interaction"
13665.pp
13666Special attention must be paid to be details of the man-machine interface
13667in speech-output systems.
13668This section summarizes experience of human factors considerations
13669gained in developing the remote
13670telephone enquiry service described in Chapter 1 (Witten and Madams, 1977),
13671which employs an ordinary touch-tone keypad for input in conjunction with
13672synthetic voice response.
13673.[
13674Witten Madams 1977 Telephone Enquiry Service
13675.]
13676Most of the principles which emerged were the result of natural evolution
13677of the system, and were not clear at the outset.
13678Basically, they stem from the fact that speech is both more intrusive
13679and more ephemeral than writing, and so they are applicable in general to
13680speech output information retrieval systems with keyboard or even voice
13681input.
13682Be warned, however, that they are based upon casual observation and
13683speculation rather than empirical research.
13684There is a desperate need for proper studies of user psychology in speech
13685systems.
13686.rh "Echoing."
13687Most alphanumeric input peripherals echo on a character-by-character basis.
13688Although one can expect quite a high proportion of mistakes with
13689unconventional keyboards, especially when entering alphabetic data on a
13690basically numeric keypad, audio character echoing is distracting and annoying.
13691If you type "123" and the computer echoes
13692.LB
13693.NI
13694"one ... two ... three"
13695.LE
13696after the individual key-presses, it is liable to divert your
13697attention, for voice output is much more intrusive than a purely visual "echo".
13698.pp
13699Instead, an immediate response to a completed input line is preferable.
13700This response can take the form or a reply to a query, or, if successive
13701data items are being typed, confirmation of the data entered.
13702In the latter case, it is helpful if the information can be generated in
13703the same way that the user himself would be likely to verbalize it.
13704Thus, for example, when entering numbers:
13705.LB
13706.nr x0 \w'COMPUTER:'
13707.nr x1 \w'USER:'
13708.NI
13709USER:\h'\n(x0u-\n(x1u' "123#"	(# is the end-of-line character)
13710.NI
13711COMPUTER: "One hundred and twenty-three."
13712.LE
13713For a query which requires lengthy processing, the input should be
13714repeated in a neat, meaningful format to give the user a chance to abort
13715the request.
13716.rh "Retracting actions."
13717Because commands are entered directly without explicit confirmation,
13718it must always be easy for the user to revoke his actions.
13719The utility of an "undo" command is now commonly recognized for
13720any interactive system, and it becomes even more important in speech
13721systems because it is easier for the user to lose his place in the
13722dialogue and so make errors.
13723.rh "Interrupting."
13724A command which interrupts output and returns to a known state
13725should be recognized at every level of the system.
13726It is essential that voice output be terminated immediately,
13727rather than at the end of the utterance.
13728We do not want the user to live in fear of the system embarking on
13729a long, boring monologue that is impossible to interrupt!
13730Again, the same is true of interactive dialogues which do not use speech,
13731but becomes particularly important with voice response because it takes
13732longer to transmit information.
13733.rh "Forestalling prompts."
13734Computer-generated prompts must be explicit and frequent enough
13735to allow new users to understand what they are expected to do.
13736Experienced users will "type ahead" quite naturally,
13737and the system should suppress unnecessary prompts under these conditions
13738by inspecting the input buffer before prompting.
13739This allows the user to concatenate frequently-used commands into chunks whose
13740size is entirely at his own discretion.
13741.pp
13742With the above-mentioned telephone enquiry service, for example,
13743it was found that people often took advantage of the prompt-suppression
13744feature to enter their
13745user number, password, and required service number as a single keying
13746sequence.
13747As you becomes familiar with a service you quickly and easily learn to
13748forestall expected prompts by typing ahead.
13749This provides a very natural way for the system to adapt itself automatically
13750to the experience of the user.
13751New users will naturally wait to be prompted, and proceed through the dialogue
13752at a slower and more relaxed pace.
13753.pp
13754Suppressing unnecessary prompts is a good idea in any interactive system,
13755whether or not it uses the medium of speech \(em although it is hardly ever done
13756in conventional systems.
13757It is particularly important with speech, however, because an unexpected
13758or unwanted
13759prompt is quite distracting, and it is not so easy to ignore it as it is
13760with a visual display.
13761Furthermore, speech messages usually take longer to present
13762than displayed ones, so that the user is distracted for more time.
13763.rh "Information units."
13764Lengthy computer voice responses are inappropriate for conveying information,
13765because attention wanders if one is not actively involved in the conversation.
13766A sequential exchange of terse messages, each designed to dispense one
13767small unit of information, forces the user to take a meaningful part in the
13768dialogue.
13769It has other advantages, too, allowing a higher degree of input-dependent
13770branching, and permitting rapid recovery from errors.
13771.pp
13772The following example from the "Acidosis program", an audio response system
13773designed to help physicians to diagnose acidosis, is a good example
13774of what
13775.ul
13776not
13777to do.
13778.LB
13779"(Chime) A VALUE OF SIX-POINT-ZERO-ZERO HAS BEEN ENTERED FOR PH.
13780THIS VALUE IS IMPOSSIBLE.
13781TO CONTINUE THE PROGRAM, ENTER A NEW VALUE FOR PH IN THE RANGE
13782BETWEEN SIX-POINT-SIX AND EIGHT-POINT-ZERO
13783(beep dah beep-beep)"  (Smith and Goodwin, 1970).
13784.[
13785Smith Goodwin 1970
13786.]
13787.LE
13788The use of extraneous noises (for example, a "chime" heralds an error message,
13789and a "beep dah beep-beep" requests data input in the form
13790<digit><point><digit><digit>)
13791was thought necessary in the Acidosis program to keep the user awake
13792and help him with the format of the interaction.
13793Rather than a long monologue like this,
13794it seems much better to design a sequential interchange of terse messages,
13795so that the caller can be guided into a state where he can rectify his error.
13796For example,
13797.LB
13798.nf
13799.ne11
13800.nr x0 \w'COMPUTER:'
13801.nr x1 \w'CALLER:'
13802CALLER:\h'\n(x0u-\n(x1u' "6*00#"
13803COMPUTER: "Entry out of range"
13804CALLER:\h'\n(x0u-\n(x1u' "6*00#"  (persists)
13805COMPUTER: "The minimum acceptable pH value is 6.6"
13806CALLER:\h'\n(x0u-\n(x1u' "9*03#"
13807COMPUTER: "The maximum acceptable pH value is 8.0"
13808.fi
13809.LE
13810This dialogue allows a rapid exit from the error situation in the likely
13811event that the entry has simply been mis-typed.
13812If the error persists, the caller is given just one piece of information
13813at a time, and forced to continue to play an active role in the interaction.
13814.rh "Input timeouts."
13815In general, input timeouts are dangerous, because they introduce apparent
13816acausality in the system seen by the user.
13817A case has been reported where a user became "highly agitated and refused
13818to go near the terminal again after her first timed-out prompt.
13819She had been quietly thinking what to do and the terminal suddenly
13820interjecting and making its
13821own suggestions was just too much for her" (Gaines and Facey, 1975).
13822.[
13823Gaines Facey 1975
13824.]
13825.pp
13826However, voice response systems lack the satisfying visual feedback
13827of end-of-line on termination of an entry.
13828Hence a timed-out reminder is appropriate if a delay occurs after some
13829characters have been entered.
13830This requires the operating system to support a character-by-character mode
13831of input, rather than the usual line-by-line mode.
13832.rh "Repeat requests."
13833Any voice response system must support a universal "repeat last utterance"
13834command, because old output does not remain visible.
13835A fairly sophisticated facility is desirable, as repeat requests are
13836very frequent in practice.
13837They may be due to a simple inability to understand a response,
13838to forgetting what was said, or to distraction of attention \(em which is
13839especially common with office terminals.
13840.pp
13841In the telephone enquiry service two distinct commands were employed,
13842one to repeat the last utterance in case of misrecognition,
13843and the other to summarize the current state of the interaction
13844in case of distraction.
13845For the former, it is essential to avoid simply regenerating an utterance
13846identical with the last.
13847Some variation of intonation and rhythm is needed to prevent an annoying,
13848stereotyped response.
13849A second consecutive repeat request should trigger a paraphrased reply.
13850An error recovery sequence could be used which presented the misunderstood
13851information in a different way with more interaction, but experience
13852indicates that this is of minor importance, especially if information units
13853are kept small anyway.
13854To summarize the current state of the interaction in response to the second
13855type of repeat command necessitates the system maintaining a model of
13856the user.
13857Even a poor model, like a record of his last few transactions and their
13858results, is well worth having.
13859.rh "Varied speech."
13860Synthetic speech is usually rather dreary to listen to.
13861Successive utterances with identical intonations should be carefully avoided.
13862Small changes in speaking rate, pitch range, and mean pitch level,
13863all serve to add variety.
13864Unfortunately, little is known at present about the role of intonation in
13865interactive dialogue, although this is an active research area and
13866new developments can be expected (for a detailed report of a recent
13867research project relevant to this topic see Brown
13868.ul
13869et al,
138701980).
13871.[
13872Brown Currie Kenworthy 1980
13873.]
13874However, even random variations in certain parameters of the pitch contour
13875are useful to relieve the tedium of repetitive intonation patterns.
13876.sh "10.2  The applications programming environment"
13877.pp
13878The comments in the last section are aimed at the applications programmer
13879who is designing the dialogue and constructing the interactive system.
13880But what kind of environment should
13881.ul
13882he
13883be given to assist with this work?
13884.pp
13885The best help the applications programmer can have is a speech generation
13886method which makes it easy for him to enter new utterances and modify
13887them on-line in cut-and-try attempts to render the man-machine dialogue
13888as natural as possible.
13889This is perhaps the most important advantage of synthesizing speech by rule
13890from a textual representation.
13891If encoded versions of natural utterances are stored, it becomes quite
13892difficult to make minor modifications to the dialogue in the light of
13893experience with it, for a recording session must be set up
13894to acquire new utterances.
13895This is especially true if more than one voice is used, or if the
13896voice belongs to a person who cannot be recalled quickly by the programmer
13897to augment the utterance library.
13898Even if it is his own voice there will still be delays, for recording
13899speech is a real-time job which usually needs a stand-alone processor,
13900and if data compression is used a substantial amount of computation will
13901be needed before the utterance is in a useable form.
13902.pp
13903The broad phonetic input required by segmental speech synthesis-by-rule
13904systems is quite suitable for utterance representation.
13905Utterances can be entered quickly from a standard computer terminal,
13906and edited as text files.
13907Programmers must acquire skill in phonetic transcription,
13908but this is a small inconvenience.
13909The art is easily learned in an interactive situation where the effect
13910of modifications to the transcription can be heard immediately.
13911If allophones must be represented explicitly in the input then the
13912programmer's task becomes considerably more complicated because of the
13913combinatorial explosion in trial-and-error modifications.
13914.pp
13915Plain text input is also quite suitable.
13916A significant rate of error is tolerable if immediate audio feedback
13917of the result is available, so that the operator can adjust his text
13918to suit the pronunciation idiosyncrasies of the program.
13919But it is acceptable, and indeed preferable, if prosodic features are
13920represented explicitly in the input rather than being assigned automatically
13921by a computer program.
13922.pp
13923The application of voice response to interactive computer dialogue is
13924quite different to the problem of reading aloud from text.
13925We have seen that a major concern with reading machines is how to glean
13926information about intonation, rhythm, emphasis, tone of voice, and so on,
13927from an input of ordinary English text.
13928The significant problems of semantic processing, utilization of pragmatic
13929knowledge, and syntactic analysis do not, fortunately, arise in interactive
13930information retrieval systems.
13931In these, the end user is communicating with a program which has been
13932created by a person who knows what he wants it to say.
13933Thus the major difficulty is in
13934.ul
13935describing
13936the prosodic features rather than
13937.ul
13938deriving
13939them from text.
13940.pp
13941Speech synthesis by rule is a subsidiary process to the main interactive
13942procedure.
13943It would be unwise to allow
13944the updating of resonance parameter tracks to be interrupted by
13945other calls on the system, and so the synthesis process needs to be executed
13946in real time.
13947If a stand-alone processor is used for the interactive dialogue, it may
13948be able to handle the synthesis rules as well.
13949In this case the speech-by-rule program could be a library procedure,
13950if the system is implemented in a compiled language.
13951An interesting alternative with an interpretive-language implementation,
13952such as Basic, is to alter the language interpreter to add a new
13953command, "speak", which simply transfers a string representing an utterance
13954to an asynchronous process which synthesizes it.
13955However, there must be some way for an intepreted program to abort the
13956current synthesis in the event of an interrupt signal from the user.
13957.pp
13958If the main computer system is time-shared, the synthesis-by-rule
13959procedure is best executed by an independent processor.
13960For example, a 16-bit microcomputer controlling a hardware
13961formant synthesizer has been used to run the
13962ISP system in real time without too much difficulty (Witten and Abbess, 1979).
13963.[
13964Witten Abbess 1979
13965.]
13966An important task is to define an interface between the two which
13967allows the main process to control relevant aspects of the prosody of
13968the speech in a way which is appropriate to the state of the interaction,
13969without having to bother about such things as matching the intonation contour
13970to the utterance and the details of syllable rhythm.
13971Halliday's notation appears to be quite suitable for this purpose.
13972.pp
13973If there is only one synthesizer on the system, there will be no
13974difficulty in addressing it.
13975One way of dealing with multiple synthesizers is to treat them as
13976assignable devices in the same way that non-spooling peripherals
13977are in many operating systems.
13978Notice that the data rate to the synthesizer is quite low
13979if the utterance is represented as text with prosodic markers,
13980and can easily be handled by a low-speed asynchronous serial line.
13981.pp
13982The Votrax ML-I synthesizer which is discussed in the next chapter has an
13983interface which interposes it between a visual display unit and the serial
13984port that connects it to the computer.
13985The VDU terminal can be used quite normally, except that a special sequence
13986of two control characters will cause Votrax to intercept the following
13987message up to another control character, and interpret it as speech.
13988The fact that the characters which specify the spoken message do not appear
13989on the VDU screen means that the operation is invisible to the user.
13990However, this transparency can be inhibited by a switch on the synthesizer
13991to allow visual checking of the sound-segment character sequence.
13992.pp
13993Votrax buffers up to 64 sound segments, which is sufficient to generate
13994isolated spoken messages.
13995For longer passages, it can be synchronized with the constant-rate
13996serial output using the modem control lines of the serial interface,
13997together with appropriate device-driving software.
13998.pp
13999This is a particularly convenient interfacing technique in cases when the
14000synthesizer should always be associated with a certain terminal.
14001As an example of how it can be used,
14002one can arrange files each of whose lines contain a printed message,
14003together with its Votrax equivalent bracketed by the appropriate
14004control characters.
14005When such a file is listed, or examined with an editor program, the lines
14006appear simultaneously in spoken and typed English.
14007.pp
14008If a phonetic representation is used for utterances, with real-time
14009synthesis using a separate process (or processor), it is easy for
14010the programmer to fiddle about with the interactive dialogue to get
14011it feeling right.
14012For him, each utterance is just a textual string which
14013can be stored as a string constant within his program just as a VDU prompt
14014would be.  He can edit it as part of his program, and "print" it to
14015the speech synthesis device to hear it.
14016There are no more technical problems to developing an interactive dialogue
14017with speech output than there are for a conventional interactive program.
14018Of course, there are more human problems, and the points discussed
14019in the last section should always be borne in mind.
14020.sh "10.3  Using the keypad"
14021.pp
14022One of the greatest advantages of speech output from computers is the
14023ubiquity of the telephone network and the possibility of using it without
14024the need for special equipment at the terminal.
14025The requirement for input as well as output obviously presents something of a problem
14026because of the restricted nature of the telephone keypad.
14027.pp
14028Figure 10.1 shows the layout of the keypad.
14029.FC "Figure 10.1"
14030Signalling is achieved by dual-frequency tones.
14031For example, if key 7 is pressed, sinusoidal components at 852\ Hz and 1209\ Hz
14032are transmitted down the line.
14033During the process of dialling these are received by the telephone exchange
14034equipment, which assembles the digits that form a number and attempts to route
14035the call appropriately.
14036Once a connection is made, either party is free to press keys if desired
14037and the signals will be transmitted to the other end,
14038where they can be decoded by simple electronic circuits.
14039.pp
14040Dial telephones signal with closely-spaced dial pulses.
14041One pulse is generated for a "1", two for a "2", and so on.
14042(Obviously, ten pulses are generated for a "0", rather than none!)  Unfortunately,
14043once the connection is made it is difficult to signal with dial pulses.
14044They cannot be decoded reliably at the other end because the telephone
14045network is not designed to transmit such low frequencies.
14046However, hand-held tone generators can be purchased for use with dial
14047telephones.
14048Although these are undeniably extra equipment, and one purpose of using speech
14049output is to avoid this, they are very cheap and portable compared with other
14050computer terminal equipment.
14051.pp
14052The small number of keys on the telephone pad makes it rather difficult
14053to use for communicating with computers.
14054Provision is made for 16 keys, but only 12 are implemented \(em the others
14055may be used for some military purposes.
14056Of course, if a separate tone generator is used then advantage can be taken
14057of the extra keys, but this will introduce incompatibility with those
14058who use unmodified touch-tone phones.
14059More sophisticated terminals are available which extend the keypad \(em such
14060as the Displayphone of Northern Telecommunications.
14061However, they are designed as a complete communications terminal and
14062contain their own visual display as well.
14063.rh "Keying alphabetic data."
14064Figure 10.2 shows the near-universal scheme for overlaying alphabetic letters
14065on to the telephone keypad.
14066.FC "Figure 10.2"
14067Since more than one symbol occupies each key, it is obviously necessary
14068to have multiple keystrokes per character if the input sequence is to be
14069decodable as a string of letters.
14070One way of doing this is to depress the appropriate button the number of
14071times corresponding to the position of the letter on it.
14072For example, to enter the letter "L" the user would key the "5" button
14073three times in rapid succession.
14074Keying rhythm must be used to distinguish the four entries "J\ J\ J",
14075"J\ K", "K\ J", and "L", unless one of the bottom three buttons is used
14076as a separator.
14077A different method is to use "*", "0", and "#" as shift keys to indicate whether
14078the first, second, or third letter on a key is intended.
14079Then "#5" would represent "L".
14080Alternatively, the shift could follow the key instead of preceding it,
14081so that "5#" represented "L".
14082.pp
14083If numeric as well as alphabetic information may be entered, a mode-shift
14084operation is commonly used to switch between numeric and alphabetic modes.
14085.pp
14086The relative merits of these three methods, multiple depressions, shift
14087key prefix, and shift key suffix, have been investigated
14088experimentally (Kramer, 1970).
14089.[
14090Kramer 1970
14091.]
14092The results were rather inconclusive.
14093The first method seemed to be slightly inferior in terms of user accuracy.
14094It seemed that preceding rather than following shifts gave higher accuracy,
14095although this is perhaps rather counter-intuitive and may have been
14096fortuitous.
14097The most useful result from the experiments was that users exhibited
14098significant learning behaviour, and a training period of at least two hours
14099was recommended.
14100Operators were found able to key at rates of at least three to four
14101characters per second, and faster with practice.
14102.pp
14103If a greater range of characters must be represented then the coding problem
14104becomes more complex.
14105Figure 10.3 shows a keypad which can be used for entry of the full 64-character
14106standard upper-case ASCII alphabet (Shew, 1975).
14107.[
14108Shew 1975
14109.]
14110.FC "Figure 10.3"
14111The system is intended for remote vocabulary updating in a phonetically-based
14112speech synthesis system.
14113There are three modes of operation:  numeric, alphabetic, and symbolic.
14114These are entered by "##", "**", and "*0" respectively.
14115Two function modes, signalled by "#0" and "#*", allow some
14116rudimentary line-editing and monitor facilities to be incorporated.
14117Line-editing commands include character and line delete, and two kinds of
14118read-back commands \(em one tries to pronounce the words in a line
14119and the other spells out the characters.
14120The monitor commands allow the user to repeat the effect of the last input line
14121as though he had entered it again, to order the system to read back the
14122last complete output line, and to query time and system status.
14123.rh "Incomplete keying of alphanumeric data."
14124It is obviously going to be rather difficult for the operator to key
14125alphanumeric information unambiguously on a 12-key pad.
14126In the description of the telephone enquiry service in Chapter 1,
14127it was mentioned that single-key entry can be useful for alphanumeric data
14128if the ambiguity can be resolved by the computer.
14129If a multiple-character entry is known to refer to an item on a given
14130list, the characters can be keyed directly according to the coding scheme
14131of Figure 10.2.
14132.pp
14133Under most circumstances no ambiguity will arise.
14134For example, Table 10.1 shows the keystrokes that would be entered for the
14135first 50 5-letter words in an English dictionary.
14136Only two clashes occur \(em between " adore" and "afore", and
14137"agate" and "agave".
14138.RF
14139.nr x2 \w'abeam  'u
14140.nr x3 \w'00000#    'u
14141.nr x0 \n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\w'00000#'u
14142.nr x1 (\n(.l-\n(x0)/2
14143.in \n(x1u
14144.ta \n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u
14145\l'\n(x0u\(ul'
14146.sp
14147aback	22225#	abide	22433#	adage	23243#	adore	23673#	after	23837#
14148abaft	22238#	abode	22633#	adapt	23278#	adorn	23676#	again	24246#
14149abase	22273#	abort	22678#	adder	23337#	adult	23858#	agape	24273#
14150abash	22274#	about	22688#	addle	23353#	adust	23878#	agate	24283#
14151abate	22283#	above	22683#	adept	23378#	aeger	23437#	agave	24283#
14152abbey	22239#	abuse	22873#	adieu	23438#	aegis	23447#	agent	24368#
14153abbot	22268#	abyss	22977#	admit	23648#	aerie	23743#	agile	24453#
14154abeam	22326#	acorn	22676#	admix	23649#	affix	23349#	aglet	24538#
14155abele	22353#	acrid	22743#	adobe	23623#	afoot	23668#	agony	24669#
14156abhor	22467#	actor	22867#	adopt	23678#	afore	23673#	agree	24733#
14157\l'\n(x0u\(ul'
14158.in 0
14159.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
14160.FG "Table 10.1  Keying equivalents of some words"
14161As a more extensive example, in a dictionary of 24,500 words, just under 2,000
14162ambiguities (8% of words) were discovered.
14163Such ambiguities would have to be resolved interactively by the system explaining
14164its dilemma, and asking the user for a choice.
14165Notice incidentally that although the keyed sequences do not have the same
14166lexicographic order as the words,
14167no extra cost will be associated with the table-searching
14168operation if the dictionary is stored in inverted form, with each legal
14169number pointing to its English equivalent or equivalents.
14170.pp
14171A command language syntax is also a powerful way of disambiguating
14172keystrokes entered.
14173Figure 10.4 shows the keypad layout for a telephone voice calculator
14174(Newhouse and Sibley, 1969).
14175.[
14176Newhouse Sibley 1969
14177.]
14178.FC "Figure 10.4"
14179This calculator provides the standard arithmetic operators,
14180ten numeric registers, a range of pre-defined mathematical functions,
14181and even the ability for a user to enter his own functions over the
14182telephone.
14183The number representation is fixed-point, with user control (through a system
14184function) over the precision.
14185Input of numbers is free format.
14186.pp
14187Despite the power of the calculator language, the dialogue is defined
14188so that each keystroke is unique in context and never has to be disambiguated
14189explicitly by the user.
14190Table 10.2 summarizes the command language syntax in an informal and rather
14191heterogeneous notation.
14192.RF
14193.nr x0 1.3i+1.7i+\w'some functions do not need the <value> part'u
14194.nr x1 (\n(.l-\n(x0)/2
14195.in \n(x1u
14196.ta 1.3i +1.7i
14197\l'\n(x0u\(ul'
14198construct	definition	explanation
14199\l'\n(x0u\(ul'
14200.sp
14201<calculation>		a sequence of <operation>s followed by a
14202		call to the system function  \fIE  X  I  T\fR
14203.sp
14204<operation>	<add> OR <subtract> OR
14205	<multiply> OR <divide> OR
14206	<function> OR <clear> OR
14207	<erase> OR <answer> OR
14208	<display-last> OR <display> OR
14209	<repeat> OR <cancel>
14210.sp
14211<add>	+  <value>  #  OR  +  #  <function>
14212.sp
14213<subtract>
14214<multiply>		similar to <add>
14215<divide>
14216.sp
14217<value>	<numeric-value>  OR  \fIregister\fR <single-digit>
14218.sp
14219<numeric-value>		a sequence of keystrokes like
14220		1  .  2  3  4  or  1  2  3  .  4  or  1  2  3  4
14221.sp
14222<function>	\fIfunction\fR <name>  #  <value>  #
14223		some functions do not need the <value> part
14224.sp
14225<name>		a sequence of keystrokes like
14226		\fIS  I  N\fR  or  \fIE  X  I  T\fR  or  \fIM  Y  F  U  N  C\fR
14227.sp
14228<clear>	\fIclear register\fR <single-digit>  #
14229		clears one of the 10 registers
14230.sp
14231<erase>	\fIerase\fR  #	undoes the effect of the last operation
14232.sp
14233<answer>	\fIanswer register\fR <single-digit>  #
14234		reads the contents of a register
14235.sp
14236<display-last>
14237<display>		these provide "repeat" facilities
14238<repeat>
14239.sp
14240<cancel>		aborts the current utterance
14241\l'\n(x0u\(ul'
14242.in 0
14243.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
14244.FG "Table 10.2  Syntax for a telephone calculator"
14245A calculation is a sequence of operations followed by an EXIT function call.
14246There are twelve different operations, one for each button on the keypad.
14247Actually, two of them \(em
14248.ul
14249cancel
14250and
14251.ul
14252function
14253\(em share the same key so that "#" can be reserved for use as a
14254separator; but the context ensures that they cannot be confused by the system.
14255.pp
14256Six of the operations give control over the dialogue.
14257There are three different "repeat" commands; a command (called
14258.ul
14259erase\c
14260)
14261which undoes the effect of the last operation;
14262one which reads out the value of a register;
14263and one which aborts the current utterance.
14264Four more commands provide the basic arithmetic operations of add,
14265subtract, multiply, and divide.
14266The operands of these may be keyed literal numbers, or register values,
14267or function calls.
14268A further command clears a register.
14269.pp
14270It is through functions that the extensibility of the language is achieved.
14271A function has a name (like SIN, EXIT, MYFUNC) which is keyed with an
14272appropriate single-key-per-character sequence (namely 746, 3948, 693862
14273respectively).
14274One function, DEFINE, allows new ones to be entered.
14275Another, LOOP, repeats sequences of operations.
14276TEST incorporates arithmetic testing.
14277The details of these are not important:  what is interesting is the evident
14278power of the calculator.
14279.pp
14280For example, the keying sequence
14281.LB
14282.NI
142835  #  1  1  2  3  #  2  1  .  2  #  9  #  6  #  2  1  .  4  #
14284.LE
14285would be decoded as
14286.LB
14287.NI
14288.ul
14289clear\c
14290  +  123  \-  1.2  \c
14291.ul
14292display  erase\c
14293  \-  1.4.
14294.LE
14295One of the difficulties with such a tight syntax is that almost any sequence
14296will be intepreted as a valid calculation \(em syntax errors are nearly
14297impossible.
14298Thus a small mistake by the user can have a catastrophic effect on the
14299calculation.
14300Here, however, speech output gives an advantage over conventional
14301character-by-character echoing
14302on visual displays.
14303It is quite adequate to echo syntactic units as they are decoded, instead
14304of echoing keys as they are entered.
14305It was suggested earlier in this chapter that confirmation of entry
14306should be generated in the same way that the user would be likely to
14307verbalize it himself.
14308Thus the synthetic voice could respond to the above keying sequence as
14309shown in the second line, except that the
14310.ul
14311display
14312command would also state the result
14313(and possibly summarize the calculation so far).
14314Numbers could be verbalized as "one hundred and twenty-three"
14315instead of as "one ... two ... three".
14316(Note, however, that this will make it necessary to await the "#" terminator
14317after numbers and function names before they can be echoed.)
14318.sh "10.4  References"
14319.LB "nnnn"
14320.[
14321$LIST$
14322.]
14323.LE "nnnn"
14324.sh "10.5  Further reading"
14325.pp
14326There are no books which relate techniques of man-computer dialogue
14327to speech interaction.
14328The best I can do is to guide you to some of the standard works on
14329interactive techniques.
14330.LB "nn"
14331.\"Gilb-1977-1
14332.]-
14333.ds [A Gilb, T.
14334.as [A " and Weinberg, G.M.
14335.ds [D 1977
14336.ds [T Humanized input
14337.ds [I Winthrop
14338.ds [C Cambridge, Massachusetts
14339.nr [T 0
14340.nr [A 1
14341.nr [O 0
14342.][ 2 book
14343.in+2n
14344This book is subtitled "techniques for reliable keyed input",
14345and considers most aspects of the problem of data entry by
14346professional key operators.
14347.in-2n
14348.\"Martin-1973-2
14349.]-
14350.ds [A Martin, J.
14351.ds [D 1973
14352.ds [T Design of man-computer dialogues
14353.ds [I Prentice-Hall
14354.ds [C Englewood Cliffs, New Jersey
14355.nr [T 0
14356.nr [A 1
14357.nr [O 0
14358.][ 2 book
14359.in+2n
14360Martin concerns himself with all aspects of man-computer dialogue,
14361and the book even contains a short chapter on  the use of
14362voice response systems.
14363.in-2n
14364.\"Smith-1980-3
14365.]-
14366.ds [A Smith, H.T.
14367.as [A " and Green, T.R.G.(Editors)
14368.ds [D 1980
14369.ds [T Human interaction with computers
14370.ds [I Academic Press
14371.ds [C London
14372.nr [T 0
14373.nr [A 0
14374.nr [O 0
14375.][ 2 book
14376.in+2n
14377A recent collection of contributions on man-computer systems and programming
14378research.
14379.in-2n
14380.LE "nn"
14381.EQ
14382delim $$
14383.EN
14384.CH "11  COMMERCIAL SPEECH OUTPUT DEVICES"
14385.ds RT "Commercial speech output devices
14386.ds CX "Principles of computer speech
14387.pp
14388This chapter takes a look at four speech output peripherals that are
14389available today.
14390It is risky in a book of this nature to descend so close to the technology
14391as to discuss particular examples of commercial products,
14392for such information becomes dated very quickly.
14393Nevertheless, having covered the principles of various types of speech
14394synthesizer, and the methods of driving them from widely differing utterance
14395representations, it seems worthwhile to see how these principles are
14396embodied in a few products actually on the market.
14397.pp
14398Developments in electronic speech devices are moving so fast that it is
14399hard to keep up with them, and the newest technology today will undoubtedly
14400be superseded next year.
14401Hence I have not tried to choose examples from the very latest technology.
14402Instead, this chapter discusses synthesizers which exemplify rather different
14403principles and architectures, in order to give an idea of the range of options
14404which face the system designer.
14405.pp
14406Three of the devices are landmarks in the commercial adoption of speech
14407technology, and have stood the test of time.
14408Votrax was introduced in the early 1970's, and has been re-implemented
14409several times since in an attempt to cover different market sectors.
14410The Computalker appeared in 1976.
14411It was aimed primarily at the burgeoning computer hobbies market.
14412One of its most far-reaching effects was to stimulate the interest of
14413hobbyists, always eager for new low-cost peripherals, in speech synthesis;
14414and so provide a useful new source of experimentation and expertise
14415which will undoubtedly help this heretofore rather esoteric discipline to
14416mature.
14417Computalker is certainly the longest-lived and probably still the most
14418popular hobbyist's speech synthesizer.
14419The Texas Instruments speech synthesis chip brought speech output technology to the
14420consumer.
14421It was the first single-chip speech synthesizer, and is still the biggest
14422seller.
14423It forms the heart of the "Speak 'n Spell" talking toy which appeared in
14424toyshops in the summer of 1978.
14425Although talking calculators had existed several years before, they were
14426exotic gadgets rather than household toys.
14427.sh "11.1  Formant synthesizer"
14428.pp
14429The Computalker is a straightforward implementation of a serial formant
14430synthesizer.
14431A block diagram of it is shown in Figure 11.1.
14432.FC "Figure 11.1"
14433In the centre is the main vocal tract path, with three formant filters
14434whose resonant frequencies can be controlled individually.
14435A separate nasal branch in parallel with the oral one is provided,
14436with a nasal formant of fixed frequency.
14437It is less important to allow for variation of the nasal formant
14438frequency than it is for the oral ones, because the size and
14439shape of the nasal tract is relatively fixed.
14440However, it is essential to control the nasal amplitude, in particular to turn
14441it off during non-nasal sounds.
14442Computalker provides independent oral and nasal amplitude parameters.
14443.pp
14444Unvoiced excitation can be passed through the main vocal tract
14445through the aspiration amplitude control AH.
14446In practice, the voicing amplitudes AV and AN will probably always be zero when AH
14447is non-zero, for physiological constraints prohibit simultaneous voicing
14448and aspiration.
14449A second unvoiced excitation path passes through a fricative formant filter
14450whose resonant frequency can be varied, and has its amplitude independently
14451controlled by AF.
14452.rh "Control parameters."
14453Table 11.1 summarizes the nine parameters which drive Computalker.
14454.RF
14455.nr x0 \w'address0'+\w'fundamental frequency of voicing00'+\w'0 bits0'+\w'logarithmic00'+\w'0000\-00000 Hz'
14456.nr x1 (\n(.l-\n(x0)/2
14457.in \n(x1u
14458.ta \w'000'u \w'address0'u +\w'fundamental frequency of voicing00'u +\w'0 bits0'u +\w'logarithmic00'u
14459address	meaning	width		\0\0\0range
14460\l'\n(x0u\(ul'
14461.sp
14462\00	AV	amplitude of voicing	8 bits
14463\01	AN	nasal amplitude	8 bits
14464\02	AH	amplitude of aspiration	8 bits
14465\03	AF	amplitude of frication	8 bits
14466\04	FV	fundamental frequency of voicing	8 bits	logarithmic	\0\075\-\0\0470 Hz
14467\05	F1	formant 1 resonant frequency	8 bits	logarithmic	\0170\-\01450 Hz
14468\06	F2	formant 2 resonant frequency	8 bits	logarithmic	\0520\-\04400 Hz
14469\07	F3	formant 3 resonant frequency	8 bits	logarithmic	1700\-\05500 Hz
14470\08	FF	fricative resonant frequency	8 bits	logarithmic	1700\-14000 Hz
14471\09		not used
1447210		not used
1447311		not used
1447412		not used
1447513		not used
1447614		not used
1447715	SW	audio on-off switch	1 bit
14478\l'\n(x0u\(ul'
14479.in 0
14480.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
14481.FG "Table 11.1  Computalker control parameters"
14482Four of them control amplitudes, while the others control frequencies.
14483In the latter case the parameter value is logarithmically related to
14484the actual frequency of the excitation (FV) or resonance (F1, F2, F3, FF).
14485The ranges over which each frequency can be controlled is shown in the Table.
14486An independent calibration of one particular Computalker has shown that
14487the logarithmic specifications are met remarkably well.
14488.pp
14489Each parameter is specified to Computalker as an 8-bit number.
14490Parameters are addressed by a 4-bit code, and so a total of 12 bits
14491is transferred in parallel to Computalker from the computer
14492for each parameter update.
14493Parameters 9 to 14 are unassigned ("reserved for future expansion" is
14494the official phrase), and the last parameter, SW, governs the position of
14495an audio on-off switch.
14496.pp
14497Computalker does not contain a clock that is accessible to the user,
14498and so the timing of parameter updates is entirely up to the host computer.
14499Typically, a 10\ msec interval between frames is used,
14500with interrupts generated by a separate timer.
14501In fact the frame interval can be anywhere between 2\ msec and 50\ msec,
14502and can be changed to alter the rate of speaking.
14503However, it is rather naive to view fast speech as slow
14504speech speeded up by a linear time compression, for in human
14505speech production the rhythm changes and elisions occur in a rather
14506more subtle way.
14507Thus it is not particularly useful to be able to alter the frame rate.
14508.pp
14509At each interrupt, the host computer transfers values for all of the nine
14510parameters to Computalker, a total of 108 data bits.
14511In theory, perhaps, it is only necessary to transmit those parameters
14512whose values have changed; but in practice all of them should be updated
14513regardless.
14514This is because the parameters are stored for the duration of the frame
14515in analogue sample-and-hold devices.  Essentially, the parameter value
14516is represented as the charge on a capacitor.
14517In time \(em and it takes only a short time \(em the values drift.
14518Although the drift over 10\ msec is insignificant, it becomes very
14519noticeable over longer time periods.
14520If parameters are not updated at all, the result is a
14521"whooosh" sound up to maximum amplitude, in a period of a second or two.
14522Hence it is essential that Computalker be serviced by the computer regularly,
14523to update all its parameters.
14524The audio on-off switch is provided so that the computer can turn off
14525the sound directly if another program, which does not use the device,
14526is to be run.
14527.rh "Filter implementation."
14528It is hard to get definite information on the implementation
14529of Computalker.
14530Because it is a commercial device, circuit diagrams are not published.
14531It is certainly an analogue rather than a digital implementation.
14532The designer suggests that a configuration like that of Figure 11.2 is used
14533for the formant filters (Rice, 1976).
14534.[
14535Rice 1976 Byte
14536.]
14537.FC "Figure 11.2"
14538Control is obtained over the resonant frequency by varying the resistance
14539at the bottom in sympathy with the parameter value.
14540The middle two operational amplifiers can be modelled by a resistance
14541$-R/k$ in the forward path, where k is the digital control value.
14542This gives the circuit in Figure 11.3, which can be analysed to obtain
14543the transfer function
14544.LB
14545.EQ
14546- ~ k over {R~R sub 1 C sub 2 C sub 3} ~ . ~ {R sub 2 C sub 2 ~s ~+~1} over
14547{ s sup 2 ~+~~
14548( 1 over {R sub 3 C sub 3} ~+~ {k R sub 2} over {R~R sub 1 C sub 3})~s ~~+~
14549k over {R~R sub 1 C sub 2 C sub 3}} ~ .
14550.EN
14551.LE
14552.FC "Figure 11.3"
14553.pp
14554This expression has a DC gain of \-1, and the denominator is similar to those
14555of the analogue formant resonators discussed in Chapter 5.
14556However, unlike them the transfer function has a numerator which creates
14557a zero at
14558.LB
14559.EQ
14560s~~=~~-~ 1 over {R sub 2 C sub 2} ~ .
14561.EN
14562.LE
14563If  $R sub 2 C sub 2$  is sufficiently small, this zero will have
14564negligible effect at audio frequencies, and the filter has
14565the following parameters:
14566.LB
14567centre frequency:    $~ mark
145681 over {2 pi}~~( k over {R~R sub 1 C sub 2 C sub 3} ~ ) sup 1/2$  Hz
14569.sp
14570bandwidth:$lineup
145711 over {2 pi}~~( 1 over {R sub 3 C sub 3}~+~
14572{k R sub 2} over {R~R sub 1 C sub 3} ~ )$  Hz.
14573.LE
14574.pp
14575Note first that the centre frequency is proportional to the square root of
14576the control value $k$.
14577Hence a non-linear transformation must be implemented on the control
14578signal, after D/A conversion, to achieve the required logarithmic relationship
14579between parameter value and resonant frequency.
14580The formant bandwidth is not constant, as it should be (see Chapter 5),
14581but depends upon the control value $k$.
14582This dependency can be minimized by selecting component values such that
14583.LB
14584.EQ
14585{k R sub 2} over {R~R sub 1 C sub 3}~~<<~~1 over {R sub 3 C sub 3}
14586.EN
14587.LE
14588for the largest value of $k$ which can occur.
14589Then the bandwidth is solely determined by the time constant  $R sub 3 C sub 3$.
14590.pp
14591The existence of the zero can be exploited for the fricative resonance.
14592This should have zero DC gain, and so the component values for the fricative
14593filter should make the time-constant  $R sub 2 C sub 2$  large enough to place
14594the zero sufficiently near the frequency origin.
14595.rh "Market orientation."
14596As mentioned above, Computalker is designed for the computer hobbies market.
14597Figure 11.4 shows a photograph of the device.
14598.FC "Figure 11.4"
14599It plugs into the S\-100 bus which has been a
14600.ul
14601de facto
14602standard for hobbyists for several years, and has recently been adopted
14603as a standard by the Institute of Electrical and Electronic Engineers.
14604This makes it immediately accessible to many microcomputer systems.
14605.pp
14606An inexpensive synthesis-by-rule program, which runs on
14607the popular 8080 microprocessor, is available to drive Computalker.
14608The input is coded in a machine-readable version of the standard phonetic
14609alphabet, similar to that which was introduced in Chapter 2 (Table 2.1).
14610Stress digits may appear in the transcription, and the program caters for
14611five levels of stress.
14612The punctuation mark at the end of an utterance has some effect on pitch.
14613The program is perhaps remarkable in that it occupies only 6\ Kbyte of storage
14614(including phoneme tables), and runs on an 8-bit microprocessor
14615(but not in real time).
14616It is, however,
14617.ul
14618un\c
14619remarkable in that it produces rather poor speech.
14620According to a demonstration cassette,
14621"most people find the speech to be readily intelligible,
14622especially after a little practice listening to it,"
14623but this seems extremely optimistic.
14624It also cunningly insinuates that if you don't understand it, you yourself
14625may share the blame with the synthesizer \(em after all,
14626.ul
14627most
14628people do!
14629Nevertheless, Computalker has made synthetic speech accessible to a large
14630number of home computer users.
14631.sh "11.2  Sound-segment synthesizer"
14632.pp
14633Votrax was the first fully commercial speech synthesizer, and at the time of
14634writing is still the only off-the-shelf speech output
14635peripheral (as distinct from reading machine) which is aimed
14636specifically at synthesis-by-rule rather than storage of parameter tracks
14637extracted from natural utterances.
14638Figure 11.5 shows a photograph of the Votrax ML-I.
14639.FC "Figure 11.5"
14640.pp
14641Votrax accepts as input a string of codes representing sound segments,
14642each with additional bits to control the duration and pitch of the segment.
14643In the earlier versions (eg model VS-6) there are 63 sound segments, specified
14644by a 6-bit code, and two further bits accompany each segment to provide a
146454-level control over pitch.
14646Four pitch levels are quite inadequate to generate acceptable intonation
14647contours for anything but isolated words spoken in citation form.
14648However, a later model (ML-I) uses an 8-level pitch specification,
14649as well as a 4-level duration qualifier,
14650associated with each sound segment.
14651It provides a vocabulary of 80 sound segments, together with an additional
14652code which allows local amplitude modifications and extra duration alterations
14653to following segments.
14654A further, low-cost model (VS-K) is now available which plugs in to the S\-100
14655bus, and
14656is aimed primarily at
14657computer hobbyists.
14658It provides no pitch control at all and is therefore
14659quite unsuited to serious voice response applications.
14660The device has recently been packaged as an LSI circuit (model SC\-01),
14661using analogue switched-capacitor filter technology.
14662.pp
14663One point where the ML-I scores favourably over other speech synthesis
14664peripherals is the remarkably convenient engineering of its
14665computer interface, which was outlined in the previous chapter.
14666.pp
14667The internal workings of Votrax are not divulged by the manufacturer.
14668Figure 11.6 shows a block diagram at the level of detail that they supply.
14669.FC "Figure 11.6"
14670It seems to be essentially a formant synthesizer with analogue function
14671generators and parameter smoothing circuits that provide transitions between
14672sound segments.
14673.rh "Sound segments."
14674The 80 segments of the high-range ML-I model
14675are summarized in Table 11.2.
14676.FC "Table 11.2"
14677They are divided into phoneme classes according to the
14678classification discussed in Chapter 2.
14679The segments break down into the following categories.
14680(Numbers in parentheses are the corresponding figures for VS-6.)
14681.LB "00 (00) "
14682.NI "00 (00) "
1468311 (11) vowel sounds which are representative of the phonological
14684vowel classes for English
14685.NI "00 (00) "
14686\09 \0(7) vowel allophones, with slightly different sound qualities from the
14687above
14688.NI "00 (00) "
1468920 (15) segments whose sound qualities are identical to the segments above, but with
14690different durations
14691.NI "00 (00) "
1469222 (22) consonant sounds which are representative of the phonological
14693consonant classes for English
14694.NI "00 (00) "
1469511 \0(6) consonant allophones
14696.NI "00 (00) "
14697\04 \0(0) segments to be used in conjunction with unvoiced plosives to increase
14698their aspiration
14699.NI "00 (00) "
14700\02 \0(2) silent segments, with different pause durations
14701.NI "00 (00) "
14702\01 \0(0) very short silent segment (about 5\ msec).
14703.LE "00 (00) "
14704Somewhat under half of the 80 elements
14705can be put into one-to-one correspondence with the phonemes of English;
14706the rest are either allophonic variations or additional sounds which can
14707sensibly be combined with certain phonemes in certain contexts.
14708The Votrax literature, and consequently Votrax users, persists in calling
14709all elements "phonemes", and this can cause considerable confusion.
14710I prefer to use the term "sound segment" instead, reserving "phoneme" for its
14711proper linguistic use.
14712.pp
14713The rules which Votrax uses for transitions between sound segments are not
14714made public by the manufacturer, and are embedded in encapsulated circuits
14715in the hardware.
14716They are clearly very crude.
14717The key to successful encoding of utterances is to use the many
14718non-phonemic segments in an appropriate way as transitions between the main
14719segments which represent phonetic classes.  This is a tricky process, and
14720I have heard of one commercial establishment giving up in despair at the
14721extreme difficulty of generating the utterances it wanted.
14722It probably explains the proliferation of letter-to-sound rules for
14723Votrax which have been developed in research laboratories
14724(Colby
14725.ul
14726et al,
147271978; Elovitz
14728.ul
14729et al,
147301976; McIlroy, 1974; Sherwood, 1978).
14731.[
14732Colby Christinaz Graham 1978
14733.]
14734.[
14735Elovitz 1976 IEEE Trans Acoustics Speech and Signal Processing
14736.]
14737.[
14738McIlroy 1974
14739.]
14740.[
14741Sherwood 1978
14742.]
14743Nevertheless, with luck, skill, and especially persistence,
14744excellent results can be
14745obtained.  The ML-I manual (Votrax, 1976) contains a list of about 625 words and short phrases,
14746and they are usually clearly recognizable.
14747.[
14748Votrax 1976
14749.]
14750.rh "Duration and pitch qualifiers."
14751Each sound segment has a different duration.
14752Table 11.2 shows the measured duration of the segments, although no
14753calibration data is given by Votrax.
14754As mentioned earlier, a 2-bit number accompanies each segment to modify
14755its duration, and
14756this was set to 3 (least duration) for the measurements.
14757The qualifier has a multiplicative effect, shown in Table 11.3.
14758.RF
14759.nr x1 (\w'rate qualifier'/2)
14760.nr x2 (\w'in Table 11.2 by'/2)
14761.nr x0 \n(x1+2i+\w'00'+\n(x2
14762.nr x3 (\n(.l-\n(x0)/2
14763.in \n(x3u
14764.ta \n(x1u +2i
14765\l'\n(x0u\(ul'
14766.sp
14767.nr x2 (\w'multiply duration'/2)
14768rate qualifier		\0\0\h'-\n(x2u'multiply duration
14769.nr x2 (\w'in Table 11.2 by'/2)
14770		\0\0\h'-\n(x2u'in Table 11.2 by
14771\l'\n(x0u\(ul'
14772.sp
14773	3	1.00
14774	2	1.11
14775	1	1.22
14776	0	1.35
14777\l'\n(x0u\(ul'
14778.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
14779.in 0
14780.FG "Table 11.3  Effect of the 2-bit per-segment rate qualifier"
14781.pp
14782As well as the 2-bit rate qualifier, each sound segment is accompanied by
14783a 3-bit pitch specification.  This provides a linear control over fundamental
14784frequency, and Table 11.4 shows the measured values.
14785.RF
14786.nr x1 (\w'pitch specifier'/2)
14787.nr x2 (\w'pitch (Hz)'/2)
14788.nr x0 \n(x1+1.5i+\n(x2
14789.nr x3 (\n(.l-\n(x0)/2
14790.in \n(x3u
14791.ta \n(x1u +1.5i
14792\l'\n(x0u\(ul'
14793.sp
14794pitch specifier	\h'-\n(x2u'pitch (Hz)
14795\l'\n(x0u\(ul'
14796.sp
14797	0	\057.5
14798	1	\064.1
14799	2	\069.4
14800	3	\075.8
14801	4	\080.6
14802	5	\087.7
14803	6	\094.3
14804	7	100.0
14805\l'\n(x0u\(ul'
14806.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
14807.in 0
14808.FG "Table 11.4  Effect of the 3-bit per-segment pitch specifier"
14809The quantization interval varies from
14810one to two semitones.
14811Votrax interpolates pitch from phoneme to phoneme in a highly satisfactory
14812manner, and this permits surprisingly sophisticated intonation contours
14813to be generated considering the crude 8-level quantization.
14814.pp
14815The notation in which the Votrax manual defines utterances
14816gives duration qualifiers and pitch specifications as digits
14817preceding the sound segment, and separated from it by a slash (/).
14818Thus, for example,
14819.LB
1482014/THV
14821.LE
14822defines the sound segment THV with duration qualifier 1 (multiplies the
1482370\ msec duration of Table 11.2 by 1.22 \(em from Table 11.3 \(em to give 85\ msec)
14824and pitch specification 4 (81 Hz).
14825This representation of a segment is transformed into two ASCII characters before transmission
14826to the synthesizer.
14827.rh "Converting a phonetic transcription to sound segments."
14828It would be useful to have a computer procedure to produce a specification for
14829an utterance in terms of Votrax sound segments from a standard phonetic
14830transcription.
14831This could remove much of the tedium from utterance preparation
14832by incorporating the contextual rules given in the Votrax manual.
14833Starting with a phonetic transcription, each phoneme should be converted
14834to its default Votrax representative.
14835The resulting "wide" Votrax transcription must be
14836transformed into a "narrow" one by application of contextual rules.
14837Separate rules are needed for
14838.LB
14839.NP
14840vowel clusters (diphthongs)
14841.NP
14842vowel transitions (ie consonant-vowel and vowel-consonant,
14843where the vowel segment is altered)
14844.NP
14845intervocalic consonants
14846.NP
14847consonant transitions (ie consonant-vowel and vowel-consonant,
14848where the consonant segment is altered)
14849.NP
14850consonant clusters
14851.NP
14852stressed-syllable effects
14853.NP
14854utterance-final effects.
14855.LE
14856Stressed-syllable effects (which include
14857extra aspiration for unvoiced stops beginning stressed syllables)
14858can be applied only if stress markers are included in the phonetic
14859transcription.
14860.pp
14861To specify a rule, it is necessary to give a
14862.ul
14863matching part
14864and a
14865.ul
14866context,
14867which define at what points in an utterance it is applicable, and a
14868.ul
14869replacement part
14870which is used to replace the matching part.
14871The context can be specified in mathematical set notation using curly brackets.
14872For example,
14873.LB
14874{G SH W K} OO		IU OO
14875.LE
14876states that the matching part OO is replaced by IU OO, after a G, SH, W, or K.
14877In fact, allophonic variations of each sound segment
14878should also be accepted as valid context, so this rule will also replace OO
14879after .G, CH, .W, .K, or .X1 (Table 11.2 gives allophones of each segment).
14880.pp
14881Table 11.5 gives some rules that have been used for this purpose.
14882.FC "Table 11.5"
14883They were derived from careful study of the hints given in the
14884ML-I manual (Votrax, 1976).
14885.[
14886Votrax 1976
14887.]
14888Classes such as "voiced" and "stop-consonant" in the context specify sets
14889of sound segments in the obvious way.
14890The beginning of a stressed syllable is marked in the input by ".syll".
14891Parentheses in the replacement part have a significance which is explained in
14892the next section.
14893.rh "Handling prosodic features."
14894We know from Chapter 8 the vital importance of prosodic features
14895in synthesizing lifelike speech.
14896To allow them to be assigned to Votrax utterances, an intermediate
14897output from a prosodic analysis program like ISP can be used.
14898For example,
14899.LB
149001  \c
14901.ul
14902dh i s  i z  /*d zh aa k s  /h aa u s;
14903.LE
14904which specifies "this is Jack's house" in a declarative intonation with
14905emphasis on the "Jack's", can be intercepted in the following form:
14906.LB
14907\&.syll
14908.ul
14909dh\c
14910\ 50\ (0\ 110)
14911.ul
14912i\c
14913\ 60
14914.ul
14915s\c
14916\ 90\ (0\ 99)
14917.ul
14918i\c
14919\ 60
14920.ul
14921z\c
14922\ 60\ (50\ 110)
14923\&.syll
14924.ul
14925d\c
14926\ 50\ (0\ 110)
14927.ul
14928zh\c
14929\ 50
14930.ul
14931aa\c
14932\ 90
14933.ul
14934k\c
14935\ 120\ (10\ 90)
14936.ul
14937s\c
14938\ 90
14939\&.syll
14940.ul
14941h\c
14942\ 60
14943.ul
14944aa\c
14945\ 140
14946.ul
14947u\c
14948\ 60
14949.ul
14950s\c
14951\ 140
14952^\ 50\ (40\ 70) .
14953.LE
14954Syllable boundaries, pitches, and durations have been assigned by the
14955procedures given earlier (Chapter 8).
14956A number always follows each phoneme to specify its duration
14957(in msec).
14958Pairs of numbers in parentheses define a pitch specification at some
14959point during the preceding phoneme:  the first number of the pair defines
14960the time offset of the specification from the beginning
14961of the phoneme, while the second gives the pitch itself (in Hz).
14962This form of utterance specification can then be passed to a Votrax
14963conversion procedure.
14964.pp
14965The phonetic transcription is converted
14966to Votrax sound segments using the method described above.  The "wide" Votrax
14967transcription is
14968.LB
14969\&.syll THV I S I Z .syll D ZH AE K S .syll H AE OO S PA0 ;
14970.LE
14971which is transformed to the following "narrow" one according to the rules
14972of Table 11.5:
14973.LB
14974\&.syll THV I S I Z .syll D J (AE EH3) K S .syll H1 (AH1 .UH2) (O U)
14975S PA0 .
14976.LE
14977The duration and pitch specifications are preserved by the transformation
14978in their original positions in the string, although they are not shown above.
14979The next stage uses them to expand the transcription by adjusting
14980the segments to have durations as close as possible to the specifications, and
14981computing pitch numbers to be associated with each phoneme.
14982.pp
14983Correct duration-expansion can, in general, require a great amount of
14984computation.
14985Associated with each sound segment is a set of elements with the same sound quality
14986but different durations, formed by attaching each of the four duration
14987qualifiers of Table 11.3 to the segment and any others which are
14988sound-equivalents to it.  For example, the segment Z has the duration-set
14989.LB
14990{3/Z   2/Z   1/Z   0/Z}
14991.LE
14992with durations
14993.LB
14994{ 70   78   85   95}
14995.LE
14996msec respectively, where the initial numerals denote the duration qualifier.
14997The segment I has the much larger duration-set
14998.LB
14999{3/I2   2/I2   1/I2   0/I2   3/I1   2/I1   1/I1   0/I1   3/I   2/I   1/I   0/I}
15000.LE
15001with durations
15002.LB
15003{ 58   64   71   78   83   92   101   112   118   131   144   159},
15004.LE
15005because segments I1 and I2 are sound-equivalents to it.
15006Duration assignment is a matter of selecting elements from the
15007duration-set whose total duration is as close as possible to that desired
15008for the segment.
15009It happens that Votrax deals sensibly with concatenations of more than one
15010identical plosive, suppressing the stop burst on all but the last.
15011Although the general problem of approximating durations in
15012this way is computationally demanding, a simple recursive exhaustive search
15013works in a reasonable amount of time because the desired duration is usually
15014not very much greater than the longest member of the duration-set, and so
15015the search terminates quite quickly.
15016.pp
15017At this point, the role of the parentheses which appear on the right-hand side
15018of Table 11.5 becomes apparent.  Because durations are only associated with
15019the input phonemes, which may each be expanded into several Votrax
15020segments, it is necessary to keep track of the segments which have descended
15021from a single phoneme.
15022Target durations are simply spread equally across any parenthesized groups
15023to which they apply.
15024.pp
15025Having expanded durations, mapping pitches on to the sound segments is
15026a simple matter.  The ISP system for formant synthesizers (Chapters 7 and 8)
15027uses linear interpolation between pitch specifications, and the frequency which
15028results for each sound segment needs to be converted to a Votrax specification
15029using the information in Table 11.4.
15030.pp
15031After applying these procedures to the example utterance, it becomes
15032.LB
1503314/THV  14/I1  03/S  14/I1  04/Z  04/D  04/J  33/AE  33/EH3  \c
1503402/K  02/K  02/S  02/H1  01/AH2  01/.UH2  31/O2  31/U1  01/S  \c
1503510/S  30/PA0  30/PA0  .
15036.LE
15037In several places, shorter sound-equivalents have been substituted
15038(I1 for I, AH2 for AH1, O2 for O, and U1 for U), while doubling-up also occurs
15039(in the K, S, and PA0 segments).
15040.pp
15041The speech which results from the use of these procedures with the
15042Votrax synthesizer sounds remarkably similar to that generated by the
15043ISP system which uses
15044parametrically-controlled synthesizers.  Formal evaluation experiments have
15045not been undertaken, but it seems clear from careful listening that it would
15046be rather difficult, and probably pointless, to evaluate the Votrax conversion
15047algorithm, for the outcome would be completely dominated by the success of the
15048original pitch and rhythm assignment procedures.
15049.sh "11.3  Linear predictive synthesizer"
15050.pp
15051The first single-chip speech synthesizer was introduced by
15052Texas Instruments (TI) in the summer of 1978 (Wiggins and Brantingham, 1978).
15053.[
15054Wiggins Brantingham 1978
15055.]
15056It was a remarkable development, combining recent advances in signal processing
15057with the very latest in VLSI technology.
15058Packaged in the Speak 'n Spell toy (Figure 11.7), it was a striking demonstration
15059of imagination and prowess in integrated electronics.
15060.FC "Figure 11.7"
15061It gave TI a long lead over its competitors and surprised many experts
15062in the speech field.
15063.EQ
15064delim @@
15065.EN
15066Overnight, it seemed, digital speech technology had descended from
15067research laboratories with their expensive and specialized equipment into
15068a $50.00 consumer item.
15069.EQ
15070delim $$
15071.EN
15072Naturally TI did not sell the chip separately but only as part of their
15073mass-market product; nor would they make available information on how to
15074drive it directly.
15075Only recently when other similar devices appeared on the market did they
15076unbundle the package and sell the chip.
15077.rh "The Speak 'n Spell toy."
15078The TI chip (TMC0280) uses the linear predictive method of synthesis,
15079primarily because of the ease of the speech analysis procedure and the known
15080high quality at low data rates.
15081Speech researchers, incidentally, sometimes scoff at what they perceive to be
15082the poor quality of the toy's speech; but considering the data rate
15083used (which averages 1200 bits per second of speech) it is remarkably good.
15084Anyway, I have never heard a child complain! \(em although it is not uncommon
15085to misunderstand a word.
15086Two 128\ Kbit read-only memories are used in the toy to hold data for about
15087330 words and phrases \(em lasting between 3 and 4 minutes \(em of speech.
15088At the time (mid-1978) these memories were the largest that were available
15089in the industry.
15090The data flow and user dialogue are handled by a microprocessor,
15091which is the fourth LSI circuit in the photograph of Figure 11.8.
15092.FC "Figure 11.8"
15093.pp
15094A schematic diagram of the toy is given in Figure 11.9.
15095.FC "Figure 11.9"
15096It has a small display which shows upper-case letters.
15097(Some teachers of spelling hold that the lack of lower case destroys
15098any educational value that the toy may have.)  It
15099has a full 26-key alphanumeric keyboard with 14 additional control keys.
15100(This is the toy's Achilles' heel, for the keys fall out after extended use.
15101More recent toys from TI use an improved keyboard.)  The
15102keyboard is laid out alphabetically instead of in QWERTY order; possibly
15103missing an opportunity to teach kids to type as well as spell.
15104An internal connector permits vocabulary expansion with up to 14 more
15105read-only memory chips.
15106Controlling the toy is a 4-bit microprocessor (a modified TMS1000).
15107However, the synthesizer chip does not receive data from the processor.
15108During speech, it accesses the memory directly and only returns control
15109to the processor when an end-of-phrase marker is found in the data stream.
15110Meanwhile the processor is idle, and cannot even be interrupted from the
15111keyboard.
15112Moreover, in one operational mode ("say-it") the toy embarks upon a long
15113monologue and remains deaf to the keyboard \(em it cannot even be turned off.
15114Any three-year-old will quickly discover that a sharp slap solves the problem!
15115A useful feature is that the device switches itself off if unused for more
15116than a few minutes.
15117A fascinating account of the development of the toy from the point of view
15118of product design and market assessment has been published
15119(Frantz and Wiggins, 1981).
15120.[
15121Frantz Wiggins 1981
15122.]
15123.rh "Control parameters."
15124The lattice filtering method of linear predictive synthesis (see Chapter 6)
15125was selected because of its good stability properties and guaranteed
15126performance with small word sizes.
15127The lattice has 10 stages.
15128All the control parameters are represented as 10-bit fixed-point numbers,
15129and the lattice operates with an internal precision of 14 bits (including
15130sign).
15131.pp
15132There are twelve parameters for the device:  ten reflection coefficients,
15133energy, and pitch.
15134These are updated every 20\ msec.
15135However, if 10-bit values were stored for each, a data rate of 120 bits
15136every 20\ msec, or 6\ Kbit/s, would be needed.
15137This would reduce the capacity of the two read-only memory chips to well
15138under a minute of speech \(em perhaps 65 words and phrases.
15139But one of the desirable properties of the reflection coefficients
15140which drive the lattice filter is that they are amenable to quantization.
15141A non-linear quantization scheme is used, with the parameter data addressing
15142an on-chip quantization table to yield a 10-bit coefficient.
15143.pp
15144Table 11.6 shows the number of bits devoted to each parameter.
15145.RF
15146.in+0.3i
15147.ta \w'repeat flag00'u +1.3i +0.8i
15148.nr x0 \w'repeat flag00'+1.3i+\w'00'+(\w'size (10-bit words)'/2)
15149\l'\n(x0u\(ul'
15150.nr x1 (\w'bits'/2)
15151.nr x2 (\w'quantization table'/2)
15152.nr x3 0.2m
15153parameter	\0\h'-\n(x1u'bits	\0\0\h'-\n(x2u'quantization table
15154.nr x2 (\w'size (10-bit words)'/2)
15155		\0\0\h'-\n(x2u'size (10-bit words)
15156\l'\n(x0u\(ul'
15157.sp
15158energy	\04	\016	\v'\n(x3u'_\v'-\n(x3u'\z4\v'\n(x3u'_\v'-\n(x3u'  energy=0 means 4-bit frame
15159pitch	\05	\032
15160repeat flag	\01	\0\(em	\z1\v'\n(x3u'_\v'-\n(x3u'\z0\v'\n(x3u'_\v'-\n(x3u'  repeat flag =1 means 10-bit frame
15161k1	\05	\032
15162k2	\05	\032
15163k3	\04	\016
15164k4	\04	\016	\z2\v'\n(x3u'_\v'-\n(x3u'\z8\v'\n(x3u'_\v'-\n(x3u'  pitch=0 (unvoiced) means 28-bit frame
15165k5	\04	\016
15166k6	\04	\016
15167k7	\04	\016
15168k8	\03	\0\08
15169k9	\03	\0\08
15170k10	\03	\0\08	\z4\v'\n(x3u'_\v'-\n(x3u'\z9\v'\n(x3u'_\v'-\n(x3u'  otherwise 49-bit frame
15171	__	___
15172.sp
15173	49 bits	216 words
15174\l'\n(x0u\(ul'
15175.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
15176.in-0.3i
15177.FG "Table 11.6  Bit allocation for Speak 'n Spell chip"
15178There are 4 bits for energy, and 5 bits for pitch and the first two
15179reflection coefficients.
15180Thereafter the number of bits allocated to reflection coefficients decreases
15181steadily, for higher coefficients are less important for intelligibility
15182than lower ones.
15183(Note that using a 10-stage filter is tantamount to allocating
15184.ul
15185no
15186bits to coefficients higher than the tenth.)  With a
151871-bit "repeat" flag, whose role is discussed shortly, the frame size
15188becomes 49 bits.
15189Updated every 20\ msec, this gives a data rate of just under 2.5\ Kbit/s.
15190.pp
15191The parameters are expanded into 10-bit numbers by a separate quantization
15192table for each one.
15193For example, the five pitch bits address a 32-word look-up table which
15194returns a 10-bit value.
15195The transformation is logarithmic in this case, the lowest pitch being
15196around 50 Hz and the highest 190 Hz.
15197As shown in Table 11.6, a total of 216 10-bit words suffices to hold all
15198twelve quantization tables; and they are implemented on the synthesizer
15199chip.
15200To provide further smoothing of the control parameters,
15201they are interpolated linearly from one frame to the next at eight points
15202within the frame.
15203.pp
15204The raw data rate of 2.5\ Kbit/s is reduced to an average of 1200\ bit/s
15205by further coding techniques.
15206Firstly, if the energy parameter is zero the frame is silent,
15207and no more parameters are transmitted (4-bit frame).
15208Secondly, if the "repeat" flag is 1 all reflection coefficients are held
15209over from the previous frame, giving a constant filter but with the ability
15210to vary amplitude and pitch (10-bit frame).
15211Finally, if the frame is unvoiced (signalled by the pitch value being zero)
15212only four reflection coefficients are transmitted, because the ear is
15213relatively insensitive to spectral detail in unvoiced speech (28-bit frame).
15214The end of the utterance is signalled by the energy bits all being 1.
15215.rh "Chip organization."
15216The configuration of the lattice filter is shown in Figure 11.10.
15217.FC "Figure 11.10"
15218The "two-multiplier" structure (Chapter 6) is used, so the 10-stage filter
15219requires 19 multiplications and 19 additions
15220per speech sample.
15221(The last operation in the reverse path at the bottom is not needed.)  Since
15222a 10\ kHz sample rate is used, just 100\ $mu$sec are available for each
15223speech sample.
15224A single 5\ $mu$sec adder and a pipelined multiplier are implemented on
15225the chip, and multiplexed among the 19 operations.
15226The latter begins a new multiplication every 5\ $mu$sec, and finishes it
1522740\ $mu$sec later.
15228These times are within the capability of p-channel MOS technology,
15229allowing the chip to be produced at low cost.
15230The time slot for the 20'th, unnecessary, filter multiplication is used
15231for an overall gain adjustment.
15232.pp
15233The final analogue signal is produced by an 8-bit on-chip D/A converter
15234which drives a 200 milliwatt speaker through an impedance-matching
15235transformer.
15236These constitute the necessary analogue low-pass desampling filter.
15237.pp
15238Figure 11.11 summarizes the organization of the synthesis chip.
15239.FC "Figure 11.11"
15240Serial data enters directly from the read-only memories, although a control
15241signal from the processor begins synthesis and another signal is returned
15242to it upon termination.
15243The data is decoded into individual parameters, which are used to address
15244the quantization tables to generate the full 10-bit parameter
15245values.
15246These are interpolated from one frame to the next.
15247The lower part of the Figure shows the speech generation subsystem.
15248An excitation waveform for voiced speech is stored in read-only
15249memory and read out repeatedly at a rate determined by the pitch.
15250The source for unvoiced sounds is hard-limited noise provided by a digital
15251pseudo-random bit generator.
15252The sound source that is used depends on whether the pitch value is zero
15253or not:  notice that this precludes mixed excitation for voiced fricatives
15254(and the sound is noticeably poor in words like "zee").
15255A gain multiplication is performed before the signal is passed through the
15256lattice synthesis filter, described earlier.
15257.sh "11.4  Programmable signal processors"
15258.pp
15259The TI chip has a fixed architecture, and is destined forever
15260to implement the same vocal tract model \(em a 10'th order lattice filter.
15261A more recent device, the Programmable Digital Signal Processor
15262(Caldwell, 1980) from Telesensory Systems allows more flexibility
15263in the type of model.
15264.[
15265Caldwell 1980
15266.]
15267It can serve as a digital formant synthesizer or a linear predictive
15268synthesizer, and the order of model (number of formants, in the former case)
15269can be changed.
15270.pp
15271Before describing the PDSP, it is worth looking at an earlier microprocessor
15272which was designed for digital signal processing.
15273Some industry observers have said that this processor, the Intel 2920,
15274is to the analogue design engineer what the first microprocessor was to
15275the random logic engineer way back in the mists of time (early 1970's).
15276.rh "The 'analogue microprocessor'."
15277The 2920 is a digital microprocessor.
15278However, it contains an on-chip D/A converter, which can be used in
15279successive approximation fashion for A/D conversion under program control,
15280and its architecture is designed to aid digital signal processing calculations.
15281Although the precision of conversion is 9 bits, internal arithmetic is
15282done with 25 bits to accomodate the accumulation of round-off errors in
15283arithmetic operations.
15284An on-chip programmable read-only memory holds a 192-instruction program,
15285which is executed in sequence with no program jumps allowed.
15286This ensures that each pass through the program takes the same time,
15287so that the analogue waveform is regularly sampled and processed.
15288.pp
15289The device is implemented in n-channel MOS technology, which makes it
15290slightly faster than the pMOS Speak 'n Spell chip.
15291At its fastest operating speed each instruction takes 400 nsec.
15292The 192-instruction program therefore executes in 78.6\ $mu$sec, corresponding
15293to a sampling rate of almost 13\ kHz.
15294Thus the processor can handle signals with a bandwidth of 6.5\ kHz \(em ample
15295for high-quality speech.
15296However, a special EOP (end of program) instruction is provided which
15297causes an immediate jump back to the beginning.
15298Hence if the program occupies less than 192 instructions, faster sampling
15299rates can be used.
15300For example, a single second-order formant resonance
15301requires only 14 instructions and so can
15302be executed at over 150\ kHz.
15303.pp
15304Despite this speed, the 2920 is only marginally capable of synthesizing
15305speech.
15306Table 11.7 gives approximate numbers of instructions needed to do some
15307subtasks for speech generation (Hoff and Li, 1980).
15308.[
15309Hoff Li 1980 Software makes a big talker
15310.]
15311.RF
15312.nr x0 \w'parameter entry and data distribution0000'+\w'00000'
15313.nr x1 \w'instructions'
15314.nr x2 (\n(.l-\n(x0)/2
15315.in \n(x2u
15316.ta \w'parameter entry and data distribution0000'u
15317\l'\n(x0u\(ul'
15318.sp
15319task	\0\0\0\0\0\h'-\n(x1u'instructions
15320\l'\n(x0u\(ul'
15321.sp
15322parameter entry and data distribution	35\-40
15323glottal pulse generation	\0\0\0\08
15324noise generation	\0\0\011
15325lattice section	\0\0\020
15326formant filter	\0\0\014
15327\l'\n(x0u\(ul'
15328.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
15329.in 0
15330.FG "Table 11.7  2920 instruction counts for typical speech subsystems"
15331The parameter entry and data distribution procedure
15332collects 10 8-bit parameters from a serial input stream, at a frame rate of
15333100 frames/s.
15334The parameter data rate is 8\ Kbit/s, and the routine assumes that the
153352920 performs each complete cycle in 125\ $mu$sec to generate sampled speech
15336at 8\ kHz.
15337Therefore one bit of parameter data is accepted on every cycle.
15338The glottal pulse program generates an asymmetrical triangular waveform
15339(Chapter 5), while the noise generator uses a 17-bit pseudo-random feedback
15340shift register.
15341About 30% of the 192-instruction program memory is consumed by these
15342essential tasks.
15343A two-multiplier lattice section takes 20 instructions,
15344and so only six sections can fit into the remaining program space.
15345It may be possible to use two 2920's to implement a complete 10 or 12'th
15346order lattice, but the results of the first stage must be passed to the
15347second by transmitting analogue or digital data between each of the
153482920's analogue ports \(em not a terribly satisfactory method.
15349.pp
15350Since a formant filter occupies only 14 instructions, up to nine of them
15351would fit in the program space left after the above-mentioned essential
15352subsystems.
15353Although other necessary house-keeping tasks may reduce this number
15354substantially,
15355it does seem possible to implement a formant synthesizer on a single 2920.
15356.rh "The Programmable Digital Signal Processor."
15357Whereas the 2920 is intended for general signal-processing jobs,
15358Telesensory Systems' PDSP
15359(Programmable Digital Signal Processor) is aimed specifically at speech
15360synthesis.
15361It comprises two separate chips, a control unit and an arithmetic unit.
15362To build a synthesizer these must be augmented with external memory
15363and a D/A converter, arranged in a configuration like that of Figure 11.12.
15364.FC "Figure 11.12"
15365.pp
15366The control unit accepts parameter data from a host computer, one byte at a time.
15367The data is temporarily held in buffer memory before being serialized and passed
15368to the arithmetic unit.
15369Notice that for the 2920 we assumed that parameters were presented
15370to the chip already serialized and precisely timed:  the PDSP control unit
15371effectively releases the host from this high-speed real-time operation.
15372But it does more.
15373It generates both a voiced and an unvoiced excitation source and passes them
15374to the arithmetic unit, to relieve the latter of the general-purpose
15375programming required for both these tasks and allow its instruction set
15376to be highly specialized for digital filtering.
15377.pp
15378The arithmetic unit has rather a peculiar structure.
15379It accomodates only 16 program steps and can execute the full 16-instruction
15380program at a rate of 10\ kHz.
15381The internal word-length is 18 bits, but coefficients and the digital output
15382are only 10 bits.
15383Each instruction can accomplish quite a lot of work.
15384Figure 11.13 shows that there are four separate blocks of store in addition
15385to the program memory.
15386.FC "Figure 11.13"
15387One location of each block is automatically associated with each program step.
15388Thus on instruction 2, for example, two 18-bit scratchpad registers MA(2)
15389and MB(2), and two 10-bit coefficient registers A1(2) and A2(2), are
15390accessible.
15391In addition five general registers, curiously numbered R1, R2, R5, R6, R7,
15392are available to every program step.
15393.pp
15394Each instruction has five fields.
15395A single instruction loads all the general registers and simultaneously
15396performs two multiplications and up to three additions.
15397The fields specify exactly which operands are involved in these operations.
15398.pp
15399The instructions of the PDSP arithmetic unit are really very powerful.
15400For example, a second-order digital formant resonator requires only
15401two program steps.
15402A two-multiplier lattice stage needs only one step, and
15403a complete 12-stage lattice filter can be implemented in the 16 steps available.
15404An important feature of the architecture is that it
15405is quite easy to incorporate more than one
15406arithmetic unit into a system, with a single control unit.
15407Intermediate data can be transferred digitally between arithmetic units
15408since the D/A converter is off-chip.
15409A four-multiplier normalized lattice (Chapter 6) with 12 stages can be implemented
15410on two arithmetic units, as can a lattice filter which incorporates zeros
15411as well as poles, and a complex series/parallel formant synthesizer
15412with a total of 12 resonators whose centre frequencies and bandwidths
15413can be controlled independently (Klatt, 1980).
15414.[
15415Klatt 1980
15416.]
15417.pp
15418How this device will fare in actual commercial products is yet to be seen.
15419It is certainly much more sophisticated than the TI Speak 'n Spell chip,
15420and a complete system will necessitate a much higher chip count and consequently
15421more expense.
15422Telesensory Systems are committed to producing a text-to-speech
15423system based upon it
15424for use both in a reading machine for the blind and as a text-input
15425speech-output computer peripheral.
15426.sh "11.5  References"
15427.LB "nnnn"
15428.[
15429$LIST$
15430.]
15431.LE "nnnn"
15432.bp
15433.ev2
15434.ta \w'\fIsilence\fR 'u +\w'.EH100'u +\w'(used to change amplitude and duration)00'u +\w'00000000000test word'u
15435.nr x0 \w'\fIsilence\fR '+\w'.EH100'+\w'(used to change amplitude and duration)00'+\w'00000000000test word'
15436\l'\n(x0u\(ul'
15437.sp
15438.nr x1 (\w'Votrax'/2)
15439.nr x2 (\w'duration (msec)'/2)
15440.nr x3 \w'test word'
15441	\h'-\n(x1u'Votrax		\0\h'-\n(x2u'duration (msec)	\h'-\n(x3u'test word
15442\l'\n(x0u\(ul'
15443.sp
15444.nr x3 \w'hid'
15445\fIi\fR	I		118	\h'-\n(x3u'hid
15446	I1	(sound equivalent of I)	\083
15447	I2	(sound equivalent of I)	\058
15448	I3	(allophone of I)	\058
15449	.I3	(sound equivalent of I3)	\083
15450	AY	(allophone of I)	\065
15451.nr x3 \w'head'
15452\fIe\fR	EH		118	\h'-\n(x3u'head
15453	EH1	(sound equivalent of EH)	\070
15454	EH2	(sound equivalent of EH)	\060
15455	EH3	(allophone of EH)	\060
15456	.EH2	(sound equivalent of EH3)	\070
15457	A1	(allophone of EH)	100
15458	A2	(sound equivalent of A1)	\095
15459.nr x3 \w'had'
15460\fIaa\fR	AE		100	\h'-\n(x3u'had
15461	AE1	(sound equivalent of AE)	100
15462.nr x3 \w'hod'
15463\fIo\fR	AW		235	\h'-\n(x3u'hod
15464	AW2	(sound equivalent of AW)	\090
15465	AW1	(allophone of AW)	143
15466.nr x3 \w'hood'
15467\fIu\fR	OO		178	\h'-\n(x3u'hood
15468	OO1	(sound equivalent of OO)	103
15469	IU	(allophone of OO)	\063
15470.nr x3 \w'hud'
15471\fIa\fR	UH		103	\h'-\n(x3u'hud
15472	UH1	(sound equivalent of UH)	\095
15473	UH2	(sound equivalent of UH)	\050
15474	UH3	(allophone of UH)	\070
15475	.UH3	(sound equivalent of UH3)	103
15476	.UH2	(allophone of UH)	\060
15477.nr x3 \w'hard'
15478\fIar\fR	AH1		143	\h'-\n(x3u'hard
15479	AH2	(sound equivalent of AH1)	\070
15480.nr x3 \w'hawed'
15481\fIaw\fR	O		178	\h'-\n(x3u'hawed
15482	O1	(sound equivalent of O)	118
15483	O2	(sound equivalent of O)	\083
15484	.O	(allophone of O)	178
15485	.O1	(sound equivalent of .O)	123
15486	.O2	(sound equivalent of .O)	\090
15487.nr x3 \w'who d'
15488\fIuu\fR	U		178	\h'-\n(x3u'who'd
15489	U1	(sound equivalent of U)	\090
15490.nr x3 \w'heard'
15491\fIer\fR	ER		143	\h'-\n(x3u'heard
15492.nr x3 \w'heed'
15493\fIee\fR	E		178	\h'-\n(x3u'heed
15494	E1	(sound equivalent of E)	118
15495\fIr\fR	R		\090
15496	.R	(allophone of R)	\050
15497\fIw\fR	W		\083
15498	.W	(allophone of W)	\083
15499\l'\n(x0u\(ul'
15500.sp3
15501.ce
15502Table 11.2  Votrax sound segments and their durations
15503.bp
15504\l'\n(x0u\(ul'
15505.sp
15506.nr x1 (\w'Votrax'/2)
15507.nr x2 (\w'duration (msec)'/2)
15508.nr x3 \w'test word'
15509	\h'-\n(1u'Votrax		\0\h'-\n(x2u'duration (msec)	\h'-\n(x3u'test word
15510\l'\n(x0u\(ul'
15511.sp
15512\fIl\fR	L		105
15513	L1	(allophone of L)	105
15514\fIy\fR	Y		103
15515	Y1	(allophone of Y)	\083
15516\fIm\fR	M		105
15517\fIb\fR	B		\070
15518\fIp\fR	P		100
15519	.PH	(aspiration burst for use with P)	\088
15520\fIn\fR	N		\083
15521\fId\fR	D		\050
15522	.D	(allophone of D)	\053
15523\fIt\fR	T		\090
15524	DT	(allophone of T)	\050
15525	.S	(aspiration burst for use with T)	\070
15526\fIng\fR	NG		120
15527\fIg\fR	G		\075
15528	.G	(allophone of G)	\075
15529\fIk\fR	K		\075
15530	.K	(allophone of K)	\080
15531	.X1	(aspiration burst for use with K)	\068
15532\fIs\fR	S		\090
15533\fIz\fR	Z		\070
15534\fIsh\fR	SH		118
15535	CH	(allophone of SH)	\055
15536\fIzh\fR	ZH		\090
15537	J	(allophone of ZH)	\050
15538\fIf\fR	F		100
15539\fIv\fR	V		\070
15540\fIth\fR	TH		\070
15541\fIdh\fR	THV		\070
15542\fIh\fR	H		\070
15543	H1	(allophone of H)	\070
15544	.H1	(allophone of H)	\048
15545\fIsilence\fR	PA0		\045
15546	PA1		175
15547	.PA1		\0\05
15548
15549	.PA2 (used to change amplitude and duration)	\0\0\-
15550\l'\n(x0u\(ul'
15551.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
15552.sp3
15553.ce
15554Table 11.2  (continued)
15555.bp
15556.ta 0.8i +2.6i +\w'(AH1 .UH2)  (O U)000'u
15557.nr x0 0.8i+2.6i+\w'(AH1 .UH2)  (O U)000'+\w'; i uh  \- here'
15558\l'\n(x0u\(ul'
15559.sp
15560vowel clusters
15561	EH I	A1 AY	; e i  \- hey
15562	UH OO	O U	; uh u  \- ho
15563	AE I	(AH1 EH3) I	; aa i  \- hi
15564	AE OO	(AH1 .UH2) (O U)	; aa u  \- how
15565	AW I	(O UH) E	; o i  \- hoi
15566	I UH	E I	; i uh  \- here
15567	EH UH	(EH A1) EH	; e uh  \- hair
15568	OO UH	OO UH	; u uh  \- poor
15569	Y U	Y1 (IU U)
15570.sp
15571vowel transitions
15572	{F M B P} O	(.O1 O)
15573	{L R} EH	(EH3 EH)
15574	{B K T D R} UH	(UH3 UH)
15575	{T D} A1	(EH3 A1)
15576	{T D} AW	(AH1 AW)
15577	{W} I	(I3 I)
15578	{G SH W K} OO	(IU OO)
15579	AY {K G T D}	(AY Y)
15580	E {M T}	(E Y)
15581	I {M T}	(I Y)
15582	E {L}	(I3 UH)
15583	EH {R N S D T}	(EH EH3)
15584	I {R T}	(I I3)
15585	AE {S N}	(AE EH)
15586	AE {K}	(AE EH3)
15587	A1 {R}	(A1 EH1)
15588	AH1 {R P K}	(AH1 UH)
15589	AH1 {ZH}	(AH1 EH3)
15590.sp
15591intervocalics
15592	{voiced} T {voiced}	DT
15593.sp
15594consonant transitions
15595	L {EH}	L1
15596	H {U OO IU}	H1
15597\l'\n(x0u\(ul'
15598.sp3
15599.ce
15600Table 11.5  Contextual rules for Votrax sound segments
15601.bp
15602\l'\n(x0u\(ul'
15603.sp
15604consonant clusters
15605	B {stop-consonant}	(B PA0)
15606	P {stop-consonant}	(P PA0)
15607	D {stop-consonant}	(D PA0)
15608	T {stop-consonant}	(T PA0)
15609	DT {stop-consonant}	(T PA0)
15610	G {stop-consonant}	(G PA0)
15611	K {stop-consonant}	(K PA0)
15612	{D T} R	(.X1 R)
15613	K R	.K (.X1 R)
15614	{consonant} R	.R
15615	{consonant} L	L1
15616	K W	.K .W
15617	D ZH	D J
15618	T SH	T CH
15619.sp
15620initial effects
15621	{.syll} P {vowel}	(P .PH)
15622	{.syll} K {vowel}	(K .H1)
15623	{.syll} T {vowel}	(T .S)
15624	{.syll} L	L1
15625	{.syll} H {U OO O AW AH1}	H1
15626.sp
15627terminal effects
15628	E {PA0}	(E Y)
15629\l'\n(x0u\(ul'
15630.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
15631.sp3
15632.ce
15633Table 11.5  (continued)
15634.ev
15635