1.EQ 2delim $$ 3.EN 4.CH "1 WHY SPEECH OUTPUT?" 5.ds RT "Why speech output? 6.ds CX "Principles of computer speech 7.pp 8Speech is our everyday, informal, communication medium. But although we use 9it a lot, we probably don't assimilate as much information through our 10ears as we do through our eyes, by reading or looking at pictures and diagrams. 11You go to a technical lecture to get the feel of a subject \(em the overall 12arrangement of ideas and the motivation behind them \(em and fill in the details, 13if you still want to know them, from a book. You probably find out more about 14the news from ten minutes with a newspaper than from a ten-minute news broadcast. 15So it should be emphasized from the start that speech output from computers is 16not a panacea. It doesn't solve the problems of communicating with computers; 17it simply enriches the possibilities for communication. 18.pp 19What, then, are the advantages of speech output? One good reason for listening 20to a radio news broadcast instead of spending the time with a newspaper 21is that you can listen while shaving, doing the housework, or driving the car. 22Speech leaves hands and eyes free for other tasks. 23Moreover, it is omnidirectional, and does not require a free line of sight. 24Related to this is the 25use of speech as a secondary medium for status reports and warning messages. 26Occasional interruptions by voice do not interfere with other activities, 27unless they demand unusual concentration, and people can assimilate spoken messages 28and queue them for later action quite easily and naturally. 29.pp 30The second key feature of speech communication stems from the telephone. 31It is the universality of the telephone receiver itself that is important 32here, rather than the existence of a world-wide distribution network; 33for with special equipment (a modem and a VDU) one does not need speech to take advantage of 34the telephone network for information transfer. 35But speech needs no tools other than the telephone, and this gives 36it a substantial advantage. You can go into a phone booth anywhere in the world, 37carrying no special equipment, and have access to your computer within seconds. 38The problem of data input is still there: perhaps your computer 39system has a limited word recognizer, or you use the touchtone telephone 40keypad (or a portable calculator-sized tone generator). Easy remote access 41without special equipment is a great, and unique, asset to speech communication. 42.pp 43The third big advantage of speech output is that it is potentially very cheap. 44Being all-electronic, except for the loudspeaker, speech systems are well 45suited to high-volume, low-cost, LSI manufacture. Other computer output 46devices are at present tied either to mechanical moving parts or to the CRT. 47This was realized quickly by the computer hobbies market, where speech output 48peripherals have been selling like hot cakes since the mid 1970's. 49.pp 50A further point in favour of speech is that it is natural-seeming and 51somehow cuddly when compared with printers or VDU's. It would have been much 52more difficult to make this point before the advent of talking toys like 53Texas Instruments' "Speak 'n Spell" in 1978, but now it is an accepted fact that friendly 54computer-based gadgets can speak \(em there are talking pocket-watches 55that really do "tell" the time, talking microwave ovens, talking pinball machines, and, 56of course, talking calculators. 57It is, however, difficult to assess whether the appeal stems from 58mechanical speech's novelty \(em it 59is still a gimmick \(em and also to what extent it is tied up with 60economic factors. 61After all, most of the population don't use high-quality VDU's, and their major 62experience of real-time interactive computing is through the very limited displays 63and keypads provided on video games and teletext systems. 64.pp 65Articles on speech communication with computers often list many more advantages of voice output 66(see Hill 1971, Turn 1974, Lea 1980). 67.[ 68Hill 1971 Man-machine interaction using speech 69.] 70.[ 71Lea 1980 72.] 73.[ 74Turn 1974 Speech as a man-computer communication channel 75.] 76For example, speech 77.LB 78.NP 79can be used in the dark 80.NP 81can be varied from a (confidential) whisper to a (loud) shout 82.NP 83requires very little energy 84.NP 85is not appreciably affected by weightlessness or vibration. 86.LE 87However, these either derive from the three advantages we have discussed above, 88or relate 89mainly to exotic applications in space modules and divers' helmets. 90.pp 91Useful as it is at present, speech output would be even more attractive if it could 92be coupled with speech input. In many ways, speech input is its "big brother". 93Many of the benefits of speech output are even more striking for speech input. 94Although people can assimilate information faster through the eyes than the 95ears, the majority of us can generate information faster with the mouth than 96with the hands. Rapid typing is a relatively uncommon skill, and even high 97typing rates are much slower than speaking rates (although whether we can 98originate ideas quickly enough to keep up with fast speech is another matter!) To 99take full advantage of the telephone for interaction with machines, machine 100recognition of speech is obviously necessary. A microwave oven, calculator, 101pinball machine, or alarm clock that responds to spoken commands is certainly 102more attractive than one that just generates spoken status messages. A book 103that told you how to recognize speech by machine would undoubtedly be more 104useful than one like this that just discusses how to synthesize it! But the 105technology of speech recognition is nowhere near as advanced as that of 106synthesis \(em it's a much more difficult problem. However, because speech input 107is obviously complementary to speech output, and even very limited input 108capabilities will greatly enhance many speech output systems, it is worth 109summarizing the present state of the art of speech recognition. 110.pp 111Commercial speech recognizers do exist. Almost invariably, they accept 112words spoken in isolation, with gaps of silence between them, rather than 113connected utterances. 114It is not difficult to discriminate with high accuracy up to a hundred 115different words spoken by the same speaker, especially if the vocabulary 116is carefully selected to avoid words which sound similar. If several 117different speakers are to be comprehended, performance can be greatly improved 118if the machine is given an opportunity to calibrate their voices in a training 119session, and is informed at recognition time which one is to speak. 120With a large population of unknown speakers, accurate recognition is difficult 121for vocabularies of more than a few carefully-chosen words. 122.pp 123A half-way house between isolated word discrimination and recognition of connected 124speech is the problem of spotting known words in continuous speech. This 125allows much more natural input, if the dialogue is structured as keywords 126which may be 127interspersed by unimportant "noise words". To speak in truly isolated 128words requires a great deal of self-discipline and concentration \(em it is 129surprising how much of ordinary speech is accounted for by vague sounds 130like um's and aah's, and false starts. Word spotting disregards these and so 131permits a more relaxed style of speech. Some progress has been made on it in 132research laboratories, but the vocabularies that can be accomodated are still 133very small. 134.pp 135The difficulty of recognizing connected speech depends crucially on what is 136known in advance about the dialogue: its pragmatic, semantic, and syntactic 137constraints. Highly structured dialogues constrain very heavily the choice of 138the next word. Recognizers which can deal with vocabularies of over 1000 words 139have been built in research laboratories, but the structure of the input has 140been such that the average "branching factor" \(em the size of the set out of 141which the next word must be selected \(em is only around 10 (Lea, 1980). 142.[ 143Lea 1980 144.] 145Whether such 146highly constrained languages would be acceptable in many practical applications 147is a moot point. One commercial recognizer, developed in 1978, can cope with 148up to five words spoken continuously from a basic 120-word vocabulary. 149.pp 150There has been much debate about whether it will ever be possible for a speech 151recognizer to step outside rigid constraints imposed on the utterances it can 152understand, and act, say, as an automatic dictation machine. Certainly the most 153advanced recognizers to date depend very strongly on a tight context being 154available. Informed opinion seems to accept that in ten years' time, 155voice data entry in the office will be an important and economically feasible 156prospect, but that it would be rash to predict the appearance of unconstrained 157automatic dictation by then. 158.pp 159Let's return now to speech output and take a look at some systems which use it, 160to illustrate the advantages and disadvantages of speech in practical 161applications. 162.sh "1.1 Talking calculator" 163.pp 164Figure 1.1 shows a calculator that speaks. 165.FC "Figure 1.1" 166Whenever a key is pressed, 167the device confirms the action by saying the key's name. 168The result of any computation is also spoken aloud. 169For most people, the addition of speech output to a calculator is simply a 170gimmick. 171(Note incidentally that speech 172.ul 173input 174is a different matter altogether. The ability to dictate lists of numbers and 175commands to a calculator, without lifting one's eyes from the page, would have 176very great advantages over keypad input.) Used-car 177salesmen find that speech output sometimes helps to clinch a deal: they key in 178the basic car price and their bargain-basement deductions, and the customer is so 179bemused by the resulting price being spoken aloud to him by a machine that he 180signs the cheque without thinking! More seriously, there may be some small 181advantage to be gained when keying a list of figures by touch from having their 182values read back for confirmation. For blind people, however, such devices 183are a boon \(em and there are many other applications, like talking elevators 184and talking clocks, which benefit from even very restricted voice output. 185Much more sophisticated is a typewriter with audio feedback, designed by 186IBM for the blind. Although blind typists can remember where the keys on a 187typewriter are without difficulty, they rely on sighted proof-readers to help 188check 189their work. This device could make them more useful as office typists and 190secretaries. As well as verbalizing the material (including punctuation) 191that has been typed, either by attempting to pronounce the words or by spelling 192them out as individual letters, it prompts the user through the more complex action sequences 193that are possible on the typewriter. 194.pp 195The vocabulary of the talking calculator comprises the 24 words of Table 1.1. 196.RF 197.nr x1 2.0i+\w'percent'u 198.nr x1 (\n(.l-\n(x1)/2 199.in \n(x1u 200.ta 2.0i 201zero percent 202one low 203two over 204three root 205four em (m) 206five times 207six point 208seven overflow 209eight minus 210nine plus 211times-minus clear 212equals swap 213.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 214.in 0 215.FG "Table 1.1 Vocabulary of a talking calculator" 216This represents a total of about 13 seconds of speech. It is stored 217electronically in read-only memory (ROM), and Figure 1.2 shows the circuitry 218of the speech module inside the calculator. 219.FC "Figure 1.2" 220There are three large integrated circuits. 221Two of them are ROMs, and the other is a special synthesis chip which decodes the 222highly compressed stored data into an audio waveform. 223Although the mechanisms used for storing speech by commercial devices are 224not widely advertised by the manufacturers, the talking calculator almost 225certainly uses linear predictive coding \(em a technique that we will examine 226in Chapter 6. 227The speech quality is very poor because of the highly compressed storage, and 228words are spoken in a grating monotone. 229However, because of the very small vocabulary, the quality is certainly good 230enough for reliable identification. 231.sh "1.2 Computer-generated wiring instructions" 232.pp 233I mentioned earlier that one big advantage of speech over visual output is that 234it leaves the eyes free for other tasks. 235When wiring telephone equipment during manufacture, the operator needs to use 236his hands as well as eyes to keep his place in the task. 237For some time tape-recorded instructions have been used for this in certain 238manufacturing plants. For example, the instruction 239.LB 240.NI 241Red 2.5 11A terminal strip 7A tube socket 242.LE 243directs the operator to cut 2.5" of red wire, attach one end to a specified point 244on the terminal strip, and attach the other to a pin of the tube socket. The 245tape recorder is fitted with a pedal switch to allow a sequence of such instructions 246to be executed by the operator at his own pace. 247.pp 248The usual way of recording the instruction tape is to have a human reader 249dictate them from a printed list. 250The tape is then checked against the list by another listener to ensure that 251the instructions are correct. Since wiring lists are usually stored and 252maintained in machine-readable form, it is natural to consider whether speech 253synthesis techniques could be used to generate the acoustic tape directly by 254a computer (Flanagan 255.ul 256et al, 2571972). 258.[ 259Flanagan Rabiner Schafer Denman 1972 260.] 261.pp 262Table 1.2 shows the vocabulary needed for this application. 263.RF 264.nr x1 2.0i+2.0i+\w'tube socket'u 265.nr x1 (\n(.l-\n(x1)/2 266.in \n(x1u 267.ta 2.0i +2.0i 268A green seventeen 269black left six 270bottom lower sixteen 271break make strip 272C nine ten 273capacitor nineteen terminal 274eight one thirteen 275eighteen P thirty 276eleven point three 277fifteen R top 278fifty red tube socket 279five repeat coil twelve 280forty resistor twenty 281four right two 282fourteen seven upper 283.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 284.in 0 285.FG "Table 1.2 Vocabulary needed for computer-generated wiring instructions" 286It is rather larger 287than that of the talking calculator \(em about 25 seconds of speech \(em but well 288within the limits of single-chip storage in ROM, compressed by the linear 289predictive technique. However, at the time that the scheme was investigated 290(1970\-71) the method of linear predictive coding had not been fully developed, 291and the technology for low-cost microcircuit implementation was not available. 292But this is not important for this particular application, for there is 293no need to perform the synthesis on a miniature low-cost computer system, 294nor need it 295be accomplished in real time. In fact a technique of concatenating 296spectrally-encoded words was used (described in Chapter 7), and it was 297implemented on a minicomputer. Operating much slower than real-time, the system 298calculated the speech waveform and wrote it to disk storage. A subsequent phase 299read the pre-computed messages and recorded them on a computer-controlled analogue 300tape recorder. 301.pp 302Informal evaluation showed the scheme to be quite successful. Indeed, the 303synthetic speech, whose quality was not high, was actually preferred to 304natural speech in the noisy environment of the production line, for each 305instruction was spoken in the same format, with the same programmed pause 306between the items. 307A list of 58 instructions of the form shown above was recorded and used 308to wire several pieces of apparatus without errors. 309.sh "1.3 Telephone enquiry service" 310.pp 311The computer-generated wiring scheme illustrates how speech can be used to give 312instructions without diverting visual attention from the task at hand. 313The next system we examine shows how speech output can make the telephone 314receiver into a remote computer terminal for a variety of purposes 315(Witten and Madams, 1977). 316.[ 317Witten Madams 1977 Telephone Enquiry Service 318.] 319The caller employs the touch-tone keypad shown in Figure 1.3 for input, and the 320computer generates 321a synthetic voice response. 322.FC "Figure 1.3" 323Table 1.3 shows the process of making 324contact with the system. 325.RF 326.fi 327.nh 328.na 329.in 0.3i 330.nr x0 \w'COMPUTER: ' 331.nr x1 \w'CALLER: ' 332.in+\n(x0u 333.ti-\n(x0u 334CALLER:\h'\n(x0u-\n(x1u' Dials the service. 335.ti-\n(x0u 336COMPUTER: Answers telephone. 337"Hello, Telephone Enquiry Service. Please 338enter your user number". 339.ti-\n(x0u 340CALLER:\h'\n(x0u-\n(x1u' Enters user number. 341.ti-\n(x0u 342COMPUTER: "Please enter your password". 343.ti-\n(x0u 344CALLER:\h'\n(x0u-\n(x1u' Enters password. 345.ti-\n(x0u 346COMPUTER: Checks validity of password. 347If invalid, the user is asked to re-enter 348his user number. 349Otherwise, 350"Which service do you require?" 351.ti-\n(x0u 352CALLER:\h'\n(x0u-\n(x1u' Enters service number. 353.in 0 354.nf 355.FG "Table 1.3 Making contact with the telephone enquiry system" 356.pp 357Advantage is taken of the disparate speeds of input (keyboard) and 358output (speech) to hasten the dialogue by imposing a question-answer structure 359on it, with the computer taking the initiative. The machine can 360afford to be slightly verbose if by so doing it makes the caller's 361response easier, and therefore more rapid. Moreover, operators who 362are experienced enough with the system to anticipate questions can 363easily forestall them just by typing ahead, for the computer is programmed 364to examine its input buffer before issuing prompts and to suppress them if 365input has already been provided. 366.pp 367An important aim of the system is to allow application programmers with no 368special knowledge of speech to write independent services for it. 369Table 1.4 shows an example of the use of one such application program, 370.RF 371.fi 372.nh 373.na 374.in 0.3i 375.nr x0 \w'COMPUTER: ' 376.nr x1 \w'CALLER: ' 377.in+\n(x0u 378.ti-\n(x0u 379COMPUTER: "Stores Information Service. Please enter 380component name". 381.ti-\n(x0u 382CALLER:\h'\n(x0u-\n(x1u' Enters "SN7406#". 383.ti-\n(x0u 384COMPUTER: "The component name is SN7406. Is this correct?" 385.ti-\n(x0u 386CALLER:\h'\n(x0u-\n(x1u' Enters "*1#" (system convention for "yes"). 387.ti-\n(x0u 388COMPUTER: "This component is in stores". 389.ti-\n(x0u 390CALLER:\h'\n(x0u-\n(x1u' Enters "*7#" (command for "price"). 391.ti-\n(x0u 392COMPUTER: "The component price is 35 pence". 393.ti-\n(x0u 394CALLER:\h'\n(x0u-\n(x1u' Enters "*8#" (command for "minimum number"). 395.ti-\n(x0u 396COMPUTER: "The minimum number of this component kept 397in stores is 10". 398.ti-\n(x0u 399CALLER:\h'\n(x0u-\n(x1u' Enters "SN7417#". 400.ti-\n(x0u 401COMPUTER: "The component name is SN7417. Is this correct?" 402.ti-\n(x0u 403CALLER:\h'\n(x0u-\n(x1u' Enters "*1#". 404.ti-\n(x0u 405COMPUTER: "This component is not in stores". 406.ti-\n(x0u 407CALLER:\h'\n(x0u-\n(x1u' Enters "*9#" (command for "delivery time"). 408.ti-\n(x0u 409COMPUTER: "The expected delivery time is 14 days". 410.ti-\n(x0u 411CALLER:\h'\n(x0u-\n(x1u' Enters "*0#". 412.ti-\n(x0u 413COMPUTER: "Which service do you require?" 414.in 0 415.nf 416.FG "Table 1.4 The Stores Information Service" 417the 418Stores Information Service, which permits enquiries to be made of a database 419holding information on electronic components kept in stock. 420This subsystem is driven by 421.ul 422alphanumeric 423data entered on the touch-tone keypad. Two or three letters are associated 424with each digit, in a manner which is fairly standard in touch-tone telephone 425applications. These are printed on a card overlay 426that fits the keypad (see Figure 1.3). Although true alphanumeric data entry 427would require a multiple key press for each character, 428the ambiguity inherent in 429a single-key-per-character convention can usually be resolved by the computer, 430if it has a list of permissible entries. For example, the component names 431SN7406 and ZTX300 are read by the machine as "767406" and "189300", respectively. 432Confusion rarely occurs if the machine is expecting a valid component code. 433The same holds true of people's names, and file names \(em although with these 434one must take care not to identify a series of files by similar names, like 435TX38A, TX38B, TX38C. It is easy for the machine to detect the rare cases 436where ambiguity occurs, and respond by requesting further information: "The 437component name is SN7406. Is this correct?" (In fact, the Stores Information 438Service illustrated in Table 1.4 is defective in that it 439.ul 440always 441requests confirmation of an entry, even when no ambiguity exists.) The 442use of a telephone keypad for data entry will be taken up again in Chapter 10. 443.pp 444A distinction is drawn throughout the system between data entries and 445commands, the latter being prefixed by a "*". In this example, the 446programmer chose to define a command for each possible question about a 447component, so that a new component name can be entered at any time 448without ambiguity. The price paid for the resulting brevity of dialogue 449is the burden of memorizing the meaning of the commands. This is an 450inherent disadvantage of a one-dimensional auditory display over the 451more conventional graphical output: presenting menus by speech is tedious and 452long-winded. In practice, however, for a simple task such as the 453Stores Information Service it is quite convenient for the caller to 454search for the appropriate command by trying out all possibilities \(em there 455are only a few. 456.pp 457The problem of memorizing commands is alleviated by establishing some 458system-wide conventions. Each input is terminated by a "#", and 459the meaning of standard commands is given in Table 1.5. 460.RF 461.fi 462.nh 463.na 464.in 0.3i 465.nr x0 \w'# alone ' 466.nr x1 \w'\(em ' 467.ta \n(x0u +\n(x1u 468.nr x2 \n(x0+\n(x1 469.in+\n(x2u 470.ti-\n(x2u 471*# \(em Erase this input line, regardless of what has 472been typed before the "*". 473.ti-\n(x2u 474*0# \(em Stop. Used to exit from any service. 475.ti-\n(x2u 476*1# \(em Yes. 477.ti-\n(x2u 478*2# \(em No. 479.ti-\n(x2u 480*3# \(em Repeat question or summarize state of current 481transaction. 482.ti-\n(x2u 483# alone \(em Short form of repeat. Repeats or summarizes 484in an abbreviated fashion. 485.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 486.in 0 487.nf 488.FG "Table 1.5 System-wide conventions for the service" 489.pp 490A summary of services available on the system is given in 491Table 1.6. 492.RF 493.fi 494.na 495.in 0.3i 496.nr x0 \w'000 ' 497.nr x1 \w'\(em ' 498.nr x2 \n(x0+\n(x1 499.in+\n(x2u 500.ta \n(x0u +\n(x1u 501.ti-\n(x2u 502\0\01 \(em tells the time 503.ti-\n(x2u 504\0\02 \(em Biffo (a game of NIM) 505.ti-\n(x2u 506\0\03 \(em MOO (a game similar to that marketed under the name "Mastermind") 507.ti-\n(x2u 508\0\04 \(em error demonstration 509.ti-\n(x2u 510\0\05 \(em speak a file in phonetic format 511.ti-\n(x2u 512\0\06 \(em listening test 513.ti-\n(x2u 514\0\07 \(em music (allows you to enter a tune and play it) 515.ti-\n(x2u 516\0\08 \(em gives the date 517.sp 518.ti-\n(x2u 519100 \(em squash ladder 520.ti-\n(x2u 521101 \(em stores information service 522.ti-\n(x2u 523102 \(em computes means and standard deviations 524.ti-\n(x2u 525103 \(em telephone directory 526.sp 527.ti-\n(x2u 528411 \(em user information 529.ti-\n(x2u 530412 \(em change password 531.ti-\n(x2u 532413 \(em gripe (permits feedback on services from caller) 533.sp 534.ti-\n(x2u 535600 \(em first year laboratory marks entering service 536.sp 537.ti-\n(x2u 538910 \(em repeat utterance (allows testing of system) 539.ti-\n(x2u 540911 \(em speak utterance (allows testing of system) 541.ti-\n(x2u 542912 \(em enable/disable user 100 (a no-password guest user number) 543.ti-\n(x2u 544913 \(em mount a magnetic tape on the computer 545.ti-\n(x2u 546914 \(em set/reset demonstration mode (prohibits access by low-priority users) 547.ti-\n(x2u 548915 \(em inhibit games 549.ti-\n(x2u 550916 \(em inhibit the MOO game 551.ti-\n(x2u 552917 \(em disable password checking when users log in 553.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 554.in 0 555.nf 556.FG "Table 1.6 Summary of services on a telephone enquiry system" 557They range from simple games and demonstrations, through serious database 558services, to system maintenance facilities. 559A priority structure is imposed upon them, with higher 560service numbers being available only to higher priority users. 561Services in the lowest range (1\-99) can be obtained by all, while 562those in the highest range (900\-999) are maintenance services, 563available only to the system designers. Access to the lower-numbered 564"games" services can be inhibited by a priority user \(em this was 565found necessary to prevent over-use of the system! Another advantage 566of telephone access to an information retrieval system is that some 567day-to-day maintenance can be done remotely, from the office telephone. 568.pp 569This telephone enquiry service, which was built in 1974, demonstrated that 570speech synthesis had moved from a specialist phonetic discipline into the 571province of engineering practicability. The speech was generated "by rule" 572from a phonetic input (the method is covered in Chapters 7 and 8), which 573has very low data storage requirements of around 75\ bit/s of speech. 574Thus an enormous vocabulary and range of services could be accomodated on a 575small computer system. 576Despite the fairly low quality of the speech, the response from callers was 577most encouraging. Admittedly the user population was a self-selected body of 578University staff, which one might suppose to have high tolerance to new ideas, 579and a system designed for the general public would require more effort to be 580spent on developing speech of greater intelligibility. Although it was 581observed that some callers failed to understand parts of the responses, even 582after repetition, communication was largely unhindered in most cases; users 583being driven by a high motivation to help the system help them. 584.pp 585The use of speech output in conjunction with a simple input device requires 586careful thought for interaction to be successful and comfortable. It is 587necessary that the computer direct the conversation as much as possible, 588without seeming to be taking charge. Provision for eliminating prompts 589which are unwanted by sophisticated users is essential to avoid frustration. 590We will return to the topic of programming techniques for speech interaction 591in Chapter 10. 592.pp 593Making a computer system available over the telephone results in a sudden 594vast increase in the user population. Although people's reaction to a new 595computer terminal in every office was overwhelmingly favourable, careful 596resource allocation was essential to prevent the service being hogged by a 597persistent few. As with all multi-access computer systems, it is particularly 598important that error recovery is effected automatically and gracefully. 599.sh "1.4 Speech output in the telephone exchange" 600.pp 601The telephone enquiry service was an experimental vehicle for research on speech 602interaction, and was developed in 1974. 603Since then, speech has begun to be used in real commercial applications. 604One example is System\ X, the British Post Office's computer-controlled 605telephone exchange. This incorporates many features 606not found in conventional telephone exchanges. 607For example, if a number is found to be busy, the call can be attempted 608again by a "repeat last call" command, without having to re-dial the full number. 609Alternatively, the last number can be stored for future re-dialling, freeing 610the phone for other calls. 611"Short code 612dialling" allows a customer to associate short codes with commonly-dialled 613numbers. 614Alarm calls can be booked at specified times, and are made automatically 615without human intervention. 616Incoming calls can be barred, as can outgoing ones. A diversion service 617allows all incoming calls to be diverted to another telephone, either 618immediately, or if a call to the original number remains unanswered for 619a specified period of time, or if the original number is busy. 620Three-party calls can be set up automatically, without involving the 621operator. 622.pp 623Making use of these facilities presents the caller with something of a problem. 624With conventional telephone exchanges, feedback is provided on what is happening 625to a call by the use of four tones \(em the dial tone, the busy tone, 626the ringing tone, and the number unavailable tone. 627For the more sophisticated interaction which is expected on the advanced 628exchange, a much greater variety of status signals is required. 629The obvious solution is to use 630computer-generated spoken 631messages to inform the caller when these services are invoked, and to guide him 632through the sequences of actions needed to set up facilities like call 633re-direction. For example, the messages used by the exchange when a user 634accesses the alarm call 635service are 636.LB 637.NI 638Alarm call service. 639Dial the time of your alarm call followed by square\u\(dg\d. 640.FN 1 641\(dg\d"Square" is the term used for the "#" key on the touch-tone telephone.\u 642.EF 643.NI 644You have booked an alarm call for seven thirty hours. 645.NI 646Alarm call operator. At the third stroke it will be seven thirty. 647.LE 648.pp 649Because of the rather small vocabulary, the number of messages that can be 650stored in their entirety rather than being formed by concatenation of 651smaller units, and the short time which was available for development, 652System\ X stores speech as a time waveform, slightly compressed by a time-domain 653encoding operation (such techniques are described in Chapter 3). 654Utterances which contain variable parts, like the time of alarm in the messages 655above, are formed by inserting separately-recorded digits in a fixed 656"carrier" message. No attempt is made to apply uniform intonation 657contours to the synthetic utterances. The resulting speech is of excellent 658quality (being a slightly compressed recording of a human voice), but sometimes 659exhibits somewhat anomalous pitch contours. 660For example, the digits comprising numbers often sound rather jerky and 661out-of-context \(em which indeed they are. 662.pp 663Even more advanced facilities can be expected on telephone exchanges in 664the future. A message storage capability is one example. Although 665automatic call recording machines have been available for years, a centralized 666facility could time and date a message, collect the caller's identity 667(using the telephone keypad), and allow the recipient to select messages left 668for him through an interactive dialogue so that he could control the order 669in which he listens to them. He could choose to leave certain messages to be 670dealt with later, or re-route them to a colleague. He may even wish to leave 671reminders for himself, to be dialled automatically at specified times (like 672alarm calls with user-defined information attached). The sender of a message 673could be informed automatically by the system when it is delivered. None of 674this requires speech recognition, but it does need economical speech 675.ul 676storage, 677and also speech 678.ul 679synthesis 680(for time and date tags). 681.sh "1.5 Travel consultant" 682.pp 683Some current research in speech communication with computers is aimed at 684embedding the dialogue in a task environment which provides it with an overall 685contextual framework. 686This in fact was the motivation for the experimental telephone enquiry service 687described above, and it was observed that people are able to tolerate rather 688low quality speech if they feel that the interaction is worthwhile. One step 689further is the use of speech 690.ul 691input 692as well as output, and here the context of the conversation which is provided 693by the task environment can be used to improve recognition accuracy 694significantly. 695A system set up at Bell Telephone Laboratories indicates what we can expect 696to find operating commercially in the future (Levinson and Shipley, 1980). 697.[ 698Levinson Shipley 1980 699.] 700It provides timetable information about airline flights, and allows the user 701to make reservations. 702.pp 703An example dialogue is given in Table 1.7. 704.RF 705.fi 706.nh 707.na 708.in 0.3i 709.nr x0 \w'COMPUTER: ' 710.nr x1 \w'CALLER: ' 711.in+\n(x0u 712.ti-\n(x0u 713COMPUTER: This is the Bell Laboratories flight information system. May I help you? 714.ti-\n(x0u 715CALLER:\h'\n(x0u-\n(x1u' I would like to go to Chicago. 716.ti-\n(x0u 717COMPUTER: Please specify the day of your departure. 718.ti-\n(x0u 719CALLER:\h'\n(x0u-\n(x1u' I want to leave at four p.m.. 720.ti-\n(x0u 721COMPUTER: Please specify the day of your departure. 722.ti-\n(x0u 723CALLER:\h'\n(x0u-\n(x1u' I want to depart on the two five May. 724.ti-\n(x0u 725COMPUTER: Flight number nine two leaves New York at four p.m. arrives in Chicago at five twenty five p.m.. 726.ti-\n(x0u 727CALLER:\h'\n(x0u-\n(x1u' What plane is on the flight? 728.ti-\n(x0u 729COMPUTER: The aircraft is a Boeing seven twenty seven. 730.ti-\n(x0u 731CALLER:\h'\n(x0u-\n(x1u' How much is the fare? 732.ti-\n(x0u 733COMPUTER: The round trip coach fare is one hundred forty two dollars. 734.ti-\n(x0u 735CALLER:\h'\n(x0u-\n(x1u' I want a first class seat. 736.ti-\n(x0u 737COMPUTER: One seat thank you. 738.ti-\n(x0u 739CALLER:\h'\n(x0u-\n(x1u' What is the fare? 740.ti-\n(x0u 741COMPUTER: The round trip first class fare is one hundred eighty four dollars. 742.ti-\n(x0u 743CALLER:\h'\n(x0u-\n(x1u' Please repeat the arrival time. 744.ti-\n(x0u 745COMPUTER: The arrival time is five twenty five p.m.. 746.ti-\n(x0u 747CALLER:\h'\n(x0u-\n(x1u' I will take the flight. 748.ti-\n(x0u 749COMPUTER: Your reservation on flight number nine two to Chicago on Friday is confirmed. Thank you. 750.in 0 751.nf 752.MT 2 753Table 1.7 A conversation with an airline flight information service 754(from Levinson and Shipley, 1980) 755.TE 756.[ 757Levinson Shipley 1980 758.] 759The user dials the system from an 760ordinary telephone. The recognition side must be trained by each user, and 761accepts isolated words spoken with brief pauses between them. 762The voice response unit has a vocabulary of around 200 words, and 763synthesizes its answers by slotting words into "templates" evoked by the speech 764understanding part in response to a query. For example, 765.LB 766.NI 767This flight makes \(em stops 768.NI 769Flight number \(em leaves \(em at \(em , arrives in \(em at \(em 770.LE 771are templates which when called with specific slot fillers could produce the 772utterances 773.LB 774.NI 775This flight makes three stops 776.NI 777Flight number nine two leaves New York at four p.m., 778arrives in Chicago at five twenty-five p.m. 779.LE 780The chief research interest of the system is in its speech understanding 781capabilities, and the method used for speech output is relatively 782straightforward. The templates and words are recorded, digitized, compressed 783slightly, and stored on disk files (totalling a few hundred thousand bytes of 784storage), using techniques similar to those of System\ X. 785Again, no independent manipulation of pitch is possible, and so the utterances 786sound intelligible but the transition between templates and slot fillers is not 787completely fluent. However, the overall context of the interaction means that 788the communication is not seriously disrupted even if the machine occasionally 789misunderstands the man or vice versa. The user's attention is drawn away from 790recognition accuracy and focussed on the exchange of information with the machine. 791The authors conclude that progress in speech recognition can best be made by 792studying it in the context of communication rather than in a vacuum or as part 793of a one-way channel, and the same is undoubtedly true of speech synthesis as 794well. 795.sh "1.6 Reading machine for the blind" 796.pp 797Perhaps the most advanced attempt to provide speech output from a computer 798is the Kurzweil reading machine for the blind, first marketed in the late 7991970's (Figure 1.4). 800.FC "Figure 1.4" 801This device reads an ordinary book aloud. Users adjust the reading 802speed according to the content of the material and their familiarity with 803it, and the maximum rate has recently been improved to around 225 words per 804minute \(em perhaps half as fast again as normal human speech rates. 805.pp 806As well as generating speech from text, the machine has to scan the document 807being read and identify the characters presented to it. A scanning camera 808is used, controlled by a program which searches for and tracks the lines of 809text. The output of the camera is digitized, and the image is enhanced 810using signal-processing techniques. Next each individual letter must be 811isolated, and its geometric features identified and compared with a pre-stored 812table of letter shapes. Isolation of letters is not at all trivial, for 813many type fonts have "ligatures" which are combinations of characters joined 814together (for example, the letters "fi" are often run together.) The 815machine must cope with many printed type fonts, as well as typewritten ones. 816The text-recognition side of the Kurzweil reading machine is in fact one of 817its most advanced features. 818.pp 819We will discuss the problem of speech generation from text in Chapter 9. 820It has many facets. First there is pronunciation, the 821translation of letters to sounds. It is important to take into account 822the morphological structure of words, dividing them into "root" and "endings". 823Many words have concatenated suffixes (like "like-li-ness"). These are 824important to detect, because a final "e" which appears on a root word 825is not pronounced itself but affects the pronunciation of the previous 826vowel. Then there is the difficulty that some words look the same 827but are pronounced differently, depending on their meaning or on the syntactic 828part that they play in the sentence. 829Appropriate intonation is extremely difficult to generate from a plain textual 830representation, for it depends on the meaning of the text and the way in which 831emphasis is given to it by the reader. Similarly the rhythmic structure is 832important, partly for correct pronunciation and partly for purposes of 833emphasis. 834Finally the sounds that have been deduced from the text need to be synthesized 835into acoustic form, taking due account of the many and varied contextual effects 836that occur in natural speech. This by itself is a challenging problem. 837.pp 838The performance of the Kurzweil reading machine is not good. While it seems 839to be true that some blind people can make use of it, it is far from 840comprehensible to an untrained listener. For example, 841it will miss out words and even whole phrases, hesitate in a 842stuttering manner, blatantly mis-pronounce many words, fail to detect 843"e"s which should be silent, and give completely wrong rhythms 844to words, making them impossible to understand. 845Its intonation is decidedly unnatural, monotonous, and often downright 846misleading. When it reads completely new text to people unfamiliar with its 847quirks, 848they invariably fail to understand more than an odd word here and there, 849and do not improve significantly when the text is repeated more than once. 850Naturally performance improves if the material is familiar or expected 851in some way. 852One useful feature is the machine's ability to spell out difficult words 853on command from the user. 854.pp 855While not wishing to denigrate the Kurzweil machine, which is a remarkable 856achievement in that it integrates together many different advanced 857technologies, there is no doubt that the state of the art in speech synthesis 858directly from unadorned text is extremely primitive, at present. 859It is vital not to overemphasize the potential usefulness of abysmal speech, 860which takes a great deal of training on the part of the user before 861it becomes at all intelligible. To make a rather extreme analogy, 862Morse code could be used as 863audio output, requiring a great deal of training, but capable of being understood 864at quite high rates by an expert. 865It could be generated very cheaply. 866But clearly the man in the street would find it quite unacceptable as 867an audio output medium, because of the excessive effort required to learn to use 868it. In many applications, very bad synthetic speech is just as useless. 869However, the issue is complicated by the fact that for people who use 870synthesizers regularly, synthetic speech becomes quite easily comprehensible. 871We will return to the problem of evaluating the quality of artificial speech 872later in the book (Chapter 8). 873.sh "1.7 System considerations for speech output" 874.pp 875Fortunately, very many of the applications of speech output from computers 876do not need to read unadorned text. 877In all the example systems described above (except the reading machine), 878it is enough to be able to store utterances in some representation which can 879include pre-programmed cues for pronunciation, rhythm, and intonation in 880a much more explicit way than ordinary text does. 881.pp 882Of course, techniques 883for storing audio information have been in use for decades. 884For example, a domestic cassette tape recorder stores speech at much better 885than telephone quality at very low cost. The method of direct 886recording of an analogue waveform is currently used for announcements in 887the telephone network to provide information such as the time, weather 888forecasts, and even bedtime stories. 889However, it is difficult to provide rapid access to messages stored in 890analogue form, and although some computer peripherals which use analogue 891recordings for voice-response applications have been marketed \(em they are 892discussed briefly at the beginning of Chapter 3 \(em they have been 893superseded by digital storage techniques. 894.pp 895Although direct storage of a digitized audio waveform is used in some 896voice-response systems, the approach has certain limitations. The most 897obvious one is the large storage requirement: suitable coding can reduce 898the data-rate of speech to as little as one hundredth of that needed by 899direct digitization, and textual representations reduce it by another factor 900of ten or twenty. (Of course, the speech quality is inevitably compromised 901somewhat by data-compression techniques.) However, the cost of storage is 902dropping so fast that this is not necessarily an overriding factor. 903A more fundamental limitation is that utterances stored directly cannot sensibly 904be modified in any way to take account of differing contexts. 905.pp 906If the results of certain kinds of analyses 907of utterances are stored, instead of simply the digitized waveform, 908a great deal more flexibility can be gained. 909It is possible to separate out the features of intonation and amplitude from 910the articulation of the speech, and this raises the attractive possibility 911of regenerating utterances with pitch contours different from those with which they were 912recorded. 913The primary analysis technique used for this purpose is 914.ul 915linear prediction 916of speech, and this is treated in some detail in Chapter 6. It also reduces drastically the 917data-rate of speech, by a factor of around 50. 918It is likely that many voice-response systems in the short- and medium-term 919future will use linear predictive representations for utterance storage. 920.pp 921For maximum flexibility, however, it is preferable to store a textual 922representation of the utterance. 923There is an important distinction between speech 924.ul 925storage, 926where an actual human utterance is recorded, perhaps processed to lower 927the data-rate, and stored for subsequent regeneration when required, 928and speech 929.ul 930synthesis, 931where the machine produces its own individual utterances which are not based 932on recordings of a person saying the same thing. The difference is summarized 933in Figure 1.5. 934.FC "Figure 1.5" 935In both cases something is stored: for the first it is 936a direct representation of an actual human utterance, while for the second 937it is a typed 938.ul 939description 940of the utterance in terms of the sounds, or phonemes, which constitute it. 941The accent and tone of voice of the human speaker will be apparent in 942the stored speech output, while for synthetic speech the accent is the 943machine's and the tone of voice is determined by the synthesis program. 944.pp 945Probably the most attractive representation of utterances in man-machine 946systems is ordinary English text, as used by the Kurzweil reading machine. 947But, as noted above, this poses extraordinarily difficult problems for the 948synthesis procedure, and these inevitably result in severely degraded speech. 949Although in the very long term these problems may indeed be solved, 950most speech output systems can adopt as their representation of an utterance 951a description of it which explicitly conveys the difficult features of 952intonation, rhythm, and even pronunciation. 953In the kind of applications described above (barring the reading machine), 954input will be prepared by a 955programmer as he builds the software system which supports the interactive 956dialogue. 957Although it is important that the method of specifying utterances be easily 958learned, it is not necessary that plain English 959is used. It should be simple for the programmer to enter new 960utterances and modify them on-line in cut-and-try attempts to render the 961man-machine dialogue as natural as possible. A phonetic input 962can be quite adequate for this, especially if the system allows the 963programmer to hear immediately the synthesized version of the message 964he types. Furthermore, markers which indicate rhythm and intonation can 965be added to the message so that the system does not have to deduce these features 966by attempting to "understand" the plain text. 967.pp 968This brings us to another disadvantage of speech storage as compared with 969speech synthesis. To provide utterances for a voice response system using 970stored human speech, one must assemble together special input hardware, 971a quiet room, and (probably) a dedicated computer. If the speech is to be 972heavily encoded, either expensive special hardware is required or the encoding 973process, if performed by software on a general-purpose computer, will take 974a considerable length of time (perhaps hundreds of times real-time). In 975either case, time-consuming editing of the speech will be necessary, with 976follow-up recordings to clarify sections of speech which turn out to be 977unsuitable or badly recorded. If at a later date the voice response 978system needs modification, it will be necessary to recall the same speaker, 979or re-record the entire utterance set. This discourages the application 980programmer from adjusting his dialogue in the light of experience. 981Synthesizing from a textual representation, on the other hand, allows him 982to change a speech prompt as simply as he could a VDU one, and evaluate 983its effect immediately. 984.pp 985We will return to methods of digitizing and compacting speech in Chapters 3 986and 4, and carry on to consider speech synthesis in subsequent chapters. 987Firstly, however, it is necessary to take a look at what speech is and how 988people produce it. 989.sh "1.8 References" 990.LB "nnnn" 991.[ 992$LIST$ 993.] 994.LE "nnnn" 995.sh "1.9 Further reading" 996.pp 997There are remarkably few general books on speech output, although a 998substantial specialist literature exists for the subject. 999In addition to the references listed above, I suggest that you look 1000at the following. 1001.LB "nn" 1002.\"Ainsworth-1976-1 1003.]- 1004.ds [A Ainsworth, W.A. 1005.ds [D 1976 1006.ds [T Mechanisms of speech recognition 1007.ds [I Pergamon 1008.nr [T 0 1009.nr [A 1 1010.nr [O 0 1011.][ 2 book 1012.in+2n 1013A nice, easy-going introduction to speech recognition, this book covers 1014the acoustic structure of the speech signal in a way which makes 1015it useful as background reading for speech synthesis as well. 1016It complements Lea, 1980, cited above; which presents more recent results 1017in greater depth. 1018.in-2n 1019.\"Flanagan-1973-2 1020.]- 1021.ds [A Flanagan, J.L. 1022.as [A " and Rabiner, L.R. (Editors) 1023.ds [D 1973 1024.ds [T Speech synthesis 1025.ds [I Wiley 1026.nr [T 0 1027.nr [A 0 1028.nr [O 0 1029.][ 2 book 1030.in+2n 1031This is a collection of previously-published research papers on speech 1032synthesis, rather than a unified book. 1033It contains many of the classic papers on the subject from 1940\ -\ 1972, 1034and is a very useful reference work. 1035.in-2n 1036.\"LeBoss-1980-3 1037.]- 1038.ds [A LeBoss, B. 1039.ds [D 1980 1040.ds [K * 1041.ds [T Speech I/O is making itself heard 1042.ds [J Electronics 1043.ds [O May\ 22 1044.ds [P 95-105 1045.nr [P 1 1046.nr [T 0 1047.nr [A 1 1048.nr [O 0 1049.][ 1 journal-article 1050.in+2n 1051The magazine 1052.ul 1053Electronics 1054is an excellent source of up-to-the-minute news, product announcements, 1055titbits, and rumours in the commercial speech technology world. 1056This particular article discusses the projected size of the voice 1057output market and gives a brief synopsis of the activities of several 1058interested companies. 1059.in-2n 1060.\"Witten-1980-5 1061.]- 1062.ds [A Witten, I.H. 1063.ds [D 1980 1064.ds [T Communicating with microcomputers 1065.ds [I Academic Press 1066.ds [C London 1067.nr [T 0 1068.nr [A 1 1069.nr [O 0 1070.][ 2 book 1071.in+2n 1072A recent book on microcomputer technology, this is unusual in that 1073it contains a major section on speech communication 1074with computers (as well as ones 1075on computer buses, interfaces, and graphics). 1076.in-2n 1077.LE "nn" 1078.EQ 1079delim $$ 1080.EN 1081.CH "2 WHAT IS SPEECH?" 1082.ds RT "What is speech? 1083.ds CX "Principles of computer speech 1084.pp 1085People speak by using their vocal cords as a sound source, and making rapid 1086gestures of the articulatory organs (tongue, lips, jaw, and so on). 1087The resulting changes in shape of the vocal tract allow production 1088of the different sounds that we know as the vowels and consonants of 1089ordinary language. 1090.pp 1091What is it necessary to learn about this process for the purposes of 1092speech output from computers? 1093That depends crucially upon how speech is represented in the system. 1094If utterances are stored as time waveforms \(em and this is what we will be 1095discussing in the next chapter \(em the structure of speech is not important. 1096If frequency-related parameters of particular natural utterances are 1097stored, then it is advantageous to take into account some of the 1098acoustic properties of the speech waveform. 1099.pp 1100This point can be brought into focus by contrasting the transmission 1101(or storage) of speech with that of real-life television pictures, 1102as has been proposed for a videophone service. 1103Massive data reductions, of the order of 50:1, can be achieved for speech, 1104using techniques that are described in later chapters. For pictures, 1105data reduction is still an important issue \(em even more so for the 1106videophone than for the telephone, because of the vastly higher 1107information rates involved. 1108Unfortunately, the potential for data reduction is much 1109smaller \(em nothing like the 50:1 figure quoted above. 1110This is because speech sounds have definite characteristics, imparted 1111by the fact that they are produced by a human vocal tract, which 1112can be exploited for data reduction. 1113Television pictures have no equivalent generative structure, for 1114they show just those things that the camera points at. 1115.pp 1116Moving up from frequency-related parameters of 1117.ul 1118particular 1119utterances, it 1120is possible to store such parameters in a 1121.ul 1122general 1123form which characterizes the sound segments that appear in spoken language. 1124This immediately raises the issue of 1125.ul 1126classification 1127of sound segments, to form a basis for storing generalized acoustic 1128information and for retrieval of the information needed to synthesize 1129any particular utterance. 1130Speech is by nature continuous, and any synthesis system based upon 1131discrete classification must come to terms with this by tackling 1132the problems of transition from one segment to another, 1133and local modification of sound segments as a function of their context. 1134.pp 1135This brings us to another level of representation. 1136So far we have talked of the 1137.ul 1138acoustic 1139nature of speech, but when we have to cope with transitions between 1140discrete sound segments it may be fruitful to consider 1141.ul 1142articulatory 1143properties as well. 1144Any model of the speech production process 1145is in effect a model of the articulatory process that generates the speech. 1146Some speech research is concerned with 1147modelling 1148the vocal tract directly, rather than modelling the acoustic output from it. 1149One might specify, for example, position of tongue and posture of jaw and lips 1150for a vowel, instead of giving frequency-related 1151characteristics of it. This is a potent 1152tool in linguistic research, for it brings one closer to human production of 1153speech \(em in particular to the connection between brain and articulators. 1154.pp 1155Articulatory 1156synthesis holds a promise of high-quality speech, for the transitional 1157effects caused by tongue and jaw inertia can be modelled directly. 1158However, this potential has 1159not yet been realized. 1160Speech from current articulatory models is of much poorer quality than 1161that from acoustically-based synthesis methods. 1162The major problem is in gaining data about articulatory 1163behaviour during running speech \(em it is much easier to perform acoustic 1164analysis on the resulting sound than it is to examine the vocal organs in 1165action. Because of this, the subject is not treated in this book. 1166We will only look at articulatory properties insofar as they help us 1167to understand, in a qualitative way, the acoustic nature of speech. 1168.pp 1169Speech, however, is much more than mere articulation. 1170Consider \(em admittedly a rather extreme and chauvinistic example \(em the 1171number of ways a girl can say "yes". 1172Breathy voice, slow tempo, low pitch \(em these are all characteristics which 1173affect the utterance as a whole, rather than being classifiable into 1174individual sound segments. Linguists call them "prosodic" or 1175"suprasegmental" features, for they relate to overall aspects of the 1176utterance, and distinguish them from "segmental" ones which concern 1177the articulation of individual segments of syllables. 1178The most important prosodic features are pitch, or fundamental frequency 1179of the voice, and rhythm. 1180.pp 1181This chapter provides a brief introduction to the nature of the speech 1182signal. Depending upon what speech output techniques we use, it may be 1183necessary to understand something of the acoustic nature of the speech 1184signal; the system that generates it (the vocal tract); commonly-used 1185classifications of sound segments; and the prosodic aspects of speech. 1186This material is little used in the early chapters of the book, but 1187becomes increasingly important as the story unfolds. 1188Hence you may skip the remainder of this chapter if you wish, but 1189should return to it later to pick up more background whenever it 1190becomes necessary. 1191.sh "2.1 The anatomy of speech" 1192.pp 1193The so-called "voiced" sounds of speech \(em like the sound you make when 1194you say "aaah" \(em are produced by passing air up from the lungs through 1195the larynx or voicebox, which is situated just behind the Adam's apple. 1196The vocal tract from the larynx to the lips acts as a resonant cavity, 1197amplifying certain frequencies and attenuating others. 1198.pp 1199The waveform generated by the larynx, however, is not simply sinusoidal. 1200(If it were, the vocal tract resonances would merely 1201give a sine wave of the same frequency but amplified or 1202attenuated according to how close it was to the nearest resonance.) The 1203larynx contains two folds of skin \(em the vocal cords \(em which blow apart and flap 1204together again in each cycle of the pitch period. 1205The pitch of a male voice in speech varies from as low as 50\ Hz 1206(cycles per second) to perhaps 1207250\ Hz, with a typical median value of 100\ Hz. 1208For a female voice the range is higher, up to about 500\ Hz in speech. 1209Singing can go much higher: a top C sung by a soprano has a frequency 1210of just over 1000\ Hz, and some opera singers can reach 1211substantially higher than this. 1212.pp 1213The flapping action of the vocal cords 1214gives a waveform which can be approximated by a 1215triangular pulse (this and other approximations will be discussed in 1216Chapter 5). 1217It has a rich spectrum of harmonics, 1218decaying at around 12\ dB/octave, and each harmonic is affected 1219by the vocal tract resonances. 1220.rh "Vocal tract resonances." 1221A simple model of the vocal tract is an organ-pipe-like cylindrical tube 1222(Figure 2.1), 1223with a sound source at one end (the larynx) and open at the other (the lips). 1224.FC "Figure 2.1" 1225This has resonances at wavelengths $4L$, $4L/3$, $4L/5$, ..., where $L$ 1226is the length of the tube; 1227and these correspond to frequencies $c/4L$, $3c/4L$, $5c/4L$, ...\ Hz, $c$ 1228being the speed of 1229sound in air. 1230Calculating these frequencies, using a typical figure for the 1231distance between larynx and lips of 17\ cm, 1232and $c = 340$\ m/s for the speed of sound, leads to resonances at 1233approximately 500\ Hz, 1500\ Hz, 2500\ Hz, ... . 1234.pp 1235When excited by the harmonic-rich waveform of the larynx, 1236the vocal tract resonances produce 1237peaks known as 1238.ul 1239formants 1240in the energy spectrum of the speech wave (Figure 2.2). 1241.FC "Figure 2.2" 1242The lowest formant, called formant one, varies from around 200\ Hz 1243to 1000\ Hz during speech, the exact range depending on the size 1244of the vocal tract. 1245Formant two varies from around 500 to 2500\ Hz, and formant three 1246from around 1500 to 3500\ Hz. 1247.pp 1248You can easily hear the lowest formant by whispering the vowels in 1249the words "heed", "hid", "head", "had", "hod", "hawed", and "who'd". 1250They appear to have a steadily descending pitch, yet since you are 1251whispering there is no fundamental frequency. 1252What you hear is the lowest resonance of the vocal tract \(em formant one. 1253Some masochistic people can play simple tunes with this formant by putting 1254their mouth in successive vowel shapes and knocking the top of their head 1255with their knuckles \(em hard! 1256.pp 1257A difficulty occurs when trying to identify the lower formants for speakers 1258with high-pitched voices. 1259When a formant frequency falls below the fundamental excitation frequency 1260of the voice, its effect is diminished \(em although it is still present. 1261The vibrato used by opera singers provides a very low-frequency excitation 1262(at the vibrato rate) which helps to illuminate the lower formants even 1263when the pitch of the voice is very high. 1264.pp 1265Of course, speech is not a static phenomenon. 1266The organ-pipe model describes the speech spectrum during a continuously 1267held vowel with the mouth in a neutral position such as for "aaah". 1268But in real speech the tongue and lips are in continuous motion, 1269altering the shape of the vocal tract and hence the positions of the resonances. 1270It is as if the organ-pipe were being squeezed and expanded in 1271different places all the time. 1272Say 1273.ul 1274ee 1275as in "heed" and feel how close your tongue is to the roof of your mouth, 1276causing a constriction near the front of the vocal cavity. 1277.pp 1278Linguists and speech engineers use a special frequency analyser called a 1279"sound spectrograph" to make a three-dimensional plot of the variation 1280of the speech energy spectrum with time. 1281Figure 2.3 shows a spectrogram of the 1282utterance "go away". 1283.FC "Figure 2.3" 1284Frequency is given on the vertical axis, 1285and bands are shown at the beginning to indicate the scale. 1286Time is plotted horizontally, 1287and energy is given by the darkness of any particular area. 1288The lower few formants can be seen as dark bands extending horizontally, 1289and they are in continuous motion. 1290In the neutral first vowel of "away", the formant frequencies 1291pass through 1292approximately the 500\ Hz, 1500\ Hz, and 2500\ Hz that we calculated earlier. 1293(In fact, formants two and three are somewhat lower than these values.) 1294.pp 1295The 1296fine vertical striations in the spectrogram correspond to single openings of the vocal cords. 1297Pitch changes continuously throughout an utterance, 1298and this can be seen on the spectrogram by the differences in spacing 1299of the striations. 1300Pitch change, or 1301.ul 1302intonation, 1303is singularly important in 1304lending naturalness to speech. 1305.pp 1306On a spectrogram, a continuously held vowel shows up as a static energy spectrum. 1307But beware \(em what we call a vowel in everyday language is not the same thing as a 1308"vowel" in phonetic terms. 1309Say "I" and feel how the tongue moves continuously while you're speaking. 1310Technically, this is a 1311.ul 1312diphthong 1313or slide between two vowel positions, 1314and not a single vowel. 1315If you say 1316.ul 1317ar 1318as in "hard", 1319and change slowly to 1320.ul 1321ee 1322as in "heed", you will obtain a diphthong not unlike that in "I". 1323And there are many more phonetically different vowel sounds 1324than the a, e, i, o, and u that we normally think of. 1325The words "hood" and "mood" have different vowels, for example, as do "head" and "mead". 1326The principal acoustic difference between the various vowel sounds 1327is in the frequencies of the first two formants. 1328.pp 1329A further complication is introduced by the nasal tract. This is 1330a large cavity which is coupled to the oral tract by a passage at the 1331back of the mouth. 1332The passage is guarded by a flap of skin called the "velum". 1333You know about this because inadvertent opening of the velum while 1334swallowing causes food or drink to go up your nose. 1335The nasal cavity is switched in and out of the vocal tract 1336by the velum during speech. 1337It is used for consonants 1338.ul 1339m, 1340.ul 1341n, 1342and the 1343.ul 1344ng 1345sound in the word 1346"singing". 1347Vowels are frequently nasalized too. 1348A very effective demonstration of the amount of nasalization in ordinary 1349speech can be obtained by cutting a nose-shaped hole in a large 1350baffle which divides a room, speaking normally with one's nose in the hole, 1351and having someone listen on the other side. 1352The frequency of occurrence of 1353nasal sounds, and the volume of sound that is emitted 1354through the nose, are both surprisingly large. 1355Interestingly enough, when we say in conversation that someone sounds 1356"nasal", we usually mean "non-nasal". When the nasal passages are 1357blocked by a cold, nasal sounds are missing \(em 1358.ul 1359n\c 1360\&'s turn into 1361.ul 1362d\c 1363\&'s, 1364and 1365.ul 1366m\c 1367\&'s to 1368.ul 1369b\c 1370\&'s. 1371.pp 1372When the nasal cavity is switched in to the vocal tract, it introduces 1373formant resonances, just as the oral cavity does. 1374Although we cannot 1375alter the shape of the nasal tract significantly, the nasal formant 1376pattern is not fixed, because the oral tract does play a part in nasal 1377resonances. 1378If you say 1379.ul 1380m, 1381.ul 1382n, 1383and 1384.ul 1385ng 1386continuously, you can hear the difference and feel how it is produced by 1387altering the combined nasal/oral tract resonances with your tongue position. 1388The nasal cavity operates in parallel with 1389the oral one: this causes the two resonance patterns to be summed 1390together, with resulting complications which will be discussed in Chapter 5. 1391.rh "Sound sources." 1392Speech involves sounds other than those caused by regular vibration of 1393the larynx. 1394When you whisper, the folds of the larynx are held slightly 1395apart so that the air passing between them becomes turbulent, causing a noisy excitation 1396of the resonant cavity. 1397The formant peaks are still present, superimposed on the noise. Such 1398"aspirated" sounds occur in the 1399.ul 1400h 1401of "hello", and for a very short time 1402after the lips are opened at the beginning of "pit". 1403.pp 1404Constrictions made in the mouth produce hissy noises such as 1405.ul 1406ss, 1407.ul 1408sh, 1409and 1410.ul 1411f. 1412For example, in 1413.ul 1414ss 1415the tip of the tongue is high up, 1416very close to the roof of the mouth. 1417Turbulent air passing through this constriction causes a 1418random noise excitation, known as "frication". 1419Actually, the roof of the mouth is quite a complicated object. 1420You can feel with your tongue a bony hump or ridge just behind the front 1421teeth, and it is this that forms a constriction with the tongue for 1422.ul 1423s. 1424In 1425.ul 1426sh, 1427the tongue is flattened close to the roof of the mouth slightly farther back, 1428in a position rather similar to that for 1429.ul 1430ee 1431but with a narrower 1432constriction, 1433while 1434.ul 1435f 1436is produced with the upper teeth and lower lip. 1437Because they are made near the front of the mouth, 1438the resonances of the vocal tract have little effect on these fricative 1439sounds. 1440.pp 1441To distinguish them from aspiration and frication, the ordinary speech 1442sounds (like "aaah") which have their source in larynx vibration are 1443known technically as "voiced". Aspirated and fricative sounds are called 1444"unvoiced". Thus the three different sound types can be classified as 1445.LB 1446.NP 1447voiced 1448.NP 1449unvoiced (fricative) 1450.NP 1451unvoiced (aspirated). 1452.LE 1453Can any of these three types occur together? 1454It would seem that voicing and aspiration can not, for the former requires 1455the larynx to be vibrating regularly, but for the latter it must be 1456generating turbulent noise. 1457However, there is a condition known technically as "breathy voice" 1458which occurs when the vocal cords are slightly apart, still vibrating, 1459but with a large volume of air passing between to create turbulence. 1460Voicing can easily occur in conjunction with frication. 1461Corresponding to 1462.ul 1463s, 1464.ul 1465sh, 1466and 1467.ul 1468f 1469we get the 1470.ul 1471voiced 1472fricatives 1473.ul 1474z, 1475the sound in the middle of words like "vision" which I will call 1476.ul 1477zh, 1478and 1479.ul 1480v. 1481A simple illustration of voicing is to say "ffffvvvvffff\ ...". 1482During the voiced part you can feel the larynx vibrations with a finger 1483on your Adam's apple, and it can be heard quite clearly if you stop up 1484your ears. 1485Technically, there is nothing to prevent frication and aspiration 1486from occurring together \(em they do, for example, when a voiced fricative 1487is whispered \(em but the combination is not an important one. 1488.pp 1489The complicated acoustic effects of noisy excitations in speech can be 1490seen in the spectrogram in Figure 2.4 of 1491"high altitude jets whizz past screaming". 1492.FC "Figure 2.4" 1493.rh "The source-filter model of speech production." 1494We have been talking in terms of a sound source (be it voiced or unvoiced) 1495exciting the resonances of the oral (and possible the nasal) tract. 1496This model, which is used extensively in speech analysis and synthesis, 1497is known as 1498the source-filter model of speech production. The reason for its success 1499is that the effect of the resonances can be modelled as a frequency-selective 1500filter, operating on an input which is the source excitation. 1501Thus the frequency spectrum of the source is modified by multiplying it 1502by the frequency characteristic of the filter (or adding it, if amplitudes 1503are expressed logarithmically). 1504This can be seen in Figure 2.5, which shows a source 1505spectrum and filter characteristic which combine to give the overall 1506spectrum of Figure 2.2. 1507.FC "Figure 2.5" 1508.pp 1509Although, as mentioned above, the various fricatives are not subjected 1510to the resonances of the vocal tract to the same extent 1511that voiced and aspirated 1512sounds are, they can still be modelled as a noise source followed by 1513a filter to give them their different sound qualities. 1514.pp 1515The source-filter model is an oversimplification of the actual speech 1516production system. There is inevitably some coupling between the vocal 1517tract and the lungs, through the glottis, during the period when 1518it is open. This effectively makes the filter characteristics 1519change during each individual cycle of the excitation. 1520However, although the effect is of interest to speech researchers, 1521it is probably not of great significance for practical speech output. 1522.pp 1523One very interesting implication of the 1524source-filter model is that the prosodic features of 1525pitch and amplitude are largely properties of the source; while 1526segmental ones are introduced by the filter. This makes it possible to 1527separate some aspects of 1528overall prosody from the actual segmental content of an 1529utterance, so that, for example, a human utterance can be stored initially 1530and then spoken by a machine with a variety of different intonations. 1531.sh "2.2 Classification of speech sounds" 1532.pp 1533The need to classify sound segments as a basis for storing generalized acoustic 1534information and retrieving it was mentioned earlier. There is a real 1535difficulty here because speech is by nature continuous and classifications are 1536discrete. 1537It is important to remember this difficulty because it is all too easy 1538to criticize the complex and often confusing attempts of linguists to 1539tackle the classification task. 1540.pp 1541Linguists call a written representation of the 1542.ul 1543sounds 1544of an utterance a "phonetic 1545transcription" of it. The same utterance can be transcribed at 1546different levels of detail: simple transcriptions are called "broad" 1547and more specific ones are called "narrow". 1548Perhaps the most logically satisfying kind of transcription employs units 1549termed "phonemes". This is the broadest transcription, 1550and is sometimes called a 1551.ul 1552phonemic 1553transcription to emphasize that that it is in terms of phonemes. 1554Unfortunately, the word "phoneme" is often used somewhat loosely. 1555In its true sense, a phoneme is a 1556.ul 1557logical 1558unit, rather than a physical, acoustic, one, 1559and is defined in relation to a particular language by reference 1560to its use in discriminating different words. 1561Classifications of sounds which are based on their 1562semantic 1563role as word-discriminators are called 1564.ul 1565phonological 1566classifications: we could ensure that there is no ambiguity in the sense 1567with which we use the term "phoneme" by calling it a phonological unit, and 1568the phonemic transcription could be called a phonological one. 1569.rh "Broad phonetic transcription." 1570A phoneme is an abstract unit representing a set of different sounds. 1571The issue is confused by the fact that the members of the set actually 1572sound very similar, if not identical, to the untrained ear \(em precisely because 1573the difference between them plays no part in distinguishing words from 1574each other in the particular language concerned. 1575.pp 1576Take the words "key" and "caw", for example. Despite the difference in 1577spelling, both of them begin with a 1578.ul 1579k 1580sound that belongs (in English) 1581to the same phoneme set, called 1582.ul 1583k. 1584However, say them two or three times each, concentrating on the position of 1585the tongue during the 1586.ul 1587k. 1588It is quite different in each case. For "key", it 1589is raised, close to the roof of the mouth, in preparation for the 1590.ul 1591ee, 1592whereas in "caw" it is much lower down. 1593The sound of the 1594.ul 1595k 1596is actually quite different in the two cases. 1597Yet they belong to the same phoneme, for there is no pair of words which 1598relies on this difference to distinguish them \(em "key" and "caw" are 1599obviously distinguished by their vowels, not by the initial 1600consonant. 1601You probably cannot hear clearly the difference between the two 1602.ul 1603k\c 1604\&'s, 1605precisely because they belong to the same phoneme and so the difference 1606is not important (for English). 1607.pp 1608The point is sharpened by considering another language where we make a 1609distinction \(em and hence can hear the difference \(em between two sounds 1610that belong, in the language, to the same phoneme. 1611Japanese does not distinguish 1612.ul 1613r 1614from 1615.ul 1616l. 1617Japanese people 1618.ul 1619do not hear 1620the difference between "lice" and "rice", in the same way that you do 1621not hear the difference between the two 1622.ul 1623k\c 1624\&'s above. 1625Cockneys do not hear, except with a special effort, the difference 1626between "has" and "as", or "haitch" and "aitch", for the Cockney dialect 1627does not recognize initial 1628.ul 1629h\c 1630\&'s. 1631.pp 1632So what is a phoneme? It is a set of sounds whose members do not 1633discriminate between any words in the language under consideration. 1634If you are mathematically minded you could think of it as an equivalence 1635class of sounds, determined by the relationship 1636.LB 1637$sound sub 1$ is related to $sound sub 2$ if $sound sub 1$ and $sound sub 2$ 1638do not discriminate any pair of words in the language. 1639.LE 1640The 1641.ul 1642p 1643and 1644.ul 1645d 1646in 1647"pig" and "dig" belong to different phonemes (in English), 1648because they discriminate 1649the two words. 1650.ul 1651b, 1652.ul 1653f, 1654and 1655.ul 1656j 1657belong to different phonemes again. 1658.ul 1659i 1660and 1661.ul 1662a 1663in "hid" and "had" belong to different phonemes too. 1664Proceeding like this, a list of phonemes can be drawn up. 1665.pp 1666Such a list is shown in Table 2.1, for British English. 1667(The layout of the list does have some significance in terms of different 1668categories of phonemes, which will be explained later.) In fact, 1669linguists use an 1670assortment of English letters, foreign letters, and special 1671symbols to represent phonemes. In this book we use one- or two-letter 1672codes, partly because they are more mnemonic, and partly because 1673they are more suitable for communication to computers using standard 1674peripheral devices. 1675They are 1676a direct transliteration of linguists' standard International Phonetic 1677Association symbols. 1678.RF 1679.nr x1 3m+1.0i+0.5i+0.5i+0.5i+\w'y'u 1680.nr x1 (\n(.l-\n(x1)/2 1681.in \n(x1u 1682.ta 3m +1.0i +0.5i +0.5i +0.5i +0.5i +0.5i 1683\fIuh\fR (the) \fIp\fR \fIt\fR \fIk\fR 1684\fIa\fR (bud) \fIb\fR \fId\fR \fIg\fR 1685\fIe\fR (head) \fIm\fR \fIn\fR \fIng\fR 1686\fIi\fR (hid) 1687\fIo\fR (hod) \fIr\fR \fIw\fR \fIl\fR \fIy\fR 1688\fIu\fR (hood) 1689\fIaa\fR (had) \fIs\fR \fIz\fR 1690\fIee\fR (heed) \fIsh\fR \fIzh\fR 1691\fIer\fR (heard) \fIf\fR \fIv\fR 1692\fIuu\fR (food) \fIth\fR \fIdh\fR 1693\fIar\fR (hard) \fIch\fR \fIj\fR 1694\fIaw\fR (hoard) \fIh\fR 1695.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 1696.in 0 1697.FG "Table 2.1 The phonemes of British English" 1698.pp 1699We will discuss the sounds which make up each of these phoneme classes 1700shortly. First, however, it is worthwhile pointing out some rather 1701tricky points in the definition of these phonemes. 1702.rh "Phonological difficulties." 1703There are snags with phonological classification, as there are 1704in any area where attempts are made to make completely logical 1705statements about human activity. 1706Consider 1707.ul 1708h 1709and the 1710.ul 1711ng 1712in "singing". 1713(\c 1714.ul 1715ng 1716is certainly not an 1717.ul 1718n 1719sound followed by a 1720.ul 1721g 1722sound, although 1723it is true that in some English accents "singing" is rendered with 1724the 1725.ul 1726ng 1727followed by a 1728.ul 1729g 1730at each of its two occurrences.) No words 1731end with 1732.ul 1733h, 1734and none begin with 1735.ul 1736ng. 1737(Notice that we are still talking about British English. 1738In Chinese, the sound 1739.ul 1740ng 1741is a word in its own right, and is a common 1742family name. 1743But we must stick with one language for phonological classification.) Hence 1744it follows that there is no pair of words which is distinguished 1745by the difference between 1746.ul 1747h 1748and 1749.ul 1750ng. 1751Technically, 1752they belong to the same phoneme. However, technical considerations 1753in this case must take second place to common sense! 1754.pp 1755The 1756.ul 1757j 1758in "jig" is another interesting case. It can be considered 1759to belong to a 1760.ul 1761j 1762phoneme, or to be a sequence of two 1763phonemes, 1764.ul 1765d 1766followed by 1767.ul 1768zh 1769(the sound in "vision"). There is 1770disagreement on this point in phonetics textbooks, and we do not 1771have the time (nor, probably, the inclination!) to consider the 1772pros and cons of this moot point. 1773I have resolved the matter arbitrarily by writing it as a separate 1774phoneme. The 1775.ul 1776ch 1777in "choose" is a similar case 1778(\c 1779.ul 1780t 1781followed by the 1782.ul 1783sh 1784in "shoes"). 1785.pp 1786Another difficulty, this time where Table 2.1 does not show how to 1787distinguish between two sounds which 1788.ul 1789do 1790discriminate words in many people's English, is the 1791.ul 1792w 1793in "witch" 1794and that in "which". The latter is conventionally transcribed 1795as a sequence of two phonemes, 1796.ul 1797h w. 1798.pp 1799The last few difficulties are all to do with deciding whether a 1800sound belongs to a single phoneme class, or comprises a sequence 1801of sounds each of which belongs to a phoneme. 1802Are the 1803.ul 1804j 1805in "jug", the 1806.ul 1807ch 1808in "chug", and the 1809.ul 1810w 1811in "which", 1812single phonemes or not? The definition above of a phoneme 1813as a "set of sounds whose members do not discriminate any words 1814in the language" does not help us to answer this question. 1815As far as this definition is concerned, we could go so far as 1816to call each and every word of the language an individual phoneme! 1817It is clear that some acoustic evidence, and quite a lot of judgement, 1818is being used when phonemes such as those of Table 2.1 are defined. 1819.pp 1820So much for the consonants. This same problem occurs in vowel sounds, 1821particularly in diphthongs, which are sequences of two vowel-like sounds. 1822Do the vowels of "main" and "man" belong to different phonemes? 1823Clearly so, if they are both transcribed as single units, for they 1824distinguish the two words. 1825Notwithstanding the fact that they are sequences of separate sounds, 1826a logically consistent system could be constructed which gave separate, 1827unitary, symbols to each diphthong. 1828However, it is usual to employ a compound symbol which indicates explicitly 1829the character of the two vowel-like sounds involved. 1830We will transcribe the diphthong of "main" as a sequence of two 1831vowels, 1832.ul 1833e 1834(as in "head") and 1835.ul 1836i 1837(as in "hid", not "I"). 1838This is done primarily for economy of symbols, choosing the constituent 1839sounds on the basis of the closest match to existing vowel sounds. 1840(Note that this again violates purely 1841.ul 1842logical 1843criteria for identifying phonemes.) 1844.rh "Categories of speech sounds." 1845A phoneme is defined as a set of sounds whose members to not discriminate 1846between any words in the language under consideration. 1847The phonemes themselves can be classified into groups which reflect 1848similarities between them. 1849This can be done in many different ways, using various criteria 1850for classification. In fact, one branch of linguistic research 1851is concerned with defining a set of "distinctive 1852features" such that a phoneme class is uniquely identified by 1853the values of the features. Distinctive features are binary, 1854and include such things as voiced\(emunvoiced, fricative\(emnot\ fricative, 1855aspirated\(emunaspirated. We will not be concerned here with such 1856detailed classifications, but it is as well to know that they exist. 1857.pp 1858There is an everyday distinction between vowels and consonants. 1859A vowel forms the nucleus of every syllable, and one or more consonants 1860may optionally surround the vowel. 1861But the distinction sometimes becomes a little ambiguous. 1862Syllables like 1863.ul 1864sh 1865are commonly uttered and certainly do not 1866contain a vowel. Furthermore, when we say "vowel" in everyday 1867language we usually refer to the 1868.ul 1869written 1870vowels a, e, i, o, and u; there are many more vowel sounds. 1871A vowel in orthography is different to a vowel as a phoneme. 1872Is a diphthong a phonetic vowel? \(em certainly, by the syllable-nucleus 1873criterion; but it is a little different from ordinary vowels because 1874it is a changing sound rather than a constant one. 1875.pp 1876Table 2.2 shows one classification of the phonemes of Table 2.1, which 1877will be useful in our later studies of speech synthesis from phonetics. 1878It shows twelve vowels, including the rather peculiar one 1879.ul 1880uh 1881(which corresponds to the first vowel in the word "above"). 1882This is the sound produced by the vocal tract when it is in a relaxed, 1883neutral position; and it never occurs in prominent, stressed, 1884syllables. The vowels later in the list are almost always longer 1885than the earlier ones. In fact, the first six 1886(\c 1887.ul 1888uh, a, e, i, o, u\c 1889) 1890are often called "short" vowels, and the last five 1891(\c 1892.ul 1893ee, er, uu, ar, aw\c 1894) 1895"long" ones. The shortness or longness of the one in the middle 1896(\c 1897.ul 1898aa\c 1899) 1900is rather ambiguous. 1901.RF 1902.nr x0 \w'000unvoiced fricative 'u 1903.nr x1 \n(x0+\w'[not classified as individual phonemes]'u 1904.nr x1 (\n(.l-\n(x1)/2 1905.in \n(x1u 1906.ta \n(x0u 1907.fi 1908vowel \c 1909.ul 1910uh a e i o u aa ee er uu ar aw 1911.br 1912diphthong [not classified as individual phonemes] 1913.br 1914glide (or liquid) \c 1915.ul 1916r w l y 1917.br 1918stop 1919.br 1920\0\0\0unvoiced stop \c 1921.ul 1922p t k 1923.br 1924\0\0\0voiced stop \c 1925.ul 1926b d g 1927.br 1928nasal \c 1929.ul 1930m n ng 1931.br 1932fricative 1933.br 1934\0\0\0unvoiced fricative \c 1935.ul 1936s sh f th 1937.br 1938\0\0\0voiced fricative \c 1939.ul 1940z zh v dh 1941.br 1942affricate 1943.br 1944\0\0\0unvoiced affricate \c 1945.ul 1946ch 1947.br 1948\0\0\0voiced affricate \c 1949.ul 1950j 1951.br 1952aspirate \c 1953.ul 1954h 1955.nf 1956.in 0 1957.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 1958.FG "Table 2.2 Phoneme categories" 1959.pp 1960Diphthongs pose no problem here because we have not classified them 1961as single phonemes. 1962.pp 1963The remaining categories are consonants. The glides are quite 1964similar to vowels and diphthongs, though; for they are voiced, 1965continuous sounds. You can say them and prolong them. 1966(This is also true of the fricatives.) 1967.ul 1968r 1969is interesting 1970because it can be realized acoustically in very different ways. 1971Some people curl the tip of the tongue 1972back \(em a so-called retroflex action of the tongue. Many people 1973cannot do this, and their 1974.ul 1975r\c 1976\&'s sound like 1977.ul 1978w\c 1979\&'s. 1980The stage Scotsman's 1981.ul 1982r 1983is a trill where the tip of the tongue vibrates against the roof of the mouth. 1984.ul 1985l 1986is also 1987slightly unusual, for it is the only English phoneme which is "lateral" \(em 1988air passes either side of it, in two separate passages. Welsh 1989has another lateral sound, a fricative, which is written "ll" as 1990in "Llandudno". 1991.pp 1992The next category is the stops. These are formed by stopping up 1993the mouth, so that air pressure builds up behind the lips, and 1994releasing this pressure suddenly. The result is a little 1995explosion (and the stops are often called "plosives"), which 1996usually creates a very short burst of fricative noise (and, in some cases, 1997aspiration as well). They are further subdivided into voiced and 1998unvoiced stops, depending upon whether voicing starts as soon as 1999the plosion occurs (sometimes even before) or well after it. 2000If you put your hand in front of your mouth when saying "pit" you 2001can easily feel the puff of air that signals the plosion on the 2002.ul 2003p, 2004and probably on the 2005.ul 2006t 2007as well. 2008.pp 2009In a sense, nasals are really stops as well (and they are often 2010called stops), for the oral tract is blocked although the nasal 2011one is not. The peculiar fact that the nasal 2012.ul 2013ng 2014never occurs at the beginning of a word (in English) was mentioned 2015earlier. Notice that for stops and nasals there is a similarity in the 2016.ul 2017vertical 2018direction of Table 2.2, between 2019.ul 2020p, 2021.ul 2022b, 2023and 2024.ul 2025m; 2026.ul 2027t, 2028.ul 2029d, 2030and 2031.ul 2032n; 2033and 2034.ul 2035k, 2036.ul 2037g, 2038and 2039.ul 2040ng. 2041.ul 2042p 2043is an unvoiced version of 2044.ul 2045b 2046(try saying them), 2047and 2048.ul 2049m 2050is a nasalized version (for 2051.ul 2052b 2053is what you get when you 2054have a cold and try to say 2055.ul 2056m\c 2057). 2058These three sounds are all made 2059at the front of the mouth, while 2060.ul 2061t, 2062.ul 2063d, 2064and 2065.ul 2066n, 2067which bear the 2068same resemblance to each other, are made in the middle; and 2069.ul 2070k, 2071.ul 2072g, 2073and 2074.ul 2075ng 2076are made at the back. This introduces another 2077possible classification, according to 2078.ul 2079place of articulation. 2080.pp 2081The unvoiced fricatives are quite straightforward, except perhaps 2082for 2083.ul 2084th, 2085which is the sound at the beginning of "thigh". 2086They are paired with the voiced fricatives on the basis of place 2087of articulation. The voiced version of 2088.ul 2089th 2090is the 2091.ul 2092dh 2093at 2094the beginning of "thy". 2095.ul 2096zh 2097is a fairly rare phoneme, which 2098is heard in the middle of "vision". Affricates are similar to 2099fricatives but begin with a stopped posture, and we mentioned earlier 2100the controversy as to whether they should be considered to be 2101single phonemes, or 2102sequences of stop phonemes and fricatives. 2103Finally comes the lonely aspirate, 2104.ul 2105h. 2106Aspiration does occur 2107elsewhere in speech, during the plosive burst of unvoiced stops. 2108.rh "Narrow phonetic transcription." 2109The phonological classification outlined above is based upon a clear 2110rationale for distinguishing between sounds according to how 2111they affect meaning \(em although the rationale does become 2112somewhat muddied in difficult cases. 2113Narrower transcriptions are not so systematic. 2114They use units called 2115.ul 2116allophones, 2117which are defined by reference to physical, acoustic, criteria rather 2118than purely logical ones. 2119("Phone" is a more old-fashioned term for the same thing, 2120and the misused word "phoneme" is often employed where allophone is 2121meant, that is, as a physical rather than a logical 2122unit.) Each phoneme has several allophones, 2123more or less depending on how narrow or broad the transcription is, 2124and the allophones are different acoustic realizations of the same 2125logical unit. 2126For example, the 2127.ul 2128k\c 2129\&'s in "key" and "caw" may be considered as different 2130allophones (in a slightly narrow transcription). 2131Although we will not use symbols for allophones here, 2132they are often indicated by diacritical marks in a text 2133which modify the basic phoneme classes. 2134For example, a tilde (~) over a vowel means that it is nasalized, while a small 2135circle underneath a consonant means that it is devoiced. 2136.pp 2137Allophonic variation in speech is governed by a mechanism called 2138.ul 2139coarticulation, 2140where a sound is affected by those that come either side of it. 2141"Key"\-"caw" is a clear example of this, where the tongue 2142position in the 2143.ul 2144k 2145anticipates that of the following vowel \(em high 2146in the first case, low in the second. 2147Most allophonic variation in English is anticipatory, in that the sound 2148is influenced by the following articulation rather than by 2149preceding ones. 2150.pp 2151Nasalization is a feature which applies to vowels in English through 2152anticipatory coarticulation. 2153In many languages (for example, French) it is a 2154.ul 2155distinctive 2156feature for vowels in that it serves to distinguish one vowel phoneme class 2157from another. 2158That this is not so in English sometimes tempts us to assume, 2159incorrectly, that nasalization does not occur in vowels. 2160It does, typically when the vowel is followed by a nasal consonant, and it is 2161important for synthesis that nasalized vowel allophones are recognized and 2162treated accordingly. 2163.pp 2164Coarticulation can be predicted by phonological rules, which show 2165how a phonemic sequence will be realized by allophones. 2166Such rules have been studied extensively by linguists. 2167.pp 2168The reason for coarticulation, and for the existence of allophones, 2169lies in the physical constraints imposed by the motion 2170of the articulatory organs \(em particularly their acceleration and deceleration. 2171An immensely crude model is that the brain decides what phonemes to 2172say (for it is concerned with semantic things, and the definition 2173of a phoneme is a semantic one). 2174It then takes this sequence and translates it into neural commands 2175which actually move the articulators into target positions. 2176However, other commands may be issued, and executed, before these targets 2177are reached, and this accounts for coarticulation effects. 2178Phonological rules for converting a phonemic sequence to an 2179allophonic one are a sort of discrete model of the process. 2180Particularly for work involving computers, it is possible that this 2181rule-based approach will be overtaken by potentially more accurate 2182methods which attempt to model the continuous articulatory phenomena 2183directly. 2184.sh "2.3 Prosody" 2185.pp 2186The phonetic classification introduced above divides speech into 2187segments and classifies these into phonemes or allophones. 2188Riding on top of this stream of segments are other, more global, 2189attributes that dictate the overall prosody of the utterance. 2190Prosody is defined by the Oxford English Dictionary as the 2191"science of versification, laws of metre," 2192which emphasizes the aspects of stress and rhythm that are central 2193to classical verse. 2194There are, however, many other features which are more or less 2195global. 2196These are collectively called prosodic or, equivalently, suprasegmental, 2197features, for they lie above the level of phoneme or syllable segments. 2198.pp 2199Prosodic features can be split into two basic categories: features 2200of voice quality and features of voice dynamics. 2201Variations in voice quality, which are sometimes called 2202"paralinguistic" phenomena, are accounted for by anatomical 2203differences and long-term muscular idiosyncrasies (like a sore 2204throat), and have little part to play in the kind of applications 2205for speech output that have been sketched in Chapter 1. 2206Variations in voice dynamics occur in three dimensions: pitch 2207or fundamental frequency of the voice, time, and amplitude. 2208Within the first, the pattern of pitch variation, or 2209.ul 2210intonation, 2211can be distinguished from the overall range within which that variation 2212occurs. 2213The time dimension encompasses the rhythm of the speech, pauses, and the 2214overall tempo \(em whether it is uttered quickly or slowly. 2215The third dimension, amplitude, is of relatively minor importance. 2216Intonation and rhythm work together to produce an effect commonly called 2217"stress", and we will elaborate further on the nature of stress and discuss 2218algorithms for synthesizing intonation and rhythm in Chapter 8. 2219.pp 2220These features have a very important role to play in communicating meaning. 2221They are not fancy, optional components. 2222It is their neglect which is largely responsible for the layman's 2223stereotype of computer speech, 2224a caricature of living speech \(em abrupt, arhythmic, and in a grating 2225monotone \(em 2226which was well characterized by Isaac Asimov when he wrote of speaking 2227"all in capital letters". 2228.pp 2229Timing has a syntactic function in that it sometimes helps to 2230distinguish nouns from 2231verbs 2232(\c 2233.ul 2234ex\c 2235tract versus ex\c 2236.ul 2237tract\c 2238). 2239and adjectives from verbs (app\c 2240.ul 2241rox\c 2242imate versus approxi\c 2243.ul 2244mate\c 2245) \(em although segmental aspects play a part here too, for the vowel 2246qualities differ in each pair of words. 2247Nevertheless, if you make a mistake when assigning stress to words 2248like these in conversation you are very likely to be queried as 2249to what you actually said. 2250.pp 2251Intonation has a big effect on meaning too. 2252Pitch often \(em but by no means always \(em rises on a question, 2253the extent and abruptness of the rise depending on features like whether 2254a genuine information-bearing reply or merely confirmation is expected. 2255A distinctive pitch pattern accompanies the introduction of a new topic. 2256In conjunction with rhythm, intonation can be used to bring out contrasts 2257as in 2258.LB 2259.NI 2260"He didn't have a 2261.ul 2262red 2263car, he had a 2264.ul 2265black 2266one." 2267.LE 2268In general, the intonation patterns used by a reader depend not only on 2269the text itself, but on his interpretation of it, and also on his 2270expectation of the listener's interpretation of it. 2271For example: 2272.LB 2273.NI 2274"He had a 2275.ul 2276red 2277car" (I think you thought it was black), 2278.NI 2279"He had a red 2280.ul 2281bi\c 2282cycle" (I think you thought it was a car). 2283.LE 2284.pp 2285In natural speech, prosodic features are significantly influenced by 2286whether the utterance is generated spontaneously or read aloud. 2287The variations in spontaneous speech are enormous. 2288There are all sorts of emotions which are plainly audible in 2289everyday speech: sarcasm, excitement, rudeness, disagreement, 2290sadness, fright, love. 2291Variations in voice quality certainly play a part here. 2292Even with "ordinary" cooperative friendly conversation, the need to find 2293words and somehow fit them into an overall utterance produces great 2294diversity of prosodic structures. 2295Applications for speech output from computers do not, however, call for 2296spontaneous conversation, but for a controlled delivery which is 2297like that when reading aloud. 2298Here, the speaker is articulating utterances which have been set out for 2299him, reducing his cognitive load to one of understanding and interpreting 2300the text rather than generating it. 2301Unfortunately for us, linguists are (quite rightly) 2302primarily interested in living, 2303spontaneous speech rather than pre-prepared readings. 2304.pp 2305Nevertheless, the richness of prosody in speech even when reading from 2306a book should not be underestimated. 2307Read aloud to an audience and listen to the contrasts in voice dynamics 2308deliberately introduced for variety's sake. 2309If stories are to be read there is even a case for controlling voice 2310.ul 2311quality 2312to cope with quotations and affective imitations. 2313.pp 2314We saw earlier that the source-filter model is particularly 2315helpful in distinguishing prosodic features, which are largely 2316properties of the source, from segmental ones, which belong to 2317the filter. 2318Pitch and amplitude are primarily source properties. 2319Rhythm and speed of speaking are not, but neither are they filter 2320properties, for they belong to the source-filter system as a whole 2321and not specifically to either part of it. 2322The difficult notion of stress is, from an acoustic point of view, 2323a combination of pitch, rhythm, and amplitude. 2324Even some features of voice quality can be attributed to the source 2325(like laryngitis), although others \(em cleft palate, badly-fitting 2326dentures \(em affect segmental features as well. 2327.sh "2.4 Further reading" 2328.pp 2329This chapter has been no more than a cursory introduction to some 2330of the difficult problems of linguistics and phonetics. 2331Here are some readable books which discuss these problems further. 2332.LB "nn" 2333.\"Abercrombie-1967-1 2334.ds [F 1 2335.]- 2336.ds [A Abercrombie, D. 2337.ds [D 1967 2338.ds [T Elements of general phonetics 2339.ds [I Edinburgh Univ Press 2340.nr [T 0 2341.nr [A 1 2342.nr [O 0 2343.][ 2 book 2344.in+2n 2345This is an excellent book which covers all of the areas of this 2346chapter, in much more detail than has been possible here. 2347.in-2n 2348.\"Brown-1980-2 2349.ds [F 2 2350.]- 2351.ds [A Brown, Gill 2352.as [A ", Currie, K.L. 2353.as [A ", and Kenworthy, J. 2354.ds [D 1980 2355.ds [T Questions of intonation 2356.ds [I Croom Helm 2357.ds [C London 2358.nr [T 0 2359.nr [A 1 2360.nr [O 0 2361.][ 2 book 2362.in+2n 2363An intensive study of the prosodics of colloquial, living speech 2364is presented, with particular reference to intonation. Although 2365not particularly relevant to speech output from computers, 2366this book gives great insight into how conversational speech 2367differs from reading aloud. 2368.in-2n 2369.\"Fry-1979-1 2370.ds [F 1 2371.]- 2372.ds [A Fry, D.B. 2373.ds [D 1979 2374.ds [T The physics of speech 2375.ds [I Cambridge University Press 2376.ds [C Cambridge, England 2377.nr [T 0 2378.nr [A 1 2379.nr [O 0 2380.][ 2 book 2381.in+2n 2382This is a simple and readable account of speech science, with a good 2383and completely non-mathematical introduction to frequency analysis. 2384.in-2n 2385.\"Ladefoged-1975-4 2386.ds [F 4 2387.]- 2388.ds [A Ladefoged, P. 2389.ds [D 1975 2390.ds [T A course in phonetics 2391.ds [I Harcourt Brace and Johanovich 2392.ds [C New York 2393.nr [T 0 2394.nr [A 1 2395.nr [O 0 2396.][ 2 book 2397.in+2n 2398Usually books entitled "A course on ..." are dreadfully dull, but 2399this is a wonderful exception. An exciting, readable, almost racy 2400introduction to phonetics, full of little experiments you can try 2401yourself. 2402.in-2n 2403.\"Lehiste-1970-5 2404.ds [F 5 2405.]- 2406.ds [A Lehiste, I. 2407.ds [D 1970 2408.ds [T Suprasegmentals 2409.ds [I MIT Press 2410.ds [C Cambridge, Massachusetts 2411.nr [T 0 2412.nr [A 1 2413.nr [O 0 2414.][ 2 book 2415.in+2n 2416This fairly comprehensive study of the prosodics of speech 2417complements Ladefoged's book, which is mainly concerned with segmental 2418phonetics. 2419.in-2n 2420.\"O'Connor-1973-1 2421.ds [F 1 2422.]- 2423.ds [A O'Connor, J.D. 2424.ds [D 1973 2425.ds [T Phonetics 2426.ds [I Penguin 2427.ds [C London 2428.nr [T 0 2429.nr [A 1 2430.nr [O 0 2431.][ 2 book 2432.in+2n 2433This is another introductory book on phonetics. 2434It is packed with information on all aspects of the subject. 2435.in-2n 2436.LE "nn" 2437.EQ 2438delim $$ 2439.EN 2440.CH "3 SPEECH STORAGE" 2441.ds RT "Speech storage 2442.ds CX "Principles of computer speech 2443.pp 2444The most familiar device that produces speech output is the ordinary tape 2445recorder, which stores information in analogue form on magnetic tape. 2446However, this is unsuitable for speech output from computers. 2447One reason is that it is difficult to access different utterances quickly. 2448Although random-access tape recorders do exist, they are expensive and 2449subject to mechanical breakdown because of the stresses associated with 2450frequent starting and stopping. 2451.pp 2452Storing speech on a rotating drum instead of 2453tape offers the possibility of access to any track within one revolution time. 2454For example, the IBM 7770 Audio Response Unit employs drums rotating twice 2455a second which are able to store up to 32 500-msec words. These can be accessed 2456randomly, within half a second at most. 2457Although one can 2458arrange to store longer words by allowing overflow on to an adjacent track at 2459the end of the rotation period, the discrete time-slots provided by this 2460system make it virtually impossible for it to generate connected utterances 2461by assembling appropriate words from the store. 2462.pp 2463The Cognitronics Speechmaker has a similar structure, but with 2464the analogue speech waveform recorded on photographic film. 2465Storing audio waveforms optically is not an unusual technique, for this is how 2466soundtracks are recorded on ordinary movie films. The original version of 2467the "speaking clock" of the British Post Office used optical storage in 2468concentric tracks on flat glass discs. 2469It is described by Speight and Gill (1937), 2470who include a fascinating account of how the utterances are synchronized. 2471.[ 2472Speight Gill 1937 2473.] 2474A 4\ Hz signal from a pendulum clock was used to supply current to an electric 2475motor, which drove a shaft equipped with cams and gears that rotated 2476the glass discs containing utterances for seconds, minutes, and hours 2477at appropriate speeds! 2478.pp 2479A second reason for avoiding analogue storage is price. It is difficult to see how a random-access 2480tape recorder could be incorporated into a talking pocket calculator or 2481child's toy without considerably inflating the cost. 2482Solid-state electronics is much cheaper than mechanics. 2483.pp 2484But the best reason is that, in many of the applications we have discussed, 2485it is necessary to form utterances by concatenating separately-recorded 2486parts. It is totally infeasible, for example, to store each and every 2487possible telephone number as an individual recording! And 2488utterances that are formed by concatenating individual words which were 2489recorded in isolation, or in a different context, do not sound completely 2490natural. For example, in an early experiment, Stowe and Hampton (1961) recorded 2491individual words on acoustic tape, spliced the tape with the words in a different 2492order to make sentences, and played the result to subjects who were scored on 2493the number of key words which they identified correctly. 2494.[ 2495Stowe Hampton 1961 2496.] 2497The overall conclusion was that while embedding a word in normally-spoken sentences 2498.ul 2499increases 2500the probability of recognition (because the extra context gives clues about the 2501word), embedding a word in a constructed sentence, where intonation and rhythm 2502are not properly rendered, 2503.ul 2504decreases 2505the probability of recognition. When the speech was uttered slowly, 2506however, a considerable improvement was noticed, indicating that if the 2507listener has more processing time he can overcome the lack of proper intonation 2508and rhythm. 2509.pp 2510Nevertheless, many present-day voice response systems 2511.ul 2512do 2513store what amounts to a direct recording of the acoustic wave. 2514However, the storage medium is digital rather than analogue. 2515This means that standard computer storage devices can be used, providing 2516rapid access to any segment of the speech at relatively low cost \(em for 2517the economics of mass-production ensures a low price for random-access 2518digital devices compared with random-access analogue ones. 2519Furthermore, it reduces the amount of special equipment needed for speech 2520output. One can buy very cheap speech input/output interfaces for home computers 2521which connect to standard hobby buses. 2522Another advantage of digital over analogue recording is that 2523integrated circuit read-only memories (ROMs) 2524can be used for hand-held devices which need small quantities of speech. 2525Hence this chapter begins by showing how waveforms are stored digitally, 2526and then describes some techniques for reducing the data needed for a given 2527utterance. 2528.sh "3.1 Storing waveforms digitally" 2529.pp 2530When an analogue signal is converted to digital form, it is made discrete 2531both in time and in amplitude. Discretization in time is the operation of 2532.ul 2533sampling, 2534whilst in amplitude it is 2535.ul 2536quantizing. 2537It is worth pointing out that the transmission of analogue information by 2538digital means is called "PCM" (standing for "pulse code modulation") in 2539telecommunications jargon. 2540Much of the theory of digital signal processing investigates signals which 2541are sampled but not quantized (or quantized into sufficiently many levels to 2542avoid inaccuracies). The operation of quantization, being non-linear, 2543is not very amenable to theoretical analysis. Quantization introduces issues 2544such as accumulation of round-off noise in arithmetic operations, 2545which, although they are very important in practical implementations, can only 2546be treated theoretically under certain somewhat unrealistic assumptions 2547(in particular, independence of the quantization error from sample to sample). 2548.rh "Sampling." 2549A fundamental theorem of telecommunications states that a signal can only be 2550reconstructed accurately from a sampled version if it does not contain 2551components whose frequency is greater than half the frequency at which the 2552sampling takes place. Figure 3.1(a) shows how a component of slightly greater 2553than half the sampling frequency can masquerade, as far as an observer with 2554access only to the sampled data can tell, as a component at slightly less 2555than half the sampling frequency. 2556.FC "Figure 3.1" 2557Call the sampling interval $T$ seconds, so that the 2558sampling frequency is $1/T$\ Hz. 2559Then components at $1/2T+f$, $3/2T-f$, $3/2T+f$ and so on all masquerade 2560as a component at $1/2T-f$. Similarly, components at frequencies just under 2561the sampling frequency masquerade as very low-frequency components, as shown 2562in Figure 3.1(b). This phenomenon is often called "aliasing". 2563.pp 2564Thus the continuous, infinite, frequency axis for the unsampled signal, where 2565two components at different frequencies can always be distinguished, maps 2566into a repetitive frequency axis when the signal is sampled. As depicted 2567in Figure 3.2, the frequency 2568interval $[1/T,~ 2/T)$ \u\(dg\d 2569.FN 3 2570.sp 2571\u\(dg\dIntervals are specified in brackets, with a square bracket representing 2572a closed end of the interval and a round one representing an open one. 2573Thus the interval $[1/T,~ 2/T)$ specifies the range $1/T ~ <= ~ frequency 2574~ < ~ 2/T$. 2575.EF 2576is mapped back into the band $[0,~ 1/T)$, as are the 2577intervals $[2/T,~ 3/T)$, $[3/T,~ 4/T)$, and so on. 2578.FC "Figure 3.2" 2579Furthermore, the interval $[1/2T,~ 1/T)$ between half the sampling frequency and the sampling 2580frequency, is mapped back into the interval 2581below half the sampling frequency; but this time the mapping is backwards, 2582with frequencies at just under $1/T$ being mapped to frequencies slightly greater 2583than zero, and frequencies just over $1/2T$ being mapped to ones 2584just under $1/2T$. 2585The best way to represent a repeating frequency axis like this is as a circle. 2586Figure 3.3 shows how the linear frequency axis for continuous systems maps 2587on to a circular axis for sampled systems. 2588.FC "Figure 3.3" 2589For present purposes it is 2590easiest to imagine the bottom half of the circle as being reflected into 2591the top half, so that traversing the upper semicircle in the anticlockwise direction 2592corresponds to frequencies increasing from 0 to $1/2T$ (half the sample frequency), 2593and returning along the lower semicircle is actually the same as coming 2594back round the upper one, and corresponds to frequencies from $1/2T$ to $1/T$ 2595being mapped into the range $1/2T$ to 0. 2596.pp 2597As far as speech is concerned, then, we must ensure that before sampling a 2598signal no significant components at greater than half the sample frequency 2599are present. Furthermore, the sampled signal will only contain information 2600about frequency components less than this, so the sample frequency must be 2601chosen as twice the highest frequency of interest. 2602For example, consider telephone-quality speech. 2603Telephones provide a familiar standard of speech quality which, 2604although it can only be an approximate "standard", 2605will be much used throughout this book. 2606The telephone network 2607aims to transmit only frequencies lower than 3.4\ kHz. We saw in the 2608previous chapter that this region will contain the information-bearing formants, 2609and some \(em but not all \(em of the fricative and aspiration energy. 2610Actually, transmitting speech through the telephone system degrades its 2611quality very significantly, probably more than you realize since everyone is 2612so accustomed to telephone speech. Try the dial-a-disc service and compare 2613it with high-fidelity music for a striking example of the kind of degradation 2614suffered. 2615.pp 2616For telephone speech, the sampling frequency must be chosen to be 2617at least 6.8\ kHz. 2618Since speech contains significant amounts of energy above 3.4\ kHz, it should be 2619filtered before sampling to remove this; otherwise the higher components 2620would be mapped back into the baseband and distort the low-frequency information. 2621Because it is difficult to make filters that cut off very sharply, the 2622sampling frequency is chosen rather greater than twice the highest frequency of 2623interest. For example, the digital telephone network samples at 8\ kHz. 2624The pre-sampling filter should have a cutoff frequency of 4\ kHz; aim for 2625negligible distortion below 3.4\ kHz; and transmit negligible components 2626above 4.6\ kHz \(em for these are reflected back into the band of interest, 2627namely 0 to 3.4\ kHz. Figure 3.4 shows a block diagram for the input hardware. 2628.FC "Figure 3.4" 2629.rh "Quantization." 2630Before considering specifications for the pre-sampling filter, let us turn 2631from discretization in time to discretization in amplitude, that is, 2632quantization. 2633This is performed by an A/D converter (analogue-to-digital), which takes as input 2634a constant analogue voltage (produced by the sampler) and generates a 2635corresponding binary value as output. The simplest correspondence is 2636.ul 2637uniform 2638quantization, where the amplitude range is split into equal regions by points 2639termed "quantization levels", and the output is a binary representation of 2640the nearest quantization level to the input voltage. 2641Typically, 11-bit conversion is used for speech, giving 2048 quantization 2642levels, and the signal is adjusted to have zero mean so that half the 2643levels correspond to negative input voltages and the other half to positive 2644ones. 2645.pp 2646It is, at first sight, surprising that as many as 11 bits are needed for 2647adequate representation of speech signals. Research on the digital telephone 2648network, for example, has concluded that a signal-to-noise ratio of 2649some 26\-27\ dB is enough to avoid undue harshness of quality, loss 2650of intelligibility, and listener fatigue for speech at a comfortable 2651level in an otherwise reasonably good channel. 2652Rabiner and Schafer (1978) suggest that about 36\ dB signal-to-noise ratio 2653would "most likely provide adequate quality in a communications system". 2654.[ 2655Rabiner Schafer 1978 Digital processing of speech signals 2656.] 2657But 11-bit quantization seems to give a very much better signal-to-noise 2658ratio than these figures. To estimate its magnitude, note that for N-bit quantization 2659the error for each sample will lie between 2660.LB 2661$ 2662- ~ 1 over 2 ~. 2 sup -N$ and $+ ~ 1 over 2 ~. 2 sup -N . 2663$ 2664.LE 2665Assuming that it is uniformly distributed in this range \(em an assumption 2666which is likely to be justified if the number of levels is sufficiently 2667large \(em leads to a mean-squared error of 2668.LB 2669.EQ 2670integral from {-2 sup -N-1} to {2 sup -N-1} ~e sup 2 p(e) de, 2671.EN 2672.LE 2673where $p(e)$, the probability density function of the error $e$, is a constant 2674which satisfies the usual probability normalization constraint, namely 2675.LB 2676.EQ 2677integral from {-2 sup -N-1} to {2 sup -N-1} ~ p(e) de ~~=~ 1. 2678.EN 2679.LE 2680Hence $p(e)=2 sup N $, and so the mean-squared error is $2 sup -2N /12$. 2681This is $10 ~ log sub 10 (2 sup -2N /12)$\ dB, or around \-77\ dB for 11-bit 2682quantization. 2683.pp 2684This noise level is relative to the maximum amplitude range of the conversion. 2685A maximum-amplitude sine wave has a power of \-9\ dB relative to the same 2686reference, giving a signal-to-noise ratio of some 68\ dB. This is far in excess 2687of that needed for telephone-quality speech. However, look at the very peaky 2688nature of the typical speech waveform given in Figure 3.5. 2689.FC "Figure 3.5" 2690If clipping is to be avoided, the maximum amplitude level of the A/D converter 2691must be set at a value which makes the power of the speech signal very much 2692less than a maximum-amplitude sine wave. Furthermore, different people 2693speak at very different volumes, and the overall level fluctuates constantly 2694with just one speaker. Experience shows that while 8- or 9-bit quantization 2695may provide sufficient signal-to-noise ratio to preserve telephone-quality 2696speech if the overall speaker levels are carefully controlled, about 11 bits 2697are generally required to provide high-quality representation of speech with 2698a uniform quantization. With 11 bits, a sine wave whose amplitude is only 1/32 2699of the full-scale value would be digitized with a signal-to-noise ratio 2700of around 36\ dB, the most pessimistic figure quoted above for adequate quality. 2701Even then it is useful if the speaker is provided 2702with an indication of the amplitude of his speech: a traffic-light 2703indicator with red signifying clipping overload, orange a suitable level, 2704and green too low a value, is often convenient for this. 2705.rh "Logarithmic quantization." 2706For the purposes of speech 2707.ul 2708processing, 2709it is essential to have the signal quantized uniformly. This is because 2710all of the theory applies to linear systems, and nonlinearities introduce 2711complexities which are not amenable to analysis. 2712Uniform quantization, although a nonlinear operation, is linear in the 2713limiting case as the number of levels becomes large, and for most purposes 2714its effect can be modelled by assuming that the quantized signal is obtained 2715from the original analogue one by the addition of a small amount of 2716uniformly-distributed quantizing noise, as in fact was done above. 2717Usually the quantization noise is disregarded in subsequent analysis. 2718.pp 2719However, the peakiness of the speech signal illustrated in Figure 3.5 leads 2720one to suspect that a non-linear representation, for example a logarithmic one, 2721could provide a better signal-to-noise ratio over a wider range of input 2722amplitudes, and hence be more useful than linear quantization \(em at least 2723for speech storage (and transmission). 2724And indeed this is the case. Linear quantization has the unfortunate effect 2725that the absolute noise level is independent of the signal level, so that an excessive 2726number of bits must be used if a reasonable ratio is to be achieved for peaky 2727signals. It can be shown that a logarithmic representation like 2728.LB 2729.EQ 2730y ~ = ~ 1 ~ + ~ k ~ log ~ x, 2731.EN 2732.LE 2733where $x$ is the original signal and $y$ is the value which is to be quantized, 2734gives a 2735signal-to-noise 2736.ul 2737ratio 2738which is independent of the input signal level. 2739This relationship cannot be realized physically, for it is undefined when the signal 2740is negative and diverges when it is zero. 2741However, realizable approximations to it can be made which retain the advantages 2742of constant signal-to-noise ratio within a useful range of signal amplitudes. 2743Figure 3.6 shows the logarithmic relation with one widely-used approximation to it, 2744called the A-law. 2745.FC "Figure 3.6" 2746The idea of non-linearly quantizing a signal to achieve adequate signal-to-noise 2747ratios for a wide variety of amplitudes is called "companding", a contraction 2748of "compressing-expanding". The original signal can be retrieved from 2749its A-law compression by antilogarithmic expansion. 2750.pp 2751Figure 3.6 also 2752shows one common coding scheme which is a piecewise linear approximation 2753to the A-law. This provides an 8-bit code, and gives the equivalent 2754of 12-bit linear quantization for small signal levels. It approximates 2755the A-law in 16 linear segments, 8 for positive and 8 for negative 2756inputs. 2757Consider the positive part of the curve. The first two segments, which 2758are actually collinear, correspond exactly to 12-bit linear conversion. 2759Thus the output codes 0 to 31 correspond to inputs from 0 to 31/2048, 2760in equal steps. (Remember that both positive and negative signals 2761must be converted, so a 12-bit linear converter will allocate 2048 levels 2762for positive signals and 2048 for negative ones.) The next 2763segment provides 11-bit linear quantization, 2764output codes 32 to 47 corresponding to inputs from 16/1024 to 31/1024. 2765Similarly, the next segment corresponds to 10-bit quantization, covering 2766inputs from 16/512 to 31/512. And so on, the last section giving 6-bit 2767quantization of inputs from 16/32 to 31/32, the full-scale positive value. 2768Negative inputs are converted similarly. 2769For signal levels of less than 32/2048, that is, $2 sup -8$, this implementation 2770of the A-law provides full 12-bit precision. 2771As the signal level increases, the precision decreases gradually to 6 bits 2772at maximum amplitudes. 2773.pp 2774Logarithmic encoding provides what is in effect a floating-point representation 2775of the input. The conventional floating-point format, however, is not used 2776because many different codes can represent the same value. For example, with 2777a 4-bit exponent preceding a 4-bit mantissa, the words 0000:1000, 27780001:0100, 0010:0010, and 0011:0001 represent the numbers 2779$0.1 ~ times ~ 2 sup 0$, $0.01 ~ times ~ 2 sup 1 2780$, $0.001 ~ times ~ 2 sup 2$, \c 2781and $0.0001 ~ times ~ 2 sup 3$ respectively, 2782which are the same. (Some floating-point conventions assume that an unwritten 2783"1" bit precedes the mantissa, except when the whole word is zero; but this 2784gives decreased resolution around zero \(em which is exactly where we want the 2785resolution to be greatest.) Table 3.1 shows the 8-bit A-law codes, 2786.RF 2787.in+0.7i 2788.ta 1.6i +\w'bits 1-3 'u 27898-bit codeword: bit 0 sign bit 2790 bits 1-3 3-bit exponent 2791 bits 4-7 4-bit mantissa 2792.sp2 2793.ta 1.6i 3.5i 2794.ul 2795 codeword interpretation 2796.sp 27970000 0000 \h'\w'\0-\0 + 'u'$.0000 ~ times ~ 2 sup -7$ 2798\0\0\0... \0\0\0\0... 27990000 1111 \h'\w'\0-\0 + 'u'$.1111 ~ times ~ 2 sup -7$ 28000001 0000 $2 sup -7 ~~ + ~~ .0000 ~ times ~ 2 sup -7$ 2801\0\0\0... \0\0\0\0... 28020001 1111 $2 sup -7 ~~ + ~~ .1111 ~ times ~ 2 sup -7$ 28030010 0000 $2 sup -6 ~~ + ~~ .0000 ~ times ~ 2 sup -6$ 2804\0\0\0... \0\0\0\0... 28050010 1111 $2 sup -6 ~~ + ~~ .1111 ~ times ~ 2 sup -6$ 28060011 0000 $2 sup -5 ~~ + ~~ .0000 ~ times ~ 2 sup -5$ 2807\0\0\0... \0\0\0\0... 28080011 1111 $2 sup -5 ~~ + ~~ .1111 ~ times ~ 2 sup -5$ 28090100 0000 $2 sup -4 ~~ + ~~ .0000 ~ times ~ 2 sup -4$ 2810\0\0\0... \0\0\0\0... 28110100 1111 $2 sup -4 ~~ + ~~ .1111 ~ times ~ 2 sup -4$ 28120101 0000 $2 sup -3 ~~ + ~~ .0000 ~ times ~ 2 sup -3$ 2813\0\0\0... \0\0\0\0... 28140101 1111 $2 sup -3 ~~ + ~~ .1111 ~ times ~ 2 sup -3$ 28150110 0000 $2 sup -2 ~~ + ~~ .0000 ~ times ~ 2 sup -2$ 2816\0\0\0... \0\0\0\0... 28170110 1111 $2 sup -2 ~~ + ~~ .1111 ~ times ~ 2 sup -2$ 28180111 0000 $2 sup -1 ~~ + ~~ .0000 ~ times ~ 2 sup -1$ 2819\0\0\0... \0\0\0\0... 28200111 1111 $2 sup -1 ~~ + ~~ .1111 ~ times ~ 2 sup -1$ 2821 28221000 0000 \h'\w'\0-\0 'u'$- ~~ .0000 ~ times ~ 2 sup -7$ negative numbers treated as 2823\0\0\0... \0\0\0\0... above, with a sign bit of 1 28241111 1111 \h'-\w'\- 'u'\- $2 sup -1 ~~ - ~~ .1111 ~ times ~ 2 sup -1$ 2825.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 2826.in 0 2827.FG "Table 3.1 8-bit A-law codes, with their floating-point equivalents" 2828according 2829to the piecewise linear approximation of Figure 3.6, written in a notation which 2830suggests floating point. Each linear segment has a different exponent except 2831the first two segments, which as explained above are collinear. 2832.pp 2833Logarithmic encoders and decoders are available from many semiconductor 2834manufacturers as single-chip devices 2835called "codecs" (for "coder/decoder"). Intended for use on digital communication 2836links, these generally provide a serial output bit-stream, which 2837should be converted to parallel by a shift register if the data is intended 2838for a computer. 2839Because of the potentially vast market for codecs in telecommunications, 2840they are made in great quantities and are consequently very cheap. 2841Estimates of the speech quality necessary for telephone applications indicate 2842that somewhat less than this accuracy is needed \(em 7-bit logarithmic encoding 2843was used in early digital communications links, and it may be that even 6 bits 2844are adequate. However, during the transition period when digital 2845networks must coexist with the present analogue one, it is anticipated that 2846a particular telephone call may have to pass through several links, some 2847using analogue technology and some being digital. The possibility of 2848several successive encodings and decodings has led telecommunications 2849engineers to standardize on 8-bit representations, leaving some margin 2850before additional degradation of signal quality becomes unduly distracting. 2851.pp 2852Unfortunately, world telecommunications authorities cannot agree on a single 2853standard for logarithmic encoding. The A-law, which we have described, 2854is the European standard, but there is another system, called 2855the $mu$-law, which is used universally in North America. It also is available 2856in single-chip form with an 8-bit code. It has very similar 2857quantization error characteristics to the A-law, and would be indistinguishable 2858from it on the scale of Figure 3.6. 2859.rh "The pre-sampling filter." 2860Now that we have some idea of the accuracy requirements for quantization, 2861let us discuss quantitative specifications for the pre-sampling filter. 2862Figure 3.7 sketches the characteristics of this filter. 2863.FC "Figure 3.7" 2864Assume a 2865sampling frequency of 8\ kHz and a range of interest from 0 to 3.4\ kHz. 2866Although all components at frequencies above 4\ kHz will fold back into 2867the 0\ \-\ 4\ kHz baseband, those below 4.6\ kHz fold back above 3.4\ kHz and are 2868therefore outside the range of interest. This gives a "guard band" between 28693.4 and 4.6\ kHz which separates the passband from the stopband. The filter 2870should transmit negligible components in the stopband above 4.6\ kHz. 2871To reduce the harmonic distortion caused by aliasing to the same level 2872as the quantization noise in 11-bit linear conversion, the stopband 2873attenuation should be around \-68\ dB (the signal-to-noise ratio for a full-scale 2874sine wave). Passband ripple is not so critical, 2875for two reasons. Whilst the presence of aliased components means that 2876information has been lost about the frequency components within the range of 2877interest, passband ripple does not actually cause a loss of information but 2878only a distortion, and could, if necessary, be compensated by a suitable 2879filter acting on the digitized waveform. Secondly, distortion of the 2880passband spectrum is not nearly so audible as the frequency images caused 2881by aliasing. Hence one usually aims for a passband ripple of around 0.5\ dB. 2882.pp 2883The pass and stopband targets we have mentioned above can be achieved with 2884a 9'th order elliptic filter. While such a filter is often used in 2885high-quality signal-processing systems, for telephone-quality speech 2886much less stringent specifications seem to be sufficient. Figure 3.8, for 2887example, shows a template which has been recommended by telecommunications 2888authorities. 2889.FC "Figure 3.8" 2890A 5'th order elliptic filter can easily meet this specification. 2891Such filters, implemented by switched-capacitor means, are available in 2892single-chip form. Integrated CCD (charge-coupled device) 2893filters which meet the same specification 2894are also marketed. Indeed, some codecs provide input filtering on the same 2895chip as the A/D converter. 2896.pp 2897Instead of implementing a filter by analogue means to meet the aliasing 2898specifications, digital filtering can be used. A high sample-rate A/D 2899converter, operating at, say, 32\ kHz, and preceded by a very simple low-pass 2900pre-sampling filter, is followed by a digital filter which meets the 2901desired specification, and its output is subsampled to provide an 8\ kHz sample 2902rate. While such implementations may be economic where a multichannel digitizing 2903capability is required, as in local telephone exchanges where the subscriber 2904connection is an analogue one, they are unlikely to prove cost-effective for 2905a single channel. 2906.rh "Reconstructing the analogue waveform." 2907Having digitized and stored a signal, it needs to be passed though a D/A 2908converter (digital-to-analogue) and low-pass filter when replayed. 2909D/A converters are cheaper than A/D converters, and the characteristics of the 2910low-pass filter for output can be the same as those for input. 2911However, the desampling operation introduces an additional distortion, which 2912has an effect on the component at frequency $f$ of 2913.LB 2914.EQ 2915{ sin ( pi f/f sub s )} over { pi f/f sub s } ~ , 2916.EN 2917.LE 2918where $f sub s$ is the sampling frequency. An "aperture correction" filter is 2919needed to compensate for this, although many systems simply do without it. 2920Such a filter is sometimes incorporated into the codec chip. 2921.rh "Summary." 2922For telephone-quality speech, existing codec chips, 2923coupled if necessary with integrated pre-sampling filters, can 2924be used, at a remarkably low cost. 2925For higher-quality speech storage the analogue interface can become quite complex. 2926A comprehensive study of the problems as they relate to digitization of audio, 2927which demands much greater fidelity than speech, has been made by Blesser (1978). 2928.[ 2929Blesser 1978 2930.] 2931He notes the following sources of error (amongst others): 2932.LB 2933.NP 2934slew-rate distortion in the pre-sampling filter for signals at the upper end 2935of the audio band; 2936.NP 2937insufficient filtering of high-frequency input signals; 2938.NP 2939noise generated by the sample-and-hold amplifier or pre-sampling filter; 2940.NP 2941acquisition errors because of the finite settling time of the sample-and-hold 2942circuit; 2943.NP 2944insufficient settling time in the A/D conversion; 2945.NP 2946errors in the quantization levels of the A/D and D/A converters; 2947.NP 2948noise in the converters; 2949.NP 2950jitter on the clock used for timing input or output samples; 2951.NP 2952aperture distortion in the output sampler; 2953.NP 2954noise in the output filter as a result of limited dynamic range of the 2955integrated circuits; 2956.NP 2957power-supply noise injection or ground coupling; 2958.NP 2959changes in characteristics as a result of temperature or ageing. 2960.LE 2961Care must be taken with the analogue interface to ensure that the precision 2962implied by the resolution of the A/D and D/A converters is not compromised 2963by inadequate analogue circuitry. It is especially important to eliminate 2964high-frequency noise caused by fast edges on nearby computer buses. 2965.sh "3.2 Coding in the time domain" 2966.pp 2967There are several methods of coding the time waveform of a speech signal to 2968reduce the data rate for a given signal-to-noise ratio, or alternatively to 2969reduce the signal-to-noise ratio for a given data rate. They almost all require 2970more processing, both at the encoding (for storage) and decoding (for 2971regeneration) ends of the digitization process. They are sometimes used to 2972economize on memory in systems using stored speech, 2973for example the System\ X telephone exchange and the travel consultant described 2974in Chapter 1, and so will be described here. However, it is to be expected 2975that simple time-domain coding techniques will be superseded by the more complex 2976linear predictive method, which is covered in Chapter 6, because this 2977can give a much more substantial reduction in the data rate for only a small 2978degradation in speech quality. Hence the aim of this section is to introduce 2979the ideas in a qualitative way: theoretical development and summaries of 2980results of listening tests can be found elsewhere (eg Rabiner and Schafer, 1978). 2981.[ 2982Rabiner Schafer 1978 Digital processing of speech signals 2983.] 2984The methods we will examine are summarized in Table 3.2. 2985.RF 2986.nr x0 \w'linear PCM 'u 2987.nr x1 \n(x0+\w' adaptive quantization, or adaptive prediction,'u 2988.nr x2 (\n(.l-\n(x1)/2 2989.in \n(x2u 2990.ta \n(x0u 2991\l'\n(x1u\(ul' 2992.sp 2993linear PCM linearly-quantized pulse code modulation 2994.sp 2995log PCM logarithmically-quantized pulse code modulation 2996 (instantaneous companding) 2997.sp 2998APCM adaptively quantized pulse code modulation 2999 (usually syllabic companding) 3000.sp 3001DPCM differential pulse code modulation 3002.sp 3003ADPCM differential pulse code modulation with either 3004 adaptive quantization, or adaptive prediction, 3005 or both 3006.sp 3007DM delta modulation (1-bit DPCM) 3008.sp 3009ADM delta modulation with adaptive quantization 3010\l'\n(x1u\(ul' 3011.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 3012.in 0 3013.FG "Table 3.2 Time-domain encoding techniques" 3014.rh "Syllabic companding." 3015We have already studied one time-domain encoding technique, namely logarithmic 3016quantization, or log PCM (sometimes called "instantaneous companding"). A more 3017sophisticated encoder could track slowly varying trends in the overall amplitude 3018of the speech signal and use this information to adjust the quantization 3019levels dynamically. Speech coding methods based on this principle are called 3020adaptive pulse code modulation systems (APCM). Because the overall amplitude 3021changes slowly, it is sufficient to adjust the quantization relatively infrequently 3022(compared with the sampling rate), and this is often done at rates approximating 3023the syllable rate of running speech, leading to the term "syllabic companding". 3024A block floating-point format can be used, with a common exponent being 3025stored every M samples (with M, say, 125 for a 100\ msec block rate at 8\ kHz 3026sampling), but the mantissa being stored at the regular sample rate. The overall 3027energy in the block, 3028.LB 3029$sum from n=h to h+M-1 ~x(n) sup 2$ ($M = 125$, say), 3030.LE 3031is used to determine a suitable exponent, and every sample 3032in the block \(em namely 3033$x(h)$, $x(h+1)$, ..., $x(h+M-1)$ \(em is scaled according to that exponent. 3034Note that for speech transmission systems this method necessitates a delay of 3035$M$ samples at the encoder, and indeed some methods base the exponent on the 3036energy in the last block to avoid this. For speech storage, however, the delay 3037is irrelevant. A rather different, nonsyllabic, method of adaptive PCM is 3038continually to change the step size of a uniform quantizer, by multiplying it by 3039a constant at each sample which is based on the magnitude of the previous code 3040word. 3041.pp 3042Adaptive quantization exploits information about the amplitude of the signal, 3043and, as a rough generalization, yields a reduction of one bit per sample 3044in the data rate for telephone-quality speech over ordinary logarithmic 3045quantization, for a given signal-to-noise ratio. Alternatively, for the 3046same data rate an improvement of 6\ dB in signal-to-noise ratio can be obtained. 3047Some results for actual schemes are given by Rabiner and Schafer (1978). 3048.[ 3049Rabiner Schafer 1978 Digital processing of speech signals 3050.] 3051However, there is other information in the time waveform of speech, namely, the 3052sample-to-sample correlation, which can be exploited to give further reductions. 3053.rh "Differential coding." 3054Differential pulse code modulation (DPCM), in its simplest form, uses the 3055present speech sample as a prediction of the next one, 3056and stores the prediction error \(em that is, the sample-to-sample difference. 3057This is a simple case of predictive encoding. 3058Referring back to the speech waveform displayed in Figure 3.5, 3059it seems plausible that the data rate can be reduced by transmitting the difference 3060between successive samples instead of their absolute values: less bits are 3061required for the difference signal for a given overall accuracy because it 3062does not assume such extreme values as the absolute signal level. 3063Actually, the improvement is not all that great \(em about 4\ \-\ 5\ dB in 3064signal-to-noise ratio, or just under one bit per sample for a given 3065signal-to-noise ratio \(em for the difference signal can be nearly as large as 3066the absolute signal level. 3067.pp 3068If DPCM is used in conjunction with adaptive quantization, giving one form of 3069adaptive differential pulse code modulation (ADPCM), both the overall amplitude 3070variation and the sample-to-sample correlation are exploited, leading to a 3071combined gain of 10\ \-\ 11\ dB in signal-to-noise ratio (or just under two bits 3072reduction per sample for telephone-quality speech). Another form of adaptation 3073is to alter the predictor by multiplying the previous sample value by a 3074parameter which is adjusted for best performance. 3075Then the transmitted signal at time $n$ is 3076.LB 3077.EQ 3078e(n) ~~ = ~~ x(n)~ - ~ax(n-1), 3079.EN 3080.LE 3081where the parameter $a$ is adapted (and stored) on a syllabic time-scale. This 3082leads to a slight improvement in signal-to-noise ratio, which can be combined 3083with that achieved by adaptive quantization. Much more substantial benefits 3084can be realized by using a weighted sum of the past several (up to 15) speech 3085samples, and adapting all the weights. This is the basic idea of linear 3086prediction, which is developed in Chapter 6. 3087.rh "Delta modulation." 3088The coding methods presented so far all increase the complexity of the 3089analogue-to-digital interface (or, if the sampled waveform is coded 3090digitally, they increase the processing required before and after storage). 3091One method which considerably 3092.ul 3093simplifies 3094the interface is the limiting case 3095of DPCM with just 1-bit quantization. Only the sign of the difference between 3096the current and last values is transmitted. Figure 3.9 shows the conversion 3097hardware. 3098.FC "Figure 3.9" 3099The encoding part is essentially the same as a tracking D/A, 3100where the value in a counter is forced to track the analogue input by 3101incrementing or decrementing the counter according as the input exceeds or 3102falls short of the analogue equivalent of the counter's contents. However, 3103for this encoding scheme, called "delta modulation", the increment-decrement 3104signal itself forms the discrete representation of the waveform, instead of the counter's 3105contents. The analogue waveform can be reconstituted from the bit stream with 3106another counter and D/A converter. Alternatively, an all-analogue implementation 3107can be used, both for the encoder and decoder, with a capacitor as integrator 3108whose charging current is controlled digitally. This is a much cheaper realization. 3109.pp 3110It is fairly obvious that the sampling frequency for delta modulation will need 3111to be considerably higher than for straightforward PCM. Figure 3.10 shows 3112an effect called "slope overload" which occurs when the sampling rate is too low. 3113.FC "Figure 3.10" 3114Either a higher sample rate or a larger step size will reduce the overload; 3115however, larger steps increase the noise level of the alternate 1's and \-1's 3116that occur when no input is present \(em called "granular noise". A compromise 3117is necessary between slope overload and granular noise for a given bit rate. 3118Delta modulation results in lower data rates than logarithmic quantization 3119for a given signal-to-noise ratio if that ratio is low (poor-quality speech). 3120As the desired speech quality is increased its data rate grows faster than 3121that of logarithmic PCM. The crossover point occurs at much lower than 3122telephone quality speech, and so although delta modulation is used for some 3123applications where the permissible data rate is severely constrained, 3124it is not really suitable for speech output from computers. 3125.pp 3126It is profitable to adjust the step size, leading to 3127.ul 3128adaptive 3129delta modulation. 3130A common strategy is to increase or decrease the step size by a multiplicative 3131constant, which depends on whether the new transmitted bit will be equal to 3132or different from the last one. That is, 3133.LB "nnnn" 3134.NI "nn" 3135$stepsize(n+1) = stepsize(n) times 2$ if $x(n+1)<x(n)<x(n-1)$ 3136or $x(n+1)>x(n)>x(n-1)$ 3137.br 3138(slope overload condition); 3139.NI "nn" 3140$stepsize(n+1) = stepsize(n)/2$ if $x(n+1),~x(n-1)<x(n)$ 3141or $x(n+1),~x(n-1)>x(n)$ 3142.br 3143(granular noise condition). 3144.LE "nnnn" 3145Despite these adaptive equations, the step size should be constrained to 3146lie between a predetermined fixed maximum and minimum, to prevent it from 3147becoming so large or so small that rapid accomodation to changing input signals is 3148impossible. 3149Then, in a period of potential slope overload the step size will grow, preventing 3150overload, possibly to its maximum value when overload may resume. In a quiet 3151period it will decrease to its minimum value which determines the granular 3152noise in the idle condition. Note that the step size need not be stored, for 3153it can be deduced from the bit changes in the digitized data. Although 3154adaptation improves the performance of delta modulation, it is still inferior to 3155PCM at telephone qualities. 3156.rh "Summary." 3157It seems that ADPCM, with 3158adaptive quantization and adaptive prediction, can provide a worthwhile 3159advantage for speech storage, reducing the number of bits needed per sample of 3160telephone-quality speech from 7 for logarithmic PCM to perhaps 5, and the data 3161rate from 56\ Kbit/s to 40\ Kbit/s. Disadvantages are additional complexity 3162in the encoding and decoding processes, and the fact that byte-oriented storage, 3163with 8 bits/sample in logarithmic PCM, is more convenient for computer use. 3164For low quality speech where hardware complexity is to be minimized, 3165adaptive delta modulation could provide worthwhile \(em although the ready 3166availability of PCM codec chips reduces the cost advantage. 3167.sh "3.3 References" 3168.LB "nnnn" 3169.[ 3170$LIST$ 3171.] 3172.LE "nnnn" 3173.sh "3.4 Further reading" 3174.pp 3175Probably the best single reference on time-domain coding of speech is 3176the book by Rabiner and Schafer (1978), cited above. 3177However, this does not contain a great deal of information on practical 3178aspects of the analogue-to-digital conversion process; this is 3179covered by Blesser (1978) above, who is especially interested in 3180high-quality conversion for digital audio applications, 3181and Garrett (1978) below. 3182There are many textbooks in the telecommunications area which 3183are relevant to the subject of the chapter, 3184although they concentrate primarily on fundamental theoretical aspects rather 3185than the practical application of the technology. 3186.LB "nn" 3187.\"Cattermole-1969-1 3188.]- 3189.ds [A Cattermole, K.W. 3190.ds [D 1969 3191.ds [T Principles of pulse code modulation 3192.ds [I Iliffe 3193.ds [C London 3194.nr [T 0 3195.nr [A 1 3196.nr [O 0 3197.][ 2 book 3198.in+2n 3199This is a standard, definitive, work on PCM, and provides a good grounding 3200in the theory. 3201It goes into the subject in much more depth than we have been able to here. 3202.in-2n 3203.\"Garrett-1978-1 3204.]- 3205.ds [A Garrett, P.H. 3206.ds [D 1978 3207.ds [T Analog systems for microprocessors and minicomputers 3208.ds [I Reston Publishing Company 3209.ds [C Reston, Virginia 3210.nr [T 0 3211.nr [A 1 3212.nr [O 0 3213.][ 2 book 3214.in+2n 3215Garrett discusses the technology of data conversion systems, including 3216A/D and D/A converters and basic analogue filter design, in a 3217clear and practical manner. 3218.in-2n 3219.\"Inose-1979-2 3220.]- 3221.ds [A Inose, H. 3222.ds [D 1979 3223.ds [T An introduction to digital integrated communications systems 3224.ds [I Peter Peregrinus 3225.ds [C Stevenage, England 3226.nr [T 0 3227.nr [A 1 3228.nr [O 0 3229.][ 2 book 3230.in+2n 3231Inose's book is a recent one which covers the whole area of digital 3232transmission and switching technology. 3233It gives a good idea of what is happening to the telephone networks 3234in the era of digital communications. 3235.in-2n 3236.\"Steele-1975-3 3237.]- 3238.ds [A Steele, R. 3239.ds [D 1975 3240.ds [T Delta modulation systems 3241.ds [I Pentech Press 3242.ds [C London 3243.nr [T 0 3244.nr [A 1 3245.nr [O 0 3246.][ 2 book 3247.in+2n 3248Again a standard work, this time on delta modulation techniques. 3249Steele gives an excellent and exhaustive treatment of the subject from a 3250communications viewpoint. 3251.in-2n 3252.LE "nn" 3253.EQ 3254delim $$ 3255.EN 3256.CH "4 SPEECH ANALYSIS" 3257.ds RT "Speech analysis 3258.ds CX "Principles of computer speech 3259.pp 3260Digital recordings of speech provide a jumping-off point for 3261further processing of the audio waveform, which is usually necessary for 3262the purpose of speech output. 3263It is difficult to synthesize natural sounds by concatenating 3264individually-spoken words. 3265Pitch is perhaps the most perceptually significant contextual effect 3266which must be 3267taken into account when forming connected speech out of isolated words. 3268The intonation of an utterance, which manifests itself as a 3269continually changing pitch, is a holistic property of the utterance 3270and not the sum of components determined by the individual words alone. 3271Happily, and quite coincidentally, communications engineers in their quest 3272for reduced-bandwidth telephony have invented methods of coding speech that 3273separate the pitch information from that carried by the articulation. 3274.pp 3275Although these analysis techniques, which were first introduced in the late 32761930's (Dudley, 1939), were originally implemented by analogue means \(em and 3277in many systems still are (Blankenship, 1978, describes a recent 3278switched-capacitor realization) \(em there is a continuing trend 3279towards digital implementations, particularly for the more sophisticated coding 3280schemes. 3281.[ 3282Dudley 1939 3283.] 3284.[ 3285Blankenship 1978 3286.] 3287It is hard to see how the technique of linear prediction of speech, 3288which is described in detail in Chapter 6, could be accomplished in the 3289absence of digital processing. 3290Some groundwork is laid for the theory of digital signal analysis in this 3291chapter. 3292The ideas are not presented in a formal, axiomatic way; but are developed as 3293and when they are needed to examine some of the structures that turn out to be 3294useful in speech processing. 3295.pp 3296Most speech analysis views speech according to the source-filter model which 3297was introduced in Chapter 2, and aims to separate the effects of the source from 3298those of the filter. The frequency spectrum of the vocal tract filter is of 3299great interest, and the technique of discrete Fourier transformation is 3300discussed in this chapter. For many purposes it is better to extract the formant 3301frequencies from the spectrum and use these alone (or in conjunction with their 3302bandwidths) to characterize it. As far as the signal source in the source-filter 3303model is concerned, its most interesting features are pitch and amplitude \(em the 3304latter being easy to estimate. Hence we go on to look at pitch extraction. 3305Related to this is the problem of deciding whether a segment of speech has 3306voiced or unvoiced excitation, or both. 3307.pp 3308Estimating formant and pitch parameters is one of the messiest areas of 3309speech processing. There is a delightful paper which points this out 3310(Schroeder, 1970), entitled "Parameter estimation in speech: a lesson in unorthodoxy". 3311.[ 3312Schroeder 1970 3313.] 3314It emphasizes that the most successful estimation procedures "have often relied 3315on intuition based on knowledge of speech signals and their production in the 3316human vocal apparatus rather than routine applications of well-established 3317theoretical methods". 3318Fortunately, the emphasis of the present book is on speech 3319.ul 3320output, 3321which involves parameter estimation only in so far as it is needed to produce 3322coded speech for storage, and to illuminate the acoustic nature of speech 3323for the development of synthesis by rule from phonetics or text. 3324Hence the many methods of formant and pitch estimation are treated rather 3325cursorily and qualitatively here: our main interest is in how to 3326.ul 3327use 3328such information for speech output. 3329.pp 3330If the incoming speech can be analysed into its formant frequencies, amplitude, 3331excitation mode, and pitch (if voiced), it is quite easy to resynthesize 3332it directly from these parameters. Speech synthesizers are described in the 3333next chapter. They can be realized in either analogue or digital 3334hardware, the former being predominant in production systems and the latter 3335in research systems \(em although, as in other areas of electronics, the balance 3336is changing in favour of digital implementations. 3337.sh "4.1 The channel vocoder" 3338.pp 3339A direct representation of the frequency spectrum of a signal can be obtained 3340by a bank of bandpass filters. This is the basis of 3341the 3342.ul 3343channel vocoder, 3344which was the first device that attempted to take advantage of the source-filter 3345model for speech coding (Dudley, 1939). 3346.[ 3347Dudley 1939 3348.] 3349The word "vocoder" is a contraction 3350of 3351.ul 3352vo\c 3353ice 3354.ul 3355coder. 3356The energy in each filter band is 3357estimated by rectification and smoothing, and the resulting approximation to 3358the frequency spectrum is transmitted or stored. The source properties are 3359represented by the type of excitation (voiced or unvoiced), and if voiced, 3360the pitch. It is not necessary to include the overall amplitude of the speech 3361explicitly, because this is conveyed by the energy levels from the separate 3362bandpass filters. 3363.pp 3364Figure 4.1 shows the encoding part of a channel vocoder which has been used 3365successfully for many years (Holmes, 1980). 3366.[ 3367Holmes 1980 JSRU channel vocoder 3368.] 3369.FC "Figure 4.1" 3370We will discuss the block labelled "pre-emphasis" shortly. 3371The shape of the spectrum is estimated by 19 bandpass filters, whose spacing 3372and bandwidth decrease slightly with decreasing frequency to obtain the rather 3373greater resolution that is needed in the lower frequency region, 3374as shown in Table 4.1. 3375.RF 3376.nr x0 4n+2.6i+\w'\0\0'u+(\w'bandwidth'/2) 3377.nr x1 (\n(.l-\n(x0)/2 3378.in \n(x1u 3379.ta 4n +1.3i +1.3i 3380\l'\n(x0u\(ul' 3381.sp 3382.nr x1 (\w'channel'/2) 3383.nr x2 (\w'centre'/2) 3384.nr x3 (\w'analysis'/2) 3385 \0\h'-\n(x1u'channel \0\h'-\n(x2u'centre \0\0\h'-\n(x3u'analysis 3386.nr x1 (\w'number'/2) 3387.nr x2 (\w'frequency'/2) 3388.nr x3 (\w'bandwidth'/2) 3389 \0\h'-\n(x1u'number \0\0\h'-\n(x2u'frequency \0\0\h'-\n(x3u'bandwidth 3390.nr x2 (\w'(Hz)'/2) 3391 \0\h'-\n(x2u'(Hz) \0\0\h'-\n(x2u'(Hz) 3392\l'\n(x0u\(ul' 3393.sp 3394 \01 \0240 \0120 3395 \02 \0360 \0120 3396 \03 \0480 \0120 3397 \04 \0600 \0120 3398 \05 \0720 \0120 3399 \06 \0840 \0120 3400 \07 1000 \0150 3401 \08 1150 \0150 3402 \09 1300 \0150 3403 10 1450 \0150 3404 11 1600 \0150 3405 12 1800 \0200 3406 13 2000 \0200 3407 14 2200 \0200 3408 15 2400 \0200 3409 16 2700 \0200 3410 17 3000 \0300 3411 18 3300 \0300 3412 19 3750 \0500 3413\l'\n(x0u\(ul' 3414.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 3415.in 0 3416.FG "Table 4.1 Filter specifications for a vocoder analyser (after Holmes, 1980)" 3417.[ 3418Holmes 1980 JSRU channel vocoder 3419.] 3420The 3\ dB points 3421of adjacent filters are halfway between their centre frequencies, so that there 3422is some overlap between bands. 3423The filter characteristics do not need to have very sharp edges, because the energy 3424in neighbouring bands is fairly highly correlated. Indeed, there is a 3425disadvantage in making them too sharp, because the phase delays associated 3426with sharp cutoff filters induce "smearing" of the spectrum in the time domain. 3427This particular channel vocoder uses second-order Butterworth bandpass filters. 3428.pp 3429For regenerating speech stored in this way, an excitation of unit impulses 3430at the specified pitch period (for voiced sounds) or white noise (for unvoiced 3431sounds) is produced and passed through a bank of bandpass filters similar 3432to the analysis ones. The excitation has a flat spectrum, for regular impulses 3433have harmonics at multiples of the repetition frequency which are all of the 3434same size, and so the spectrum of the output signal is completely determined 3435by the filter bank. The gain of each filter is controlled by the stored 3436magnitude of the spectrum at that frequency. 3437.pp 3438The frequency spectrum and voicing pitch of speech change at much slower rates 3439than the time waveform. The changes are due to movements of the articulatory 3440organs (tongue, lips, etc) in the speaker, and so are limited in their speed 3441by physical constraints. A typical rate of production of phonemes is 15 per 3442second, but in fact the spectrum can change quite a lot within a single 3443phoneme (especially a stop sound). 3444Between 10 and 25\ msec (100\ Hz and 40\ Hz) 3445is generally thought to be a satisfactory interval for transmitting or storing 3446the spectrum, to preserve a reasonably faithful representation of the speech. 3447Of course, the entire spectrum, as well as the source characteristics, must 3448be stored at this rate. 3449The channel vocoder described by Holmes (1980) uses 48 bits to encode 3450the information. 3451.[ 3452Holmes 1980 JSRU channel vocoder 3453.] 3454Repeated every 20\ msec, this gives a data rate of 2400\ bit/s \(em very 3455considerably less than any of the time-domain encoding techniques. 3456.pp 3457It needs some care to encode the output of 19 filters, the excitation type, 3458and the pitch into 48 bits of information. Holmes uses 6 bits for pitch, 3459logarithmically encoded, 3460and one bit for excitation type. 3461This leaves 41 bits to encode the output of the 19 filters, and so a differential 3462technique is used which transmits just the difference between adjacent 3463channels \(em for the spectrum does not change abruptly in the frequency domain. 3464Three bits are used for the absolute level in channel 1, and two bits 3465for each channel-to-channel difference, giving a total of 39 bits for the whole 3466spectrum. The remaining two bits per frame are reserved for signalling or 3467monitoring purposes. 3468.pp 3469A 2400 bit/s channel vocoder degrades the speech in a telephone channel quite 3470perceptibly. It is sufficient for interactive communication, where 3471if you do not understand something you can always ask for it to be repeated. 3472It is probably not good enough for most voice response applications. 3473However, the vocoder principle can be used with larger filter banks and much 3474higher bit rates, and still reduce the data rate substantially below that 3475required by log PCM. 3476.sh "4.2 Pre-emphasis" 3477.pp 3478There is an 3479overall \-6\ dB/octave trend in speech radiated from the lips, 3480as frequency increases. 3481We will discuss why this is so in the next chapter. 3482Notice that this trend means that the signal power is reduced 3483by a factor of 4, or the signal amplitude by a factor of 16, for each 3484doubling in frequency. 3485For vocoders, and indeed for other methods of spectral analysis of speech, 3486it is usually desirable to equalize this by a +6\ dB/octave lift prior to 3487processing, so that the channel outputs occupy a similar range of levels. 3488On regeneration, the output speech is passed through an inverse filter which 3489provides 6\ dB/octave of attenuation. 3490.pp 3491For a digital system, such pre-emphasis 3492can either be implemented as an analogue circuit which precedes the presampling 3493filter and digitizer, or as a digital operation on the sampled and quantized 3494signal. In the former case, the characteristic is usually flat up to a certain 3495breakpoint, which occurs somewhere between 100\ Hz and 1\ kHz \(em the exact 3496position does not seem to be critical \(em at which point the +6\ dB/octave lift 3497begins. Although de-emphasis on output ought to have an exactly inverse 3498characteristic, it is sometimes modified or even eliminated altogether in an 3499attempt to counteract approximately 3500the $sin( pi f/f sub s )/( pi f/f sub s )$ distortion 3501introduced by the desampling operation, which was discussed in an earlier 3502section. Above half the sampling frequency, the characteristic of the 3503pre-emphasis is irrelevant because any effect will be suppressed by the presampling 3504filter. 3505.pp 3506The effect of a 6\ dB/octave lift can also be achieved digitally, by differencing 3507the input. The operation 3508.LB 3509.EQ 3510y(n)~~ = ~~ x(n)~ -~ ax(n-1) 3511.EN 3512.LE 3513is suitable, where the constant parameter $a$ is usually chosen between 0.9 and 1. 3514The latter value gives straightforward differencing, and this amounts to 3515creating a DPCM signal as input to the spectral analysis. Figure 4.2 plots 3516the frequency response of this operation, with a sample frequency of 8\ kHz, 3517for two values of the parameter; together with that of a 6\ dB/octave lift 3518above 100\ Hz. 3519.FC "Figure 4.2" 3520The vertical positions of the plots have been adjusted to give 3521the same gain, 20\ dB, at 1\ kHz. 3522The difference at 3.4\ kHz, the upper end of the telephone spectrum, is just 3523over 2\ dB. At frequencies below the breakpoint, in this case 100\ Hz, the 3524difference between analogue and digital pre-emphasis can be very great. For 3525$a=0.9$ the attenuation at DC (zero frequency) is 18\ dB below that at 1\ kHz, 3526which happens to be close to that of the analogue filter for frequencies below the 3527breakpoint. However, if the breakpoint had been at 1\ kHz there would have been 352820\ dB difference between the analogue and $a=0.9$ plots at DC. And of course 3529the $a=1$ characteristic has infinite attenuation at DC. 3530In practice, however, the exact form of the pre-emphasis does not seem to be at all 3531critical. 3532.pp 3533The above remarks apply only to voiced speech. For unvoiced speech there appears 3534to be no real need for pre-emphasis; indeed, it may do harm by reinforcing 3535the already large high-frequency components. There is a case for altering the 3536parameter $a$ according to the excitation mode of the speech: $a=1$ for voiced 3537excitation and $a=0$ for unvoiced gives pre-emphasis just when it is needed. 3538This can be achieved by expressing the parameter in terms of the autocorrelation 3539of the incoming signal, as 3540.LB 3541.EQ 3542a ~~ = ~~ R(1) over R(0) ~ , 3543.EN 3544.LE 3545where $R(1)$ is the correlation of the signal with itself delayed by one sample, 3546and $R(0)$ is the correlation without delay (that is, the signal variance). 3547This is reasonable intuitively because high sample-to-sample correlation 3548is to be expected in voiced speech, so that $R(1)$ is very nearly as great as 3549$R(0)$ and the ratio becomes 1; whereas little or no sample-to-sample correlation 3550will be present in unvoiced speech, making the ratio close to 0. Such a 3551scheme is reminiscent of ADPCM with adaptive prediction. 3552.pp 3553However, this sophisticated pre-emphasis method does not seem to be worthwhile 3554in practice. Usually the breakpoint in an analogue pre-emphasis filter is 3555chosen to be rather greater than 100\ Hz to limit the amplification of fricative 3556energy. In fact, the channel vocoder described by Holmes (1980) has the 3557breakpoint at 1\ kHz, limiting the gain to 12\ dB at 4\ kHz, two octaves above. 3558.[ 3559Holmes 1980 JSRU channel vocoder 3560.] 3561.sh "4.3 Digital signal analysis" 3562.pp 3563You may be wondering how the frequency response for the digital pre-emphasis 3564filters, displayed in Figure 4.2, can be calculated. Suppose a digitized 3565sinusoid is applied as input to the filter 3566.LB 3567.EQ 3568y(n) ~~ = ~~ x(n)~ - ~ax(n-1). 3569.EN 3570.LE 3571A sine wave of frequency $f$ has equation $x(t) ~ = ~ sin ~ 2 pi ft$, and when 3572sampled at $t=0,~ T,~ 2T,~ ...$ (where $T$ is the sampling interval, 125\ msec for 3573an 8\ kHz sample rate), this becomes $x(n) ~ = ~ sin ~ 2 pi fnT.$ It is much 3574more convenient to consider a complex exponential 3575input, $e sup { j2 pi fnT}$ \(em the response to a sinusoid can then be derived 3576by taking imaginary parts, if necessary. The output for this input is 3577.LB 3578.EQ 3579y(n) ~~ = ~~ e sup {j2 pi fnT} ~~-~ae sup {j2 pi f(n-1)T} ~~ = ~~ 3580(1~-~ae sup {-j2 pi fT} )~e sup {j2 pi fnT} , 3581.EN 3582.LE 3583a sinusoid at the same frequency as the input. The 3584factor $1~-~ae sup {-j2 pi fT}$ is complex, with both amplitude and phase 3585components. Thus the output will be a phase-shifted and amplified version 3586of the input. The amplitude response at frequency $f$ is therefore 3587.LB 3588.EQ 3589|1~ - ~ ae sup {-j2 pi fT} | ~~ = ~~ 3590[1~ +~ a sup 2 ~-~ 2a~cos~2 pi fT ] sup 1/2 , 3591.EN 3592.LE 3593or 3594.LB 3595.EQ 359610 ~ log sub 10 (1~ +~ a sup 2 ~ - ~ 2a~ cos 2 pi fT) 3597.EN 3598dB. 3599.LE 3600Normalizing to 20\ dB at 1\ kHz, and assuming 8\ kHz sampling, yields 3601.LB 3602.EQ 360320~ + ~~ 10~ log sub 10 (1~ +~ a sup 2 ~-~ 2a~ cos ~ { pi f} over 4000 ) 3604~~ -~ 10~ log sub 10 (1~ +~ a sup 2 ~-~ 2a~ cos ~ pi over 4 ) 3605.EN 3606dB. 3607.LE 3608With $a=0.9$ and 1 this gives the graphs of Figure 4.2. 3609.pp 3610Frequency responses for analogue filters are often plotted with a logarithmic 3611frequency scale, as well as a logarithmic amplitude one, to bring out the 3612asymptotes in dB/octave as straight lines. For digital filters the response 3613is usually drawn on a 3614.ul 3615linear 3616frequency axis extending to half the sampling frequency. The response is 3617symmetric about this point. 3618.pp 3619Analyses like the above are usually expressed in terms of the $z$-transform. 3620Denote the unit delay operation by $z sup -1$. The choice of the inverse rather 3621than $z$ itself is of course an arbitrary matter, but the convention has stuck. 3622Then the filter can be characterized 3623by Figure 4.3, which signifies that the output is the input minus a delayed 3624and scaled version of itself. 3625.FC "Figure 4.3" 3626The transfer function of the filter is 3627.LB 3628.EQ 3629H(z) ~~ = ~~ 1~ -~ az sup -1 , 3630.EN 3631.LE 3632and we have seen that the effect of the system on a (complex) exponential of 3633frequency $f$ is to multiply it by 3634.LB 3635.EQ 36361~ -~ ae sup {-j2 pi fT}. 3637.EN 3638.LE 3639To get the frequency response from the transfer function, replace $z sup -1$ 3640by $e sup {-j2 pi fT}$. Amplitude and phase responses can then be found by 3641taking the modulus and angle of the complex frequency response. 3642.pp 3643If $z sup -1$ is treated as an 3644.ul 3645operator, 3646it is quite in order to summarize the action of the filter by 3647.LB 3648.EQ 3649y(n) ~~ = ~~ x(n)~ - ~az sup -1 x(n) ~~ = ~~ (1~ -~ az sup -1 )x(n). 3650.EN 3651.LE 3652However, it is usual to derive from the sequence $x(n)$ a 3653.ul 3654transform 3655$X(z)$ upon which $z sup -1$ acts as a 3656.ul 3657multiplier. 3658If the transform of $x(n)$ is defined as 3659.LB 3660.EQ 3661X(z) ~~ = ~~ sum from {n=- infinity} to infinity ~x(n) z sup -n , 3662.EN 3663.LE 3664then on multiplication by $z sup -1$ we get a new transform, say $V(z)$: 3665.LB 3666.EQ 3667V(z) ~~ = ~~ z sup -1 X(z) ~~ = 3668~~ z sup -1 sum from {n=- infinity} to infinity ~x(n) z sup -n ~~ = 3669~~ sum ~x(n)z sup -n-1 ~~ = 3670~~ sum ~x(n-1)z sup -n . 3671.EN 3672.LE 3673$V(z)$ can also be expressed as the transform of a new sequence, say $v(n)$, by 3674.LB 3675.EQ 3676V(z) ~~ = ~~ sum from {n=- infinity} to infinity ~v(n) z sup -n , 3677.EN 3678.LE 3679from which it becomes apparent that 3680.LB 3681.EQ 3682v(n) ~~ = ~~ x(n-1). 3683.EN 3684.LE 3685Thus $v(n)$ is a delayed version of $x(n)$, and we have accomplished what we 3686set out to do, namely to show that the delay 3687.ul 3688operator 3689$z sup -1$ can be treated as an ordinary 3690.ul 3691multiplier 3692in the $z$-transform domain, where $z$-transforms are defined as the infinite 3693sums given above. 3694.pp 3695In terms of $z$-transforms, the filter can be written 3696.LB 3697.EQ 3698Y(z) ~~ = ~~ (1~ -~ az sup -1 )X(z), 3699.EN 3700.LE 3701where $z sup -1$ is now treated as a multiplier. 3702The transfer function of the filter is 3703.LB 3704.EQ 3705H(z) ~~ = ~~ Y(z) over X(z) ~~ = ~~ 1 - az sup -1 , 3706.EN 3707.LE 3708the ratio of the output to the input transform. 3709.pp 3710It may seem that little has been gained by inventing this rather abstract 3711notion of transform, simply to change an operator to a multiplier. After 3712all, the equation of the filter is no simpler in the transform domain than 3713it was in the time domain using $z sup -1$ as an operator. However, we will 3714need to go on to examine more complex filters. Consider, for example, the 3715transfer function 3716.LB 3717.EQ 3718H(z) ~~ = ~~ {1~+~az sup -1 ~+~bz sup -2} over {1~+~cz sup -1 ~+~dz sup -2} ~ . 3719.EN 3720.LE 3721If $z sup -1$ is treated as an operator, it is not immediately obvious how 3722this transfer function can be realized by a time-domain recurrence relation. 3723However, with $z sup -1$ as an ordinary multiplier in the transform domain, we can 3724make purely mechanical manipulations with infinite sums to see what the transfer 3725function means as a recurrence relation. 3726.pp 3727It is worth noting the similarity between the $z$-transform in the discrete 3728domain and the Fourier and Laplace transforms in the continuous domains. 3729In fact, the $z$-transform plays an analogous role in digital signal processing 3730to the Laplace transform in continuous theory, for the delay operator 3731$z sup -1$ 3732performs a similar service to the differentiation operator $s$. 3733Recall first the continuous Fourier transform, 3734.LB 3735$ 3736G(f) ~~ = ~~ 3737integral from {- infinity} to infinity ~g(t)~e sup {-j2 pi ft} dt 3738$, where $f$ is real, 3739.LE 3740and the Laplace transform, 3741.LB 3742$ 3743F(s) ~~ = ~~ 3744integral from 0 to infinity ~f(t)~e sup -st dt 3745$, where $s$ is complex. 3746.LE 3747The main difference between these two transforms is that the range of integration 3748begins at -$infinity$ for the Fourier transform and at 0 for the Laplace. 3749Advocates of the Fourier transform, which typically include people involved with 3750telecommunications, enjoy the freedom from initial conditions which is bestowed 3751by an origin way back in the mists of time. Advocates of Laplace, including 3752most analogue filter theorists, invariably 3753consider systems where all is quiet before $t=0$ \(em altering the origin 3754of measurement of time to achieve this if necessary \(em and welcome the opportunity 3755to include initial conditions explicitly 3756.ul 3757without 3758having to worry about what happens in the mists of time. 3759Although there is a two-sided Laplace transform where the integration begins 3760at -$infinity$, it is not generally used because it causes some convergence 3761complications. Ignoring this difference between the transforms (by considering 3762signals which are zero when $t<0$), the Fourier spectrum can be found from the 3763Laplace transform by writing $s=j2 pi f$; that is, by considering values 3764of $s$ which lie on the imaginary axis. 3765.pp 3766The $z$-transform is 3767.LB 3768$ 3769H(z) ~~ = ~~ sum from n=0 to infinity ~h(n)~z sup -n 3770$, or $ 3771H(z) ~~ = ~~ sum from {n=- infinity} to infinity ~h(n)~z sup -n , 3772$ 3773.LE 3774depending on whether a one-sided or two-sided transform is used. The advantages 3775and disadvantages of one- and two-sided transforms are the same as in the 3776analogue case. 3777$z$ plays the role of $e sup sT $, and so it is not surprising that the response 3778to a (sampled) sinusoid input can be found by setting 3779.LB 3780.EQ 3781z ~~ = ~~ e sup {j2 pi fT} 3782.EN 3783.LE 3784in $H(z)$, as we proved explicitly above for the pre-emphasis filter. 3785.pp 3786The above relation between $z$ and $f$ means that real-valued frequencies correspond 3787to points where $|z|=1$, that is, the unit circle in the complex $z$-plane. 3788As you travel anticlockwise around this unit circle, starting from the 3789point $z=1$, the corresponding frequency increases from 0, to $1/2T$ half-way 3790round ($z=-1$), to $1/T$ when you get back to the beginning ($z=1$) again. 3791Frequencies greater than the sampling frequency are aliased back into the 3792sampling band, corresponding to further circuits of $|z|=1$ with frequency 3793going from $1/T$ to $2/T$, $2/T$ to $3/T$, and so on. In fact, this is the circle 3794of Figure 3.3 which was used earlier to explain how sampling affects the frequency 3795spectrum! 3796.sh "4.4 Discrete Fourier transform" 3797.pp 3798Let us return from this brief digression into techniques of digital signal 3799analysis to the problem of determining the frequency spectrum of speech. 3800Although a bank of bandpass filters such as is used in the channel vocoder 3801is the perhaps most straightforward way to obtain a frequency spectrum, 3802there are other techniques which are in fact more commonly used in digital speech 3803processing. 3804.pp 3805It is possible to define the Fourier transform of a discrete sequence of 3806points. To motivate the definition, consider first the 3807ordinary Fourier transform (FT), which is 3808.LB 3809$ 3810g(t) ~~ = ~~ 3811integral from {- infinity} to infinity ~G(f)~e sup {+j2 pi ft} df 3812~~~~~~~~~~~~~~~~ 3813G(f) ~~ = ~~ 3814integral from {- infinity} to infinity ~g(t)~e sup {-j2 pi ft} dt . 3815$ 3816.LE 3817This takes a continuous time domain into a continuous frequency domain. 3818Sometimes you see a normalizing factor $1/2 pi$ multiplying the integral in 3819either the forward or the reverse transform. This is only needed 3820when the frequency variable is expressed in radians/s, and we will find it 3821more convenient to express frequencies in\ Hz. 3822.pp 3823The Fourier series (FS), which should also be familiar to you, 3824operates on a periodic time waveform (or, equivalently, 3825one that only exists for a finite period of time, which is notionally extended 3826periodically). If a period lies in the time range $[0,b)$, then the transform is 3827.LB 3828$ 3829g(t) ~~ = ~~ 3830sum from {r = - infinity} to infinity ~G(r)~e sup {+j2 pi rt/b} 3831~~~~~~~~~~~~~~~~ 3832G(r) ~~ = ~~ 1 over b ~ integral from 0 to b ~g(t)~e sup {-j2 pi rt/b} dt . 3833$ 3834.LE 3835The Fourier series takes a periodic time-domain function into a discrete frequency-domain one. 3836Because of the basic duality between the time and frequency domains in the 3837Fourier transforms, it is not surprising that another version of the transform 3838can be defined which takes a periodic 3839.ul 3840frequency\c 3841-domain function into a 3842discrete 3843.ul 3844time\c 3845-domain one. 3846.pp 3847Fourier transforms can only deal with a finite stretch of a time signal 3848by assuming that the signal is periodic, for if $g(t)$ is evaluated from 3849its transform $G(r)$ according to the formula above, and $t$ is chosen outside 3850the interval $[0,b)$, then a periodic extension of the function $g(t)$ is obtained 3851automatically. 3852Furthermore, periodicity in one domain implies discreteness in the other. 3853Hence if we transform a 3854.ul 3855finite 3856stretch of a 3857.ul 3858discrete 3859time waveform, 3860we get a frequency-domain representation which is also finite (or, equivalently, 3861periodic), and discrete. 3862This is the discrete Fourier transform (DFT), 3863and takes a discrete periodic time-domain function into a discrete 3864periodic frequency-domain one as illustrated in Figure 4.4. 3865.FC "Figure 4.4" 3866It is defined by 3867.LB 3868$ 3869g(n) ~~ = ~~ 38701 over N ~ sum from r=0 to N-1~G(r)~e sup { + j2 pi rn/N} 3871~~~~~~~~~~~~~~~~ 3872G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~e sup { - j2 pi rn/N} , 3873$ 3874.LE 3875or, writing $W=e sup {-j2 pi /N}$, 3876.LB 3877$ 3878g(n) ~~ = ~~ 38791 over N ~ sum from r=0 to N-1~G(r)~W sup -rn 3880~~~~~~~~~~~~~~~~ 3881G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~W sup rn . 3882$ 3883.LE 3884.sp 3885The $1/N$ in the first equation is the same normalizing 3886factor as the $1/b$ in the Fourier series, 3887for the finite time domain is $[0,N)$ 3888in the discrete case and $[0,b)$ in the Fourier series case. 3889It does not matter 3890whether it is written into the forward or the reverse transform, but it is usually 3891placed as shown above as a matter of convention. 3892.pp 3893As illustrated by Figure 4.5, discrete Fourier transforms 3894take an input of $N$ real values, representing equally-spaced time samples 3895in the interval $[0,b)$, and produce as output $N$ complex values, representing 3896equally-spaced frequency samples in the interval $[0,N/b)$. 3897.FC "Figure 4.5" 3898Note that the end-point of this frequency interval is the sampling frequency. 3899It seems odd that the input is real and the output is the same number of 3900.ul 3901complex 3902quantities: we seem to be getting some numbers for nothing! 3903However, this isn't so, for it is easy to show that if the input sequence is 3904real, the output frequency 3905spectrum has a symmetry about its mid-point (half the sampling frequency). 3906This can be expressed as 3907.LB 3908DFT symmetry:\0\0\0\0\0\0 $ 3909~ mark G( half N +r) ~=~ G( half N -r) sup *$ if $g$ is real-valued, 3910.LE 3911where $*$ denotes the conjugate of a complex quantity 3912(that is, $(a+jb) sup * = a-jb$). 3913.pp 3914It was argued above that the frequency spectrum in the DFT is periodic, with 3915the spectrum from 0 to the sampling frequency being repeated regularly up and 3916down the frequency axis. It can easily be seen from the DFT equation that 3917this is so. It can be written 3918.LB 3919DFT periodicity:$ lineup G(N+r) ~=~ G(r)$ always. 3920.LE 3921Figure 4.6 illustrates the properties of symmetry and periodicity. 3922.FC "Figure 4.6" 3923.sh "4.5 Estimating the frequency spectrum of speech using the DFT" 3924.pp 3925Speech signals are not exactly periodic. Although the waveform in a particular 3926pitch period will usually resemble those in the preceding and following pitch 3927periods, it will certainly not be identical to them. 3928As the articulation of the speech changes, the formant positions will alter. 3929As we saw in Chapter 2, the pitch itself is certainly not constant. 3930Hence the fundamental assumption of the DFT, that the waveform is periodic, 3931is not really justified. However, the signal is quasi-periodic, for changes 3932from period to period will not usually be very great. One way of computing 3933the short-term frequency spectrum of speech is to use 3934.ul 3935pitch-synchronous 3936Fourier transformation, where single pitch periods are isolated from the 3937waveform and processed with the DFT. This gives a rather accurate estimate 3938of the spectrum. Unfortunately, it is difficult to determine the beginning 3939and end of each pitch cycle, as we shall see later in this chapter when 3940discussing pitch extraction techniques. 3941.pp 3942If a finite stretch of a speech waveform is isolated and Fourier transformed, 3943without regard to pitch of the speech, then the periodicity assumption will 3944be grossly violated. Figure 4.7 illustrates that the effect is the same 3945as 3946multiplying the signal by a rectangular 3947.ul 3948window function, 3949which is 0 except during the period to be analysed, where it is 1. 3950.FC "Figure 4.7" 3951The windowed sequence will almost certainly have discontinuities at its edges, 3952and these will affect the resulting spectrum. The effect can be analysed 3953quite easily, but we will not do so here. It is enough to say that the 3954high frequencies associated with the edges of the window cause considerable 3955distortion of the spectrum. The effect can be alleviated by 3956using a smoother window than a rectangular one, 3957and several have been investigated extensively. The commonly-used windows of 3958Bartlett, Blackman, and Hamming are illustrated in Figure 4.8. 3959.FC "Figure 4.8" 3960.pp 3961Because the DFT produces the same number of frequency samples, equally spaced, 3962as there were points in the time waveform, there is a tradeoff between 3963frequency resolution and time resolution (for a given sampling rate). 3964For example, a 256-point transform with a sample rate of 8\ kHz gives the 256 3965equally-spaced frequency components between 0 and 8\ kHz that are shown in Table 39664.2. 3967.RF 3968.nr x0 (\w'time domain'/2) 3969.nr x1 (\w'frequency domain'/2) 3970.in+1.0i 3971.ta 1.0i 3.0i 4.0i 3972\h'0.5i+2n-\n(x0u'time domain\h'|3.5i+2n-\n(x1u'frequency domain 3973.sp 3974sample time sample \h'-3n'frequency 3975number number 3976.nr x0 1i+\w'00000' 3977\l'\n(x0u\(ul' \l'\n(x0u\(ul' 3978.sp 3979\0\0\00 \0\0\0\00 $mu$sec \0\0\00 \0\0\0\00 Hz 3980\0\0\01 \0\0125 \0\0\01 \0\0\031 3981\0\0\02 \0\0250 \0\0\02 \0\0\062 3982\0\0\03 \0\0375 \0\0\03 \0\0\094 3983\0\0\04 \0\0500 \0\0\04 \0\0125 3984.nr x2 (\w'...'/2) 3985\h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'... 3986\h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'... 3987\h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'... 3988.sp 3989\0254 31750 \0254 \07938 3990\0255 31875 $mu$sec \0255 \07969 Hz 3991\l'\n(x0u\(ul' \l'\n(x0u\(ul' 3992.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 3993.in 0 3994.MT 2 3995Table 4.2 Time domain and frequency domain samples for a 256-point DFT, 3996with 8\ kHz sampling 3997.TE 3998The top half of the frequency spectrum is of no interest, because 3999it contains the complex conjugates of the bottom half (in reverse order), 4000corresponding to frequencies greater than half the sampling frequency. 4001Thus for a 30\ Hz resolution in the frequency domain, 4002256 time samples, or a 32\ msec stretch of speech, needs to be transformed. 4003A common technique is to take overlapping periods in the time domain to 4004give a new frequency spectrum every 16\ msec. From the acoustic point 4005of view this is a reasonable rate to re-compute the spectrum, for as noted 4006above when discussing channel vocoders the rate of change in the spectrum 4007is limited by the speed that the speaker can move his vocal organs, and 4008anything between 10 and 25\ msec is a reasonable figure for transmitting 4009or storing the spectrum. 4010.pp 4011The DFT is a complex transform, and speech is a real signal. It is possible 4012to do two DFT's at once by putting one time waveform into the real parts 4013of the input and another into the imaginary parts. This destroys the DFT 4014symmetry property, for it only holds for real inputs. But given the DFT 4015of a complex sequence formed in this way, it is easy to separate out the 4016DFT's of the two real time sequences. If the two time sequences are 4017$x(n)$ and $y(n)$, then the transform of the complex sequence 4018.LB 4019.EQ 4020g(n) ~~ = ~~ x(n) ~+~ jy(n) 4021.EN 4022.LE 4023is 4024.LB 4025.EQ 4026G(r) ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup rn ~+~ y(n)W sup rn ] . 4027.EN 4028.LE 4029It follows that the complex conjugate of the aliased parts of the spectrum, 4030in the upper frequency region, are 4031.LB 4032.EQ 4033G(N-r) sup * ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup -(N-r)n 4034~-~ y(n)W sup -(N-r)n ] , 4035.EN 4036.LE 4037and this is the same as 4038.LB 4039.EQ 4040G(N-r) sup * ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup rn 4041~-~ y(n)W sup rn ] , 4042.EN 4043.LE 4044because $W sup N$ is 1 (recall the definition of $W$), 4045and so $W sup -Nn$ is 1 for any $n$. 4046Thus 4047.LB 4048.EQ 4049X(r) ~~ = ~~ {G(r) ~+~ G(N-r) sup * } over 2 4050~~~~~~~~~~~~~~~~ 4051Y(r) ~~ = ~~ {G(r) ~-~ G(N-r) sup * } over 2 4052.EN 4053.LE 4054extracts the transforms $X(r)$ and $Y(r)$ of the original sequences 4055$x$ and $y$. 4056.pp 4057With speech, this trick is frequently used to calculate two spectra at once. 4058Using 256-point transforms, a new estimate of the spectrum can be obtained 4059every 16\ msec by taking overlapping 32\ msec stretches of speech, with a 4060computational requirement of one 256-point transform every 32\ msec. 4061.sh "4.6 The fast Fourier transform" 4062.pp 4063Straightforward calculation of the DFT, expressed as 4064.LB 4065.EQ 4066G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~W sup nr , 4067.EN 4068.LE 4069for $r=0,~ 1,~ 2,~ ...,~ N-1$, takes $N sup 2$ operations, where each operation 4070is a complex multiply and add (for $W$ is, of course, a complex number). 4071There is a better way, invented in the early sixties, which reduces this to 4072$N ~ log sub 2 N$ operations \(em a very considerable improvement. 4073Dubbed the "fast Fourier transform" (FFT) for historical reasons, it would actually 4074be better called the "Fourier transform", with the straightforward method above 4075known as the "slow Fourier transform"! There 4076is no reason nowadays to use the slow method, except for tiny transforms. 4077It is worth describing the basic principle of the FFT, for it is surprisingly 4078simple. More details on actual implementations can be found in Brigham (1974). 4079.[ 4080Brigham 1974 4081.] 4082.pp 4083It is important to realize that the FFT involves no approximation. 4084It is an 4085.ul 4086exact 4087calculation of the values that would be obtained by the slow method 4088(although it may be affected differently by round-off errors). 4089Problems of aliasing and windowing occur in all discrete Fourier transforms, 4090and they are neither alleviated nor exacerbated by the FFT. 4091.pp 4092To gain insight into the working of the FFT, imagine the sequence $g(n)$ split 4093into two halves, containing the even and odd points 4094respectively. 4095.LB 4096even half $e(n)$ is $g(0)~ g(2)~ .~ .~ .~ g(N-2)$ 4097.br 4098odd half $o(n)$ is $g(1)~ g(3)~ .~ .~ .~ g(N-1)$. 4099.LE 4100Then it is easy to show that if $G$ is the transform of $g$, 4101$E$ the transform of $e$, 4102and $O$ that of $o$, then 4103.LB 4104$ 4105G(r) ~~ = ~~ E(r) ~+~ W sup r O(r)$ for $r=0,~ 1,~ ...,~ half N -1$, 4106.LE 4107and 4108.LB 4109$ 4110G( half N +r ) ~~ = ~~ E(r) ~+~ W sup { half N +r} O(r)$ for $ 4111r = 0,~ 1,~ ...,~ half N -1$. 4112.LE 4113Calculation of the $E$ and $O$ transforms involves $( half N) sup 2$ operations each, 4114while combining them together according to the above relationship occupies 4115$N$ operations. Thus the total is $N + half N sup 2 $ operations, which is considerably 4116less than $N sup 2$. 4117.pp 4118But don't stop there! The even half can itself be broken down into 4119even and odd parts to expedite its calculation, and the same with the odd half. 4120The only constraint is that the number of elements in the sequences splits 4121exactly into two at each stage. 4122Providing $N$ is a power of 2, then, we are left at the end with some 1-point 4123transforms to do. But transforming a single point leaves it unaffected! (Check 4124the definition of the DFT.) A quick calculation shows that the number of operations 4125needed is not $N + half N sup 2$, but $N~ log sub 2 N$. 4126Figure 4.9 compares this with $N sup 2$, the number of operations for 4127straightforward DFT calculation, and it can be seen that the FFT is very much 4128faster. 4129.FC "Figure 4.9" 4130.pp 4131The only restriction on the use of the FFT is that $N$ must be a power of two. 4132If it is not, alternative, more complicated, algorithms can be used which 4133give comparable computational advantages. However, for speech processing 4134the number of samples that are transformed is usually arranged to be a power 4135of two. If a pitch synchronous analysis is undertaken, the 4136time stretch that is to be transformed is dictated by the length of the pitch 4137period, and will vary from time to time. Then, it is usual to pad out the 4138time waveform with zeros to bring the number of samples up to a power of two; 4139otherwise, if different-length time stretches were transformed the scale 4140of the resulting frequency components would vary too. 4141.pp 4142The FFT provides very worthwhile cost savings over the use of a bank of 4143bandpass filters for spectral analysis. Take the example of a 256-point 4144transform with 8\ kHz sampling, giving 128 frequency components spaced 4145by 31.25\ Hz from 0 up to almost 4\ kHz. This can be computed on overlapping 414632\ msec stretches of the time waveform, giving a new spectrum every 16\ msec, 4147by a single FFT calculation every 32\ msec (putting successive pairs of 4148time stretches in the real and imaginary parts of the complex input sequence, 4149as described earlier). The FFT algorithm requires $N~ log sub 2 N$ operations, 4150which is 2048 when $N=256$. An additional 512 operations are required 4151for the windowing calculation. Repeated every 32\ msec, this gives 4152a rate of 80,000 operations per second. To achieve a much lower frequency 4153resolution with 20 bandpass filters, each of which are fourth-order, 4154will need a great deal more operations. Each filter will need between 4 and 8 4155multiplications per sample, depending on its exact digital implementation. But new 4156samples appear every 125 4157.ul 4158micro\c 4159seconds, and so somewhere around a million 4160operations will be required every second. 4161If we increased the frequency resolution to that obtained by the FFT, 128 4162filters would be needed, requiring between 4 and 8 million operations! 4163.sh "4.7 Formant estimation" 4164.pp 4165Once the frequency spectrum of a speech signal has been calculated, it may 4166seem a simple matter to estimate the positions of the formants. But it is 4167not! Spectra obtained in practice are not usually like the idealized ones 4168of Figure 2.2. One reason for this is that, unless the analysis is 4169pitch-synchronous, the frequency spectrum of the excitation source is mixed 4170in with that of the vocal tract filter. There are other reasons, which will 4171be discussed later in this section. But first, let us consider how to 4172extract the vocal tract filter characteristics from the combined spectrum 4173of source and filter. To do so we must begin to explore the theory of linear 4174systems. 4175.rh "Discrete linear systems." 4176Figure 4.10 shows an input signal exciting a filter to produce an output 4177signal. 4178.FC "Figure 4.10" 4179For present purposes, imagine the input to be a glottal 4180waveform, the filter a vocal tract one, and the output a 4181speech signal (which is then subjected to high-frequency de-emphasis 4182by radiation from the lips). 4183We will consider here 4184.ul 4185discrete 4186systems, so that the input $x(n)$ and output $y(n)$ are sampled signals, 4187defined only when $n$ is integral. The theory is quite similar for continuous 4188systems. 4189.pp 4190Assume that the system is 4191.ul 4192linear, 4193that is, if input $x sub 1 (n)$ produces output $y sub 1 (n)$ and 4194input $x sub 2 (n)$ produces output $y sub 2 (n)$, 4195then the sum of $x sub 1 (n)$ and 4196$x sub 2 (n)$ will produce the sum of $y sub 1 (n)$ and $y sub 2 (n)$. 4197It is easy to show from this that, for any constant multiplier $a$, 4198the input $ax(n)$ will produce output $ay(n)$ \(em it is pretty obvious 4199when $a=2$, 4200or indeed any positive integer; for then $ax(n)$ can be written as 4201$x(n)+x(n)+...$ . 4202Assume further that the system is 4203.ul 4204time-invariant, 4205that is, if input $x(n)$ 4206produces output $y(n)$ then a time-shifted version of $x$, 4207say $x(n+n sub 0 )$ for 4208some constant $n sub 0$, will produce the same output, only time-shifted; namely 4209$y(n+n sub 0)$. 4210.pp 4211Now consider the discrete delta function $delta (n)$, which is 0 except at 4212$n=0$ when it is 1. 4213If this single impulse is presented as input to the system, the output is called 4214the 4215.ul 4216impulse response, 4217and will be denoted by $h(n)$. 4218The fact that the system is time-invariant guarantees that the response does 4219not depend upon the particular time at which the impulse occurred, so that, 4220for example, the impulsive input $delta (n+n sub 0 )$ will produce output 4221$h(n+n sub 0 )$. 4222A delta-function input and corresponding impulse response are shown in Figure 42234.10. 4224.pp 4225The impulse response of a linear, time-invariant system is an extremely useful 4226thing to 4227know, for it can be used to calculate the output of the system for any input 4228at all! Specifically, an input signal $x(n)$ can be written 4229.LB 4230.EQ 4231x(n)~ = ~~ sum from {k=- infinity} to infinity ~ x(k) delta (n-k) , 4232.EN 4233.LE 4234because $delta (n-k)$ is non-zero only when $k=n$, and so for any 4235particular value of $n$, the summation contains only 4236one non-zero term \(em that is, $x(n)$. 4237The action of the system on each term of the sum is to produce an output 4238$x(k)h(n-k)$, because $x(k)$ is just a constant, and 4239the system is linear. 4240Furthermore, the complete input $x(n)$ is just the sum of such terms, and since 4241the system is linear, the output is the sum of $x(k)h(n-k)$. 4242Hence the response of the system to an arbitrary input is 4243.LB 4244.EQ 4245y(n)~ = ~~ sum from {k=- infinity} to infinity ~ x(k) h(n-k) . 4246.EN 4247.LE 4248This is called a 4249.ul 4250convolution sum, 4251and is sometimes written 4252.LB 4253.EQ 4254y(n)~ =~ x(n) ~*~ h(n). 4255.EN 4256.LE 4257.pp 4258Let's write this in terms of $z$-transforms. The (two-sided) $z$-transform of y(n) 4259is 4260.LB 4261.EQ 4262Y(z)~ = ~~ sum from {n=- infinity} to infinity ~y(n)z sup -n ~~ = 4263~~ sum from n ~ sum from k ~x(k)h(n-k) ~z sup -n , 4264.EN 4265.LE 4266Writing $z sup -n$ as $z sup -(n-k) z sup -k$, and interchanging the order 4267of summation, this becomes 4268.LB 4269.EQ 4270Y(z)~ mark = ~~ sum from k ~[~ sum from n ~ h(n-k)z sup -(n-k) ~]~x(k)z sup -k 4271.EN 4272.br 4273.EQ 4274lineup = ~~ sum from k ~H(z)~z sup -k ~~ = ~~ H(z)~ sum from k ~x(k)z sup 4275-k ~~=~~H(z)X(z) . 4276.EN 4277.LE 4278Thus convolution in the time domain is the same as multiplication in the 4279$z$-transform domain; a very important result. Applied to the linear system of 4280Figure 4.10, this means that the output $z$-transform is the input $z$-transform 4281multiplied by the $z$-transform of the system's impulse response. 4282.pp 4283What we really want to do is to relate the frequency spectrum of 4284the output to the response of the system and the spectrum of the 4285input. 4286In fact, frequency spectra are very closely connected with $z$-transforms. A 4287periodic signal $x(n)$ which repeats every $N$ samples has DFT 4288.LB 4289.EQ 4290sum from n=0 to N-1 ~x(n)~e sup {-j2 pi rn/N} , 4291.EN 4292.LE 4293and its $z$-transform is 4294.LB 4295.EQ 4296sum from {n=- infinity} to infinity ~x(n) ~z sup -n . 4297.EN 4298.LE 4299Hence the DFT is the same as the $z$-transform of a single cycle of the signal, 4300evaluated at the points $z= e sup {j2 pi r/N}$ for $r=0,~ 1,~ ...~ ,~ N-1$. 4301In other 4302words, the frequency components are samples of the $z$-transform at $N$ 4303equally-spaced points around the unit circle. 4304Hence the frequency spectrum at the output of a linear system is the product of 4305the 4306input spectrum and the frequency response of the system itself (that is, the 4307transform of its impulse response function). 4308It should be admitted that this statement is somewhat questionable, 4309because to get from $z$-transforms to DFT's we have assumed that 4310a single cycle only is transformed \(em and the impulse response function of 4311a system is not necessarily periodic. The real action of the system is 4312to multiply $z$-transforms, not DFT's. However, it is useful in imagining 4313the behaviour of the system to think in terms of products of DFT's; and in 4314practice it is always these rather than $z$-transforms which are computed 4315because of the existence of the FFT algorithm. 4316.pp 4317Figure 4.11 shows the frequency spectrum of a typical voiced speech signal. 4318.FC "Figure 4.11" 4319The overall shape shows humps at the formant positions, like those in the 4320idealized Figure 2.2. However, superimposed on this is an "oscillation" 4321(in the frequency domain!) at the pitch frequency. This occurs because the 4322transform of the vocal tract filter has been multiplied by that of the 4323pitch pulse, the latter having components at harmonics of the pitch frequency. 4324The oscillation must be suppressed before the formants 4325can be estimated to any degree of accuracy. 4326.pp 4327One way of eliminating the oscillation is to perform pitch-synchronous 4328analysis. 4329This removes the influence of pitch from the frequency domain by dealing with 4330it in the time domain! The snag is, of course, that it is not easy to estimate 4331the pitch frequency: some techniques for doing so are discussed in the next 4332main section. 4333Another way is to use linear predictive analysis, which really does get rid 4334of pitch information without having to estimate the pitch period first. A 4335smooth 4336frequency spectrum can be produced using the analysis techniques described in 4337Chapter 6, which provides 4338a suitable starting-point for formant frequency estimation. 4339The third method is to remove the pitch ripple from the frequency spectrum 4340directly. This will be discussed in an intuitive rather than a 4341theoretical way, because linear predictive methods are becoming dominant 4342in speech processing. 4343.rh "Cepstral processing of speech." 4344Suppose the frequency spectrum of Figure 4.11 were actually a time waveform. 4345To remove the high-frequency pitch ripple is easy: just filter it out! 4346However, 4347filtering removes 4348.ul 4349additive 4350ripples, whereas this is a 4351.ul 4352multiplicative 4353ripple. To turn multiplication into addition, take logarithms. Then the 4354procedure would be 4355.LB 4356.NP 4357compute the DFT of the speech waveform (windowed, overlapped); 4358.NP 4359take the logarithm of the transform; 4360.NP 4361filter out the high-frequency part, corresponding to pitch ripple. 4362.LE 4363.pp 4364Filtering is often best done using the DFT. If the rippled waveform of Figure 43654.11 is transformed, a strong component could be expected at the ripple 4366frequency, with weaker ones at its harmonics. These components can be 4367simply removed by setting them to zero, and inverse-transforming the result 4368to give a smoothed version of the original frequency spectrum. 4369A spectrum of the logarithm of a frequency spectrum is often called a 4370.ul 4371cepstrum 4372\(em a sort of backwards spectrum. The horizontal axis of the cepstrum, 4373having the dimension of time, is called "quefrency"! Note that high-frequency 4374signals have low quefrencies and vice versa. In practice, 4375because the pitch ripple is usually well above the quefrency of interest for 4376formants, the upper end of the cepstrum is often simply cut off from a fixed 4377quefrency which corresponds to the maximum pitch expected. However, identifying 4378the pitch peaks of the cepstrum has the useful byproduct of giving the pitch 4379period of the original speech. 4380.pp 4381To summarize, then, the procedure for spectral smoothing by the cepstral method 4382is 4383.LB 4384.NP 4385compute the DFT of the speech waveform (windowed, overlapped); 4386.NP 4387take the logarithm of the transform; 4388.NP 4389take the DFT of this log-transform, calling it the cepstrum; 4390.NP 4391identify the lowest-quefrency peak in the spectrum as the pitch, 4392confirming it by examining its harmonics, which should be 4393equally spaced at the pitch quefrency; 4394.NP 4395remove pitch effects from the cepstrum by cutting off its high-quefrency 4396part above either the pitch quefrency or some constant representing the maximum 4397expected pitch (which is the minimum expected pitch quefrency); 4398.NP 4399inverse DFT the resulting cepstrum to give a smoothed spectrum. 4400.LE 4401.rh "Estimating formant frequencies from smoothed spectra." 4402The difficulties of formant extraction are not over even when a smooth frequency 4403spectrum has been obtained. A simple peak-picking algorithm which identifies 4404a peak at the $k$'th frequency component whenever 4405.LB 4406$ 4407X(k-1) ~<~ X(k) 4408$ and $ 4409X(k) ~>~ X(k+1) 4410$ 4411.LE 4412will quite often identify formants incorrectly. 4413It helps to specify in advance minimum and maximum formant frequencies \(em say 4414100\ Hz and 3\ kHz for three-formant identification, and ignore peaks lying 4415outside these limits. It helps to estimate 4416the bandwidth of the peaks and reject those with bandwidths greater than 4417500\ Hz \(em for real formants are never this wide. However, if two formants are 4418very close, then they may appear as a single, wide, peak and be rejected by 4419this criterion. It is usual to take account of formant positions identified 4420in previous frames under these conditions. 4421.pp 4422Markel and Gray (1976) describe in detail several estimation algorithms. 4423.[ 4424Markel Gray 1976 Linear prediction of speech 4425.] 4426Their simplest uses the number of peaks identified in the raw spectrum 4427(under 3\ kHz, and with 4428bandwidths greater than 500\ Hz), to determine what to do. If exactly three 4429peaks are found, they are used as the formant positions. It is claimed that 4430this happens about 85% to 90% of the time. 4431If only one peak is found, the present frame is ignored and the 4432previously-identified 4433formant positions are used (this happens less than 1% of the time). 4434The remaining cases are two peaks \(em corresponding to omission of one formant \(em 4435and four peaks \(em corresponding to an extra formant being included. (More 4436than 4437four peaks never occurred in their data.) Under these conditions, 4438a nearest-neighbour measure is used for disambiguation. The measure is 4439.LB 4440.EQ 4441v sub ij ~ = ~ |{ F sup * } sub i (k) ~-~ F sub j (k-1)| , 4442.EN 4443.LE 4444where $F sub j sup (k-1)$ is the $j$'th formant frequency defined 4445in the previous frame 4446$k-1$ and ${ F sup * } sub i (k)$ is the $i$'th raw data frequency estimate 4447for frame $k$. 4448If two peaks only are found, this measure is used to identify 4449the closest peaks in the previous frame; and then the 4450third peak of that frame is taken to be the missing formant 4451position. If four peaks are found, the measure is used to 4452determine which of them is furthest from the previous formant 4453values, and this one is discarded. 4454.pp 4455This procedure works forwards, using the previous frame to 4456disambiguate peaks given in the current one. More sophisticated 4457algorithms work backwards as well, identifying 4458.ul 4459anchor points 4460in the data which have clearly-defined formant positions, and 4461moving in both directions from these to disambiguate 4462neighbouring frames of data. Finally, absolute limits can be 4463imposed upon the magnitude of formant movements between frames 4464to give an overall smoothing to the formant tracks. 4465.pp 4466Very often, people will refine the result of such automatic formant 4467estimation procedures by hand, looking at the tracks, knowing 4468what was said, and making adjustments in the light of their 4469experience of how formants move in speech. Unfortunately, it is difficult to 4470obtain high-quality formant tracks by completely automatic 4471means. 4472.pp 4473One of the most difficult cases in formant estimation is where 4474two formants are so close together that the individual peaks 4475cannot be resolved. One simple solution to this problem is to 4476employ "analysis-by-synthesis", whereby once a formant is 4477identified, a standard formant shape at this position is 4478synthesized and 4479subtracted from the 4480logarithmic spectrum (Coker, 1963). 4481.[ 4482Coker 1963 4483.] 4484Then, even if two formants 4485are right on top of each other, the second is not missed because 4486it remains after the first one has been subtracted. 4487.pp 4488Unfortunately, however, the single peak which appears when 4489two formants are close together usually does not correspond exactly with the 4490position of either one. 4491There is one rather advanced signal-processing technique that 4492can help in this case. 4493The frequency spectrum of 4494speech is determined by 4495.ul 4496poles 4497which lie in the complex $z$-plane inside the unit circle. (They 4498must be inside the unit circle if the system is stable. Those 4499familiar with Laplace analysis of analogue systems may like to note that the 4500left half of the $s$-plane corresponds with the inside of the unit 4501circle in the $z$-plane.) As shown earlier, computing a DFT is tantamount to 4502evaluating the $z$-transform at equally-spaced points around the 4503unit circle. However, better resolution is obtained by 4504evaluating around a circle which lies 4505.ul 4506inside 4507the unit circle, but 4508.ul 4509outside 4510the outermost pole position. Such a circle is sketched in 4511Figure 4.12. 4512.FC "Figure 4.12" 4513.pp 4514Recall that the FFT is a fast way of calculating the DFT of a 4515sequence. Is there a similarly fast way of evaluating the 4516$z$-transform inside the unit circle? The answer is yes, and the 4517technique is known as the "chirp $z$-transform", because it 4518involves considering a signal whose frequency increases 4519linearly \(em just like a radar chirp signal. The chirp method 4520allows the $z$-transform to be computed quickly at equally-spaced 4521points along spirally-shaped contours around the origin of the 4522$z$-plane \(em corresponding to signals of linearly increasing 4523complex frequency. The spiral nature of these curves is not of 4524particular interest in speech processing. What 4525.ul 4526is 4527of interest, though, is that the spiral can begin at any point 4528on 4529the $z=0$ axis, and its pitch can be set arbitrarily. 4530If we begin spiralling at $z=0.9$, say, and set the pitch 4531to zero, the contour becomes a circle inside the unit one, with 4532radius 0.9. Such a circle is exactly what is needed to refine 4533formant resolution. 4534.sh "4.8 Pitch extraction" 4535.pp 4536The last section discussed how to characterize the vocal tract filter 4537in the source-filter model of speech production: this one looks 4538at how the most important property of the source \(em that is, the 4539pitch period \(em can be derived. In many ways pitch extraction 4540is more important from a practical point of view than is formant 4541estimation. In a voice-output system, formant estimation is 4542only necessary if speech is to be stored in formant-coded form. 4543For linear predictive storage of speech, or for speech synthesis 4544from phonetics or text, formant extraction is unnecessary \(em 4545although of course general information about formant 4546frequencies and formant tracks in natural speech is needed 4547before a synthesis-from-phonetics system can be built. 4548However, knowledge of the pitch contour is needed for 4549many different purposes. For example, compact encoding of 4550linearly predicted speech relies on the pitch being estimated and 4551stored as a parameter separate from the articulation. 4552Significant improvements in frequency analysis can be made by 4553performing pitch-synchronous Fourier transformations, 4554because the need to window is eliminated. 4555Many synthesis-from-phonetics systems require the pitch contour 4556for utterances to be stored rather computed from markers in the 4557phonetic text. 4558.pp 4559Another issue which is closely bound up with pitch extraction is 4560the voiced-unvoiced distinction. A good pitch estimator ought to 4561fail when presented with aperiodic input such as an unvoiced 4562sound, and so give a reliable indication of whether the frame of 4563speech is voiced or not. 4564.pp 4565One method of pitch estimation, which uses the cepstrum, has been outlined 4566above. It involves a substantial amount of computation, 4567and has a high degree of complexity. However, if implemented 4568properly it gives excellent results, because the source-filter 4569structure of the speech is fully utilized. 4570Another method, using the 4571linear prediction residual, will be described in Chapter 6. 4572Again, this requires a great deal of computation of a fairly sophisticated 4573nature, and gives good results \(em although it relies on a 4574somewhat more 4575restricted version of the source-filter model than cepstral 4576analysis. 4577.rh "Autocorrelation methods." 4578The most reliable way of estimating the pitch of a periodic 4579signal which is corrupted by noise is to examine its 4580short-time autocorrelation function. 4581The autocorrelation of a signal $x(n)$ with lag $k$ is defined as 4582.LB 4583.EQ 4584phi (k) ~~ = ~~ sum from {n=- infinity} to infinity ~ x(n)x(n+k) . 4585.EN 4586.LE 4587If the signal is quasi-periodic, with slowly varying period, 4588a finite stretch of it can be isolated with a window 4589$w(i)$, which is 0 when $i$ is outside the range $[0,N)$. 4590Beginning this window at sample $m$ gives the windowed signal 4591.LB 4592.EQ 4593x(n)w(n-m), 4594.EN 4595.LE 4596whose autocorrelation, 4597the 4598.ul 4599short-time 4600autocorrelation of the signal $x$ at point $m$ is 4601.LB 4602.EQ 4603phi sub m (k)~ = ~~ sum from n ~ x(n)w(n-m)x(n+k)w(n-m+k) . 4604.EN 4605.LE 4606.pp 4607The autocorrelation function exhibits peaks at lags which correspond to 4608the pitch periods and multiples of it. At such lags, the signal is in 4609phase with a delayed version of itself, giving high correlation. 4610The pitch of natural speech ranges about three octaves, from 50\ Hz (low-pitched men) to around 4611400\ Hz (children). To ensure that at least two pitch cycles are seen, even at 4612the 4613low end, the window needs to be at least 40\ msec long, and the autocorrelation 4614function calculated for lags up to 20\ msec. The peaks which occur at lags 4615corresponding to multiples of the pitch become smaller as the multiple 4616increases, because the speech waveform will change slightly and the pitch 4617period is not perfectly constant. If signals at the high end of the pitch 4618range, 400\ Hz, are 4619viewed through a 40\ msec autocorrelation window, considerable smearing of 4620pitch resolution in the time domain is to be expected. Finally, for unvoiced 4621speech, no substantial peaks of autocorrelation will occur. 4622.pp 4623If all deviations from perfect periodicity can be attributed to 4624additive, white, Gaussian noise, then it can be shown from 4625standard detection theory that autocorrelation methods are 4626appropriate for pitch identification. Unfortunately, this is 4627certainly not the case for speech signals. Although the 4628short-time autocorrelation of voiced speech exhibits peaks at 4629multiples of the pitch period, it is not clear that it is any 4630easier to detect these peaks in the autocorrelation function 4631than it is in the original time waveform! To take a simple 4632example, if a signal contains a fundamental and in-phase first 4633and second harmonics, 4634.LB 4635.EQ 4636x(n)~ =~ a sin 2 pi fnT ~+~ b sin 4 pi fnT ~+~ c sin 6 pi fnT , 4637.EN 4638.LE 4639then its autocorrelation function is 4640.LB 4641.EQ 4642phi (k) ~=~~ {a sup 2 ~cos~2 pi fkT~+~b sup 2 ~cos~2 pi 4643fkT~+~c sup 2 ~cos 2 pi fkT} over 2 ~ . 4644.EN 4645.LE 4646There is no reason to believe that detection of the fundamental 4647period of this signal will be any easier in the autocorrelation 4648domain than in the time domain. 4649.pp 4650The most common error of pitch detection by autocorrelation 4651analysis is that the periodicities of the formants are confused 4652with the pitch. This typically leads to the repetition time 4653being identified as $T sub pitch ~ +- ~ T sub formant1$, where the 4654$T$'s are the periods of the pitch and first formant. Fortunately, 4655there are simple ways of processing the signal non-linearly to 4656reduce the effect of formants on pitch estimation using autocorrelation. 4657.pp 4658One way 4659is to low-pass filter the 4660signal with a cut-off above the maximum pitch period, say 600 4661Hz. However, formant 1 is often below this value. A different 4662technique, which may be used in conjunction with filtering, is 4663to "centre-clip" the signal as shown in Figure 4.13. 4664.FC "Figure 4.13" 4665This 4666removes many of 4667the ripples which are associated with formants. However, it 4668entails the use of an adjustable clipping threshold to cater for 4669speech of varying amplitudes. Sondhi (1968), who introduced the 4670technique, set the clipping level at 30% of the maximum 4671amplitude. 4672.[ 4673Sondhi 1968 4674.] 4675An alternative which achieves 4676much the same effect without the need to fiddle with thresholds, 4677is to cube the signal, or raise it to some other high (odd!) 4678power, before taking the autocorrelation. This highlights the 4679peaks and suppresses the effect of low-amplitude parts. 4680.pp 4681For very accurate pitch detection, it is best to combine the evidence 4682from several different methods of analysis of the time waveform. 4683The autocorrelation function provides one source of evidence; 4684and the cepstrum provides another. 4685A third source comes from the time waveform itself. 4686McGonegal 4687.ul 4688et al 4689(1975) have described a semi-automatic method of pitch 4690detection which uses human judgement to make a final decision based upon these 4691three sources of evidence. 4692.[ 4693McGonegal Rabiner Rosenberg 1975 SAPD 4694.] 4695This appears to provide highly accurate pitch contours at the expense of 4696considerable human effort \(em it takes an experienced user 30 minutes to 4697process each second of speech. 4698.rh "Speeding up autocorrelation." 4699Calculating the autocorrelation function is an 4700arithmetic-intensive procedure. For large lags, it can best be 4701done using FFT methods; although there are simpler arithmetic 4702tricks which speed it up without going to such complexity. 4703However, with the availability of analogue delay lines using 4704charge-coupled devices, autocorrelation can now be done 4705effectively and cheaply by analogue, sampled-data, hardware. 4706.pp 4707Nevertheless, some techniques to speed up digital 4708calculation of short-time autocorrelations are in wide use. It 4709is tempting to hard-limit the signal so that it becomes binary 4710(Figure 4.14(a)), thus eliminating multiplication. 4711.FC "Figure 4.14" 4712This can be 4713disastrous, however, because hard-limited speech is known to 4714retain considerable intelligibility and therefore the formant 4715structure is still there. A better plan is to take 4716centre-clipped speech and hard-limit that to a ternary signal 4717(Figure 4.14(b)). This simplifies the computation considerably 4718with essentially no degradation in performance (Dubnowski 4719.ul 4720et al, 47211976). 4722.[ 4723Dubnowski Schafer Rabiner 1976 Digital hardware pitch detector 4724.] 4725.pp 4726A different approach to reducing the amount of calculation is to 4727perform a kind of autocorrelation which does not use 4728multiplications. The 4729"average magnitude difference function", 4730which is defined by 4731.LB 4732.EQ 4733d(k)~ = ~~ sum from {n=- infinity} to infinity ~ |x(n)-x(n+k)| , 4734.EN 4735.LE 4736has been used for this purpose with some success (Ross 4737.ul 4738et al, 47391974). 4740.[ 4741Ross Schafer Cohen Freuberg Manley 1974 4742.] 4743It exhibits dips at pitch periods (instead of the peaks of the 4744autocorrelation function). 4745.rh "Feature-extraction methods." 4746Another possible way of extracting pitch in the time domain is to try to 4747integrate information from different sources to give reliable 4748pitch estimates. Several features of the time 4749waveform can be defined, each of which provides an estimate of the pitch period, 4750and 4751an overall estimate can be obtained by majority vote. 4752.pp 4753For example, suppose that the only feature of the speech 4754waveform which is retained is the height and position of the 4755peaks, where a "peak" is defined by the simplistic criterion 4756.LB 4757$ 4758x(n-1) ~<~ x(n) 4759$ and $ 4760x(n) $>$ x(n+1) . 4761$ 4762.LE 4763Having found a peak which is thought to represent a pitch pulse, 4764one could define a "blanking period", based upon the current 4765pitch estimate, within which the next pitch pulse could not 4766occur. When this period has expired, the next pitch pulse is 4767sought. At first, a stringent criterion should be used for 4768identifying the next peak as a pitch pulse; but it can gradually be 4769relaxed if time goes on without a suitable pulse being 4770located. Figure 4.15 shows a convenient way of doing this: a 4771decaying exponential is begun at the end of the blanking period 4772and when a peak shows above, it is identified as a pitch pulse. 4773.FC "Figure 4.15" 4774One big advantage of this type of algorithm is that the data is 4775greatly reduced by considering peaks only \(em which can be 4776detected by simple hardware. Thus it can permit real-time 4777operation on a small processor with minimal special-purpose 4778hardware. 4779.pp 4780Such a pitch pulse detector is exceedingly simplistic, and will 4781often identify the pitch incorrectly. However, it can be used 4782in conjunction with other features to produce good pitch 4783estimates. Gold and Rabiner (1969), who pioneered the 4784approach, used six features: 4785.[ 4786Gold Rabiner 1969 Parallel processing techniques for pitch periods 4787.] 4788.LB 4789.NP 4790peak height 4791.NP 4792valley depth 4793.NP 4794valley-to-peak height 4795.NP 4796peak-to-valley depth 4797.NP 4798peak-to-peak height (if greater than 0) 4799.NP 4800valley-to-valley depth (if greater than 0). 4801.LE 4802The features are symmetric with regard to peaks and valleys. 4803The first feature is the one described above, and the second one works in 4804exactly the same way. 4805The third feature records the 4806height between each valley and the succeeding peak, and fourth 4807uses the depth between each peak and the succeeding valley. The 4808purpose of the final two detectors is to eliminate secondary, 4809but rather large, peaks from consideration. Figure 4.16 shows 4810the kind of waveform on which the other features might 4811incorrectly double the pitch, but the last two features identify 4812correctly. 4813.FC "Figure 4.16" 4814.pp 4815Gold and Rabiner also included the last two pitch estimates from each 4816feature detector. 4817Furthermore, for each feature, the present estimate 4818was added to the previous one to make a fourth, and the previous one to 4819the one before that to make a fifth, and all three were added together 4820to make a sixth; so that for each feature there were 6 separate estimates of 4821pitch. The reason for this is that if three consecutive estimates of the 4822fundamental period are $T sub 0$, $T sub 1$ and $T sub 2$; then if some peaks are 4823being falsely identified, the actual period could be any of 4824.LB 4825.EQ 4826T sub 0 ~+~ T sub 1 ~~~~ T sub 1 ~+~ T sub 2 ~~~~ 4827T sub 0 ~+~ T sub 1 ~+~ T sub 2 . 4828.EN 4829.LE 4830It is essential to do this, because 4831a feature of a given type can occur more than once in a pitch period \(em 4832secondary peaks usually exist. 4833.pp 4834Six features, each contributing six separate estimates, makes 36 estimates 4835of pitch in all. 4836An overall figure was obtained from this 4837set by selecting the most popular estimate (within some 4838pre-specified tolerance). The complete scheme has been 4839evaluated extensively (Rabiner 4840.ul 4841et al, 48421976) and compares 4843favourably with other methods. 4844.[ 4845Rabiner Cheng Rosenberg McGonegal 1976 4846.] 4847.pp 4848However, it must be admitted that this procedure seems to be rather 4849.ul 4850ad hoc 4851(as are many other successful speech parameter estimation 4852algorithms!). Specifically, it is not easy to predict what 4853kinds of waveforms it will fail on, and evaluation of it can 4854only be pragmatic. When used to 4855estimate the pitch of musical 4856instruments and singers over a 6-octave range (40\ Hz to 2.5\ kHz), 4857instances were found where it failed dramatically (Tucker and Bates, 1978). 4858.[ 4859Tucker Bates 1978 4860.] 4861This is, of 4862course, a much more difficult problem than pitch estimation for 4863speech, where the range is typically 3 octaves. 4864In fact, for speech the feature 4865detectors are usually preceded by 4866a low-pass filter to attenuate the myriad 4867of peaks 4868caused by higher formants, and this 4869is inappropriate for 4870musical applications. 4871.pp 4872There is evidence which shows that additional features can 4873assist with pitch identification. The above features are all 4874based upon the signal amplitude, and could be described as 4875.ul 4876secondary 4877features derived from a single 4878.ul 4879primary 4880feature. Other primary features can easily be defined. 4881Tucker and Bates (1978) used a centre-clipped waveform, and considered only 4882the peaks rising above the central region. 4883.[ 4884Tucker Bates 1978 4885.] 4886They defined two 4887further primary features, in addition to the peak amplitude: the 4888.ul 4889time width 4890of a peak (period for which it is 4891outside the clipping level), and its 4892.ul 4893energy 4894(again, outside the clipping level). The primary 4895features are shown in Figure 4.17. 4896.FC "Figure 4.17" 4897Secondary features are 4898defined, based on these three primary ones, and pitch estimates 4899are made for each one. A further innovation was to combine the 4900individual estimates on a way which is based upon 4901autocorrelation analysis, reducing to some degree the 4902.ul 4903ad-hocery 4904of the pitch detection process. 4905.sh "4.9 References" 4906.LB "nnnn" 4907.[ 4908$LIST$ 4909.] 4910.LE "nnnn" 4911.sh "4.10 Further reading" 4912.pp 4913There are a lot of books on digital signal analysis, although in general 4914I find them rather turgid and difficult to read. 4915.LB "nn" 4916.\"Ackroyd-1973-1 4917.]- 4918.ds [A Ackroyd, M.H. 4919.ds [D 1973 4920.ds [T Digital filters 4921.ds [I Butterworths 4922.ds [C London 4923.nr [T 0 4924.nr [A 1 4925.nr [O 0 4926.][ 2 book 4927.in+2n 4928Here is the exception to prove the rule. 4929This book 4930.ul 4931is 4932easy to read. 4933It provides a good introduction to digital signal processing, 4934together with a wealth of practical design information on digital filters. 4935.in-2n 4936.\"Committee.I.D.S.P-1979-3 4937.]- 4938.ds [A IEEE Digital Signal Processing Committee 4939.ds [D 1979 4940.ds [T Programs for digital signal processing 4941.ds [I Wiley 4942.ds [C New York 4943.nr [T 0 4944.nr [A 0 4945.nr [O 0 4946.][ 2 book 4947.in+2n 4948This is a remarkable collection of tried and tested Fortran programs 4949for digital signal analysis. 4950They are all available from the IEEE in machine-readable form on magnetic 4951tape. 4952Included are programs for digital filter design, discrete Fourier transformation, 4953and cepstral analysis, as well as others (like linear predictive analysis; 4954see Chapter 6). 4955Each program is accompanied by a concise, well-written description of how 4956it works, with references to the relevant literature. 4957.in-2n 4958.\"Oppenheim-1975-4 4959.]- 4960.ds [A Oppenheim, A.V. 4961.as [A " and Schafer, R.W. 4962.ds [D 1975 4963.ds [T Digital signal processing 4964.ds [I Prentice Hall 4965.ds [C Englewood Cliffs, New Jersey 4966.nr [T 0 4967.nr [A 1 4968.nr [O 0 4969.][ 2 book 4970.in+2n 4971This is one of the standard texts on most aspects of digital signal processing. 4972It treats the $z$-transform, digital filters, and discrete Fourier transformation 4973in far more detail than we have been able to here. 4974.in-2n 4975.\"Rabiner-1975-5 4976.]- 4977.ds [A Rabiner, L.R. 4978.as [A " and Gold, B. 4979.ds [D 1975 4980.ds [T Theory and application of digital signal processing 4981.ds [I Prentice Hall 4982.ds [C Englewood Cliffs, New Jersey 4983.nr [T 0 4984.nr [A 1 4985.nr [O 0 4986.][ 2 book 4987.in+2n 4988This is the other standard text on digital signal processing. 4989It covers the same ground as Oppenheim and Schafer (1975) above, 4990but with a slightly faster (and consequently more difficult) presentation. 4991It also contains major sections on special-purpose hardware for 4992digital signal processing. 4993.in-2n 4994.\"Rabiner-1978-1 4995.]- 4996.ds [A Rabiner, L.R. 4997.as [A " and Schafer, R.W. 4998.ds [D 1978 4999.ds [T Digital processing of speech signals 5000.ds [I Prentice Hall 5001.ds [C Englewood Cliffs, New Jersey 5002.nr [T 0 5003.nr [A 1 5004.nr [O 0 5005.][ 2 book 5006.in+2n 5007Probably the best single reference for digital speech analysis, 5008as it is for the time-domain encoding techniques of the last chapter. 5009Unlike the books cited above, it is specifically oriented to speech processing. 5010.in-2n 5011.LE "nn" 5012.EQ 5013delim $$ 5014.EN 5015.CH "5 RESONANCE SPEECH SYNTHESIZERS" 5016.ds RT "Resonance speech synthesizers 5017.ds CX "Principles of computer speech 5018.pp 5019This chapter considers the design of speech synthesizers which 5020implement a direct electrical analogue of 5021the resonance properties of the vocal tract by providing a filter for each 5022formant whose resonant frequency is to be controlled. Another method is the 5023channel vocoder, with a bank of fixed filters whose gains are varied to match 5024the spectrum of the speech as described in Chapter 4. This is not generally 5025used for synthesis from a written representation, however, because it is hard 5026to get good quality speech. It 5027.ul 5028is 5029used sometimes for low-bandwidth 5030transmission and storage, for 5031it is fairly easy to analyse natural speech into fixed frequency bands. 5032A second alternative to the resonance synthesizer is the linear predictive 5033synthesizer, which at present is used quite extensively and is likely to become 5034even more popular. This is covered in the next chapter. 5035Another alternative is the articulatory synthesizer, which 5036attempts to model the vocal tract directly, rather than 5037modelling the acoustic output from it. 5038Although, as noted in Chapter 2, articulatory synthesis holds a promise of 5039high-quality speech \(em for the coarticulation effects caused by tongue 5040and jaw inertia can be modelled directly \(em this has not yet been realized. 5041.pp 5042The source-filter model of speech production indicates that an electrical 5043analogue of the vocal tract can be obtained by considering the source 5044excitation and the filter that produces the formant frequencies separately. 5045This approach was pioneered by Fant (1960), and we shall present much of his 5046work in this chapter. 5047.[ 5048Fant 1960 Acoustic theory of speech production 5049.] 5050There has been some discussion over whether the source-filter model really 5051is a good one, and some 5052synthesizers 5053explicitly introduce an element of 5054"sub-glottal coupling", which simulates the effect of the lung cavity 5055on the vocal tract transfer function during the periods when the glottis is 5056open (for an example see Rabiner, 1968). 5057.[ 5058Rabiner 1968 Digital formant synthesizer JASA 5059.] 5060However, this is very much a low-order effect when considering 5061speech synthesized by rule from a written representation, for the software 5062which calculates parameter values to drive the synthesizer is a far greater 5063source of degradation in speech quality. 5064.sh "5.1 Overall spectral considerations" 5065.pp 5066Figure 5.1 shows the source-filter model of speech production. 5067.FC "Figure 5.1" 5068For voiced speech, the excitation source produces a waveform whose frequency 5069components decay at about 12\ dB/octave, as we shall see in a later section. 5070The excitation passes into the vocal tract filter. Conceptually, this can best 5071be viewed as an infinite series of formant filters, although for implementation 5072purposes only the first few are modelled explicitly and the effect of the rest 5073is lumped together into a higher-formant compensation network. In either case 5074the overall frequency profile of the filter is a flat one, upon which humps are 5075superimposed at the various formant frequencies. Thus the output of the 5076vocal tract filter falls off at 12\ dB/octave just as the input does. 5077However, measurements of actual speech show a 6\ dB/octave decay with increasing 5078frequency. This is explained by the effect of radiation of speech from the 5079lips, which in fact has a "differentiating" action, producing a 6\ dB/octave 5080rise in the frequency spectrum. This 6\ dB/octave lift is similar to that 5081provided by a treble boost control on a radio or amplifier. Speech synthesized 5082without it sounds unnaturally heavy and bassy. 5083.pp 5084These overall spectral shapes, which are derived from considering the human 5085vocal tract, are summarized in the upper annotations in Figure 5.1. But there 5086is no real necessity for a synthesizer to model the frequency characteristics 5087of the human vocal tract at intermediate points: only the output speech is of 5088any concern. Because the system is a linear one, the filter blocks in the 5089figure can be shuffled around to suit engineering requirements. One such 5090requirement is the desire to minimize internally-generated noise in the 5091electrical implementation, most of which will arise in the vocal tract filter 5092(because it is much more complicated than the other components). For this 5093reason an excitation source with a flat spectrum is often preferred, as shown 5094in the lower annotations. This can be generated either by taking the desired 5095glottal pulse shape, with its 12\ dB/octave fall-off, and passing it through a 5096filter giving 12\ dB/octave lift at higher frequencies; or, if the pulse shape 5097is to be stored digitally, by storing its second derivative instead. 5098Then the radiation compensation, which is now more properly called 5099"spectral equalization", will comprise a 6\ dB/octave fall-off to give the 5100required trend in the output spectrum. 5101.pp 5102For a given pitch period, this scheme yields exactly the same spectral 5103characteristics as the original system which modelled the human vocal tract. 5104However, when the pitch varies there will be a difference, for sounds with 5105higher excitation frequencies will be attenuated by \-6\ dB/octave in the new 5106system and +6\ dB/octave in the old by the final spectral equalization. 5107In practice, the pitch of the human voice lies quite low in the frequency 5108region \(em usually below 400\ Hz \(em and if all filter characteristics begin 5109their roll-off at this frequency the two systems will be the same. This 5110simplifies the implementation with a slight compromise in its accuracy in 5111modelling the spectral trend of human speech, for the overall \-6\ dB/octave 5112decay actually begins at a frequency of around 100\ Hz. If this is 5113implemented, some adjustment will need to be made to the amplitudes to ensure 5114that high-pitched sounds are not attenuated unduly. 5115.pp 5116The discussion so far pertains to voiced speech only. The source spectrum of 5117the random excitation in unvoiced sounds is substantially flat, and combines 5118with the radiation from the lips to give a +6\ dB/octave rise in the output 5119spectrum. Hence if spectral equalization is changed to \-6\ dB/octave to 5120accomodate a voiced excitation with flat spectrum, the noise source should 5121show a 12\ dB/octave rise to give the correct overall effect. 5122.sh "5.2 The excitation sources" 5123.pp 5124In human speech, the excitation source for voiced sounds is produced by two 5125flaps of skin called the "vocal cords". These are blown apart by pressure from 5126the lungs. When they come apart the pressure is relieved, and the muscles 5127tensioning the skin cause the flaps to come together again. Subsequently, the 5128lung pressure \(em called "sub-glottal pressure" \(em builds up once more and the 5129process is repeated. The factors which influence the rate and nature of 5130vibration are muscular tension of the cords and the sub-glottal pressure. The detail 5131of the excitation has considerable importance to speech synthesis because it 5132greatly influences the apparent naturalness of the sound produced. For example, 5133if you have inflamed vocal cords caused by laryngitis the sound quality 5134changes dramatically. Old people who do not have proper muscular control over 5135their vocal cord tension produce a quavering sound. Shouted speech can easily 5136be distinguished from quiet speech even when the volume cue is absent \(em you 5137can verify this by fiddling with the volume control of a tape recorder \(em because 5138when shouting, the vocal cords stay apart for a much smaller fraction of the 5139pitch cycle than at normal volumes. 5140.rh "Voiced excitation in natural speech." 5141There are two basic ways to examine the shape of the excitation source in 5142people. One is to use a dentist's mirror and high-speed photography to observe 5143the vocal cords directly. Although it seems a lot to ask someone to speak 5144naturally with a mirror stuck down the back of his throat, the method has been 5145used and photographs can be found, for example, in Flanagan (1972). 5146.[ 5147Flanagan 1972 Speech analysis synthesis and perception 5148.] 5149The second 5150technique is to process the acoustic waveform digitally, identifying the 5151formant positions and deducting the formant contributions from the waveform by 5152filtering. This leaves the basic excitation waveform, which can then be 5153displayed. Such techniques lead to excitation shapes like those sketched in 5154Figure 5.2, in which the gradual opening and abrupt closure of the vocal cords 5155can easily be seen. 5156.FC "Figure 5.2" 5157.pp 5158It is a fact that if a periodic function has one or more discontinuities, its frequency 5159spectrum will decay at sufficiently high frequencies at the rate of 6\ dB/octave. 5160For example, the components of the square wave 5161.LB 5162$ 5163g(t) ~~ = ~~ mark 0 5164$ for $ 51650 <= t < h 5166$ 5167.br 5168$ 5169lineup 1 5170$ for $ 5171h <= t < b 5172$ 5173.LE 5174can be calculated from the Fourier series 5175.LB 5176.EQ 5177G(r) ~~ = ~~ 1 over b ~ integral from 0 to b ~g(t)~e sup {-j2 pi rt/b} ~dt 5178~~ = ~~ j over {2 pi r} ~e sup {-j2 pi rh/b} , 5179.EN 5180.LE 5181so $|G(r)|$ is proportional to $1/r$, and the change in one octave is 5182.LB 5183.EQ 518420~log sub 10 ~ |G(2r)| over |G(r)| 5185~~=~~20~log sub 10 ~ 1 over 2 5186~~ = ~ 5187.EN 5188\-6\ dB. 5189.LE 5190However, if the discontinuities are ones of slope only, then the asymptotic decay 5191at high frequencies is 12\ dB/octave. Thus the glottal excitation of Figure 5.2 5192will decay at this rate. 5193Note that it is not the 5194.ul 5195number 5196but the 5197.ul 5198type 5199of discontinuities which are important in determining the asymptotic spectral 5200trend. 5201.rh "Voiced excitation in synthetic speech." 5202There are several ways that glottal excitation can be simulated in a synthesizer, 5203four of which are shown in Figure 5.3. 5204.FC "Figure 5.3" 5205The square pulse and the sawtooth pulse 5206both exhibit discontinuities, and so will have the wrong asymptotic rate of 5207decay (6\ dB/octave instead of 12\ dB/octave). A better bet is the triangular 5208pulse. This has the correct decay, for there are only discontinuities of slope. 5209However, although the asymptotic rate of decay is of first importance, the fine 5210structure of the frequency spectrum at the lower end is also significant, and 5211the fact that there are two discontinuities of slope instead of just one in the 5212natural waveform means that the spectra cannot match closely. 5213.pp 5214Rosenberg (1971) has investigated several different shapes using listening 5215tests, and he found that the polynomial approximation sketched in Figure 5.3 5216was preferred by listeners. 5217.[ 5218Rosenberg 1971 5219.] 5220This has one slope discontinuity, and comprises 5221three sections: 5222.LB 5223$g(t) ~~ = ~~ 0$ for $0 <= t < t sub 1$ (flat during the period of closure) 5224.sp 5225$g(t) ~~ = ~~ A~ u sup 2 (3 - 2u) $, where 5226$u ~=~ {t-t sub 1} over {t sub 2 -t sub 1} $ , for 5227$t sub 1 <= t < t sub 2$ (opening phase) 5228.sp 5229.sp 5230$g(t) ~~ = ~~ A~ (1 - v sup 2 )$, where 5231$v ~=~ {t-t sub 2} over {b-t sub 2} $ , for 5232$t sub 2 <= t < b$ (closing phase). 5233.LE 5234It is easy to see that the joins between the first and second section, and 5235between the second and third section, are smooth; but that the slope of the third 5236section at the end of the cycle when $t=b$ is 5237.LB 5238.EQ 5239dg over dt ~~ = ~~ -~ 2A. 5240.EN 5241.LE 5242$A$ is the maximum amplitude of the pulse, and is reached when $t=t sub 2$. 5243.pp 5244A much simpler glottal pulse shape to implement is the filtered impulse. 5245Passing an impulse through a filter with characteristic 5246.LB 5247.EQ 52481 over {(1+sT) sup 2} 5249.EN 5250.LE 5251imparts a 12\ dB/octave decay after frequency $1/T$. This gives a pulse shape of 5252.LB 5253.EQ 5254g(t) ~~ = ~~ A~ t over T ~e sup {1-t/T} , 5255.EN 5256.LE 5257which is sketched in Figure 5.4. 5258.FC "Figure 5.4" 5259The pulse is the wrong way round in time 5260when compared with the desired one; but this is not important under most 5261listening conditions because phase differences are not noticeable (this 5262point is discussed further below). 5263The maximum is reached when $t=T$ and has 5264height $A$. The value zero is never actually attained, for the decay to it 5265is asymptotic, and if the slight discontinuity between pulses shown in the 5266Figure is left, the asymptotic rate of decay of the frequency spectrum will 5267be 6\ dB/octave rather than 12\ dB/octave. However, in a real implementation 5268involving filtering an impulse there will be no such discontinuity, for the 5269next pulse will start off where the last one ended. 5270.pp 5271This seems to be an attractive scheme because of its simplicity, 5272and indeed is sometimes used in speech synthesis. However, it does not have 5273the right properties when the pitch is varied, for in real glottal 5274waveforms the maximum occurs at a fixed 5275.ul 5276fraction 5277of the period, whereas the filtered impulse's maximum is at a fixed time, $T$. 5278If $T$ is chosen to make the system correct at high pitch frequencies (say 5279400\ Hz), then the pulse will be much too narrow at low pitches and sound rather 5280harsh. The only solution is to vary the filter parameters with the pitch, 5281leading to complexity again. 5282.pp 5283Holmes (1973) has made an extensive study of the effect of the glottal 5284waveshape on the naturalness of high-quality synthesized speech. 5285.[ 5286Holmes 1973 Influence of glottal waveform on naturalness 5287.] 5288He employed a rather special speech synthesizer, which provides far more 5289comprehensive and sophisticated control than most. It was driven by parameters 5290which were extracted from natural utterances by hand \(em but the process of 5291generating and tuning them took many months of a skilled person's time. 5292By using the pulse shape 5293extracted from the natural utterance, he found that synthetic and natural 5294versions could actually be made indistinguishable to most people, even under high-quality 5295listening conditions using headphones. Performance dropped quite drastically 5296when one of Rosenberg's pulse shapes, similar to the three-section one given 5297above, was used. Holmes also investigated phase effects and found that whilst 5298different pulse shapes with identical frequency spectra could easily be 5299distinguished when listening over headphones, there was no perceptible difference 5300if the listener was placed at a comfortable distance from a loudspeaker in 5301a room. This is attributable to the fact that the room itself imposes a 5302complex modification to the phase characteristics of the speech signal. 5303.pp 5304Although a great deal of care must be taken with the glottal pulse shape for very 5305high-quality synthetic speech, for speech synthesized by rule from a written 5306representation the degradation which stems from incorrect control of the 5307synthesizer parameters is much greater than that caused by using a slightly 5308inferior glottal pulse. The triangular pulse illustrated in Figure 5.3 5309has been found quite satisfactory for speech synthesis by rule. 5310.rh "Unvoiced excitation." 5311Speech quality is much less sensitive to the characteristics of the unvoiced 5312excitation. Broadband white noise will serve admirably. It is quite 5313acceptable to generate this digitally, using a pseudo-random feedback shift 5314register. This gives a bit sequence whose autocorrelation is zero except at 5315multiples of the repetition length. The repetition length 5316can easily be made as long as the number of states in the shift 5317register (less one) \(em in this case, the configuration is called 5318"maximal length" (Gaines, 1969). 5319.[ 5320Gaines 1969 Stochastic computing advances in information science 5321.] 5322For example, an 18-bit maximal-length shift register will repeat 5323every $2 sup 18 -1$ cycles. If the bit-stream is used as a source of analogue 5324noise, the autocorrelation function will have triangular parts whose width is 5325twice the clock period, as shown in Figure 5.5. 5326.FC "Figure 5.5" 5327According to a well-known 5328result (the Weiner-Kinchine theorem; see for example Chirlian, 1973) 5329the power density of the frequency 5330spectrum is the same as the Fourier transform of the autocorrelation function. 5331.[ 5332Chirlian 1973 5333.] 5334Since the feedback shift register gives a periodic autocorrelation function, 5335its transform is a Fourier series. The $r$'th frequency component is 5336.LB 5337.EQ 5338G(r) ~~ = ~~ {R sup 2} over {4 pi sup 2 r sup 2 T} 5339~(1~-~~cos~{{2 pi rT} over R}) ~ . 5340.EN 5341.LE 5342Here, $T$ is the clock period and $R=(2 sup N -1)T$ is the repetition time of 5343an $N$-bit shift register. 5344.pp 5345The spectrum is a bar spectrum, with components spaced 5346at 5347.LB 5348$ 5349{1 over R}~~=~~{1 over {(2 sup N -1)T}}$ Hz. 5350.LE 5351These are very close together \(em with $N=18$ and 5352sampling at 20\ kHz (50\ $mu$sec) 5353the spacing becomes under 0.1\ Hz \(em and so it is reasonable to treat the 5354spectrum as continuous, with 5355.LB 5356.EQ 5357G(f) ~~ = ~~ 1 over {4 pi sup 2 f sup 2 T}~~(1~-~cos 2 pi fT) . 5358.EN 5359.LE 5360This spectrum is sketched in Figure 5.6(a), and the measured result of an actual 5361implementation in Figure 5.6(b). 5362.FC "Figure 5.6" 5363The 3\ dB point occurs when 5364.LB 5365.EQ 5366{G(f) over G(0)} ~~=~~{1 over 2} ~ , 5367.EN 5368.LE 5369and $G(0)$ is $T/2$. Hence, at the 3\ dB point, 5370.LB 5371.EQ 5372{1~-~cos 2 pi fT} over {2 pi sup 2 f sup 2 T sup 2} 5373~~ = ~~ 1 over 2 ~ , 5374.EN 5375.LE 5376which has solution $f=0.45/T$. 5377Thus a pseudo-random shift register generates 5378noise whose spectrum is substantially flat up to half the clock frequency. 5379Anything over 10\ kHz is therefore a suitable clocking rate for speech-quality 5380noise. Choose 20\ kHz to err on the conservative side. If the repetition occurs 5381in less than 3 or 4 seconds, it can be heard quite clearly; but above this figure 5382it is not noticeable. An 18-bit shift register clocked at 20\ kHz repeats 5383every $(2 sup 18 -1)/20000 ~ = ~ 13$ seconds, which is more than adequate. 5384.sh "5.3 Simulating vocal tract resonances" 5385.pp 5386The vocal tract, from glottis to lips, can be modelled as an unconstricted 5387tube of varying cross-section with no side branches and no sub-glottal coupling. 5388This has an all-pole transfer function, which can be written in the form 5389.LB 5390.EQ 5391H(s) ~~ = ~~ 5392{w sub 1 sup 2} over {s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2} 5393~.~{w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} ~~ .~ .~ . 5394.EN 5395.LE 5396There is an unspecified (conceptually infinite) number of terms in the 5397product. Each of them produces a peak in the energy spectrum, 5398and these are the formants we observed in Chapter 2. 5399.pp 5400Formants appear even in an over-simplified 5401model of the tract as a tube of uniform cross-section, with a sound source 5402at one end (the larynx) and open at the other (the lips). 5403This extremely crude model was discussed in Chapter 2, and surprisingly, 5404perhaps, it gives a good approximation to the observed formant frequencies 5405for a neutral, relaxed vowel such as that in 5406.ul 5407"a\c 5408bove". 5409.pp 5410Speech is made by varying the postures of the various organs of the vocal tract. 5411Different vowels, for example, result largely from different tongue positions 5412and lip postures. Naturally, such physical changes alter the frequencies of the 5413resonances, and successful automatic speech synthesis depends upon 5414successful movement of the formants. Fortunately, only the first three or 5415four resonances need to be altered even for extremely realistic synthesis, and 5416virtually all existing synthesizers provide control over these formants only. 5417.rh "Analysis of a single formant." 5418Each formant is modelled as a second-order resonance, with transfer function 5419.LB 5420.EQ 5421H(s) ~~ = ~~ {w sub c sup 2} over {s sup 2 ~+~ b s ~+~ w sub c sup 2} ~ . 5422.EN 5423.LE 5424As will be shown below, $w sub c$ is the nominal resonant frequency in 5425radians/s, and $b$ is the 5426approximate 3\ dB bandwidth of the resonance. The term $w sub c sup 2$ in the 5427numerator adjusts the gain to be unity at DC ($s=0$). 5428.pp 5429To calculate the frequency response of the formant, write $s=jw$. Then the 5430energy spectrum is 5431.LB 5432.EQ 5433|H(jw)| sup 2 ~~ mark = ~~ 5434{w sub c sup 4} over {(w sup 2 - w sub c sup 2 ) sup 2 ~+~ b sup 2 w sup 2} 5435.EN 5436.sp 5437.sp 5438.EQ 5439lineup = ~~ 5440{w sub c sup 4} over 5441{[w sup 2 ~-~(w sub c sup 2 -~ {b sup 2} over 2 )] sup 2 ~~ 5442+~~b sup 2 (w sub c sup 2~-~{{b sup 2} over 4})} ~ . 5443.EN 5444.sp 5445.LE 5446This reaches a maximum when the squared term in the denominator of the second 5447expression is zero, namely when $w=(w sub c sup 2 ~-~ b sup 2 /2) sup 1/2$. 5448However, 5449formant bandwidths are low compared with their centre frequencies, and so to 5450a good approximation the peak occurs 5451at $w=w sub c$ and is of amplitude $w sub c /b$, that 5452is, $10~log sub 10 w sub c /b$\ dB above the DC gain. 5453At frequencies higher than the peak the energy falls off as $1/w sup 4$, 5454a factor of 1/16 for each doubling 5455in frequency, and so the asymptotic decay is 12\ dB/octave. 5456.pp 5457At the points which are 3\ dB below the peak, 5458.LB 5459.EQ 5460|H(jw sub 3dB )| sup 2 ~~ = ~~ 54611 over 2 ~|H(jw sub max )| sup 2 ~~ = ~~ 54621 over 2 ~ times ~ {w sub c sup 2} over {b sup 2} ~ , 5463.EN 5464.LE 5465and it is easy to show that 5466this is satisfied by $w sub 3dB ~ = ~ w sub c ~ +- ~ b/2$ to a 5467good approximation (neglecting higher powers of $b/w sub c )$. Figure 5.7 5468summarizes the shape of an individual formant resonance. 5469.FC "Figure 5.7" 5470.pp 5471The bandwidth of a formant is fairly constant, regardless of the formant 5472frequency. This makes the formant filter a slightly unusual one: most 5473engineering applications which use variable-frequency resonances require 5474the bandwidth to be a constant proportion of the resonant 5475frequency \(em the ratio 5476$w sub c /b$, often called the "$Q$" of the filter, is to be constant. 5477For formants, we wish the Q to increase linearly with resonant frequency. 5478Since the amplitude gain of the formant at resonance is $w sub c /b$, 5479this peak gain increases as the formant frequency is increased. 5480.pp 5481Although it is easy to measure formant frequencies on a spectrogram 5482(cf Chapter 2), 5483it is not so easy to measure bandwidths accurately. One rather unusual method 5484was reported by van den Berg (1955), who took a subject who had had a partial 5485laryngectomy, an operation which left an opening into the vocal tract near 5486the larynx position. Into this he inserted a sound source and made a 5487swept-frequency calibration of the vocal tract! 5488.[ 5489Berg van den 1955 5490.] 5491Almost as bizarre is a 5492technique which involves setting off a spark inside the mouth of a subject 5493as he holds his articulators in a given position. 5494.pp 5495The results of several different kinds of experiment are reported by Dunn (1961), 5496and are summarized in Table 5.1, along with the formant frequency ranges. 5497.[ 5498Dunn 1961 5499.] 5500.RF 5501.in+0.5i 5502.ta 1.7i +2.5i 5503.nr x1 (\w'range of formant'/2) 5504.nr x2 (\w'range of bandwidths'/2) 5505 \h'-\n(x1u'range of formant \h'-\n(x2u'range of bandwidths 5506.nr x1 (\w'frequencies (Hz)'/2) 5507.nr x2 (\w'as measured in different'/2) 5508 \h'-\n(x1u'frequencies (Hz) \h'-\n(x2u'as measured in different 5509.nr x1 (\w'experiments (Hz)'/2) 5510 \h'-\n(x1u'experiments (Hz) 5511.nr x1 (\w'0000 \- 0000'/2) 5512.nr x2 (\w'000 \- 000'/2) 5513.nr x0 2.5i+(\w'range of formant'/2)+(\w'as measured in different'/2) 5514.nr x3 (\w'range of formant'/2) 5515 \h'-\n(x3u'\l'\n(x0u\(ul' 5516.sp 5517formant 1 \h'-\n(x1u'\0100 \- 1100 \h'-\n(x2u'\045 \- 130 5518formant 2 \h'-\n(x1u'\0500 \- 2500 \h'-\n(x2u'\050 \- 190 5519formant 3 \h'-\n(x1u'1500 \- 3500 \h'-\n(x2u'\070 \- 260 5520 \h'-\n(x3u'\l'\n(x0u\(ul' 5521.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 5522.in-0.5i 5523.MT 2 5524Table 5.1 Different estimates of formant bandwidths, with range of 5525formant frequencies for reference 5526.TE 5527Note that the bandwidths really are narrow compared with the resonant frequencies 5528of the filters, except at the lower end of the formant 1 range. Choosing the 5529lowest bandwidth estimate leads to an amplification factor at resonance of 50 for formant 2 5530when its frequency is at the top of its range; and formant 3 happens to give 5531the same value. 5532.rh "Series synthesizers." 5533The simplest realization of the vocal tract filter is a chain of formant 5534filters in series, as illustrated in Figure 5.8. 5535.FC "Figure 5.8" 5536This leads to particular difficulties if the frequencies of two formants 5537stray close together. The worst case occurs if formants 2 and 3 have the 5538same resonant frequencies, at the top of the range of formant 2, namely 2500\ Hz. 5539In this case, and if the bandwidths of the formants are set to the lowest 5540estimates, a combined amplification factor 5541of $(2500/50) times (2500/70)=1800$ is 5542obtained at the point of resonance \(em that is, 554365\ dB above the DC value. This is enough 5544to tax most analogue implementations, and can evoke clipping in the formant 5545filters, with a very noticeable effect on speech quality. This 5546extreme case will not occur during synthesis of realistic speech, for 5547although the formant 5548.ul 5549ranges 5550overlap, the values for any particular (human) sound will not coincide exactly. However, 5551it illustrates the difficulty of designing a series synthesizer which copes 5552sensibly with arbitrary parameter settings, and explains why designers often 5553choose formant bandwidths in the top half of the ranges given in Table 5.1. 5554.pp 5555The problem of excessive amplification within a series synthesizer can be 5556alleviated to a small extent by choosing carefully the order in which the 5557filters are placed in the chain. In a linear system, of course, the order in 5558which the components occur does not matter. 5559In physical implementations, however, it is advantageous to minimize extreme 5560amplification at intermediate points. By placing the formant 1 filter between 5561formants 2 and 3, the formant 2 resonance is attenuated somewhat before it 5562reaches formant 3. Continuing with the extreme example above, where both 5563formants 2 and 3 were set to 2500\ Hz; assume that formant 1 is at its 5564nominal value of 500\ Hz. It provides attenuation at approximately 12\ dB/octave 5565above this, and so at the formant 2 peak, 2.3\ octaves higher, the attenuation 5566is 28\ dB. Thus the gain at 2500\ Hz, 5567which is $20 ~ log sub 10 ~ 2500/50 ~ = ~ 34$\ dB after 5568passing through the formant 2 filter, is reduced to 6\ dB by formant 1, only 5569to be increased by $20 ~ log sub 10 ~ 2500/70 ~ = ~ 31$\ dB to 5570a value of 37\ dB by formant 3. 5571This avoids the extreme 65\ dB gain of formants 2 and 3 combined. 5572.pp 5573Figure 5.8 shows only three formant filters modelled explicitly. 5574The effect of the rest \(em and they do have an effect, although it is small 5575at low frequencies \(em is 5576incorporated by lumping them together into the "higher-formant correction" filter. 5577To calculate the characteristics of this filter, assume that the lumped 5578formants have the values given by the simple uniform-tube model of Chapter 2, 5579namely 3500\ Hz for formant 4, 4500\ Hz for formant 5, and, in general, 5580$500(2n-1)$\ Hz for formant $n$. The effect of each of these on the spectrum is 5581.LB 5582.EQ 558310~ log sub 10 {w sub n sup 4} over {(w sup 2 ~-~w sub n sup 2 ) sup 2 5584~~+~~b sub n sup 2 w sup 2} 5585~~ = ~~ -~ 10~ log sub 10 ~[(1~-~~{{w sup 2} over {w sub n sup 2}}) sup 2 5586~~+~~ {{b sub n sup 2 w sup 2} over {w sub n sup 4}}] 5587.EN 5588dB, 5589.LE 5590following from what was calculated above. 5591We will have to approximate this by assuming that 5592$b sub n sup 2 /w sub n sup 2$ is 5593negligible \(em this is quite reasonable for these higher formants because 5594Table 5.1 shows that the bandwidth does not increase in proportion to the 5595formant frequency range \(em and approximate the logarithm by the first 5596term of its series expansion: 5597.LB 5598.EQ 5599-10 ~ log sub 10 ~ (1~-~~{{w sup 2} over {w sub n sup 2}}) sup 2 5600~~ = ~~ -20~ log sub 10 ~ e ~ log sub e 5601(1~-~~{{w sup 2} over {w sub n sup 2}}) 5602~~ = ~~ 20~ log sub 10 ~ e ~ times ~ {w sup 2} over {w sub n sup 2} ~ . 5603.EN 5604.LE 5605.pp 5606Now the total effect of formants 4, 5, ... at frequency $f$\ Hz (as distinct 5607from $w$\ radians/s) is 5608.LB 5609.EQ 561020~ log sub 10 ~ e ~ times ~ sum from n=4 to infinity 5611~{{f sup 2} over {500 sup 2 (2n-1) sup 2}} ~ . 5612.EN 5613.LE 5614This expression is 5615.LB 5616.EQ 561720~ log sub 10 ~ e ~ times ~ 5618{{f sup 2} over {500 sup 2}}~~(~sum from n=1 to infinity 5619~{1 over {(2n-1) sup 2}} ~~-~~ sum from n=1 to 3 ~{1 over {(2n-1) sup 2}}~) 5620~ . 5621.EN 5622.LE 5623The infinite sum can actually be calculated in closed form, and is equal 5624to $pi sup 2 /8$. Hence the total correction is 5625.LB 5626.EQ 562720~ log sub 10 ~ e ~ times {{f sup 2} over {500 sup 2}} 5628~~(~{pi sup 2} over 8 ~~-~~ sum from n=1 to 3 ~{1 over {(2n-1) sup 2}}~) 5629~~ = ~~ 2.87 times 10 sup -6 f sup 2 5630.EN 5631dB. 5632.LE 5633.pp 5634Although this may at first seem to be a rather small correction, 5635it is in fact 72\ dB when 5636$f=5$\ kHz! On further reflection this is not an unreasonable figure, for the 563712\ dB/octave decays contributed by formants 1, 2, and 3 must all be annihilated 5638by the higher-formant correction to give an overall flat spectral trend. 5639In fact, formant 1 will contribute 564012\ dB/octave from 500\ Hz (3.3\ octaves to 5\ kHz, representing 40\ dB); formant 56412 will contribute 12\ dB/octave from 1500\ Hz (1.7\ octaves to 5\ kHz, representing 564221\ dB); and formant 3 will contribute 12\ dB/octave from 2500\ Hz (1\ octave to 5\ kHz, 5643representing 12\ dB). 5644These sum to 73\ dB. 5645.pp 5646If the first five formants are synthesized explicitly instead of just the 5647first three, the correction is 5648.LB 5649.EQ 565020~ log sub 10 ~ e ~ times ~ {{f sup 2} over {500 sup 2}} 5651~~(~{pi sup 2} over 8 ~-~~ sum from n=1 to 5 ~{1 over {(2n-1) sup 2}}~) 5652~~ = ~~ 1.73 times 10 sup -6 f sup 2 5653.EN 5654dB, 5655.LE 5656giving a rather more reasonable value of 43\ dB when $f=5$\ kHz. In actual 5657implementations, fixed filters are sometimes included explicitly for 5658formants 4 and 5. Although this lowers the gain of the higher-formant 5659correction filter, the total amplification at 5\ kHz of the combined correction 5660is still 72\ dB. If one is less demanding and aims for a synthesizer that 5661produces a correct spectrum only up to 3.5\ kHz, it is 35\ dB. 5662This places quite stringent requirements on the preceding formant filters if 5663the stray noise that they generate internally is not to be amplified to 5664perceptible magnitudes by the correction filter at high frequencies. 5665.pp 5666Explicit inclusion of fixed filters for formants 4 and 5 undoubtedly improves 5667the accuracy of the higher-formant correction. Recall that the above derivation 5668of the correction filter characteristic used the first-order approximation 5669.LB 5670.EQ 5671log sub e (1~-~{{w sup 2} over {w sub n sup 2}}) 5672~~ = ~~ -~ {w sup 2} over {w sub n sup 2} ~ , 5673.EN 5674.LE 5675which is only valid if $w << w sub n$. 5676Thus it only holds at frequencies less than 5677the highest explicitly synthesized formant, 5678and so with formants 4 (3.5\ kHz) and 56795 (4.5\ kHz) included a reasonable correction should be obtained for 5680telephone-quality speech. However, detailed analysis with a second-order 5681approximation shows that the coefficient of the neglected term is in fact 5682small (Fant, 1960). 5683.[ 5684Fant 1960 Acoustic theory of speech production 5685.] 5686A second, perhaps more compelling, reason for explicitly 5687including a couple of fixed formants is that the otherwise enormous amplification 5688provided by the correction can be distributed throughout the formant chain. 5689We saw earlier why there is reason to prefer the 5690order F3\(emF1\(emF2 over F1\(emF2\(emF3. 5691With explicit formants 4 and 5, a suitable order which helps 5692to keep the amplification at intermediate points in the chain within reasonable 5693bounds is F3\(emF5\(emF2\(emF4\(emF1. 5694.rh "Parallel synthesizers." 5695A series synthesizer models the vocal tract resonances by a chain of formant 5696filters in series. A parallel synthesizer utilizes a parallel connection of 5697filters as illustrated in Figure 5.9. 5698.FC "Figure 5.9" 5699.pp 5700Consider a parallel combination of two formants with individually-controllable 5701amplitudes. The combined transfer function is 5702.LB 5703.EQ 5704H(s) ~~ mark = ~~ {A sub 1 w sub 1 sup 2} over 5705{s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2} 5706~~+~~{A sub 2 w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} 5707.EN 5708.sp 5709.sp 5710.EQ 5711lineup = ~~ { (A sub 1 w sub 1 sup 2 + A sub 2 w sub 2 sup 2 )s sup 2 5712~+~(A sub 1 b sub 2 w sub 1 sup 2 + A sub 2 b sub 1 w sub 2 sup 2 )s 5713~+~ (A sub 1 +A sub 2 )w sub 1 sup 2 w sub 2 sup 2 } 5714over 5715{ (s sup 2 ~+~b sub 1 s~+~w sub 1 sup 2 ) 5716(s sup 2 ~+~b sub 2 s~+~w sub 2 sup 2 ) } 5717.EN 5718.LE 5719If the formant bandwidths $b sub 1$ and $b sub 2$ 5720are equal and the amplitudes are 5721chosen as 5722.LB 5723.EQ 5724A sub 1 ~~=~~ {w sub 2 sup 2} over {w sub 2 sup 2 -w sub 1 sup 2} 5725~~~~~~~~ 5726A sub 2 ~~=~~-~ {w sub 1 sup 2} over {w sub 2 sup 2 -w sub 1 sup 2} ~ , 5727.EN 5728.LE 5729then the transfer function becomes the same as that of a two-formant series synthesizer, 5730namely 5731.LB 5732.EQ 5733H(s) ~~ = ~~ {w sub 1 sup 2} over {s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2} 5734~ . ~{w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} ~ . 5735.EN 5736.LE 5737The argument can be extended to any number of formants, under the assumption 5738that the formant bandwidths are equal. Note that the signs of $A sub 1$ 5739and $A sub 2$ 5740differ: in general the formant amplitudes for a parallel synthesizer alternate 5741in sign. 5742.pp 5743In theory, therefore, it would be possible to use five parallel formants to 5744model a five-formant series synthesizer exactly. Then the same higher-formant 5745correction filter would be needed for the parallel synthesizer as for the 5746series one. If the formant amplitudes were set slightly incorrectly, however, 5747the five filters would not combine to give a total of 60\ dB/octave high-frequency 5748decay above the resonances. It is easy to see this in the context of the 5749simplified two-formant combination above: if the amplitudes were not chosen 5750exactly right then the $s sup 2$ 5751term in the numerator would not be quite zero. 5752Then, the decay in the two-formant combination would be \-12\ dB/octave instead 5753of \-24\ dB/octave, and in the five-formant case the decay would in fact still be 5754\-12\ dB/octave. Advantage can be taken of this to equalize the levels 5755within the synthesizer so that large amplitude variations do not occur. 5756This can best be done by associating relatively low-gain fixed correction filters 5757with each formant instead of providing one comprehensive correction to the 5758combined spectrum: these are shown in Figure 5.9. 5759Suitable correction filters 5760have been determined empirically by Holmes (1972). 5761.[ 5762Holmes 1972 Speech synthesis 5763.] 5764They provide a 6\ dB/octave 5765lift above 640\ Hz for formant 1, and 6\ dB/octave lift above 300\ Hz for formant 57662. Formants 3 and 4 are uncorrected, whilst for formant 5 the correction begins 5767as a 6\ dB/octave decay above 600\ Hz and increases to an 18\ dB/octave decay 5768above 5.5\ kHz. 5769.pp 5770The disadvantage of a parallel synthesizer is that the amplitudes of the 5771formants must be specified as well as their frequencies. (Furthermore, the 5772formant bandwidths should all be equal, but they are often chosen to be such 5773in series synthesizers because of the uncertainty as to their exact 5774values.) However, the extra amplitude parameters clearly give greater 5775control over the frequency spectrum of the synthesized speech. 5776.pp 5777A good example of how this extra control can usefully be exploited is the 5778synthesis of nasal sounds. 5779Nasalization introduces a cavity parallel to the oral tract, as illustrated 5780in Figure 5.10, and this causes zeros in the transfer function. 5781.FC "Figure 5.10" 5782It is as if two different copies of the vocal tract transfer function, one for 5783the oral and the other for the nasal passage, were added 5784together. We have seen the effect of this above when considering parallel 5785synthesis. The combination 5786.LB 5787.EQ 5788H(s) ~~ = ~~ {A sub 1 w sub o sup 2} over 5789{s sup 2 ~+~ b sub o s ~+~ w sub o sup 2} 5790~~+~~{A sub 2 w sub n sup 2} 5791over {s sup 2 ~+~ b sub n s ~+~ w sub n sup 2} ~ , 5792.EN 5793.LE 5794where the subscript "$o$" stands for oral and "$n$" for nasal, 5795produces zeros in the 5796numerator (unless the amplitudes are carefully adjusted to avoid them). 5797These cannot be modelled by a series synthesizer, but they obviously can be 5798by a parallel one. 5799.pp 5800Although they are certainly needed for accurate imitation of human speech, 5801transfer function zeros to simulate nasal sounds are not essential for 5802synthesis of intelligible English. It is not difficult to get a sound 5803like a nasal consonant 5804(\c 5805.ul 5806n, 5807or 5808.ul 5809m\c 5810) 5811with an all-pole synthesizer. 5812Nevertheless, it is certainly true that a parallel synthesizer gives better 5813.ul 5814potential 5815control over the spectrum than a series one. Whether the added flexibility 5816can be used properly by a synthesis-by-rule computer program is another matter. 5817.rh "Implementation of formant filters." 5818Formant filters can be built in either analogue or digital form. A 5819second-order resonance is needed, whose centre frequency can be controlled 5820but whose bandwidth is fixed. If the control can be arranged as two 5821tracking resistors, then the simple analogue configuration of Figure 5.11, 5822with two operational amplifiers, will suffice. 5823.FC "Figure 5.11" 5824.pp 5825The transfer function of this arrangement is 5826.LB 5827.EQ 5828- ~~ { 1/C sub 1 R sub 1 C sub 2 R sub 2 } over 5829{ s sup 2 ~~+~~ {1 over {C sub 2 R sub 2}}~s 5830~~+~~{1 over {C sub 1 R' sub 1 C sub 2 R sub 2 }}} ~ , 5831.EN 5832.LE 5833which characterizes it as a low-pass resonator with DC gain 5834of $- R' sub 1 /R sub 1 $, bandwidth of $1/2 pi C sub 2 R sub 2$\ Hz, and 5835centre frequency of $1/2 pi (C sub 1 R' sub 1 C sub 2 R sub 2 ) sup 1/2$\ Hz. 5836Tracking $R' sub 1$ with $R sub 1$ ensures that the DC gain remains constant, 5837and that the centre frequency follows $R sub 1 sup -1/2$. Moreover, 5838neither is especially sensitive to slight departures from exact tracking 5839of $R' sub 1$ with $R sub 1$. 5840Such a filter has been used in a simple hand-controlled speech synthesizer, 5841built for demonstration and amusement (Witten and Madams, 1978). 5842.[ 5843Witten Madams 1978 Chatterbox 5844.] 5845However, the need for tracking resistors, and the inverse square root variation 5846of the formant frequency with $R sub 1$, makes it rather unsuitable for serious 5847applications. 5848.pp 5849A better analogue filter is the ring-of-three configuration 5850shown in Figure 5.12. 5851.FC "Figure 5.12" 5852(Ignore the secondary output for now.) Control 5853is achieved over the centre frequency by two multipliers, driven from 5854the same control input $k$. These have a high-impedance output, producing a 5855current $kx$ if the input voltage is $x$. 5856It is not too difficult to show that the transfer function of the circuit is 5857.LB 5858.EQ 5859- ~~ { {k sup 2} over {C sup 2} } over 5860{ s sup 2 ~~+~~ 2 over RC ~s 5861~~+~~{1+k sup 2 R sup 2} over {R sup 2 C sup 2} } ~ . 5862.EN 5863.LE 5864Suppose that $R$ is chosen so that $k sup 2 R sup 2 ~ >>~ 1$. Then this is a 5865unity-gain resonator with constant bandwidth $1/ pi RC$\ Hz and centre 5866frequency $k/2 pi C$\ Hz. Note that it is the combination of both multipliers that 5867makes the centre frequency grow linearly with $k$: with one multiplier there 5868would be a square-root relationship. 5869.pp 5870The ring-of-three filter of Figure 5.12 is arranged in a slightly unusual 5871way, with an inverting stage at the beginning and the two resonant stages 5872following it. This ensures that the signal level at intermediate 5873points in the filter does not exceed that at the output, and gives the filter 5874the best chance of coping with a wide range of input amplitudes without 5875clipping. This contrasts markedly with the resonator of Figure 5.11, where 5876the voltage at the output of the first integrator is $w/b$ times the final output \(em a 5877factor of 50 in the worst case. 5878.pp 5879For a digital implementation of a formant, consider the recurrence relation 5880.LB 5881.EQ 5882y(n)~ = ~~ a sub 1 y(n-1) ~-~ a sub 2 y(n-2) ~+~ a sub 0 x(n) , 5883.EN 5884.LE 5885where $x(n)$ is the input and $y(n)$ the output at time $n$, 5886$y(n-1)$ and $y(n-2)$ are the previous two values of the output, 5887and $a sub 0$, $a sub 1$, and $a sub 2$ are (real) constants. 5888The minus sign is in front of the second term because it makes $a sub 2$ 5889turn out to be 5890positive. To calculate the $z$-transform version of this relationship, multiply 5891through by $z sup -n$ and sum from $n=- infinity$ to $infinity$ : 5892.LB "nn" 5893.EQ 5894sum from {n=- infinity} to infinity ~y(n)z sup -n ~~ mark =~~ 5895a sub 1 sum from {n=- infinity} to infinity ~y(n-1)z sup -n ~~-~ 5896a sub 2 sum from {n=- infinity} to infinity ~y(n-2)z sup -n ~~+~ 5897a sub 0 sum from {n=- infinity} to infinity ~x(n)z sup -n 5898.EN 5899.sp 5900.EQ 5901lineup = ~~ a sub 1 z sup -1 ~ sum ~y(n-1)z sup -(n-1) ~~-~~ 5902a sub 2 z sup -2 ~ sum ~y(n-2)z sup -(n-2) 5903~~+~~ a sub 0 ~ sum ~x(n)x sup -n ~ . 5904.EN 5905.LE "nn" 5906Writing this in terms of $z$-transforms, 5907.LB 5908.EQ 5909Y(z)~ = ~~ a sub 1 z sup -1 Y(z) ~-~ a sub 2 z sup -2 Y(z) ~+~ a sub 0 X(z) . 5910.EN 5911.LE 5912Thus the input-output transfer function of the system is 5913.LB 5914.EQ 5915H(z)~ = ~~ Y(z) over X(z) 5916~~=~~ {a sub 0 } over {1~-~a sub 1 z sup -1 ~+~a sub 2 z sup -2} ~ . 5917.EN 5918.LE 5919.pp 5920We learned in the previous chapter that the frequency response is obtained 5921from the $z$-transform of a system by replacing $z sup -1$ 5922by $e sup {-j2 pi fT}$, where $f$ is the frequency variable in\ Hz. 5923Hence the amplitude response of the digital formant filter is 5924.LB 5925.EQ 5926|H(e sup {j2 pi fT} )| sup 2 5927~~ = ~~ left [ {a sub 0} over {1~-~a sub 1 e sup {-j2 pi fT} 5928~+~a sub 2 e sup {-j4 pi fT} } ~ right ] sup 2 ~ . 5929.EN 5930.sp 5931.LE 5932It is fairly obvious from this that a DC gain of 1 is obtained if 5933.LB 5934.EQ 5935a sub 0 ~ = ~~ 1 ~-~ a sub 1 ~+~ a sub 2 , 5936.EN 5937.LE 5938for $e sup {-j2 pi fT}$ is 1 at a frequency of 0\ Hz. Some manipulation is 5939required to show that, under the usual assumption that the bandwidth is 5940small, the centre frequency is 5941.LB 5942.EQ 59431 over {2 pi T} ~~ cos sup -1 ~ {a sub 1} over {2 a sub 2 sup 1/2} ~ 5944.EN 5945Hz. 5946.LE 5947Furthermore, the 3\ dB bandwidth of the resonance is given approximately by 5948.LB 5949.EQ 5950-~ 1 over {2 pi T} ~~ log sub e a sub 2 ~ 5951.EN 5952Hz. 5953.LE 5954.pp 5955As an example, Figure 5.13 shows an amplitude response for this digital filter. 5956.FC "Figure 5.13" 5957The parameters $a sub 0$, $a sub 1$ and $a sub 2$ 5958were generated from the above 5959relationships for a sampling frequency of 8\ kHz, centre frequency of 1\ kHz, 5960and bandwidth of 75\ Hz. 5961It exhibits a peak of approximately the right bandwidth at the correct 5962frequency, 1\ kHz. Note that the response is flat at half the sampling 5963frequency, for the frequency response from 4\ kHz to 8\ kHz is just a reflection of 5964that up to 4\ kHz. 5965This contrasts sharply with that of an analogue formant filter, also shown 5966in Figure 5.13, which slopes 5967at \-12\ dB/octave at frequencies above resonance. 5968.pp 5969The behaviour of a digital formant filter at frequencies above 5970resonance actually makes it preferable to an analogue implementation. 5971We saw earlier that considerable trouble must be taken with the latter to 5972compensate for the cumulative effect of \-12\ dB/octave at higher frequencies for 5973each of the formants. 5974This is not necessary with digital implementations, for the response of 5975a digital formant filter is flat at half the sampling frequency. In fact, further 5976study shows that digital synthesizers without any higher-pole correction 5977give a closer approximation to the vocal tract than analogue ones with higher-pole 5978correction (Gold and Rabiner, 1968). 5979.[ 5980Gold Rabiner 1968 Analysis of digital and analogue formant synthesizers 5981.] 5982.rh "Time-domain methods." 5983An interesting alternative to frequency-domain speech synthesis is to construct 5984the formants in the time domain. When a second-order resonance is excited by 5985an impulse, an exponentially decaying sinusoid is produced, as illustrated by 5986Figure 5.14. 5987.FC "Figure 5.14" 5988The oscillation occurs at the resonant frequency of the filter, 5989while the decay is related to the bandwidth. In fact, if the formant filter 5990has transfer function 5991.LB 5992.EQ 5993{w sup 2} over {s sup 2 ~+~ b s ~+~ w sup 2} ~ , 5994.EN 5995.LE 5996the time waveform for impulsive excitation is 5997.LB 5998.EQ 5999x(t)~ = ~~ w~ e sup -bt/2 ~ sin ~ wt ~~~~~~~~ 6000.EN 6001(neglecting $b sup 2 /w sup 2$). 6002.LE 6003It is the combination of several such time waveforms, coupled with the regular 6004reappearance of excitation at the pitch period, that produces the characteristic 6005wiggly waveform of voiced speech. 6006.pp 6007Now suppose we take a sine wave of frequency $w$ and multiply it by a 6008decaying exponential $e sup -bt/2$. This gives a signal 6009.LB 6010.EQ 6011x(t)~ = ~~ e sup -bt/2 ~ sin ~ wt , 6012.EN 6013.LE 6014which is identical with the filtered impulse except for a factor $w$. 6015If there are several formants in parallel, all with the same bandwidth, 6016the exponential factor is the same for each: 6017.LB 6018.EQ 6019x(t)~ = ~~ e sup -bt/2 ~ (A sub 1 ~ sin ~ w sub 1 t 6020~~+ ~~ A sub 2 ~ sin ~ w sub 2 t ~~ + ~~ A sub 3 ~ sin ~ w sub 3 t) . 6021.EN 6022.LE 6023$A sub 1$, $A sub 2$, and $A sub 3$ control the formant amplitudes, 6024as in an ordinary parallel synthesizer; 6025except that they need adjusting to account for the missing 6026factors $w sub 1$, $w sub 2$, and $w sub 3$. 6027.pp 6028A neat way of implementing such a synthesizer digitally is to store one cycle of a 6029sine wave in a read-only memory (ROM). Then, the formant frequencies can be 6030controlled by reading the ROM at different rates. For example, if twice the 6031basic frequency is desired, every second value should be read. 6032Multiplication is needed for amplitude control of each formant: this can be 6033accomplished by shifting the digital word (each place shifted accounts for 60346\ dB of attenuation). Finally, the exponential damping factor can be 6035provided in analogue hardware by a single capacitor after the D/A converter. 6036This implementation gives a system for hardware-software synthesis which 6037involves an absolutely minimal amount of extra hardware apart from the computer, 6038and does not need hardware multiplication for real-time operation. 6039It could easily be made to work in real time with a microprocessor coupled 6040to a D/A converter, damping capacitor, and fixed tone-control filter to give 6041the required spectral equalization. 6042.pp 6043Because the overall spectral decay of an impulse exciting a second-order 6044formant filter is 12\ dB/octave, the appropriate equalization is +6\ dB/octave 6045lift at high frequencies, to give an overall \-6\ dB/octave spectral trend. 6046.pp 6047Note, however, that this synthesis model is an extremely basic one. Only 6048impulsive excitation can be accomodated. For fricatives, which we will 6049discuss in more detail below, a different implementation is needed. A 6050hardware noise generator, with a few fixed filters \(em one 6051for each fricative type \(em will suffice for a simple system. More damaging 6052is the lack of aspiration, where random noise excites the vocal tract resonances. 6053This cannot be simulated in the model. The 6054.ul 6055h 6056sound can be provided by 6057treating it as a fricative, and although it will not sound completely realistic, 6058because there will be no variation with the formant positions of adjacent phonemes, 6059this can be tolerated because 6060.ul 6061h 6062is not too important for speech intelligibility. 6063A bigger disadvantage is the lack of proper aspiration control for producing 6064unvoiced stops, which as mentioned in Chapter 2 consist of an silent phase 6065followed by a burst of aspiration. 6066Experience has shown that although it is difficult to drive such a synthesizer 6067from a software synthesis-by-rule system, quite intelligible output can 6068be obtained if parameters are derived from real speech and tweaked by hand. 6069Then, for each aspiration burst the most closely-matching fricative sound 6070can be used. 6071.sh "5.4 Aspiration and frication" 6072.pp 6073The model of the vocal tract as a filter which affects the frequency spectrum 6074of the basic voiced excitation breaks down if there are constrictions in it, 6075for these introduce new sound sources caused by turbulent air. 6076The generation of unvoiced excitation has been discussed earlier in this 6077chapter: now we must consider how to simulate the filtering action of 6078the vocal tract for unvoiced sounds. 6079.pp 6080Aspiration and frication need to be dealt with separately. The former 6081is caused by excitation at the vocal cords \(em the cords are held 6082so close together that turbulent noise is produced. 6083This noise passes through the same vocal tract filter that modifies voiced 6084sounds, and the same kind of formant structure can be observed. 6085All that is needed to simulate it is to replace the voiced excitation 6086source by white noise, as shown in the upper part of Figure 5.15. 6087.FC "Figure 5.15" 6088.pp 6089Speech can be whispered by substituting aspiration for voicing throughout. 6090Of course, there is no fundamental frequency associated with aspiration. 6091An interesting way of assessing informally the degradation caused by inadequate 6092pitch control in a speech synthesis-by-rule system is to listen to 6093whispered speech, in which pitch variations play no part. 6094.pp 6095Voiced and aspirative excitation are rarely produced at the same time 6096in natural speech (but see the discussion in Chapter 2 about breathy voice). 6097However, the excitation can change from one to the other quite quickly, and 6098when this happens there is no discontinuity in the formant structure. 6099.pp 6100Fricative, or sibilant, excitation is quite different from aspiration, 6101because it introduces a new sound source at a different place from the vocal 6102cords. The constriction which produces the sound may be at the lips, 6103the teeth, the hard ridge just behind the top front teeth, or further 6104back along the palate. 6105These positions each produce a different sound 6106(\c 6107.ul 6108f, 6109.ul 6110th, 6111.ul 6112s, 6113and 6114.ul 6115sh 6116respectively). However, smooth transitions from one of these sounds to another 6117do not occur in natural speech; and dynamical movement of the frequency 6118spectrum during a fricative is unnecessary for speech synthesis. 6119.pp 6120It is necessary, however, to be able to produce an approximation to the 6121noise spectrum for each of these sound types. This is commonly achieved 6122by a single high-pass resonance whose centre frequency can be controlled. 6123This is the purpose of the secondary output 6124of the formant filter of Figure 5.12. 6125Taking the output from this point gives a high-pass instead of a low-pass 6126resonance, and this same filter configuration is quite acceptable for 6127fricatives. Figure 5.15 shows the fricative sound path as a noise generator 6128followed by such a filter. 6129.pp 6130Unlike aspiration, fricative excitation is frequently combined with voicing. 6131This gives the voiced fricative sounds 6132.ul 6133v, 6134.ul 6135dh, 6136.ul 6137z, 6138and 6139.ul 6140zh. 6141It is possible to produce frication and aspiration together, and although 6142there are no examples of this in English, speech synthesis-by-rule 6143programs often use a short burst of aspiration 6144.ul 6145and 6146frication when simulating the opening of unvoiced stops. 6147Separate amplitude controls are therefore needed for voicing and frication, 6148but the former can be used for aspiration as well, with a "glottal excitation 6149type" switch to indicate aspiration rather than voicing. 6150.sh "5.5 Summary" 6151.pp 6152A resonance speech synthesizer consists of a vocal tract filter, excited by 6153either a periodic pitch pulse or aspiration noise. In addition, a set of 6154sibilant sounds must be provided. The vocal tract filter is dynamic, with 6155three controllable resonances. These, coupled with some fixed spectral 6156compensation, give it a fairly high order \(em about 10 complex poles are 6157needed. Although several different sibilant sound types must be simulated, 6158dynamical movement is less important in fricative sound spectra than 6159for voiced and aspirated sounds because 6160smooth transitions between one fricative and another are not important 6161in speech. 6162However, fricative timing and amplitude must be controlled rather precisely. 6163.pp 6164The speech synthesizer is controlled by several parameters. 6165These include fundamental frequency (if voiced), amplitude of voicing, 6166frequency of the first few \(em typically three \(em formants, 6167aspiration amplitude, sibilance amplitude, and frequency of one (or more) 6168sibilance filters. 6169Additionally, if the synthesizer is a parallel one, parameters for the 6170amplitudes of individual formants will need to be included. 6171It may be that some control over formant bandwidths is provided too. 6172Thus synthesizers have from eight up to about 20 parameters (Klatt, 1980, 6173describes one with 20 parameters). 6174.[ 6175Klatt 1980 Software for a cascade/parallel formant synthesizer 6176.] 6177.pp 6178The parameters are supplied to the synthesizer at regular intervals of time. 6179For a 10-parameter synthesizer, the control can be thought of as a set of 618010 graphs, each representing the time evolution of one parameter. 6181They are usually called parameter 6182.ul 6183tracks, 6184the terminology dating from the days when a track was painted on a glass 6185slide for each parameter to provide dynamic control of the synthesizer 6186(Lawrence, 1953). 6187.[ 6188Lawrence 1953 6189.] 6190The pitch track is often called a pitch 6191.ul 6192contour; 6193this is a common phonetician's usage. 6194Do not confuse this with the everyday meaning of "contour" 6195as a line joining points of equal height on a map \(em a pitch contour is 6196just the time evolution of the pitch frequency. 6197.pp 6198For computer-controlled synthesizers, of course, the parameter tracks 6199are sampled, typically every 5 to 20\ msec. 6200The rate is determined by the need to generate fast amplitude transitions 6201for nasals and stop consonants. 6202Contrast it with the 125\ $mu$sec sampling period needed to digitize 6203telephone-quality speech. 6204The raw data rate for a 10-parameter synthesizer updated every 10 msec 6205is 1,000 parameters/sec, or 6\ Kbit/s if each parameter is represented 6206by 6\ bits. 6207This is a substantial reduction over the 56\ Kbit/s needed for PCM representation. 6208For speech synthesis by rule (Chapter 7), these parameter tracks 6209are generated by a computer program from a phonetic (or English) 6210version of the utterance, lowering the data rate by a further one or two 6211orders of magnitude. 6212.pp 6213Filters for speech 6214synthesizers can be implemented in either analogue or digital form. 6215High-order filters are usually broken down into second-order sections in 6216parallel or in series. A third possibility, which has not been discussed 6217above, is to implement a single high-order filter directly. Finally, the 6218action of formant filters can be synthesized in the time domain. This gives 6219eight possibilities which are summarized in Table 5.2. 6220.RF 6221.in +0.5i 6222.ta 2.1i +2.0i 6223.nr x1 (\w'Analogue'/2) 6224.nr x2 (\w'Digital'/2) 6225 \h'-\n(x1u'Analogue \h'-\n(x2u'Digital 6226.nr x0 2.0i+(\w'Liljencrants (1968)'/2)+(\w'Morris and Paillet (1972)'/2) 6227.nr x3 (\w'Liljencrants (1968)'/2) 6228 \h'-\n(x3u'\l'\n(x0u\(ul' 6229.sp 6230.nr x1 (\w'Rice (1976)'/2) 6231.nr x2 (\w'Rabiner \fIet al\fR'/2) 6232Series \h'-\n(x1u'Rice (1976) \h'-\n(x2u'Rabiner \fIet al\fR 6233.nr x1 (\w'Liljencrants (1968)'/2) 6234.nr x2 (\w'Holmes (1973)'/2) 6235Parallel \h'-\n(x1u'Liljencrants (1968) \h'-\n(x2u'Holmes (1973) 6236.nr x1 (\w'unpublished'/2) 6237.nr x2 (\w'unpublished'/2 6238Time-domain \h'-\n(x1u'unpublished \h'-\n(x2u'unpublished 6239.nr x1 (\w'\(em'/2) 6240.nr x2 (\w'Morris and Paillet (1972)'/2) 6241High-order filter \h'-\n(x1u'\(em \h'-\n(x2u'Morris and Paillet (1972) 6242 \h'-\n(x3u'\l'\n(x0u\(ul' 6243.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 6244.in-0.5i 6245.FG "Table 5.2 Implementation options for resonance speech synthesizers" 6246.[ 6247Rice 1976 Byte 6248.] 6249.[ 6250Rabiner Jackson Schafer Coker 1971 6251.] 6252.[ 6253Liljencrants 1968 6254.] 6255.[ 6256Holmes 1973 Influence of glottal waveform on naturalness 6257.] 6258.[ 6259Morris and Paillet 1972 6260.] 6261All but one have certainly been used as the basis for synthesis, and 6262the table includes reference to published descriptions. 6263.pp 6264Each method has advantages and disadvantages. Series decomposition obviates 6265the need for control over the amplitudes of individual formants, but does 6266not allow synthesis of sounds which use the nasal tract as well as the oral 6267one; for these are in parallel. Analogue implementation of series synthesizers 6268is complicated by the need for higher-pole correction, and the fact that 6269the gains at different frequencies can vary widely throughout the system. 6270Higher-pole correction is not so important for digital synthesizers. 6271Parallel decomposition eliminates some of these problems: higher-pole correction 6272can be implemented individually for each formant. However, the formant 6273amplitudes must be controlled rather precisely to simulate the vocal tract, 6274which is essentially serial. 6275Time-domain synthesis is associated with low hardware costs but does not 6276easily allow proper control over the excitation sources. In particular, 6277it cannot simulate dynamical movement of the spectrum during aspiration. 6278Implementation of the entire vocal tract model as a single high-order filter, 6279without breaking it down into individual formants in series or parallel, 6280is attractive from the computational point of view because less arithmetic 6281operations are required. It is best analysed in terms of linear predictive 6282coding, which is the subject of the next chapter. 6283.sh "5.6 References" 6284.LB "nnnn" 6285.[ 6286$LIST$ 6287.] 6288.LE "nnnn" 6289.sh "5.7 Further reading" 6290.pp 6291Historically-minded readers should look at the early speech synthesizer 6292designed by Lawrence (1953). 6293This and other classic papers on the subject 6294are reprinted in Flanagan and Rabiner (1973). 6295A good description of a quite sophisticated parallel synthesizer can 6296be found in Holmes (1973), above, and another of a switchable 6297series/parallel one in Klatt (1980), who even includes a listing of 6298the Fortran program that implements it. 6299Here are some useful books on speech synthesizers. 6300.LB "nn" 6301.\"Fant-1960-1 6302.]- 6303.ds [A Fant, G. 6304.ds [D 1960 6305.ds [T Acoustic theory of speech production 6306.ds [I Mouton 6307.ds [C The Hague 6308.nr [T 0 6309.nr [A 1 6310.nr [O 0 6311.][ 2 book 6312.in+2n 6313Fant really started the study of the vocal tract as an acoustic system, 6314and this book marks the beginning of modern speech synthesis. 6315.in-2n 6316.\"Flanagan-1972-1 6317.]- 6318.ds [A Flanagan, J.L. 6319.ds [D 1972 6320.ds [T Speech analysis, synthesis, and perception (2nd, expanded, edition) 6321.ds [I Springer Verlag 6322.ds [C Berlin 6323.nr [T 0 6324.nr [A 1 6325.nr [O 0 6326.][ 2 book 6327.in+2n 6328This book is the speech researcher's bible, and like the bible, it's not 6329all that easy to read. 6330However, it is an essential reference source for speech acoustics and 6331speech synthesis (as well as for human speech perception). 6332.in-2n 6333.\"Flanagan-1973-2 6334.]- 6335.ds [A Flanagan, J.L. 6336.as [A " and Rabiner, L.R.(Editors) 6337.ds [D 1973 6338.ds [T Speech synthesis 6339.ds [I Dowsen, Hutchinson and Ross 6340.ds [C Stroudsburg, Pennsylvania 6341.nr [T 0 6342.nr [A 0 6343.nr [O 0 6344.][ 2 book 6345.in+2n 6346I recommended this book at the end of Chapter 1 as a collection of 6347classic papers on the subject of speech synthesis and synthesizers. 6348.in-2n 6349.\"Holmes-1972-3 6350.]- 6351.ds [A Holmes, J.N. 6352.ds [D 1972 6353.ds [T Speech synthesis 6354.ds [I Mills and Boom 6355.ds [C London 6356.nr [T 0 6357.nr [A 1 6358.nr [O 0 6359.][ 2 book 6360.in+2n 6361This little book, by one of Britain's foremost workers in the field, 6362introduces the subject of speech synthesis and speech synthesizers. 6363It has a particularly good discussion of parallel synthesizers. 6364.in-2n 6365.LE "nn" 6366.EQ 6367delim $$ 6368.EN 6369.CH "6 LINEAR PREDICTION OF SPEECH" 6370.ds RT "Linear prediction of speech 6371.ds CX "Principles of computer speech 6372.pp 6373The speech coding techniques which were discussed in Chapter 3 operate 6374in the time domain, while the analysis and synthesis techniques 6375of Chapters 4 and 5 are 6376based in the frequency domain. Linear prediction is a relatively 6377new method of speech analysis-synthesis, 6378introduced in the early 1970's and used 6379extensively since then, which is primarily a time-domain coding method 6380but can be used to give frequency-domain parameters like formant 6381frequency, bandwidth, and amplitude. 6382.pp 6383It has several advantages over other speech analysis techniques, and is 6384likely to become increasingly dominant in speech output systems. 6385As well as bridging the gap between time- and frequency-domain techniques, it 6386is of equal value for both speech storage and speech synthesis, and forms 6387an extremely convenient basis for speech-output systems which use high-quality 6388stored speech for routine messages and synthesis from phonetics or text 6389for unusual or exceptional conditions. Linear prediction can be used to 6390separate the excitation source properties of pitch and amplitude from the 6391vocal tract filter which governs phoneme articulation, or, in other words, 6392to separate much of the prosodic from the segmental information. 6393Hence it makes it easy to use stored segmentals with synthetic prosody, 6394which is just what is needed to enhance the flexibility of stored speech by 6395providing overall intonation contours for utterances formed by word 6396concatenation (see Chapter 7). 6397.pp 6398The frequency-domain analysis technique 6399of Fourier transformation necessarily involves approximation because it 6400applies only to periodic waveforms, and so the artificial operation 6401of windowing is required to suppress the aperiodicity of real 6402speech. In contrast, the linear predictive technique, being a time-domain 6403method, can \(em in certain forms \(em deal more rationally with aperiodic 6404signals. 6405.pp 6406The basic idea of linear predictive coding is exactly the same as 6407one form of adaptive differential pulse code modulation which 6408was introduced briefly in Chapter 3. There it was noted that a speech 6409sample $x(n)$ can be predicted quite closely by the previous sample 6410$x(n-1)$. The prediction can be improved by multiplying the previous 6411sample by a number, say $a sub 1$, which is adapted on a syllabic 6412time-scale. This can be utilized for speech coding by transmitting 6413only the prediction error 6414.LB 6415.EQ 6416e(n)~=~~x(n)~-~a sub 1 x(n-1), 6417.EN 6418.LE 6419and using it (and the value of $a sub 1$) to reconstitute the signal 6420$x(n)$ at the receiver. It is worthwhile noting that 6421exactly the same relationship was used for digital 6422preemphasis in Chapter 4, with the value of $a sub 1$ 6423being constant at about 0.9 \(em although 6424the possibility of adapting it to take into account the difference 6425between voiced and unvoiced speech was discussed. 6426.pp 6427An obvious extension is to use several past values of the signal to form 6428the prediction, instead of just one. Different multipliers for each would 6429be needed, so that the prediction error could be written as 6430.LB 6431.EQ 6432e(n)~~ mark =~~x(n)~-~a sub 1 x(n-1)~-~a sub 2 x(n-2)~-~...~-~a sub p x(n-p) 6433.EN 6434.sp 6435.EQ 6436lineup =~~x(n)~-~~sum from k=1 to p ~a sub k x(n-k). 6437.EN 6438.LE 6439The multipliers $a sub k$ should be adapted to minimize the error signal, 6440and we will consider how to do this in the next section. It turns out 6441that they must be re-calculated and transmitted on a time-scale that is 6442rather faster than syllabic but much slower than 6443the basic sampling rate: intervals 6444of 10\-25\ msec are usually used (compare this with the 125\ $mu$sec sampling 6445rate for telephone-quality speech). 6446A configuration for high-order adaptive differential 6447pulse code modulation is shown in Figure 6.1. 6448.FC "Figure 6.1" 6449.pp 6450Figure 6.2 shows typical time waveforms for each of the ten coefficients 6451over a 1-second stretch of speech. 6452.FC "Figure 6.2" 6453Notice that they vary much more slowly than, say, the speech waveform of 6454Figure 3.5. 6455.pp 6456Turning the above relationship into $z$-transforms gives 6457.LB 6458.EQ 6459E(z)~~=~~X(z)~-~~sum from k=1 to p ~a sub k z sup -k ~X(z)~~=~~(1~-~~ 6460sum from k=1 to p ~a sub k z sup -k )~X(z). 6461.EN 6462.LE 6463Rewriting the speech signal in terms of the error, 6464.LB 6465.EQ 6466X(z)~~=~~1 over {1~-~~ sum ~a sub k z sup -k }~.~E(z) . 6467.EN 6468.LE 6469.pp 6470Now let us bring together some facts from the previous chapter which will 6471allow the time-domain technique of linear prediction to be interpreted 6472in terms of the frequency-domain formant model of speech. Recall that speech 6473can be viewed as an excitation source passing through a vocal tract filter, 6474followed by another filter to model the effect of radiation from the lips. 6475The overall spectral levels can be reassigned as in Figure 5.1 so that 6476the excitation source has a 0\ dB/octave spectral profile, and hence is 6477essentially impulsive. 6478Considering the vocal tract filter as a series connection 6479of digital formant filters, its transfer function is the product of terms like 6480.LB 6481.EQ 64821 over {1~-~b sub 1 z sup -1 ~+~b sub 2 z sup -2}~ , 6483.EN 6484.LE 6485where $b sub 1$ and $b sub 2$ control the position and bandwidth of the formant resonances. 6486The \-6\ dB/octave spectral compensation can be modelled by the 6487first-order digital filter 6488.LB 6489.EQ 64901 over {1~-~bz sup -1}~ . 6491.EN 6492.LE 6493The product of all these terms, when multiplied out, will have the 6494form 6495.LB 6496.EQ 64971 over {1~-~c sub 1 z sup -1 ~-~c sub 2 z sup -2 ~-~...~-~ 6498c sub q z sup -q }~ , 6499.EN 6500.LE 6501where $q$ is twice the number of formants plus one, and the $c$'s are calculated 6502from the positions and bandwidths of the formant resonances and the spectral 6503compensation parameter. Hence 6504the $z$-transform of the speech is 6505.LB 6506.EQ 6507X(z)~=~~1 over {1~-~~ sum from k=1 to q ~c sub k z sup -k }~.~I(z) , 6508.EN 6509.LE 6510where $I(z)$ is the transform of the impulsive excitation. 6511.pp 6512This is remarkably similar to the linear prediction relation given earlier! If 6513$p$ and $q$ are the same, then the linear predictive coefficients $a sub k$ 6514form a $p$'th order polynomial which is the same as that obtained by multiplying 6515together the second-order polynomials representing the individual formants 6516(together with the first-order one for spectral compensation). 6517Furthermore, the predictive error $E(z)$ can be identified with the 6518impulsive excitation $I(z)$. This raises the very interesting 6519possibility of parametrizing the error signal by its frequency and 6520amplitude \(em two relatively slowly-varying quantities \(em instead of 6521transmitting it sample-by-sample (at an 8\ kHz rate). This is how 6522linear prediction separates out the excitation properties of the source 6523from the vocal tract filter: the source parameters can be derived 6524from the error signal and the vocal tract filter is represented by 6525the linear predictive coefficients. 6526Figure 6.3 shows how this can be used for speech transmission. 6527.FC "Figure 6.3" 6528Note that 6529.ul 6530no 6531signals need now be transmitted at the speech sampling rate; for the 6532source parameters vary relatively slowly. This leads to an extremely 6533low data rate. 6534.pp 6535Practical linear predictive coding schemes operate with a value of $p$ between 653610 and 15, corresponding approximately to 4-formant and 7-formant synthesis 6537respectively. The $a sub k$'s are re-calculated every 10 to 25\ msec, and 6538transmitted to the receiver. Also, the pitch and amplitude 6539of the speech are estimated and transmitted at the same rate. 6540If the speech 6541is unvoiced, there is no pitch value: an "unvoiced flag" is 6542transmitted instead. 6543Because the linear predictive coefficients are intimately related to 6544formant frequencies and bandwidths, a "frame rate" in the region 6545of 10 to 25\ msec is appropriate because this approximates the maximum rate 6546at which acoustic events happen in speech production. 6547.pp 6548At the receiver, the excitation waveform 6549is reconstituted. 6550For voiced speech, it is impulsive at the specified 6551frequency and with the specified amplitude, while for unvoiced speech it 6552is random, with the specified amplitude. This signal $e(n)$, together 6553with the transmitted parameters $a sub 1$, ..., $a sub p$, is used 6554to regenerate the speech waveform by 6555.LB 6556.EQ 6557x(n)~=~~e(n)~+~~sum from k=1 to p ~a sub k x(n-k) , 6558.EN 6559.LE 6560\(em which is the inverse of the transmitter's formula for calculating $e(n)$, 6561namely 6562.LB 6563.EQ 6564e(n)~=~~x(n)~-~~sum from k=1 to p ~a sub k x(n-k) . 6565.EN 6566.LE 6567This relies on knowing the past $p$ values of the speech samples. 6568Many systems set these past values to zero at the beginning of each pitch 6569cycle. 6570.pp 6571Linear prediction can also be used for speech analysis, rather than 6572for speech coding, as shown in Figure 6.4. 6573.FC "Figure 6.4" 6574Instead of transmitting the coefficients $a sub k$, 6575they are used to determine the formant positions and bandwidths. 6576We saw above that the polynomial 6577.LB 6578.EQ 65791~-~a sub 1 z sup -1 ~-~a sub 2 z sup -2 ~-~...~-~a sub p z sup -p , 6580.EN 6581.LE 6582when factored into a product of second-order terms, gives the formant 6583characteristics (as well as the spectral compensation term). 6584Factoring is equivalent to finding the complex roots of the polynomial, 6585and this is fairly demanding computationally \(em especially if done at 6586a high rate. Consequently, peak-picking algorithms are sometimes 6587used instead. The absolute value of the polynomial gives the 6588frequency spectrum of the vocal tract filter, and the formants 6589appear as peaks \(em just as they do in cepstrally smoothed speech 6590(see Chapter 4). 6591.pp 6592The chief deficiency in the linear predictive method, whether it 6593is used for speech coding or for speech analysis, is that \(em like a series 6594synthesizer \(em it 6595implements an all-pole model of the vocal tract. 6596We mentioned in Chapter 5 that this is rather simplistic, 6597especially for nasalized sounds which involve a cavity in parallel 6598with the oral one. Some research has been done on incorporating zeros 6599into a linear predictive model, but it complicates the problem of 6600calculating the parameters enormously. For most purposes people seem 6601to be able to live with the limitations of the all-pole model. 6602.sh "6.1 Linear predictive analysis" 6603.pp 6604The key problem in linear predictive coding is to determine the values 6605of the coefficients $a sub 1$, ..., $a sub p$. 6606If the error signal is to be transmitted on a sample-by-sample basis, 6607as it is in adaptive differential pulse code modulation, then it can be most 6608economically encoded if its mean power is as small as possible. 6609Thus the coefficients are chosen to minimize 6610.LB 6611.EQ 6612sum ~e(n) sup 2 6613.EN 6614.LE 6615over some period of time. 6616The period of time used is related to the frame rate at which the 6617coefficients are transmitted or stored, although there is no need 6618to make it exactly the same as one frame interval. As mentioned above, 6619the frame size 6620is usually chosen to be in the region of 10 to 25\ msec. Some 6621schemes minimize the error signal over as few as 30 samples 6622(corresponding to 3\ msec at a 10\ kHz sampling rate). Others take 6623longer; up to 250 samples (25\ msec). 6624.pp 6625However, if the error signal is to be considered as impulsive and 6626parametrized by its frequency and amplitude before transmission, 6627or if the coefficients $a sub k$ are to be used for spectral calculations, 6628then it is not immediately obvious how the coefficients should be 6629calculated. 6630In fact, it is still best to choose them to minimize the above sum. 6631This is at least plausible, for an impulsive excitation will have a 6632rather small mean power \(em most of the samples are zero. 6633It can be justified theoretically in terms of 6634.ul 6635spectral whitening, 6636for it can be shown that minimizing the mean-squared error 6637produces an error signal whose spectrum is maximally flat. 6638Now the only two waveforms whose spectra are absolutely flat 6639are a single impulse and white noise. Hence if 6640the speech is voiced, minimizing the mean-squared error 6641will lead to an error signal which is as nearly impulsive 6642as possible. Provided the time-frame for minimizing is short enough, 6643the impulse will correspond to a single excitation pulse. 6644If the speech is unvoiced, minimization will lead to an error 6645signal which is as nearly white noise as possible. 6646.pp 6647How does one choose the linear predictive coefficients to minimize 6648the mean-squared error? The total squared prediction error is 6649.LB 6650.EQ 6651M~=~~sum from n ~e(n) sup 2~~=~~sum from n 6652~[x(n)~-~ sum from k=1 to p ~a sub k x sub n-k ] sup 2 , 6653.EN 6654.LE 6655leaving the range of summation unspecified for the moment. 6656To minimize $M$ by choice of the coefficients $a sub j$, differentiate 6657with respect to each of them and set the resulting derivatives 6658to zero. 6659.LB 6660.EQ 6661dM over {da sub j} ~~=~~-2 sum from n ~x(n-j)[x(n)~-~~ 6662sum from k=1 to p ~a sub k x(n-k)]~~=~0~, 6663.EN 6664.LE 6665so 6666.LB 6667.EQ 6668sum from k=1 to p ~a sub k ~ sum from n ~x(n-j)x(n-k)~~=~~ 6669sum from n ~x(n)x(n-j)~~~~j~=~1,~2,~...,~p. 6670.EN 6671.LE 6672.pp 6673This is a set of $p$ linear equations for the $p$ unknowns $a sub 1$, ..., 6674$a sub p$. 6675Solving it is equivalent to inverting a $p times p$ matrix. 6676This job must be repeated at the frame rate, and so if 6677real-time operation is desired quite a lot of calculation is needed. 6678.rh "The autocorrelation method." 6679So far, the range of the $n$-summation has been left open. The 6680coefficients of the matrix equation have the form 6681.LB 6682.EQ 6683sum from n ~x(n-j)x(n-k). 6684.EN 6685.LE 6686If a doubly-infinite summation were made, with $x(n)$ being defined 6687as zero whenever $n<0$, we could make use of the fact that 6688.sp 6689.ce 6690.EQ 6691sum from {n=- infinity} to infinity ~x(n-j)x(n-k)~=~~ 6692sum from {n=- infinity} to infinity ~x(n-j+1)x(n-k+1)~=~...~=~~ 6693sum from {n=- infinity} to infinity ~x(n)x(n+j-k) 6694.EN 6695.sp 6696to simplify the matrix equation. This just states that the 6697autocorrelation of an infinite sequence depends only on the lag at which 6698it is computed, and not on absolute time. 6699.pp 6700Defining $R(m)$ as the 6701autocorrelation at lag $m$, that is, 6702.LB 6703.EQ 6704R(m)~=~ sum from n ~x(n)x(n+m), 6705.EN 6706.LE 6707the matrix equation becomes 6708.LB 6709.ne7 6710.nf 6711.EQ 6712R(0)a sub 1 ~+~R(1)a sub 2 ~+~R(2)a sub 3 ~+~...~~=~R(1) 6713.EN 6714.EQ 6715R(1)a sub 1 ~+~R(0)a sub 2 ~+~R(1)a sub 3 ~+~...~~=~R(2) 6716.EN 6717.EQ 6718R(2)a sub 1 ~+~R(1)a sub 2 ~+~R(0)a sub 3 ~+~...~~=~R(3) 6719.EN 6720.EQ 6721etc 6722.EN 6723.fi 6724.LE 6725An elegant method due to Durbin and Levinson exists for solving this 6726special system of equations. It requires much less computational 6727effort than is generally needed for symmetric matrix equations. 6728.pp 6729Of course, an infinite range of summation can not be used in 6730practice. For one thing, the power spectrum is changing, and 6731only the data from a short time-frame should be used for 6732a realistic estimate of the optimum linear predictive coefficients. 6733Hence a windowing procedure, 6734.LB 6735.EQ 6736x(n) sup * ~=~w sub n x(n), 6737.EN 6738.LE 6739is used to reduce the signal to zero outside a finite range of 6740interest. Windows were discussed in Chapter 4 from the 6741point of view of Fourier analysis of speech signals, and the same 6742sort of considerations apply to choosing a window for linear 6743prediction. 6744.pp 6745This is known as the 6746.ul 6747autocorrelation method 6748of computing prediction parameters. Typically a window of 6749100 to 250 samples is used for analysis of one frame of speech. 6750.rh "Algorithm for the autocorrelation method." 6751The algorithm for obtaining linear prediction coefficients 6752by the autocorrelation method is quite simple. It is 6753straightforward to compute the matrix coefficients 6754$R(m)$ from the speech samples and window coefficients. 6755The Durbin-Levinson method of solving matrix equations operates 6756directly on this $R$-vector to produce the coefficient vector $a sub k$. 6757The complete procedure is given as Procedure 6.1, and is shown 6758diagrammatically in Figure 6.5. 6759.FC "Figure 6.5" 6760.RF 6761.fi 6762.na 6763.nh 6764.ul 6765const 6766N=256; p=15; 6767.ul 6768type 6769svec = 6770.ul 6771array 6772[0..N\-1] 6773.ul 6774of 6775real; 6776cvec = 6777.ul 6778array 6779[1..p] 6780.ul 6781of 6782real; 6783.sp 6784.ul 6785procedure 6786autocorrelation(signal: vec; window: svec; 6787.ul 6788var 6789coeff: cvec); 6790.sp 6791{computes linear prediction coefficients by autocorrelation method 6792in coeff[1..p]} 6793.sp 6794.ul 6795var 6796R, temp: 6797.ul 6798array 6799[0..p] 6800.ul 6801of 6802real; 6803n: [0..N\-1]; i,j: [0..p]; E: real; 6804.sp 6805.ul 6806begin 6807{window the signal} 6808.in+6n 6809.ul 6810for 6811n:=0 6812.ul 6813to 6814N\-1 6815.ul 6816do 6817signal[n] := signal[n]*window[n]; 6818.sp 6819{compute autocorrelation vector} 6820.br 6821.ul 6822for 6823i:=0 6824.ul 6825to 6826p 6827.ul 6828do begin 6829.in+2n 6830R[i] := 0; 6831.br 6832.ul 6833for 6834n:=0 6835.ul 6836to 6837N\-1\-i 6838.ul 6839do 6840R[i] := R[i] + signal[n]*signal[n+i] 6841.in-2n 6842.ul 6843end; 6844.sp 6845{solve the matrix equation by the Durbin-Levinson method} 6846.br 6847E := R[0]; 6848.br 6849coeff[1] := R[1]/E; 6850.br 6851.ul 6852for 6853i:=2 6854.ul 6855to 6856p 6857.ul 6858do begin 6859.in+2n 6860E := (1\-coeff[i\-1]*coeff[i\-1])*E; 6861.br 6862coeff[i] := R[i]; 6863.br 6864.ul 6865for 6866j:=1 6867.ul 6868to 6869i\-1 6870.ul 6871do 6872coeff[i] := coeff[i] \- R[i\-j]*coeff[j]; 6873.br 6874coeff[i] := coeff[i]/E; 6875.br 6876.ul 6877for 6878j:=1 6879.ul 6880to 6881i\-1 6882.ul 6883do 6884temp[j] := coeff[j] \- coeff[i]*coeff[i\-j]; 6885.br 6886.ul 6887for 6888j:=1 6889.ul 6890to 6891i\-1 6892.ul 6893do 6894coeff[j] := temp[j] 6895.in-2n 6896.ul 6897end 6898.in-6n 6899.ul 6900end. 6901.nf 6902.FG "Procedure 6.1 Pascal algorithm for the autocorrelation method" 6903.pp 6904This algorithm is not quite as efficient as it might be, for some 6905multiplications are repeated during the calculation of the 6906autocorrelation vector. Blankinship (1974) shows how 6907the number of multiplications can be reduced by about half. 6908.[ 6909Blankinship 1974 6910.] 6911.pp 6912If the algorithm is performed in fixed-point arithmetic 6913(as it often is in practice because of speed considerations), 6914some scaling must be done. The maximum and minimum values of 6915the windowed signal can be determined within the window 6916calculation loop, and one extra pass over the vector will 6917suffice to scale it to maximum significance. 6918(Incidentally, if all sample values are the same the procedure 6919cannot produce a solution because $E$ becomes zero, and this 6920can easily be checked when scaling.) 6921.pp 6922The absolute value of the $R$-vector has no significance, and since 6923$R(0)$ is always the greatest element, this can be set to the largest 6924fixed-point number and the other $R$'s scaled down appropriately 6925after they have been calculated. 6926These scaling operations are shown as dashed boxes in Figure 6.5. 6927$E$ decreases monotonically 6928as the computation proceeds, so it is safe to initialize it to $R(0)$ 6929without extra scaling. The remainder of the scaling is straightforward, 6930with the linear prediction coefficients $a sub k$ appearing as fractions. 6931.rh "The covariance method." 6932One of the advantages of linear predictive methods that was 6933promised earlier was that it allows us to escape from 6934the problem of windowing. To do this, we must abandon the 6935requirement that the coefficients of the matrix equation have 6936the symmetry property of autocorrelations. Instead, suppose 6937that the range of $n$-summation uses a fixed number of 6938elements, say N, starting at $n=h$, to estimate the prediction 6939coefficients between sample number $h$ and sample number $h+N$. 6940.pp 6941This leads to the matrix equation 6942.LB 6943.EQ 6944sum from k=1 to p ~a sub k sum from n=h to h+N-1 ~x(n-j)x(n-k) ~~=~~ 6945sum from n=h to h+N-1 ~x(n)x(n-j)~~~~j~=~1,~2,~...,~p. 6946.EN 6947.LE 6948Alternatively, we could write 6949.LB 6950.EQ 6951sum from k=1 to p ~a sub k ~ Q sub jk sup h~~=~~Q sub 0j sup h 6952~~~~j~=~1,~2,~...,~p; 6953.EN 6954.LE 6955where 6956.LB 6957.EQ 6958Q sub jk sup h~~=~~sum from n=h to h+N-1 ~x(n-j)x(n-k). 6959.EN 6960.LE 6961Note that some values of $x(n)$ outside the range $h ~ <= ~ n ~ < ~ h+N$ are 6962required: these are shown diagrammatically in Figure 6.6. 6963.FC "Figure 6.6" 6964.pp 6965Now $Q sub jk sup h ~=~ Q sub kj sup h$, so the equation has 6966a diagonally symmetric matrix; and in fact the matrix $Q sup h$ can 6967be shown to be positive semidefinite \(em and is almost always positive 6968definite in practice. Advantage can be taken of these facts 6969to provide a computationally efficient method for solving the 6970equation. According to a result called Cholesky's theorem, a 6971positive definite symmetric matrix $Q$ can be factored into the form 6972$Q ~ = ~ LL sup T$, where $L$ is a lower triangular matrix. 6973This leads to an efficient 6974solution algorithm. 6975.pp 6976This method of computing prediction coefficients has become known 6977as the 6978.ul 6979covariance method. 6980It does not use windowing of the speech signal, and can give accurate 6981estimates of the prediction coefficients with a smaller analysis 6982frame than the autocorrelation method. Typically, 50 to 100 speech samples 6983might be used to estimate the coefficients, and they are re-calculated 6984every 100 to 250 samples. 6985.rh "Algorithm for the covariance method." 6986An algorithm for the covariance method is given in Procedure 6.2, 6987.RF 6988.fi 6989.na 6990.nh 6991.ul 6992const 6993N=100; p=15; 6994.ul 6995type 6996svec = 6997.ul 6998array 6999[\-p..N\-1] 7000.ul 7001of 7002real; 7003cvec = 7004.ul 7005array 7006[1..p] 7007.ul 7008of 7009real; 7010.sp 7011.ul 7012procedure 7013covariance(signal: svec; 7014.ul 7015var 7016coeff: cvec); 7017.sp 7018{computes linear prediction coefficients by covariance method 7019in coeff[1..p]} 7020.sp 7021.ul 7022var 7023Q: 7024.ul 7025array 7026[0..p,0..p] 7027.ul 7028of 7029real; 7030n: [0..N\-1]; i,j,r: [0..p]; X: real; 7031.sp 7032.ul 7033begin 7034{calculate upper-triangular covariance matrix in Q} 7035.in+6n 7036.ul 7037for 7038i:=0 7039.ul 7040to 7041p 7042.ul 7043do 7044.in+2n 7045.ul 7046for 7047j:=i 7048.ul 7049to 7050p 7051.ul 7052do begin 7053.in+2n 7054Q[i,j]:=0; 7055.br 7056.ul 7057for 7058n:=0 7059.ul 7060to 7061N\-1 7062.ul 7063do 7064.in+2n 7065Q[i,j] := Q[i,j] + signal[n\-i]*signal[n\-j] 7066.in-2n 7067.in-2n 7068.ul 7069end; 7070.in-2n 7071.sp 7072{calculate the square root of Q} 7073.br 7074.ul 7075for 7076r:=2 7077.ul 7078to 7079p 7080.ul 7081do 7082.in+2n 7083.ul 7084begin 7085.in+2n 7086.ul 7087for 7088i:=2 7089.ul 7090to 7091r\-1 7092.ul 7093do 7094.in+2n 7095.ul 7096for 7097j:=1 7098.ul 7099to 7100i\-1 7101.ul 7102do 7103.in+2n 7104Q[i,r] := Q[i,r] \- Q[j,i]*Q[j,r]; 7105.in-2n 7106.ul 7107for 7108j:=1 7109.ul 7110to 7111r\-1 7112.ul 7113do 7114.in+2n 7115.ul 7116begin 7117.in+2n 7118X := Q[j,r]; 7119.br 7120Q[j,r] := Q[j,r]/Q[j,i]; 7121.br 7122Q[r,r] := Q[r,r] \- Q[j,r]*X 7123.in-2n 7124.ul 7125end 7126.in-2n 7127.in-2n 7128.in-2n 7129.ul 7130end; 7131.in-2n 7132.sp 7133{calculate coeff[1..p]} 7134.br 7135.ul 7136for 7137r:=2 7138.ul 7139to 7140p 7141.ul 7142do 7143.in+2n 7144.ul 7145for 7146i:=1 7147.ul 7148to 7149r\-1 7150.ul 7151do 7152Q[0,r] := Q[0,r] \- Q[i,r]*Q[0,i]; 7153.in-2n 7154.ul 7155for 7156r:=1 7157.ul 7158to 7159p 7160.ul 7161do 7162Q[0,r] := Q[0,r]/Q[r,r]; 7163.br 7164.ul 7165for 7166r:=p\-1 7167.ul 7168downto 71691 7170.ul 7171do 7172.in+2n 7173.ul 7174for 7175i:=r+1 7176.ul 7177to 7178p 7179.ul 7180do 7181Q[0,r] := Q[0,r] \- Q[r,i]*Q[0,i]; 7182.in-2n 7183.ul 7184for 7185r:=1 7186.ul 7187to 7188p 7189.ul 7190do 7191coeff[r] := Q[0,r] 7192.in-6n 7193.ul 7194end. 7195.nf 7196.FG "Procedure 6.2 Pascal algorithm for the covariance method" 7197and is shown diagrammatically in Figure 6.7. 7198.FC "Figure 6.7" 7199The algorithm shown is not terribly efficient from a computation 7200and storage point of view, although it is workable. For one thing, 7201it uses the obvious method for computing the covariance matrix 7202by calculating 7203.EQ 7204Q sub 01 sup h , 7205.EN 7206.EQ 7207Q sub 02 sup h , ~ ..., 7208.EN 7209.EQ 7210Q sub 0p sup h , 7211.EN 7212.EQ 7213Q sub 11 sup h , ..., 7214.EN 7215in turn, which repeats most of the multiplications $p$ times \(em not 7216an efficient procedure. A simple alternative is to precompute the necessary 7217multiplications and store them in a $(N+h) times (p+1)$ diagonally symmetric 7218table, but even apart from the extra storage required for this, the number 7219of additions which must be performed subsequently to give the $Q$'s is far 7220larger than necessary. It is possible, however, to write a procedure which is 7221both time- and space-efficient (Witten, 1980). 7222.[ 7223Witten 1980 Algorithms for linear prediction 7224.] 7225.pp 7226The scaling problem is rather more tricky for the covariance 7227method than for the autocorrelation method. The $x$-vector 7228should be scaled initially in the same way as before, but now there 7229are $p+1$ diagonal elements of the covariance matrix, any of which could 7230be the greatest element. Of course, 7231.LB 7232.EQ 7233Q sub jk ~~ <= ~~ Max ( Q sub 11 , Q sub 22 , ..., Q sub pp ), 7234.EN 7235.LE 7236but despite the considerable communality in the summands of the diagonal 7237elements, there are no 7238.ul 7239a priori 7240bounds on the ratios between them. 7241.pp 7242The only way to scale the $Q$ matrix properly is to calculate each of its $p$ 7243diagonal elements and use the greatest as a scaling factor. 7244Alternatively, the fact that 7245.LB 7246.EQ 7247Q sub jk ~~ <= ~~ N times Max( x sub n sup 2 ) 7248.EN 7249.LE 7250can be used to give a bound for scaling purposes; however, this 7251is usually a rather conservative bound, and as $N$ is often around 100, several 7252bits of significance will be lost. 7253.pp 7254Scaling difficulties do not cease when $Q$ has been determined. It is possible 7255to show that the elements of the lower-triangular matrix $L$ which represents 7256the square root of $Q$ are actually 7257.ul 7258unbounded. 7259In fact there is a slightly different variant of the Cholesky decomposition 7260algorithm which guarantees bounded coefficients but suffers from the 7261disadvantage that it requires square roots to be taken (Martin 7262.ul 7263et al, 72641965). 7265.[ 7266Martin Peters Wilkinson 1965 7267.] 7268However, experience with the method indicates that it is rare for the elements 7269of $L$ to exceed 16 times the maximum element of $Q$, and the possibility of 7270occasional failure to adjust the coefficients may be tolerable in a practical 7271linear prediction system. 7272.rh "Comparison of autocorrelation and covariance analysis." 7273There are various factors which should be taken into account when 7274deciding whether to use the autocorrelation or covariance method for linear 7275predictive analysis. Furthermore, there is a rather different technique, 7276called the "lattice method", which will be discussed shortly. 7277The autocorrelation method involves windowing, which means that in 7278practice a rather longer stretch of speech should be used 7279for analysis. We have illustrated this by setting $N$=256 in the 7280autocorrelation algorithm and 100 in the covariance one. 7281Offsetting the extra calculation that this entails is the 7282fact that the Durbin-Levinson method of inverting a matrix is much more 7283efficient than Cholesky decomposition. In practice, this means 7284that similar amounts of computation are needed for each method \(em a 7285detailed comparison is made in Witten (1980). 7286.[ 7287Witten 1980 Algorithms for linear prediction 7288.] 7289.pp 7290A factor which weighs against the covariance method is the 7291difficulty of scaling intermediate quantities within the algorithm. 7292The autocorrelation method can be implemented quite satisfactorily 7293in fixed-point arithmetic, and this makes it more suitable for 7294hardware implementation. Furthermore, serious instabilities sometimes 7295arise with the covariance method, whereas it can be shown that 7296the autocorrelation one is always stable. Nevertheless, the approximations 7297inherent in the windowing operation, and the smearing effect of taking a 7298larger number of sample points, mean that covariance-method coefficients 7299tend to represent the speech more accurately, if they can be obtained. 7300.pp 7301One way of using the covariance method which has proved to be rather 7302satisfactory in practice is to synchronize the analysis frame with 7303the beginning of a pitch period, when the excitation is strongest. 7304Pitch synchronous techniques were discussed in Chapter 4 in the context 7305of discrete Fourier transformation of speech. The snag, of course, is that 7306pitch peaks do not occur uniformly in time, and furthermore it is difficult 7307to estimate their locations precisely. 7308.sh "6.2 Linear predictive synthesis" 7309.pp 7310If the linear predictive coefficients and the error signal are available, 7311it is easy to regenerate the original speech by 7312.LB 7313.EQ 7314x(n)~=~~e(n)~+~~ sum from k=1 to p ~a sub k x(n-k) . 7315.EN 7316.LE 7317If the error signal is parametrized into the sound source type 7318(voiced or unvoiced), amplitude, and pitch (if voiced), it can be 7319regenerated by an impulse repeated at the appropriate pitch 7320frequency (if voiced), or white noise (if unvoiced). 7321.pp 7322However, it may be that the filter represented by the coefficients $a sub k$ is 7323unstable, causing the output speech signal to oscillate wildly. 7324In fact, it is only possible for the covariance method to produce an 7325unstable filter, and not the autocorrelation method \(em although even 7326with the latter, truncation of the $a sub k$'s for transmission may turn 7327a stable filter into an unstable one. Furthermore, the coefficients 7328$a sub k$ are not suitable candidates for quantization, because small 7329changes in them can have a dramatic effect on the characteristics of 7330the synthesis filter. 7331.pp 7332Both of these problems can be solved by using a different set of numbers, 7333called 7334.ul 7335reflection coefficients, 7336for quantization and transmission. Thus, for example, in Figures 6.1 7337and 6.3 these reflection coefficients could be derived at the 7338transmitter, quantized, and used by the receiver to reproduce 7339the speech waveform. They can be related to reflection and transmission 7340parameters at the junctions of an acoustic tube model of the vocal tract; 7341hence the name. Procedure 6.3 shows an algorithm for calculating the 7342reflection coefficients from the filter coefficients $a sub k$. 7343.RF 7344.fi 7345.na 7346.nh 7347.ul 7348const 7349p=15; 7350.ul 7351type 7352cvec = 7353.ul 7354array 7355[1..p] 7356.ul 7357of 7358real; 7359.sp 7360.ul 7361procedure 7362reflection(coeff: cvec; 7363.ul 7364var 7365refl: cvec); 7366.sp 7367{computes reflection coefficients in refl[1..p] corresponding 7368to linear prediction coefficients in coeff[1..p]} 7369.sp 7370.ul 7371var 7372temp: cvec; i, m: 1..p; 7373.sp 7374.ul 7375begin 7376.in+6n 7377.ul 7378for 7379m:=p 7380.ul 7381downto 73821 7383.ul 7384do begin 7385.in+2n 7386refl[m] := coeff[m]; 7387.br 7388.ul 7389for 7390i:=1 7391.ul 7392to 7393m\-1 7394.ul 7395do 7396temp[i] := coeff[i]; 7397.br 7398.ul 7399for 7400i:=1 7401.ul 7402to 7403m\-1 7404.ul 7405do 7406.ti+2n 7407coeff[i] := 7408.ti+4n 7409(coeff[i] + refl[m]*temp[m\-i]) / (1 \- refl[m]*refl[m]); 7410.in-2n 7411.ul 7412end 7413.in-6n 7414.ul 7415end. 7416.nf 7417.MT 2 7418Procedure 6.3 Pascal algorithm for producing reflection coefficients 7419from filter coefficients 7420.TE 7421.pp 7422Although we will not go into the theoretical details here, 7423reflection coefficients are bounded by $+-$1 for stable filters, 7424and hence form a useful test for stability. Having a limited 7425range makes them easy to quantize for transmission, and in fact 7426they behave better under quantization than do the filter coefficients. 7427One could resynthesize speech from reflection coefficients by first 7428converting them to filter coefficients and using the synthesis 7429method described above. However, it is natural to seek a single-stage 7430procedure which can regenerate speech directly from reflection 7431coefficients. 7432.pp 7433Such a procedure does exist, and is called a 7434.ul 7435lattice filter. 7436Figure 6.8 shows one form of lattice for speech synthesis. 7437.FC "Figure 6.8" 7438The error signal (whether transmitted or synthesized) 7439enters at the upper left-hand corner, passes along the top forward 7440signal path, being modified on the way, to give the output signal 7441at the right-hand side. 7442Then it passes back through a chain of delays along the bottom, 7443backward, path, and is used to modify subsequent forward signals. 7444Finally it is discarded at the lower left-hand corner. 7445.pp 7446There are $p$ stages in the lattice structure of Figure 6.8, where $p$ is the 7447order of the linear predictive filter. 7448Each stage involves two multiplications by the appropriate 7449reflection coefficients, one by the backward signal \(em the 7450result of which is added into the forward path \(em and the other by 7451the forward signal \(em the result of which is subtracted from the 7452backward path. Thus the number of multiplications is twice 7453the order of the filter, and hence twice as many as for the 7454realization using coefficients $a sub k$. If the labour necessary 7455to turn the reflection coefficients into $a sub k$'s is included, 7456the computational load becomes the same. Moreover, since the 7457reflection coefficients need fewer quantization bits than the $a sub k$'s 7458(for a given speech quality), the word lengths are smaller in the 7459lattice realization. 7460.pp 7461The advantages of the lattice method of synthesis over direct evaluation 7462of the prediction using filter coefficients $a sub k$, then, are: 7463.LB 7464.NP 7465the reflection coefficients are used directly 7466.NP 7467the stability of the filter is obvious from the reflection coefficient 7468values 7469.NP 7470the system is more tolerant to quantization errors in fixed-point 7471implementations. 7472.LE 7473Although it may seem unlikely that an unstable filter would be produced 7474by linear predictive analysis, instability is in fact a real problem 7475in non-lattice implementations. For example, 7476coefficients are often interpolated at the receiver, to allow longer 7477frame times and smooth over sudden transitions, and it is quite likely that 7478an unstable configuration is obtained when interpolating filter coefficients 7479between two stable configurations. 7480This cannot happen with reflection coefficients, however, because a 7481necessary and sufficient condition for stability is that all 7482coefficients lie in the interval $(-1,+1)$. 7483.sh "6.3 Lattice filtering" 7484.pp 7485Lattice filters are an important new method of linear predictive 7486.ul 7487analysis 7488as well as synthesis, and so 7489it is worth considering the theory behind them a little further. 7490.rh "Theory of the lattice synthesis filter." 7491Figure 6.9 shows a single stage of the synthesis lattice given earlier. 7492.FC "Figure 6.9" 7493There are two signals at each side of the lattice, and the $z$-transforms 7494of these have been labelled $X sup +$ and $X sup -$ at the left-hand side 7495and $Y sup +$ and $Y sup -$ at the right-hand side. 7496The direction of signal flow is forwards along the upper ("positive") path 7497and backwards along the lower ("negative") one. 7498.pp 7499The signal flows show that the following two relationships hold: 7500.LB 7501.EQ 7502Y sup + ~=~~ X sup + ~+~ k z sup -1 Y sup - ~~~~~~ 7503.EN 7504for the forward (upper) path 7505.br 7506.EQ 7507X sup - ~ =~ -kY sup + ~+~ z sup -1 Y sup - ~~~~~~~ 7508.EN 7509\h'-\w'\-'u'for the backward (lower) path. 7510.LE 7511Re-arranging the first equation yields 7512.LB 7513.EQ 7514X sup + ~ =~~ Y sup + ~-~ k z sup -1 Y sup - , 7515.EN 7516.LE 7517and so we can describe the function of the lattice by a single matrix 7518equation: 7519.LB 7520.ne4 7521.EQ 7522left [ matrix {ccol {X sup + above X sup -}} right ] ~~=~~ 7523left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] 7524~ left [ matrix {ccol {Y sup + above Y sup -}} right ] ~ . 7525.EN 7526.LE 7527It would be nice to be able to 7528call this an input-output equation, but it is not; 7529for the input signals to the lattice stage are $X sup +$ and $Y sup -$, 7530and the outputs are $X sup -$ and $Y sup +$. 7531We have written it in this form because it allows a multi-stage lattice to 7532be described by cascading these matrix equations. 7533.pp 7534A single-stage lattice filter has $Y sup +$ and $Y sup -$ connected together, 7535forming its output (call this $X sub output$), while the input is $X sup +$ 7536($X sub input$). 7537Hence the input is related to the output by 7538.LB 7539.EQ 7540left [ matrix {ccol {X sub input above \(sq }} right ] ~~ = 7541~~ left [ matrix {ccol {1 above -k} ccol {-k z sup -1 7542above z sup -1}} right ] 7543~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ , 7544.EN 7545.LE 7546so 7547.LB 7548.EQ 7549X sub input ~ = ~~ (1~-~ k z sup -1 )~X sub output , 7550.EN 7551.LE 7552or 7553.LB 7554.EQ 7555{X sub output} over {X sub input} ~~=~~ 1 over {1~-~ k sub 1 z sup -1} ~ . 7556.EN 7557.LE 7558(The symbol \(sq is used here and elsewhere 7559to indicate an unimportant element of a vector 7560or matrix.) This certainly has the form of a linear predictive 7561synthesis filter, which is 7562.LB 7563.EQ 7564X(z) over E(z) ~~=~~ 1 over {1~-~~ sum from k=1 to p ~a sub k 7565z sup -k}~~=~~ 1 over {1~-~a sub 1 z sup -1 } ~~~~~~ 7566.EN 7567when $p=1$. 7568.LE 7569.pp 7570The behaviour of a second-order lattice filter, shown in Figure 6.10, 7571can be described by 7572.LB 7573.ne4 7574.EQ 7575left [ matrix {ccol {X sub 3 sup + above X sub 3 sup -}} right ] ~~ = 7576~~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1 7577above z sup -1}} right ] 7578~ left [ matrix {ccol {X sub 2 sup + above X sub 2 sup -}} right ] 7579.EN 7580.sp 7581.ne4 7582.EQ 7583left [ matrix {ccol {X sub 2 sup + above X sub 2 sup -}} right ] ~~ = 7584~~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1 7585above z sup -1}} right ] 7586~ left [ matrix {ccol {X sub 1 sup + above X sub 1 sup -}} right ] 7587.EN 7588.LE 7589with 7590.LB 7591.ne3 7592.EQ 7593X sub 3 sup + ~=~X sub input 7594.EN 7595.br 7596.EQ 7597X sub 1 sup + ~=~ X sub 1 sup - ~=~ X sub output . 7598.EN 7599.LE 7600.FC "Figure 6.10" 7601$X sub 2 sup +$ and $X sub 2 sup -$ can be eliminated by substituting the 7602second equation into the first, which yields 7603.LB 7604.EQ 7605left [ matrix {ccol {X sub input above \(sq }} right ] ~~ mark = 7606~~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1 7607above z sup -1}} right ] 7608~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1 7609above z sup -1}} right ] 7610~ left [ matrix {ccol {X sub output above X sub output}} right ] 7611.EN 7612.sp 7613.sp 7614.EQ 7615lineup = ~~ left [ matrix {ccol {1+k sub 1 k sub 2 z sup -1 above \(sq } 7616ccol { -k sub 1 z sup -1 -k sub 2 z sup -2 above \(sq }} right ] 7617~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ . 7618.EN 7619.LE 7620This leads to an input-output relationship 7621.LB 7622.EQ 7623{X sub output} over {X sub input} ~~ = ~~ 76241 over {1~+~k sub 1 (k sub 2 -1)z sup -1 ~-~k sub 2 z sup -2} ~ , 7625.EN 7626.LE 7627which has the required form, namely 7628.LB 7629.EQ 76301 over {1~-~~ sum from k=1 to p ~a sub k z sup -k } ~~~~~~ (p=2) 7631.EN 7632.LE 7633when 7634.LB 7635.EQ 7636a sub 1 ~=~-k sub 1 (k sub 2 -1) 7637.EN 7638.br 7639.EQ 7640a sub 2 ~=~k sub 2. 7641.EN 7642.LE 7643.pp 7644A third-order filter is described by 7645.LB 7646.EQ 7647left [ matrix {ccol {X sub input above \(sq }} right ] ~~ = 7648~~ left [ matrix {ccol {1 above -k sub 3 } ccol {-k sub 3 z sup -1 7649above z sup -1}} right ] 7650~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1 7651above z sup -1}} right ] 7652~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1 7653above z sup -1}} right ] 7654~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ , 7655.EN 7656.LE 7657and brave souls can verify that this gives an input-output 7658relationship 7659.LB 7660.EQ 7661{X sub output} over {X sub input} ~~ = ~~ 76621 over {1~+~[k sub 2 k sub 3 + k sub 1 (1-k sub 2 )] z sup -1 ~+~ 7663[k sub 1 k sub 3 (1-k sub 2 ) -k sub 2 ] z sup -2 ~-~ k sub 3 z sup -3 } ~ . 7664.EN 7665.LE 7666It is fairly obvious that a $p$'th order lattice filter will give the 7667required all-pole $p$'th order synthesis form, 7668.LB 7669.EQ 76701 over { 1~-~~ sum from k=1 to p ~a sub k z sup -k } ~ . 7671.EN 7672.LE 7673.pp 7674We have not shown that the algorithm given in Procedure 6.3 for producing 7675reflection coefficients from filter coefficients gives those values 7676for $k sub i$ which are necessary to make the lattice filter equivalent 7677to the ordinary synthesis filter. However, this is the case, and it is 7678easy to verify by hand for the first, second, and third-order cases. 7679.rh "Different lattice configurations." 7680The lattice filters of Figures 6.8, 6.9, and 6.10 have two multipliers 7681per section. 7682This is called a "two-multiplier" configuration. 7683However, there are other configurations which achieve 7684the same effect, but require different numbers of multiplies. 7685Figure 6.11 shows one-multiplier and four-multiplier configurations, 7686along with the familiar two-multiplier one. 7687.FC "Figure 6.11" 7688It is easy to verify that the three configurations can be modelled in 7689matrix terms by 7690.LB 7691.ne4 7692$ 7693left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~ 7694left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] 7695~ left [ matrix {ccol {Y sup + above Y sup -}} right ] 7696$ two-multiplier configuration 7697.sp 7698.sp 7699.ne4 7700$ 7701left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~ 7702left [ {1-k over 1+k} right ] sup 1/2 ~ 7703left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] 7704~ left [ matrix {ccol {Y sup + above Y sup -}} right ] 7705$ one-multiplier configuration 7706.sp 7707.sp 7708.ne4 7709$ 7710left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~ 77111 over {(1-k sup 2) sup 1/2} ~ 7712left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] 7713~ left [ matrix {ccol {Y sup + above Y sup -}} right ] 7714$ four-multiplier configuration. 7715.LE 7716Each of the three has the same frequency-domain response, although 7717a different constant factor is involved in each case. 7718The effect of this can be annulled by performing a single multiply 7719operation on the output of a complete lattice chain. 7720The multiplier has the form 7721.LB 7722.EQ 7723left [ {1 - k sub p} over {1 + k sub p} ~.~ 7724{1 - k sub p-1} over {1 + k sub p-1} ~.~...~.~ 7725{1 - k sub 1} over {1 + k sub 1} right ] sup 1/2 7726.EN 7727.sp 7728.LE 7729for single-multiplier lattices, and 7730.LB 7731.EQ 7732left [ 1 over {1 - k sub p sup 2} ~.~ 77331 over {1 - k sub p-1 sup 2} ~.~...~.~ 77341 over {1 - k sub 1 sup 2} right ] sup 1/2 7735.EN 7736.LE 7737for four-multiplier lattices, where the reflection coefficients 7738in the lattice are $k sub p$, $k sub p-1$, ..., $k sub 1$. 7739.pp 7740There are important differences between these three configurations. 7741If multiplication is time-consuming, the one-multiplier model has obvious 7742computational advantages over the other two methods. 7743However, the four-multiplier structure behaves substantially better 7744in finite word-length implementations. It is easy to show that, with this 7745configuration, 7746.LB 7747.EQ 7748(X sup - ) sup 2 ~+~ (Y sup + ) sup 2 ~~ = ~~ 7749(X sup + ) sup 2 ~+~ (z sup -1 Y sup - ) sup 2 , 7750.EN 7751.LE 7752\(em a relationship which suggests that the "energy" in the 7753the input signals, namely $X sup +$ and $Y sup -$, is preserved in the output 7754signals, $X sup -$ and $Y sup +$. 7755Notice that care must be taken with the $z$-transforms, since squaring is a 7756non-linear operation. $(z sup -1 Y sup - ) sup 2$ means the square of 7757the previous value of $Y sup -$, which is not the same 7758as $z sup -2 (Y sup - ) sup 2$. 7759.pp 7760It has been shown (Gray and Markel, 1975) that the four-multiplier 7761configuration has some stability properties which are not shared by other 7762digital filter structures. 7763.[ 7764Gray Markel 1975 Normalized digital filter structure 7765.] 7766When a linear predictive filter is used for synthesis, the parameters 7767of the filter \(em the $k$-parameters in the case of lattice filters, 7768and the $a$-parameters in the case of direct ones \(em change with time. 7769It is usually rather difficult to guarantee stability in the case of 7770time-varying filter parameters, but some guarantees can be made for a 7771chain of four-multiplier lattices. Furthermore, if the input is a 7772discrete delta function, the cumulative energies at each stage of the 7773lattice are the same, and so maximum dynamic range will be achieved 7774for the whole filter if each section is implemented with the same 7775word size. 7776.rh "Lattice analysis." 7777It is quite easy to construct a filter which is inverse to 7778a single-stage lattice. 7779The structure of Figure 6.12(a) does the job. 7780(Ignore for a moment 7781the dashed lines connecting Figure 6.12(a) and (b).) Its matrix transfer 7782function is 7783.FC "Figure 6.12" 7784.LB 7785.ne4 7786$ 7787left [ matrix {ccol {Y sup + above Y sup -}} right ] ~~=~~ 7788left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] 7789~ left [ matrix {ccol {X sup + above X sup -}} right ] 7790$ analysis lattice (Figure 6.12(a)). 7791.LE 7792Notice that this is exactly the same as the transfer function of the 7793synthesis lattice of Figure 6.9, which is reproduced 7794in Figure 6.12(b), except that the $X$'s and $Y$'s are reversed: 7795.LB 7796.ne4 7797$ 7798left [ matrix {ccol {X sup + above X sup -}} right ] ~~=~~ 7799left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] 7800~ left [ matrix {ccol {Y sup + above Y sup -}} right ] 7801$ synthesis lattice (Figure 6.12(b)), 7802.LE 7803or, in other words, 7804.LB 7805.ne4 7806$ 7807left [ matrix {ccol {Y sup + above Y sup -}} right ] ~~ = ~~ 7808left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} 7809right ] sup -1 7810~ left [ matrix {ccol {X sup + above X sup -}} right ] 7811$ synthesis lattice (Figure 6.12(b)). 7812.LE 7813Hence if the filters of Figures 6.12(a) and (b) were connected together 7814as shown by the dashed lines, they 7815would cancel each other out, and the overall transfer would be unity: 7816.LB 7817.ne4 7818.EQ 7819left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} 7820right ] ~ 7821left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} 7822right ] sup -1 ~~ = ~~ 7823left [ matrix {ccol {1 above 0} ccol {0 above 1}} right ] ~ . 7824.EN 7825.LE 7826Actually, such a connection is not possible in physical terms, 7827for although the upper paths can be joined together the lower ones can not. 7828The right-hand lower point of Figure 6.12(a) is an 7829.ul 7830output 7831terminal, and so is the left-hand lower one of Figure 6.12(b)! However, 7832there is no need to envisage a physical connection of the lower paths. 7833It is sufficient for cancellation just to assume that the signals at both 7834of the points turn out to be the same. 7835.pp 7836And they do. 7837The general case of a $p$-stage analysis lattice 7838connected to a $p$-stage synthesis 7839lattice is shown in Figure 6.13. 7840.FC "Figure 6.13" 7841Notice that the forward and backward paths are connected together at both 7842of the extreme ends of the system. 7843It is not difficult to show that under these 7844conditions the signal at the lower righthand 7845terminal of the analysis chain will equal that at the lower lefthand 7846terminal of the synthesis chain, even though they are not connected, 7847provided the upper terminals are connected together as shown by the dashed 7848line. 7849Of course, the reflection coefficients $k sub 1$, $k sub 2$, ..., 7850$k sub p$ in the analysis lattice must equal those in the synthesis 7851lattice, and as Figure 6.13 shows the order is reversed in the synthesis 7852lattice. 7853Successive analysis and synthesis sections pair off, working from 7854the middle outwards. At each stage the sections cancel each other out, 7855giving a unit transfer function as demonstrated above. 7856.rh "Estimating reflection coefficients." 7857As stated earlier in this chapter, the key problem in linear prediction is to 7858determine the values of the predictive coefficients \(em in this case, the 7859reflection coefficients. 7860If this is done correctly, we have shown using Procedure 6.3 that 7861the the synthesis part of Figure 6.13 performs the same calculation that 7862a conventional direct-form linear predictive synthesizer would, and hence 7863the signal that excites it \(em that is, the signal represented by the 7864dashed line \(em must be the prediction residual, or error signal, discussed 7865earlier. The system is effectively the same as the high-order adaptive 7866differential pulse code modulation one of Figure 6.1. 7867.pp 7868One of the most interesting features of the lattice structure for 7869analysis filters is that calculation of suitable values for the 7870reflection coefficients can be done locally at each stage of the lattice. 7871For example, consider the $i$'th section of the analysis lattice in 7872Figure 6.13. It is possible to determine a suitable value of $k sub i$ 7873simply by performing a calculation on the inputs to the $i$'th 7874section (ie $X sup +$ and $X sup -$ in Figure 6.12). 7875No longer need the complicated global optimization technique of matrix 7876inversion be used, as in the autocorrelation and covariance methods discussed 7877earlier. 7878.pp 7879A suitable value for $k$ in the single lattice section of Figure 6.12 is 7880.LB 7881.EQ 7882k~ = ~~ {E[ x sup + (n) x sup - (n-1)]} over 7883{( E[ x sup + (n) sup 2 ] E[ x sup - (n-1) sup 2 ] ) sup 1/2} ~~ ; 7884.EN 7885.LE 7886that is, the statistical correlation between $x sup + (n)$ and 7887$x sup - (n-1)$. 7888Here, $x sup + (n)$ and $x sup - (n)$ represent the input signals to the 7889upper and lower paths (recall that $X sup +$ and $X sup -$ 7890are their $z$-transforms). 7891$x sup - (n-1)$ is just $x sup - (n)$ delayed by one time unit, that is, 7892the output of the $z sup -1$ box in the Figure. 7893.pp 7894The criterion of optimality for the autocorrelation and covariance methods 7895was that the prediction error, that is, the signal which emerges from 7896the right-hand end of the upper path of a lattice analysis filter, 7897should be minimized in a mean-square sense. 7898The reflection coefficients obtained from the above formula do not necessarily 7899satisfy any such global minimization criterion. 7900Nevertheless, they do keep the error signal small, and have been used with 7901success in speech analysis systems. 7902.pp 7903It is easy to minimize the output from either the upper or the lower path 7904of the lattice filter at each stage. For example, the $z$-transform of the 7905upper output is given by 7906.LB 7907.EQ 7908Y sup + ~~=~~ X sup + ~-~ k z sup -1 X sup - , 7909.EN 7910.LE 7911or 7912.LB 7913.EQ 7914y sup + (n) ~~=~~ x sup + (n) ~-~ k x sup - (n-1) . 7915.EN 7916.LE 7917Hence 7918.LB 7919.EQ 7920E[y sup + (n) sup 2 ] ~~ = ~~ E[x sup + (n) sup 2 ] ~-~ 79212kE[x sup + (n) x sup - (n-1) ] ~+~ k sup 2 E [x sup - (n-1) sup 2 ] , 7922.EN 7923.LE 7924where $E$ stands for expected value, and this reaches a minimum when the 7925derivative with respect to $k$ becomes zero: 7926.LB 7927.EQ 7928-2E[x sup + (n) x sup - (n-1) ] ~+~ 2kE[x sup - (n-1) sup 2 ] ~~=~0 , 7929.EN 7930.LE 7931that is, when 7932.LB 7933.EQ 7934k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over {E[x sup - (n-1) sup 2 ] 7935} ~ . 7936.EN 7937.LE 7938A similar calculation shows that the output of the lower path is minimized 7939when 7940.LB 7941.EQ 7942k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over {E[x sup + (n-1) sup 2 ] 7943} ~ . 7944.EN 7945.LE 7946Unfortunately, either of these expressions can exceed 1, leading to an 7947unstable filter. 7948The value of $k$ cited earlier is the geometric mean of these two 7949expressions, and since it is a correlation coefficient, must be less than 1. 7950.pp 7951Another possibility is to minimize the expected value of the sum of the 7952squares of the upper and lower outputs: 7953.LB 7954.EQ 7955y sup + (n) sup 2 ~+~ y sup - (n) sup 2 ~~ = ~~ 7956(1+k sup 2 )x sup + (n) sup 2 ~-~ 2kx sup + (n) x sup - (n-1) ~+~ 7957(1+k sup 2 )x sup - (n) sup 2 . 7958.EN 7959.LE 7960Taking expected values and setting the derivative with respect to k to zero 7961leads to 7962.LB 7963.EQ 7964k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over 7965{ half ~ E[x sup + (n) sup 2 ~+~ x sup - (n-1) sup 2 ]} ~. 7966.EN 7967.LE 7968This also is guaranteed to be less than 1, and has given good results 7969in speech analysis systems. 7970.pp 7971Figure 6.14 shows the implementation of a single section of an analysis 7972lattice. 7973.FC "Figure 6.14" 7974The signals $x sup + (n)$ and $x sup - (n-1)$ are fed to a 7975correlator, which produces a suitable value for $k$. 7976This value is used to calculate the output of the lattice section, 7977and hence the input to the next lattice section. 7978The reflection coefficient needs to be low-pass filtered, because it will 7979only be transmitted to the synthesizer occasionally (say every 20\ msec) and so a 7980short-term average is required. 7981.pp 7982One implementation of the correlator is shown in Figure 6.15 (Kang, 1974). 7983.[ 7984Kang 1974 7985.] 7986.FC "Figure 6.15" 7987This calculates the value of $k$ given by the last equation above, and does it 7988by summing and differencing the two 7989signals $x sup + (n)$ and $x sup - (n-1)$, squaring the results to give 7990.LB 7991.EQ 7992x sup + (n) sup 2 + 2x sup + (n mark ) x sup - (n-1) +x sup - (n-1) sup 2 7993~~~~~~~~ x sup + (n) sup 2 - 2x sup + (n) x sup - (n-1) +x sup - (n-1) sup 2 7994~ , 7995.EN 7996.LE 7997and summing and differencing these, to yield 7998.LB 7999.EQ 8000lineup 2x sup + (n) sup 2 + 2x sup - (n-1) sup 2 ~~~~~~~~ 80014x sup + (n) x sup - (n-1) ~ . 8002.EN 8003.LE 8004.sp 8005Before these are divided to give the final coefficient $k$, they are 8006individually low-pass filtered. 8007While some rather complex schemes have been proposed, 8008based upon Kalman filter theory (eg Matsui 8009.ul 8010et al, 80111972), 8012.[ 8013Matsui Nakajima Suzuki Omura 1972 8014.] 8015a simple exponential weighted past average has been found to be 8016satisfactory. This has $z$-transform 8017.LB 8018.EQ 80191 over {64 - 63 z sup -1} ~ , 8020.EN 8021.LE 8022that is, in the time domain, 8023.LB 8024.EQ 8025y(n)~ = ~~ 63 over 64 ~ y(n-1) ~+~ 1 over 64 ~ y(n) ~ . 8026.EN 8027.LE 8028This filter exponentially averages past sample values 8029with a time-constant of 64 sampling intervals 8030\(em that is, 8\ msec at an 8\ kHz sampling rate. 8031.sh "6.4 Pitch estimation" 8032.pp 8033It is sometimes useful to think of linear prediction as a kind of 8034curve-fitting technique. 8035Figure 6.16 illustrates how four samples of a speech signal can predict 8036the next one. 8037.FC "Figure 6.16" 8038In essence, a curve is drawn through four points 8039to predict the position of the fifth, and only the prediction error 8040is actually transmitted. Now if the order of linear prediction 8041is high enough (at least 10), and if the coefficients are chosen 8042correctly, the prediction will closely model the resonances of the 8043vocal tract. Thus the error will actually be zero, except at pitch 8044pulses. 8045.pp 8046Figure 6.17 shows a segment of voiced speech together with the prediction 8047error (often called the prediction residual). 8048.FC "Figure 6.17" 8049It is apparent that the 8050error is indeed small, except at pitch pulses. 8051This suggests that a good way to determine the pitch period is to examine 8052the error signal, perhaps by looking at its autocorrelation function. 8053As with all pitch detection methods, one must be 8054careful: spurious peaks can occur, especially in nasal sounds when 8055the all-pole model provided by linear prediction fails. Continuity 8056constraints, which use previous values of pitch period when determining 8057which peak to accept as a new pitch impulse, can eliminate many of these 8058spurious peaks. Unvoiced speech should produce an error signal with no 8059prominent peaks, and this needs to be detected. 8060Voiced fricatives are a difficult case: peaks should be present 8061but the general noise level of the error signal will be greater than 8062it is in 8063purely voiced speech. 8064Such considerations have been taken into account in a practical pitch 8065estimation system based upon this technique (Markel, 1972). 8066.[ 8067Markel 1972 SIFT 8068.] 8069.pp 8070This method of pitch detection highlights another advantage of the lattice 8071analysis technique. When using autocorrelation or covariance analysis to 8072determine the filter (or reflection) coefficients, the error signal is not 8073normally produced. It can, of course, be found by taking the speech samples 8074which constitute the current frame and running them through an analysis 8075filter whose parameters are those determined by the analysis, but this 8076is a computationally demanding exercise, for the filter must run at the 8077speech sampling rate (say 8\ kHz) instead of at the frame rate (say 50\ Hz). 8078Usually, pitch is estimated by other methods, like those discussed in 8079Chapter 4, when using autocorrelation or covariance linear prediction. 8080However, we have seen above that with the lattice method, the error 8081signal is produced as a byproduct: it appears at the right-hand end 8082of the upper path of the lattice chain. Thus it is already available 8083for use in determining pitch periods. 8084.sh "6.5 Parameter coding for linear predictive storage or transmission" 8085.pp 8086In this section, the coding requirements of linear predictive parameters 8087will be examined. The parameters that need to be stored or transmitted 8088are: 8089.LB 8090.NP 8091pitch 8092.NP 8093voiced-unvoiced flag 8094.NP 8095overall amplitude level 8096.NP 8097filter coefficients or reflection coefficients. 8098.LE 8099The first three are parameters of the excitation source. 8100They can be derived directly from the error signal as indicated above, if 8101it is generated (as it is in lattice implementations); or by other 8102methods if no error signal is calculated. 8103The filter or reflection coefficients are, of course, the main product 8104of linear predictive analysis. 8105.pp 8106It is generally agreed that around 60 levels, logarithmically spaced, 8107are needed to represent pitch for telephone quality speech. 8108The voiced-unvoiced indication requires one bit, but since pitch is 8109irrelevant in unvoiced speech it can be coded as one of the pitch 8110levels. For example, with 6-bit coding of pitch, the value 0 can be 8111reserved to indicate unvoiced speech, with values 1\-63 indicating the 8112pitch of voiced speech. 8113The overall gain has not been discussed above: it is simply the average 8114amplitude of the error signal. Five bits on a logarithmic scale 8115are sufficient to represent it. 8116.pp 8117Filter coefficients are not very amenable to quantization. At least 81188\-10\ bits are required for each one. However, reflection coefficients 8119are better behaved, and 5\-6\ bits each seems adequate. The number of 8120coefficients that must be stored or transmitted is the same as the 8121order of the linear prediction: 10 is commonly used for low-quality 8122speech, with as many as 15 for higher qualities. 8123.pp 8124These figures give around 100\ bits/frame for a 10'th order system using 8125filter coefficients, and around 65\ bits/frame for a 10'th order system 8126using reflection coefficients. Frame lengths vary between 10\ msec 8127and 25\ msec, depending on the quality desired. Thus for 20\ msec frames, 8128the data rates work out at around 5000\ bit/s using filter coefficients, 8129and 3250\ bit/s using reflection coefficients. 8130.pp 8131Substantially lower data rates can be achieved by more careful 8132coding of parameters. In 1976, the US Government defined a standard 8133coding scheme for 10-pole linear prediction with a data rate of 81342400\ bit/s \(em conveniently chosen as one of the 8135commonly-used rates for serial data transmission. 8136This standard, called LPC-10, tackles the difficult problem of 8137protection against transmission errors (Fussell 8138.ul 8139et al, 81401978). 8141.[ 8142Fussell Boudra Abzug Cowing 1978 8143.] 8144.pp 8145Whenever data rates are reduced, redundancy inherent in the signal is 8146necessarily lost and so the effect of transmission errors becomes 8147greatly magnified. 8148For example, a single corrupted sample in PCM transmission of speech 8149will probably not be noticed, and even a short burst of errors will be 8150perceived as a click which can readily be distinguished from the speech. 8151However, any error in LPC transmission will last for one entire 8152frame \(em say 20\ msec \(em and worse still, it will be integrated into the 8153speech signal and not easily discriminated from it by the listener's brain. 8154A single corruption may, for example, change a voiced frame into an 8155unvoiced one, or vice versa. Even if it affects only 8156a reflection coefficient it will change the resonance characteristics 8157of that frame, and change them in a way that does not simply sound like 8158superimposed noise. 8159.pp 8160Table 6.1 shows the LPC-10 coding scheme. 8161.RF 8162.in+0.1i 8163.ta 2.0i +1.8i +0.6i 8164.nr x1 (\w'voiced sounds'/2) 8165.nr x2 (\w'unvoiced sounds'/2) 8166.ul 8167 \h'-\n(x1u'voiced sounds \h'-\n(x2u'unvoiced sounds 8168.sp 8169pitch/voicing 7 7 60 pitch levels, Hamming 8170 \h'\w'00 'u'and Gray coded 8171energy 5 5 logarithmically coded 8172$k sub 1$ 5 5 coded by table lookup 8173$k sub 2$ 5 5 coded by table lookup 8174$k sub 3$ 5 5 8175$k sub 4$ 5 5 8176$k sub 5$ 4 \- 8177$k sub 6$ 4 \- 8178$k sub 7$ 4 \- 8179$k sub 8$ 4 \- 8180$k sub 9$ 3 \- 8181$k sub 10$ 2 \- 8182synchronization 1 1 alternating 1,0 pattern 8183error detection/ \- \h'-\w'0'u'21 8184correction 8185 \h'-\w'__'u+\w'0'u'__ \h'-\w'__'u+\w'0'u'__ 8186.sp 8187 \h'-\w'0'u'54 \h'-\w'0'u'54 8188.sp 8189.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 8190 frame rate: 44.4\ Hz (22.5\ msec frames) 8191.in 0 8192.FG "Table 6.1 Bit requirements for each parameter in LPC-10 coding scheme" 8193Different coding is used for voiced and unvoiced frames. 8194Only four reflection coefficients are transmitted for unvoiced frames, 8195because it has been determined that no perceptible increase in speech quality 8196occurs when more are used. 8197The bits saved are more fruitfully employed to provide error detection 8198and correction for the other parameters. 8199Seven bits are used for pitch and the voiced-unvoiced flag, and they are 8200redundant in that only 60 possible pitch values are 8201allowed. 8202Most transmission errors in this field will be detected by the receiver; 8203which can then use an estimate of pitch based on previous values and 8204discard the erroneous one. Pitch values are also Gray coded so that 8205even if errors are not detected, there is a good chance that an adjacent 8206pitch value is read instead. 8207Different numbers of bits are allocated to the various reflection 8208coefficients: experience shows that the lower-numbered ones contribute 8209most highly to intelligibility and so these are quantized most finely. 8210In addition, a table lookup operation is performed on the code 8211generated for the first two, providing a non-linear quantization which is 8212chosen to minimize the error on a statistical basis. 8213.pp 8214With 54\ bits/frame and 22.5\ msec frames, LPC-10 requires a 2400\ bit/s 8215data rate. Even lower rates have been used successfully for lower-quality 8216speech. The Speak 'n Spell toy, described in Chapter 11, has an 8217average data rate of 1200\ bit/s. Rates as low as 600\ bit/s have 8218been achieved (Kang and Coulter, 1976) by pattern recognition techniques operating 8219on the reflection coefficients: however, the speech quality is not good. 8220.[ 8221Kang Coulter 1976 8222.] 8223.sh "6.6 References" 8224.LB "nnnn" 8225.[ 8226$LIST$ 8227.] 8228.LE "nnnn" 8229.sh "6.7 Further reading" 8230.pp 8231Most recent books on digital signal processing contain some information 8232on linear prediction (see Oppenheim and Schafer, 1975; Rabiner and Gold, 1975; 8233and Rabiner and Schafer, 1978; all referenced at the end of Chapter 4). 8234.LB "nn" 8235.\"Atal-1971-1 8236.]- 8237.ds [A Atal, B.S. 8238.as [A " and Hanauer, S.L. 8239.ds [D 1971 8240.ds [T Speech analysis and synthesis by linear prediction of the acoustic wave 8241.ds [J JASA 8242.ds [V 50 8243.ds [P 637-655 8244.nr [P 1 8245.ds [O August 8246.nr [T 0 8247.nr [A 1 8248.nr [O 0 8249.][ 1 journal-article 8250.in+2n 8251This paper is of historical importance because it introduced the idea 8252of linear prediction to the speech processing community. 8253.in-2n 8254.\"Makhoul-1975-2 8255.]- 8256.ds [A Makhoul, J.I. 8257.ds [D 1975 8258.ds [K * 8259.ds [T Linear prediction: a tutorial review 8260.ds [J Proc IEEE 8261.ds [V 63 8262.ds [N 4 8263.ds [P 561-580 8264.nr [P 1 8265.ds [O April 8266.nr [T 0 8267.nr [A 1 8268.nr [O 0 8269.][ 1 journal-article 8270.in+2n 8271An interesting, informative, and readable survey of linear prediction. 8272.in-2n 8273.\"Markel-1976-3 8274.]- 8275.ds [A Markel, J.D. 8276.as [A " and Gray, A.H. 8277.ds [D 1976 8278.ds [T Linear prediction of speech 8279.ds [I Springer Verlag 8280.ds [C Berlin 8281.nr [T 0 8282.nr [A 1 8283.nr [O 0 8284.][ 2 book 8285.in+2n 8286This is the only book which is entirely devoted to linear prediction of speech. 8287It is an essential reference work for those interested in the subject. 8288.in-2n 8289.\"Wiener-1947-4 8290.]- 8291.ds [A Wiener, N. 8292.ds [D 1947 8293.ds [T Extrapolation, interpolation and smoothing of stationary time series 8294.ds [I MIT Press 8295.ds [C Cambridge, Massachusetts 8296.nr [T 0 8297.nr [A 1 8298.nr [O 0 8299.][ 2 book 8300.in+2n 8301Linear prediction is often thought of as a relatively new technique, 8302but it is only its application to speech processing that is novel. 8303Wiener develops all of the basic mathematics used in linear prediction 8304of speech, except the lattice filter structure. 8305.in-2n 8306.LE "nn" 8307.EQ 8308delim $$ 8309.EN 8310.CH "7 JOINING SEGMENTS OF SPEECH" 8311.ds RT "Joining segments of speech 8312.ds CX "Principles of computer speech 8313.pp 8314The obvious way to provide speech output from computers 8315is to select the basic acoustic units to be used; record them; 8316and generate utterances by concatenating together appropriate segments 8317from this pre-stored inventory. 8318The crucial question then becomes, what are the basic units? 8319Should they be whole sentences, words, syllables, or phonemes? 8320.pp 8321There are several trade-offs to be considered here. 8322The larger the units, the more utterances have to be stored. 8323It is not so much the length of individual utterances that is of concern, 8324but rather their variety, which tends to increase exponentially instead 8325of linearly with the size of the basic unit. Numbers provide an 8326easy example: there are $10 sup 7$ 7-digit telephone numbers, and it is 8327certainly infeasible to record each one individually. 8328Note that as storage technology improves the limitation is becoming 8329more and more one of recording the utterances in the first place rather 8330than finding somewhere to store them. 8331At a PCM data rate of 50\ Kbit/s, a 100\ Mbyte disk can hold over 4\ hours 8332of continuous speech. 8333With linear predictive coding at 1\ Kbit/s it holds 0.8 of a 8334megasecond \(em well over a week. And this is a 24-hour 7-day week, 8335which corresponds to a working month; and continuous speech \(em without 8336pauses \(em which probably requires another factor of five for 8337production by a person. 8338Setting up a recording session to fill the disk would be a formidable 8339task indeed! 8340Furthermore, the use of videodisks \(em which will be common domestic items 8341by the end of the decade \(em could increase these figures by a factor of 50. 8342.pp 8343The word seems to be a sensibly-sized basic unit. 8344Many applications use a rather limited vocabulary \(em 190 words 8345for the airline reservation system described in Chapter 1. 8346Even at PCM data rates, this will consume less than 0.5\ Mbyte of 8347storage. 8348Unfortunately, coarticulation and prosodic factors now come into play. 8349.pp 8350Real speech is connected \(em there are few gaps between words. 8351Coarticulation, where sounds are affected by those on either side, 8352naturally operates across word boundaries. 8353And the time constants of coarticulation are associated with the 8354mechanics of the vocal tract and hence measure tens or hundreds 8355of msec. Thus the effects straddle several pitch periods (100\ Hz pitch 8356has 10\ msec period) and cannot be simulated by simple interpolation of the 8357speech waveform. 8358.pp 8359Prosodic features \(em notably pitch and rhythm \(em span much longer 8360stretches of speech than single words. As far as most speech output 8361applications are concerned, they operate at the utterance level of 8362a single, sentence-sized, information unit. They cannot be 8363accomodated if speech waveforms of individual words of 8364the utterance are stored, 8365for it is rarely feasible to alter the fundamental 8366frequency or duration of a time waveform without changing all the formant 8367resonances as well. 8368However, both word-to-word coarticulation and the essential features 8369of rhythm and intonation can be incorporated if the stored words are 8370coded in source-filter form. 8371.pp 8372For more general applications of speech output, the limitations of 8373word storage soon become apparent. Although people's daily 8374vocabularies are not large, most words have a variety 8375of inflected forms which need to be treated separately if a strict 8376policy is adopted of word storage. For instance, in this book 8377there are 84,000 words, and 6,500 (8%) different ones (counting 8378inflected forms). 8379In Chapter 1 alone, there are 6,800 words and 1,700 (25%) different ones. 8380.pp 8381It seems crazy to treat a simple inflection like "$-s$" or its voiced 8382counterpart, "$-z$" (as in "inflection\c 8383.ul 8384s\c 8385"), 8386as a totally different word from the base form. 8387But once you consider storing roots and endings separately, 8388it becomes apparent 8389that there is a vast number of different endings, and it is difficult to know 8390where to draw the line. It is natural to think instead of simply 8391using the syllable as the basic unit. 8392.pp 8393A generous estimate of the number of different syllables in English is 10,000. 8394At three a second, only about an 8395hour's storage is required for them all. But waveform storage 8396will certainly not do. 8397Although coarticulation effects between words are needed to make 8398speech sound fluent, coarticulation between syllables is necessary 8399for it even to be 8400.ul 8401comprehensible. 8402Adopting a source-filter form of representation is essential, as is 8403some scheme of interpolation between syllables which simulates 8404coarticulation. 8405Unfortunately, a great deal of acoustic action occurs at syllable 8406boundaries \(em stops are exploded, the sound source changes 8407between voicing and frication, and so on. It may be more appropriate 8408to consider inverse syllables, comprising a vowel-consonant-vowel sequence 8409instead of consonant-vowel-consonant. 8410(These have jokingly been dubbed "lisibles"!) 8411.pp 8412There is again some considerable practical difficulty in creating 8413an inventory of syllables, or lisibles. 8414Now it is not so much the recording that is impractical, but 8415the editing needed to ensure that the cuts between syllables are made 8416at exactly the right point. As units get smaller, the exact 8417placement of the boundaries becomes ever more critical; and several thousand 8418sensitive editing jobs is no easy task. 8419.pp 8420Since quite general effects of coarticulation must be accomodated 8421with syllable synthesis, there will not necessarily be significant 8422deterioration if smaller, demisyllable, units are employed. 8423This reduces the segment inventory to an estimated 1000\-2000 entries, 8424and the tedious job of editing each one individually becomes at 8425least feasible, if not enviable. 8426Alternatively, the segment inventory could be created by artificial 8427means involving cut-and-try experiments with resonance parameters. 8428.pp 8429The ultimate in economy of inventory size, of course, is to use 8430phonemes as the basic unit. This makes the most critical 8431part of the task interpolation between units, rather than their 8432construction or recording. With only about 40 phonemes 8433in English, each one can be examined in many different contexts to 8434ascertain the best data to store. 8435There is no need to record them directly from a human voice \(em it 8436would be difficult anyway for most cannot be produced in isolation. 8437In fact, a phoneme is an abstract unit, not a particular sound 8438(recall the discussion of phonology in Chapter 2), and so it is 8439most appropriate that data be abstracted from several different 8440realizations rather than an exact record made of any one. 8441.pp 8442If information is stored about phonological units of 8443speech \(em phonemes \(em the difficult task of phonological-to-phonetic 8444conversion must necessarily be performed automatically. 8445Allophones are created by altering the transitions between units, 8446and to a lesser extent by modifying the central parts of the units 8447themselves. 8448The rules for making transitions will have a big effect on the 8449quality of the resulting speech. 8450Instead of trying to perform this task automatically by a computer 8451program, the allophones themselves could be stored. This will 8452ease the job of generating transitions between segments, but 8453will certainly not eliminate it. 8454The total number of allophones will depend on the narrowness of the 8455transcription system: 60\-80 is typical, and it is unlikely to exceed 8456one or two hundred. In any case there will not be a storage problem. 8457However, now the burden of producing an allophonic transcription 8458has been transferred to the person who codes the utterance prior 8459to synthesizing it. If he is skilful and patient, he should 8460be able to coax the system into producing fairly understandable 8461speech, but the effort required for this on a per-utterance basis 8462should not be underestimated. 8463.RF 8464.nr x0 \w'sentences ' 8465.nr x1 \w' ' 8466.nr x2 \w'depends on ' 8467.nr x3 \w'generalized or ' 8468.nr x4 \w'natural speech ' 8469.nr x5 \w'author of segment' 8470.nr x6 \n(x0u+\n(x1u+\n(x2u+\n(x3u+\n(x4u+\n(x5u 8471.nr x7 (\n(.l-\n(x6)/2 8472.in \n(x7u 8473.ta \n(x0u +\n(x1u +\n(x2u +\n(x3u +\n(x4u 8474 | size of storage source of principal 8475 | utterance method utterance burden is 8476 | inventory inventory placed on 8477 |\h'-1.0i'\l'\n(x6u\(ul' 8478 | 8479sentences | depends on waveform or natural speech recording artist, 8480 | application source-filter storage medium 8481 | parameters 8482 | 8483words | depends on source-filter natural speech recording artist 8484 | application parameters and editor, 8485 | storage medium 8486 | 8487syllables/ | \0\0\010000 source-filter natural speech recording editor 8488 lisibles | parameters 8489 | 8490demi- | \0\0\0\01000 source-filter natural speech recording editor 8491 syllables | parameters or artificially or inventory 8492 | generated compiler 8493 | 8494phonemes | \0\0\0\0\0\040 generalized artificially author of segment 8495 | parameters generated concatenation 8496 | program 8497 | 8498allophones | \0\050\-100 generalized or artificially coder of 8499 | source-filter generated or synthesized 8500 | parameters natural speech utterances 8501 |\h'-1.0i'\l'\n(x6u\(ul' 8502.in 0 8503.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 8504.FG "Table 7.1 Some issues relevant to choice of basic unit" 8505.pp 8506Table 7.1 summarizes in broad brush-strokes the issues which relate to the 8507choice of basic unit for concatenation. 8508The sections which follow provide more detail about the different 8509methods of joining segments of speech together. 8510Only segmental aspects are considered, for the important problems of 8511prosody will be treated in the next chapter. 8512All of the methods rely to some extent on the acoustic properties of speech, 8513and as smaller basic units are considered the role of speech acoustics 8514becomes more important. 8515It is impossible in a book like this to give a detailed account of acoustic 8516phonetics, for it would take several volumes! 8517What I aim to do in the following pages is to highlight some salient features 8518which are relevant to segment concatenation, without attempting to be 8519complete. 8520.sh "7.1 Word concatenation" 8521.pp 8522For general speech output, word concatenation is an inherently limited 8523technique because of the large number of phonetically different words. 8524Despite this fact, it is at present the most widely-used synthesis 8525method, and is likely to remain so for several years. 8526We have seen that the primary problems are word-to-word 8527coarticulation and prosody; and both can be overcome, at least to a useful 8528approximation, by coding the words in source-filter form. 8529.rh "Time-domain techniques." 8530Nevertheless, a surprising number of applications simply store 8531the time waveform, coded, usually, by one of the techniques described in 8532Chapter 3. 8533From an implementation point of view there are many advantages to this. 8534Speech quality can easily be controlled by selecting a suitable sampling 8535rate and coding scheme. 8536A natural-sounding voice is guaranteed; male or female as desired. 8537The equipment required is minimal \(em a digital-to-analogue 8538converter and post-sampling filter will do for synthesis if 8539PCM coding is used, and 8540DPCM, ADPCM, and delta modulation decoders are not much more complicated. 8541.pp 8542From a speech point of view, the resulting utterances can never be made 8543convincingly fluent. 8544We discussed the early experiments of Stowe and Hampton (1961) 8545at the beginning of Chapter 3. 8546.[ 8547Stowe Hampton 1961 8548.] 8549A major drawback to word concatenation in the 8550analogue domain is the introduction of clicks and other interference 8551between words: it is difficult to prevent the time waveform transitions 8552from adding extraneous sounds. 8553This poses no problem with digital storage, however, for the waveforms 8554can be edited accurately prior to storage so that they start 8555and finish at an exactly 8556zero level. 8557Rather, the lack of fluency stems from the absence of proper control 8558of coarticulation and prosody. 8559.pp 8560But this is not necessarily a serious drawback if the application is 8561a sufficiently limited one. Complete, invariant utterances can be 8562stored as one unit. Often they must contain data-dependent 8563slot-fillers, as in 8564.LB 8565This flight makes \(em stops 8566.LE 8567and 8568.LB 8569Flight number \(em leaves \(em at \(em , arrives in \(em at \(em 8570.LE 8571(taken from the airline reservation system of Chapter 1 8572(Levinson and Shipley, 1980)). 8573.[ 8574Levinson Shipley 1980 8575.] 8576Then, each slot-filling word is recorded in an intonation consistent 8577both with its position in the template utterance and with the 8578intonation of that utterance. 8579This could be done by embedding the word in the utterance 8580for recording, and excising it by digital editing before storage. 8581It would be dangerous to try to take into account coarticulation effects, 8582for the coarticulation could not be made consistent with both the 8583several slot-fillers and the single template. 8584This could be overcome if several versions of the template were stored, 8585but then the scheme becomes subject to combinatorial explosion 8586if there is more than one slot in a single utterance. 8587But it is not really necessary, for the lack of fluency will probably 8588be interpreted by a benevolent listener as an attempt to convey the 8589information as clearly as possible. 8590.pp 8591Difficulties will occur if the same slot-filler is used in different 8592contexts. For instance, the first gap in each of the sentences above 8593contains a number; yet the intonation of that number is different. 8594Many systems simply ignore this problem. 8595Then one does notice anomalies, if one is attentive: the words come, 8596as it were, from different mouths, without fluency. 8597However, the problem is not necessarily acute. If it is, two or more 8598versions of each slot-filler can be recorded, one for each context. 8599.pp 8600As an example, consider the synthesis of 7-digit telephone numbers, 8601like 289\-5371. If one version only of each digit is stored, 8602it should be recorded in a level tone of voice. A pause should be 8603inserted after the third digit of the synthetic number, to accord 8604with common elocution. The result will certainly be unnatural, although 8605it should be clear and intelligible. 8606Any pitch errors in the recordings will make certain numbers 8607audibly anomalous. 8608At the other extreme, 70 single digits could be stored, one version of 8609each digit for each position in the number. The recording will be 8610tedious and error-prone, and the synthetic utterances will still not 8611be fluent \(em for coarticulation is ignored \(em but instead 8612unnaturally clearly enunciated. A compromise is to record only 8613three versions of each digit, one for any of the 8614five positions 8615.nr x1 \w'\(ul' 8616.nr x2 (8*\n(x1) 8617.nr x3 0.2m 8618\zx\h'\n(x1u'\zx\h'\n(x1u'\h'\n(x1u'\z\-\h'\n(x1u'\zx\h'\n(x1u'\zx\h'\n(x1u'\c 8619\zx\h'\n(x1u'\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' , 8620another one for the third position 8621\h'\n(x1u'\h'\n(x1u'\zx\h'\n(x1u'\z\-\h'\n(x1u'\h'\n(x1u'\c 8622\h'\n(x1u'\h'\n(x1u'\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' , 8623and the last for the final position 8624\h'\n(x1u'\h'\n(x1u'\h'\n(x1u'\z\-\h'\n(x1u'\h'\n(x1u'\c 8625\h'\n(x1u'\h'\n(x1u'\zx\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' . 8626The first version will be in a level voice, the second an 8627incomplete, rising tone; and the third a final, dropping pitch. 8628.rh "Joining formant-coded words." 8629The limitations of the time-domain method are lack of 8630fluency caused by unnatural transitions between words, and the 8631combinatorial explosion created by recording slot-fillers several times 8632in different contexts. 8633Both of these problems can be alleviated by storing formant tracks, 8634concatenating them with suitable interpolation, and applying a complete 8635pitch contour suitable for the whole utterance. 8636But one can still not generate conversational speech, for natural speech 8637rhythms cause non-linear warpings of the time axis which cannot reasonably 8638be imitated by this method. 8639.pp 8640Solving problems often creates others. 8641As we saw in Chapter 4, it is not easy to obtain reliable formant tracks 8642automatically. Yet hand-editing of formant parameters adds a whole new 8643dimension to the problem of vocabulary construction, for it is 8644an exceedingly tiresome and time-consuming task. 8645Even after such tweaking, resynthesized utterances will be degraded 8646considerably from the original, for the source-filter model is by no means 8647a perfect one. 8648A hardware or real-time software formant synthesizer must be added 8649to the system, presenting design problems and creating extra cost. 8650Should a serial or parallel synthesizer be used? \(em the latter offers 8651potentially better speech (especially in nasal sounds), but requires 8652additional parameters, namely formant amplitudes, to be estimated. 8653Finally, as we will see in the next chapter, it is not an easy matter to 8654generate a suitable pitch contour and apply it to the utterance. 8655.pp 8656Strangely enough, the interpolation itself does not present any great 8657difficulty, for there is not enough information in the formant-coded 8658words to make possible sophisticated coarticulation. 8659The need for interpolation is most pressing when one word ends with 8660a voiced sound and the next begins with one. 8661If either the end of the first or the beginning of the second word 8662(or both) is unvoiced, unnatural formant transitions do not matter 8663for they will not be heard. 8664Actually, this is only strictly true for fricative transitions: if 8665the juncture is aspirated then formants will be perceived in the 8666aspiration. However, 8667.ul 8668h 8669is the only fully aspirated sound in English, 8670and it is relatively uncommon. 8671It is not absolutely necessary to interpolate the fricative filter resonance, 8672because smooth transitions from one fricative sound to another are rare 8673in natural speech. 8674.pp 8675Hence unless both sides of the junction are voiced, no interpolation 8676is needed: simple abuttal of the stored parameter tracks will do. 8677Note that this is 8678.ul 8679not 8680the same as joining time waveforms, for the synthesizer 8681will automatically ensure a relatively smooth transition from one 8682segment to another because of energy storage in the filters. 8683A new set of resonance parameters for the formant-coded words will be stored 8684every 10 or 20 msec (see Chapter 5), and so the transition will automatically 8685be smoothed over this time period. 8686.pp 8687For voiced-to-voiced transitions, some interpolation is needed. 8688An overlap period of duration, say, 50\ msec, is established, and 8689the resonance parameters in the final 50\ msec of the first word are 8690averaged with those in the first 50\ msec of the second. 8691The average is weighted, with the first word's formants dominating 8692at the beginning and their effect progressively dying out 8693in favour of the second word. 8694.pp 8695More sophisticated than a simple average is to weight the components 8696according to how rapidly they are changing. 8697If the spectral change in one word is much greater than that in the 8698other, we might expect that this will dominate the transition. 8699A simple measure of spectral derivative at any given time can be found 8700by adding the magnitude of the discrepancies in each formant frequency 8701between one sample and the next. 8702The spectral change in the transition region can be obtained by summing 8703the spectral derivatives at each sample in the region. 8704Such a measure can perhaps be made more accurate by taking into 8705account the relative importance of the formants, but will probably 8706never be more than a rough and ready yardstick. 8707At any rate, it can be used to load the average in favour of the 8708dominant side of the junction. 8709.pp 8710Much more important for naturalness of the speech are the effects 8711of rhythm and intonation, discussed in the next chapter. 8712.pp 8713Such a scheme has been implemented and tested on \(em guess what! \(em 7-digit 8714telephone numbers (Rabiner 8715.ul 8716et al, 87171971). 8718.[ 8719Rabiner Schafer Flanagan 1971 8720.] 8721Significant improvement (at the 5% level of statistical 8722significance) in people's 8723ability to recall numbers was found for this method over direct 8724abuttal of either natural or synthetic versions of the digits. 8725Although the method seemed, on balance, to produce utterances that were 8726recalled less accurately than completely natural spoken 8727telephone numbers, the difference was not significant (at the 5% level). 8728The system was also used to generate wiring instructions by computer 8729directly from the connection list, as described in Chapter 1. 8730As noted there, synthetic speech was actually preferred to natural speech 8731in the noisy environment of the production line. 8732.rh "Joining linear predictive coded words." 8733Because obtaining accurate formant tracks for natural utterances 8734by Fourier transform methods is difficult, it is worth considering 8735the use of linear prediction as the source-filter model. 8736Actually, formant resonances can be extracted from linear predictive 8737coefficients quite easily, but there is no need to do this because 8738the reflection coefficients themselves are quite suitable 8739for interpolation. 8740.pp 8741A slightly different interpolation scheme from that described in the 8742previous section has been reported (Olive, 1975). 8743.[ 8744Olive 1975 8745.] 8746The reflection coefficients were spliced during an overlap region of 8747only 20\ msec. 8748More interestingly, attempts were made to suppress the plosive bursts 8749of stop sounds in cases where they were followed by another stop at 8750the beginning of the next word. 8751This is a common coarticulation, occurring, for instance, in the phrase 8752"stop burst". In running speech, the plosion on the 8753.ul 8754p 8755of "stop" is 8756normally suppressed because it is followed by another stop. 8757This is a particularly striking case because the place of articulation 8758of the two stops 8759.ul 8760p 8761and 8762.ul 8763b 8764is the same: complete suppression is not as likely 8765to happen in "stop gap", for example (although it may occur). 8766Here is an instance of how extra information could improve the 8767quality of the synthetic transitions considerably. 8768However, automatically identifying the place of articulation of stops is 8769a difficult job, of a complexity far above what is appropriate for 8770simply joining words stored in source-filter form. 8771.pp 8772Another innovation was introduced into the transition between two 8773vowel sounds, when the second word began with an accented syllable. 8774A glottal stop was placed at the juncture. 8775Although the glottal stop was not described in Chapter 2, it is a sound 8776used in many dialects of English. It frequently occurs 8777in the utterance "uh-uh", meaning "no". Here it 8778.ul 8779is 8780used to separate two vowel sounds, but in fact this is not particularly 8781common in most dialects. 8782One could say "the apple", "the orange", "the onion" with a neutral vowel 8783in "the" (to rhyme with "\c 8784.ul 8785a\c 8786bove") and a glottal stop as separator, 8787but it is much more usual to rhyme "the" with "he" and introduce a 8788.ul 8789y 8790between the words. 8791Similarly, even speakers who do not normally pronounce an 8792.ul 8793r 8794at the 8795end of words will introduce one in "bigger apple", rather than 8796using a glottal stop. 8797Note that it would be wrong to put an 8798.ul 8799r 8800in "the apple", even 8801for speakers who usually terminate "the" and "bigger" with the same sound. 8802Such effects occur at a high level of processing, and are practically 8803impossible to simulate with word-interpolation rules. 8804Hence the expedient of introducing a glottal stop is a good one, although 8805it is certainly unnatural. 8806.sh "7.2 Concatenating whole or partial syllables" 8807.pp 8808The use of segments larger than a single phoneme or allophone but smaller 8809than a word as the basic unit for speech synthesis has an interesting 8810history. 8811It has long been realized that transitions between phonemes are 8812extremely sensitive and critical components of speech, and thus are 8813essential for successful synthesis. 8814Consider the unvoiced stop sounds 8815.ul 8816p, t, 8817and 8818.ul 8819k. 8820Their central portion is actually silence! (Try saying a word like 8821"butter" with a very long 8822.ul 8823t.\c 8824) Hence 8825in this case it is 8826.ul 8827only 8828the transitional information which can distinguish these sounds from 8829each other. 8830.pp 8831Sound segments which comprise the transition from the centre of one phoneme 8832to the centre of the next are called 8833.ul 8834dyads 8835or 8836.ul 8837diphones. 8838The possibility of using them as the basic units for concatenation 8839was first mooted in the mid 1950's. 8840The idea is attractive because there is relatively little spectral 8841movement in the central, so-called "steady-state", portion of many 8842phonemes \(em in the extreme case of unvoiced stops there is not only 8843no spectral movement, but no spectrum at all in the steady state! 8844At that time the resonance synthesizer was in its infancy, and 8845so recorded segments of live speech were used. The early experiments 8846met with little success because of the technical difficulties 8847of joining analogue waveforms and inevitable discrepancies between 8848the steady-state parts of a phoneme recorded in different contexts \(em not 8849to mention the problems of coarticulation and prosody which effectively 8850preclude the use of waveform concatenation at such a low level. 8851.pp 8852In the mid 1960's, with the growing use of resonance synthesizers, 8853it became possible to generate diphones by copying resonance parameters 8854manually from a spectrogram, and improving the result by trial and error. 8855It was not feasible to extract formant frequencies automatically from real 8856speech, though, because the fast Fourier transform was not yet widely 8857known and the computational burden of slow Fourier transformation was 8858prohibitive. 8859For example, a project at IBM stored manually-derived parameter tracks 8860for diphones, identified by pairs of phoneme names (Dixon and Maxey, 1968). 8861.[ 8862Dixon Maxey 1968 8863.] 8864To generate a synthetic utterance it was coded in 8865phonetic form and used to access 8866the diphone table to give a set of parameter tracks for the complete 8867utterance. Note that this is the first system we have encountered 8868whose input is a phonetic transcription which relates to an inventory 8869of truly synthetic character: all previous schemes used recordings of 8870live speech, albeit processed in some form. 8871Since the inventory was synthetic, there was no difficulty in ensuring 8872that discontinuities did not arise between segments beginning and ending with 8873the same phoneme. Thus interpolation was irrelevant, and the synthesis 8874procedure concentrated on prosodic questions. The resulting speech 8875was reported to be quite impressive. 8876.pp 8877Strictly speaking, diphones are not demisyllables but phoneme pairs. 8878In the simplest case they happen to be similar, for two primary diphones 8879characterize a consonant-vowel-consonant syllable. 8880There is an advantage to using demisyllables rather than diphones as the basic 8881unit, for many syllables begin or end with complicated consonant clusters 8882which are not easy to produce convincingly by diphone 8883concatenation. 8884But they are not easy to produce by hand-editing resonance parameters 8885either! 8886Now that speech analysis methods have been developed and refined, 8887resonance parameters or linear predictive coefficients 8888can be extracted automatically 8889from natural utterances, and there has been a resurgence of interest in 8890syllabic and demisyllabic synthesis methods. The wheel has turned 8891full circle, from segments of natural speech to hand-tailored parameters 8892and back again! 8893.pp 8894The advantage of storing demisyllables over syllables (or lisibles) from 8895the point of view of storage capacity has already been pointed out 8896(perhaps 1,000\-2,000 demisyllables as opposed to 4,000\-10,000 syllables). 8897But it is probably not too significant with the continuing decline 8898of storage costs. 8899The requirements are of the order of 25\ Kbyte versus 0.5\ Mbyte 8900for 1200\ bit/s linear predictive coding, and the latter could 8901almost be accomodated today \(em 1981 \(em on a state-of-the-art 8902read-only memory chip. 8903A bigger advantage comes from rhythmic considerations. 8904As we will see in the next chapter, the rhythms of fluent speech cause 8905dramatic variations in syllable duration, but these seem to affect 8906the vowel and closing consonant cluster much more than the initial consonant 8907cluster. Thus if a demisyllable is deemed to begin shortly (say 60\ msec) 8908after onset of the vowel, when the formant structure has settled down, 8909the bulk of the vowel and the closing consonant cluster will form a 8910single demisyllable. The opening cluster of the next syllable will lie 8911in the next demisyllable. Then differential lengthening can be applied 8912to that part of the syllable which tends to be stretched in live speech. 8913.pp 8914One system for demisyllable concatenation has produced excellent results 8915for monosyllabic English words (Lovins and Fujimura, 1976). 8916.[ 8917Lovins Fujimura 1976 8918.] 8919Complex word-final consonant clusters are excluded from the inventory by 8920using syllable affixes 8921.ul 8922s, z, t, 8923and 8924.ul 8925d; 8926these are attached to the 8927syllabic core as a separate exercise (Macchi and Nigro, 1977). 8928.[ 8929Macchi Nigro 1977 8930.] 8931Prosodic rather than segmental considerations are likely to prove the major 8932limiting factor when this scheme is extended to running speech. 8933.pp 8934Monosyllabic words spoken in isolation are coded as linear predictive 8935reflection coefficients, and segmented by digital editing into the initial 8936consonant cluster and the vocalic nucleus plus final cluster. 8937The cut is made 60\ msec into the vowel, as suggested above. 8938This minimizes the difficulty of interpolation when concatenating 8939segments, for there is ample voicing on either side of the juncture. 8940The reflection coefficients should not differ radically because the 8941vowel is the same in each demisyllable. 8942A 40\ msec overlap is used, with the usual linear interpolation. 8943An alternative smoothing rule applies when the second segment has 8944a nasal or glide after the vowel. In this case anticipatory coarticulation 8945occurs, affecting even the early part of the vowel. For example, a vowel 8946is frequently nasalized when followed by a nasal sound \(em even in English 8947where nasalization is not a distinctive feature in vowels (see Chapter 2). 8948Under these circumstances the overlap area is moved forward in time so 8949that the colouration applies throughout almost the whole vowel. 8950.sh "7.3 Phoneme synthesis" 8951.pp 8952Acoustic phonetics is the study of how the acoustic 8953signal relates to the phonetic sequence which was spoken or heard. 8954People \(em especially engineers \(em often ask, how could phonetics not 8955be acoustic? In fact it can be articulatory, auditory, or linguistic 8956(phonological), for example, and we have touched on the first and last 8957in Chapter 2. 8958The invention of the sound spectrograph in the late 1940's was an 8959event of colossal significance for acoustic phonetics, for it somehow 8960seemed to make the intricacies of speech visible. 8961(This was thought to be a greater advance than actually turned 8962out: historically-minded readers should refer to Potter 8963.ul 8964et al, 89651947, 8966for an enthusiastic contemporary appraisal of the invention.) A 8967.[ 8968Potter Kopp Green 1947 8969.] 8970result of several years of research at Haskins Laboratories in New York 8971during the 1950's was a set of "minimal rules for synthesizing speech", 8972which showed how stylized formant patterns could generate cues for 8973identifying vowels and, particularly, consonants 8974(Liberman, 1957; Liberman 8975.ul 8976et al, 89771959). 8978.[ 8979Liberman 1957 Some results of research on speech perception 8980.] 8981.[ 8982Liberman Ingemann Lisker Delattre Cooper 1959 8983.] 8984.pp 8985These were to form the basis of many speech synthesis-by-rule computer 8986programs in the ensuing decades. Such programs take as input a 8987phonetic transcription of the utterance and generate a spoken version 8988of it. The transcription may be broad or narrow, depending on the 8989system. Experience has shown that the Haskins rules really are 8990minimal, and the success of a synthesis-by-rule program depends on 8991a vast collection of minutia, each seemingly insignificant in isolation 8992but whose effects combine to influence the speech quality dramatically. 8993The best current systems produce clearly understandable 8994speech which is nevertheless something of a strain to listen to for 8995long periods. 8996However, many are not good; and some are execrable. 8997In recent times commercial influences have unfortunately restricted 8998the free exchange of results and programs between academic researchers, 8999thus slowing down progress. 9000Research attention has turned to prosodic factors, 9001which are certainly less well understood than segmental ones, and 9002to synthesis from plain English text rather than from phonetic transcriptions. 9003.pp 9004The remainder of this chapter describes the techniques of segmental 9005synthesis. First it is necessary to introduce some 9006elements of acoustic phonetics. 9007It may be worth re-reading Chapter 2 at this point, to refresh 9008your memory about the classification of speech sounds. 9009.sh "7.4 Acoustic characterization of phonemes" 9010.pp 9011Shortly after the invention of the sound spectrograph an inverse 9012instrument was developed, called the "pattern playback" synthesizer. 9013This took as input a spectrogram, either in its original form or 9014painted by hand. 9015An optical arrangment was used to modulate the amplitude of some 9016fifty harmonically-related oscillators by the lightness or darkness 9017of each point on the frequency axis of the spectrogram. 9018As it was drawn past the playing head, sound was produced which 9019had approximately the frequency components shown on the spectrogram, 9020although the fundamental frequency was constant. 9021.pp 9022This device allowed the complicated 9023acoustic effects seen on a spectrogram (see for example Figures 2.3 and 2.4) 9024to be replayed in either original or simplified form. 9025Hence the features which are important for perception of the different sounds 9026could be isolated. The procedure was to copy from an actual spectrogram 9027the features which were most prominent visually, and then to make further 9028changes by trial and error until the result was judged to have 9029reasonable intelligibility when replayed. 9030.pp 9031For the purpose of acoustic characterization of particular phonemes, 9032it is useful to consider the central, steady-state part separately from 9033transitions into and out of the segment. 9034The steady-state part is that sound which is heard when the phoneme 9035is prolonged. The term "phoneme" is being used in a rather loose sense 9036here: it is more appropriate to think of a "sound segment" rather than 9037the abstract unit which forms the basis of phonological classification, 9038and this is the terminology I will adopt. 9039.pp 9040The essential auditory characteristics of some sound segments are inherent in 9041their steady states. 9042If a vowel, for example, is spoken and prolonged, it can readily be 9043identified by listening to any part of the utterance. 9044This is not true for diphthongs: if you say "I" very slowly and freeze 9045your vocal tract posture at any time, the resulting steady-state sound 9046will not be sufficient to identify the diphthong. Rather, it will be 9047a vowel somewhere between 9048.ul 9049aa 9050(in "had") or 9051.ul 9052ar 9053(in "hard") and 9054.ul 9055ee 9056(in "heed"). 9057Neither is it true for glides, for prolonging 9058.ul 9059w 9060(in "want") or 9061.ul 9062y 9063(in "you") results in vowels resembling respectively 9064.ul 9065u 9066("hood") or 9067.ul 9068ee 9069("heed"). 9070Fricatives, voiced or unvoiced, can be identified from the steady state; 9071but stops can not, for their's is silent (or \(em in the case 9072of voiced stops \(em something close to it). 9073.pp 9074Segments which are identifiable from their steady state are easy to synthesize. 9075The difficulty lies with the others, for it must be the transitions which 9076carry the information. Thus "transitions" are an essential part of speech, 9077and perhaps the term is unfortunate for it calls to mind an unimportant 9078bridge between one segment and the next. 9079It is tempting to use the words "continuant" and "non-continuant" to distinguish 9080the two categories; unfortunately they are used by phoneticians in a different 9081sense. 9082We will call them "steady-state" and "transient" segments. The latter term 9083is not particularly appropriate, for even sounds in this class 9084.ul 9085can 9086be prolonged: the point is that the identifying information is in the 9087transitions rather than the steady state. 9088.RF 9089.nr x1 (\w'excitation'/2) 9090.nr x2 (\w'formant resonance'/2) 9091.nr x3 (\w'fricative'/2) 9092.nr x4 (\w'frequencies (Hz)'/2) 9093.nr x5 (\w'resonance (Hz)'/2) 9094.nr x0 4n+1.7i+0.8i+0.6i+0.6i+1.0i+\w'00'+\n(x5 9095.nr x6 (\n(.l-\n(x0)/2 9096.in \n(x6u 9097.ta 4n +1.7i +0.8i +0.6i +0.6i +1.0i 9098 \h'-\n(x1u'excitation \0\0\h'-\n(x2u'formant resonance \0\0\h'-\n(x3u'fricative 9099 \0\0\h'-\n(x4u'frequencies (Hz) \0\0\c 9100\h'-\n(x5u'resonance (Hz) 9101\l'\n(x0u\(ul' 9102.sp 9103.nr x1 (\w'voicing'/2) 9104\fIuh\fR (the) \h'-\n(x1u'voicing \0500 1500 2500 9105\fIa\fR (bud) \h'-\n(x1u'voicing \0700 1250 2550 9106\fIe\fR (head) \h'-\n(x1u'voicing \0550 1950 2650 9107\fIi\fR (hid) \h'-\n(x1u'voicing \0350 2100 2700 9108\fIo\fR (hod) \h'-\n(x1u'voicing \0600 \0900 2600 9109\fIu\fR (hood) \h'-\n(x1u'voicing \0400 \0950 2450 9110\fIaa\fR (had) \h'-\n(x1u'voicing \0750 1750 2600 9111\fIee\fR (heed) \h'-\n(x1u'voicing \0300 2250 3100 9112\fIer\fR (heard) \h'-\n(x1u'voicing \0600 1400 2450 9113\fIar\fR (hard) \h'-\n(x1u'voicing \0700 1100 2550 9114\fIaw\fR (hoard) \h'-\n(x1u'voicing \0450 \0750 2650 9115\fIuu\fR (food) \h'-\n(x1u'voicing \0300 \0950 2300 9116.nr x1 (\w'aspiration'/2) 9117\fIh\fR (he) \h'-\n(x1u'aspiration 9118.nr x1 (\w'frication'/2) 9119.nr x2 (\w'frication and voicing'/2) 9120\fIs\fR (sin) \h'-\n(x1u'frication 6000 9121\fIz\fR (zed) \h'-\n(x2u'frication and voicing 6000 9122\fIsh\fR (shin) \h'-\n(x1u'frication 2300 9123\fIzh\fR (vision) \h'-\n(x2u'frication and voicing 2300 9124\fIf\fR (fin) \h'-\n(x1u'frication 4000 9125\fIv\fR (vat) \h'-\n(x2u'frication and voicing 4000 9126\fIth\fR (thin) \h'-\n(x1u'frication 5000 9127\fIdh\fR (that) \h'-\n(x2u'frication and voicing 5000 9128\l'\n(x0u\(ul' 9129.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 9130.in 0 9131.FG "Table 7.2 Resonance synthesizer parameters for steady-state sounds" 9132.rh "Steady-state segments." 9133Table 7.2 shows appropriate values for the resonance parameters and 9134excitation sources of a resonance synthesizer, for steady-state 9135segments only. 9136There are several points to note about it. 9137Firstly, all the frequencies involved obviously depend upon the 9138speaker \(em the size of his vocal tract, his accent and speaking habits. 9139The values given are nominal ones for a male speaker with a dialect of 9140British English called "received pronunciation" (RP) \(em for it is what 9141used to be "received" on the wireless in the old days 9142before the British Broadcasting Corporation 9143adopted a policy of more informal, more regional, speech. 9144Female speakers have formant frequencies approximately 15% higher 9145than male ones. 9146Secondly, the third formant is relatively unimportant for vowel 9147identification; it is 9148the first and second that give the vowels their character. 9149Thirdly, formant values for 9150.ul 9151h 9152are not given, for they would be meaningless. 9153Although it is certainly a steady-state sound, 9154.ul 9155h 9156changes radically 9157in context. If you say "had", "heed", "hud", and so on, and freeze 9158your vocal tract posture on the initial 9159.ul 9160h, 9161you will find it 9162already configured for the following vowel \(em an excellent 9163example of anticipatory coarticulation. 9164Fourthly, amplitude values do play some part in identification, 9165particularly for fricatives. 9166.ul 9167th 9168is the weakest sound, closely followed by 9169.ul 9170f, 9171with 9172.ul 9173s 9174and 9175.ul 9176sh 9177the 9178strongest. It is necessary to get a reasonable mix of excitation in 9179the voiced fricatives; the voicing amplitude is considerably less than 9180in vowels. Finally, there are other sounds that might be considered 9181steady state ones. You can probably identify 9182.ul 9183m, n, 9184and 9185.ul 9186ng 9187just by 9188their steady states. However, the difference is not particularly 9189strong; it is the transitional parts which discriminate most effectively 9190between these sounds. The steady state of 9191.ul 9192r 9193is quite distinctive, too, 9194for most speakers, because the top of the tongue is curled back in a 9195so-called "retroflex" action and this causes a radical change in the 9196third formant resonance. 9197.rh "Transient segments." 9198Transient sounds include diphthongs, glides, 9199nasals, voiced and unvoiced stops, and affricates. 9200The first two are relatively easy to characterize, for they are 9201basically continuous, gradual transitions from one vocal tract posture 9202to another \(em sort of dynamic vowels. Diphthongs and glides are 9203similar to each other. In fact "you" could be transcribed as 9204a triphthong, 9205.ul 9206i e uu, 9207except that in the initial posture the tongue 9208is even higher, and the vocal tract correspondingly more constricted, 9209than in 9210.ul 9211i 9212("hid") \(em though not as constricted as in 9213.ul 9214sh. 9215Both categories can be represented in terms of target formant 9216values, on the understanding that these are not to be 9217interpreted as steady state configurations but strictly as 9218extreme values at the beginning or end of the formant motion (for 9219transitions out of and into the segment, respectively). 9220.pp 9221Nasals have a steady-state portion comprising a strong nasal formant 9222at a fairly low frequency, on account of the large size of the 9223combined nasal and oral cavity which is resonating. 9224Higher formants are relatively weak, because of attenuation effects. 9225Transitions into and out of nasals are strongly nasalized, 9226as indeed are adjacent vocalic segments, with 9227the oral and nasal tract operating in parallel. As discussed in 9228Chapter 5, this cannot be simulated on a series synthesizer. 9229However, extremely fast motions of the formants occur on account of 9230the binary switching action of the velum, and it turns out that 9231fast formant transitions are sufficient to simulate nasals because 9232the speech perception mechanism is accustomed to hearing them only 9233in that context! Contrast this with the extremely slow transitions 9234in diphthongs and glides. 9235.pp 9236Stops form the most interesting category, and research using the pattern 9237playback synthesizer was instrumental in providing adequate acoustic 9238characterizations for them. Consider unvoiced stops. 9239They each have three phases: transition in, silent central portion, 9240and transition out. There is a lot of action on the transition out 9241(and many phoneticians would divide this part alone into several "phases"). 9242First, as the release occurs, there is a small burst of fricative noise. 9243Say "t\ t\ t\ ..." as in "tut-tut", without producing any voicing. 9244Actually, when used as an admonishment this is accompanied by 9245an ingressive, inhaling air-stream instead of the normal egressive, 9246exhaling one used in English speech (although some languages 9247do have ingressive sounds). 9248In any case, a short fricative somewhat resembling a tiny 9249.ul 9250s 9251can be heard as the tongue leaves the roof of the mouth. 9252Frication is produced when the gap is very narrow, and ceases 9253rapidly as it becomes wider. 9254Next, when an unvoiced stop is released, a significant amount of aspiration 9255follows the release. 9256Say "pot", "tot", "cot" with force and you will hear the 9257.ul 9258h\c 9259-like 9260aspiration quite clearly. 9261It doesn't always occur, though; for example you will hear little 9262aspiration when a fricative like 9263.ul 9264s 9265precedes the stop in the 9266same syllable, as in "spot", "scot". The aspiration is a distinguishing 9267feature between "white spot" and the rather unlikely "White's pot". 9268It tends to increase as the emphasis on the syllable increases, 9269and this in an example of a prosodic feature influencing segmental 9270characteristics. Finally, at the end of the segment, 9271the aspiration \(em if any \(em will turn to voicing. 9272.pp 9273What has been described applies to 9274.ul 9275all 9276unvoiced stops. 9277What distinguishes one from another? 9278The tiny fricative burst will be different because the noise is produced 9279at different places in the vocal tract \(em at the lips for 9280.ul 9281p, 9282tongue and front of palate for 9283.ul 9284t, 9285and tongue and back of palate for 9286.ul 9287k. 9288The most important difference, however, is the formant motion illuminated 9289by the last vestiges of voicing at closure and by both aspiration and the 9290onset of voicing at opening. 9291Each stop has target formant values which, although 9292they cannot be heard during the stopped portion (for there is no 9293sound there), do affect the transitions in and out. 9294An added complexity is that the target positions themselves vary to some 9295extent depending on the adjacent segments. 9296If the stop is heavily aspirated, the vocal posture will have almost 9297attained that for the following vowel before voicing begins, but 9298the formant transitions will be perceived because they affect 9299the sound quality of aspiration. 9300.pp 9301The voiced stops 9302.ul 9303b, d, 9304and 9305.ul 9306g 9307are quite similar to their unvoiced analogues 9308.ul 9309p, t, 9310and 9311.ul 9312k. 9313What distinguishes them from each other are the formant transitions to 9314target positions, heard during closure and opening. 9315They are distinguished from their unvoiced counterparts by the fact 9316that more voicing is present: it lingers on longer at closure 9317and begins earlier on opening. Thus little or no aspiration appears 9318during the opening phase. If an unvoiced stop is uttered in a context 9319where aspiration is suppressed, as in "spot", it is almost identical to the 9320corresponding voiced stop, "sbot". Luckily no words in English require 9321us to make a distinction in such contexts. 9322Voicing sometimes pervades the entire stopped portion of a voiced stop, 9323especially when it is surrounded by other voiced segments. 9324When saying a word like "baby" slowly you can choose whether or not to 9325prolong voicing throughout the second 9326.ul 9327b. 9328If you do, creating what is 9329called a "voice bar" in spectrograms, 9330the sound escapes through the cheeks, for 9331the lips are closed \(em try doing it for a very long time and your cheeks 9332will fill up with air! 9333This severely attenuates high-frequency components, and can 9334be simulated with a weak first formant at a low resonant frequency. 9335.RF 9336.nr x0 \w'unvoiced stops: 'u 9337.nr x1 4n 9338.nr x2 \n(x0+\n(x1+\w'aspiration burst (context- and emphasis-dependent)'u 9339.nr x3 (\n(.l-\n(x2)/2 9340.in \n(x3u 9341.ta \n(x0u +\n(x1u 9342unvoiced stops: closure (early cessation of voicing) 9343 silent steady state 9344 opening, comprising 9345 short fricative burst 9346 aspiration burst (context- and emphasis-dependent) 9347 onset of voicing 9348.sp 9349voiced stops: closure (late cessation of voicing) 9350 steady state (possibility of voice bar) 9351 opening, comprising 9352 pre-voicing 9353 short fricative burst 9354.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 9355.in 0 9356.FG "Table 7.3 Acoustic phases of stop consonants" 9357.pp 9358Table 7.3 summarizes some of the acoustic phases of voiced and unvoiced 9359stops. There are many variations that have not been mentioned. 9360Nasal plosion ("good news") occurs (at the word boundary, in this case) 9361when the nasal formant pervades the 9362opening phase. Stop bursts are suppressed when the next sound is a stop 9363too (the burst on the 9364.ul 9365p 9366of "apt", for example). 9367It is difficult to distinguish a voiced stop from an unvoiced one 9368at the end of a word ("cab" and "cap"); if the speaker is trying to 9369make himself particularly clear he will put a short neutral vowel 9370after the voiced stop to emphasize its early onset of voicing. 9371(If he is Italian he will probably do this anyway, for it is the norm 9372in his own language.) 9373.pp 9374Finally, we turn to affricates, of which there are only two 9375in English: 9376.ul 9377ch 9378("chin") and 9379.ul 9380j 9381("djinn"). 9382They are very similar to the stops 9383.ul 9384t 9385and 9386.ul 9387d 9388followed by the fricatives 9389.ul 9390sh 9391and 9392.ul 9393zh 9394respectively, and their acoustic characterization is similar to that 9395of the phoneme pair. 9396.ul 9397ch 9398has a closing phase, a stopped phase, and a long fricative burst. 9399There is no aspiration, 9400for the vocal cords are not involved. 9401.ul 9402j 9403is the same except that voicing extends further into the stopped 9404portion, and the terminating fricative is also voiced. 9405It may be pronounced with a voice bar if the preceding segment is voiced 9406("adjunct"). 9407.sh "7.5 Speech synthesis by rule" 9408.pp 9409Generation of speech by rules acting upon a phonetic transcription 9410was first investigated in the early 1960's (Kelly and Gerstman, 1961). 9411.[ 9412Kelly Gerstman 1961 9413.] 9414Most systems employ a hardware resonance synthesizer, analogue or digital, 9415series or parallel, 9416to reduce the load on the computer which operates the rules. 9417The speech-by-rule program, rather than the 9418synthesizer, inevitably contributes by far the greater part of the 9419degradation in the resulting speech. 9420Although parallel synthesizers offer greater potential control over 9421the spectrum, it is not clear to what extent a synthesis program can take 9422advantage of this. Parameter tracks for a series synthesizer can 9423easily be converted into linear predictive coefficients, and systems 9424which use a linear predictive synthesizer will probably become popular 9425in the near future. 9426.pp 9427The phrase "synthesis by rule", which is in common use, does not 9428make it clear just what sort of features the rules are supposed to 9429accomodate, and what information must be included explicitly in the 9430input transcription. 9431Early systems made no attempt to simulate prosodics. 9432Pitch and rhythm could be controlled, but only by inserting 9433pitch specifiers and duration markers in the input. 9434Some kind of prosodic control was often incorporated later, 9435but usually as a completely separate phase from segmental synthesis. 9436This does not allow interaction effects (such as the extra 9437aspiration for voiceless stops in accented syllables) to be taken 9438into account easily. 9439Even systems which perform prosodic operations invariably need to have 9440prosodic specifications embedded explicitly in the input. 9441.pp 9442Generating parameter tracks for a synthesizer from a phonetic transcription 9443is a process of data 9444.ul 9445expansion. 9446Six bits are ample to specify a phoneme, and a speaking rate of 12 phonemes/sec 9447leads to an input data rate of 72 bit/s. 9448The data rate required to control the synthesizer will depend upon the number 9449of parameters and the rate at which they are sampled, 9450but a typical figure is 6 Kbit/s (Chapter 5). 9451Hence there is something like a hundredfold data expansion. 9452.pp 9453Figure 7.1 shows the parameter tracks for a series synthesizer's rendering 9454of the utterance 9455.ul 9456s i k s. 9457.FC "Figure 7.1" 9458There are eight parameters. 9459You can see the onset of frication at the beginning and end (parameter 5), 9460and the amplitude of voicing (parameter 1) come on for the 9461.ul 9462i 9463and off again before the 9464.ul 9465k. 9466The pitch (parameter 0) is falling slowly throughout the utterance. 9467These tracks are stylized: they come from a computer synthesis-by-rule 9468program and not from a human utterance. 9469With a parameter update rate of 10 msec, the graphs can be represented 9470by 90 sets of eight parameter values, a total of 720 values or 4320 bits 9471if a 6-bit representation is used for each value. 9472Contrast this with the input of only four phoneme segments, or say 24 bits. 9473.rh "A segment-by-segment system." 9474A seminal paper appearing in 1964 was the first comprehensive 9475description of a computer-based synthesis-by-rule system 9476(Holmes 9477.ul 9478et al, 94791964). 9480.[ 9481Holmes Mattingly Shearme 1964 9482.] 9483The same system is still in use and has been reimplemented in a more 9484portable form (Wright, 1976). 9485.[ 9486Wright 1976 9487.] 9488The inventory of sound segments 9489includes the phonemes listed in Table 2.1, as well as diphthongs and 9490a second allophone of 9491.ul 9492l. 9493(Many British speakers use quite a different vocal posture for 9494pre- and post-vocalic 9495.ul 9496l\c 9497\&'s, called clear and dark 9498.ul 9499l\c 9500\&'s 9501respectively.) Some phonemes are expanded into sub-phonemic 9502"phases" by the program. Stops have three phases, corresponding to 9503the closure, silent steady state, and opening. 9504Diphthongs have two phases. We will call individual phases and 9505single-phase phonemes "segments", for they are subject to exactly 9506the same transition rules. 9507.pp 9508Parameter tracks are constructed out of linear pieces. 9509Consider a pair of adjacent segments in an utterance to be synthesized. 9510Each one has a steady-state portion and an internal transition. 9511The internal transition of one phoneme is dubbed "external" 9512as far as the other is concerned. 9513This is important because instead of each segment being responsible 9514for its own internal transition, one of the pair is identified 9515as "dominant" and it controls the duration of both transitions \(em its 9516internal one and its external (the other's internal) one. 9517For example, in Figure 7.2 the segment 9518.ul 9519sh 9520dominates 9521.ul 9522ee 9523and so it 9524governs the duration of both transitions shown. 9525.FC "Figure 7.2" 9526Note that each 9527segment contributes as many as three linear pieces to the parameter track. 9528.pp 9529The notion of domination is similar to that discussed earlier for 9530word concatenation. 9531The difference is that for word concatenation the dominant segment was 9532determined by computing the spectral derivative over the transition 9533region, whereas for synthesis-by-rule 9534segments are ranked according to a static precedence, 9535and the higher-ranking segment dominates. 9536Segments of stop consonants have the highest rank (and also 9537the greatest spectral derivative), while fricatives, nasals, glides, 9538and vowels follow in that order. 9539.pp 9540The concatenation procedure is controlled by a table which associates 954125 quantities with each segment. They are 9542.LB 9543.NI 9544rank 9545.NI 95462\ \ overall durations (for stressed and unstressed occurrences) 9547.NI 95484\ \ transition durations (for internal and external transitions of 9549formant frequencies and amplitudes) 9550.NI 95518\ \ target parameter values (amplitudes and frequencies of three 9552formant resonances, plus fricative information) 9553.NI 95545\ \ quantities which specify how to calculate boundary values for 9555formant frequencies (two for each formant except the third, 9556which has only one) 9557.NI 95585\ \ quantities which specify how to calculate boundary values for 9559amplitudes. 9560.LE 9561This table is rather large. There are 80 segments in all (remember 9562that many phonemes are represented by more than one segment), 9563and so it has 2000 entries. The system was an offline one which ran on 9564what was then \(em 1964 \(em a large computer. 9565.pp 9566The advantage of such a large table of "rules" is the 9567flexibility it affords. 9568Notice that transition durations are specified independently for 9569formant frequency and amplitude parameters \(em this permits 9570fine control which is particularly useful for stops. 9571For each parameter the boundary value between segments is calculated 9572using a fixed contribution from the dominant one 9573and a proportion of the steady state value of the other. 9574.pp 9575It is possible that the two transition durations which are 9576calculated for a segment actually exceed the overall duration specified 9577for it. In this case, the steady-state target values will be approached 9578but not actually attained, simulating a situation where coarticulation 9579effects prevent a target value from being reached. 9580.rh "An event-based system." 9581The synthesis system described above, in common with many others, takes 9582an uncompromisingly segment-by-segment view of speech. 9583The next phoneme is read, perhaps split into a few segments, and 9584these are synthesized one by one with due attention being paid 9585to transitions between them. 9586Some later work has taken a more syllabic view. 9587Mattingly (1976) urges a return to syllables for both practical and 9588theoretical reasons. 9589.[ 9590Mattingly 1976 Syllable synthesis 9591.] 9592Transitional effects are particularly strong 9593within a syllable and comparatively weak (but by no means negligible) 9594from one syllable to the next. From a theoretical viewpoint, 9595there are much stronger phonetic restrictions on phoneme sequences 9596than there are on syllable sequences: pretty well any syllable can 9597follow another (although whether the pair makes sense is 9598a different matter), but the linguistically 9599acceptable phoneme sequences are only a fraction 9600of those formed by combining phonemes in all 9601possible ways. 9602Hill (1978) argues against what be calls the "segmental assumption" 9603that progress through the utterance should be made one segment at a time, 9604and recommends a description of speech based upon perceptually relevant 9605"events". 9606.[ 9607Hill 1978 A program structure for event-based speech synthesis by rules 9608.] 9609This framework is interesting because it provides an opportunity for prosodic 9610considerations to be treated as an integral part of the synthesis 9611process. 9612.pp 9613The phonetic segments and other information that specify an utterance 9614can be regarded as a list of events which describes it 9615at a relatively high level. 9616Synthesis-by-rule is the act of taking this list and elaborating on it 9617to produce lower-level events which are realized by the vocal tract, 9618or acoustically simulated by a resonance synthesizer, to give a speech 9619waveform. 9620In articulatory terms, an event might be "begin tongue motion towards 9621upper teeth with a given effort", while in resonance terms it could be 9622"begin second formant transition towards 1500\ Hz at a given rate". 9623(These two examples are 9624.ul 9625not 9626intended to describe the same event: a tongue motion causes much more 9627than the transition of a single formant.) Coarticulation 9628issues such as stop burst suppression and nasal plosion should 9629be easier to imitate within an event-based scheme than a segment-to-segment 9630one. 9631.pp 9632The ISP system (Witten and Abbess, 1979) is event-based. 9633.[ 9634Witten Abbess 1979 9635.] 9636The key to its operation is the 9637.ul 9638synthesis list. 9639To prepare an utterance for synthesis, the lexical items which specify 9640it are joined into a linked list. Figure 7.3 shows the start of 9641the list created for 9642.LB 96431 9644.ul 9645dh i z i z /*d zh aa k s /h aa u s 9646.LE 9647(this is Jack's house); the "1\ ...\ /*\ ...\ /\ ..." are 9648prosodic markers which will be discussed in the next chapter. 9649.FC "Figure 7.3" 9650Next, the rhythm and pitch assignment routines 9651augment the list with syllable boundaries, phoneme 9652cluster identifiers, and duration and pitch specifications. 9653Then it is passed to the segmental synthesis routine 9654which chains events into the appropriate places and, as it 9655proceeds, removes the no longer useful elements (phoneme names, 9656pitch specifiers, etc) which originally constituted the synthesis list. 9657Finally, an interrupt-driven speech synthesizer handler removes 9658events from the list as they become due and uses them to control 9659the hardware synthesizer. 9660.pp 9661By adopting the synthesis list as a uniform data structure for 9662holding utterances at every stage of processing, the problems of storage 9663allocation and garbage collection are minimized. 9664Each list element has a forward pointer and five data words, the first 9665indicating what type of element it is. 9666Lexical items which may appear in the input are 9667.LB 9668.NI 9669end of utterance (".", "!", ",", ";") 9670.NI 9671intonation indicator ("1", ...) 9672.NI 9673rhythm indicator ("/", "/*") 9674.NI 9675word boundary (" ") 9676.NI 9677syllable boundary ("'") 9678.NI 9679phoneme segment 9680(\c 9681.ul 9682ar, b, ng, ...\c 9683) 9684.NI 9685explicit duration or pitch information. 9686.LE 9687Several of these have to do with prosodic features \(em a prime 9688advantage of the structure is that it does not create an artificial 9689division between segmentals and prosody. 9690Syllable boundaries and duration and pitch information are optional. 9691They will normally be computed by ISP, but the user can override them in the 9692input in a natural way. 9693The actual characters which identify lexical items are not fixed 9694but are taken from the rule table. 9695.pp 9696As synthesis 9697proceeds, new elements are chained in to the synthesis list. 9698For segmental purposes, three types of event are defined \(em 9699target events, increment events, and aspiration events. 9700With each event is associated a time at which the event becomes due. 9701For a target event, a parameter number, target parameter value, 9702and time-increment are specified. 9703When it becomes due, motion of the parameter towards the 9704target is begun. If no other event for that parameter intervenes, 9705the target value will be reached after the given time-increment. 9706However, another target event for the parameter may change its motion 9707before the target has been attained. 9708Increment events contain a parameter number, a parameter increment, 9709and a time-increment. The fixed increment is added to the parameter value 9710throughout the time specified. This provides an easy way to make a 9711fricative burst during the opening phase of a stop consonant. 9712Aspiration events switch the mode of excitation from voicing to aspiration 9713for a given period of time. Thus the aspirated part of unvoiced stops 9714can be accomodated in a natural manner, by changing the mode of excitation 9715for the duration of the aspiration. 9716.RF 9717.nr x1 (\w'excitation'/2) 9718.nr x2 (\w'formant resonance'/2) 9719.nr x3 (\w'fricative'/2) 9720.nr x4 (\w'type'/2) 9721.nr x5 (\w'frequencies (Hz)'/2) 9722.nr x6 (\w'resonance (Hz)'/2) 9723.nr x0 1.0i+0.7i+0.6i+0.6i+1.0i+1.2i+(\w'long vowel'/2) 9724.nr x7 (\n(.l-\n(x0)/2 9725.in \n(x7u 9726.ta 1.0i +0.7i +0.6i +0.6i +1.0i +1.2i 9727 \h'-\n(x1u'excitation \0\0\h'-\n(x2u'formant resonance \0\0\h'-\n(x3u'fricative \h'-\n(x4u'type 9728 \0\0\h'-\n(x5u'frequencies (Hz) \0\0\h'-\n(x6u'resonance (Hz) 9729\l'\n(x0u\(ul' 9730.sp 9731.nr x1 (\w'voicing'/2) 9732.nr x2 (\w'vowel'/2) 9733\fIuh\fR \h'-\n(x1u'voicing \0490 1480 2500 \c 9734\h'-\n(x2u'vowel 9735\fIa\fR \h'-\n(x1u'voicing \0720 1240 2540 \h'-\n(x2u'vowel 9736\fIe\fR \h'-\n(x1u'voicing \0560 1970 2640 \h'-\n(x2u'vowel 9737\fIi\fR \h'-\n(x1u'voicing \0360 2100 2700 \h'-\n(x2u'vowel 9738\fIo\fR \h'-\n(x1u'voicing \0600 \0890 2600 \h'-\n(x2u'vowel 9739\fIu\fR \h'-\n(x1u'voicing \0380 \0950 2440 \h'-\n(x2u'vowel 9740\fIaa\fR \h'-\n(x1u'voicing \0750 1750 2600 \h'-\n(x2u'vowel 9741.nr x2 (\w'long vowel'/2) 9742\fIee\fR \h'-\n(x1u'voicing \0290 2270 3090 \h'-\n(x2u'long vowel 9743\fIer\fR \h'-\n(x1u'voicing \0580 1380 2440 \h'-\n(x2u'long vowel 9744\fIar\fR \h'-\n(x1u'voicing \0680 1080 2540 \h'-\n(x2u'long vowel 9745\fIaw\fR \h'-\n(x1u'voicing \0450 \0740 2640 \h'-\n(x2u'long vowel 9746\fIuu\fR \h'-\n(x1u'voicing \0310 \0940 2320 \h'-\n(x2u'long vowel 9747.nr x1 (\w'aspiration'/2) 9748.nr x2 (\w'h'/2) 9749\fIh\fR \h'-\n(x1u'aspiration \h'-\n(x2u'h 9750.nr x1 (\w'voicing'/2) 9751.nr x2 (\w'glide'/2) 9752\fIr\fR \h'-\n(x1u'voicing \0240 1190 1550 \h'-\n(x2u'glide 9753\fIw\fR \h'-\n(x1u'voicing \0240 \0650 \h'-\n(x2u'glide 9754\fIl\fR \h'-\n(x1u'voicing \0380 1190 \h'-\n(x2u'glide 9755\fIy\fR \h'-\n(x1u'voicing \0240 2270 \h'-\n(x2u'glide 9756.nr x2 (\w'nasal'/2) 9757\fIm\fR \h'-\n(x1u'voicing \0190 \0690 2000 \h'-\n(x2u'nasal 9758.nr x1 (\w'none'/2) 9759.nr x2 (\w'stop'/2) 9760\fIb\fR \h'-\n(x1u'none \0100 \0690 2000 \h'-\n(x2u'stop 9761\fIp\fR \h'-\n(x1u'none \0100 \0690 2000 \h'-\n(x2u'stop 9762.nr x1 (\w'voicing'/2) 9763.nr x2 (\w'nasal'/2) 9764\fIn\fR \h'-\n(x1u'voicing \0190 1780 3300 \h'-\n(x2u'nasal 9765.nr x1 (\w'none'/2) 9766.nr x2 (\w'stop'/2) 9767\fId\fR \h'-\n(x1u'none \0100 1780 3300 \h'-\n(x2u'stop 9768\fIt\fR \h'-\n(x1u'none \0100 1780 3300 \h'-\n(x2u'stop 9769.nr x1 (\w'voicing'/2) 9770.nr x2 (\w'nasal'/2) 9771\fIng\fR \h'-\n(x1u'voicing \0190 2300 2500 \h'-\n(x2u'nasal 9772.nr x1 (\w'none'/2) 9773.nr x2 (\w'stop'/2) 9774\fIg\fR \h'-\n(x1u'none \0100 2300 2500 \h'-\n(x2u'stop 9775\fIk\fR \h'-\n(x1u'none \0100 2300 2500 \h'-\n(x2u'stop 9776.nr x1 (\w'frication'/2) 9777.nr x2 (\w'voice + fric'/2) 9778.nr x3 (\w'fricative'/2) 9779\fIs\fR \h'-\n(x1u'frication 6000 \h'-\n(x3u'fricative 9780\fIz\fR \h'-\n(x2u'voice + fric \0190 1780 3300 6000 \h'-\n(x3u'fricative 9781\fIsh\fR \h'-\n(x1u'frication 2300 \h'-\n(x3u'fricative 9782\fIzh\fR \h'-\n(x2u'voice + fric \0190 2120 2700 2300 \h'-\n(x3u'fricative 9783\fIf\fR \h'-\n(x1u'frication 4000 \h'-\n(x3u'fricative 9784\fIv\fR \h'-\n(x2u'voice + fric \0190 \0690 3300 4000 \h'-\n(x3u'fricative 9785\fIth\fR \h'-\n(x1u'frication 5000 \h'-\n(x3u'fricative 9786\fIdh\fR \h'-\n(x2u'voice + fric \0190 1780 3300 5000 \h'-\n(x3u'fricative 9787\l'\n(x0u\(ul' 9788.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 9789.in 0 9790.FG "Table 7.4 Rule table for an event-based synthesis-by-rule program" 9791.pp 9792Now the rule table, which is shown in Table 7.4, 9793holds simple target positions for each phoneme segment, as well as 9794the segment type. The latter is used to trigger events by computer 9795procedures which have access to the context of the segment. 9796In principle, this allows considerably more sophistication to be 9797introduced than does a simple segment-by-segment approach. 9798.RF 9799.nr x1 0.5i+0.5i+\w'preceding consonant in this syllable (suppress burst if fricative)'u 9800.nr x1 (\n(.l-\n(x1)/2 9801.in \n(x1u 9802.ta 0.5i +0.5i 9803fricative bursts on stops 9804aspiration bursts on unvoiced stops, affected by 9805 preceding consonant in this syllable (suppress burst if fricative) 9806 following consonant (suppress burst if another stop; introduce 9807 nasal plosion if a nasal) 9808 prosodics (increase burst if syllable is stressed) 9809voice bar on voiced stops (in intervocalic position) 9810post-voicing on terminating voiced stops, if syllable is stressed 9811anticipatory coarticulation for \fIh\fR 9812vowel colouring when a nasal or glide follows 9813.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 9814.in 0 9815.FG "Table 7.5 Some coarticulation effects" 9816.pp 9817For example, Table 7.5 summarizes some of the subtleties of the 9818speech production process which have been mentioned earlier in this 9819chapter. Most of them are context-dependent, with the prosodic 9820context (whether two segments are in the same syllable; whether a 9821syllable is stressed) playing a significant role. A scheme where 9822data-dependent "demons" fire on particular patterns in a linked list 9823seems to be a sensible approach towards incorporating such rules. 9824.rh "Discussion." 9825There are two opposing trends in speech synthesis by rule. 9826On the one hand larger and larger segment inventories can be used, 9827containing more and more allophones explicitly. 9828This is the approach of the Votrax sound-segment synthesizer, 9829discussed in Chapter 11. 9830It puts an increasing burden on the person who codes the utterances 9831for synthesis, although, as we shall see, computer programs can assist with 9832this task. 9833On the other hand the segment inventory can be kept small, perhaps 9834comprising just the logical phonemes as in the ISP system. 9835This places the onus on the computer program to accomodate allophonic variations, 9836and to do so it must take account of the segmental and prosodic 9837context of each phoneme. 9838An event-based approach seems to give the best chance of incorporating 9839contextual modification whilst avoiding undesired interactions. 9840.pp 9841The second trend brings synthesis closer to the articulatory process 9842of speech production. In fact an event-based system would be 9843an ideal way of implementing an articulatory model for speech synthesis 9844by rule. It would be much more satisfying to have the rule table 9845contain articulatory target positions instead of resonance ones, 9846with events like "begin tongue motion towards upper teeth with a given 9847effort". The problem is that hard data on articulatory postures and 9848constraints is much more difficult to gather than resonance information. 9849.pp 9850An interesting question that relates to articulation is whether formant 9851motion can be simulated adequately by a small number of linear pieces. 9852The segment-by-segment system described above had as many as nine 9853pieces for a single phoneme, for some phonemes had three phases 9854and each one contributes up to three pieces (transition in, 9855steady state, and transition out). 9856Another system used curves of decaying exponential 9857form which ensured that all transitions started rapidly towards 9858the target position but slowed down as it was approached (Rabiner, 1968, 1969). 9859.[ 9860Rabiner 1968 Speech synthesis by rule Bell System Technical J 9861.] 9862.[ 9863Rabiner 1969 A model for synthesizing speech by rule 9864.] 9865The time-constant of decay was stored with each segment in the rule 9866table. The rhythm of the synthetic speech was controlled at this level, 9867for the next segment was begun when all the formants had attained 9868values sufficiently close to the current targets. 9869This is a poor model of the human speech production process, where rhythm 9870is dictated at a relatively high level and the next phoneme is not 9871simply started when the current one happens to end. 9872Nevertheless, the algorithm produced smooth, continuous formant motions 9873not unlike those found in spectrograms. 9874.pp 9875There is, however, by no means universal agreement on decaying exponential formant 9876motions. Lawrence (1974) divided segments into "checked" and "free" 9877categories, corresponding roughly to consonants and vowels; and postulated 9878.ul 9879increasing 9880exponential transitions into checked segments, and decaying transitions into 9881free ones. 9882.[ 9883Lawrence 1974 9884.] 9885This is a reasonable supposition if you consider the mechanics of 9886articulation. The speed of movement of the tongue (for example) is likely 9887to increase until it is physically stopped by reaching the roof of the 9888mouth. 9889When moving away from a checked posture into a free one the transition will 9890be rapid at first but slow down to approach the target asymptotically, 9891governed by proprioceptive feedback. 9892.pp 9893The only thing that seems to be agreed is that the formant tracks should 9894certainly 9895.ul 9896not 9897be piecewise linear. However, in the face of 9898conflicting opinions as to whether exponentials should be decaying 9899or increasing, piecewise linear motions seem to be a reasonable 9900compromise! It is likely that the precise shape of formant 9901tracks is unimportant so long as the gross features are imitated 9902correctly. 9903Nevertheless, this is a question which an articulatory model 9904could help to answer. 9905.sh "7.6 References" 9906.LB "nnnn" 9907.[ 9908$LIST$ 9909.] 9910.LE "nnnn" 9911.sh "7.7 Further reading" 9912.pp 9913There are unfortunately few books to recommend on the subject of 9914joining segments of speech. 9915The references form a representative and moderately comprehensive bibliography. 9916Here is some relevant background reading in linguistics. 9917.LB "nn" 9918.\"Fry-1976-1 9919.]- 9920.ds [A Fry, D.B.(Editor) 9921.ds [D 1976 9922.ds [T Acoustic phonetics 9923.ds [I Cambridge Univ Press 9924.ds [C Cambridge, England 9925.nr [T 0 9926.nr [A 0 9927.nr [O 0 9928.][ 2 book 9929.in+2n 9930This book of readings contains many classic papers on acoustic phonetics 9931published from 1922\-1965. 9932It covers much of the history of the subject, and is intended 9933primarily for students of linguistics. 9934.in-2n 9935.\"Lehiste-1967-2 9936.]- 9937.ds [A Lehiste, I.(Editor) 9938.ds [D 1967 9939.ds [T Readings in acoustic phonetics 9940.ds [I MIT Press 9941.ds [C Cambridge, Massachusetts 9942.nr [T 0 9943.nr [A 0 9944.nr [O 0 9945.][ 2 book 9946.in+2n 9947Another basic collection of references which covers much the same ground 9948as Fry (1976), above. 9949.in-2n 9950.\"Sivertsen-1961-3 9951.]- 9952.ds [A Sivertsen, E. 9953.ds [D 1961 9954.ds [K * 9955.ds [T Segment inventories for speech synthesis 9956.ds [J Language and Speech 9957.ds [V 4 9958.ds [P 27-89 9959.nr [P 1 9960.nr [T 0 9961.nr [A 1 9962.nr [O 0 9963.][ 1 journal-article 9964.in+2n 9965This is a careful early study of the quantitative implications of using 9966phonemes, demisyllables, syllables, and words as the basic building 9967blocks for speech synthesis. 9968.in-2n 9969.LE "nn" 9970.EQ 9971delim $$ 9972.EN 9973.CH "8 PROSODIC FEATURES IN SPEECH SYNTHESIS" 9974.ds RT "Prosodic features 9975.ds CX "Principles of computer speech 9976.pp 9977Prosodic features are those which characterize an utterance as a whole, 9978rather than having a local influence on individual sound segments. 9979For speech output from computers, an "utterance" usually comprises a 9980single unit of information which stretches over several words \(em a clause 9981or sentence. In natural speech an utterance can be very much longer, but 9982it will be broken into prosodic units which are again roughly the size of a 9983clause or sentence. These prosodic units are certainly closely related 9984to each other. For example, the pitch contour used when introducing a new 9985topic is usually different from those employed to develop it subsequently. 9986However, for the purposes of synthesis the successive prosodic units can 9987be treated independently, and information about pitch contours to be used 9988will have to be specified in the input for each one. 9989The independence between them is not complete, however, and 9990lower-level contextual effects, such as interpolation of pitch between 9991the end of one prosodic unit and the start of the next, must still be 9992imitated. 9993.pp 9994Prosodic features were introduced briefly in Chapter 2. 9995Variations in voice dynamics occur in three dimensions: pitch of the voice, 9996time, and amplitude. 9997These dimensions are inextricably twined together in living speech. 9998Variations in voice quality are much less important for the factual 9999kind of speech usually sought in voice response applications, 10000although they can play a considerable in conveying emotions 10001(for a discussion of the acoustic manifestations of emotion in speech, 10002see Williams and Stevens, 1972). 10003.[ 10004Williams Stevens 1972 10005.] 10006.pp 10007The distinction between prosodic and segmental effects is a traditional one, 10008but it becomes rather fuzzy when examined in detail. 10009It is analogous to the distinction between hardware and 10010software in computer science: although useful from some points of view 10011the borderline becomes blurred as one gets closer to actual systems \(em with 10012microcode, interrupts, memory management, and the like. 10013At a trivial level, prosodics 10014cannot exist without segmentals, for there must be some vehicle to carry the 10015prosodic contrasts. 10016Timing \(em a prosodic feature \(em is actually realized by the durations of 10017individual segments. Pauses are tantamount to silent segments. 10018.pp 10019While pitch may seem to be relatively independent of segmentals \(em and 10020this view is reinforced by the success of the source-filter model 10021which separates the frequency of the 10022excitation source from the filter characteristics \(em there 10023are some subtle phonetic effects of pitch. 10024It has been observed that it drops on the transition into certain 10025consonants, and rises again on the transition out (Haggard 10026.ul 10027et al, 100281970). 10029.[ 10030Haggard Ambler Callow 1970 10031.] 10032This can be explained in terms of variations in pressure from the 10033lungs on the vocal cords (Ladefoged, 1967). 10034.[ 10035Ladefoged 1967 10036.] 10037Briefly, the increase in mouth pressure which occurs during some consonants 10038causes a reduction in the pressure difference across the vocal cords 10039and in the rate of flow of air between them. 10040This results in a decrease in their frequency of vibration. 10041When the constriction is released, there is a temporary increase in the air 10042flow which increases the pitch again. 10043The phenomenon is called "microintonation". 10044It is particularly noticeable in voiced stops, but also occurs in voiced 10045fricatives and unvoiced stops. 10046Simulation of the effect in synthesis-by-rule has often been found to give 10047noticeable improvements in the speech quality. 10048.pp 10049Loudness also has a segmental role. For example, we noted in the last chapter 10050that amplitude values play a small part in identification of fricatives. 10051In fact loudness is a very 10052.ul 10053weak 10054prosodic feature. It contributes little to the perception of stress. 10055Even for shouting the distinction from normal speech is as much in the voice 10056quality as in amplitude 10057.ul 10058per se. 10059It is not necessary to consider varying loudness on a prosodic basis 10060in most speech synthesis systems. 10061.pp 10062The above examples show how prosodic features have segmental influences 10063as well. 10064The converse is also true: some segmental features have a prosodic effect. 10065The last chapter described how stress is associated with increased aspiration 10066of syllable-initial unvoiced stops. Furthermore, stressed syllables 10067are articulated with greater effort than unstressed ones, and hence the formant 10068transitions are more likely to attain their target values 10069under circumstances which would otherwise cause them to fall short. 10070In unstressed syllables, extreme vowels (like 10071.ul 10072ee, aa, uu\c 10073) 10074tend to more centralized sounds 10075(like 10076.ul 10077i, uh, u 10078respectively). 10079Although all British English vowels 10080.ul 10081can 10082appear in unstressed syllables, they often become "reduced" into a 10083centralized form. 10084Consider the following examples. 10085.LB 10086.NI 10087diplomat \ 10088.ul 10089d i p l uh m aa t 10090.NI 10091diplomacy \ 10092.ul 10093d i p l uh u m uh s i 10094.NI 10095diplomatic \ 10096.ul 10097d i p l uh m aa t i k. 10098.LE 10099The vowel of the second syllable is reduced to 10100.ul 10101uh 10102in "diplomat" and "diplomatic", whereas the root form "diploma", and also 10103"diplomacy", has a diphthong 10104(\c 10105.ul 10106uh u\c 10107) 10108there. The third syllable has an 10109.ul 10110aa 10111in "diplomat" and "diplomatic" which is reduced to 10112.ul 10113uh 10114in "diplomacy". 10115In these cases the reduction is shown explicitly in the phonetic transcription; 10116but in more marginal examples where it is less extreme it will not be. 10117.pp 10118I have tried to emphasize in previous chapters that prosodic features are 10119important in speech synthesis. 10120There is something very basic about them. 10121Rhythm is an essential part of all bodily activity \(em of breathing, 10122walking, working and playing \(em and so it pervades speech too. 10123Mothers and babies communicate effectively using intonation alone. 10124Some experiments have indicated that the language environment of 10125an infant affects his babbling at an early age, before he has effective 10126segmental control. 10127There is no doubt that "tone of voice" plays a large part in human 10128communication. 10129.pp 10130However, early attempts at synthesis did not pay too 10131much attention to prosodics, perhaps because it was thought sufficient to get the 10132meaning across by providing clear segmentals. 10133As artificial speech grows more widespread, however, it is becoming 10134apparent that its acceptability to users, and hence its ultimate 10135success, depends to a large extent on incorporating natural-sounding 10136prosodics. Flat, arhythmic speech may be comprehensible in short stretches, 10137but it strains the concentration in significant discourse and people 10138are not usually prepared to listen to it. 10139Unfortunately, current commercial speech output systems do not really tackle 10140prosodic questions, which indicates our present rather inadequate 10141state of knowledge. 10142.pp 10143The importance of prosodics for automatic speech 10144.ul 10145recognition 10146is beginning to be appreciated too. Some research projects 10147have attended to the automatic identification of points of stress, 10148in the hope that the clear articulation of stressed syllables can be used 10149to provide anchor points in an unknown utterance (for example, see Lea 10150.ul 10151et al, 101521975). 10153.[ 10154Lea Medress Skinner 1975 10155.] 10156.pp 10157But prosodics and segmentals are closely intertwined. 10158I have chosen to 10159treat them in separate chapters in order to split the material up into 10160manageable chunks rather than to enforce a deep division between them. 10161It is also true that synthesis of prosodic features is an uncharted and 10162controversial area, which gives this chapter rather a different 10163flavour from the last. 10164It is hard to be as definite about alternative strategies 10165and methods as you can for segment concatenation. 10166In order to make the treatment as concrete and down-to-earth as possible, 10167I will describe in some detail two example projects in prosodic synthesis. 10168The first treats the problem of transferring pitch from one utterance to 10169another, while the second considers how artificial timing and pitch can be 10170assigned to synthetic speech. 10171These examples illustrate quite different problems, and are reasonably 10172representative of current research activity. 10173(Other systems are described by Mattingly, 1966; Rabiner 10174.ul 10175et al, 101761969.) Before 10177.[ 10178Mattingly 1966 10179.] 10180.[ 10181Rabiner Levitt Rosenberg 1969 10182.] 10183looking at the two examples, we will discuss 10184a feature which is certainly prosodic but does not appear in the 10185list given earlier \(em stress. 10186.sh "8.1 Stress" 10187.pp 10188Stress is an everyday notion, and when 10189listening to natural speech people can usually agree on which syllables 10190are stressed. But it is difficult to characterize in acoustic terms. 10191From the speaker's point of view, a stressed syllable is produced by 10192pushing more air out of the lungs. For a listener, the points of stress 10193are "obvious". 10194You may think that stressed syllables are louder than the others: however, 10195instrumental studies show that this is not necessarily (nor even usually) 10196so (eg Lehiste and Peterson, 1959). 10197.[ 10198Lehiste Peterson 1959 10199.] 10200Stressed syllables frequently have a longer vowel than unstressed 10201ones, but this is by no means universally true \(em if you say "little" 10202or "bigger" you will find that the vowel in the first, stressed, syllable 10203is short and shows little sign of lengthening as you increase the emphasis. 10204Moreover, experiments using bisyllabic nonsense words have indicated 10205that some people consistently judge the 10206.ul 10207shorter 10208syllable to be stressed in the absence of other clues (Morton and Jassem, 102091965). 10210.[ 10211Morton Jassem 1965 10212.] 10213Pitch often helps to indicate stress. 10214It is not that stressed syllables are always higher- or lower-pitched 10215than neighbouring ones, or even that they are uttered with a rising or 10216falling pitch. It is the 10217.ul 10218rate of change 10219of pitch that tends to be greater 10220for stressed syllables: a sharp rise or fall, 10221or a reversal of direction, helps to give emphasis. 10222.pp 10223Stress is acoustically manifested in timing and pitch, 10224and to a much lesser extent in loudness. 10225However it is a rather subtle feature and does 10226.ul 10227not 10228correspond simply to duration increases or pitch rises. 10229It seems that listeners unconsciously put together all the clues 10230that are present in an utterance in order to deduce which syllables are 10231stressed. 10232It may be that speech is perceived by a listener with reference to how 10233he would have produced it himself, and that this is how he detects which syllables 10234were given greater vocal effort. 10235.pp 10236The situation is confused by the fact that certain syllables in words are 10237often said in ordinary language to be "stressed" on account of their 10238position in the word. For example, the words 10239"diplomat", "diplomacy", and "diplomatic" have stress on the first, 10240second, and third syllables respectively. 10241But here we are talking about the word itself rather than 10242any particular utterance of it. The "stress" is really 10243.ul 10244latent 10245in the indicated syllables and only made manifest upon uttering them, 10246and then to a greater or lesser degree depending on exactly how 10247they are uttered. 10248.pp 10249Some linguists draw a careful distinction between salient syllables, 10250accented syllables, and stressed syllables, 10251although the words are sometimes used differently by different authorities. 10252I will not adopt a precise terminology here, 10253but it is as well to be aware of the subtle distinctions involved. 10254The term "salience" is applied to actual utterances, and salient 10255syllables are those that are perceived as being more prominent than their 10256neighbours. 10257"Accent" is the potential for salience, as marked, for example, 10258in a dictionary or lexicon. 10259Thus the discussion of the "diplo-" words above is about accent. 10260Stress is an articulatory phenomenon associated with increased 10261muscular activity. 10262Usually, syllables which are perceived as salient were produced with stress, 10263but in shouting, for example, all syllables can be stressed \(em even 10264non-salient ones. 10265Furthermore, accented syllables may not be salient. 10266For instance, the first syllable of the word "very" is accented, 10267that is, potentially salient, but in a sentence as uttered it may or may not be 10268salient. One can say 10269.LB 10270"\c 10271.ul 10272he's 10273very good" 10274.LE 10275with salience on "he" and possibly "good", or 10276.LB 10277"he's 10278.ul 10279very 10280good" 10281.LE 10282with salience on the first syllable of "very", and possibly "good". 10283.pp 10284Non-standard stress patterns are frequently used to bring out contrasts. 10285Words like "a" and "the" are normally unstressed, but can be stressed 10286in contexts where ambiguity has arisen. 10287Thus factors which operate at a much higher level than the phonetic structure 10288of the utterance must be taken into account when deciding where stress 10289should be assigned. These include syntactic and semantic considerations, 10290as well as the attitude of the speaker and the likely attitude of 10291the listener to the material being spoken. 10292For example, I might say 10293.LB 10294"Anna 10295.ul 10296and 10297Nikki should go", 10298.LE 10299with emphasis on the "and" purely because I was aware that my listener 10300might quibble about the expense of sending them both. 10301Clearly some notation is needed to communicate to the synthesis process 10302how the utterance is supposed to be rendered. 10303.sh "8.2 Transferring pitch from one utterance to another" 10304.pp 10305For speech stored in source-filter form and concatenated on a 10306slot-filling basis, it would be useful to 10307have stored typical pitch contours which can be applied to the 10308synthetic utterances. 10309From a practical point of view it is important to be able to generate 10310natural-sounding pitch for high-quality artificial speech. 10311Although several algorithms for creating completely synthetic contours 10312have been proposed \(em and we will examine one later in this chapter \(em 10313they are unsuitable for high-quality speech. 10314They are generally designed for use with synthesis-by-rule from phonetics, 10315and the rather poor quality of articulation does not encourage the 10316development of excellent pitch assignment procedures. With speech 10317synthesized by rule there is generally an emphasis on keeping the 10318data storage requirements to a minimum, and so it is not appropriate 10319to store complete contours. 10320Moreover, if speech is entered in textual 10321form as phoneme strings, it is natural to attach pitch information as markers 10322in the text rather than by entering a complete and detailed contour. 10323.pp 10324The picture is rather different for concatenated segments of natural speech. 10325In the airline reservation system, with utterances formed from templates like 10326.LB 10327Flight number \(em leaves \(em at \(em , arrives in \(em at \(em , 10328.LE 10329it is attractive to store the pitch contour of one complete instance of the 10330utterance and apply it to all synthetic versions. 10331.pp 10332There is an enormous literature on the anatomy of intonation, and much of it 10333rests upon the notion of a pitch contour as a descriptive aid to analysis. 10334Underlying this is the assumption, usually unstated, that a contour can be 10335discussed independently of the particular stream of words that manifests it; 10336that a single contour can somehow be bound to any sentence (or phrase, or 10337clause) to produce an acceptable utterance. But the contour, and its binding, 10338are generally described only at the grossest level, the details being left 10339unspecified. 10340.pp 10341There are phonetic influences on pitch \(em the characteristic lowering 10342during certain consonants was mentioned above \(em and these are 10343not normally considered as part of intonation. 10344Such effects will certainly spoil attempts to store contours extracted 10345from living speech and apply them to different utterances, but the impairment 10346may not be too great, for pitch is only one of many segmental clues to 10347consonant identification. 10348.pp 10349In the system mentioned earlier which generated 7-digit telephone numbers 10350by concatenating formant-coded words, a single natural pitch contour 10351was applied to all utterances. 10352It was taken to match as well as possible the general shape of the 10353contours measured in naturally-spoken telephone numbers. However, this is a very 10354restricted environment, for telephone numbers exhibit almost no variety in 10355the configuration of stressed and unstressed syllables \(em 10356the only digit which is not a monosyllable is "seven". 10357Significant problems arise when more general utterances are considered. 10358.pp 10359Suppose the pitch contour of one utterance (the "source") 10360is to be transferred to another (the "target"). 10361Assume that the utterances are encoded in source-filter form, 10362either as parameter tracks for a formant synthesizer or as linear predictive 10363coefficients. 10364Then there are no technical obstacles to combining pitch and segmentals. 10365The source must be available as a complete utterance, while the target 10366may be formed by concatenating smaller units such as words. 10367.pp 10368For definiteness, we will consider utterances of the form 10369.LB 10370The price is \(em dollars and \(em cents, 10371.LE 10372where the slots are filled by numbers less than 100; 10373and of the form 10374.LB 10375The price is \(em cents. 10376.LE 10377The domain of prices encompasses a wide range of syllable 10378configurations. 10379There are between one and five syllables in each variable part, 10380if the numbers are restricted to be less than 100. 10381The sentences have a constant pragmatic, semantic, and syntactic structure. 10382As in the vast majority of real-life situations, 10383minimal phonetic distinctions between utterances do not occur. 10384.pp 10385Pitch transfer is complicated by the fact that values of the source pitch 10386are only known during the voiced parts of the utterance. 10387Although it would certainly be possible to extrapolate pitch 10388over unvoiced parts, this would introduce some artificiality into 10389the otherwise completely natural contours. 10390Let us assume, therefore, that the pitch contour 10391of the voiced nucleus of each syllable in the source is applied to the 10392corresponding syllable nucleus in the target. 10393.pp 10394The primary factors which might tend to inhibit successful transfer 10395are 10396.LB 10397.NP 10398different numbers of syllables in the utterances; 10399.NP 10400variations in the pattern of stressed and unstressed syllables; 10401.NP 10402different syllable durations; 10403.NP 10404pitch discontinuities; 10405.NP 10406phonetic differences between the utterances. 10407.LE 10408.rh "Syllabification." 10409It is essential to take into account the syllable structures 10410of the utterances, so that pitch is transferred between 10411corresponding syllables rather than over the utterance 10412as a whole. 10413Fortunately, syllable boundaries can be detected automatically 10414with a fair degree of accuracy, especially if the speech is carefully 10415enunciated. 10416It is worth considering briefly how this can be done, even though it takes 10417us off the main topic of synthesis and into speech analysis. 10418.pp 10419A procedure developed by Mermelstein (1975) 10420involves integrating the spectral energy 10421at each point in the utterance. 10422.[ 10423Mermelstein 1975 Automatic segmentation of speech into syllabic units 10424.] 10425First the low (<500\ Hz) and high (>4000\ Hz) ends are filtered out 10426with 12\ dB/octave cutoffs. 10427The resulting energy signal is smoothed 10428by a 40\ Hz lowpass filter, giving a so-called "loudness" 10429function. 10430All this can be accomplished with simple recursive digital filters. 10431.pp 10432Then, the loudness function is compared with its convex hull. 10433The convex hull is the shape a piece of elastic would assume if 10434stretched over the top of the loudness function and anchored down at 10435both ends, as illustrated in Figure 8.1. 10436.FC "Figure 8.1" 10437The point of maximum difference between the hull and loudness function 10438is taken to be a tentative syllable 10439boundary. 10440The hull is recomputed, but anchored to the actual loudness function 10441at the tentative boundary, 10442and the points of maximum hull-loudness difference in each of the 10443two halves are selected as further tentative 10444boundaries. 10445The procedure continues recursively until the maximum hull-loudness 10446difference, with the hull anchored at each tentative boundary, 10447falls below a certain minimum (say 4\ dB). 10448.pp 10449At this stage, the number of tentative boundaries will greatly exceed 10450the actual number of syllables (by a factor of around 5). 10451Many of the extraneous boundaries are eliminated by the following 10452constraints: 10453.LB 10454.NP 10455if two boundaries lie within a certain time of each other 10456(say 120\ msec), one of them is discarded; 10457.NP 10458if the maximum loudness within a tentative syllable falls too 10459far short of the overall maximum for the utterance 10460(more than 20\ dB), one boundary is discarded. 10461.LE 10462The question of which boundary to discard can be decided by 10463examining the voicing continuity of the utterance. 10464If possible, voicing across a syllable boundary should be avoided. 10465Otherwise, the boundary with the smallest hull-loudness 10466difference should be rejected. 10467.RF 10468.nr x0 \w'boundaries moved slightly to correspond better with voicing:' 10469.nr x1 (\n(.l-\n(x0)/2 10470.in \n(x1u 10471.ta 3.4i +0.5i 10472\l'\n(x0u\(ul' 10473.sp 10474total syllable count: 332 10475boundaries missed by algorithm: \0\09 (3%) 10476extra boundaries inserted by algorithm: \029 (9%) 10477boundaries moved slightly to correspond better with voicing: 10478 \0\03 (1%) 10479.sp 10480total errors: \041 (12%) 10481\l'\n(x0u\(ul' 10482.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 10483.in 0 10484.FG "Table 8.1 Success of the syllable segmentation procedure" 10485.pp 10486Table 8.1 illustrates the success of this syllabification 10487procedure, in a particular example. 10488Segmentation is performed with less than 10% of extraneous 10489boundaries being inserted, 10490and much less than 10% of actual boundaries being missed. 10491These figures are rather sensitive to the values of the 10492three thresholds. 10493The values were chosen to err on the side 10494of over-zealous syllabification, because all the boundaries need to be checked 10495by ear and eye and it is easier to delete 10496a boundary by hand than to insert one at an appropriate place. 10497It may well be that with careful optimization of thresholds, 10498better figures could be 10499achieved. 10500.rh "Stressed and unstressed syllables." 10501If the source and target utterances have the same number of 10502syllables, and the same pattern of stressed and unstressed syllables, 10503pitch can simply be transferred from a syllable in the source 10504to the corresponding one in the target. 10505But if the pattern differs \(em even though the 10506number of syllables may be the same, as in "eleven" and "seventeen" \(em 10507then a one-to-one mapping will conflict with the stress points, 10508and certainly sound unnatural. 10509Hence an attempt should be made to ensure that the pitch is mapped in a 10510plausible way. 10511.pp 10512The syllables of each utterance can be classified as "stressed" 10513and "unstressed". 10514This distinction could be made automatically by 10515inspection of the pitch contour, within the domain of utterances used, 10516and possibly even in general (Lea 10517.ul 10518et al, 105191975). 10520.[ 10521Lea Medress Skinner 1975 10522.] 10523However, in many cases it is expedient to perform the job by hand. 10524In our example, the sentences have fixed "carrier" parts and 10525variable "number" parts. 10526The stressed carrier syllables, namely 10527.LB 10528"... price ... dol\- ... cents", 10529.LE 10530can be marked as such, by hand, 10531to facilitate proper alignment between the source and target. 10532This marking would be difficult to do automatically 10533because it would be hard to distinguish the carrier from the numbers. 10534.pp 10535Even after classifying the syllables as "carrier stressed", 10536"stressed", and "unstressed", alignment still presents problems, 10537because the configuration of syllables in the variable parts 10538of the utterances may differ. 10539Syllables in the source which have no 10540correspondence in the target can be ignored. 10541The pitch track of 10542the source syllable can be replicated for each 10543additional syllable in corresponding 10544position in the target. 10545Of course, a stressed syllable should be selected for copying 10546if the unmatched target syllable is stressed, 10547and similarly for unstressed ones. 10548It is rather dangerous to copy exactly a part of a pitch 10549contour, for the ear is very sensitive to the juxtaposition of 10550identically intoned segments of speech \(em especially when the segment is stressed. 10551To avoid this, whenever a stressed syllable is replicated the 10552pitch values should be decreased by, say, 20%, on the second copy. 10553It sometimes happens that a single stressed syllable in the source 10554needs to cover a stressed-unstressed pair in the target: in 10555this case the first part of the source pitch track can be used 10556for the stressed syllable, and the remainder for the 10557unstressed one. 10558.pp 10559The example of Figure 8.2 will help to make these rules clear. 10560.FC "Figure 8.2" 10561Note that the marking alone is done by hand. 10562The detailed mapping decisions can be left to the computer. 10563The rules were derived intuitively, and do not have any sound theoretical 10564basis. 10565They are intended to give reasonable results in the majority of cases. 10566.pp 10567Figure 8.3 shows the result of transferring the pitch from "the price is ten 10568cents" to "the price is seventy-seven cents". 10569.FC "Figure 8.3" 10570The syllable boundaries which are marked were determined automatically. 10571The use of the last 30% of the 10572"ten" contour to cover the first "-en" syllable, and its replication 10573to serve the "-ty" syllable, can be seen. 10574However, the 70%\(em30% proportion is applied to the source contour, 10575and the linear distortion (described next) upsets the proportion in the 10576target utterance. 10577The contour of the second "seven" can be seen to be a 10578replication of that of the first one, lowered by 20%. 10579Notice that the pitch extraction procedure has introduced an artifact into the final 10580part of one of the "cents" contours by doubling the pitch. 10581.rh "Stretching and squashing." 10582The pitch contour over a source syllable nucleus must be stretched 10583or squashed to match the duration 10584of the target nucleus. 10585It is difficult to see how anything other than linear stretching 10586and squashing could be done without considerably increasing the 10587complexity of the procedure. 10588The gross non-linearities will have been accounted for 10589by the syllable alignment process, and so simple linear time-distortion 10590should not cause too much degradation. 10591.rh "Pitch discontinuities." 10592Sudden jumps in pitch during voiced speech sound peculiar, 10593although they can in fact be produced naturally (by yodelling). 10594People frequently burst into laughter on hearing them in synthetic speech. 10595It is particularly important to avoid this diverting effect in 10596voice response applications, 10597for the listener's attention is instantly directed 10598away from what is said to the voice that speaks. 10599.pp 10600Discontinuities can arise in the pitch-transfer procedure either by a 10601voiced-unvoiced-voiced transition between syllables mapping on to 10602a voiced-voiced transition in the target, 10603or by voicing continuity being broken when the syllable 10604alignment procedure drops or replicates a syllable. 10605There are several ways in which at least some of the possibilities can 10606be avoided. 10607For example, one could hold unstressed syllables at a constant pitch 10608whose value coincides with either the end of the previous 10609syllable's contour or the beginning of the next syllable's contour, 10610depending on which transition is voiced. 10611Alternatively, the policy of reserving the trailing part 10612of a stressed syllable in the source to cover an unmatched following 10613unstressed syllable in the target could be generalized to allow use of the leading 30% 10614of the next stressed syllable's contour instead, 10615if that maintained voicing continuity. 10616A third solution is simply to merge the pitch contours 10617at a discontinuity by mixing the average pitch value at the break 10618with the pitch contour on either side of it in a proportion which 10619increases linearly from the edges of the domain of influence to the discontinuity. 10620Figure 8.4 shows the effect of this merging, 10621when the pitch contour of "the price is seven cents" 10622is transferred to "the price is eleven cents". 10623.FC "Figure 8.4" 10624Of course, the 10625interpolated part will not necessarily be linear. 10626.rh "Results of an experiment on pitch transfer." 10627Some experiments have been conducted to evaluate the performance 10628of this pitch transfer method on the kind of utterances discussed above 10629(Witten, 1979). 10630.[ 10631Witten 1979 On transferring pitch from one utterance to another 10632.] 10633First, the source and target sentences 10634were chosen to be lexically identical, that is, the same words were spoken. 10635For this experiment alone, 10636expert judges were employed. 10637Each sentence was recorded twice (by the same person), 10638and pitch was transferred from copy A 10639to copy B and vice versa. Also, the originals were resynthesized from their linear 10640predictive coefficients with their own pitch contours. 10641Although all four often sounded extremely similar, sometimes the pitch 10642contours of originals A and B were quite different, 10643and in these cases it was immediately obvious to the ear that two of 10644the four utterances shared the same intonation, 10645which was different to that shared by the other two. 10646.pp 10647Experienced researchers in speech analysis-synthesis served as 10648judges. 10649In order to make the test as stringent as possible it was explained 10650to them exactly what had been done, 10651except that the order of the utterances in each quadruple was kept secret. 10652They were asked to identify which two of the four sentences did not have their 10653original contours, 10654and were allowed to listen to each quadruple as often as they liked. 10655On occasion they were prepared to identify only one, or even none, 10656of the sentences as artificial. 10657.pp 10658The result was that an utterance with pitch transferred 10659from another, lexically identical, one is indistinguishable from 10660a resynthesized version of the original, even to a skilled ear. 10661(To be more precise, this hypothesis 10662could not be rejected even at the 1% level of statistical significance.) This 10663gave confidence in the transfer procedure. 10664However, one particular judge was quite successful at identifying the bogus contours, 10665and he attributed his success to the fact that 10666on occasion the segmental durations did not accord with the 10667pitch contour. 10668This casts a shadow of suspicion on the linear stretching and 10669squashing mechanism. 10670.pp 10671The second experiment examined pitch transfers between utterances having only one variable part 10672each ("the price is ... cents") to test the transfer 10673method under relatively controlled conditions. 10674Ten sentences of the form 10675.LB 10676"The price is \(em cents" 10677.LE 10678were selected to cover 10679a wide range of syllable structures. 10680Each one was regenerated with pitch transferred from each of 10681the other nine, 10682and these nine versions were paired with the original resynthesized 10683with its natural pitch. 10684The $10 times 9=90$ resulting pairs were recorded on tape in random order. 10685.pp 10686Five males and five females, with widely differing occupations 10687(secretaries, teachers, academics, and students), served as judges. 10688Written instructions explained that the tape contained pairs of 10689sentences which were lexically identical but had a slight difference 10690in "tone of voice", and that the subjects were to judge which of 10691each pair sounded "most natural and intelligible". The 10692response form gave the price associated with each pair \(em 10693a preliminary experiment had shown that there was never 10694any difficulty in identifying this \(em and a column for decision. 10695With each decision, the subjects recorded their confidence in the decision. 10696Subjects could rest at any time during the test, which lasted for about 1069730 minutes, but they were not permitted to hear any pair a second time. 10698.pp 10699Defining a "success" to be a choice of the utterance with 10700natural pitch as the best of a pair, 10701the overall success rate was about 60%. 10702If choices were random, one would of course expect only a 50% success rate, 10703and the figure obtained was significantly different from this. 10704Almost half the choices were correct and made with high confidence; 10705high-confidence but incorrect choices accounted for a quarter of the 10706judgements. 10707.pp 10708To investigate structural effects in the pitch transfer process, 10709low confidence decisions were ignored to eliminate noise, and the others 10710lumped together and tabulated by source and target utterance. 10711The number of stressed and unstressed syllables does not appear to play 10712an important part in determining whether a particular utterance is an 10713easy target. 10714For example, it proved to be particularly difficult to tell 10715.EQ 10716delim @@ 10717.EN 10718natural from transferred contours with utterances $0.37 and $0.77. 10719.EQ 10720delim $$ 10721.EN 10722In fact, the results showed no better than random discrimination for them, 10723even though the decisions in which listeners expressed little confidence 10724had been discarded. 10725Hence it seems that the syllable alignment procedure and the policy 10726of replication were successful. 10727.pp 10728.EQ 10729delim @@ 10730.EN 10731The worst target scores were for utterances $0.11 and $0.79. 10732.EQ 10733delim $$ 10734.EN 10735Both of these contained large unbroken voiced periods 10736in the "variable" part \(em almost twice as long as the next longest 10737voiced period. 10738The first has an unstressed syllable followed by 10739a stressed one with no break in voicing, 10740involving, in a natural contour, 10741a fast but continuous climb in pitch over the juncture, 10742and it is not surprising that it proved to be the most difficult target. 10743A more sophisticated "smoothing" algorithm than the 10744one used may be worth investigating. 10745.pp 10746In a third experiment, sentences with two variable parts were used to check 10747that the results of the second experiment extended to more complex 10748utterances. 10749The overall success rate was 75%, significantly different from chance. 10750However, a breakdown of the results by source and target utterance 10751showed that there was one contour (for the utterance 10752"the price is 19 dollars and 8 cents") which exhibited very successful 10753transfer, subjects identifying the transferred-pitch utterances at only 10754a chance level. 10755.pp 10756Finally, transfers of pitch from utterances with two variable parts 10757to those with one variable part were tested. 10758Pitch contours were transferred to sentences with the same "cents" 10759figure but no "dollars" part; for example, 10760"the price is five dollars and thirteen cents" 10761to 10762"the price is thirteen cents". The 10763contour was simply copied between the corresponding 10764syllables, so that no adjustment needed to be made 10765for different syllable structures. 10766The overall score was 60 successes in 100 judgements \(em 10767the same percentage as in the second experiment. 10768.pp 10769To summarize the results of these four experiments, 10770.LB 10771.NP 10772even accomplished linguists cannot distinguish an utterance from one with 10773pitch transferred from a different recording of it; 10774.NP 10775when the utterance contained only one variable part embedded in a 10776carrier sentence, 10777lay listeners identified the original correctly in 60% of cases, 10778over a wide variety of syllable structures: this 10779figure differs significantly from the chance value of 50%; 10780.NP 10781lay listeners identified the original confidently and correctly in 1078250% of cases; confidently but incorrectly in 25% of cases; 10783.NP 10784the greatest hindrance to successful transfer was the presence of 10785a long uninterrupted period of voicing in the target utterance; 10786.NP 10787the performance of the method deteriorates as the number 10788of variable parts in the utterances increases; 10789.NP 10790some utterances seemed to serve better than others as the pitch source for 10791transfer, although this was not correlated with complexity of syllable structure; 10792.NP 10793even when the utterance contained two variable parts, 10794there was one source utterance whose pitch contour was 10795transferred to all the others so successfully that listeners could not identify 10796the original. 10797.LE 10798.pp 10799The fact that only 60% of originals in the second experiment were 10800spotted by lay listeners in a stringent 10801paired-comparison test \(em many of them being identified without confidence \(em 10802does encourage the use of the procedure for generating stereotyped, 10803but different, utterances of high quality in voice-response systems. 10804The experiments indicate that although different syllable patterns 10805can be handled satisfactorily by this procedure, 10806long voiced periods should be avoided if possible when designing 10807the message set, 10808and that if individual utterances must contain multiple variable parts 10809the source utterance should be chosen with the aid of listening tests. 10810.sh "8.3 Assigning timing and pitch to synthetic speech" 10811.pp 10812The pitch transfer method can give good results within a fairly narrow 10813domain of application. 10814But like any speech output technique which treats complete utterances 10815as a single unit, with provision for a small number of slot-fillers to 10816accomodate data-dependent messages, it becomes unmanageable in more general 10817situations with a large variety of utterances. 10818As with segmental synthesis it becomes necessary to consider methods 10819which use a textual rather than an acoustically-based representation 10820of the prosodic features. 10821.pp 10822This raises a problem with prosodics that was not there for segmentals: how 10823.ul 10824can 10825prosodic features be written in text form? 10826The standard phonetic transcription method does not give much help with 10827notation for prosodics. It does provide a diacritical mark to indicate 10828stress, but this is by no means enough information for synthesis. 10829Furthermore, text-to-speech procedures (described in the next chapter) 10830promise to allow segmentals to be specified by an ordinary orthographic 10831representation of the utterance; but we have seen that considerable 10832intelligence is required to derive prosodic features from text. 10833(More than mere intelligence may be needed: this is underlined by a paper 10834(Bolinger, 1972) 10835delightfully entitled 10836"Accent is predictable \(em if you're a mind reader"!) 10837.[ 10838Bolinger 1972 Accent is predictable \(em if you're a mind reader 10839.] 10840.pp 10841If synthetic speech is to be used as a computer output medium rather 10842than as an experimental tool for linguistic research, it is important 10843that the method of specifying utterances is natural and easy to learn. 10844Prosodic features must be communicated to the computer in a manner 10845considerably simpler than individual duration and pitch specifications 10846for each phoneme, as was required in early synthesis-by-rule systems. 10847Fortunately, a notation has been developed for conveying some of the 10848prosodic features of utterances, as a by-product of the linguistically 10849important task of classifying the intonation contours used in 10850conversational English (Halliday, 1967). 10851.[ 10852Halliday 1967 10853.] 10854This system has even been used to help foreigners speak English 10855(Halliday, 1970) \(em which emphasizes the fact that it was designed for use 10856by laymen, not just linguists! 10857.[ 10858Halliday 1970 Course in spoken English: Intonation 10859.] 10860.pp 10861Here are examples of the way utterances can be conveyed to the ISP 10862speech synthesis system which was described in the previous chapter. 10863The notation is based upon Halliday's. 10864.LB 10865.NI 108663 10867.ul 10868^ aw\ t\ uh/m\ aa\ t\ i\ k /s\ i\ n\ th\ uh\ s\ i\ s uh\ v /*s\ p\ ee\ t\ sh, 10869.NI 108701 10871.ul 10872^ f\ r\ uh\ m uh f\ uh/*n\ e\ t\ i\ k /r\ e\ p\ r\ uh\ z\ e\ n/t\ e\ i\ sh\ uh\ n. 10873.LE 10874(Automatic synthesis of speech, from a phonetic representation.) Three 10875levels of stress are distinguished: tonic or "sentence" stress, 10876marked by "*" before the syllable; foot stress (marked by "/"); 10877and unstressed syllables. 10878The notion of a "foot" controls the rhythm of the speech in a way that 10879will be described shortly. 10880A fourth level of stress is indicated on a segmental basis when a syllable 10881contains a reduced vowel. 10882.pp 10883Utterances are divided by punctuation into 10884.ul 10885tone groups, 10886which are the basic prosodic unit \(em there are two in the example. 10887The shape of the pitch contour is governed by a numeral at the start of 10888each tone group. 10889Crude control over pauses is achieved by punctuation marks: full stop, for 10890example, signals a pause while comma does not. 10891(Longer pauses can be obtained by several full stops as in "...".) The 10892"^" character stands for a so-called "silent stress" or breath point. 10893Word boundaries are marked by two spaces between phonemes. 10894As mentioned in the previous chapter, syllable boundaries and explicit 10895pitch and duration specifiers can also be included in the input. 10896If they are not, the ISP system will attempt to compute them. 10897.rh "Rhythm." 10898Our understanding of speech rhythm knows many laws but little order. 10899In the mid 1970's there was a spate of publications reporting new data 10900on segmental duration in various contexts, and there is a growing 10901awareness that segmental duration is influenced by a great many factors, 10902ranging from the structure of a discourse, through semantic and syntactic 10903attributes of the utterances, their phonemic and phonetic make-up, 10904right down to physiological constraints 10905(these multifarious influences are ably documented and reviewed by 10906Klatt, 1976). 10907.[ 10908Klatt 1976 Linguistic uses of segment duration in English 10909.] 10910What seems to be lacking in this work is a conceptual framework on to 10911which new information about segmental duration can be nailed. 10912.pp 10913One starting-point for imitating the rhythm of English speech is the 10914hypothesis of regularly recurring stresses. 10915These stresses are primarily 10916.ul 10917rhythmic 10918ones, and should be distinguished from the tonic stress mentioned above which 10919is primarily an 10920.ul 10921intonational 10922one. 10923Rhythmic stresses are marked in the transcription by a "/". 10924The stretch between one and the next is called a "foot", 10925and the hypothesis above is often referred to as that of isochronous feet 10926("isochronous" means "of equal time"). 10927There is considerable controversy about this hypothesis. 10928It is most popular among British linguists and, it must be admitted, 10929amongst those who work by introspection and intuition and do not actually 10930.ul 10931measure 10932things. 10933Although the question of isochrony of feet has long been debated, there 10934seems to be general agreement 10935\(em even amongst American linguists \(em 10936that there is at least a tendency towards 10937equal spacing of foot boundaries. 10938However, little is known about the strength of this tendency and the extent 10939of deviations from it (see Hill 10940.ul 10941et al, 109421979, for an attempt 10943to quantify it) \(em and there is even evidence to suggest that it may in part 10944be a 10945.ul 10946perceptual 10947phenomenon (Lehiste, 1973). 10948.[ 10949Hill Jassem Witten 1979 10950.] 10951.[ 10952Lehiste 1973 10953.] 10954On this basic point, as on many others, the designer of a prosodic synthesis 10955strategy must needs make assumptions which cannot be properly justified. 10956.pp 10957From a pragmatic point of view there are two advantages to basing 10958a synthesis strategy on this hypothesis. 10959Firstly, it provides a way to represent the many influences of higher-level 10960processes (like syntax and semantics) on rhythm using a simple notation which 10961fits naturally into the phonetic utterance representation, 10962and which people find quite easy to understand and generate. 10963Secondly, it tends to produce a heavily accentuated, but not unnatural, 10964speech rhythm which can easily be moderated into a more acceptable rhythm 10965by departing from isochrony in a controlled manner. 10966.pp 10967The ISP procedure does not make feet exactly isochronous. 10968It starts with a standard foot time and attempts to fit the syllables of the 10969foot into this time. 10970If doing so would result in certain syllables having less than a preset minimum 10971duration, the isochrony constraint is relaxed and the foot is expanded. 10972There is no preset 10973.ul 10974maximum 10975syllable length. 10976However, when the durations of individual phoneme postures are adjusted 10977to realize the calculated syllable durations, 10978limits are imposed on the amount by which individual phonemes can be expanded 10979or contracted. 10980Thus a hierarchy of limits exists. 10981.pp 10982The rate of talking is determined by the standard foot time. 10983If this time is short, many feet will be forced to have durations longer than 10984the standard, and the speech will be "less isochronous". 10985This seems to accord with common human experience. 10986If the standard time is longer, however, the minimum syllable limit 10987will always be exceeded and the speech will be completely isochronous. 10988If it is too long, the above-mentioned limits to phoneme expansion will 10989come into play and again partially destroy the isochrony. 10990.pp 10991It has often been observed that the final foot of an utterance tends to be 10992longer than others; as does the tonic foot \(em that which bears the 10993major stress. 10994This is easy to accomodate, simply by making the target duration 10995longer for these feet. 10996.rh "From feet to syllables." 10997A foot is a succession of syllables, one or more. 10998And it is obvious that since there are more syllables in some feet than 10999in others, some syllables must occupy less time than others in order to preserve 11000the tendency towards isochrony of feet. 11001.pp 11002However, the duration of a foot is not divided evenly between its constituent 11003syllables. The syllables have a definite rhythm of their own, which seems 11004to be governed by 11005.LB 11006.NP 11007the nature of the salient (that is, the first) syllable of the foot 11008.NP 11009the presence of word boundaries within the foot. 11010.LE 11011A salient syllable tends to be long either if it contains one of 11012a class of so-called "long" vowels, or if there is a cluster of two or more 11013consonants following the vowel. 11014The pattern of syllables and word boundaries governs the rhythm of the foot, 11015and Table 8.2 shows the possibilities for one-, two-, and three-syllable feet. 11016This theory of speech rhythm is due to Abercrombie (1964). 11017.[ 11018Abercrombie 1964 Syllable quantity and enclitics in English 11019.] 11020.RF 11021.nr x2 \w'three-syllable feet 'u 11022.nr x3 \w'sal-short 'u 11023.nr x4 \w'weak [#] 'u 11024.nr x5 \w'weak 'u 11025.nr x6 \w'/\fIit s incon\fR/ceivable 'u 11026.nr x1 (\w'syllable rhythm'/2) 11027.nr x7 \n(x2+\n(x3+\n(x4+\n(x5+\n(x6+\n(x1+\n(x1 11028.nr x7 (\n(.l-\n(x7)/2 11029.in \n(x7u 11030.ta \n(x2u +\n(x3u +\n(x4u +\n(x5u +\n(x6u 11031.ul 11032 syllable pattern example \0\0\h'-\n(x1u'syllable rhythm 11033.sp 11034one-syllable feet salient /\fIgood\fR /show 1 11035 ^ weak /\fI^ good\fR/bye 2:1 11036.sp 11037two-syllable feet sal-long weak /\fIcentre\fR /forward 1:1 11038 sal-short weak /\fIatom\fR /bomb 1:2 11039 salient # weak /\fItea for\fR /two 2:1 11040.sp 11041three-syllable feet salient # weak [#] weak /\fIone for the\fR /road 2:1:1 11042 /\fIit's incon\fR/ceivable 11043 sal-long weak # weak /\fIafter the\fR /war 2:3:1 11044 sal-short weak # weak /\fImiddle to\fR /top 1:3:2 11045 sal-long weak weak /\fInobody\fR /knows 3:1:2 11046 sal-short weak weak /\fIanything\fR /more 1:1:1 11047.sp 11048 # denotes a word boundary; 11049 [#] is an optional word boundary 11050.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 11051.FG "Table 8.2 Syllable patterns and rhythms" 11052.pp 11053A foot may have the rhythmical characteristics of a two-syllable foot 11054while having only one syllable, if the first place in it is filled by a 11055silent stress (marked by "^"). 11056This is shown in the second one-syllable example of 11057Table 8.2. 11058A similar effect may occur with two- and three-syllable feet, 11059although examples are not given in the table. 11060Feet of four and five syllables \(em with or without a silent stress \(em are 11061considerably rarer. 11062.pp 11063Syllabification \(em splitting an utterance into syllables \(em is a job 11064which had to be done for the pitch-transfer procedure described earlier, 11065and the nature of syllable rhythms calls for it here too. 11066Even though the utterance is now specified phonetically instead of 11067acoustically, the same basic principle applies. 11068Syllables normally coincide with peaks of sonority, 11069where "sonority" measures the inherent loudness of a sound relative to 11070other sounds of the same duration and pitch. 11071However, difficult cases exist where it seems to be unclear how many syllables 11072there are in a word. (Ladefoged, 1975, discusses this problem with examples 11073such as "real", "realistic", and "reality".) Furthermore, 11074.[ 11075Ladefoged 1975 11076.] 11077care must be taken to avoid counting two syllables in a word like "sky" 11078because of its two peaks of sonority \(em for the stop 11079.ul 11080k 11081has lower 11082sonority than the fricative 11083.ul 11084s. 11085.pp 11086Three levels of notional sonority are enough for syllabification. 11087Dividing phoneme segments into 11088.ul 11089sonorants 11090(glides and nasals), 11091.ul 11092obstruents 11093(stops and fricatives), and vowels; a general syllable has the form 11094.LB 11095.EQ 11096<obstruent> sup * ~ <sonorant> sup * ~ <vowel> sup * ~ <sonorant> sup * ~ 11097<obstruent> sup * ~ , 11098.EN 11099.LE 11100where "*" means repetition, that is, occurrence zero or more times. 11101This sidesteps the "sky" problem by giving fricatives the same 11102sonority as stops. 11103It is easy to use the above structure to count the number 11104of syllables in a given utterance by counting the sonority 11105peaks. 11106.pp 11107However, what is required is an indication of syllable 11108.ul 11109boundaries 11110as well as a syllable count. 11111For slow conversational speech, these can be approximated as follows. 11112Word divisions obviously form syllable boundaries, as should 11113foot markers \(em but it may be wise not to assume that the latter do if the 11114utterance has been prepared by someone with little knowledge of linguistics. 11115Syllable boundaries should be made to coincide with sonority minima. 11116As an 11117.ul 11118ad hoc 11119pragmatic 11120rule, if only one segment has the minimum sonority the boundary is placed 11121before it. 11122If there are two segments, each with the minimum sonority, it is placed between 11123them, while for three or more it is placed after the first two. 11124.pp 11125These rules produce obviously acceptable divisions in many cases 11126(to'day, ash'tray, tax'free), with perhaps unexpected positioning of the 11127boundary in others (ins'pire, de'par'tment). 11128Actually, people do differ in placement of syllable boundaries 11129(Abercrombie, 1967). 11130.[ 11131Abercrombie 1967 11132.] 11133.rh "From syllables to segments." 11134The theory of isochronous feet (with the caveats noted earlier) 11135and that of syllable rhythms provide a way of producing durations for 11136individual syllables. But where are these durations supposed to be measured? 11137There is a beat point, or tapping point, near the beginning of each syllable. 11138This is the place where a listener will tap if asked to give one tap to each 11139syllable; it has been investigated experimentally by Allen (1972). 11140.[ 11141Allen 1972 Location of rhythmic stress beats in English One 11142.] 11143It is not necessarily at the very beginning of the syllable. 11144For example, in "straight", the tapping point is certainly after the 11145.ul 11146s 11147and the stopped part of the 11148.ul 11149t. 11150.pp 11151Another factor which relates to the division of the syllable duration 11152amongst phonetic segments is the often-observed fact that the length of the 11153vocalic nucleus is a strong clue to the degree of voicing of the terminating 11154cluster (Lehiste, 1970). 11155.[ 11156Lehiste 1970 Suprasegmentals 11157.] 11158If you say in pairs words like "cap", "cab"; "cat", "cad"; "tack", "tag" 11159you will find that the vowel in the first word of each pair is significantly 11160shorter than that in the second. 11161In fact, the major difference between such pairs is the vowel length, 11162not the final consonant. 11163.pp 11164Such effects can be taken into account by considering a syllable to comprise 11165an initial consonant cluster, followed by a vocalic nucleus and a final 11166consonant cluster. 11167Any of these elements can be missing \(em the most unusual case where the 11168nucleus is absent occurs, for example, in so-called syllabic 11169.ul 11170n\c 11171\&'s 11172(as in renderings of "button", "pudding" which might be written 11173"butt'n", "pudd'n"). 11174However, it is convenient to modify the definition of the nucleus 11175so as to rule out the possibility of it being empty. 11176Using the characterization of the syllable given above, the clusters can 11177be defined as 11178.LB 11179.NI 11180initial cluster = <obstruent>\u*\d <sonorant>\u*\d 11181.NI 11182nucleus = <vowel>\u*\d <sonorant>\u*\d 11183.NI 11184final cluster = <obstruent>\u*\d. 11185.LE 11186Sonorants are included in the nucleus so that it is always present, 11187even in the case of a syllabic consonant. 11188.pp 11189Then, rules can be used to divide the syllable duration between the 11190initial cluster, nucleus, and final cluster. 11191These must distinguish between situations where the terminating cluster 11192is voiced or unvoiced so that the characteristic differences in vowel lengths 11193can be accomodated. 11194.pp 11195Finally, the cluster durations must be apportioned amongst their constituent 11196phonetic segments. There is little published data on which to base this. 11197Two simple schemes which have been used in ISP are described in 11198Witten (1977) and Witten & Smith (1977). 11199.[ 11200Witten 1977 A flexible scheme for assigning timing and pitch to synthetic speech 11201.] 11202.[ 11203Witten Smith 1977 Synthesizing British English rhythm 11204.] 11205.rh "Pitch." 11206There are two basically different ways of looking at the pitch of an 11207utterance. 11208One is to imagine pitch 11209.ul 11210levels 11211attached to individual syllables. 11212This has been popular amongst American linguists, and some people 11213have even gone so far as to associate pitch levels with levels of 11214stress. 11215The second approach is to consider pitch 11216.ul 11217contours, 11218as we did earlier when examining how to transfer pitch from one utterance 11219to another. 11220This seems to be easier for the person who transcribes the utterances 11221to produce, for the information required is much less detailed than levels 11222attached to each syllable. Some indication needs to be given of how 11223the contour is to be bound to the utterance, and in the notation introduced above 11224the most prominent, or "tonic", syllable is indicated in the transcription. 11225.pp 11226Halliday's (1970) classification identifies five different primary intonation 11227contours, each hinging on the tonic syllable. 11228.[ 11229Halliday 1970 Course in spoken English: Intonation 11230.] 11231These are sketched in Figure 8.5, in the style of Halliday. 11232.FC "Figure 8.5" 11233Several secondary contours, which are variations on the primary ones, 11234are defined as well. 11235However, this classification scheme is intended for consumption by people, 11236who bring to the problem a wealth of prior knowledge of speech and years 11237of experience with it! It captures only the gross features 11238of the infinite variety of pitch contours found in living speech. 11239In a sense, the classification is 11240.ul 11241phonological 11242rather than 11243.ul 11244phonetic, 11245for it attempts to distinguish the features which make a logical difference 11246to the listener instead of the acoustic details of the pitch contours. 11247.pp 11248It is necessary to take these contours and subject them to a sort of 11249phonological-to-phonetic embellishment before applying them in synthetic 11250speech. 11251For example, the stretches with constant pitch which precede the tonic 11252syllable in tone groups 1, 2, and 3 sound 11253most unnatural when synthesized \(em for pitch is hardly ever 11254exactly constant in living speech. 11255Some pretonic pitch variation is necessary, 11256and this can be made to emphasize the salient syllable 11257of each foot. A "lilting" effect which reaches a peak at each foot 11258boundary, and drops rather faster at the beginning of the foot than it 11259rises at the end, sounds more natural. The magnitude of this inflection 11260can be altered slightly to add interest, but a considerable increase in it 11261produces a semantic change by making the utterance sound more emphatic. 11262It is a major problem to pin down exactly the turning points of pitch in 11263the falling-rising and rising-falling contours (4 and 5 in Figure 8.5). 11264And even deciding on precise values for the pitch frequencies involved is not 11265always easy. 11266.pp 11267The aim of the pitch assignment method of ISP is to allow the person 11268(or program) which originates a spoken message to exercise a great deal 11269of control over its intonation, without having to concern himself with 11270foot or syllable structure. The message to be spoken must be broken down 11271into tone groups, 11272which correspond roughly to Halliday's tone groups. 11273Each one comprises a 11274.ul 11275tonic 11276of one or more feet, which is optionally preceded by a 11277.ul 11278pretonic, 11279also with a number of feet. It is advantageous to allow a tone group 11280boundary to occur in the middle of a foot (whereas Halliday's scheme 11281insists that it occurs at a foot boundary). 11282The first foot of the tonic, the 11283.ul 11284tonic foot, 11285is marked by an asterisk at the beginning. 11286It is on the first syllable of this foot \(em the 11287"tonic" or "nuclear" 11288syllable \(em that the major stress of the tone group occurs. 11289If there is no asterisk in a tone group, 11290ISP takes the final foot as the tonic 11291(since this is the most common case). 11292.pp 11293The pitch contour on a tone group is specified by an array of ten numbers. 11294Of course, the system cannot generate all conceivable contours for a tone 11295group, but the definitions of the ten specifiable quantities have been 11296chosen to give a useful range of contours. 11297If necessary, more precise control over the pitch of an utterance can 11298be achieved by making the tone groups smaller. 11299.pp 11300The overall pitch movement is controlled by specifying the pitch at three 11301places: the beginning of the tone group, the beginning of the tonic syllable, 11302and the end of the tone group. 11303Provision is made for an abrupt pitch break at the start of the tonic 11304syllable in order to simulate tone groups 2 and 3, and, to a lesser 11305extent, tone groups 4 and 5. 11306The pitch is interpolated linearly over the first part of the 11307tone group (up to the tonic syllable) and over the last part (from there to 11308the end), except that it is possible to specify a non-linearity on the tonic 11309syllable, for emphasis, as shown in Figure 8.6. 11310.FC "Figure 8.6" 11311.pp 11312On this basic shape are superimposed two finer pitch patterns. 11313One of these is an initialization-continuation option which allows 11314the pitch to rise (or fall) independently on the initial and final feet 11315to specified values, without affecting the contour on the rest 11316of the tone group (Figure 8.7). 11317.FC "Figure 8.7" 11318The other is a foot pattern which is superimposed on each pretonic foot, 11319to give the stressed syllables of the pretonic added prominence and avoid 11320the monotony of constant pitch. 11321This is specified by a 11322.ul 11323non-linearity 11324parameter which distorts the contour on the foot at a pre-determined 11325point along it. 11326Figure 8.8 shows the effect. 11327.FC "Figure 8.8" 11328.pp 11329The ten quantities that define a pitch contour are summarized in 11330Table 8.3, and shown diagrammatically in Figure 8.9. 11331.FC "Figure 8.9" 11332.RF 11333.nr x0 \w'H: 'u 11334.nr x1 \n(x0+\w'fraction along foot of the non-linearity position, for the tonic foot'u 11335.nr x1 (\n(.l-\n(x1)/2 11336.in \n(x1u 11337.ta \n(x0u +4n 11338A: continuation from previous tone group 11339 zero gives no continuation 11340 non-zero gives pitch at start of tone group 11341B: notional pitch at start 11342C: pitch range on whole of pretonic 11343D: departure from linearity on each foot of pretonic 11344E: pitch change at start of tonic 11345F: pitch range on tonic 11346G: departure from linearity on tonic 11347H: continuation to next tone group 11348 zero gives no continuation 11349 non-zero gives pitch at end of tone group 11350I: fraction along foot of the non-linearity position, for pretonic feet 11351J: fraction along foot of the non-linearity position, for the tonic foot 11352.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 11353.in 0 11354.FG "Table 8.3 The quantities that define a pitch contour" 11355.pp 11356The intention of this parametric method of specifying contours 11357is that the parameters should be easily derivable from semantic variables 11358like emphasis, novelty of idea, surprise, uncertainty, incompleteness. 11359Here we really are getting into controversial, unresearched areas. 11360Roughly speaking, parameters D and G control emphasis, G by itself 11361controls novelty and surprise, and H and the relative sizes of E and F 11362control uncertainty and incompleteness. 11363Certain parameters (notably I and J) are defined because although they 11364do not appear to correspond to semantic distinctions, we do not yet know 11365how to generate them automatically. 11366.RF 11367.nr x0 0.6i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+\w'0000' 11368.nr x1 (\n(.l-\n(x0)/2 11369.in \n(x1u 11370.ta 0.6i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i 11371Halliday's 11372tone group \0\0A \0\0B \0\0C \0\0D \0\0E \0\0F \0\0G \0\0H \0\0I \0\0J 11373\l'\n(x0u\(ul' 11374.sp 11375 1 \0\0\00 \0175 \0\0\00 \0\-40 \0\0\00 \-100 \0\-40 \0\0\00 0.33 \00.5 11376 2 \0\0\00 \0280 \0\0\00 \0\-40 \-190 \0100 \0\0\00 \0\0\00 0.33 \00.5 11377 3 \0\0\00 \0175 \0\0\00 \0\-40 \0\-70 \0\045 \0\-10 \0\0\00 0.33 \00.5 11378 4 \0\0\00 \0280 \-100 \0\-40 \0\020 \0\045 \0\-45 \0\0\00 0.33 \00.5 11379 5 \0\0\00 \0175 \0\060 \0\-40 \0\-20 \0\-45 \0\045 \0\0\00 0.33 \00.5 11380\l'\n(x0u\(ul' 11381.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 11382.in 0 11383.FG "Table 8.4 Pitch contour table for Halliday's primary tone groups" 11384.pp 11385One basic requirement of the pitch assignment scheme was the ability to 11386generate contours which approximate Halliday's five primary tone groups. 11387Values of the ten specifiable quantities are given in Table 8.4, for each 11388tone group. 11389All pitches are given in\ Hz. 11390A distinctly dipping pitch movement has been given to each pretonic foot 11391(parameter D), 11392to lend prominence to the salient syllables. 11393.sh "8.4 Evaluating prosodic synthesis" 11394.pp 11395It is extraordinarily difficult to evaluate schemes for prosodic synthesis, 11396and this is surely a large part of the reason why prosodics are among the 11397least advanced aspects of artificial speech. 11398Segmental synthesis can be tested by playing people minimal pairs of 11399words which differ in just one feature that is being investigated. 11400For example, one might experiment with "pit", "bit"; "tot", "dot"; 11401"cot", "got" to test the rules which discriminate unvoiced from voiced stops. 11402There are standard word-lists for intelligibility tests which can be 11403used to compare systems, too. 11404No equivalent of such micro-level evaluation exists for prosodics, 11405for they by definition have a holistic effect on utterances. 11406They are most noticeable, and most important, in longish stretches of speech. 11407Even monotonous, arhythmic speech will be intelligible in 11408sufficiently short samples provided the segmentals are good enough; 11409but it is quite impossible to concentrate on such speech in quantity. 11410Some attempts at evaluation appear in Ainsworth (1974) and McHugh (1976), 11411but these are primarily directed at assessing the success of pronunciation 11412rules, which are discussed in the next chapter. 11413.[ 11414Ainsworth 1974 Performance of a speech synthesis system 11415.] 11416.[ 11417McHugh 1976 Listener preference and comprehension tests 11418.] 11419.pp 11420One evaluation technique is to compare synthetic with natural versions 11421of utterances, as was done in the pitch transfer experiment. 11422The method described earlier used a sensitive paired-comparison test, 11423where subjects heard both versions in quick succession and were asked 11424to judge which was "most natural and intelligible". 11425This is quite a stringent test, and one that may not be so useful 11426for inferior, completely synthetic, contours. 11427It is essential to degrade the "natural" utterance so that it is 11428comparable segmentally to the synthetic one: this was done in the 11429experiment described by extracting its pitch and resynthesizing it 11430from linear predictive coefficients. 11431.pp 11432Several other experiments could be undertaken to evaluate artificial 11433prosody. 11434For example, one could compare 11435.LB 11436.NP 11437natural and artificial rhythms, using artificial segmental synthesis 11438in both cases; 11439.NP 11440natural and artificial pitch contours, using artificial segmental synthesis 11441in both cases; 11442.NP 11443natural and artificial pitch contours, using segmentals extracted from 11444natural utterances. 11445.LE 11446There are many other topics which have not yet been fully investigated. 11447It would be interesting, for example, to define rules for generating speech 11448at different tempos. 11449Elisions, where phonemes or even whole syllables are suppressed, 11450occur in fast speech; these have been analyzed by linguists 11451but not yet incorporated into synthetic models. 11452It should be possible to simulate emotion by altering parameters such as 11453pitch range and mean pitch level; but this seems exceptionally difficult 11454to evaluate. One situation where it would perhaps be possible to 11455measure emotion is in the reading of sports results \(em in fact a study 11456has already been made of intonation in soccer results (Bonnet, 1980)! 11457.[ 11458Bonnet 1980 11459.] 11460Even the synthesis of voices with different pitch ranges requires 11461investigation, for, as noted earlier, it is difficult to place 11462precise frequency specifications on phonological contours such as 11463those sketched in Figure 8.5. 11464Clearly the topic of prosodic synthesis is a rich and potentially 11465rewarding area of research. 11466.sh "8.5 References" 11467.LB "nnnn" 11468.[ 11469$LIST$ 11470.] 11471.LE "nnnn" 11472.sh "8.6 Further reading" 11473.pp 11474There are quite a lot of books in the field of linguistics which 11475describe prosodic features. 11476Here is a small but representative sample from both sides of the Atlantic. 11477.LB "nn" 11478.\"Abercrombie-1965-1 11479.]- 11480.ds [A Abercrombie, D. 11481.ds [D 1965 11482.ds [T Studies in phonetics and linguistics 11483.ds [I Oxford Univ Press 11484.ds [C London 11485.nr [T 0 11486.nr [A 1 11487.nr [O 0 11488.][ 2 book 11489.in+2n 11490Abercrombie is one of the leading English authorities on phonetics, 11491and this is a collection of essays which he has written over the years. 11492Some of them treat prosodics explicitly, and others show the influence 11493of verse structure on Abercrombie's thinking. 11494.in-2n 11495.\"Bolinger-1972-2 11496.]- 11497.ds [A Bolinger, D.(Editor) 11498.ds [D 1972 11499.ds [T Intonation 11500.ds [I Penguin 11501.ds [C Middlesex, England 11502.nr [T 0 11503.nr [A 0 11504.nr [O 0 11505.][ 2 book 11506.in+2n 11507A collection of papers that treat a wide variety of different aspects 11508of intonation in living speech. 11509.in-2n 11510.\"Crystal-1969-3 11511.]- 11512.ds [A Crystal, D. 11513.ds [D 1969 11514.ds [T Prosodic systems and intonation in English 11515.ds [I Cambridge Univ Press 11516.nr [T 0 11517.nr [A 1 11518.nr [O 0 11519.][ 2 book 11520.in+2n 11521This book attempts to develop a theoretical basis for the study of British 11522English intonation. 11523.in-2n 11524.\"Gimson-1966-3 11525.]- 11526.ds [A Gimson, A.C. 11527.ds [D 1966 11528.ds [T The linguistic relevance of stress in English 11529.ds [B Phonetics and linguistics 11530.ds [E W.E.Jones and J.Laver 11531.ds [P 94-102 11532.nr [P 1 11533.ds [I Longmans 11534.ds [C London 11535.nr [T 0 11536.nr [A 1 11537.nr [O 0 11538.][ 3 article-in-book 11539.in+2n 11540Here is a careful discussion of what is meant by "stress", with much more 11541detail than has been possible in this chapter. 11542.in-2n 11543.\"Lehiste-1970-4 11544.]- 11545.ds [A Lehiste, I. 11546.ds [D 1970 11547.ds [T Suprasegmentals 11548.ds [I MIT Press 11549.ds [C Cambridge, Massachusetts 11550.nr [T 0 11551.nr [A 1 11552.nr [O 0 11553.][ 2 book 11554.in+2n 11555This is a comprehensive study of suprasegmental phenomena in natural speech. 11556It is divided into three major sections: quantity (timing), tonal features 11557(pitch), and stress. 11558.in-2n 11559.\"Pike-1945-5 11560.]- 11561.ds [A Pike, K.L. 11562.ds [D 1945 11563.ds [T The intonation of American English 11564.ds [I Univ of Michigan Press 11565.ds [C Ann Arbor, Michigan 11566.nr [T 0 11567.nr [A 1 11568.nr [O 0 11569.][ 2 book 11570.in+2n 11571A classic, although somewhat dated, study. 11572Notice that it deals specifically with American English. 11573.in-2n 11574.LE "nn" 11575.EQ 11576delim $$ 11577.EN 11578.CH "9 GENERATING SPEECH FROM TEXT" 11579.ds RT "Generating speech from text 11580.ds CX "Principles of computer speech 11581.pp 11582In the preceding two chapters I have described how artificial speech 11583can be produced from a written phonetic representation with additional 11584markers indicating intonation contours, points of major stress, rhythm, 11585and pauses. 11586This representation is substantially the same as that used by linguists 11587when recording natural utterances. 11588What we will discuss now are techniques for generating this information, 11589or at least some of it, from text. 11590.pp 11591Figure 9.1 shows various levels of the speech synthesis process. 11592.FC "Figure 9.1" 11593Starting from the top with plain text, the first box splits it into 11594intonation units (tone groups), decides where the major emphases 11595(tonic stresses) should be placed, 11596and further subdivides the tone group into rhythmic units (feet). 11597For intonation analysis it is necessary to decide on an "interpretation" 11598of the text, which in turn, as was emphasized at the beginning of the 11599previous chapter, depends both on the semantics of what is being said and 11600on the attitude of the speaker to his material. 11601The resulting representation will be at the level of Halliday's notation 11602for utterances, with the words still in English rather than phonetics. 11603Table 9.1 illustrates the utterance representation at the various levels 11604of the Figure. 11605.RF 11606.nr x0 \w'pitch and duration '+\w'at 8 kHz sampling rate a 4-second utterance' 11607.nr x1 (\n(.l-\n(x0)/2 11608.in \n(x1u 11609.ta \w'pitch and duration 'u +\w'pause 'u +\w'00 msec 'u 11610representation example 11611\l'\n(x0u\(ul' 11612.sp 11613plain text Automatic synthesis of speech, 11614 from a phonetic representation. 11615.sp 11616text adorned with 3\0^ auto/matic /synthesis of /*speech, 11617prosodic markers 1\0^ from a pho/*netic /represen/tation. 11618.sp 11619phonetic text with 3\0\fI^ aw t uh/m aa t i k /s i n th uh s i s\fR 11620prosodic markers \0\0\fIuh v /*s p ee t sh\fR , 11621 1\0\fI^ f r uh m uh f uh/*n e t i k\fR 11622 \0\0\fI/r e p r uh z e n/t e i sh uh n\fR . 11623.sp 11624phonemes with pause 80 msec 11625pitch and duration \fIaw\fR 70 msec 105 Hz 11626 \fIt\fR 40 msec 136 Hz 11627 \fIuh\fR 50 msec 148 Hz 11628 \fIm\fR 70 msec 175 Hz 11629 \fIaa\fR 90 msec 140 Hz 11630 ... 11631 ... 11632 ... 11633.sp 11634parameters for 10 parameters, each updated at a frame 11635formant or linear rate of 10 msec 11636predictive (4 second utterance gives 400 frames, 11637synthesizer or 4,000 data values) 11638.sp 11639acoustic wave at 8 kHz sampling rate a 4-second utterance 11640 has 32,000 samples 11641\l'\n(x0u\(ul' 11642.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 11643.in 0 11644.FG "Table 9.1 Utterance representations at various levels in speech synthesis" 11645.pp 11646The next job is to translate the plain text into a broad phonetic 11647transcription. 11648This requires knowledge of letter-to-sound pronunciation 11649rules for the language under consideration. 11650But much more is needed. The structure of each word must be examined for 11651prefixes and suffixes, because they \(em especially the latter \(em have a 11652strong influence on pronunciation. 11653This is called "morphological" analysis. 11654Actually it is also required for rhythmical purposes, because prefixes 11655are frequently unstressed (note that the word "prefix" is itself an 11656exception to this!). 11657Thus the appealing segmentation of the overall problem shown in Figure 9.1 11658is not very accurate, for the individual processes cannot be rigidly 11659separated as it implies. In fact, we saw earlier how this intermixing of 11660levels occurs with prosodic and segmental features. 11661Nevertheless, it is helpful to structure discussion of the problem by 11662separating levels as a first approximation. 11663Further influences on pronunciation come from the semantics and syntax 11664of the utterance \(em and both also play a part in intonation and rhythm analysis. 11665The result of this second process is a phonetic representation, still 11666adorned with prosodic markers. 11667.pp 11668Now we move down from higher-level intonation and rhythm considerations 11669to the details of the pitch contour and segment durations. 11670This process was the subject of the previous chapter. 11671The problems are twofold: to map an appropriate acoustic pitch contour 11672on to the utterance, using tonic stress point and foot boundaries as 11673anchor points; and to assign durations to segments using the 11674foot\(emsyllable\(emcluster\(emsegment hierarchy. 11675If it is accepted that the overall rhythm can be captured adequately by foot 11676markers, this process does not interact with earlier ones. 11677However, many researchers do not, believing instead that rhythm is 11678syntactically determined at a very detailed level. 11679This will, of course, introduce strong interaction between the duration 11680assignment process and the levels above. 11681(Klatt, 1975, puts it into his title \(em 11682"Vowel lengthening is syntactically determined in a connected discourse". 11683.[ 11684Klatt 1975 Vowel lengthening is syntactically determined 11685.] 11686Contrast this with the paper cited earlier (Bolinger, 1972) entitled 11687"Accent is predictable \(em if you're a mind reader". 11688.[ 11689Bolinger 1972 Accent is predictable \(em if you're a mind reader 11690.] 11691No-one would disagree that "accent" is an influential factor in vowel length!) 11692.pp 11693Notice incidentally that the representation of the result of the pitch 11694and duration assignment process in Table 9.1 is inadequate, for each segment 11695is shown as having just one pitch. 11696In practice the pitch varies considerably throughout every segment, 11697and can easily rise and fall on a single one. For example, 11698.LB 11699"he's 11700.ul 11701very 11702good" 11703.LE 11704may have a rise-fall on the vowel of "very". 11705The linked event-list data-structure of ISP is much more suitable 11706than a textual string for utterance representation at this level. 11707.pp 11708The fourth and fifth processes of Figure 9.1 have little interaction with 11709the first two, which are the subject of this chapter. Segmental 11710concatenation, which was treated in Chapter 7, is affected by prosodic 11711features like stress; but a notation which indicates stressed syllables 11712(like Halliday's) is sufficient to capture this influence. 11713Contextual modification of segments, by which I mean 11714the coarticulation effects which govern allophones of phonemes, 11715is included explicitly in the fourth process to emphasize that the upper levels 11716need only provide a broad phonemic transcription rather than a detailed 11717phonetic one. 11718Signal synthesis can be performed by either a formant synthesizer or a 11719linear predictive one (discussed in Chapters 5 and 6). 11720This will affect the details of the segmental concatenation process but should have no 11721impact at all on the upper levels. 11722.pp 11723Figure 9.1 performs a useful function by summarizing where we have 11724been in earlier chapters \(em the lower three boxes \(em and introducing the 11725remaining problems that must be faced by a full text-to-speech system. 11726It also serves to illustrate an important point: that a speech output system 11727can demand that its utterances be entered in any of a wide range of 11728representations. 11729Thus one can enter at a low level with a digitized waveform or linear 11730predictive parameters; or higher up with a phonetic representation 11731that includes detailed pitch and duration specification at the phoneme level; 11732or with a phonetic text or plain text adorned with prosodic markers; 11733or at the very top with plain text as it would appear in a book. 11734A heavy price in naturalness and intelligibility is paid by moving up 11735.ul 11736any 11737of these levels \(em and this is just as true at the top of the Figure as 11738at the bottom. 11739.sh "9.1 Deriving prosodic features" 11740.pp 11741If you really need to start with plain text, 11742some very difficult problems present themselves. 11743The text should be understood, first of all, and then decisions need to be 11744made about how it is to be interpreted. 11745For an excellent speaker \(em like an actor \(em these decisions will be artistic, 11746at least in part. 11747They should certainly depend upon the opinion and attitude of the speaker, 11748and his perception of the structure and progress of the dialogue. 11749Very little is known about this upper level of speech synthesis from text. 11750In practice it is almost completely ignored \(em and the speech is at most 11751barely intelligible, and certainly uncomfortable to listen to. 11752Hence anybody contemplating building or using a speech output system which 11753starts from something close to plain text should consider carefully whether some extra 11754semantic information can be coded into the initial utterances to help with 11755prosodic interpretation. 11756Only rarely is this impossible \(em and reading machines for the blind are 11757a prime example of a situation where arbitrary, unannotated, texts 11758must be read. 11759.rh "Intonation analysis." 11760One distinction which a program can usefully try 11761to make is between basically rising 11762and basically falling pitch contours. It is often said that pitch rises on 11763a question and falls on a statement, but if you listen to speech you will 11764find this to be a gross oversimplification. It normally 11765falls on statements, certainly; but it falls as often as it rises on questions. 11766It is more accurate to say that pitch rises on "yes-no" questions 11767and falls on other utterances, although this rule is still only a rough guide. 11768A simple test which operates lexically on the input text is to determine 11769whether a sentence is a question by looking at the 11770punctuation mark at its end, and then to examine the first word. 11771If it is a "wh"-word like "what", "which", "when", "why" (and also "how") 11772a falling contour is likely to fit. 11773If not, the question is probably a yes-no one, and the contour 11774should rise. 11775Such a crude rule will certainly not be very accurate 11776(it fails, for example, when the "wh"-word is embedded in a phrase as in 11777"at what time are you going?"), but at least it provides a starting-point. 11778.pp 11779An air of finality is given to an utterance when it bears a definite 11780fall in pitch, dropping to a rather low value at the end. 11781This should accompany the last intonation unit in an utterance 11782(unless it is a yes-no question). 11783However, a rise-fall contour such as Halliday's tone group 5 (Figure 8.5) 11784can easily be used in utterance-final position by one person 11785in a conversation \(em 11786although it would be unlikely to terminate the dialogue altogether. 11787A new topic is frequently introduced by a fall-rise contour \(em such as 11788Halliday's tone group 4 \(em and this often begins a paragraph. 11789.pp 11790Determining the type of pitch contour is only one part of 11791intonation assignment. There are really three separate problems: 11792.LB 11793.NP 11794dividing the utterance into tone groups 11795.NP 11796choosing the tonic syllable, or major stress point, of each one 11797.NP 11798assigning a pitch contour to each tone group. 11799.LE 11800Let us continue to use the Halliday notation for intonation, which was introduced 11801in simplified form in the previous chapter. 11802Moreover, assume that the foot boundaries can be placed correctly \(em 11803this problem will be discussed in the next subsection. 11804Then a scheme which considers only the lexical form of the utterance 11805and does not attempt to "understand" it (whatever that means) is as follows: 11806.LB 11807.NP 11808place a tone group boundary at every punctuation mark 11809.NP 11810place the tonic at the first syllable of the last foot in a tone group 11811.NP 11812use contour 4 for the first tone group in a paragraph and contour 1 11813elsewhere, except for a yes-no question which receives contour 2. 11814.LE 11815.RF 11816.nr x0 \w'From Scarborough to Whitby\0\0\0\0'+\w'4 ^ from /Scarborough to /*Whitby is a' 11817.nr x1 (\n(.l-\n(x0)/2 11818.in \n(x1u 11819.ta \w'From Scarborough to Whitby\0\0\0\0\0\0'u 11820plain text text adorned with prosodic markers 11821\l'\n(x0u\(ul' 11822.sp 11823From Scarborough to Whitby is a 4 ^ from /Scarborough to /*Whitby is a 11824very pleasant journey, with 1\- very /pleasant /*journey with 11825very beautiful countryside. 1\- very /beautiful /*countryside ... 11826In fact the Yorkshire coast is 1+ ^ in /fact the /Yorkshire /coast is 11827\0\0\0\0lovely, \0\0\0\0/*lovely 11828all along, ex- 1+ all a/*long ex 11829cept the parts that are covered _4 cept the /parts that are /covered 11830\0\0\0\0in caravans of course; and \0\0\0\0in /*caravans of /course and 11831if you go in spring, 4 if you /go in /*spring 11832when the gorse is out, 4 ^ when the /*gorse is /out 11833or in summer, 4 ^ or in /*summer 11834when the heather's out, 4 ^ when the /*heather's /out 11835it's really one of the most 13 ^ it's /really /one of the /most 11836\0\0\0\0delightful areas in the \0\0\0\0de/*lightful /*areas in the 11837whole country. 1 whole /*country 11838.sp 11839The moorland is 4 ^ the /*moorland is 11840rather high up, and 1 rather /high /*up and 11841fairly flat \(em a 1 fairly /*flat a 11842sort of plateau. 1 sort of /*plateau ... 11843At least, 1 ^ at /*least 11844it isn't really flat, 13 ^ it /*isn't /really /*flat 11845when you get up on the top; \-3 ^ when you /get up on the /*top 11846it's rolling moorland 1 ^ it's /rolling /*moorland 11847cut across by steep valleys. But 1 cut across by /steep /*valleys but 11848seen from the coast it's 4 seen from the /*coast it's ... 11849"up there on the moors", and you 1 up there on the /*moors and you 11850always think of it as a _4 always /*think of it as a 11851kind of tableland. 1 kind of /*tableland 11852\l'\n(x0u\(ul' 11853.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 11854.in 0 11855.FG "Table 9.2 Example of intonation and rhythm analysis (from Halliday, 1970)" 11856.[ 11857Halliday 1970 Course in spoken English: Intonation 11858.] 11859.pp 11860These extremely crude and simplistic rules are really the most that one can do 11861without subjecting the utterance to a complicated semantic analysis. 11862In statistical terms, they are actually remarkably effective. 11863Table 9.2 shows part of a spontaneous monologue which was transcribed by 11864Halliday and appears in his teaching text on intonation 11865(Halliday, 1970, p 133). 11866.[ 11867Halliday 1970 Course in Spoken English: Intonation 11868.] 11869Among the prosodic markers are some that were not introduced in Chapter 8. 11870Firstly, each tone group has secondary contours which are identified 11871by "1+", "1\-" (for tone group 1), and so on. 11872Secondly, the mark "..." is used to indicate a pause which disrupts 11873the speech rhythm. 11874Notice that its positioning belies the advice of the old elocutionists: 11875.br 11876.ev2 11877.in 0 11878.LB 11879.fi 11880A Comma stops the Voice while we may privately tell 11881.NI 11882.ul 11883one, 11884a Semi-colon 11885.ul 11886two; 11887a Colon 11888.ul 11889three:\c 11890 and a Period 11891.ul 11892four. 11893.br 11894.nr x0 \w'\fIone,\fR a Semi-colon \fItwo;\fR a Colon \fIthree:\fR and a Period \fIfour.'-\w'(Mason,\fR 1748)' 11895.NI 11896\h'\n(x0u'(Mason, 1748) 11897.nf 11898.LE 11899.br 11900.ev 11901Thirdly, compound tone groups such as "13" appear which contain 11902.ul 11903two 11904tonic syllables. 11905This differs from a simple concatenation of tone groups 11906(with contours 1 and 3 in this case) because the second is in some sense subsidiary to 11907the first. 11908Typically it forms an adjunct clause, while the first clause gives the 11909main information. Halliday provides many examples, such as 11910.LB 11911.NI 11912/Jane goes /shopping in /*town /every /*Friday 11913.NI 11914/^ I /met /*Arthur on the /*train. 11915.LE 11916But he does not comment on the 11917.ul 11918acoustic 11919difference between a compound tone group and a concatenation of simple ones \(em 11920which is, after all, the information needed for synthesis. 11921A final, minor, difference between Halliday's scheme and that outlined earlier 11922is that he compels tone group boundaries to occur at the beginning 11923of a foot. 11924.RF 11925.nr x0 3.3i+1.3i+\w'complete' 11926.nr x1 (\n(.l-\n(x0)/2 11927.in \n(x1u 11928.ta 3.3i +1.3i 11929 excerpt in complete 11930 Table 9.2 passage 11931\l'\n(x0u\(ul' 11932.sp 11933number of tone groups 25 74 11934.sp 11935number of boundaries correctly 19 (76%) 47 (64%) 11936placed 11937.sp 11938number of boundaries incorrectly \00 \01 (\01%) 11939placed 11940.sp 11941number of tone groups having a 22 (88%) 60 (81%) 11942tonic syllable at the beginning 11943of the final foot 11944.sp 11945number of tone groups whose 17 (68%) 51 (69%) 11946contours are correctly assigned 11947\l'\n(x0u\(ul' 11948.sp 11949number of compound tone groups \02 (\08%) \06 (\08%) 11950.sp 11951number of secondary intonation \07 (28%) 13 (17%) 11952contours 11953\l'\n(x0u\(ul' 11954.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 11955.in 0 11956.FG "Table 9.3 Success of simple intonation assignment rules" 11957.pp 11958Applying the simple rules given above to the text of Table 9.2 leads to 11959the results in the first column of Table 9.3. 11960Three-quarters of the foot boundaries are flagged by 11961punctuation marks, with no extraneous ones being included. 1196288% of tone groups have a tonic syllable at the start of the final foot. 11963However, the compound tone groups each have two tonic syllables, 11964and of course only the second one is predicted by the final-foot rule. 11965Assigning intonation contours on the extremely simple basis of using 11966contour 4 for the first tone group in a paragraph, and contour 1 thereafter, 11967also seems to work quite well. Secondary contours such as "1+" and "1\-" 11968have been mapped into the appropriate primary contour (1, in this case) 11969for the present purpose, and compound tone groups have been assigned the first 11970contour of the pair. 11971The result is that 68% of contours are given correctly. 11972.pp 11973In order to give some idea of the reliability of these figures, the results 11974for the whole passage transcribed by Halliday \(em of which Table 9.2 is an 11975excerpt \(em are shown in the second column of Table 9.3. Although it 11976looks as though the rules may have been slightly lucky with the excerpt, 11977the general trends are the same, with 65% to 80% of features being assigned 11978correctly. 11979It could be argued, though, that the complete text is punctuated fairly liberally by 11980present-day standards, so that the tone-group boundary rule is unusually 11981successful. 11982.pp 11983These results are really astonishingly good, considering the crudeness of 11984the rules. However, they should be interpreted with caution. 11985What is missed by the rules, although appearing to comprise only 1198620% to 35% of the features, is certain to include the important, 11987information-bearing, and variety-producing features that give the utterance 11988its liveliness and interest. 11989It would be rash to assume that all tone-group boundaries, 11990all tonic positions, and all intonation contours, are equally 11991important for intelligibility and naturalness. 11992It is much more likely that the rules predict a 11993default pattern, while most information is borne by deviations from 11994them. 11995To give an engineering analogy, it may be as though the carrier waveform 11996of a modulated transmission is being simulated, instead of the 11997information-bearing signal! 11998Certainly the utterance will, if synthesized with intonation given by these 11999rules, sound extremely dull and repetitive, mainly because of the 12000overwhelming predominance of tone group 1 and the universal placement 12001of tonic stress on the final foot. 12002.pp 12003There are certainly many different ways to orate any particular text, 12004and that given by Halliday and reproduced in Table 9.2 is only one possible 12005version. 12006However, it is fair to say that the default intonation discussed above 12007could only occur naturally under very unusual circumstances \(em such as 12008a petulant child, unwilling and sulky, having been forced to read aloud. 12009This is hardly how we want our computers to speak! 12010.rh "Rhythm analysis." 12011Consider now how to decide where foot boundaries should be placed 12012in English text. 12013Clearly semantic considerations sometimes play a part in this \(em one could 12014say 12015.LB 12016/^ is /this /train /going /*to /London 12017.LE 12018instead of the more usual 12019.LB 12020/^ is /this /train /going to /*London 12021.LE 12022in circumstances where the train might be going 12023.ul 12024to 12025or 12026.ul 12027from 12028London. 12029Such effects are ignored here, although it is worth noting in passing that the 12030rogue words will often be marked by underscoring or italicizing 12031(as in the previous sentence). 12032If the text is liberally underlined, semantic analysis may 12033be unnecessary for the purposes of rhythm. 12034.pp 12035A rough and ready rule for placing foot boundaries is to insert one before 12036each word which is not in a small closed set of "function words". 12037The set includes, for example, "a", "and", "but", "for", "is", "the", "to". 12038If a verb or adjective begins with a prefix, the boundary should be moved 12039between it and the root \(em but not for a noun. 12040This will give the distinction between 12041.ul 12042con\c 12043vert (noun) and con\c 12044.ul 12045vert 12046(verb), 12047.ul 12048ex\c 12049tract and ex\c 12050.ul 12051tract, 12052and for many North American speakers, 12053will help to distinguish 12054.ul 12055in\c 12056quiry from in\c 12057.ul 12058quire. 12059However, detecting prefixes by a simple splitting algorithm is dangerous. 12060For example, "predate" is a verb with stress on what appears to be a prefix, 12061contrary to the rule; while the "pre" in "predator" is not a prefix \(em at 12062least, it is not pronounced as the prefix "pre" normally is. 12063Moreover, polysyllabic words like "/diplomat", "dip/lomacy", "diplo/matic"; 12064or "/telegraph", "te/legraphy", "tele/graphic" cannot be handled on such a simple 12065basis. 12066.pp 12067In 1968, a remarkable work on English sound structure was published 12068(Chomsky and Halle, 1968) which proposes a system of rules to transform 12069English text into a phonetic representation in terms of distinctive features, 12070with the aid of a lexicon. 12071.[ 12072Chomsky Halle 1968 12073.] 12074A great deal of attention is paid to stress, and rules are given which 12075perform well in many tricky cases. 12076.pp 12077It uses the American system of levels of stress, marking 12078so-called primary stress with a superscript 1, secondary stress with a 12079superscript 2, and so on. 12080The superscripts are written on the vowel of the stressed 12081syllable: completely unstressed syllables receive no annotation. 12082For example, the sentence "take John's blackboard eraser" is written 12083.LB 12084ta\u2\dke Jo\u3\dhn's bla\u1\dckboa\u5\drd era\u4\dser. 12085.LE 12086In foot notation this utterance 12087is 12088.LB 12089/take /John's /*blackboard e/raser. 12090.LE 12091It undoubtedly contains less information than the stress-level version. 12092For example, the second syllable of "blackboard" and the first one of "erase" 12093are both unstressed, although the rhythm rules given in Chapter 8 12094will cause them 12095to be treated differently because they occupy different places in the 12096syllable pattern of the foot. 12097"Take", "John's", and the second syllable of "erase" are all non-tonic 12098foot-initial syllables and hence are not distinguished in the notation; 12099although the pitch contours schematized in Figure 8.9 will give them different 12100intonations. 12101.pp 12102An indefinite number of levels of stress can be used. For example, according 12103to the rules given by Chomsky and Halle, the word "sad" in 12104.LB 12105my friend can't help being shocked at anyone who would fail to consider 12106his sad plight 12107.LE 12108has level-8 stress, the final two words being annotated 12109as "sa\u8\dd pli\u1\dght". 12110However, only the first few levels are used regularly, and 12111it is doubtful whether acoustic distinctions are made in speech 12112between the weaker ones. 12113.pp 12114Chomsky and Halle are concerned to distinguish between such utterances as 12115.LB 12116.NI 12117bla\u2\dck boa\u1\drd-era\u3\dser ("board eraser that is black") 12118.NI 12119bla\u1\dckboa\u3\drd era\u2\dser ("eraser for a blackboard") 12120.NI 12121bla\u3\dck boa\u1\drd era\u2\dser ("eraser of a black board"), 12122.LE 12123and their stress assignment rules do indeed produce each version when 12124appropriate. 12125In foot notation the distinctions can still be made: 12126.LB 12127.NI 12128/black /*board-eraser/ 12129.NI 12130/*blackboard e/raser/ 12131.NI 12132/black /*board e/raser/ 12133.LE 12134.pp 12135The rules operate on a grammatical derivation tree 12136of the text. 12137For instance, input for the three examples would be written 12138.LB 12139.NI 12140[\dNP\u[\dA\u black ]\dA\u [\dN\u[\dN\u board]\dN\u 12141[\dN\u eraser ]\dN\u]\dN\u]\dNP\u 12142.NI 12143[\dN\u[\dN\u[\dA\u black ]\dA\u [\dN\u board ]\dN\u]\dN\u [\dN\u eraser ]\dN\u]\dN\u 12144.NI 12145[\dN\u[\dNP\u[\dA\u black ]\dA\u [\dN\u board ]\dN\u]\dNP\u [\dN\u eraser ]\dN\u]\dN\u, 12146.LE 12147representing the trees shown in Figure 9.2. 12148.FC "Figure 9.2" 12149Here, N stands for a noun, NP for a noun phrase, and A for an adjective. 12150These categories appear explicitly as nodes in the tree. 12151In the linearized textual representation they are used to label 12152brackets which represent the tree structure. 12153An additional piece of information which is needed is the lexical entry for 12154"eraser", which would show that it has only one accented 12155(that is, potentially stressed) syllable, namely, the second. 12156.pp 12157Consider now how to account for stress in prefixed and 12158suffixed words, and those polysyllabic ones with more than one potential 12159stress point. 12160For these, the morphological structure must appear in the input. 12161.pp 12162Now 12163.ul 12164morphemes 12165are well-defined minimal units of grammatical analysis from which a word 12166may be composed. 12167For example, [went]\ =\ [go]\ +\ [ed] is 12168a morphemic decomposition, where "[ed]" denotes the 12169past-tense morpheme. 12170This representation is not particularly suitable for speech synthesis 12171for the obvious reason that the result bears no phonetic resemblance to 12172the input. 12173What is needed is a decomposition into 12174.ul 12175morphs, 12176which occur only when the lexical or phonetic representation of a word may 12177easily be segmented into parts. 12178Thus [wanting]\ =\ [want]\ +\ [ing] and [bigger]\ =\ [big]\ +\ [er] are 12179simultaneously morphic and morphemic decompositions. 12180Notice that in the second example, a rule about final consonant doubling has 12181been applied at the lexical level (although it is not needed in 12182a phonetic representation): this comes into the sphere 12183of "easy" segmentation. 12184Contrast this with [went]\ =\ [go]\ +\ [ed] which 12185is certainly not an easy segmentation and hence a 12186morphemic but not a morphic decomposition. 12187But between these extremes there are some difficult 12188cases: [specific]\ =\ [specify]\ +\ [ic] is probably morphic 12189as well as morphemic, but it is not clear 12190that [galactic]\ =\ [galaxy]\ +\ [ic] is. 12191.pp 12192Assuming that the input is given as a derivation tree with morphological 12193structure made explicit, Chomsky and Halle present rules which assign stress 12194correctly in nearly all cases. For example, their rules give 12195.LB 12196.NI 12197[\dA\u[\dN\u incident ]\dN\u + al]\dA\u \(em> i\u2\dncide\u1\dntal; 12198.LE 12199and if the stem is marked by [\dS\u\ ...\ ]\dS\u in prefixed words, 12200they can deduce 12201.LB 12202.NI 12203[\dN\u tele [\dS\u graph ]\dS\u]\dN\u \(em> te\u1\dlegra\u3\dph 12204.NI 12205[\dN\u[\dN\u tele [\dS\u graph ]\dS\u]\dN\u y ]\dN\u \(em> tele\u1\dgraphy 12206.NI 12207[\dA\u[\dN\u tele [\dS\u graph ]\dS\u]\dN\u ic ]\dA\u \(em> te\u3\dlegra\u1\dphi\u2\dc. 12208.LE 12209.pp 12210There are two rules which account for the word-level stress 12211on such examples: the "main stress" 12212rule and the "alternating stress" rule. 12213In essence, the main stress rule emphasizes the last strong syllable 12214of a stem. 12215A syllable is "strong" either if it contains one of a class of so-called 12216"long" vowels, or if there is a cluster of two or more consonants 12217following the vowel; otherwise it is "weak". 12218(If you are exceptionally observant you will notice that this strong\(emweak 12219distinction has been used before, when discussing the rhythm of feet in 12220syllables.) Thus the verb "torment" receives stress on the second syllable, 12221for it is a strong one. 12222A noun like "torment" is treated as being derived from the corresponding verb, 12223and the rule assigns stress to the verb first and then modifies it for the noun. 12224The second, "alternating stress", rule gives some stress to alternate 12225syllables of polysyllabic words like "form\c 12226.ul 12227al\c 12228de\c 12229.ul 12230hyde\c 12231". 12232.pp 12233It is quite easy to incorporate the word-level rules into a computer 12234program which uses feet rather than stress levels as the basis for prosodic 12235description. 12236A foot boundary is simply placed before the primary-stressed (level-1) syllable, 12237except for function words, which do not begin a foot. 12238The other stress levels should be ignored, 12239except that for slow, deliberate speech, secondary (level-2) stress is 12240mapped into a foot boundary too, if it precedes the primary stress. 12241There is also a rule which reduces vowels in unstressed 12242syllables. 12243.pp 12244The stress assignment rules can work on phonemic script, as well as English. 12245For example, starting from the phonetic 12246form [\d\V\u\ \c 12247.ul 12248aa\ s\ t\ o\ n\ i\ sh\ \c 12249]\dV\u, 12250the stress assignment rules 12251produce \c 12252.ul 12253aa\ s\ t\ o\u1\d\ n\ i\ sh\ ;\c 12254 the 12255vowel reduction rule 12256generates \c 12257.ul 12258uh\ s\ t\ o\u1\d\ n\ i\ sh\ ;\c 12259 and 12260the foot conversion process 12261gives \c 12262.ul 12263uh\ s/t\ o\ n\ i\ sh. 12264This appears to provide a fairly reliable algorithm for foot boundary 12265placement. 12266.rh "Speech synthesis from concept." 12267I argued earlier that in order to derive prosodic features 12268of an utterance from text it 12269is necessary to understand its role in the dialogue, its semantics, 12270its syntax, and \(em as we have just seen \(em its morphological structure. 12271This is a very tall order, and the problem of natural language comprehension 12272by machine is a vast research area in its own right. 12273However, in many applications requiring speech output, 12274utterances are generated by the computer from internally stored data 12275rather than being read aloud from pre-prepared text. 12276Then the problem of comprehending text may be evaded, for 12277presumably the language-generation module can provide a semantic, 12278syntactic, and even morphological decomposition of the utterance, 12279as well as some indication of its role in the dialogue 12280(that is, why it is necessary to say it). 12281.pp 12282This forms the basis of the appealing notion of "speech synthesis from concept". 12283It has some advantages over speech generation from text, and in principle 12284should provide more natural-sounding speech. 12285Every word produced by the system can have a complete lexical entry which 12286shows its morphological decomposition and potential stress points. 12287The full syntactic history of each utterance is known. 12288The Chomsky-Halle rules described above can therefore be used to place 12289foot boundaries accurately, without the need for a complex parsing program 12290and without the risk of having to make guesses about unknown words. 12291.pp 12292However, it is not clear how to take advantage of any semantic information 12293which is available. Ideally, it should be possible to place tone group 12294boundaries and tonic stress points, and assign intonation contours, in 12295a natural-sounding way. 12296But look again at the example text of Table 9.2 and imagine that you have 12297at your disposal as much semantic information as is needed. 12298It is 12299.ul 12300still 12301far from obvious how the intonation features could be assigned! 12302It is, in the ultimate analysis, interpretive and stylistic 12303.ul 12304choices 12305that add variety and interest to speech. 12306.pp 12307Take the problem of determining pitch contours, for instance. 12308Some of them may be explicable. 12309Contour 4 on 12310.LB 12311.NI 12312except the parts that are covered in caravans of course 12313.LE 12314is due to its being a contrastive clause, for it presents 12315essentially new information. 12316Similarly, the succession 12317.LB 12318.NI 12319if you go in spring 12320.NI 12321when the gorse is out 12322.NI 12323or in summer 12324.NI 12325when the heather's out 12326.LE 12327could be considered contrastive, being in the subjunctive voice, and 12328this could explain why contour 4's were used. 12329But this is all conjecture, and it is difficult to apply throughout the 12330passage. 12331Halliday (1970) explains the contexts in which each tone group is typically 12332used, but in an extremely high-level manner which would be impossible 12333to embody directly in a computer program. 12334.[ 12335Halliday 1970 Course in spoken English: Intonation 12336.] 12337At the other end of the spectrum, computer systems for written 12338discourse production do not seem to provide the subtle information needed 12339to make intonation decisions (see, for example, Davey, 1978, for a fairly 12340complete description of such a system). 12341.[ 12342Davey 1978 12343.] 12344.pp 12345One project which uses such a method for generating speech has been 12346described (Young and Fallside, 1980). 12347.[ 12348Young Fallside 1980 12349.] 12350Although some attention is paid to rhythm, the intonation contours 12351which are generated are disappointingly repetitive and lacking in 12352richness. 12353In fact, very little semantic information is used to assign contours; really 12354just that inferred by the crude punctuation-driven method described 12355earlier. 12356.pp 12357The higher-level semantic problems associated with speech output were 12358studied some years go under the 12359title "synthetic elocution" (Vanderslice, 1968). 12360.[ 12361Vanderslice 1968 12362.] 12363A set of rules was generated and tested by hand on a sample passage, 12364the first part of which is shown in Table 9.4. 12365However, no attempt was made to formalize the rules in a computer program, 12366and indeed it was recognized that a number of important questions, 12367such as the form of the semantic information assumed at the input, 12368had been left unanswered. 12369.RF 12370.nr x0 \w'\0\0 psychologist '+\w'emphasis assigned because of antithesis with ' 12371.nr x1 (\n(.l-\n(x0)/2 12372.in \n(x1u 12373.ta \w'\0\0 psychologist 'u 12374\l'\n(x0u\(ul' 12375.sp 12376Human experience and human behaviour are accessible to 12377observation by everyone. The psychologist tries to bring 12378them under systematic study. What he perceives, however, 12379anyone can perceive; for his task he requires no microscope 12380or electronic gear. 12381.sp2 12382\0\0 word comments 12383\l'\n(x0u\(ul' 12384.sp 12385\01 Human special treatment because paragraph-initial 12386\04 human accent deleted because it echoes word 1 1238713 psychologist emphasis assigned because of antithesis with 12388 "everyone" 1238917 them anaphoric to "Human experience and human 12390 behaviour" 1239119 systematic emphasis assigned because of contrast with 12392 "observation" 1239320 study emphasis? \(em text is ambiguous whether 12394 "observation" is a kind of study that is 12395 nonsystematic, or an activity contrasting 12396 with the entire concept of "systematic study" 1239721 What increase in pitch for "What he perceives" 12398 because it is not the subject 1239922 he accented although anaphoric to word 13 12400 because of antithesis with word 25 1240124 however decrease in pitch because it is parenthetical 1240225 anyone emphasized by antithesis with word 22 1240327 perceive unaccented because it echoes word 23, 12404 "perceives" 12405\0\0 ; semicolon assigns falling intonation 1240630 task unaccented because it is anaphoric with 12407 "tries to bring them under systematic study" 12408\l'\n(x0u\(ul' 12409.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 12410.in 0 12411.FG "Table 9.4 Sample passage and comments pertinent to synthetic elocution" 12412.pp 12413The comments in the table, which are selected and slightly edited versions 12414of those appearing in the original work (Vanderslice, 1968), are intended 12415as examples of the nature and subtlety of the prosodic influences which 12416were examined. 12417.[ 12418Vanderslice 1968 12419.] 12420The concepts of "accent" and "emphasis" are used; these relate to stress 12421but are not easy to define precisely in our tone-group terminology. 12422Fortunately we do not need an exact characterization of them for the present 12423purpose. 12424Roughly speaking, "accent" encompasses both foot-initial stress and 12425tonic stress, whereas "emphasis" is something more than this, 12426typically being realized by the fall-rise or rise-fall contours of 12427Halliday's tone groups 4 and 5 (Figure 8.5). 12428.pp 12429Particular attention is paid to anaphora and antithesis (amongst other things). 12430The first term means the repetition of a word or phrase in the text, 12431and is often applied to pronoun references. 12432In the example, the word "human" is repeated in the first few words; 12433"them" in the second sentence refers to "human experience and human 12434behaviour"; "he" in the third sentence is the previously-mentioned 12435psychologist; and "task" is anaphoric with "tries to bring them under 12436systematic study". 12437Other things being equal, anaphoric references are unaccented. 12438In our terms this means that they certainly do not receive tonic stress 12439and may not even receive foot stress. 12440.pp 12441Antithesis is defined as the contrast of ideas expressed by parallelism of 12442strongly contrasting words or phrases; and the second element taking part 12443in it is generally emphasized. 12444"Psychologist" in the passage is an antithesis of "everyone"; 12445"systematic" and possibly "study" of "observation". 12446Thus 12447.LB 12448.NI 12449/^ the psy/*chologist 12450.LE 12451would probably receive intonation contour 4, since it is also introducing 12452a new actor; while 12453.LB 12454.NI 12455/tries to /bring them /under /system/*matic /study 12456.LE 12457could receive contour 5. 12458"He" and "everyone" are antithetical; not only does the latter receive 12459emphasis but the former has its accent restored \(em for otherwise 12460it would have been removed because of anaphora with "psychologist". 12461Hence it will certainly begin a foot, possibly a tonic foot. 12462.pp 12463A factor that does not affect the sample passage is the accentuation 12464of unusual syllables of similar words to bring out a contrast. 12465For example, 12466.LB 12467.NI 12468he went 12469.ul 12470out\c 12471side, not 12472.ul 12473in\c 12474side. 12475.LE 12476Although this may seem to be just another facet of antithesis, 12477Vanderslice points out that it is phonetic rather than structural 12478similarity that is contrasted: 12479.LB 12480.NI 12481I said 12482.ul 12483de\c 12484plane, not 12485.ul 12486com\c 12487plain. 12488.LE 12489This introduces an interesting interplay between the phonetic and 12490prosodic levels. 12491.pp 12492Anaphora and antithesis provide an ideal domain for speech synthesis from 12493concept. 12494Determining them from plain text is a very difficult problem, 12495requiring a great deal of real-world knowledge. 12496The first has received some attention in the field of natural language 12497understanding. 12498Finding pronoun referents is an important problem for language translation, 12499for their gender is frequently distinguished in, say, French where it is not 12500in English. 12501Examples such as 12502.LB 12503.NI 12504I bought the wine, sat on a table, and drank it 12505.NI 12506I bought the wine, sat on a table, and broke it 12507.LE 12508have been closely studied (Wilks, 1975); for if they were to be translated 12509into French the pronoun "it" would be rendered differently in each case 12510(\c 12511.ul 12512le 12513vin, 12514.ul 12515la 12516table). 12517.[ 12518Wilks 1975 An intelligent analyzer and understander of English 12519.] 12520.pp 12521In spoken language, emphasis is used to indicate the referent of a pronoun 12522when it would not otherwise be obvious. 12523Vanderslice gives the example 12524.LB 12525.NI 12526Bill saw John across the room and he ran over to him 12527.NI 12528Bill saw John across the room and 12529.ul 12530he 12531ran over to 12532.ul 12533him, 12534.LE 12535where the emphasis reverses the pronoun referents 12536(so that John did the running). 12537He suggests accenting a personal pronoun whenever the true 12538antecedent is not the same as the "unmarked" or default one. 12539Unfortunately he does not elaborate on what is meant by "unmarked". 12540Does it mean that the referent cannot be predicted from 12541knowledge of the words alone \(em as in the second example above? 12542If so, this is a clear candidate for speech synthesis from concept, 12543for the distinction cannot be made from text! 12544.sh "9.2 Pronunciation" 12545.pp 12546English pronunciation is notoriously irregular. 12547A poem by Charivarius, the pseudonym of a Dutch high school teacher 12548and linguist G.N.Trenite (1870\-1946), surveys the problems in an amusing 12549way and is worth quoting in full. 12550.br 12551.ev2 12552.in 0 12553.LB "nnnnnnnnnnnnnnnn" 12554.ul 12555 The Chaos 12556.sp2 12557.ne4 12558Dearest creature in Creation 12559Studying English pronunciation, 12560.in +5n 12561I will teach you in my verse 12562Sounds like corpse, corps, horse and worse. 12563.ne4 12564.in -5n 12565It will keep you, Susy, busy, 12566Make your head with heat grow dizzy; 12567.in +5n 12568Tear in eye your dress you'll tear. 12569So shall I! Oh, hear my prayer: 12570.ne4 12571.in -5n 12572Pray, console your loving poet, 12573Make my coat look new, dear, sew it. 12574.in +5n 12575Just compare heart, beard and heard, 12576Dies and diet, lord and word. 12577.ne4 12578.in -5n 12579Sword and sward, retain and Britain, 12580(Mind the latter, how it's written). 12581.in +5n 12582Made has not the sound of bade, 12583Say \(em said, pay \(em paid, laid, but plaid. 12584.ne4 12585.in -5n 12586Now I surely will not plague you 12587With such words as vague and ague, 12588.in +5n 12589But be careful how you speak: 12590Say break, steak, but bleak and streak, 12591.ne4 12592.in -5n 12593Previous, precious; fuchsia, via; 12594Pipe, shipe, recipe and choir; 12595.in +5n 12596Cloven, oven; how and low; 12597Script, receipt; shoe, poem, toe. 12598.ne4 12599.in -5n 12600Hear me say, devoid of trickery; 12601Daughter, laughter and Terpsichore; 12602.in +5n 12603Typhoid, measles, topsails, aisles; 12604Exiles, similes, reviles; 12605.ne4 12606.in -5n 12607Wholly, holly; signal, signing; 12608Thames, examining, combining; 12609.in +5n 12610Scholar, vicar and cigar, 12611Solar, mica, war and far. 12612.ne4 12613.in -5n 12614Desire \(em desirable, admirable \(em admire; 12615Lumber, plumber; bier but brier; 12616.in +5n 12617Chatham, brougham; renown but known, 12618Knowledge; done, but gone and tone, 12619.ne4 12620.in -5n 12621One, anemone; Balmoral, 12622Kitchen, lichen; laundry, laurel; 12623.in +5n 12624Gertrude, German; wind and mind; 12625Scene, Melpemone, mankind; 12626.ne4 12627.in -5n 12628Tortoise, turquoise, chamois-leather, 12629Reading, Reading; heathen, heather. 12630.in +5n 12631This phonetic labyrinth 12632Gives: moss, gross; brook, brooch; ninth, plinth. 12633.ne4 12634.in -5n 12635Billet does not end like ballet; 12636Bouquet, wallet, mallet, chalet; 12637.in +5n 12638Blood and flood are not like food, 12639Nor is mould like should and would. 12640.ne4 12641.in -5n 12642Banquet is not nearly parquet, 12643Which is said to rime with darky 12644.in +5n 12645Viscous, viscount; load and broad; 12646Toward, to forward, to reward. 12647.ne4 12648.in -5n 12649And your pronunciation's O.K. 12650When you say correctly: croquet; 12651.in +5n 12652Rounded, wounded; grieve and sieve; 12653Friend and fiend, alive and live 12654.ne4 12655.in -5n 12656Liberty, library; heave and heaven; 12657Rachel, ache, moustache; eleven. 12658We say hallowed, but allowed; 12659People, leopard; towed, but vowed. 12660.in +5n 12661Mark the difference moreover 12662Between mover, plover, Dover; 12663.ne4 12664.in -5n 12665Leeches, breeches; wise, precise; 12666Chalice, but police and lice. 12667.in +5n 12668Camel, constable, unstable, 12669Principle, discipline, label; 12670.ne4 12671.in -5n 12672Petal, penal and canal; 12673Wait, surmise, plait, promise; pal. 12674.in +5n 12675Suit, suite, ruin; circuit, conduit, 12676Rime with: "shirk it" and "beyond it"; 12677.ne4 12678.in -5n 12679But it is not hard to tell 12680Why it's pall, mall, but Pall Mall. 12681.in +5n 12682Muscle, muscular; goal and iron; 12683Timber, climber; bullion, lion; 12684.ne4 12685.in -5n 12686Worm and storm; chaise, chaos, chair; 12687Senator, spectator, mayor. 12688.in +5n 12689Ivy, privy; famous, clamour 12690and enamour rime with "hammer". 12691.ne4 12692.in -5n 12693Pussy, hussy and possess, 12694Desert, but dessert, address. 12695.in +5n 12696Golf, wolf; countenants; lieutenants 12697Hoist, in lieu of flags, left pennants. 12698.ne4 12699.in -5n 12700River, rival; tomb, bomb, comb; 12701Doll and roll, and some and home. 12702.in +5n 12703Stranger does not rime with anger, 12704Neither does devour with clangour. 12705.ne4 12706.in -5n 12707Soul, but foul; and gaunt, but aunt; 12708Font, front, won't; want, grand and grant; 12709.in +5n 12710Shoes, goes, does. Now first say: finger, 12711And then; singer, ginger, linger. 12712.ne4 12713.in -5n 12714Real, zeal; mauve, gauze and gauge; 12715Marriage, foliage, mirage, age. 12716.in +5n 12717Query does not rime with very, 12718Nor does fury sound like bury. 12719.ne4 12720.in -5n 12721Dost, lost, post; and doth, cloth, loth; 12722Job, Job; blossom, bosom, oath. 12723.in +5n 12724Though the difference seems little 12725We say actual, but victual; 12726.ne4 12727.in -5n 12728Seat, sweat; chaste, caste; Leigh, eight, height; 12729Put, nut; granite but unite. 12730.in +5n 12731Reefer does not rime with deafer, 12732Feoffer does, and zephyr, heifer. 12733.ne4 12734.in -5n 12735Dull, bull; Geoffrey, George; ate, late; 12736Hint, pint; senate, but sedate. 12737.in +5n 12738Scenic, Arabic, Pacific; 12739Science, conscience, scientific. 12740.ne4 12741.in -5n 12742Tour, but our, and succour, four; 12743Gas, alas and Arkansas! 12744.in +5n 12745Sea, idea, guinea, area, 12746Psalm, Maria, but malaria. 12747.ne4 12748.in -5n 12749Youth, south, southern; cleanse and clean; 12750Doctrine, turpentine, marine. 12751.in +5n 12752Compare alien with Italian. 12753Dandelion with battalion, 12754.ne4 12755.in -5n 12756Sally with ally, Yea, Ye, 12757Eye, I, ay, aye, whey, key, quay. 12758Say aver, but ever, fever, 12759Neither, leisure, skein, receiver. 12760.in +5n 12761Never guess \(em it is not safe; 12762We say calves, valves, half, but Ralf. 12763.ne4 12764.in -5n 12765Heron, granary, canary; 12766Crevice and device and eyrie; 12767.in +5n 12768Face, preface, but efface, 12769Phlegm, phlegmatic; ass, glass, bass; 12770.ne4 12771.in -5n 12772Large, but target, gin, give, verging; 12773Ought, out, joust and scour, but scourging; 12774.in +5n 12775Ear, but earn; and wear and tear 12776Do not rime with "here", but "ere". 12777.ne4 12778.in -5n 12779Seven is right, but so is even; 12780Hyphen, roughen, nephew, Stephen; 12781.in +5n 12782Monkey, donkey; clerk and jerk; 12783Asp, grasp, wasp; and cork and work. 12784.ne4 12785.in -5n 12786Pronunciation \(em think of psyche - 12787Is a paling, stout and spikey; 12788.in +5n 12789Won't it make you lose your wits, 12790Writing groats and saying "groats"? 12791.ne4 12792.in -5n 12793It's a dark abyss or tunnel, 12794Strewn with stones, like rowlock, gunwale, 12795.in +5n 12796Islington and Isle of Wight, 12797Housewife, verdict and indict. 12798.ne4 12799.in -5n 12800Don't you think so, reader, rather 12801Saying lather, bather, father? 12802.in +5n 12803Finally: which rimes with "enough", 12804Though, through, plough, cough, hough or tough? 12805.ne4 12806.in -5n 12807Hiccough has the sound of "cup", 12808My advice is ... give it up! 12809.LE "nnnnnnnnnnnnnnnn" 12810.br 12811.ev 12812.rh "Letter-to-sound rules." 12813Despite such irregularities, it is surprising how much can be done 12814with simple letter-to-sound rules. 12815These specify phonetic equivalents of word fragments and single letters. 12816The longest stored fragment which matches the current word is translated, 12817and then the same strategy is adopted on the remainder of the word. 12818Table 9.5 shows some English fragments and their pronunciations. 12819.RF 12820.nr x0 1.5i+\w'pronunciation ' 12821.nr x1 (\n(.l-\n(x0)/2 12822.in \n(x1u 12823.ta 1.5i 12824fragment pronunciation 12825\l'\n(x0u\(ul' 12826.sp 12827-p- \fIp\fR 12828-ph- \fIf\fR 12829-phe| \fIf ee\fR 12830-phe|s \fIf ee z\fR 12831-phot- \fIf uh u t\fR 12832-place|- \fIp l e i s\fR 12833-plac|i- \fIp l e i s i\fR 12834-ple|ment- \fIp l i m e n t\fR 12835-plie|- \fIp l aa i y\fR 12836-post \fIp uh u s t\fR 12837-pp- \fIp\fR 12838-pp|ly- \fIp l ee\fR 12839-preciou- \fIp r e s uh\fR 12840-proce|d- \fIp r uh u s ee d\fR 12841-prope|r- \fIp r o p uh r\fR 12842-prov- \fIp r uu v\fR 12843-purpose- \fIp er p uh s\fR 12844-push- \fIp u sh\fR 12845-put \fIp u t\fR 12846-puts \fIp u t s\fR 12847\l'\n(x0u\(ul' 12848.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 12849.in 0 12850.FG "Table 9.5 Word fragments and their pronunciations" 12851.pp 12852It is sometimes important to specify that a rule applies only when 12853the fragment is matched at the beginning or end of a word. 12854In the Table "-" means that other fragments can precede or follow this 12855one. 12856The "|" sign is used to separate suffixes from a word stem, 12857as will be explained 12858shortly. 12859.pp 12860An advantage of the longest-string search strategy is that it is easy 12861to account for exceptions simply by incorporating them into the fragment 12862table. 12863If they occur in the input, the complete word will automatically be 12864matched first, before any fragment of it is translated. 12865The exception list of complete words can be surprisingly small for 12866quite respectable performance. 12867Table 9.6 shows the entire dictionary for an excellent early pronunciation 12868system written at Bell Laboratories (McIlroy, 1974). 12869.[ 12870McIlroy 1974 12871.] 12872Some of the words are notorious exceptions in English, while others are 12873included simply because the rules would run amok on them. 12874Notice that the exceptions are all quite short, with only a few of them 12875having more than two syllables. 12876.RF 12877.nr x1 0.9i+0.9i+0.9i+0.9i+0.9i+0.9i 12878.nr x1 (\n(.l-\n(x1)/2 12879.in \n(x1u 12880.ta 0.9i +0.9i +0.9i +0.9i +0.9i 12881a doesn't guest meant reader those 12882alkali doing has moreover refer to 12883always done have mr says today 12884any dr having mrs seven tomorrow 12885april early heard nature shall tuesday 12886are earn his none someone two 12887as eleven imply nothing something upon 12888because enable into nowhere than very 12889been engine is nuisance that water 12890being etc island of the wednesday 12891below evening john on their were 12892body every july once them who 12893both everyone live one there whom 12894busy february lived only thereby whose 12895copy finally living over these woman 12896do friday many people they women 12897does gas maybe read this yes 12898.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 12899.in 0 12900.FG "Table 9.6 Exception table for a simple pronunciation program" 12901.pp 12902Special action has to be taken with final "e"'s. 12903These lengthen and alter the quality 12904of the preceding vowel, so that "bit" becomes "bite" and so on. 12905Unfortunately, if the word has a suffix the "e" must be detected even though 12906it is no longer final, as in "lonely", and it is even dropped sometimes 12907("biting") \(em otherwise these would be pronounced "lonelly", "bitting". 12908To make matters worse the suffix may be another word: we do not 12909want "kiteflying" to have an extra syllable which rhymes with "deaf"! 12910Although simple procedures can be developed to take care of common 12911word endings like "-ly", "-ness", "-d", it is difficult to decompose 12912compound words like "wisecrack" and "bumblebee" reliably \(em but this must 12913be done if they are not to be articulated with three syllables instead of two. 12914Of course, there are exceptions to the final "e" rule. 12915Many common words ("some", "done", "[live]\dV\u") disobey the rule by not 12916lengthening the main vowel, while in other, rarer, ones ("anemone", 12917"catastrophe", "epitome") the final "e" is actually pronounced. 12918There are also some complete anomalies ("fete"). 12919.pp 12920McIlroy's (1974) system is a superb example of a robust program which takes 12921a pragmatic approach to these problems, accepting that they will never be 12922fully solved, and which is careful to degrade 12923gracefully when stumped. 12924.[ 12925McIlroy 1974 12926.] 12927The pronunciation of each word is found by a succession of increasingly 12928desperate trials: 12929.LB 12930.NP 12931replace upper- by lower-case letters, strip punctuation, and try again; 12932.NP 12933remove final "-s", replace final "ie" by "y", and try again; 12934.NP 12935reject a word without a vowel; 12936.NP 12937repeatedly mark any suffixes with "|"; 12938.NP 12939mark with "|" probable morph divisions in compound words; 12940.NP 12941mark potential long vowels indicated by "e|", 12942and long vowels elsewhere in the word; 12943.NP 12944mark voiced medial "s" as in "busy", "usual"; 12945replace final "-s" if stripped; 12946.NP 12947scanning the word from left to right, apply letter-to-sound rules 12948to word fragments; 12949.NP 12950when all else fails spell the word, punctuation and all 12951(burp on letters for which no spelling rule exists). 12952.LE 12953.RF 12954.nr x0 \w'| ment\0\0\0'+\w'replace final ie by y\0\0\0'+\w'except when no vowel would remain in ' 12955.nr x1 (\n(.l-\n(x0)/2 12956.in \n(x1u 12957.ta \w'| ment\0\0\0'u +\w'replace final ie by y\0\0\0'u 12958suffix action notes and exceptions 12959\l'\n(x0u\(ul' 12960.sp 12961s strip off final s except in context us 12962\&' strip off final ' 12963ie replace final ie by y 12964e replace final e by E when it is the only vowel in a word 12965 (long "e") 12966 12967| able place suffix mark as except when no vowel would remain in 12968| ably shown the rest of the word 12969e | d 12970e | n 12971e | r 12972e | ry 12973e | st 12974e | y 12975| ful 12976| ing 12977| less 12978| ly 12979| ment 12980| ness 12981| or 12982 12983| ic place suffix mark as 12984| ical shown and terminate 12985e | final e processing 12986\l'\n(x0u\(ul' 12987.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 12988.in 0 12989.FG "Table 9.7 Rules for detecting suffixes for final 'e' processing" 12990.pp 12991Table 9.7 shows the suffixes which the program recognizes, with some comments 12992on their processing. 12993Multiple suffixes are detected and marked in words like 12994"force|ful|ly" and "spite|ful|ness". 12995This allows silent "e"'s to be spotted even when they occur far back in a 12996word. 12997Notice that the suffix marks are available to the word-fragment 12998rules of Table 9.5, and are frequently used by them. 12999.pp 13000The program has some 13001.ul 13002ad hoc 13003rules for dealing with compound words like "race|track", "house|boat"; 13004these are applied as well as normal suffix splitting so that multiple 13005decompositions like "pace|make|r" can be accomplished. 13006The rules look for short letter sequences which do not 13007usually appear in monomorphemic words. 13008It is impossible, however, to detect every morph boundary 13009by such rules, and the program inevitably makes mistakes. 13010Examples of boundaries which go undetected are 13011"edge|ways", "fence|post", "horse|back", "large|mouth", "where|in"; 13012while boundaries are incorrectly inserted into "comple|mentary", 13013"male|volent", "prole|tariat", "Pame|la". 13014.pp 13015We now seem to have presented two opposing points of view on the pronunciation 13016problem. 13017Charivarius, the Dutch poet, shows that an enormous number of 13018exceptional words exist; whereas McIlroy's program makes do with a tiny 13019exception dictionary. 13020These views can be reconciled by noting that most of Charivarius' words 13021are relatively uncommon. 13022McIlroy tested his program against the 2000 most frequent words in a large 13023corpus (Kucera and Francis, 1967), 13024and found that 97% were pronounced correctly if word frequencies were 13025taken into account. 13026.[ 13027Kucera Francis 1967 13028.] 13029(The notion of "correctness" is of course a rather subjective one.) However, 13030he estimated that on the remaining words the success rate was only 88%. 13031.pp 13032The system is particularly impressive in that it is prepared to say 13033anything: if used, for example, on source programs in a high-level 13034computer language it will say the keywords and pronouncable 13035identifiers, spell the other identifiers, and even give the names of special 13036symbols (like +, <, =) correctly! 13037.rh "Morphological analysis." 13038The use of letter-to-sound rules provides a cheap and fast technique 13039for pronunciation \(em the fragment table and exception dictionary for the 13040program described above occupy only 11 Kbyte of storage, and can easily 13041be kept in solid-state read-only memory. 13042It produces reasonable results if careful attention is paid to rules 13043for suffix-splitting. 13044However, it is inherently limited because it is not possible in general 13045to detect compound words by simple rules which operate on the lexical 13046structure of the word. 13047.pp 13048Compounds can only be found reliably by using a morph dictionary. 13049This gives the added advantage that syntactic information 13050can be stored with the morphs to assist with rhythm assignment according 13051to the Chomsky-Halle theory. 13052However, it was noted earlier that morphs, unlike the grammatically-determined 13053morphemes, are not very well defined from a linguistic point of view. 13054Some morphemic decompositions are obviously not morphic because the 13055constituents do not in any way resemble the final word; 13056while others, where the word is simply a concatenation 13057of its components, are clearly morphic. 13058Between these extremes lies a hazy region where what one considers 13059to be a morph depends upon how complex one is prepared to make the 13060concatenation rules. 13061The following description draws on techniques used in a project at MIT 13062in which a morph-based pronunciation system has been implemented 13063(Lee, 1969; Allen, 1976). 13064.[ 13065Lee 1969 13066.] 13067.[ 13068Allen 1976 Synthesis of speech from unrestricted text 13069.] 13070.pp 13071Estimates of the number of morphs in English vary from 10,000 to 30,000. 13072Although these seem to be very large numbers, they are considerably less 13073than the number of words in the language. 13074For example, Webster's 13075.ul 13076New Collegiate Dictionary 13077(7'th edition) contains about 100,000 entries. 13078If all forms of the words were included, this number would probably 13079double. 13080.pp 13081There are several classes of morphs, with restrictions on the combinations 13082that occur. 13083A general word has prefixes, a root, and suffixes, as shown in Figure 9.3; 13084only the root is mandatory. 13085.FC "Figure 9.3" 13086Suffixes usually perform a grammatical role, affecting the 13087conjugation of a verb or declension of a noun; or transforming one 13088part of speech into another 13089("-al" can make a noun into an adjective, while "-ness" performs the reverse 13090transformation.) Other 13091suffixes, such as "-dom" or "-ship", only apply to certain parts of 13092speech (nouns, in this case), but do not change the grammatical 13093role of the word. Such suffixes, and all prefixes, alter the meaning 13094of a word. 13095.pp 13096Some root morphs cannot combine with other morphs but always stand 13097alone \(em for instance, "this". 13098Others, called free morphs, can either occur on their own or combine 13099with further morphs to form a word. 13100Thus the root "house" can be joined on either side by another root, 13101such as "boat", 13102or by a suffix such as "ing". 13103A third type of root morph is one which 13104.ul 13105must 13106combine with another morph, like "crimin-", "-ceive". 13107.pp 13108Even with a morph dictionary, decomposing a word into a sequence 13109of morphs is not a trivial operation. 13110The process of lexical concatenation often results in a 13111minor change in the constituents. 13112How big this change is allowed to be governs the morph system being used. 13113For example, Allen (1976) gives three concatenation rules: a 13114final "e" can be omitted, as in 13115.ta 1.1i 13116.LB 13117.NI 13118give + ing \(em> giving; 13119.LE 13120the last consonant of the root can be doubled, as in 13121.LB 13122.NI 13123bid + ing \(em> bidding; 13124.LE 13125or a final "y" can change to an "i", as in 13126.LB 13127.NI 13128handy + cap \(em> handicap. 13129.[ 13130Allen 1976 Synthesis of speech from unrestricted text 13131.] 13132.LE 13133If these are the only rules permitted, the morph dictionary will 13134have to include multiple versions of some suffixes. 13135For example, the plural morpheme [-s] needs to be represented both by 13136"-s" and "-es", to account for 13137.LB 13138.NI 13139pea + s \(em> peas 13140.LE 13141and 13142.LB 13143.NI 13144baby + es \(em> babies (using the "y" \(em> "i" rule). 13145.LE 13146This would not be necessary if a "y" \(em> "ie" rule were included too. 13147Similarly, the morpheme [-ic] will include morphs 13148"-ic" and "-c"; the latter to cope with 13149.LB 13150.NI 13151specify + c \(em> specific (using the "y" \(em> "i" rule). 13152.LE 13153Furthermore, non-morphemic roots such as "galact" need to be included because 13154the concatenation rules do not capture the transformation 13155.LB 13156.NI 13157galaxy + ic \(em> galactic. 13158.LE 13159There is clearly a trade-off between the size of the morph dictionary 13160and the complexity of the concatenation rules. 13161.pp 13162Since a text-to-speech system is presented with already-concatenated 13163morphs, it must be prepared to reverse the effects of the concatenation 13164rules to deduce the constituents of a word. 13165When two morphs combine with any of the three rules given above, 13166the changes in spelling occur only in the lefthand one. 13167Therefore the word is best scanned in a right-to-left direction to 13168split off the morphs starting with suffixes, as McIlroy's program does. 13169If the procedure fails at any point, one of the three rules is 13170hypothesized, its effect is undone, and splitting continues. 13171For example, consider the word 13172.LB 13173.NI 13174grasshoppers <\(em grass + hop + er + s 13175.LE 13176(Lee, 1969). 13177.[ 13178Lee 1969 13179.] 13180The "-s" is detected first, then "-er"; these are both stored in 13181the dictionary as suffixes. 13182The remainder, "grasshopp", cannot be decomposed and does not appear 13183in the dictionary. 13184So each of the rules above is hypothesized in turn, and the 13185result investigated. (The "y" \(em> "i" rule is obviously not 13186applicable.) When 13187the final-consonant-doubling rule is considered, the sequence 13188"grasshop" is investigated. 13189"Shop" could be split off this, but then the unknown morph "gras" 13190would result. 13191The alternative, to remove "hop", leaves a remainder "grass" which 13192.ul 13193is 13194a free morph, as desired. 13195Thus a unique and correct decomposition is obtained. 13196Notice that the procedure would fail if, for example, "grass" had 13197been inadvertently omitted from the dictionary. 13198.pp 13199Sometimes, several seemingly valid decompositions present themselves 13200(Allen, 1976). 13201.[ 13202Allen 1976 Synthesis of speech from unrestricted text 13203.] 13204For example: 13205.LB 13206.NI 13207scarcity <\(em scar + city 13208.NI 13209 <\(em scarce + ity (using final-"e" deletion) 13210.NI 13211 <\(em scar + cite + y (using final-"e" deletion) 13212.NI 13213resting <\(em rest + ing 13214.NI 13215 <\(em re + sting 13216.NI 13217biding <\(em bide + ing (using final-"e" deletion) 13218.NI 13219 <\(em bid + ing 13220.NI 13221unionized <\(em un + ion + ize + d 13222.NI 13223 <\(em union + ize + d 13224.NI 13225winding <\(em [wind]\dN\u + ing 13226.NI 13227 <\(em [wind]\dV\u + ing. 13228.LE 13229The last distinction is important because the pronunciation of "wind" 13230depends on whether it is a noun or a verb. 13231.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 13232.pp 13233Several sources of information can be used to resolve these ambiguities. 13234The word structure of Figure 9.3, together with the division of root 13235morphs into bound and free ones, may eliminate some possibilities. 13236Certain letter sequences (such as "rp") do not appear at the beginning 13237of a word or morph, and others never occur at the end. 13238Knowledge of these sequences can reject some unacceptable 13239decompositions \(em or perhaps more importantly, can enable intelligent guesses 13240to be made in cases where a constituent morph has been omitted from the 13241dictionary. 13242The grammatical function of suffixes allows suffix sequences to be 13243checked for compatibility. 13244The syntax of the sentence, together with suffix knowledge, can 13245rule out other combinations. 13246Semantic knowledge will occasionally be necessary (as in the "unionized" 13247and "winding" examples above \(em compare a "winding road" with a "winding 13248blow"). 13249Finally, Allen (1976) suggests that a preference structure on composition 13250rules can be used to resolve ambiguity. 13251.[ 13252Allen 1976 Synthesis of speech from unrestricted text 13253.] 13254.pp 13255Once the morphological structure has been determined, 13256the rest of the pronunciation 13257process is relatively easy. 13258A phonetic transcription of each morph may be stored in the morph dictionary, 13259or else letter-to-sound rules can be used on individual morphs. 13260These are likely to be quite successful because final-"e" processing can be 13261now be done with confidence: there are no hidden final "e"'s in the middle 13262of morphs. 13263In either case the resulting phonetic transcriptions of the individual morphs 13264must be concatenated to give the transcription of the complete word. 13265Although some contextual modification has to be accounted for, 13266it is relatively straightforward and easy to predict. 13267For example, the plural morphs "-s" and "-es" can be realized phonetically 13268by 13269.ul 13270uh\ z, 13271.ul 13272s, 13273or 13274.ul 13275z 13276depending on context. 13277Similarly the past-tense suffix "-ed" may be rendered as 13278.ul 13279uh\ d, 13280.ul 13281t, 13282or 13283.ul 13284d. 13285The suffixes "-ion" and "-ure" sometimes cause modification of the previous 13286morph: for example 13287.LB 13288.NI 13289act + ion \(em> \c 13290.ul 13291a k t\c 13292 + ion \(em> \c 13293.ul 13294a k sh uh n. 13295.LE 13296.pp 13297The morph dictionary does not remove the need for a lexicon of exceptional 13298words. 13299The irregular final-"e" words mentioned earlier ("done", "anemone", "fete") 13300need to be treated on an individual basis, 13301as do words such as "quadruped" which have misleading endings 13302(it should not be decomposed as "quadrup|ed"). 13303.rh "Pronunciation of languages other than English." 13304Text-to-speech systems for other languages have been reported in 13305the literature. 13306(For example, French, Esperanto, 13307Italian, Russian, Spanish, and German are covered 13308by Lesmo 13309.ul 13310et al, 133111978; O'Shaughnessy 13312.ul 13313et al, 133141981; Sherwood, 1978; 13315Mangold and Stall, 1978). 13316.[ 13317Lesmo 1978 13318.] 13319.[ 13320O'Shaughnessy Lennig Mermelstein Divay 1981 13321.] 13322.[ 13323Sherwood 1978 13324.] 13325.[ 13326Mangold Stall 1978 13327.] 13328Generally speaking, these present fewer difficulties than does English. 13329Esperanto is particularly easy because each letter in its orthography 13330has only one sound, making the pronunciation problem trivial. 13331Moreover, stress in polysyllabic words always occurs on the penultimate 13332syllable. 13333.pp 13334It is tempting and often sensible when designing a synthesis system for 13335English to use an utterance representation somewhere between phonetics and 13336ordinary spelling. 13337This may happen in practice even if it is not intended: a user, finding 13338that a given word is pronounced incorrectly, will alter the spelling to 13339make it work. 13340The Word English Spelling alphabet (Dewey, 1971), amongst others (Haas, 1966), 13341is a simplified and apparently natural scheme which was developed by the 13342spelling reform movement. 13343.[ 13344Dewey 1971 13345.] 13346.[ 13347Haas 1966 13348.] 13349It maps very simply on to a phonetic representation, just like Esperanto. 13350However, it can provide little help with the crucial problem of stress 13351assignment, except perhaps by explicitly indicating reduced vowels. 13352.sh "9.3 Discussion" 13353.pp 13354This chapter has really only touched the tip of a linguistic iceberg. 13355I have given some examples of representations, rules, algorithms, 13356and exceptions, to make the concepts more tangible; but a whole mass of 13357detail has been swept under the carpet. 13358.pp 13359There are two important messages that are worth reiterating once more. 13360The first is that the representation of the input \(em that is, 13361whether it be a "concept" 13362in some semantic domain, a syntactic description of an utterance, a 13363decomposition into morphs, plain text or some contrived re-spelling of it \(em 13364is crucial to the quality of the output. 13365Almost any extra information about the utterance can be taken into account 13366and used to improve the speech. 13367It is difficult to derive such information if it is not provided explicitly, 13368for the process of climbing the tree from text to semantic representation is 13369at least as hard as descending it to a phonetic transcription. 13370.pp 13371Secondly, simple algorithms perform remarkably well \(em witness the 13372punctuation-driven intonation assignment scheme, and word fragment rules 13373for pronunciation. 13374However, the combined degradation contributed by several imperfect 13375processes is likely to impair speech quality very seriously. 13376And great complexity is introduced when these simple algorithms are 13377discarded in favour of more sophisticated ones. 13378There is, for example, a world of difference between a pronunciation 13379program that copes with 97% of common words and one that deals correctly 13380with 99% of a random sample from a dictionary. 13381.pp 13382Some of the options that face the system designer are recapitulated in 13383Figure 9.4. 13384.FC "Figure 9.4" 13385Starting from text, one can take the simple approach of lexically-based 13386suffix-splitting, letter-to-sound rules, and prosodics derived 13387from punctuation, to generate a phonetic transcription. 13388This will provide a cheap system which is relatively easy to implement 13389but whose speech quality will probably not be acceptable to any but the 13390most dedicated listener 13391(such as a blind person with no other access to reading material). 13392.pp 13393The biggest improvement in speech quality from such a system would 13394almost certainly come from more intelligent prosodic 13395control \(em particularly of intonation. 13396This, unfortunately, is also by far the most difficult to make unless 13397intonation contours, tonic stresses, and tone-group boundaries are hand-coded 13398into the input. 13399To generate the appropriate information from text one has to climb to the 13400upper levels in Figure 9.4 \(em and even when these are reached, the problems 13401are by no means over. 13402Still, let us climb the tree. 13403.pp 13404For syntax analysis, part-of-speech information is needed; and for this 13405the grammatical roles of individual words in the text must be ascertained. 13406A morph dictionary is the most reliable way to do this. 13407A linguist may prefer to go from morphs to syntax by way of morphemes; 13408but this is not necessary for the present purpose. 13409Just the information that 13410the morph "went" is a verb can be stored in the dictionary, instead 13411of its decomposition [went]\ =\ [go]\ +\ [ed]. 13412.pp 13413Now that we have the morphological structure of the text, stress assignment rules 13414can be applied to produce more accurate speech rhythms. 13415The morph decomposition will also allow improvements to be made to the 13416pronunciation, particularly in the case of silent "e"'s in compound words. 13417But the ability to assign intonation has hardly been improved at all. 13418.pp 13419Let us proceed upwards. 13420Now the problems become really difficult. 13421A semantic representation of the text is needed; but what exactly does this 13422mean? 13423We certainly must have 13424.ul 13425morphemic 13426knowledge, for now the fact that "went" is a derivative of "go" 13427(rather than any other verb) becomes crucial. 13428Very well, let us augment the morph dictionary with morphemic information. 13429But this does not attack the problem of semantic representation. 13430We may wish to resolve pronoun references to help assign stress. 13431Parts of the problem are solved in principle 13432and reported in the artificial intelligence 13433literature, but if such an ability is incorporated into the speech 13434synthesis system it will become enormously complicated. 13435In addition, we have seen that knowledge of antitheses in the text will greatly 13436assist intonation assignment, but procedures for extracting this 13437information constitute a research topic in their own right. 13438.pp 13439Now step back and take a top-down approach. 13440What could we do with this semantic understanding and knowledge of the structure 13441of the discourse if we had it? 13442Suppose the input were a "concept" in some as yet undetermined representation. 13443What are the 13444.ul 13445acoustic 13446manifestations of such high-level features as anaphoric references or 13447antithetical comparisons, 13448of parenthetical or satirical remarks, 13449of emotions: warmth, sarcasm, sadness and despair? 13450Can we program the art of elocution? 13451These are good questions. 13452.sh "9.4 References" 13453.LB "nnnn" 13454.[ 13455$LIST$ 13456.] 13457.LE "nnnn" 13458.sh "9.5 Further reading" 13459.pp 13460Books on pronunciation give surprisingly little help in designing 13461a text-to-speech procedure. 13462The best aid is a good on-line dictionary and flexible software to 13463search it and record rules, examples, and exceptions. 13464Here are some papers that describe existing systems. 13465.LB "nn" 13466.\"Ainsworth-1974-1 13467.]- 13468.ds [A Ainsworth, W.A. 13469.ds [D 1974 13470.ds [T A system for converting text into speech 13471.ds [J IEEE Trans Audio and Electroacoustics 13472.ds [V AU-21 13473.ds [P 288-290 13474.nr [P 1 13475.nr [T 0 13476.nr [A 1 13477.nr [O 0 13478.][ 1 journal-article 13479.in+2n 13480.in-2n 13481.\"Colby-1978-2 13482.]- 13483.ds [A Colby, K.M. 13484.as [A ", Christinaz, D. 13485.as [A ", and Graham, S. 13486.ds [D 1978 13487.ds [K * 13488.ds [T A computer-driven, personal, portable, and intelligent speech prosthesis 13489.ds [J Computers and Biomedical Research 13490.ds [V 11 13491.ds [P 337-343 13492.nr [P 1 13493.nr [T 0 13494.nr [A 1 13495.nr [O 0 13496.][ 1 journal-article 13497.in+2n 13498.in-2n 13499.\"Elovitz-1976-3 13500.]- 13501.ds [A Elovitz, H.S. 13502.as [A ", Johnson, R.W. 13503.as [A ", McHugh, A. 13504.as [A ", and Shore, J.E. 13505.ds [D 1976 13506.ds [K * 13507.ds [T Letter-to-sound rules for automatic translation of English text to phonetics 13508.ds [J IEEE Trans Acoustics, Speech and Signal Processing 13509.ds [V ASSP-24 13510.ds [N 6 13511.ds [P 446-459 13512.nr [P 1 13513.ds [O December 13514.nr [T 0 13515.nr [A 1 13516.nr [O 0 13517.][ 1 journal-article 13518.in+2n 13519.in-2n 13520.\"Kooi-1978-4 13521.]- 13522.ds [A Kooi, R. 13523.as [A " and Lim, W.C. 13524.ds [D 1978 13525.ds [T An on-line minicomputer-based system for reading printed text aloud 13526.ds [J IEEE Trans Systems, Man and Cybernetics 13527.ds [V SMC-8 13528.ds [P 57-62 13529.nr [P 1 13530.ds [O January 13531.nr [T 0 13532.nr [A 1 13533.nr [O 0 13534.][ 1 journal-article 13535.in+2n 13536.in-2n 13537.\"Umeda-1975-5 13538.]- 13539.ds [A Umeda, N. 13540.as [A " and Teranishi, R. 13541.ds [D 1975 13542.ds [K * 13543.ds [T The parsing program for automatic text-to-speech synthesis developed at the Electrotechnical Laboratory in 1968 13544.ds [J IEEE Trans Acoustics, Speech and Signal Processing 13545.ds [V ASSP-23 13546.ds [N 2 13547.ds [P 183-188 13548.nr [P 1 13549.ds [O April 13550.nr [T 0 13551.nr [A 1 13552.nr [O 0 13553.][ 1 journal-article 13554.in+2n 13555.in-2n 13556.\"Umeda-1976-6 13557.]- 13558.ds [A Umeda, N. 13559.ds [D 1976 13560.ds [K * 13561.ds [T Linguistic rules for text-to-speech synthesis 13562.ds [J Proc IEEE 13563.ds [V 64 13564.ds [N 4 13565.ds [P 443-451 13566.nr [P 1 13567.ds [O April 13568.nr [T 0 13569.nr [A 1 13570.nr [O 0 13571.][ 1 journal-article 13572.in+2n 13573.in-2n 13574.LE "nn" 13575.EQ 13576delim $$ 13577.EN 13578.CH "10 DESIGNING THE MAN-COMPUTER DIALOGUE" 13579.ds RT "The man-computer dialogue 13580.ds CX "Principles of computer speech 13581.pp 13582Interactive computers are being used more and more by non-specialist people 13583without much previous computer experience. 13584As processing costs continue to decline, the overall expense of providing 13585highly interactive systems 13586becomes increasingly dominated by terminal and communications equipment. 13587Taken together, these two factors highlight the need for easy-to-use, 13588low-bandwidth interactive terminals that make maximum use of the existing 13589telephone network for remote access. 13590.pp 13591Speech output can provide versatile feedback from a computer at very low 13592cost in distribution and terminal equipment. It is attractive from several 13593points of view. 13594Terminals \(em telephones \(em are invariably in place already. 13595People without experience of computers are accustomed to their use, 13596and are not intimidated by them. 13597The telephone network is cheap to use and extends all over the world. 13598The touch-tone keypad (or a portable tone generator) 13599provides a complementary data input device which will do for many 13600purposes until the technology of speech recognition becomes better developed 13601and more widespread. 13602Indeed, many applications \(em especially information retrieval ones \(em need 13603a much smaller bandwidth from user to computer than in the reverse direction, 13604and voice output combined with restricted keypad entry provides a good match 13605to their requirements. 13606.pp 13607There are, however, severe problems in implementing natural and useful 13608interactive systems using speech output. 13609The eye can absorb information at a far greater rate than can the ear. 13610You can scan a page of text in a way which has no analogy in auditory terms. 13611Even so, it is difficult to design a dialogue which allows you to search 13612computer output visually at high speed. 13613In practice, scanning a new report is often better done at your desk 13614with a printed copy than at a computer terminal with a viewing program 13615(although this is likely to change in the near future). 13616.pp 13617With speech, the problem of organizing output becomes even harder. 13618Most of the information we learn using our ears is presented in a 13619conversational way, either in face-to-face discussions or over the telephone. 13620Verbal but non-conversational presentations, as in the 13621university lecture theatre, are known to be a rather inefficient way 13622of transmitting information. 13623The degree of interaction is extremely high even in a telephone conversation, 13624and communication relies heavily on speech gestures such as hesitations, 13625grunts, and pauses; on prosodic features such as intonation, pitch range, 13626tempo, and voice quality; and on conversational gambits such as interruption 13627and long silence. 13628I emphasized in the last two chapters the rudimentary state of knowledge 13629about how to synthesize 13630prosodic features, and the situation is even worse 13631for the other, paralinguistic, phenomena. 13632.pp 13633There is also a very special problem with voice output, namely, the transient 13634nature of the speech signal. 13635If you miss an utterance, it's gone. 13636With a visual display unit, at least the last few interactions usually remain 13637available. 13638Even then, it is not uncommon to look up beyond the top of the screen and 13639wish that more of the history was still visible! 13640This obviously places a premium on a voice response system's 13641ability to repeat utterances. 13642Moreover, the dialogue designer must do his utmost to ensure that the user 13643is always aware of the current state of the interaction, 13644for there is no opportunity to refresh the memory by glancing at earlier 13645entries and responses. 13646.pp 13647There are two separate aspects to the man-computer interface in a voice 13648response system. 13649The first is the relationship between the system and the end user, 13650that is, the "consumer" of the synthesized dialogue. 13651The second is the relationship between the system and the applications 13652programmer who creates the dialogue. 13653These are treated separately in the next two sections. 13654We will have more to say about the former aspect, 13655for it is ultimately more important to more people. 13656But the applications programmer's view is important, too; for without him 13657no systems would exist! 13658The technical difficulties in creating synthetic dialogues 13659for the majority of voice systems probably 13660explain why speech output technology is still greatly under-used. 13661Finally we look at techniques for using small keypads such as those on 13662touch-tone telephones, 13663for they are an essential part of many voice response systems. 13664.sh "10.1 Programming principles for natural interaction" 13665.pp 13666Special attention must be paid to be details of the man-machine interface 13667in speech-output systems. 13668This section summarizes experience of human factors considerations 13669gained in developing the remote 13670telephone enquiry service described in Chapter 1 (Witten and Madams, 1977), 13671which employs an ordinary touch-tone keypad for input in conjunction with 13672synthetic voice response. 13673.[ 13674Witten Madams 1977 Telephone Enquiry Service 13675.] 13676Most of the principles which emerged were the result of natural evolution 13677of the system, and were not clear at the outset. 13678Basically, they stem from the fact that speech is both more intrusive 13679and more ephemeral than writing, and so they are applicable in general to 13680speech output information retrieval systems with keyboard or even voice 13681input. 13682Be warned, however, that they are based upon casual observation and 13683speculation rather than empirical research. 13684There is a desperate need for proper studies of user psychology in speech 13685systems. 13686.rh "Echoing." 13687Most alphanumeric input peripherals echo on a character-by-character basis. 13688Although one can expect quite a high proportion of mistakes with 13689unconventional keyboards, especially when entering alphabetic data on a 13690basically numeric keypad, audio character echoing is distracting and annoying. 13691If you type "123" and the computer echoes 13692.LB 13693.NI 13694"one ... two ... three" 13695.LE 13696after the individual key-presses, it is liable to divert your 13697attention, for voice output is much more intrusive than a purely visual "echo". 13698.pp 13699Instead, an immediate response to a completed input line is preferable. 13700This response can take the form or a reply to a query, or, if successive 13701data items are being typed, confirmation of the data entered. 13702In the latter case, it is helpful if the information can be generated in 13703the same way that the user himself would be likely to verbalize it. 13704Thus, for example, when entering numbers: 13705.LB 13706.nr x0 \w'COMPUTER:' 13707.nr x1 \w'USER:' 13708.NI 13709USER:\h'\n(x0u-\n(x1u' "123#" (# is the end-of-line character) 13710.NI 13711COMPUTER: "One hundred and twenty-three." 13712.LE 13713For a query which requires lengthy processing, the input should be 13714repeated in a neat, meaningful format to give the user a chance to abort 13715the request. 13716.rh "Retracting actions." 13717Because commands are entered directly without explicit confirmation, 13718it must always be easy for the user to revoke his actions. 13719The utility of an "undo" command is now commonly recognized for 13720any interactive system, and it becomes even more important in speech 13721systems because it is easier for the user to lose his place in the 13722dialogue and so make errors. 13723.rh "Interrupting." 13724A command which interrupts output and returns to a known state 13725should be recognized at every level of the system. 13726It is essential that voice output be terminated immediately, 13727rather than at the end of the utterance. 13728We do not want the user to live in fear of the system embarking on 13729a long, boring monologue that is impossible to interrupt! 13730Again, the same is true of interactive dialogues which do not use speech, 13731but becomes particularly important with voice response because it takes 13732longer to transmit information. 13733.rh "Forestalling prompts." 13734Computer-generated prompts must be explicit and frequent enough 13735to allow new users to understand what they are expected to do. 13736Experienced users will "type ahead" quite naturally, 13737and the system should suppress unnecessary prompts under these conditions 13738by inspecting the input buffer before prompting. 13739This allows the user to concatenate frequently-used commands into chunks whose 13740size is entirely at his own discretion. 13741.pp 13742With the above-mentioned telephone enquiry service, for example, 13743it was found that people often took advantage of the prompt-suppression 13744feature to enter their 13745user number, password, and required service number as a single keying 13746sequence. 13747As you becomes familiar with a service you quickly and easily learn to 13748forestall expected prompts by typing ahead. 13749This provides a very natural way for the system to adapt itself automatically 13750to the experience of the user. 13751New users will naturally wait to be prompted, and proceed through the dialogue 13752at a slower and more relaxed pace. 13753.pp 13754Suppressing unnecessary prompts is a good idea in any interactive system, 13755whether or not it uses the medium of speech \(em although it is hardly ever done 13756in conventional systems. 13757It is particularly important with speech, however, because an unexpected 13758or unwanted 13759prompt is quite distracting, and it is not so easy to ignore it as it is 13760with a visual display. 13761Furthermore, speech messages usually take longer to present 13762than displayed ones, so that the user is distracted for more time. 13763.rh "Information units." 13764Lengthy computer voice responses are inappropriate for conveying information, 13765because attention wanders if one is not actively involved in the conversation. 13766A sequential exchange of terse messages, each designed to dispense one 13767small unit of information, forces the user to take a meaningful part in the 13768dialogue. 13769It has other advantages, too, allowing a higher degree of input-dependent 13770branching, and permitting rapid recovery from errors. 13771.pp 13772The following example from the "Acidosis program", an audio response system 13773designed to help physicians to diagnose acidosis, is a good example 13774of what 13775.ul 13776not 13777to do. 13778.LB 13779"(Chime) A VALUE OF SIX-POINT-ZERO-ZERO HAS BEEN ENTERED FOR PH. 13780THIS VALUE IS IMPOSSIBLE. 13781TO CONTINUE THE PROGRAM, ENTER A NEW VALUE FOR PH IN THE RANGE 13782BETWEEN SIX-POINT-SIX AND EIGHT-POINT-ZERO 13783(beep dah beep-beep)" (Smith and Goodwin, 1970). 13784.[ 13785Smith Goodwin 1970 13786.] 13787.LE 13788The use of extraneous noises (for example, a "chime" heralds an error message, 13789and a "beep dah beep-beep" requests data input in the form 13790<digit><point><digit><digit>) 13791was thought necessary in the Acidosis program to keep the user awake 13792and help him with the format of the interaction. 13793Rather than a long monologue like this, 13794it seems much better to design a sequential interchange of terse messages, 13795so that the caller can be guided into a state where he can rectify his error. 13796For example, 13797.LB 13798.nf 13799.ne11 13800.nr x0 \w'COMPUTER:' 13801.nr x1 \w'CALLER:' 13802CALLER:\h'\n(x0u-\n(x1u' "6*00#" 13803COMPUTER: "Entry out of range" 13804CALLER:\h'\n(x0u-\n(x1u' "6*00#" (persists) 13805COMPUTER: "The minimum acceptable pH value is 6.6" 13806CALLER:\h'\n(x0u-\n(x1u' "9*03#" 13807COMPUTER: "The maximum acceptable pH value is 8.0" 13808.fi 13809.LE 13810This dialogue allows a rapid exit from the error situation in the likely 13811event that the entry has simply been mis-typed. 13812If the error persists, the caller is given just one piece of information 13813at a time, and forced to continue to play an active role in the interaction. 13814.rh "Input timeouts." 13815In general, input timeouts are dangerous, because they introduce apparent 13816acausality in the system seen by the user. 13817A case has been reported where a user became "highly agitated and refused 13818to go near the terminal again after her first timed-out prompt. 13819She had been quietly thinking what to do and the terminal suddenly 13820interjecting and making its 13821own suggestions was just too much for her" (Gaines and Facey, 1975). 13822.[ 13823Gaines Facey 1975 13824.] 13825.pp 13826However, voice response systems lack the satisfying visual feedback 13827of end-of-line on termination of an entry. 13828Hence a timed-out reminder is appropriate if a delay occurs after some 13829characters have been entered. 13830This requires the operating system to support a character-by-character mode 13831of input, rather than the usual line-by-line mode. 13832.rh "Repeat requests." 13833Any voice response system must support a universal "repeat last utterance" 13834command, because old output does not remain visible. 13835A fairly sophisticated facility is desirable, as repeat requests are 13836very frequent in practice. 13837They may be due to a simple inability to understand a response, 13838to forgetting what was said, or to distraction of attention \(em which is 13839especially common with office terminals. 13840.pp 13841In the telephone enquiry service two distinct commands were employed, 13842one to repeat the last utterance in case of misrecognition, 13843and the other to summarize the current state of the interaction 13844in case of distraction. 13845For the former, it is essential to avoid simply regenerating an utterance 13846identical with the last. 13847Some variation of intonation and rhythm is needed to prevent an annoying, 13848stereotyped response. 13849A second consecutive repeat request should trigger a paraphrased reply. 13850An error recovery sequence could be used which presented the misunderstood 13851information in a different way with more interaction, but experience 13852indicates that this is of minor importance, especially if information units 13853are kept small anyway. 13854To summarize the current state of the interaction in response to the second 13855type of repeat command necessitates the system maintaining a model of 13856the user. 13857Even a poor model, like a record of his last few transactions and their 13858results, is well worth having. 13859.rh "Varied speech." 13860Synthetic speech is usually rather dreary to listen to. 13861Successive utterances with identical intonations should be carefully avoided. 13862Small changes in speaking rate, pitch range, and mean pitch level, 13863all serve to add variety. 13864Unfortunately, little is known at present about the role of intonation in 13865interactive dialogue, although this is an active research area and 13866new developments can be expected (for a detailed report of a recent 13867research project relevant to this topic see Brown 13868.ul 13869et al, 138701980). 13871.[ 13872Brown Currie Kenworthy 1980 13873.] 13874However, even random variations in certain parameters of the pitch contour 13875are useful to relieve the tedium of repetitive intonation patterns. 13876.sh "10.2 The applications programming environment" 13877.pp 13878The comments in the last section are aimed at the applications programmer 13879who is designing the dialogue and constructing the interactive system. 13880But what kind of environment should 13881.ul 13882he 13883be given to assist with this work? 13884.pp 13885The best help the applications programmer can have is a speech generation 13886method which makes it easy for him to enter new utterances and modify 13887them on-line in cut-and-try attempts to render the man-machine dialogue 13888as natural as possible. 13889This is perhaps the most important advantage of synthesizing speech by rule 13890from a textual representation. 13891If encoded versions of natural utterances are stored, it becomes quite 13892difficult to make minor modifications to the dialogue in the light of 13893experience with it, for a recording session must be set up 13894to acquire new utterances. 13895This is especially true if more than one voice is used, or if the 13896voice belongs to a person who cannot be recalled quickly by the programmer 13897to augment the utterance library. 13898Even if it is his own voice there will still be delays, for recording 13899speech is a real-time job which usually needs a stand-alone processor, 13900and if data compression is used a substantial amount of computation will 13901be needed before the utterance is in a useable form. 13902.pp 13903The broad phonetic input required by segmental speech synthesis-by-rule 13904systems is quite suitable for utterance representation. 13905Utterances can be entered quickly from a standard computer terminal, 13906and edited as text files. 13907Programmers must acquire skill in phonetic transcription, 13908but this is a small inconvenience. 13909The art is easily learned in an interactive situation where the effect 13910of modifications to the transcription can be heard immediately. 13911If allophones must be represented explicitly in the input then the 13912programmer's task becomes considerably more complicated because of the 13913combinatorial explosion in trial-and-error modifications. 13914.pp 13915Plain text input is also quite suitable. 13916A significant rate of error is tolerable if immediate audio feedback 13917of the result is available, so that the operator can adjust his text 13918to suit the pronunciation idiosyncrasies of the program. 13919But it is acceptable, and indeed preferable, if prosodic features are 13920represented explicitly in the input rather than being assigned automatically 13921by a computer program. 13922.pp 13923The application of voice response to interactive computer dialogue is 13924quite different to the problem of reading aloud from text. 13925We have seen that a major concern with reading machines is how to glean 13926information about intonation, rhythm, emphasis, tone of voice, and so on, 13927from an input of ordinary English text. 13928The significant problems of semantic processing, utilization of pragmatic 13929knowledge, and syntactic analysis do not, fortunately, arise in interactive 13930information retrieval systems. 13931In these, the end user is communicating with a program which has been 13932created by a person who knows what he wants it to say. 13933Thus the major difficulty is in 13934.ul 13935describing 13936the prosodic features rather than 13937.ul 13938deriving 13939them from text. 13940.pp 13941Speech synthesis by rule is a subsidiary process to the main interactive 13942procedure. 13943It would be unwise to allow 13944the updating of resonance parameter tracks to be interrupted by 13945other calls on the system, and so the synthesis process needs to be executed 13946in real time. 13947If a stand-alone processor is used for the interactive dialogue, it may 13948be able to handle the synthesis rules as well. 13949In this case the speech-by-rule program could be a library procedure, 13950if the system is implemented in a compiled language. 13951An interesting alternative with an interpretive-language implementation, 13952such as Basic, is to alter the language interpreter to add a new 13953command, "speak", which simply transfers a string representing an utterance 13954to an asynchronous process which synthesizes it. 13955However, there must be some way for an intepreted program to abort the 13956current synthesis in the event of an interrupt signal from the user. 13957.pp 13958If the main computer system is time-shared, the synthesis-by-rule 13959procedure is best executed by an independent processor. 13960For example, a 16-bit microcomputer controlling a hardware 13961formant synthesizer has been used to run the 13962ISP system in real time without too much difficulty (Witten and Abbess, 1979). 13963.[ 13964Witten Abbess 1979 13965.] 13966An important task is to define an interface between the two which 13967allows the main process to control relevant aspects of the prosody of 13968the speech in a way which is appropriate to the state of the interaction, 13969without having to bother about such things as matching the intonation contour 13970to the utterance and the details of syllable rhythm. 13971Halliday's notation appears to be quite suitable for this purpose. 13972.pp 13973If there is only one synthesizer on the system, there will be no 13974difficulty in addressing it. 13975One way of dealing with multiple synthesizers is to treat them as 13976assignable devices in the same way that non-spooling peripherals 13977are in many operating systems. 13978Notice that the data rate to the synthesizer is quite low 13979if the utterance is represented as text with prosodic markers, 13980and can easily be handled by a low-speed asynchronous serial line. 13981.pp 13982The Votrax ML-I synthesizer which is discussed in the next chapter has an 13983interface which interposes it between a visual display unit and the serial 13984port that connects it to the computer. 13985The VDU terminal can be used quite normally, except that a special sequence 13986of two control characters will cause Votrax to intercept the following 13987message up to another control character, and interpret it as speech. 13988The fact that the characters which specify the spoken message do not appear 13989on the VDU screen means that the operation is invisible to the user. 13990However, this transparency can be inhibited by a switch on the synthesizer 13991to allow visual checking of the sound-segment character sequence. 13992.pp 13993Votrax buffers up to 64 sound segments, which is sufficient to generate 13994isolated spoken messages. 13995For longer passages, it can be synchronized with the constant-rate 13996serial output using the modem control lines of the serial interface, 13997together with appropriate device-driving software. 13998.pp 13999This is a particularly convenient interfacing technique in cases when the 14000synthesizer should always be associated with a certain terminal. 14001As an example of how it can be used, 14002one can arrange files each of whose lines contain a printed message, 14003together with its Votrax equivalent bracketed by the appropriate 14004control characters. 14005When such a file is listed, or examined with an editor program, the lines 14006appear simultaneously in spoken and typed English. 14007.pp 14008If a phonetic representation is used for utterances, with real-time 14009synthesis using a separate process (or processor), it is easy for 14010the programmer to fiddle about with the interactive dialogue to get 14011it feeling right. 14012For him, each utterance is just a textual string which 14013can be stored as a string constant within his program just as a VDU prompt 14014would be. He can edit it as part of his program, and "print" it to 14015the speech synthesis device to hear it. 14016There are no more technical problems to developing an interactive dialogue 14017with speech output than there are for a conventional interactive program. 14018Of course, there are more human problems, and the points discussed 14019in the last section should always be borne in mind. 14020.sh "10.3 Using the keypad" 14021.pp 14022One of the greatest advantages of speech output from computers is the 14023ubiquity of the telephone network and the possibility of using it without 14024the need for special equipment at the terminal. 14025The requirement for input as well as output obviously presents something of a problem 14026because of the restricted nature of the telephone keypad. 14027.pp 14028Figure 10.1 shows the layout of the keypad. 14029.FC "Figure 10.1" 14030Signalling is achieved by dual-frequency tones. 14031For example, if key 7 is pressed, sinusoidal components at 852\ Hz and 1209\ Hz 14032are transmitted down the line. 14033During the process of dialling these are received by the telephone exchange 14034equipment, which assembles the digits that form a number and attempts to route 14035the call appropriately. 14036Once a connection is made, either party is free to press keys if desired 14037and the signals will be transmitted to the other end, 14038where they can be decoded by simple electronic circuits. 14039.pp 14040Dial telephones signal with closely-spaced dial pulses. 14041One pulse is generated for a "1", two for a "2", and so on. 14042(Obviously, ten pulses are generated for a "0", rather than none!) Unfortunately, 14043once the connection is made it is difficult to signal with dial pulses. 14044They cannot be decoded reliably at the other end because the telephone 14045network is not designed to transmit such low frequencies. 14046However, hand-held tone generators can be purchased for use with dial 14047telephones. 14048Although these are undeniably extra equipment, and one purpose of using speech 14049output is to avoid this, they are very cheap and portable compared with other 14050computer terminal equipment. 14051.pp 14052The small number of keys on the telephone pad makes it rather difficult 14053to use for communicating with computers. 14054Provision is made for 16 keys, but only 12 are implemented \(em the others 14055may be used for some military purposes. 14056Of course, if a separate tone generator is used then advantage can be taken 14057of the extra keys, but this will introduce incompatibility with those 14058who use unmodified touch-tone phones. 14059More sophisticated terminals are available which extend the keypad \(em such 14060as the Displayphone of Northern Telecommunications. 14061However, they are designed as a complete communications terminal and 14062contain their own visual display as well. 14063.rh "Keying alphabetic data." 14064Figure 10.2 shows the near-universal scheme for overlaying alphabetic letters 14065on to the telephone keypad. 14066.FC "Figure 10.2" 14067Since more than one symbol occupies each key, it is obviously necessary 14068to have multiple keystrokes per character if the input sequence is to be 14069decodable as a string of letters. 14070One way of doing this is to depress the appropriate button the number of 14071times corresponding to the position of the letter on it. 14072For example, to enter the letter "L" the user would key the "5" button 14073three times in rapid succession. 14074Keying rhythm must be used to distinguish the four entries "J\ J\ J", 14075"J\ K", "K\ J", and "L", unless one of the bottom three buttons is used 14076as a separator. 14077A different method is to use "*", "0", and "#" as shift keys to indicate whether 14078the first, second, or third letter on a key is intended. 14079Then "#5" would represent "L". 14080Alternatively, the shift could follow the key instead of preceding it, 14081so that "5#" represented "L". 14082.pp 14083If numeric as well as alphabetic information may be entered, a mode-shift 14084operation is commonly used to switch between numeric and alphabetic modes. 14085.pp 14086The relative merits of these three methods, multiple depressions, shift 14087key prefix, and shift key suffix, have been investigated 14088experimentally (Kramer, 1970). 14089.[ 14090Kramer 1970 14091.] 14092The results were rather inconclusive. 14093The first method seemed to be slightly inferior in terms of user accuracy. 14094It seemed that preceding rather than following shifts gave higher accuracy, 14095although this is perhaps rather counter-intuitive and may have been 14096fortuitous. 14097The most useful result from the experiments was that users exhibited 14098significant learning behaviour, and a training period of at least two hours 14099was recommended. 14100Operators were found able to key at rates of at least three to four 14101characters per second, and faster with practice. 14102.pp 14103If a greater range of characters must be represented then the coding problem 14104becomes more complex. 14105Figure 10.3 shows a keypad which can be used for entry of the full 64-character 14106standard upper-case ASCII alphabet (Shew, 1975). 14107.[ 14108Shew 1975 14109.] 14110.FC "Figure 10.3" 14111The system is intended for remote vocabulary updating in a phonetically-based 14112speech synthesis system. 14113There are three modes of operation: numeric, alphabetic, and symbolic. 14114These are entered by "##", "**", and "*0" respectively. 14115Two function modes, signalled by "#0" and "#*", allow some 14116rudimentary line-editing and monitor facilities to be incorporated. 14117Line-editing commands include character and line delete, and two kinds of 14118read-back commands \(em one tries to pronounce the words in a line 14119and the other spells out the characters. 14120The monitor commands allow the user to repeat the effect of the last input line 14121as though he had entered it again, to order the system to read back the 14122last complete output line, and to query time and system status. 14123.rh "Incomplete keying of alphanumeric data." 14124It is obviously going to be rather difficult for the operator to key 14125alphanumeric information unambiguously on a 12-key pad. 14126In the description of the telephone enquiry service in Chapter 1, 14127it was mentioned that single-key entry can be useful for alphanumeric data 14128if the ambiguity can be resolved by the computer. 14129If a multiple-character entry is known to refer to an item on a given 14130list, the characters can be keyed directly according to the coding scheme 14131of Figure 10.2. 14132.pp 14133Under most circumstances no ambiguity will arise. 14134For example, Table 10.1 shows the keystrokes that would be entered for the 14135first 50 5-letter words in an English dictionary. 14136Only two clashes occur \(em between " adore" and "afore", and 14137"agate" and "agave". 14138.RF 14139.nr x2 \w'abeam 'u 14140.nr x3 \w'00000# 'u 14141.nr x0 \n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\w'00000#'u 14142.nr x1 (\n(.l-\n(x0)/2 14143.in \n(x1u 14144.ta \n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u 14145\l'\n(x0u\(ul' 14146.sp 14147aback 22225# abide 22433# adage 23243# adore 23673# after 23837# 14148abaft 22238# abode 22633# adapt 23278# adorn 23676# again 24246# 14149abase 22273# abort 22678# adder 23337# adult 23858# agape 24273# 14150abash 22274# about 22688# addle 23353# adust 23878# agate 24283# 14151abate 22283# above 22683# adept 23378# aeger 23437# agave 24283# 14152abbey 22239# abuse 22873# adieu 23438# aegis 23447# agent 24368# 14153abbot 22268# abyss 22977# admit 23648# aerie 23743# agile 24453# 14154abeam 22326# acorn 22676# admix 23649# affix 23349# aglet 24538# 14155abele 22353# acrid 22743# adobe 23623# afoot 23668# agony 24669# 14156abhor 22467# actor 22867# adopt 23678# afore 23673# agree 24733# 14157\l'\n(x0u\(ul' 14158.in 0 14159.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 14160.FG "Table 10.1 Keying equivalents of some words" 14161As a more extensive example, in a dictionary of 24,500 words, just under 2,000 14162ambiguities (8% of words) were discovered. 14163Such ambiguities would have to be resolved interactively by the system explaining 14164its dilemma, and asking the user for a choice. 14165Notice incidentally that although the keyed sequences do not have the same 14166lexicographic order as the words, 14167no extra cost will be associated with the table-searching 14168operation if the dictionary is stored in inverted form, with each legal 14169number pointing to its English equivalent or equivalents. 14170.pp 14171A command language syntax is also a powerful way of disambiguating 14172keystrokes entered. 14173Figure 10.4 shows the keypad layout for a telephone voice calculator 14174(Newhouse and Sibley, 1969). 14175.[ 14176Newhouse Sibley 1969 14177.] 14178.FC "Figure 10.4" 14179This calculator provides the standard arithmetic operators, 14180ten numeric registers, a range of pre-defined mathematical functions, 14181and even the ability for a user to enter his own functions over the 14182telephone. 14183The number representation is fixed-point, with user control (through a system 14184function) over the precision. 14185Input of numbers is free format. 14186.pp 14187Despite the power of the calculator language, the dialogue is defined 14188so that each keystroke is unique in context and never has to be disambiguated 14189explicitly by the user. 14190Table 10.2 summarizes the command language syntax in an informal and rather 14191heterogeneous notation. 14192.RF 14193.nr x0 1.3i+1.7i+\w'some functions do not need the <value> part'u 14194.nr x1 (\n(.l-\n(x0)/2 14195.in \n(x1u 14196.ta 1.3i +1.7i 14197\l'\n(x0u\(ul' 14198construct definition explanation 14199\l'\n(x0u\(ul' 14200.sp 14201<calculation> a sequence of <operation>s followed by a 14202 call to the system function \fIE X I T\fR 14203.sp 14204<operation> <add> OR <subtract> OR 14205 <multiply> OR <divide> OR 14206 <function> OR <clear> OR 14207 <erase> OR <answer> OR 14208 <display-last> OR <display> OR 14209 <repeat> OR <cancel> 14210.sp 14211<add> + <value> # OR + # <function> 14212.sp 14213<subtract> 14214<multiply> similar to <add> 14215<divide> 14216.sp 14217<value> <numeric-value> OR \fIregister\fR <single-digit> 14218.sp 14219<numeric-value> a sequence of keystrokes like 14220 1 . 2 3 4 or 1 2 3 . 4 or 1 2 3 4 14221.sp 14222<function> \fIfunction\fR <name> # <value> # 14223 some functions do not need the <value> part 14224.sp 14225<name> a sequence of keystrokes like 14226 \fIS I N\fR or \fIE X I T\fR or \fIM Y F U N C\fR 14227.sp 14228<clear> \fIclear register\fR <single-digit> # 14229 clears one of the 10 registers 14230.sp 14231<erase> \fIerase\fR # undoes the effect of the last operation 14232.sp 14233<answer> \fIanswer register\fR <single-digit> # 14234 reads the contents of a register 14235.sp 14236<display-last> 14237<display> these provide "repeat" facilities 14238<repeat> 14239.sp 14240<cancel> aborts the current utterance 14241\l'\n(x0u\(ul' 14242.in 0 14243.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 14244.FG "Table 10.2 Syntax for a telephone calculator" 14245A calculation is a sequence of operations followed by an EXIT function call. 14246There are twelve different operations, one for each button on the keypad. 14247Actually, two of them \(em 14248.ul 14249cancel 14250and 14251.ul 14252function 14253\(em share the same key so that "#" can be reserved for use as a 14254separator; but the context ensures that they cannot be confused by the system. 14255.pp 14256Six of the operations give control over the dialogue. 14257There are three different "repeat" commands; a command (called 14258.ul 14259erase\c 14260) 14261which undoes the effect of the last operation; 14262one which reads out the value of a register; 14263and one which aborts the current utterance. 14264Four more commands provide the basic arithmetic operations of add, 14265subtract, multiply, and divide. 14266The operands of these may be keyed literal numbers, or register values, 14267or function calls. 14268A further command clears a register. 14269.pp 14270It is through functions that the extensibility of the language is achieved. 14271A function has a name (like SIN, EXIT, MYFUNC) which is keyed with an 14272appropriate single-key-per-character sequence (namely 746, 3948, 693862 14273respectively). 14274One function, DEFINE, allows new ones to be entered. 14275Another, LOOP, repeats sequences of operations. 14276TEST incorporates arithmetic testing. 14277The details of these are not important: what is interesting is the evident 14278power of the calculator. 14279.pp 14280For example, the keying sequence 14281.LB 14282.NI 142835 # 1 1 2 3 # 2 1 . 2 # 9 # 6 # 2 1 . 4 # 14284.LE 14285would be decoded as 14286.LB 14287.NI 14288.ul 14289clear\c 14290 + 123 \- 1.2 \c 14291.ul 14292display erase\c 14293 \- 1.4. 14294.LE 14295One of the difficulties with such a tight syntax is that almost any sequence 14296will be intepreted as a valid calculation \(em syntax errors are nearly 14297impossible. 14298Thus a small mistake by the user can have a catastrophic effect on the 14299calculation. 14300Here, however, speech output gives an advantage over conventional 14301character-by-character echoing 14302on visual displays. 14303It is quite adequate to echo syntactic units as they are decoded, instead 14304of echoing keys as they are entered. 14305It was suggested earlier in this chapter that confirmation of entry 14306should be generated in the same way that the user would be likely to 14307verbalize it himself. 14308Thus the synthetic voice could respond to the above keying sequence as 14309shown in the second line, except that the 14310.ul 14311display 14312command would also state the result 14313(and possibly summarize the calculation so far). 14314Numbers could be verbalized as "one hundred and twenty-three" 14315instead of as "one ... two ... three". 14316(Note, however, that this will make it necessary to await the "#" terminator 14317after numbers and function names before they can be echoed.) 14318.sh "10.4 References" 14319.LB "nnnn" 14320.[ 14321$LIST$ 14322.] 14323.LE "nnnn" 14324.sh "10.5 Further reading" 14325.pp 14326There are no books which relate techniques of man-computer dialogue 14327to speech interaction. 14328The best I can do is to guide you to some of the standard works on 14329interactive techniques. 14330.LB "nn" 14331.\"Gilb-1977-1 14332.]- 14333.ds [A Gilb, T. 14334.as [A " and Weinberg, G.M. 14335.ds [D 1977 14336.ds [T Humanized input 14337.ds [I Winthrop 14338.ds [C Cambridge, Massachusetts 14339.nr [T 0 14340.nr [A 1 14341.nr [O 0 14342.][ 2 book 14343.in+2n 14344This book is subtitled "techniques for reliable keyed input", 14345and considers most aspects of the problem of data entry by 14346professional key operators. 14347.in-2n 14348.\"Martin-1973-2 14349.]- 14350.ds [A Martin, J. 14351.ds [D 1973 14352.ds [T Design of man-computer dialogues 14353.ds [I Prentice-Hall 14354.ds [C Englewood Cliffs, New Jersey 14355.nr [T 0 14356.nr [A 1 14357.nr [O 0 14358.][ 2 book 14359.in+2n 14360Martin concerns himself with all aspects of man-computer dialogue, 14361and the book even contains a short chapter on the use of 14362voice response systems. 14363.in-2n 14364.\"Smith-1980-3 14365.]- 14366.ds [A Smith, H.T. 14367.as [A " and Green, T.R.G.(Editors) 14368.ds [D 1980 14369.ds [T Human interaction with computers 14370.ds [I Academic Press 14371.ds [C London 14372.nr [T 0 14373.nr [A 0 14374.nr [O 0 14375.][ 2 book 14376.in+2n 14377A recent collection of contributions on man-computer systems and programming 14378research. 14379.in-2n 14380.LE "nn" 14381.EQ 14382delim $$ 14383.EN 14384.CH "11 COMMERCIAL SPEECH OUTPUT DEVICES" 14385.ds RT "Commercial speech output devices 14386.ds CX "Principles of computer speech 14387.pp 14388This chapter takes a look at four speech output peripherals that are 14389available today. 14390It is risky in a book of this nature to descend so close to the technology 14391as to discuss particular examples of commercial products, 14392for such information becomes dated very quickly. 14393Nevertheless, having covered the principles of various types of speech 14394synthesizer, and the methods of driving them from widely differing utterance 14395representations, it seems worthwhile to see how these principles are 14396embodied in a few products actually on the market. 14397.pp 14398Developments in electronic speech devices are moving so fast that it is 14399hard to keep up with them, and the newest technology today will undoubtedly 14400be superseded next year. 14401Hence I have not tried to choose examples from the very latest technology. 14402Instead, this chapter discusses synthesizers which exemplify rather different 14403principles and architectures, in order to give an idea of the range of options 14404which face the system designer. 14405.pp 14406Three of the devices are landmarks in the commercial adoption of speech 14407technology, and have stood the test of time. 14408Votrax was introduced in the early 1970's, and has been re-implemented 14409several times since in an attempt to cover different market sectors. 14410The Computalker appeared in 1976. 14411It was aimed primarily at the burgeoning computer hobbies market. 14412One of its most far-reaching effects was to stimulate the interest of 14413hobbyists, always eager for new low-cost peripherals, in speech synthesis; 14414and so provide a useful new source of experimentation and expertise 14415which will undoubtedly help this heretofore rather esoteric discipline to 14416mature. 14417Computalker is certainly the longest-lived and probably still the most 14418popular hobbyist's speech synthesizer. 14419The Texas Instruments speech synthesis chip brought speech output technology to the 14420consumer. 14421It was the first single-chip speech synthesizer, and is still the biggest 14422seller. 14423It forms the heart of the "Speak 'n Spell" talking toy which appeared in 14424toyshops in the summer of 1978. 14425Although talking calculators had existed several years before, they were 14426exotic gadgets rather than household toys. 14427.sh "11.1 Formant synthesizer" 14428.pp 14429The Computalker is a straightforward implementation of a serial formant 14430synthesizer. 14431A block diagram of it is shown in Figure 11.1. 14432.FC "Figure 11.1" 14433In the centre is the main vocal tract path, with three formant filters 14434whose resonant frequencies can be controlled individually. 14435A separate nasal branch in parallel with the oral one is provided, 14436with a nasal formant of fixed frequency. 14437It is less important to allow for variation of the nasal formant 14438frequency than it is for the oral ones, because the size and 14439shape of the nasal tract is relatively fixed. 14440However, it is essential to control the nasal amplitude, in particular to turn 14441it off during non-nasal sounds. 14442Computalker provides independent oral and nasal amplitude parameters. 14443.pp 14444Unvoiced excitation can be passed through the main vocal tract 14445through the aspiration amplitude control AH. 14446In practice, the voicing amplitudes AV and AN will probably always be zero when AH 14447is non-zero, for physiological constraints prohibit simultaneous voicing 14448and aspiration. 14449A second unvoiced excitation path passes through a fricative formant filter 14450whose resonant frequency can be varied, and has its amplitude independently 14451controlled by AF. 14452.rh "Control parameters." 14453Table 11.1 summarizes the nine parameters which drive Computalker. 14454.RF 14455.nr x0 \w'address0'+\w'fundamental frequency of voicing00'+\w'0 bits0'+\w'logarithmic00'+\w'0000\-00000 Hz' 14456.nr x1 (\n(.l-\n(x0)/2 14457.in \n(x1u 14458.ta \w'000'u \w'address0'u +\w'fundamental frequency of voicing00'u +\w'0 bits0'u +\w'logarithmic00'u 14459address meaning width \0\0\0range 14460\l'\n(x0u\(ul' 14461.sp 14462\00 AV amplitude of voicing 8 bits 14463\01 AN nasal amplitude 8 bits 14464\02 AH amplitude of aspiration 8 bits 14465\03 AF amplitude of frication 8 bits 14466\04 FV fundamental frequency of voicing 8 bits logarithmic \0\075\-\0\0470 Hz 14467\05 F1 formant 1 resonant frequency 8 bits logarithmic \0170\-\01450 Hz 14468\06 F2 formant 2 resonant frequency 8 bits logarithmic \0520\-\04400 Hz 14469\07 F3 formant 3 resonant frequency 8 bits logarithmic 1700\-\05500 Hz 14470\08 FF fricative resonant frequency 8 bits logarithmic 1700\-14000 Hz 14471\09 not used 1447210 not used 1447311 not used 1447412 not used 1447513 not used 1447614 not used 1447715 SW audio on-off switch 1 bit 14478\l'\n(x0u\(ul' 14479.in 0 14480.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 14481.FG "Table 11.1 Computalker control parameters" 14482Four of them control amplitudes, while the others control frequencies. 14483In the latter case the parameter value is logarithmically related to 14484the actual frequency of the excitation (FV) or resonance (F1, F2, F3, FF). 14485The ranges over which each frequency can be controlled is shown in the Table. 14486An independent calibration of one particular Computalker has shown that 14487the logarithmic specifications are met remarkably well. 14488.pp 14489Each parameter is specified to Computalker as an 8-bit number. 14490Parameters are addressed by a 4-bit code, and so a total of 12 bits 14491is transferred in parallel to Computalker from the computer 14492for each parameter update. 14493Parameters 9 to 14 are unassigned ("reserved for future expansion" is 14494the official phrase), and the last parameter, SW, governs the position of 14495an audio on-off switch. 14496.pp 14497Computalker does not contain a clock that is accessible to the user, 14498and so the timing of parameter updates is entirely up to the host computer. 14499Typically, a 10\ msec interval between frames is used, 14500with interrupts generated by a separate timer. 14501In fact the frame interval can be anywhere between 2\ msec and 50\ msec, 14502and can be changed to alter the rate of speaking. 14503However, it is rather naive to view fast speech as slow 14504speech speeded up by a linear time compression, for in human 14505speech production the rhythm changes and elisions occur in a rather 14506more subtle way. 14507Thus it is not particularly useful to be able to alter the frame rate. 14508.pp 14509At each interrupt, the host computer transfers values for all of the nine 14510parameters to Computalker, a total of 108 data bits. 14511In theory, perhaps, it is only necessary to transmit those parameters 14512whose values have changed; but in practice all of them should be updated 14513regardless. 14514This is because the parameters are stored for the duration of the frame 14515in analogue sample-and-hold devices. Essentially, the parameter value 14516is represented as the charge on a capacitor. 14517In time \(em and it takes only a short time \(em the values drift. 14518Although the drift over 10\ msec is insignificant, it becomes very 14519noticeable over longer time periods. 14520If parameters are not updated at all, the result is a 14521"whooosh" sound up to maximum amplitude, in a period of a second or two. 14522Hence it is essential that Computalker be serviced by the computer regularly, 14523to update all its parameters. 14524The audio on-off switch is provided so that the computer can turn off 14525the sound directly if another program, which does not use the device, 14526is to be run. 14527.rh "Filter implementation." 14528It is hard to get definite information on the implementation 14529of Computalker. 14530Because it is a commercial device, circuit diagrams are not published. 14531It is certainly an analogue rather than a digital implementation. 14532The designer suggests that a configuration like that of Figure 11.2 is used 14533for the formant filters (Rice, 1976). 14534.[ 14535Rice 1976 Byte 14536.] 14537.FC "Figure 11.2" 14538Control is obtained over the resonant frequency by varying the resistance 14539at the bottom in sympathy with the parameter value. 14540The middle two operational amplifiers can be modelled by a resistance 14541$-R/k$ in the forward path, where k is the digital control value. 14542This gives the circuit in Figure 11.3, which can be analysed to obtain 14543the transfer function 14544.LB 14545.EQ 14546- ~ k over {R~R sub 1 C sub 2 C sub 3} ~ . ~ {R sub 2 C sub 2 ~s ~+~1} over 14547{ s sup 2 ~+~~ 14548( 1 over {R sub 3 C sub 3} ~+~ {k R sub 2} over {R~R sub 1 C sub 3})~s ~~+~ 14549k over {R~R sub 1 C sub 2 C sub 3}} ~ . 14550.EN 14551.LE 14552.FC "Figure 11.3" 14553.pp 14554This expression has a DC gain of \-1, and the denominator is similar to those 14555of the analogue formant resonators discussed in Chapter 5. 14556However, unlike them the transfer function has a numerator which creates 14557a zero at 14558.LB 14559.EQ 14560s~~=~~-~ 1 over {R sub 2 C sub 2} ~ . 14561.EN 14562.LE 14563If $R sub 2 C sub 2$ is sufficiently small, this zero will have 14564negligible effect at audio frequencies, and the filter has 14565the following parameters: 14566.LB 14567centre frequency: $~ mark 145681 over {2 pi}~~( k over {R~R sub 1 C sub 2 C sub 3} ~ ) sup 1/2$ Hz 14569.sp 14570bandwidth:$lineup 145711 over {2 pi}~~( 1 over {R sub 3 C sub 3}~+~ 14572{k R sub 2} over {R~R sub 1 C sub 3} ~ )$ Hz. 14573.LE 14574.pp 14575Note first that the centre frequency is proportional to the square root of 14576the control value $k$. 14577Hence a non-linear transformation must be implemented on the control 14578signal, after D/A conversion, to achieve the required logarithmic relationship 14579between parameter value and resonant frequency. 14580The formant bandwidth is not constant, as it should be (see Chapter 5), 14581but depends upon the control value $k$. 14582This dependency can be minimized by selecting component values such that 14583.LB 14584.EQ 14585{k R sub 2} over {R~R sub 1 C sub 3}~~<<~~1 over {R sub 3 C sub 3} 14586.EN 14587.LE 14588for the largest value of $k$ which can occur. 14589Then the bandwidth is solely determined by the time constant $R sub 3 C sub 3$. 14590.pp 14591The existence of the zero can be exploited for the fricative resonance. 14592This should have zero DC gain, and so the component values for the fricative 14593filter should make the time-constant $R sub 2 C sub 2$ large enough to place 14594the zero sufficiently near the frequency origin. 14595.rh "Market orientation." 14596As mentioned above, Computalker is designed for the computer hobbies market. 14597Figure 11.4 shows a photograph of the device. 14598.FC "Figure 11.4" 14599It plugs into the S\-100 bus which has been a 14600.ul 14601de facto 14602standard for hobbyists for several years, and has recently been adopted 14603as a standard by the Institute of Electrical and Electronic Engineers. 14604This makes it immediately accessible to many microcomputer systems. 14605.pp 14606An inexpensive synthesis-by-rule program, which runs on 14607the popular 8080 microprocessor, is available to drive Computalker. 14608The input is coded in a machine-readable version of the standard phonetic 14609alphabet, similar to that which was introduced in Chapter 2 (Table 2.1). 14610Stress digits may appear in the transcription, and the program caters for 14611five levels of stress. 14612The punctuation mark at the end of an utterance has some effect on pitch. 14613The program is perhaps remarkable in that it occupies only 6\ Kbyte of storage 14614(including phoneme tables), and runs on an 8-bit microprocessor 14615(but not in real time). 14616It is, however, 14617.ul 14618un\c 14619remarkable in that it produces rather poor speech. 14620According to a demonstration cassette, 14621"most people find the speech to be readily intelligible, 14622especially after a little practice listening to it," 14623but this seems extremely optimistic. 14624It also cunningly insinuates that if you don't understand it, you yourself 14625may share the blame with the synthesizer \(em after all, 14626.ul 14627most 14628people do! 14629Nevertheless, Computalker has made synthetic speech accessible to a large 14630number of home computer users. 14631.sh "11.2 Sound-segment synthesizer" 14632.pp 14633Votrax was the first fully commercial speech synthesizer, and at the time of 14634writing is still the only off-the-shelf speech output 14635peripheral (as distinct from reading machine) which is aimed 14636specifically at synthesis-by-rule rather than storage of parameter tracks 14637extracted from natural utterances. 14638Figure 11.5 shows a photograph of the Votrax ML-I. 14639.FC "Figure 11.5" 14640.pp 14641Votrax accepts as input a string of codes representing sound segments, 14642each with additional bits to control the duration and pitch of the segment. 14643In the earlier versions (eg model VS-6) there are 63 sound segments, specified 14644by a 6-bit code, and two further bits accompany each segment to provide a 146454-level control over pitch. 14646Four pitch levels are quite inadequate to generate acceptable intonation 14647contours for anything but isolated words spoken in citation form. 14648However, a later model (ML-I) uses an 8-level pitch specification, 14649as well as a 4-level duration qualifier, 14650associated with each sound segment. 14651It provides a vocabulary of 80 sound segments, together with an additional 14652code which allows local amplitude modifications and extra duration alterations 14653to following segments. 14654A further, low-cost model (VS-K) is now available which plugs in to the S\-100 14655bus, and 14656is aimed primarily at 14657computer hobbyists. 14658It provides no pitch control at all and is therefore 14659quite unsuited to serious voice response applications. 14660The device has recently been packaged as an LSI circuit (model SC\-01), 14661using analogue switched-capacitor filter technology. 14662.pp 14663One point where the ML-I scores favourably over other speech synthesis 14664peripherals is the remarkably convenient engineering of its 14665computer interface, which was outlined in the previous chapter. 14666.pp 14667The internal workings of Votrax are not divulged by the manufacturer. 14668Figure 11.6 shows a block diagram at the level of detail that they supply. 14669.FC "Figure 11.6" 14670It seems to be essentially a formant synthesizer with analogue function 14671generators and parameter smoothing circuits that provide transitions between 14672sound segments. 14673.rh "Sound segments." 14674The 80 segments of the high-range ML-I model 14675are summarized in Table 11.2. 14676.FC "Table 11.2" 14677They are divided into phoneme classes according to the 14678classification discussed in Chapter 2. 14679The segments break down into the following categories. 14680(Numbers in parentheses are the corresponding figures for VS-6.) 14681.LB "00 (00) " 14682.NI "00 (00) " 1468311 (11) vowel sounds which are representative of the phonological 14684vowel classes for English 14685.NI "00 (00) " 14686\09 \0(7) vowel allophones, with slightly different sound qualities from the 14687above 14688.NI "00 (00) " 1468920 (15) segments whose sound qualities are identical to the segments above, but with 14690different durations 14691.NI "00 (00) " 1469222 (22) consonant sounds which are representative of the phonological 14693consonant classes for English 14694.NI "00 (00) " 1469511 \0(6) consonant allophones 14696.NI "00 (00) " 14697\04 \0(0) segments to be used in conjunction with unvoiced plosives to increase 14698their aspiration 14699.NI "00 (00) " 14700\02 \0(2) silent segments, with different pause durations 14701.NI "00 (00) " 14702\01 \0(0) very short silent segment (about 5\ msec). 14703.LE "00 (00) " 14704Somewhat under half of the 80 elements 14705can be put into one-to-one correspondence with the phonemes of English; 14706the rest are either allophonic variations or additional sounds which can 14707sensibly be combined with certain phonemes in certain contexts. 14708The Votrax literature, and consequently Votrax users, persists in calling 14709all elements "phonemes", and this can cause considerable confusion. 14710I prefer to use the term "sound segment" instead, reserving "phoneme" for its 14711proper linguistic use. 14712.pp 14713The rules which Votrax uses for transitions between sound segments are not 14714made public by the manufacturer, and are embedded in encapsulated circuits 14715in the hardware. 14716They are clearly very crude. 14717The key to successful encoding of utterances is to use the many 14718non-phonemic segments in an appropriate way as transitions between the main 14719segments which represent phonetic classes. This is a tricky process, and 14720I have heard of one commercial establishment giving up in despair at the 14721extreme difficulty of generating the utterances it wanted. 14722It probably explains the proliferation of letter-to-sound rules for 14723Votrax which have been developed in research laboratories 14724(Colby 14725.ul 14726et al, 147271978; Elovitz 14728.ul 14729et al, 147301976; McIlroy, 1974; Sherwood, 1978). 14731.[ 14732Colby Christinaz Graham 1978 14733.] 14734.[ 14735Elovitz 1976 IEEE Trans Acoustics Speech and Signal Processing 14736.] 14737.[ 14738McIlroy 1974 14739.] 14740.[ 14741Sherwood 1978 14742.] 14743Nevertheless, with luck, skill, and especially persistence, 14744excellent results can be 14745obtained. The ML-I manual (Votrax, 1976) contains a list of about 625 words and short phrases, 14746and they are usually clearly recognizable. 14747.[ 14748Votrax 1976 14749.] 14750.rh "Duration and pitch qualifiers." 14751Each sound segment has a different duration. 14752Table 11.2 shows the measured duration of the segments, although no 14753calibration data is given by Votrax. 14754As mentioned earlier, a 2-bit number accompanies each segment to modify 14755its duration, and 14756this was set to 3 (least duration) for the measurements. 14757The qualifier has a multiplicative effect, shown in Table 11.3. 14758.RF 14759.nr x1 (\w'rate qualifier'/2) 14760.nr x2 (\w'in Table 11.2 by'/2) 14761.nr x0 \n(x1+2i+\w'00'+\n(x2 14762.nr x3 (\n(.l-\n(x0)/2 14763.in \n(x3u 14764.ta \n(x1u +2i 14765\l'\n(x0u\(ul' 14766.sp 14767.nr x2 (\w'multiply duration'/2) 14768rate qualifier \0\0\h'-\n(x2u'multiply duration 14769.nr x2 (\w'in Table 11.2 by'/2) 14770 \0\0\h'-\n(x2u'in Table 11.2 by 14771\l'\n(x0u\(ul' 14772.sp 14773 3 1.00 14774 2 1.11 14775 1 1.22 14776 0 1.35 14777\l'\n(x0u\(ul' 14778.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 14779.in 0 14780.FG "Table 11.3 Effect of the 2-bit per-segment rate qualifier" 14781.pp 14782As well as the 2-bit rate qualifier, each sound segment is accompanied by 14783a 3-bit pitch specification. This provides a linear control over fundamental 14784frequency, and Table 11.4 shows the measured values. 14785.RF 14786.nr x1 (\w'pitch specifier'/2) 14787.nr x2 (\w'pitch (Hz)'/2) 14788.nr x0 \n(x1+1.5i+\n(x2 14789.nr x3 (\n(.l-\n(x0)/2 14790.in \n(x3u 14791.ta \n(x1u +1.5i 14792\l'\n(x0u\(ul' 14793.sp 14794pitch specifier \h'-\n(x2u'pitch (Hz) 14795\l'\n(x0u\(ul' 14796.sp 14797 0 \057.5 14798 1 \064.1 14799 2 \069.4 14800 3 \075.8 14801 4 \080.6 14802 5 \087.7 14803 6 \094.3 14804 7 100.0 14805\l'\n(x0u\(ul' 14806.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 14807.in 0 14808.FG "Table 11.4 Effect of the 3-bit per-segment pitch specifier" 14809The quantization interval varies from 14810one to two semitones. 14811Votrax interpolates pitch from phoneme to phoneme in a highly satisfactory 14812manner, and this permits surprisingly sophisticated intonation contours 14813to be generated considering the crude 8-level quantization. 14814.pp 14815The notation in which the Votrax manual defines utterances 14816gives duration qualifiers and pitch specifications as digits 14817preceding the sound segment, and separated from it by a slash (/). 14818Thus, for example, 14819.LB 1482014/THV 14821.LE 14822defines the sound segment THV with duration qualifier 1 (multiplies the 1482370\ msec duration of Table 11.2 by 1.22 \(em from Table 11.3 \(em to give 85\ msec) 14824and pitch specification 4 (81 Hz). 14825This representation of a segment is transformed into two ASCII characters before transmission 14826to the synthesizer. 14827.rh "Converting a phonetic transcription to sound segments." 14828It would be useful to have a computer procedure to produce a specification for 14829an utterance in terms of Votrax sound segments from a standard phonetic 14830transcription. 14831This could remove much of the tedium from utterance preparation 14832by incorporating the contextual rules given in the Votrax manual. 14833Starting with a phonetic transcription, each phoneme should be converted 14834to its default Votrax representative. 14835The resulting "wide" Votrax transcription must be 14836transformed into a "narrow" one by application of contextual rules. 14837Separate rules are needed for 14838.LB 14839.NP 14840vowel clusters (diphthongs) 14841.NP 14842vowel transitions (ie consonant-vowel and vowel-consonant, 14843where the vowel segment is altered) 14844.NP 14845intervocalic consonants 14846.NP 14847consonant transitions (ie consonant-vowel and vowel-consonant, 14848where the consonant segment is altered) 14849.NP 14850consonant clusters 14851.NP 14852stressed-syllable effects 14853.NP 14854utterance-final effects. 14855.LE 14856Stressed-syllable effects (which include 14857extra aspiration for unvoiced stops beginning stressed syllables) 14858can be applied only if stress markers are included in the phonetic 14859transcription. 14860.pp 14861To specify a rule, it is necessary to give a 14862.ul 14863matching part 14864and a 14865.ul 14866context, 14867which define at what points in an utterance it is applicable, and a 14868.ul 14869replacement part 14870which is used to replace the matching part. 14871The context can be specified in mathematical set notation using curly brackets. 14872For example, 14873.LB 14874{G SH W K} OO IU OO 14875.LE 14876states that the matching part OO is replaced by IU OO, after a G, SH, W, or K. 14877In fact, allophonic variations of each sound segment 14878should also be accepted as valid context, so this rule will also replace OO 14879after .G, CH, .W, .K, or .X1 (Table 11.2 gives allophones of each segment). 14880.pp 14881Table 11.5 gives some rules that have been used for this purpose. 14882.FC "Table 11.5" 14883They were derived from careful study of the hints given in the 14884ML-I manual (Votrax, 1976). 14885.[ 14886Votrax 1976 14887.] 14888Classes such as "voiced" and "stop-consonant" in the context specify sets 14889of sound segments in the obvious way. 14890The beginning of a stressed syllable is marked in the input by ".syll". 14891Parentheses in the replacement part have a significance which is explained in 14892the next section. 14893.rh "Handling prosodic features." 14894We know from Chapter 8 the vital importance of prosodic features 14895in synthesizing lifelike speech. 14896To allow them to be assigned to Votrax utterances, an intermediate 14897output from a prosodic analysis program like ISP can be used. 14898For example, 14899.LB 149001 \c 14901.ul 14902dh i s i z /*d zh aa k s /h aa u s; 14903.LE 14904which specifies "this is Jack's house" in a declarative intonation with 14905emphasis on the "Jack's", can be intercepted in the following form: 14906.LB 14907\&.syll 14908.ul 14909dh\c 14910\ 50\ (0\ 110) 14911.ul 14912i\c 14913\ 60 14914.ul 14915s\c 14916\ 90\ (0\ 99) 14917.ul 14918i\c 14919\ 60 14920.ul 14921z\c 14922\ 60\ (50\ 110) 14923\&.syll 14924.ul 14925d\c 14926\ 50\ (0\ 110) 14927.ul 14928zh\c 14929\ 50 14930.ul 14931aa\c 14932\ 90 14933.ul 14934k\c 14935\ 120\ (10\ 90) 14936.ul 14937s\c 14938\ 90 14939\&.syll 14940.ul 14941h\c 14942\ 60 14943.ul 14944aa\c 14945\ 140 14946.ul 14947u\c 14948\ 60 14949.ul 14950s\c 14951\ 140 14952^\ 50\ (40\ 70) . 14953.LE 14954Syllable boundaries, pitches, and durations have been assigned by the 14955procedures given earlier (Chapter 8). 14956A number always follows each phoneme to specify its duration 14957(in msec). 14958Pairs of numbers in parentheses define a pitch specification at some 14959point during the preceding phoneme: the first number of the pair defines 14960the time offset of the specification from the beginning 14961of the phoneme, while the second gives the pitch itself (in Hz). 14962This form of utterance specification can then be passed to a Votrax 14963conversion procedure. 14964.pp 14965The phonetic transcription is converted 14966to Votrax sound segments using the method described above. The "wide" Votrax 14967transcription is 14968.LB 14969\&.syll THV I S I Z .syll D ZH AE K S .syll H AE OO S PA0 ; 14970.LE 14971which is transformed to the following "narrow" one according to the rules 14972of Table 11.5: 14973.LB 14974\&.syll THV I S I Z .syll D J (AE EH3) K S .syll H1 (AH1 .UH2) (O U) 14975S PA0 . 14976.LE 14977The duration and pitch specifications are preserved by the transformation 14978in their original positions in the string, although they are not shown above. 14979The next stage uses them to expand the transcription by adjusting 14980the segments to have durations as close as possible to the specifications, and 14981computing pitch numbers to be associated with each phoneme. 14982.pp 14983Correct duration-expansion can, in general, require a great amount of 14984computation. 14985Associated with each sound segment is a set of elements with the same sound quality 14986but different durations, formed by attaching each of the four duration 14987qualifiers of Table 11.3 to the segment and any others which are 14988sound-equivalents to it. For example, the segment Z has the duration-set 14989.LB 14990{3/Z 2/Z 1/Z 0/Z} 14991.LE 14992with durations 14993.LB 14994{ 70 78 85 95} 14995.LE 14996msec respectively, where the initial numerals denote the duration qualifier. 14997The segment I has the much larger duration-set 14998.LB 14999{3/I2 2/I2 1/I2 0/I2 3/I1 2/I1 1/I1 0/I1 3/I 2/I 1/I 0/I} 15000.LE 15001with durations 15002.LB 15003{ 58 64 71 78 83 92 101 112 118 131 144 159}, 15004.LE 15005because segments I1 and I2 are sound-equivalents to it. 15006Duration assignment is a matter of selecting elements from the 15007duration-set whose total duration is as close as possible to that desired 15008for the segment. 15009It happens that Votrax deals sensibly with concatenations of more than one 15010identical plosive, suppressing the stop burst on all but the last. 15011Although the general problem of approximating durations in 15012this way is computationally demanding, a simple recursive exhaustive search 15013works in a reasonable amount of time because the desired duration is usually 15014not very much greater than the longest member of the duration-set, and so 15015the search terminates quite quickly. 15016.pp 15017At this point, the role of the parentheses which appear on the right-hand side 15018of Table 11.5 becomes apparent. Because durations are only associated with 15019the input phonemes, which may each be expanded into several Votrax 15020segments, it is necessary to keep track of the segments which have descended 15021from a single phoneme. 15022Target durations are simply spread equally across any parenthesized groups 15023to which they apply. 15024.pp 15025Having expanded durations, mapping pitches on to the sound segments is 15026a simple matter. The ISP system for formant synthesizers (Chapters 7 and 8) 15027uses linear interpolation between pitch specifications, and the frequency which 15028results for each sound segment needs to be converted to a Votrax specification 15029using the information in Table 11.4. 15030.pp 15031After applying these procedures to the example utterance, it becomes 15032.LB 1503314/THV 14/I1 03/S 14/I1 04/Z 04/D 04/J 33/AE 33/EH3 \c 1503402/K 02/K 02/S 02/H1 01/AH2 01/.UH2 31/O2 31/U1 01/S \c 1503510/S 30/PA0 30/PA0 . 15036.LE 15037In several places, shorter sound-equivalents have been substituted 15038(I1 for I, AH2 for AH1, O2 for O, and U1 for U), while doubling-up also occurs 15039(in the K, S, and PA0 segments). 15040.pp 15041The speech which results from the use of these procedures with the 15042Votrax synthesizer sounds remarkably similar to that generated by the 15043ISP system which uses 15044parametrically-controlled synthesizers. Formal evaluation experiments have 15045not been undertaken, but it seems clear from careful listening that it would 15046be rather difficult, and probably pointless, to evaluate the Votrax conversion 15047algorithm, for the outcome would be completely dominated by the success of the 15048original pitch and rhythm assignment procedures. 15049.sh "11.3 Linear predictive synthesizer" 15050.pp 15051The first single-chip speech synthesizer was introduced by 15052Texas Instruments (TI) in the summer of 1978 (Wiggins and Brantingham, 1978). 15053.[ 15054Wiggins Brantingham 1978 15055.] 15056It was a remarkable development, combining recent advances in signal processing 15057with the very latest in VLSI technology. 15058Packaged in the Speak 'n Spell toy (Figure 11.7), it was a striking demonstration 15059of imagination and prowess in integrated electronics. 15060.FC "Figure 11.7" 15061It gave TI a long lead over its competitors and surprised many experts 15062in the speech field. 15063.EQ 15064delim @@ 15065.EN 15066Overnight, it seemed, digital speech technology had descended from 15067research laboratories with their expensive and specialized equipment into 15068a $50.00 consumer item. 15069.EQ 15070delim $$ 15071.EN 15072Naturally TI did not sell the chip separately but only as part of their 15073mass-market product; nor would they make available information on how to 15074drive it directly. 15075Only recently when other similar devices appeared on the market did they 15076unbundle the package and sell the chip. 15077.rh "The Speak 'n Spell toy." 15078The TI chip (TMC0280) uses the linear predictive method of synthesis, 15079primarily because of the ease of the speech analysis procedure and the known 15080high quality at low data rates. 15081Speech researchers, incidentally, sometimes scoff at what they perceive to be 15082the poor quality of the toy's speech; but considering the data rate 15083used (which averages 1200 bits per second of speech) it is remarkably good. 15084Anyway, I have never heard a child complain! \(em although it is not uncommon 15085to misunderstand a word. 15086Two 128\ Kbit read-only memories are used in the toy to hold data for about 15087330 words and phrases \(em lasting between 3 and 4 minutes \(em of speech. 15088At the time (mid-1978) these memories were the largest that were available 15089in the industry. 15090The data flow and user dialogue are handled by a microprocessor, 15091which is the fourth LSI circuit in the photograph of Figure 11.8. 15092.FC "Figure 11.8" 15093.pp 15094A schematic diagram of the toy is given in Figure 11.9. 15095.FC "Figure 11.9" 15096It has a small display which shows upper-case letters. 15097(Some teachers of spelling hold that the lack of lower case destroys 15098any educational value that the toy may have.) It 15099has a full 26-key alphanumeric keyboard with 14 additional control keys. 15100(This is the toy's Achilles' heel, for the keys fall out after extended use. 15101More recent toys from TI use an improved keyboard.) The 15102keyboard is laid out alphabetically instead of in QWERTY order; possibly 15103missing an opportunity to teach kids to type as well as spell. 15104An internal connector permits vocabulary expansion with up to 14 more 15105read-only memory chips. 15106Controlling the toy is a 4-bit microprocessor (a modified TMS1000). 15107However, the synthesizer chip does not receive data from the processor. 15108During speech, it accesses the memory directly and only returns control 15109to the processor when an end-of-phrase marker is found in the data stream. 15110Meanwhile the processor is idle, and cannot even be interrupted from the 15111keyboard. 15112Moreover, in one operational mode ("say-it") the toy embarks upon a long 15113monologue and remains deaf to the keyboard \(em it cannot even be turned off. 15114Any three-year-old will quickly discover that a sharp slap solves the problem! 15115A useful feature is that the device switches itself off if unused for more 15116than a few minutes. 15117A fascinating account of the development of the toy from the point of view 15118of product design and market assessment has been published 15119(Frantz and Wiggins, 1981). 15120.[ 15121Frantz Wiggins 1981 15122.] 15123.rh "Control parameters." 15124The lattice filtering method of linear predictive synthesis (see Chapter 6) 15125was selected because of its good stability properties and guaranteed 15126performance with small word sizes. 15127The lattice has 10 stages. 15128All the control parameters are represented as 10-bit fixed-point numbers, 15129and the lattice operates with an internal precision of 14 bits (including 15130sign). 15131.pp 15132There are twelve parameters for the device: ten reflection coefficients, 15133energy, and pitch. 15134These are updated every 20\ msec. 15135However, if 10-bit values were stored for each, a data rate of 120 bits 15136every 20\ msec, or 6\ Kbit/s, would be needed. 15137This would reduce the capacity of the two read-only memory chips to well 15138under a minute of speech \(em perhaps 65 words and phrases. 15139But one of the desirable properties of the reflection coefficients 15140which drive the lattice filter is that they are amenable to quantization. 15141A non-linear quantization scheme is used, with the parameter data addressing 15142an on-chip quantization table to yield a 10-bit coefficient. 15143.pp 15144Table 11.6 shows the number of bits devoted to each parameter. 15145.RF 15146.in+0.3i 15147.ta \w'repeat flag00'u +1.3i +0.8i 15148.nr x0 \w'repeat flag00'+1.3i+\w'00'+(\w'size (10-bit words)'/2) 15149\l'\n(x0u\(ul' 15150.nr x1 (\w'bits'/2) 15151.nr x2 (\w'quantization table'/2) 15152.nr x3 0.2m 15153parameter \0\h'-\n(x1u'bits \0\0\h'-\n(x2u'quantization table 15154.nr x2 (\w'size (10-bit words)'/2) 15155 \0\0\h'-\n(x2u'size (10-bit words) 15156\l'\n(x0u\(ul' 15157.sp 15158energy \04 \016 \v'\n(x3u'_\v'-\n(x3u'\z4\v'\n(x3u'_\v'-\n(x3u' energy=0 means 4-bit frame 15159pitch \05 \032 15160repeat flag \01 \0\(em \z1\v'\n(x3u'_\v'-\n(x3u'\z0\v'\n(x3u'_\v'-\n(x3u' repeat flag =1 means 10-bit frame 15161k1 \05 \032 15162k2 \05 \032 15163k3 \04 \016 15164k4 \04 \016 \z2\v'\n(x3u'_\v'-\n(x3u'\z8\v'\n(x3u'_\v'-\n(x3u' pitch=0 (unvoiced) means 28-bit frame 15165k5 \04 \016 15166k6 \04 \016 15167k7 \04 \016 15168k8 \03 \0\08 15169k9 \03 \0\08 15170k10 \03 \0\08 \z4\v'\n(x3u'_\v'-\n(x3u'\z9\v'\n(x3u'_\v'-\n(x3u' otherwise 49-bit frame 15171 __ ___ 15172.sp 15173 49 bits 216 words 15174\l'\n(x0u\(ul' 15175.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 15176.in-0.3i 15177.FG "Table 11.6 Bit allocation for Speak 'n Spell chip" 15178There are 4 bits for energy, and 5 bits for pitch and the first two 15179reflection coefficients. 15180Thereafter the number of bits allocated to reflection coefficients decreases 15181steadily, for higher coefficients are less important for intelligibility 15182than lower ones. 15183(Note that using a 10-stage filter is tantamount to allocating 15184.ul 15185no 15186bits to coefficients higher than the tenth.) With a 151871-bit "repeat" flag, whose role is discussed shortly, the frame size 15188becomes 49 bits. 15189Updated every 20\ msec, this gives a data rate of just under 2.5\ Kbit/s. 15190.pp 15191The parameters are expanded into 10-bit numbers by a separate quantization 15192table for each one. 15193For example, the five pitch bits address a 32-word look-up table which 15194returns a 10-bit value. 15195The transformation is logarithmic in this case, the lowest pitch being 15196around 50 Hz and the highest 190 Hz. 15197As shown in Table 11.6, a total of 216 10-bit words suffices to hold all 15198twelve quantization tables; and they are implemented on the synthesizer 15199chip. 15200To provide further smoothing of the control parameters, 15201they are interpolated linearly from one frame to the next at eight points 15202within the frame. 15203.pp 15204The raw data rate of 2.5\ Kbit/s is reduced to an average of 1200\ bit/s 15205by further coding techniques. 15206Firstly, if the energy parameter is zero the frame is silent, 15207and no more parameters are transmitted (4-bit frame). 15208Secondly, if the "repeat" flag is 1 all reflection coefficients are held 15209over from the previous frame, giving a constant filter but with the ability 15210to vary amplitude and pitch (10-bit frame). 15211Finally, if the frame is unvoiced (signalled by the pitch value being zero) 15212only four reflection coefficients are transmitted, because the ear is 15213relatively insensitive to spectral detail in unvoiced speech (28-bit frame). 15214The end of the utterance is signalled by the energy bits all being 1. 15215.rh "Chip organization." 15216The configuration of the lattice filter is shown in Figure 11.10. 15217.FC "Figure 11.10" 15218The "two-multiplier" structure (Chapter 6) is used, so the 10-stage filter 15219requires 19 multiplications and 19 additions 15220per speech sample. 15221(The last operation in the reverse path at the bottom is not needed.) Since 15222a 10\ kHz sample rate is used, just 100\ $mu$sec are available for each 15223speech sample. 15224A single 5\ $mu$sec adder and a pipelined multiplier are implemented on 15225the chip, and multiplexed among the 19 operations. 15226The latter begins a new multiplication every 5\ $mu$sec, and finishes it 1522740\ $mu$sec later. 15228These times are within the capability of p-channel MOS technology, 15229allowing the chip to be produced at low cost. 15230The time slot for the 20'th, unnecessary, filter multiplication is used 15231for an overall gain adjustment. 15232.pp 15233The final analogue signal is produced by an 8-bit on-chip D/A converter 15234which drives a 200 milliwatt speaker through an impedance-matching 15235transformer. 15236These constitute the necessary analogue low-pass desampling filter. 15237.pp 15238Figure 11.11 summarizes the organization of the synthesis chip. 15239.FC "Figure 11.11" 15240Serial data enters directly from the read-only memories, although a control 15241signal from the processor begins synthesis and another signal is returned 15242to it upon termination. 15243The data is decoded into individual parameters, which are used to address 15244the quantization tables to generate the full 10-bit parameter 15245values. 15246These are interpolated from one frame to the next. 15247The lower part of the Figure shows the speech generation subsystem. 15248An excitation waveform for voiced speech is stored in read-only 15249memory and read out repeatedly at a rate determined by the pitch. 15250The source for unvoiced sounds is hard-limited noise provided by a digital 15251pseudo-random bit generator. 15252The sound source that is used depends on whether the pitch value is zero 15253or not: notice that this precludes mixed excitation for voiced fricatives 15254(and the sound is noticeably poor in words like "zee"). 15255A gain multiplication is performed before the signal is passed through the 15256lattice synthesis filter, described earlier. 15257.sh "11.4 Programmable signal processors" 15258.pp 15259The TI chip has a fixed architecture, and is destined forever 15260to implement the same vocal tract model \(em a 10'th order lattice filter. 15261A more recent device, the Programmable Digital Signal Processor 15262(Caldwell, 1980) from Telesensory Systems allows more flexibility 15263in the type of model. 15264.[ 15265Caldwell 1980 15266.] 15267It can serve as a digital formant synthesizer or a linear predictive 15268synthesizer, and the order of model (number of formants, in the former case) 15269can be changed. 15270.pp 15271Before describing the PDSP, it is worth looking at an earlier microprocessor 15272which was designed for digital signal processing. 15273Some industry observers have said that this processor, the Intel 2920, 15274is to the analogue design engineer what the first microprocessor was to 15275the random logic engineer way back in the mists of time (early 1970's). 15276.rh "The 'analogue microprocessor'." 15277The 2920 is a digital microprocessor. 15278However, it contains an on-chip D/A converter, which can be used in 15279successive approximation fashion for A/D conversion under program control, 15280and its architecture is designed to aid digital signal processing calculations. 15281Although the precision of conversion is 9 bits, internal arithmetic is 15282done with 25 bits to accomodate the accumulation of round-off errors in 15283arithmetic operations. 15284An on-chip programmable read-only memory holds a 192-instruction program, 15285which is executed in sequence with no program jumps allowed. 15286This ensures that each pass through the program takes the same time, 15287so that the analogue waveform is regularly sampled and processed. 15288.pp 15289The device is implemented in n-channel MOS technology, which makes it 15290slightly faster than the pMOS Speak 'n Spell chip. 15291At its fastest operating speed each instruction takes 400 nsec. 15292The 192-instruction program therefore executes in 78.6\ $mu$sec, corresponding 15293to a sampling rate of almost 13\ kHz. 15294Thus the processor can handle signals with a bandwidth of 6.5\ kHz \(em ample 15295for high-quality speech. 15296However, a special EOP (end of program) instruction is provided which 15297causes an immediate jump back to the beginning. 15298Hence if the program occupies less than 192 instructions, faster sampling 15299rates can be used. 15300For example, a single second-order formant resonance 15301requires only 14 instructions and so can 15302be executed at over 150\ kHz. 15303.pp 15304Despite this speed, the 2920 is only marginally capable of synthesizing 15305speech. 15306Table 11.7 gives approximate numbers of instructions needed to do some 15307subtasks for speech generation (Hoff and Li, 1980). 15308.[ 15309Hoff Li 1980 Software makes a big talker 15310.] 15311.RF 15312.nr x0 \w'parameter entry and data distribution0000'+\w'00000' 15313.nr x1 \w'instructions' 15314.nr x2 (\n(.l-\n(x0)/2 15315.in \n(x2u 15316.ta \w'parameter entry and data distribution0000'u 15317\l'\n(x0u\(ul' 15318.sp 15319task \0\0\0\0\0\h'-\n(x1u'instructions 15320\l'\n(x0u\(ul' 15321.sp 15322parameter entry and data distribution 35\-40 15323glottal pulse generation \0\0\0\08 15324noise generation \0\0\011 15325lattice section \0\0\020 15326formant filter \0\0\014 15327\l'\n(x0u\(ul' 15328.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 15329.in 0 15330.FG "Table 11.7 2920 instruction counts for typical speech subsystems" 15331The parameter entry and data distribution procedure 15332collects 10 8-bit parameters from a serial input stream, at a frame rate of 15333100 frames/s. 15334The parameter data rate is 8\ Kbit/s, and the routine assumes that the 153352920 performs each complete cycle in 125\ $mu$sec to generate sampled speech 15336at 8\ kHz. 15337Therefore one bit of parameter data is accepted on every cycle. 15338The glottal pulse program generates an asymmetrical triangular waveform 15339(Chapter 5), while the noise generator uses a 17-bit pseudo-random feedback 15340shift register. 15341About 30% of the 192-instruction program memory is consumed by these 15342essential tasks. 15343A two-multiplier lattice section takes 20 instructions, 15344and so only six sections can fit into the remaining program space. 15345It may be possible to use two 2920's to implement a complete 10 or 12'th 15346order lattice, but the results of the first stage must be passed to the 15347second by transmitting analogue or digital data between each of the 153482920's analogue ports \(em not a terribly satisfactory method. 15349.pp 15350Since a formant filter occupies only 14 instructions, up to nine of them 15351would fit in the program space left after the above-mentioned essential 15352subsystems. 15353Although other necessary house-keeping tasks may reduce this number 15354substantially, 15355it does seem possible to implement a formant synthesizer on a single 2920. 15356.rh "The Programmable Digital Signal Processor." 15357Whereas the 2920 is intended for general signal-processing jobs, 15358Telesensory Systems' PDSP 15359(Programmable Digital Signal Processor) is aimed specifically at speech 15360synthesis. 15361It comprises two separate chips, a control unit and an arithmetic unit. 15362To build a synthesizer these must be augmented with external memory 15363and a D/A converter, arranged in a configuration like that of Figure 11.12. 15364.FC "Figure 11.12" 15365.pp 15366The control unit accepts parameter data from a host computer, one byte at a time. 15367The data is temporarily held in buffer memory before being serialized and passed 15368to the arithmetic unit. 15369Notice that for the 2920 we assumed that parameters were presented 15370to the chip already serialized and precisely timed: the PDSP control unit 15371effectively releases the host from this high-speed real-time operation. 15372But it does more. 15373It generates both a voiced and an unvoiced excitation source and passes them 15374to the arithmetic unit, to relieve the latter of the general-purpose 15375programming required for both these tasks and allow its instruction set 15376to be highly specialized for digital filtering. 15377.pp 15378The arithmetic unit has rather a peculiar structure. 15379It accomodates only 16 program steps and can execute the full 16-instruction 15380program at a rate of 10\ kHz. 15381The internal word-length is 18 bits, but coefficients and the digital output 15382are only 10 bits. 15383Each instruction can accomplish quite a lot of work. 15384Figure 11.13 shows that there are four separate blocks of store in addition 15385to the program memory. 15386.FC "Figure 11.13" 15387One location of each block is automatically associated with each program step. 15388Thus on instruction 2, for example, two 18-bit scratchpad registers MA(2) 15389and MB(2), and two 10-bit coefficient registers A1(2) and A2(2), are 15390accessible. 15391In addition five general registers, curiously numbered R1, R2, R5, R6, R7, 15392are available to every program step. 15393.pp 15394Each instruction has five fields. 15395A single instruction loads all the general registers and simultaneously 15396performs two multiplications and up to three additions. 15397The fields specify exactly which operands are involved in these operations. 15398.pp 15399The instructions of the PDSP arithmetic unit are really very powerful. 15400For example, a second-order digital formant resonator requires only 15401two program steps. 15402A two-multiplier lattice stage needs only one step, and 15403a complete 12-stage lattice filter can be implemented in the 16 steps available. 15404An important feature of the architecture is that it 15405is quite easy to incorporate more than one 15406arithmetic unit into a system, with a single control unit. 15407Intermediate data can be transferred digitally between arithmetic units 15408since the D/A converter is off-chip. 15409A four-multiplier normalized lattice (Chapter 6) with 12 stages can be implemented 15410on two arithmetic units, as can a lattice filter which incorporates zeros 15411as well as poles, and a complex series/parallel formant synthesizer 15412with a total of 12 resonators whose centre frequencies and bandwidths 15413can be controlled independently (Klatt, 1980). 15414.[ 15415Klatt 1980 15416.] 15417.pp 15418How this device will fare in actual commercial products is yet to be seen. 15419It is certainly much more sophisticated than the TI Speak 'n Spell chip, 15420and a complete system will necessitate a much higher chip count and consequently 15421more expense. 15422Telesensory Systems are committed to producing a text-to-speech 15423system based upon it 15424for use both in a reading machine for the blind and as a text-input 15425speech-output computer peripheral. 15426.sh "11.5 References" 15427.LB "nnnn" 15428.[ 15429$LIST$ 15430.] 15431.LE "nnnn" 15432.bp 15433.ev2 15434.ta \w'\fIsilence\fR 'u +\w'.EH100'u +\w'(used to change amplitude and duration)00'u +\w'00000000000test word'u 15435.nr x0 \w'\fIsilence\fR '+\w'.EH100'+\w'(used to change amplitude and duration)00'+\w'00000000000test word' 15436\l'\n(x0u\(ul' 15437.sp 15438.nr x1 (\w'Votrax'/2) 15439.nr x2 (\w'duration (msec)'/2) 15440.nr x3 \w'test word' 15441 \h'-\n(x1u'Votrax \0\h'-\n(x2u'duration (msec) \h'-\n(x3u'test word 15442\l'\n(x0u\(ul' 15443.sp 15444.nr x3 \w'hid' 15445\fIi\fR I 118 \h'-\n(x3u'hid 15446 I1 (sound equivalent of I) \083 15447 I2 (sound equivalent of I) \058 15448 I3 (allophone of I) \058 15449 .I3 (sound equivalent of I3) \083 15450 AY (allophone of I) \065 15451.nr x3 \w'head' 15452\fIe\fR EH 118 \h'-\n(x3u'head 15453 EH1 (sound equivalent of EH) \070 15454 EH2 (sound equivalent of EH) \060 15455 EH3 (allophone of EH) \060 15456 .EH2 (sound equivalent of EH3) \070 15457 A1 (allophone of EH) 100 15458 A2 (sound equivalent of A1) \095 15459.nr x3 \w'had' 15460\fIaa\fR AE 100 \h'-\n(x3u'had 15461 AE1 (sound equivalent of AE) 100 15462.nr x3 \w'hod' 15463\fIo\fR AW 235 \h'-\n(x3u'hod 15464 AW2 (sound equivalent of AW) \090 15465 AW1 (allophone of AW) 143 15466.nr x3 \w'hood' 15467\fIu\fR OO 178 \h'-\n(x3u'hood 15468 OO1 (sound equivalent of OO) 103 15469 IU (allophone of OO) \063 15470.nr x3 \w'hud' 15471\fIa\fR UH 103 \h'-\n(x3u'hud 15472 UH1 (sound equivalent of UH) \095 15473 UH2 (sound equivalent of UH) \050 15474 UH3 (allophone of UH) \070 15475 .UH3 (sound equivalent of UH3) 103 15476 .UH2 (allophone of UH) \060 15477.nr x3 \w'hard' 15478\fIar\fR AH1 143 \h'-\n(x3u'hard 15479 AH2 (sound equivalent of AH1) \070 15480.nr x3 \w'hawed' 15481\fIaw\fR O 178 \h'-\n(x3u'hawed 15482 O1 (sound equivalent of O) 118 15483 O2 (sound equivalent of O) \083 15484 .O (allophone of O) 178 15485 .O1 (sound equivalent of .O) 123 15486 .O2 (sound equivalent of .O) \090 15487.nr x3 \w'who d' 15488\fIuu\fR U 178 \h'-\n(x3u'who'd 15489 U1 (sound equivalent of U) \090 15490.nr x3 \w'heard' 15491\fIer\fR ER 143 \h'-\n(x3u'heard 15492.nr x3 \w'heed' 15493\fIee\fR E 178 \h'-\n(x3u'heed 15494 E1 (sound equivalent of E) 118 15495\fIr\fR R \090 15496 .R (allophone of R) \050 15497\fIw\fR W \083 15498 .W (allophone of W) \083 15499\l'\n(x0u\(ul' 15500.sp3 15501.ce 15502Table 11.2 Votrax sound segments and their durations 15503.bp 15504\l'\n(x0u\(ul' 15505.sp 15506.nr x1 (\w'Votrax'/2) 15507.nr x2 (\w'duration (msec)'/2) 15508.nr x3 \w'test word' 15509 \h'-\n(1u'Votrax \0\h'-\n(x2u'duration (msec) \h'-\n(x3u'test word 15510\l'\n(x0u\(ul' 15511.sp 15512\fIl\fR L 105 15513 L1 (allophone of L) 105 15514\fIy\fR Y 103 15515 Y1 (allophone of Y) \083 15516\fIm\fR M 105 15517\fIb\fR B \070 15518\fIp\fR P 100 15519 .PH (aspiration burst for use with P) \088 15520\fIn\fR N \083 15521\fId\fR D \050 15522 .D (allophone of D) \053 15523\fIt\fR T \090 15524 DT (allophone of T) \050 15525 .S (aspiration burst for use with T) \070 15526\fIng\fR NG 120 15527\fIg\fR G \075 15528 .G (allophone of G) \075 15529\fIk\fR K \075 15530 .K (allophone of K) \080 15531 .X1 (aspiration burst for use with K) \068 15532\fIs\fR S \090 15533\fIz\fR Z \070 15534\fIsh\fR SH 118 15535 CH (allophone of SH) \055 15536\fIzh\fR ZH \090 15537 J (allophone of ZH) \050 15538\fIf\fR F 100 15539\fIv\fR V \070 15540\fIth\fR TH \070 15541\fIdh\fR THV \070 15542\fIh\fR H \070 15543 H1 (allophone of H) \070 15544 .H1 (allophone of H) \048 15545\fIsilence\fR PA0 \045 15546 PA1 175 15547 .PA1 \0\05 15548 15549 .PA2 (used to change amplitude and duration) \0\0\- 15550\l'\n(x0u\(ul' 15551.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 15552.sp3 15553.ce 15554Table 11.2 (continued) 15555.bp 15556.ta 0.8i +2.6i +\w'(AH1 .UH2) (O U)000'u 15557.nr x0 0.8i+2.6i+\w'(AH1 .UH2) (O U)000'+\w'; i uh \- here' 15558\l'\n(x0u\(ul' 15559.sp 15560vowel clusters 15561 EH I A1 AY ; e i \- hey 15562 UH OO O U ; uh u \- ho 15563 AE I (AH1 EH3) I ; aa i \- hi 15564 AE OO (AH1 .UH2) (O U) ; aa u \- how 15565 AW I (O UH) E ; o i \- hoi 15566 I UH E I ; i uh \- here 15567 EH UH (EH A1) EH ; e uh \- hair 15568 OO UH OO UH ; u uh \- poor 15569 Y U Y1 (IU U) 15570.sp 15571vowel transitions 15572 {F M B P} O (.O1 O) 15573 {L R} EH (EH3 EH) 15574 {B K T D R} UH (UH3 UH) 15575 {T D} A1 (EH3 A1) 15576 {T D} AW (AH1 AW) 15577 {W} I (I3 I) 15578 {G SH W K} OO (IU OO) 15579 AY {K G T D} (AY Y) 15580 E {M T} (E Y) 15581 I {M T} (I Y) 15582 E {L} (I3 UH) 15583 EH {R N S D T} (EH EH3) 15584 I {R T} (I I3) 15585 AE {S N} (AE EH) 15586 AE {K} (AE EH3) 15587 A1 {R} (A1 EH1) 15588 AH1 {R P K} (AH1 UH) 15589 AH1 {ZH} (AH1 EH3) 15590.sp 15591intervocalics 15592 {voiced} T {voiced} DT 15593.sp 15594consonant transitions 15595 L {EH} L1 15596 H {U OO IU} H1 15597\l'\n(x0u\(ul' 15598.sp3 15599.ce 15600Table 11.5 Contextual rules for Votrax sound segments 15601.bp 15602\l'\n(x0u\(ul' 15603.sp 15604consonant clusters 15605 B {stop-consonant} (B PA0) 15606 P {stop-consonant} (P PA0) 15607 D {stop-consonant} (D PA0) 15608 T {stop-consonant} (T PA0) 15609 DT {stop-consonant} (T PA0) 15610 G {stop-consonant} (G PA0) 15611 K {stop-consonant} (K PA0) 15612 {D T} R (.X1 R) 15613 K R .K (.X1 R) 15614 {consonant} R .R 15615 {consonant} L L1 15616 K W .K .W 15617 D ZH D J 15618 T SH T CH 15619.sp 15620initial effects 15621 {.syll} P {vowel} (P .PH) 15622 {.syll} K {vowel} (K .H1) 15623 {.syll} T {vowel} (T .S) 15624 {.syll} L L1 15625 {.syll} H {U OO O AW AH1} H1 15626.sp 15627terminal effects 15628 E {PA0} (E Y) 15629\l'\n(x0u\(ul' 15630.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i 15631.sp3 15632.ce 15633Table 11.5 (continued) 15634.ev 15635