1Grammar {#estgram} 2===================== 3 4# Overview {#estgramoverview} 5 6To aid speech recognition and general text processing the speech 7tools library provides an number of techniques to process various 8types of grammars. 9 10These consist of a continuing expanding set of class and related 11programs. These are at various levels of development but 12all have fairly stable and working basic features. 13 14## N-grams {#estgramngrams} 15 16Ngram language models are supported by the EST_Ngrammar class, and 17associated routines and programs. Support is provided for 18building. tuning, smoothing and running n-gram models. N-grams 19themselves have been can be internally stored in either a 20*dense* or *sparse*. In 21the dense case all possible states are represented with probability 22distributions, while the sparece case only has those possible 23states for with data is available. Sparse will be much smaller 24in very large cases. However we consider the dense case to be 25the most fully developed. 26 27Formally n-grams can be views as a special case of weighted finite 28state machines. Our implementation reflects that where possible (backoff 29options break this simple view), thus a start state is provided 30and traversal operation are given. This method of using n-grams is 31by far the most efficient as only one new piece of information 32is required at each stage, so no vectors of tokens need be collected 33(or shifted) and presented to n-gram class. However as this finite 34state machine view can't always be reasonable used we also support 35access through a vector of tokens. 36 37### Building ngram language models {#estgramngrambuild} 38 39The program \ref ngram_build_manual estimates ngram language 40models from data. The data can be in a number of formats and be 41saved in both an ascii (easier for humans to read) and binary 42(quick to load) format. 43 44#### Vocabularies 45 46The vocabulary of an ngram must be predefined. This is required to 47allow efficient internal representation. This implementation 48supports two vocabularies, one for the n-1 tokens in an ngram 49and one for the nth token as potentially this "predictee" could 50be from a different class. We also support the notion of 51out of vocabulary word, so any token found in the input data 52that is not in the vocabulary may be mapped to that token. 53 54In build n-grams there are options on what to do with n-grams 55which contain out of vocabulary tokens. They may be mapped 56to te specifed out of vocabulary word, the ngram can be ignored or 57the whole sentence containing the out of vocabulary word can 58be ignored. 59 60#### ngram data input formats 61 62The input data can be in a number of formats depending on how 63much preprocessing you wish to do before building. The most 64basic form is to submit n-grams. That is n tokens, on each 65line. For example for a tri-gram model of phones it might 66look like 67 68 0 # a1 69 # a1 f 70 a1 f r 71 f r i 72 r i k 73 i k aa1 74 k aa1 n 75 aa1 n @ 76 77In this case the data preparation stage most create each n-gram with 78the sigle stepping through the data at each stage. This 79format we call `ngram_per_line` 80 81A second format is `sentence_per_line` where each 82line of a file is a complete "sentence". Ngrams for each n-tuple 83will be automatically created and cumulated. In this case the input 84file might look like 85 86 a1 f r i k aa1 n @ 87 ei1 b r @ h a m 88 89In this mode, ngrams for the tokens at start of the sentence are 90created by using the token by defining a `prev_tag` (and if necessary a 91`prev_prev_tag`). Thus given the above sentence by line file, 92a `prev_tag` of "#" and a `prev_prev_tag` of "0". The first few 93tri-grams cumulated are 94 95 0 # a1 96 # a1 f 97 a1 f r 98 99If the ngram size requires looking back further the `prev_prev_tag` is 100repeat indefinitely. Likewise an end_tag is appended to the end 101of every sentence too, (i.e. end of every line). 102 103A third data input format is `sentence_per_file` 104where line breaks are no longer signficant and n-grams are create 105for all n-tuples in the file. The same special cases are treated 106for beginning and end of file as are for beginning and end of 107line in the sentence_per_line case. 108 109#### Smoothing and Backoff 110 111We support a number of different techniques to deal with lack of data in 112a training set. 113 114Good Turing smoothing \cite churchgale1991 is supported 115allowing smoothing on n-grams whose frequency is less than M. We also 116support *backoff* where the n-1 grams are 117(recursively) built to provide an estimation of probability 118distributions for unseen n-grams. 119 120### Testing ngram language models {#estngramtesting} 121 122\ref ngram_test_manual computes language model entropy/perplexity 123on test data. The test data may be in any of the formats described 124above. 125 126## SCFGs {#estngramscfg} 127 128Stochastic context-free grammars are a version of context-free grammars 129with probabilities on rules. In this implementation we assume SCFGs 130are always in Chomsky Normal From (CNF) where rules may only be binary 131or unary branching. When binary, both daughters must be non-terminals and 132when unary, the daughter must be a terminal. 133 134The implementation here is primarily based on \cite pereira1992inside 135thus allowing unsupervised training of SCFGs as well 136as allowing seeding with a bracketed corpus which can vastly reduce 137training time, and improve results. Training uses the inside-outside 138algorithm. 139 140The support is split into four parts: making grammars, training grammars, 141testing grammars and parsing. 142 143A grammar file consists of a set of rules. Each rule is a bracketed 144list of probability, nonterminal, followed by two nonterminals or 145one terminal. A simple example is 146 147 (0.5 S A D) 148 (0.5 S C B) 149 (0.79 B S C) 150 (0.21 B a) 151 (0.79 D S A) 152 (0.21 D b) 153 (1 A b) 154 (1 C a) 155 156The mother non-terminal in the first rule is the distinguished symbol. 157 158Grammars may be constructed by hand, by the program \ref scfg_make_manual or 159by some other external process. The program \ref scfg_make_manual constructs 160a full grammar given a list (or number of) terminals and nonterminals. 161The rules can be assigned equal probabilities or random ones. The 162"probabilities" may be actual probabilities or log probabilties. 163For example given a file `wp19` with a list of terminals, a grammar 164suitable for training with 15 non-terminals may be created by the command 165 166 scfg_make -nonterms 15 -terms wp19 -domain prob \ 167 -values random -o wsj_15_19.gram 168 169The non-terminals or terminal names will be automatically generated if a 170number is given, or will be as specified if a file name is given. In 171the case of a filename being given, no brackets should be the file just 172whitespace separated tokens. 173 174A corpus consists of a number of sentences, each sentence must be 175contain within a set of parenthesis. The sentences themselves may 176additionally contain further bracketing (for training and testing). 177Each sentence is read by the Lisp reader so comments (semi-colon to 178end of file) may be included. For example 179 180\code{.lisp} 181((((n n) punc ((cd n) j) punc) 182 (md (v (dt n) (in (dt j n)) (n cd))) 183 punc)) 184(((n n) (v ((n) (in ((n n) punc (dt n v n))))) punc)) 185((((n n) punc (((cd n) j) cc ((j n) (in (n n n n)))) punc) 186 (v (v (((dt j n) (in (dt j j n)))))) 187 punc)) 188... 189\endcode 190 191Training is done by estimating the inside and outside probabilities of 192the rules based on their current probabilities and a corpus. The corpus 193may optionally include internal bracketing which is used by the training 194algorithm to precluded some possible parses hence making the training 195typically faster (and sometimes more accurate). After each training 196pass the grammar rule probabilities are updated and the process starts 197again. Note depending on the number of training sentences training 198passes may take a very long time. After each passes the cross entropy 199for the current version of the grammar is printed. This should normally 200decrease until the the "best" grammar has been found. 201 202The program \ref scfg_train_manual takes an initial grammar, and corpus and, 203by default will train for 100 passes. Because it can take prohibitively 204long for a desired number of passes an option is available to selection 205only an N percent chunk of the training set on each pass, cycling 206through the other N percent chunks of the corpus on each pass 207Experiments have shown that this not only produces much faster training, 208but the accuracy of the fully trained grammar is very similar. Given the 209choice of waiting taking 10 days or 48 hours to parse, it is highly 210recommended. 211 212After each N passes the current state of the grammar may be saved, the 213number of passes between saving is specified by the `-checkpoint` 214option. The grammar is saved in the output file appended with the pass 215number. 216 217Because the partitioned training will select different partitions 218depending on the pass number you can optionally specify the starting 219pass number, making it much easier to continue training after some 220interruption. 221 222Testing is done by the program \ref scfg_test_manual it takes a grammar and a 223corpus. That corpus may be fully bracketed or not. By default 224the mean cross entropy value from anaylzing these senetences will 225be printed, also the number sentence sthat fail to parse. 226 227Alternatively a *bracketing accuracy* may be calculated this is the 228percentage of prhases in a parsed sentence that are compatible with the 229bracketing in the corpus example. 230 231The fourth program provides a mechanism for parsing one or more 232sentences. The corpus this time should contain no bracketing except 233around the beginning and end of the sentence itself. Two forms of parses 234are produced. A full form with start and end points for each phrase, 235the related non-terminal and the probability, and a simple form where 236only the bracketing is given. Note only one (or no) parses is given. 237For any phrase only the best example tree is given though the 238probability is given as the sum of all possibily derivations of that 239non-terminal for that phrase. 240 241 scfg_parse -grammar wsj_15_19.gram -corpus news.data -brackets -o news.out 242 243Note the input for must be strings of terminals for the given grammar. 244For real parsing of real text it is likely the grmmar uses part of 245speech tags as terminals and the data is avtuall words not part of 246speech tags. If you want to parse texts then you can use the Festival 247script `festival/examples/scfg_parse_text` which takes in arbitrary 248text, runs the part of speech tagger on it after standard tokenizing 249rules and parses the output saving the parse to the specified file. 250 251## WFSTs {#estngramwfst} 252 253The speech tools contains a small, but growing library of 254basic functions for building, and manipulating weighted finite 255state transducers. Although not complete they already provided many of 256the basic operations and compilers one needs for using these devices. 257 258Given a WFST the following operations are supported: \ref deterimise, 259\ref minimize, \ref complement. 260 261Given two WFSTs the following operations are supported: \ref intersection, 262\ref union, \ref difference, \ref concatenation and \ref compose. 263 264In addition to these operations compiles are provided for a number of 265basic input formats: regular expressions, regular grammars, context-free 266grammars (with depth restriction) and Kay/Kaplan/Koksenniemi two-level 267morphology rules. 268 269Still missing are complete treatment of the weights through some 270basic operations (e.g. minimization doesn't presever weights). Also 271techniques for learning WFSTs from data, or at least weightign existing 272FSTs from data will be added in later versions. 273 274In general inputing symbols is of the form X or X/Y. When X is 275given it is (except if using the wfst as a transducer) treated 276as X/X. Where X/Y is input/output symbols, thus using single symbols 277will mostly cause the wfst mechanisms to act as if they are finite 278state machines. 279 280The two main programs are \ref wfst_build_manual and \ref wfst_run_manual. 281\ref wfst_run_manual runs in both recognition and transduction mode. 282 283\ref wfst_build_ builds wfst's from description files or through 284combination of existing ones. The output may be optionally 285determinized or determinized and minimized. 286 287### Kay/Kaplan/Koskenniemi morphological rules {#estngramwfstkaykaplan} 288 289One of the major drives in interest in wfst has been through their 290use in morphology @cite kaplan94. Hence we provide a method for 291compiling Kay/Kaplan/Koskenniemi type (restricted) context sensitive 292rewrite rules. The exact form is given in the example below. 293 294This example covers basic letters to letters but also Epenthesis for 295e-insertion in words like `churches` and `boxes`. 296\code{.lisp} 297(KKrules 298 engmorph 299 (Alphabets 300 ;; Input Alphabet 301 (a b c d e f g h i j k l m n o p q r s t u v w x y z #) 302 ;; Output Alphabet 303 (a b c d e f g h i j k l m n o p q r s t u v w x y z + #) 304 ) 305 (Sets 306 (LET a b c d e f g h i j k l m n o p q r s t u v w x y z) 307 ) 308 (Rules 309 ;; The basic rules 310 ( a => nil --- nil) 311 ( b => nil --- nil) 312 ( c => nil --- nil) 313 ( d => nil --- nil) 314 ( e => nil --- nil) 315 ( f => nil --- nil) 316 ( g => nil --- nil) 317 ( h => nil --- nil) 318 ( i => nil --- nil) 319 ( j => nil --- nil) 320 ( k => nil --- nil) 321 ( l => nil --- nil) 322 ( m => nil --- nil) 323 ( n => nil --- nil) 324 ( o => nil --- nil) 325 ( p => nil --- nil) 326 ( q => nil --- nil) 327 ( r => nil --- nil) 328 ( s => nil --- nil) 329 ( t => nil --- nil) 330 ( u => nil --- nil) 331 ( v => nil --- nil) 332 ( w => nil --- nil) 333 ( x => nil --- nil) 334 ( y => nil --- nil) 335 ( z => nil --- nil) 336 ( # => nil --- nil) 337 ( _epsilon_/+ => (or LET _epsilon_/e) --- nil) 338 339 ;; Epenthesis 340 ;; churches -> church+s 341 ;; boxes -> box+s 342 (e/+ <=> (or (s h) (or s x z) (i/y) (c h)) 343 --- 344 (s)) 345) 346\endcode 347 348A substantially larger example of morphographenic rules is distributed with 349the Festival speech synthesis system in `festival/lib/engmorph.scm`. 350This is based on the English description in @cite ritchie92 . 351 352For a definition of the semantics fo the basic types of rule, 353surface coercion, context restriction and combined rules 354see @cite ritchie92. Note that these rules are run in 355parallel (the transducers are intersected) making they rule 356interact in ways that the author might not intend. A good rule 357debugger is really required in order to write a substantial set 358of rules in this formalism. 359 360The rule compilation method used differs from Kay and Kaplan, and 361also from @cite mohri96 and actually follows them method used 362in @cite ritchie92 though in this case, unlike @cite ritchie92, 363the technique is followed through to true wfst's. The actual 364compilation method shold be described somewhere. 365 366The above may be compiled into a wfst by the command (assuming it is 367in the file `mm.rules`. 368 369 wfst_build -type kk -o engmorph.wfst -detmin engmorph.scm 370 371This rule compiler has also been used in finding equivalent transducers 372for restricted forms of decision tree (following @cite sproat96) and 373may be view as mostly stable. 374 375### Regular expressions {#estngramwfstregex} 376 377A simple method for building wfst's from regular expressions is also 378provided. 379 380An example is 381\code{.lisp} ((a b c) 382 (a b c) 383 (and a (+ (or b c)) d)) 384\endcode 385 386This consists of the input alphabet and the output alphabet 387followed by a LISP s-expression contains the regex. The supported 388operators are `and`, `or`, `+`, `*` and `not`. 389 390Compilation is by the following command: 391 392 wfst_build -type regex -o t1.wfst -detmin t1.regex 393 394### Regular Grammars {#estngramwfstregulargrammars} 395 396A compilation method also exists for regular grammars. These grammars 397do not need to be a normal form, in fact no chaeck is made that they 398are regular, if they contain center-embedding the construct algorithm 399will go into a loop and eventually run out of storage. The 400correction to that is to add a depth limit which would then 401allow wfst approximations of context-free grammars, which would 402be useful. 403 404An example regular grammar is 405\code{.lisp} (RegularGrammar 406 engsuffixmorphosyntax 407 ;; Sets 408 ( 409 (V a e i o u y) 410 (C b c d f g h j k l m n p q r s t v w x y z) 411 ) 412 ;; Rules 413 414 ( 415 ;; A word *must* have a suffix to be recognized 416 (Word -> # Syls Suffix ) 417 (Word -> # Syls End ) 418 419 ;; This matches any string of characters that contains at least one vowel 420 (Syls -> Syl Syls ) 421 (Syls -> Syl ) 422 (Syl -> Cs V Cs ) 423 (Cs -> C Cs ) 424 (Cs -> ) 425 426 (Suffix -> VerbSuffix ) 427 (Suffix -> NounSuffix ) 428 (Suffix -> AdjSuffix ) 429 (VerbSuffix -> VerbFinal End ) 430 (VerbSuffix -> VerbtoNoun NounSuffix ) 431 (VerbSuffix -> VerbtoNoun End ) 432 (VerbSuffix -> VerbtoAdj AdjSuffix ) 433 (VerbSuffix -> VerbtoAdj End ) 434 (NounSuffix -> NounFinal End ) 435 (NounSuffix -> NountoNoun NounSuffix ) 436 (NounSuffix -> NountoNoun End ) 437 (NounSuffix -> NountoAdj AdjSuffix ) 438 (NounSuffix -> NountoAdj End ) 439 (NounSuffix -> NountoVerb VerbSuffix ) 440 (NounSuffix -> NountoVerb End ) 441 (AdjSuffix -> AdjFinal End ) 442 (AdjSuffix -> AdjtoAdj AdjSuffix) 443 (AdjSuffix -> AdjtoAdj End) 444 (AdjSuffix -> AdjtoAdv End) ;; isn't any Adv to anything 445 446 (End -> # ) ;; word boundary symbol *always* present 447 448 (VerbFinal -> + e d) 449 (VerbFinal -> + i n g) 450 (VerbFinal -> + s) 451 452 (VerbtoNoun -> + e r) 453 (VerbtoNoun -> + e s s) 454 (VerbtoNoun -> + a t i o n) 455 (VerbtoNoun -> + i n g) 456 (VerbtoNoun -> + m e n t) 457 458 (VerbtoAdj -> + a b l e) 459 460 (NounFinal -> + s) 461 462 (NountoNoun -> + i s m) 463 (NountoNoun -> + i s t) 464 (NountoNoun -> + s h i p) 465 466 (NountoAdj -> + l i k e) 467 (NountoAdj -> + l e s s) 468 (NountoAdj -> + i s h) 469 (NountoAdj -> + o u s) 470 471 (NountoVerb -> + i f y) 472 (NountoVerb -> + i s e) 473 (NountoVerb -> + i z e) 474 475 (AdjFinal -> + e r) 476 (AdjFinal -> + e s t) 477 478 (AdjtoAdj -> + i s h) 479 (AdjtoAdv -> + l y) 480 (AdjtoNoun -> + n e s s) 481 (AdjtoVerb -> + i s e) 482 (AdjtoVerb -> + i z e) 483 484) 485) 486\endcode 487 488The above is a simple morpho-syntax for English. 489 490 491# Programs {#estngramprograms} 492 493The following are exectutable programs for grammars 494 495 - scfg_make_manual: make a set of rules for a SCFG. 496 - scfg_train_manual: train a set of rules for a SCFG. 497 - scfg_parse_manual: Perform parsing using pre-trained grammar 498 - scfg_test_manual: Test the output of a parser. 499 - wfst_build_manual: Build a weighted finite state transducer. 500 - wfst_run_manual: Run a weighted finite state transducer. 501 - ngram_build_manual: Train a n-gram from text. 502 - ngram_test_manual: Calculate the perplexity etc of an n-gram. 503 504# Classes {#estngramclasses} 505 506 - EST_SCFG 507 - EST_SCFG_Rule 508 - EST_SCFG_traintest 509 - EST_bracketed_string 510 - EST_Ngrammar 511 - EST_WFST 512