1Grammar  {#estgram}
2=====================
3
4# Overview {#estgramoverview}
5
6To aid speech recognition and general text processing the speech
7tools library provides an number of techniques to process various
8types of grammars.
9
10These consist of a continuing expanding set of class and related
11programs.  These are at various levels of development but
12all have fairly stable and working basic features.
13
14## N-grams {#estgramngrams}
15
16Ngram language models are supported by the EST_Ngrammar class, and
17associated routines and programs.   Support is provided for
18building. tuning, smoothing and running n-gram models.  N-grams
19themselves have been can be internally stored in either a
20*dense* or *sparse*.  In
21the dense case all possible states are represented with probability
22distributions, while the sparece case only has those possible
23states for with data is available. Sparse will be much smaller
24in very large cases.  However we consider the dense case to be
25the most fully developed.
26
27Formally n-grams can be views as a special case of weighted finite
28state machines.  Our implementation reflects that where possible (backoff
29options break this simple view), thus a start state is provided
30and traversal operation are given.  This method of using n-grams is
31by far the most efficient as only one new piece of information
32is required at each stage, so no vectors of tokens need be collected
33(or shifted) and presented to n-gram class.  However as this finite
34state machine view can't always be reasonable used we also support
35access through a vector of tokens.
36
37### Building ngram language models {#estgramngrambuild}
38
39The program \ref ngram_build_manual estimates ngram language
40models from data. The data can be in a number of formats and be
41saved in both an ascii (easier for humans to read) and binary
42(quick to load) format.
43
44#### Vocabularies
45
46The vocabulary of an ngram must be predefined.  This is required to
47allow efficient internal representation.  This implementation
48supports two vocabularies, one for the n-1 tokens in an ngram
49and one for the nth token as potentially this "predictee" could
50be from a different class.  We also support the notion of
51out of vocabulary word, so any token found in the input data
52that is not in the vocabulary may be mapped to that token.
53
54In build n-grams there are options on what to do with n-grams
55which contain out of vocabulary tokens.  They may be mapped
56to te specifed out of vocabulary word, the ngram can be ignored or
57the whole sentence containing the out of vocabulary word can
58be ignored.
59
60#### ngram data input formats
61
62The input data can be in a number of formats depending on how
63much preprocessing you wish to do before building.   The most
64basic form is to submit n-grams.  That is n tokens, on each
65line.  For example for a tri-gram model of phones it might
66look like
67
68    0 # a1
69    # a1 f
70    a1 f r
71    f r i
72    r i k
73    i k aa1
74    k aa1 n
75    aa1 n @
76
77In this case the data preparation stage most create each n-gram with
78the sigle stepping through the data at each stage.  This
79format we call `ngram_per_line`
80
81A second format is `sentence_per_line` where each
82line of a file is a complete "sentence".  Ngrams for each n-tuple
83will be automatically created and cumulated.   In this case the input
84file might look like
85
86    a1 f r i k aa1 n @
87    ei1 b r @ h a m
88
89In this mode, ngrams for the tokens at start of the sentence are
90created by using the token by defining a `prev_tag` (and if necessary a
91`prev_prev_tag`). Thus given the above sentence by line file,
92a `prev_tag` of "#" and a `prev_prev_tag` of "0".  The first few
93tri-grams cumulated are
94
95    0 # a1
96    # a1 f
97    a1 f r
98
99If the ngram size requires looking back further the `prev_prev_tag` is
100repeat indefinitely.  Likewise an end_tag is appended to the end
101of every sentence too, (i.e. end of every line).
102
103A third data input format is `sentence_per_file`
104where line breaks are no longer signficant and n-grams are create
105for all n-tuples in the file.  The same special cases are treated
106for beginning and end of file as are for beginning and end of
107line in the sentence_per_line case.
108
109#### Smoothing and Backoff
110
111We support a number of different techniques to deal with lack of data in
112a training set.
113
114Good Turing smoothing \cite churchgale1991 is supported
115allowing smoothing on n-grams whose frequency is less than M.  We also
116support *backoff* where the n-1 grams are
117(recursively) built to provide an estimation of probability
118distributions for unseen n-grams.
119
120### Testing ngram language models {#estngramtesting}
121
122\ref ngram_test_manual computes language model entropy/perplexity
123on test data. The test data may be in any of the formats described
124above.
125
126## SCFGs {#estngramscfg}
127
128Stochastic context-free grammars are a version of context-free grammars
129with probabilities on rules.  In this implementation we assume SCFGs
130are always in Chomsky Normal From (CNF) where rules may only be binary
131or unary branching.  When binary, both daughters must be non-terminals and
132when unary, the daughter must be a terminal.
133
134The implementation here is primarily based on \cite pereira1992inside
135thus allowing unsupervised training of SCFGs as well
136as allowing seeding with a bracketed corpus which can vastly reduce
137training time, and improve results.  Training uses the inside-outside
138algorithm.
139
140The support is split into four parts: making grammars, training grammars,
141testing grammars and parsing.
142
143A grammar file consists of a set of rules.  Each rule is a bracketed
144list of probability, nonterminal, followed by two nonterminals or
145one terminal.  A simple example is
146
147    (0.5 S A D)
148    (0.5 S C B)
149    (0.79 B S C)
150    (0.21 B a)
151    (0.79 D S A)
152    (0.21 D b)
153    (1 A b)
154    (1 C a)
155
156The mother non-terminal in the first rule is the distinguished symbol.
157
158Grammars may be constructed by hand, by the program \ref scfg_make_manual or
159by some other external process.  The program \ref scfg_make_manual constructs
160a full grammar given a list (or number of) terminals and nonterminals.
161The rules can be assigned equal probabilities or random ones.  The
162"probabilities" may be actual probabilities or log probabilties.
163For example given a file `wp19` with a list of terminals, a grammar
164suitable for training with 15 non-terminals may be created by the command
165
166    scfg_make -nonterms 15 -terms wp19 -domain prob \
167              -values random -o wsj_15_19.gram
168
169The non-terminals or terminal names will be automatically generated if a
170number is given, or will be as specified if a file name is given.  In
171the case of a filename being given, no brackets should be the file just
172whitespace separated tokens.
173
174A corpus consists of a number of sentences, each sentence must be
175contain within a set of parenthesis.  The sentences themselves may
176additionally contain further bracketing (for training and testing).
177Each sentence is read by the Lisp reader so comments (semi-colon to
178end of file) may be included.  For example
179
180\code{.lisp}
181((((n n) punc ((cd n) j) punc)
182 (md (v (dt n) (in (dt j n)) (n cd)))
183 punc))
184(((n n) (v ((n) (in ((n n) punc (dt n v n))))) punc))
185((((n n) punc (((cd n) j) cc ((j n) (in (n n n n)))) punc)
186 (v (v (((dt j n) (in (dt j j n))))))
187 punc))
188...
189\endcode
190
191Training is done by estimating the inside and outside probabilities of
192the rules based on their current probabilities and a corpus.  The corpus
193may optionally include internal bracketing which is used by the training
194algorithm to precluded some possible parses hence making the training
195typically faster (and sometimes more accurate).  After each training
196pass the grammar rule probabilities are updated and the process starts
197again.  Note depending on the number of training sentences training
198passes may take a very long time.  After each passes the cross entropy
199for the current version of the grammar is printed.  This should normally
200decrease until the the "best" grammar has been found.
201
202The program \ref scfg_train_manual takes an initial grammar, and corpus and,
203by default will train for 100 passes.  Because it can take prohibitively
204long for a desired number of passes an option is available to selection
205only an N percent chunk of the training set on each pass, cycling
206through the other N percent chunks of the corpus on each pass
207Experiments have shown that this not only produces much faster training,
208but the accuracy of the fully trained grammar is very similar.  Given the
209choice of waiting taking 10 days or 48 hours to parse, it is highly
210recommended.
211
212After each N passes the current state of the grammar may be saved, the
213number of passes between saving is specified by the `-checkpoint`
214option.  The grammar is saved in the output file appended with the pass
215number.
216
217Because the partitioned training will select different partitions
218depending on the pass number you can optionally specify the starting
219pass number, making it much easier to continue training after some
220interruption.
221
222Testing is done by the program \ref scfg_test_manual it takes a grammar and a
223corpus.  That corpus may be fully bracketed or not.  By default
224the mean cross entropy value from anaylzing these senetences will
225be printed, also the number sentence sthat fail to parse.
226
227Alternatively a *bracketing accuracy* may be calculated this is the
228percentage of prhases in a parsed sentence that are compatible with the
229bracketing in the corpus example.
230
231The fourth program provides a mechanism for parsing one or more
232sentences.  The corpus this time should contain no bracketing except
233around the beginning and end of the sentence itself.  Two forms of parses
234are produced.  A full form with start and end points for each phrase,
235the related non-terminal and the probability, and a simple form where
236only the bracketing is given.  Note only one (or no) parses is given.
237For any phrase only the best example tree is given though the
238probability is given as the sum of all possibily derivations of that
239non-terminal for that phrase.
240
241    scfg_parse -grammar wsj_15_19.gram -corpus news.data -brackets -o news.out
242
243Note the input for must be strings of terminals for the given grammar.
244For real parsing of real text it is likely the grmmar uses part of
245speech tags as terminals and the data is avtuall words not part of
246speech tags.  If you want to parse texts then you can use the Festival
247script `festival/examples/scfg_parse_text` which takes in arbitrary
248text, runs the part of speech tagger on it after standard tokenizing
249rules and parses the output saving the parse to the specified file.
250
251## WFSTs {#estngramwfst}
252
253The speech tools contains a small, but growing library of
254basic functions for building, and manipulating weighted finite
255state transducers.  Although not complete they already provided many of
256the basic operations and compilers one needs for using these devices.
257
258Given a WFST the following operations are supported: \ref deterimise,
259\ref minimize, \ref complement.
260
261Given two WFSTs the following operations are supported: \ref intersection,
262\ref union, \ref difference,  \ref concatenation and \ref compose.
263
264In addition to these operations compiles are provided for a number of
265basic input formats: regular expressions, regular grammars, context-free
266grammars (with depth restriction) and Kay/Kaplan/Koksenniemi two-level
267morphology rules.
268
269Still missing are complete treatment of the weights through some
270basic operations (e.g. minimization doesn't presever weights).  Also
271techniques for learning WFSTs from data, or at least weightign existing
272FSTs from data will be added in later versions.
273
274In general inputing symbols is of the form X or X/Y.  When X is
275given it is (except if using the wfst as a transducer) treated
276as X/X.  Where X/Y is input/output symbols, thus using single symbols
277will mostly cause the wfst mechanisms to act as if they are finite
278state machines.
279
280The two main programs are \ref wfst_build_manual and \ref wfst_run_manual.
281\ref wfst_run_manual runs in both recognition and transduction mode.
282
283\ref wfst_build_ builds wfst's from description files or through
284combination of existing ones.  The output may be optionally
285determinized or determinized and minimized.
286
287### Kay/Kaplan/Koskenniemi morphological rules {#estngramwfstkaykaplan}
288
289One of the major drives in interest in wfst has been through their
290use in morphology @cite kaplan94.  Hence we provide a method for
291compiling Kay/Kaplan/Koskenniemi type (restricted) context sensitive
292rewrite rules.  The exact form is given in the example below.
293
294This example covers basic letters to letters but also Epenthesis for
295e-insertion in words like `churches` and `boxes`.
296\code{.lisp}
297(KKrules
298 engmorph
299 (Alphabets
300  ;; Input Alphabet
301  (a b c d e f g h i j k l m n o p q r s t u v w x y z #)
302  ;; Output Alphabet
303  (a b c d e f g h i j k l m n o p q r s t u v w x y z + #)
304 )
305 (Sets
306  (LET a b c d e f g h i j k l m n o p q r s t u v w x y z)
307 )
308 (Rules
309 ;; The basic rules
310 ( a => nil --- nil)
311 ( b => nil --- nil)
312 ( c => nil --- nil)
313 ( d => nil --- nil)
314 ( e => nil --- nil)
315 ( f => nil --- nil)
316 ( g => nil --- nil)
317 ( h => nil --- nil)
318 ( i => nil --- nil)
319 ( j => nil --- nil)
320 ( k => nil --- nil)
321 ( l => nil --- nil)
322 ( m => nil --- nil)
323 ( n => nil --- nil)
324 ( o => nil --- nil)
325 ( p => nil --- nil)
326 ( q => nil --- nil)
327 ( r => nil --- nil)
328 ( s => nil --- nil)
329 ( t => nil --- nil)
330 ( u => nil --- nil)
331 ( v => nil --- nil)
332 ( w => nil --- nil)
333 ( x => nil --- nil)
334 ( y => nil --- nil)
335 ( z => nil --- nil)
336 ( # => nil --- nil)
337 ( _epsilon_/+ => (or LET _epsilon_/e) --- nil)
338
339 ;; Epenthesis
340 ;;   churches -> church+s
341 ;;   boxes -> box+s
342 (e/+ <=> (or (s h) (or s x z) (i/y) (c h))
343	    ---
344	    (s))
345)
346\endcode
347
348A substantially larger example of morphographenic rules is distributed with
349the Festival speech synthesis system in `festival/lib/engmorph.scm`.
350This is based on the English description in @cite ritchie92 .
351
352For a definition of the semantics fo the basic types of rule,
353surface coercion, context restriction and combined rules
354see @cite ritchie92.  Note that these rules are run in
355parallel (the transducers are intersected) making they rule
356interact in ways that the author might not intend.  A good rule
357debugger is really required in order to write a substantial set
358of rules in this formalism.
359
360The rule compilation method used differs from Kay and Kaplan, and
361also from @cite mohri96 and actually follows them method used
362in @cite ritchie92 though in this case, unlike @cite ritchie92,
363the technique is followed through to true wfst's.  The actual
364compilation method shold be described somewhere.
365
366The above may be compiled into a wfst by the command (assuming it is
367in the file `mm.rules`.
368
369    wfst_build -type kk -o engmorph.wfst -detmin engmorph.scm
370
371This rule compiler has also been used in finding equivalent transducers
372for restricted forms of decision tree (following @cite sproat96) and
373may be view as mostly stable.
374
375### Regular expressions {#estngramwfstregex}
376
377A simple method for building wfst's from regular expressions is also
378provided.
379
380An example is
381\code{.lisp} ((a b c)
382 (a b c)
383 (and a (+ (or b c)) d))
384\endcode
385
386This consists of the input alphabet and the output alphabet
387followed by a LISP s-expression contains the regex.  The supported
388operators are `and`, `or`, `+`, `*` and `not`.
389
390Compilation is by the following command:
391
392    wfst_build -type regex -o t1.wfst -detmin t1.regex
393
394### Regular Grammars {#estngramwfstregulargrammars}
395
396A compilation method also exists for regular grammars.  These grammars
397do not need to be a normal form, in fact no chaeck is made that they
398are regular, if they contain center-embedding the construct algorithm
399will go into a loop and eventually run out of storage.  The
400correction to that is to add a depth limit which would then
401allow wfst approximations of context-free grammars, which would
402be useful.
403
404An example regular grammar is
405\code{.lisp} (RegularGrammar
406 engsuffixmorphosyntax
407 ;; Sets
408 (
409 (V a e i o u y)
410 (C b c d f g h j k l m n p q r s t v w x y z)
411 )
412 ;; Rules
413
414 (
415 ;; A word *must* have a suffix to be recognized
416 (Word -> # Syls Suffix )
417 (Word -> # Syls End )
418
419 ;; This matches any string of characters that contains at least one vowel
420 (Syls -> Syl Syls )
421 (Syls -> Syl )
422 (Syl -> Cs V Cs )
423 (Cs -> C Cs )
424 (Cs -> )
425
426 (Suffix -> VerbSuffix )
427 (Suffix -> NounSuffix )
428 (Suffix -> AdjSuffix )
429 (VerbSuffix -> VerbFinal End )
430 (VerbSuffix -> VerbtoNoun NounSuffix )
431 (VerbSuffix -> VerbtoNoun End )
432 (VerbSuffix -> VerbtoAdj  AdjSuffix )
433 (VerbSuffix -> VerbtoAdj End )
434 (NounSuffix -> NounFinal End )
435 (NounSuffix -> NountoNoun NounSuffix )
436 (NounSuffix -> NountoNoun End )
437 (NounSuffix -> NountoAdj AdjSuffix )
438 (NounSuffix -> NountoAdj End )
439 (NounSuffix -> NountoVerb VerbSuffix )
440 (NounSuffix -> NountoVerb End )
441 (AdjSuffix -> AdjFinal End )
442 (AdjSuffix -> AdjtoAdj AdjSuffix)
443 (AdjSuffix -> AdjtoAdj End)
444 (AdjSuffix -> AdjtoAdv End)  ;; isn't any Adv to anything
445
446 (End -> # )  ;; word boundary symbol *always* present
447
448 (VerbFinal -> + e d)
449 (VerbFinal -> + i n g)
450 (VerbFinal -> + s)
451
452 (VerbtoNoun -> + e r)
453 (VerbtoNoun -> + e s s)
454 (VerbtoNoun -> + a t i o n)
455 (VerbtoNoun -> + i n g)
456 (VerbtoNoun -> + m e n t)
457
458 (VerbtoAdj -> + a b l e)
459
460 (NounFinal -> + s)
461
462 (NountoNoun -> + i s m)
463 (NountoNoun -> + i s t)
464 (NountoNoun -> + s h i p)
465
466 (NountoAdj -> + l i k e)
467 (NountoAdj -> + l e s s)
468 (NountoAdj -> + i s h)
469 (NountoAdj -> + o u s)
470
471 (NountoVerb -> + i f y)
472 (NountoVerb -> + i s e)
473 (NountoVerb -> + i z e)
474
475 (AdjFinal -> + e r)
476 (AdjFinal -> + e s t)
477
478 (AdjtoAdj -> + i s h)
479 (AdjtoAdv -> + l y)
480 (AdjtoNoun -> + n e s s)
481 (AdjtoVerb -> + i s e)
482 (AdjtoVerb -> + i z e)
483
484)
485)
486\endcode
487
488The above is a simple morpho-syntax for English.
489
490
491# Programs {#estngramprograms}
492
493The following are exectutable programs for grammars
494
495  - scfg_make_manual:  make a set of rules for a SCFG.
496  - scfg_train_manual:  train a set of rules for a SCFG.
497  - scfg_parse_manual: Perform parsing using pre-trained grammar
498  - scfg_test_manual: Test the output of a parser.
499  - wfst_build_manual: Build a weighted finite state transducer.
500  - wfst_run_manual: Run a weighted finite state transducer.
501  - ngram_build_manual: Train a n-gram from text.
502  - ngram_test_manual: Calculate the perplexity etc of an n-gram.
503
504# Classes {#estngramclasses}
505
506  - EST_SCFG
507  - EST_SCFG_Rule
508  - EST_SCFG_traintest
509  - EST_bracketed_string
510  - EST_Ngrammar
511  - EST_WFST
512