speech_tools/doc/estgram.md

Grammar  {#estgram}
=====================

# Overview {#estgramoverview}

To aid speech recognition and general text processing the speech
tools library provides an number of techniques to process various
types of grammars.

These consist of a continuing expanding set of class and related
programs.  These are at various levels of development but
all have fairly stable and working basic features.

## N-grams {#estgramngrams}

Ngram language models are supported by the EST_Ngrammar class, and
associated routines and programs.   Support is provided for
building. tuning, smoothing and running n-gram models.  N-grams
themselves have been can be internally stored in either a
*dense* or *sparse*.  In
the dense case all possible states are represented with probability
distributions, while the sparece case only has those possible
states for with data is available. Sparse will be much smaller
in very large cases.  However we consider the dense case to be
the most fully developed.

Formally n-grams can be views as a special case of weighted finite
state machines.  Our implementation reflects that where possible (backoff
options break this simple view), thus a start state is provided
and traversal operation are given.  This method of using n-grams is
by far the most efficient as only one new piece of information
is required at each stage, so no vectors of tokens need be collected
(or shifted) and presented to n-gram class.  However as this finite
state machine view can't always be reasonable used we also support
access through a vector of tokens.

### Building ngram language models {#estgramngrambuild}

The program \ref ngram_build_manual estimates ngram language
models from data. The data can be in a number of formats and be
saved in both an ascii (easier for humans to read) and binary
(quick to load) format.

#### Vocabularies

The vocabulary of an ngram must be predefined.  This is required to
allow efficient internal representation.  This implementation
supports two vocabularies, one for the n-1 tokens in an ngram
and one for the nth token as potentially this "predictee" could
be from a different class.  We also support the notion of
out of vocabulary word, so any token found in the input data
that is not in the vocabulary may be mapped to that token.

In build n-grams there are options on what to do with n-grams
which contain out of vocabulary tokens.  They may be mapped
to te specifed out of vocabulary word, the ngram can be ignored or
the whole sentence containing the out of vocabulary word can
be ignored.

#### ngram data input formats

The input data can be in a number of formats depending on how
much preprocessing you wish to do before building.   The most
basic form is to submit n-grams.  That is n tokens, on each
line.  For example for a tri-gram model of phones it might
look like

    0 # a1
    # a1 f
    a1 f r
    f r i
    r i k
    i k aa1
    k aa1 n
    aa1 n @

In this case the data preparation stage most create each n-gram with
the sigle stepping through the data at each stage.  This
format we call `ngram_per_line`

A second format is `sentence_per_line` where each
line of a file is a complete "sentence".  Ngrams for each n-tuple
will be automatically created and cumulated.   In this case the input
file might look like

    a1 f r i k aa1 n @
    ei1 b r @ h a m

In this mode, ngrams for the tokens at start of the sentence are
created by using the token by defining a `prev_tag` (and if necessary a
`prev_prev_tag`). Thus given the above sentence by line file,
a `prev_tag` of "#" and a `prev_prev_tag` of "0".  The first few
tri-grams cumulated are

    0 # a1
    # a1 f
    a1 f r

If the ngram size requires looking back further the `prev_prev_tag` is
repeat indefinitely.  Likewise an end_tag is appended to the end
of every sentence too, (i.e. end of every line).

A third data input format is `sentence_per_file`
where line breaks are no longer signficant and n-grams are create
for all n-tuples in the file.  The same special cases are treated
for beginning and end of file as are for beginning and end of
line in the sentence_per_line case.

#### Smoothing and Backoff

We support a number of different techniques to deal with lack of data in
a training set.

Good Turing smoothing \cite churchgale1991 is supported
allowing smoothing on n-grams whose frequency is less than M.  We also
support *backoff* where the n-1 grams are
(recursively) built to provide an estimation of probability
distributions for unseen n-grams.

### Testing ngram language models {#estngramtesting}

\ref ngram_test_manual computes language model entropy/perplexity
on test data. The test data may be in any of the formats described
above.

## SCFGs {#estngramscfg}

Stochastic context-free grammars are a version of context-free grammars
with probabilities on rules.  In this implementation we assume SCFGs
are always in Chomsky Normal From (CNF) where rules may only be binary
or unary branching.  When binary, both daughters must be non-terminals and
when unary, the daughter must be a terminal.

The implementation here is primarily based on \cite pereira1992inside
thus allowing unsupervised training of SCFGs as well
as allowing seeding with a bracketed corpus which can vastly reduce
training time, and improve results.  Training uses the inside-outside
algorithm.

The support is split into four parts: making grammars, training grammars,
testing grammars and parsing.

A grammar file consists of a set of rules.  Each rule is a bracketed
list of probability, nonterminal, followed by two nonterminals or
one terminal.  A simple example is

    (0.5 S A D)
    (0.5 S C B)
    (0.79 B S C)
    (0.21 B a)
    (0.79 D S A)
    (0.21 D b)
    (1 A b)
    (1 C a)

The mother non-terminal in the first rule is the distinguished symbol.

Grammars may be constructed by hand, by the program \ref scfg_make_manual or
by some other external process.  The program \ref scfg_make_manual constructs
a full grammar given a list (or number of) terminals and nonterminals.
The rules can be assigned equal probabilities or random ones.  The
"probabilities" may be actual probabilities or log probabilties.
For example given a file `wp19` with a list of terminals, a grammar
suitable for training with 15 non-terminals may be created by the command

    scfg_make -nonterms 15 -terms wp19 -domain prob \
              -values random -o wsj_15_19.gram

The non-terminals or terminal names will be automatically generated if a
number is given, or will be as specified if a file name is given.  In
the case of a filename being given, no brackets should be the file just
whitespace separated tokens.

A corpus consists of a number of sentences, each sentence must be
contain within a set of parenthesis.  The sentences themselves may
additionally contain further bracketing (for training and testing).
Each sentence is read by the Lisp reader so comments (semi-colon to
end of file) may be included.  For example

\code{.lisp}
((((n n) punc ((cd n) j) punc)
 (md (v (dt n) (in (dt j n)) (n cd)))
 punc))
(((n n) (v ((n) (in ((n n) punc (dt n v n))))) punc))
((((n n) punc (((cd n) j) cc ((j n) (in (n n n n)))) punc)
 (v (v (((dt j n) (in (dt j j n))))))
 punc))
...
\endcode

Training is done by estimating the inside and outside probabilities of
the rules based on their current probabilities and a corpus.  The corpus
may optionally include internal bracketing which is used by the training
algorithm to precluded some possible parses hence making the training
typically faster (and sometimes more accurate).  After each training
pass the grammar rule probabilities are updated and the process starts
again.  Note depending on the number of training sentences training
passes may take a very long time.  After each passes the cross entropy
for the current version of the grammar is printed.  This should normally
decrease until the the "best" grammar has been found.

The program \ref scfg_train_manual takes an initial grammar, and corpus and,
by default will train for 100 passes.  Because it can take prohibitively
long for a desired number of passes an option is available to selection
only an N percent chunk of the training set on each pass, cycling
through the other N percent chunks of the corpus on each pass
Experiments have shown that this not only produces much faster training,
but the accuracy of the fully trained grammar is very similar.  Given the
choice of waiting taking 10 days or 48 hours to parse, it is highly
recommended.

After each N passes the current state of the grammar may be saved, the
number of passes between saving is specified by the `-checkpoint`
option.  The grammar is saved in the output file appended with the pass
number.

Because the partitioned training will select different partitions
depending on the pass number you can optionally specify the starting
pass number, making it much easier to continue training after some
interruption.

Testing is done by the program \ref scfg_test_manual it takes a grammar and a
corpus.  That corpus may be fully bracketed or not.  By default
the mean cross entropy value from anaylzing these senetences will
be printed, also the number sentence sthat fail to parse.

Alternatively a *bracketing accuracy* may be calculated this is the
percentage of prhases in a parsed sentence that are compatible with the
bracketing in the corpus example.

The fourth program provides a mechanism for parsing one or more
sentences.  The corpus this time should contain no bracketing except
around the beginning and end of the sentence itself.  Two forms of parses
are produced.  A full form with start and end points for each phrase,
the related non-terminal and the probability, and a simple form where
only the bracketing is given.  Note only one (or no) parses is given.
For any phrase only the best example tree is given though the
probability is given as the sum of all possibily derivations of that
non-terminal for that phrase.

    scfg_parse -grammar wsj_15_19.gram -corpus news.data -brackets -o news.out

Note the input for must be strings of terminals for the given grammar.
For real parsing of real text it is likely the grmmar uses part of
speech tags as terminals and the data is avtuall words not part of
speech tags.  If you want to parse texts then you can use the Festival
script `festival/examples/scfg_parse_text` which takes in arbitrary
text, runs the part of speech tagger on it after standard tokenizing
rules and parses the output saving the parse to the specified file.

## WFSTs {#estngramwfst}

The speech tools contains a small, but growing library of
basic functions for building, and manipulating weighted finite
state transducers.  Although not complete they already provided many of
the basic operations and compilers one needs for using these devices.

Given a WFST the following operations are supported: \ref deterimise,
\ref minimize, \ref complement.

Given two WFSTs the following operations are supported: \ref intersection,
\ref union, \ref difference,  \ref concatenation and \ref compose.

In addition to these operations compiles are provided for a number of
basic input formats: regular expressions, regular grammars, context-free
grammars (with depth restriction) and Kay/Kaplan/Koksenniemi two-level
morphology rules.

Still missing are complete treatment of the weights through some
basic operations (e.g. minimization doesn't presever weights).  Also
techniques for learning WFSTs from data, or at least weightign existing
FSTs from data will be added in later versions.

In general inputing symbols is of the form X or X/Y.  When X is
given it is (except if using the wfst as a transducer) treated
as X/X.  Where X/Y is input/output symbols, thus using single symbols
will mostly cause the wfst mechanisms to act as if they are finite
state machines.

The two main programs are \ref wfst_build_manual and \ref wfst_run_manual.
\ref wfst_run_manual runs in both recognition and transduction mode.

\ref wfst_build_ builds wfst's from description files or through
combination of existing ones.  The output may be optionally
determinized or determinized and minimized.

### Kay/Kaplan/Koskenniemi morphological rules {#estngramwfstkaykaplan}

One of the major drives in interest in wfst has been through their
use in morphology @cite kaplan94.  Hence we provide a method for
compiling Kay/Kaplan/Koskenniemi type (restricted) context sensitive
rewrite rules.  The exact form is given in the example below.

This example covers basic letters to letters but also Epenthesis for
e-insertion in words like `churches` and `boxes`.
\code{.lisp}
(KKrules
 engmorph
 (Alphabets
  ;; Input Alphabet
  (a b c d e f g h i j k l m n o p q r s t u v w x y z #)
  ;; Output Alphabet
  (a b c d e f g h i j k l m n o p q r s t u v w x y z + #)
 )
 (Sets
  (LET a b c d e f g h i j k l m n o p q r s t u v w x y z)
 )
 (Rules
 ;; The basic rules
 ( a => nil --- nil)
 ( b => nil --- nil)
 ( c => nil --- nil)
 ( d => nil --- nil)
 ( e => nil --- nil)
 ( f => nil --- nil)
 ( g => nil --- nil)
 ( h => nil --- nil)
 ( i => nil --- nil)
 ( j => nil --- nil)
 ( k => nil --- nil)
 ( l => nil --- nil)
 ( m => nil --- nil)
 ( n => nil --- nil)
 ( o => nil --- nil)
 ( p => nil --- nil)
 ( q => nil --- nil)
 ( r => nil --- nil)
 ( s => nil --- nil)
 ( t => nil --- nil)
 ( u => nil --- nil)
 ( v => nil --- nil)
 ( w => nil --- nil)
 ( x => nil --- nil)
 ( y => nil --- nil)
 ( z => nil --- nil)
 ( # => nil --- nil)
 ( _epsilon_/+ => (or LET _epsilon_/e) --- nil)

 ;; Epenthesis
 ;;   churches -> church+s
 ;;   boxes -> box+s
 (e/+ <=> (or (s h) (or s x z) (i/y) (c h))
	    ---
	    (s))
)
\endcode

A substantially larger example of morphographenic rules is distributed with
the Festival speech synthesis system in `festival/lib/engmorph.scm`.
This is based on the English description in @cite ritchie92 .

For a definition of the semantics fo the basic types of rule,
surface coercion, context restriction and combined rules
see @cite ritchie92.  Note that these rules are run in
parallel (the transducers are intersected) making they rule
interact in ways that the author might not intend.  A good rule
debugger is really required in order to write a substantial set
of rules in this formalism.

The rule compilation method used differs from Kay and Kaplan, and
also from @cite mohri96 and actually follows them method used
in @cite ritchie92 though in this case, unlike @cite ritchie92,
the technique is followed through to true wfst's.  The actual
compilation method shold be described somewhere.

The above may be compiled into a wfst by the command (assuming it is
in the file `mm.rules`.

    wfst_build -type kk -o engmorph.wfst -detmin engmorph.scm

This rule compiler has also been used in finding equivalent transducers
for restricted forms of decision tree (following @cite sproat96) and
may be view as mostly stable.

### Regular expressions {#estngramwfstregex}

A simple method for building wfst's from regular expressions is also
provided.

An example is
\code{.lisp} ((a b c)
 (a b c)
 (and a (+ (or b c)) d))
\endcode

This consists of the input alphabet and the output alphabet
followed by a LISP s-expression contains the regex.  The supported
operators are `and`, `or`, `+`, `*` and `not`.

Compilation is by the following command:

    wfst_build -type regex -o t1.wfst -detmin t1.regex

### Regular Grammars {#estngramwfstregulargrammars}

A compilation method also exists for regular grammars.  These grammars
do not need to be a normal form, in fact no chaeck is made that they
are regular, if they contain center-embedding the construct algorithm
will go into a loop and eventually run out of storage.  The
correction to that is to add a depth limit which would then
allow wfst approximations of context-free grammars, which would
be useful.

An example regular grammar is
\code{.lisp} (RegularGrammar
 engsuffixmorphosyntax
 ;; Sets
 (
 (V a e i o u y)
 (C b c d f g h j k l m n p q r s t v w x y z)
 )
 ;; Rules

 (
 ;; A word *must* have a suffix to be recognized
 (Word -> # Syls Suffix )
 (Word -> # Syls End )

 ;; This matches any string of characters that contains at least one vowel
 (Syls -> Syl Syls )
 (Syls -> Syl )
 (Syl -> Cs V Cs )
 (Cs -> C Cs )
 (Cs -> )

 (Suffix -> VerbSuffix )
 (Suffix -> NounSuffix )
 (Suffix -> AdjSuffix )
 (VerbSuffix -> VerbFinal End )
 (VerbSuffix -> VerbtoNoun NounSuffix )
 (VerbSuffix -> VerbtoNoun End )
 (VerbSuffix -> VerbtoAdj  AdjSuffix )
 (VerbSuffix -> VerbtoAdj End )
 (NounSuffix -> NounFinal End )
 (NounSuffix -> NountoNoun NounSuffix )
 (NounSuffix -> NountoNoun End )
 (NounSuffix -> NountoAdj AdjSuffix )
 (NounSuffix -> NountoAdj End )
 (NounSuffix -> NountoVerb VerbSuffix )
 (NounSuffix -> NountoVerb End )
 (AdjSuffix -> AdjFinal End )
 (AdjSuffix -> AdjtoAdj AdjSuffix)
 (AdjSuffix -> AdjtoAdj End)
 (AdjSuffix -> AdjtoAdv End)  ;; isn't any Adv to anything

 (End -> # )  ;; word boundary symbol *always* present

 (VerbFinal -> + e d)
 (VerbFinal -> + i n g)
 (VerbFinal -> + s)

 (VerbtoNoun -> + e r)
 (VerbtoNoun -> + e s s)
 (VerbtoNoun -> + a t i o n)
 (VerbtoNoun -> + i n g)
 (VerbtoNoun -> + m e n t)

 (VerbtoAdj -> + a b l e)

 (NounFinal -> + s)

 (NountoNoun -> + i s m)
 (NountoNoun -> + i s t)
 (NountoNoun -> + s h i p)

 (NountoAdj -> + l i k e)
 (NountoAdj -> + l e s s)
 (NountoAdj -> + i s h)
 (NountoAdj -> + o u s)

 (NountoVerb -> + i f y)
 (NountoVerb -> + i s e)
 (NountoVerb -> + i z e)

 (AdjFinal -> + e r)
 (AdjFinal -> + e s t)

 (AdjtoAdj -> + i s h)
 (AdjtoAdv -> + l y)
 (AdjtoNoun -> + n e s s)
 (AdjtoVerb -> + i s e)
 (AdjtoVerb -> + i z e)

)
)
\endcode

The above is a simple morpho-syntax for English.


# Programs {#estngramprograms}

The following are exectutable programs for grammars

  - scfg_make_manual:  make a set of rules for a SCFG.
  - scfg_train_manual:  train a set of rules for a SCFG.
  - scfg_parse_manual: Perform parsing using pre-trained grammar
  - scfg_test_manual: Test the output of a parser.
  - wfst_build_manual: Build a weighted finite state transducer.
  - wfst_run_manual: Run a weighted finite state transducer.
  - ngram_build_manual: Train a n-gram from text.
  - ngram_test_manual: Calculate the perplexity etc of an n-gram.

# Classes {#estngramclasses}

  - EST_SCFG
  - EST_SCFG_Rule
  - EST_SCFG_traintest
  - EST_bracketed_string
  - EST_Ngrammar
  - EST_WFST