cargo-crates/aho-corasick-0.7.18/DESIGN.md

This document describes the internal design of this crate, which is an object
lesson in what happens when you take a fairly simple old algorithm like
Aho-Corasick and make it fast and production ready.

The target audience of this document is Rust programmers that have some
familiarity with string searching, however, one does not need to know the
Aho-Corasick algorithm in order to read this (it is explained below). One
should, however, know what a trie is. (If you don't, go read its Wikipedia
article.)

The center-piece of this crate is an implementation of Aho-Corasick. On its
own, Aho-Corasick isn't that complicated. The complex pieces come from the
different variants of Aho-Corasick implemented in this crate. Specifically,
they are:

* Aho-Corasick as an NFA, using dense transitions near the root with sparse
  transitions elsewhere.
* Aho-Corasick as a DFA. (An NFA is slower to search, but cheaper to construct
  and uses less memory.)
  * A DFA with pre-multiplied state identifiers. This saves a multiplication
    instruction in the core search loop.
  * A DFA with equivalence classes of bytes as the alphabet, instead of the
    traditional 256-byte alphabet. This shrinks the size of the DFA in memory,
    but adds an extra lookup in the core search loop to map the input byte to
    an equivalent class.
* The option to choose how state identifiers are represented, via one of
  u8, u16, u32, u64 or usize. This permits creating compact automatons when
  matching a small number of patterns.
* Supporting "standard" match semantics, along with its overlapping variant,
  in addition to leftmost-first and leftmost-longest semantics. The "standard"
  semantics are typically what you see in a textbook description of
  Aho-Corasick. However, Aho-Corasick is also useful as an optimization in
  regex engines, which often use leftmost-first or leftmost-longest semantics.
  Thus, it is useful to implement those semantics here. The "standard" and
  "leftmost" search algorithms are subtly different, and also require slightly
  different construction algorithms.
* Support for ASCII case insensitive matching.
* Support for accelerating searches when the patterns all start with a small
  number of fixed bytes. Or alternatively, when the  patterns all contain a
  small number of rare bytes. (Searching for these bytes uses SIMD vectorized
  code courtesy of `memchr`.)
* Transparent support for alternative SIMD vectorized search routines for
  smaller number of literals, such as the Teddy algorithm. We called these
  "packed" search routines because they use SIMD. They can often be an order of
  magnitude faster than just Aho-Corasick, but don't scale as well.
* Support for searching streams. This can reuse most of the underlying code,
  but does require careful buffering support.
* Support for anchored searches, which permit efficient `is_prefix` checks for
  a large number of patterns.

When you combine all of this together along with trying to make everything as
fast as possible, what you end up with is enitrely too much code with too much
`unsafe`. Alas, I was not smart enough to figure out how to reduce it. Instead,
we will explain it.


# Basics

The fundamental problem this crate is trying to solve is to determine the
occurrences of possibly many patterns in a haystack. The naive way to solve
this is to look for a match for each pattern at each position in the haystack:

    for i in 0..haystack.len():
      for p in patterns.iter():
        if haystack[i..].starts_with(p.bytes()):
          return Match(p.id(), i, i + p.bytes().len())

Those four lines are effectively all this crate does. The problem with those
four lines is that they are very slow, especially when you're searching for a
large number of patterns.

While there are many different algorithms available to solve this, a popular
one is Aho-Corasick. It's a common solution because it's not too hard to
implement, scales quite well even when searching for thousands of patterns and
is generally pretty fast. Aho-Corasick does well here because, regardless of
the number of patterns you're searching for, it always visits each byte in the
haystack exactly once. This means, generally speaking, adding more patterns to
an Aho-Corasick automaton does not make it slower. (Strictly speaking, however,
this is not true, since a larger automaton will make less effective use of the
CPU's cache.)

Aho-Corasick can be succinctly described as a trie with state transitions
between some of the nodes that efficiently instruct the search algorithm to
try matching alternative keys in the automaton. The trick is that these state
transitions are arranged such that each byte of input needs to be inspected
only once. These state transitions are typically called "failure transitions,"
because they instruct the searcher (the thing traversing the automaton while
reading from the haystack) what to do when a byte in the haystack does not
correspond to a valid transition in the current state of the trie.

More formally, a failure transition points to a state in the automaton that may
lead to a match whose prefix is a proper suffix of the path traversed through
the trie so far. (If no such proper suffix exists, then the failure transition
points back to the start state of the trie, effectively restarting the search.)
This is perhaps simpler to explain pictorally. For example, let's say we built
an Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. The
trie looks like this:

         a - S1 - b - S2 - c - S3 - d - S4*
        /
    S0 - c - S5 - e - S6 - f - S7*

where states marked with a `*` are match states (meaning, the search algorithm
should stop and report a match to the caller).

So given this trie, it should be somewhat straight-forward to see how it can
be used to determine whether any particular haystack *starts* with either
`abcd` or `cef`. It's easy to express this in code:

    fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool {
      let mut state_id = trie.start();
      // If the empty pattern is in trie, then state_id is a match state.
      if trie.is_match(state_id) {
        return true;
      }
      for (i, &b) in haystack.iter().enumerate() {
        state_id = match trie.next_state(state_id, b) {
          Some(id) => id,
          // If there was no transition for this state and byte, then we know
          // the haystack does not start with one of the patterns in our trie.
          None => return false,
        };
        if trie.is_match(state_id) {
          return true;
        }
      }
      false
    }

And that's pretty much it. All we do is move through the trie starting with the
bytes at the beginning of the haystack. If we find ourselves in a position
where we can't move, or if we've looked through the entire haystack without
seeing a match state, then we know the haystack does not start with any of the
patterns in the trie.

The meat of the Aho-Corasick algorithm is in how we add failure transitions to
our trie to keep searching efficient. Specifically, it permits us to not only
check whether a haystack *starts* with any one of a number of patterns, but
rather, whether the haystack contains any of a number of patterns *anywhere* in
the haystack.

As mentioned before, failure transitions connect a proper suffix of the path
traversed through the trie before, with a path that leads to a match that has a
prefix corresponding to that proper suffix. So in our case, for patterns `abcd`
and `cef`, with a haystack `abcef`, we want to transition to state `S5` (from
the diagram above) from `S3` upon seeing that the byte following `c` is not
`d`. Namely, the proper suffix in this example is `c`, which is a prefix of
`cef`. So the modified diagram looks like this:


         a - S1 - b - S2 - c - S3 - d - S4*
        /                      /
       /       ----------------
      /       /
    S0 - c - S5 - e - S6 - f - S7*

One thing that isn't shown in this diagram is that *all* states have a failure
transition, but only `S3` has a *non-trivial* failure transition. That is, all
other states have a failure transition back to the start state. So if our
haystack was `abzabcd`, then the searcher would transition back to `S0` after
seeing `z`, which effectively restarts the search. (Because there is no pattern
in our trie that has a prefix of `bz` or `z`.)

The code for traversing this *automaton* or *finite state machine* (it is no
longer just a trie) is not that much different from the `has_prefix` code
above:

    fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool {
      let mut state_id = fsm.start();
      // If the empty pattern is in fsm, then state_id is a match state.
      if fsm.is_match(state_id) {
        return true;
      }
      for (i, &b) in haystack.iter().enumerate() {
        // While the diagram above doesn't show this, we may wind up needing
        // to follow multiple failure transitions before we land on a state
        // in which we can advance. Therefore, when searching for the next
        // state, we need to loop until we don't see a failure transition.
        //
        // This loop terminates because the start state has no empty
        // transitions. Every transition from the start state either points to
        // another state, or loops back to the start state.
        loop {
          match fsm.next_state(state_id, b) {
            Some(id) => {
              state_id = id;
              break;
            }
            // Unlike our code above, if there was no transition for this
            // state, then we don't quit. Instead, we look for this state's
            // failure transition and follow that instead.
            None => {
              state_id = fsm.next_fail_state(state_id);
            }
          };
        }
        if fsm.is_match(state_id) {
          return true;
        }
      }
      false
    }

Other than the complication around traversing failure transitions, this code
is still roughly "traverse the automaton with bytes from the haystack, and quit
when a match is seen."

And that concludes our section on the basics. While we didn't go deep into
how the automaton is built (see `src/nfa.rs`, which has detailed comments about
that), the basic structure of Aho-Corasick should be reasonably clear.


# NFAs and DFAs

There are generally two types of finite automata: non-deterministic finite
automata (NFA) and deterministic finite automata (DFA). The difference between
them is, principally, that an NFA can be in multiple states at once. This is
typically accomplished by things called _epsilon_ transitions, where one could
move to a new state without consuming any bytes from the input. (The other
mechanism by which NFAs can be in more than one state is where the same byte in
a particular state transitions to multiple distinct states.) In contrast, a DFA
can only ever be in one state at a time. A DFA has no epsilon transitions, and
for any given state, a byte transitions to at most one other state.

By this formulation, the Aho-Corasick automaton described in the previous
section is an NFA. This is because failure transitions are, effectively,
epsilon transitions. That is, whenever the automaton is in state `S`, it is
actually in the set of states that are reachable by recursively following
failure transitions from `S`. (This means that, for example, the start state
is always active since the start state is reachable via failure transitions
from any state in the automaton.)

NFAs have a lot of nice properties. They tend to be easier to construct, and
also tend to use less memory. However, their primary downside is that they are
typically slower to execute. For example, the code above showing how to search
with an Aho-Corasick automaton needs to potentially iterate through many
failure transitions for every byte of input. While this is a fairly small
amount of overhead, this can add up, especially if the automaton has a lot of
overlapping patterns with a lot of failure transitions.

A DFA's search code, by contrast, looks like this:

    fn contains(dfa: &DFA, haystack: &[u8]) -> bool {
      let mut state_id = dfa.start();
      // If the empty pattern is in dfa, then state_id is a match state.
      if dfa.is_match(state_id) {
        return true;
      }
      for (i, &b) in haystack.iter().enumerate() {
        // An Aho-Corasick DFA *never* has a missing state that requires
        // failure transitions to be followed. One byte of input advances the
        // automaton by one state. Always.
        state_id = trie.next_state(state_id, b);
        if fsm.is_match(state_id) {
          return true;
        }
      }
      false
    }

The search logic here is much simpler than for the NFA, and this tends to
translate into significant performance benefits as well, since there's a lot
less work being done for each byte in the haystack. How is this accomplished?
It's done by pre-following all failure transitions for all states for all bytes
in the alphabet, and then building a single state transition table. Building
this DFA can be much more costly than building the NFA, and use much more
memory, but the better performance can be worth it.

Users of this crate can actually choose between using an NFA or a DFA. By
default, an NFA is used, because it typically strikes the best balance between
space usage and search performance. But the DFA option is available for cases
where a little extra memory and upfront time building the automaton is okay.
For example, the `AhoCorasick::auto_configure` and
`AhoCorasickBuilder::auto_configure` methods will enable the DFA setting if
there are a small number of patterns.


# More DFA tricks

As described in the previous section, one of the downsides of using a DFA
is that is uses more memory and can take longer to build. One small way of
mitigating these concerns is to map the alphabet used by the automaton into
a smaller space. Typically, the alphabet of a DFA has 256 elements in it:
one element for each possible value that fits into a byte. However, in many
cases, one does not need the full alphabet. For example, if all patterns in an
Aho-Corasick automaton are ASCII letters, then this only uses up 52 distinct
bytes. As far as the automaton is concerned, the rest of the 204 bytes are
indistinguishable from one another: they will never disrciminate between a
match or a non-match. Therefore, in cases like that, the alphabet can be shrunk
to just 53 elements. One for each ASCII letter, and then another to serve as a
placeholder for every other unused byte.

In practice, this library doesn't quite compute the optimal set of equivalence
classes, but it's close enough in most cases. The key idea is that this then
allows the transition table for the DFA to be potentially much smaller. The
downside of doing this, however, is that since the transition table is defined
in terms of this smaller alphabet space, every byte in the haystack must be
re-mapped to this smaller space. This requires an additional 256-byte table.
In practice, this can lead to a small search time hit, but it can be difficult
to measure. Moreover, it can sometimes lead to faster search times for bigger
automata, since it could be difference between more parts of the automaton
staying in the CPU cache or not.

One other trick for DFAs employed by this crate is the notion of premultiplying
state identifiers. Specifically, the normal way to compute the next transition
in a DFA is via the following (assuming that the transition table is laid out
sequentially in memory, in row-major order, where the rows are states):

    next_state_id = dfa.transitions[current_state_id * 256 + current_byte]

However, since the value `256` is a fixed constant, we can actually premultiply
the state identifiers in the table when we build the table initially. Then, the
next transition computation simply becomes:

    next_state_id = dfa.transitions[current_state_id + current_byte]

This doesn't seem like much, but when this is being executed for every byte of
input that you're searching, saving that extra multiplication instruction can
add up.

The same optimization works even when equivalence classes are enabled, as
described above. The only difference is that the premultiplication is by the
total number of equivalence classes instead of 256.

There isn't much downside to premultiplying state identifiers, other than the
fact that you may need to choose a bigger integer representation than you would
otherwise. For example, if you don't premultiply state identifiers, then an
automaton that uses `u8` as a state identifier can hold up to 256 states.
However, if they are premultiplied, then it can only hold up to
`floor(256 / len(alphabet))` states. Thus premultiplication impacts how compact
your DFA can be. In practice, it's pretty rare to use `u8` as a state
identifier, so premultiplication is usually a good thing to do.

Both equivalence classes and premultiplication are tuneable parameters via the
`AhoCorasickBuilder` type, and both are enabled by default.


# Match semantics

One of the more interesting things about this implementation of Aho-Corasick
that (as far as this author knows) separates it from other implementations, is
that it natively supports leftmost-first and leftmost-longest match semantics.
Briefly, match semantics refer to the decision procedure by which searching
will disambiguate matches when there are multiple to choose from:

* **standard** match semantics emits matches as soon as they are detected by
  the automaton. This is typically equivalent to the textbook non-overlapping
  formulation of Aho-Corasick.
* **leftmost-first** match semantics means that 1) the next match is the match
  starting at the leftmost position and 2) among multiple matches starting at
  the same leftmost position, the match corresponding to the pattern provided
  first by the caller is reported.
* **leftmost-longest** is like leftmost-first, except when there are multiple
  matches starting at the same leftmost position, the pattern corresponding to
  the longest match is returned.

(The crate API documentation discusses these differences, with examples, in
more depth on the `MatchKind` type.)

The reason why supporting these match semantics is important is because it
gives the user more control over the match procedure. For example,
leftmost-first permits users to implement match priority by simply putting the
higher priority patterns first. Leftmost-longest, on the other hand, permits
finding the longest possible match, which might be useful when trying to find
words matching a dictionary. Additionally, regex engines often want to use
Aho-Corasick as an optimization when searching for an alternation of literals.
In order to preserve correct match semantics, regex engines typically can't use
the standard textbook definition directly, since regex engines will implement
either leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics.

Supporting leftmost semantics requires a couple key changes:

* Constructing the Aho-Corasick automaton changes a bit in both how the trie is
  constructed and how failure transitions are found. Namely, only a subset of
  the failure transitions are added. Specifically, only the failure transitions
  that either do not occur after a match or do occur after a match but preserve
  that match are kept. (More details on this can be found in `src/nfa.rs`.)
* The search algorithm changes slightly. Since we are looking for the leftmost
  match, we cannot quit as soon as a match is detected. Instead, after a match
  is detected, we must keep searching until either the end of the input or
  until a dead state is seen. (Dead states are not used for standard match
  semantics. Dead states mean that searching should stop after a match has been
  found.)

Other implementations of Aho-Corasick do support leftmost match semantics, but
they do it with more overhead at search time, or even worse, with a queue of
matches and sophisticated hijinks to disambiguate the matches. While our
construction algorithm becomes a bit more complicated, the correct match
semantics fall out from the structure of the automaton itself.


# Overlapping matches

One of the nice properties of an Aho-Corasick automaton is that it can report
all possible matches, even when they overlap with one another. In this mode,
the match semantics don't matter, since all possible matches are reported.
Overlapping searches work just like regular searches, except the state
identifier at which the previous search left off is carried over to the next
search, so that it can pick up where it left off. If there are additional
matches at that state, then they are reported before resuming the search.

Enabling leftmost-first or leftmost-longest match semantics causes the
automaton to use a subset of all failure transitions, which means that
overlapping searches cannot be used. Therefore, if leftmost match semantics are
used, attempting to do an overlapping search will panic. Thus, to get
overlapping searches, the caller must use the default standard match semantics.
This behavior was chosen because there are only two alternatives, which were
deemed worse:

* Compile two automatons internally, one for standard semantics and one for
  the semantics requested by the caller (if not standard).
* Create a new type, distinct from the `AhoCorasick` type, which has different
  capabilities based on the configuration options.

The first is untenable because of the amount of memory used by the automaton.
The second increases the complexity of the API too much by adding too many
types that do similar things. It is conceptually much simpler to keep all
searching isolated to a single type. Callers may query whether the automaton
supports overlapping searches via the `AhoCorasick::supports_overlapping`
method.


# Stream searching

Since Aho-Corasick is an automaton, it is possible to do partial searches on
partial parts of the haystack, and then resume that search on subsequent pieces
of the haystack. This is useful when the haystack you're trying to search is
not stored contiguous in memory, or if one does not want to read the entire
haystack into memory at once.

Currently, only standard semantics are supported for stream searching. This is
some of the more complicated code in this crate, and is something I would very
much like to improve. In particular, it currently has the restriction that it
must buffer at least enough of the haystack in memory in order to fit the
longest possible match. The difficulty in getting stream searching right is
that the implementation choices (such as the buffer size) often impact what the
API looks like and what it's allowed to do.


# Prefilters

In some cases, Aho-Corasick is not the fastest way to find matches containing
multiple patterns. Sometimes, the search can be accelerated using highly
optimized SIMD routines. For example, consider searching the following
patterns:

    Sherlock
    Moriarty
    Watson

It is plausible that it would be much faster to quickly look for occurrences of
the leading bytes, `S`, `M` or `W`, before trying to start searching via the
automaton. Indeed, this is exactly what this crate will do.

When there are more than three distinct starting bytes, then this crate will
look for three distinct bytes occurring at any position in the patterns, while
preferring bytes that are heuristically determined to be rare over others. For
example:

    Abuzz
    Sanchez
    Vasquez
    Topaz
    Waltz

Here, we have more than 3 distinct starting bytes, but all of the patterns
contain `z`, which is typically a rare byte. In this case, the prefilter will
scan for `z`, back up a bit, and then execute the Aho-Corasick automaton.

If all of that fails, then a packed multiple substring algorithm will be
attempted. Currently, the only algorithm available for this is Teddy, but more
may be added in the future. Teddy is unlike the above prefilters in that it
confirms its own matches, so when Teddy is active, it might not be necessary
for Aho-Corasick to run at all. (See `Automaton::leftmost_find_at_no_state_imp`
in `src/automaton.rs`.) However, the current Teddy implementation only works
in `x86_64` and when SSSE3 or AVX2 are available, and moreover, only works
_well_ when there are a small number of patterns (say, less than 100). Teddy
also requires the haystack to be of a certain length (more than 16-34 bytes).
When the haystack is shorter than that, Rabin-Karp is used instead. (See
`src/packed/rabinkarp.rs`.)

There is a more thorough description of Teddy at
[`src/packed/teddy/README.md`](src/packed/teddy/README.md).