1This document describes the internal design of this crate, which is an object
2lesson in what happens when you take a fairly simple old algorithm like
3Aho-Corasick and make it fast and production ready.
4
5The target audience of this document is Rust programmers that have some
6familiarity with string searching, however, one does not need to know the
7Aho-Corasick algorithm in order to read this (it is explained below). One
8should, however, know what a trie is. (If you don't, go read its Wikipedia
9article.)
10
11The center-piece of this crate is an implementation of Aho-Corasick. On its
12own, Aho-Corasick isn't that complicated. The complex pieces come from the
13different variants of Aho-Corasick implemented in this crate. Specifically,
14they are:
15
16* Aho-Corasick as an NFA, using dense transitions near the root with sparse
17  transitions elsewhere.
18* Aho-Corasick as a DFA. (An NFA is slower to search, but cheaper to construct
19  and uses less memory.)
20  * A DFA with pre-multiplied state identifiers. This saves a multiplication
21    instruction in the core search loop.
22  * A DFA with equivalence classes of bytes as the alphabet, instead of the
23    traditional 256-byte alphabet. This shrinks the size of the DFA in memory,
24    but adds an extra lookup in the core search loop to map the input byte to
25    an equivalent class.
26* The option to choose how state identifiers are represented, via one of
27  u8, u16, u32, u64 or usize. This permits creating compact automatons when
28  matching a small number of patterns.
29* Supporting "standard" match semantics, along with its overlapping variant,
30  in addition to leftmost-first and leftmost-longest semantics. The "standard"
31  semantics are typically what you see in a textbook description of
32  Aho-Corasick. However, Aho-Corasick is also useful as an optimization in
33  regex engines, which often use leftmost-first or leftmost-longest semantics.
34  Thus, it is useful to implement those semantics here. The "standard" and
35  "leftmost" search algorithms are subtly different, and also require slightly
36  different construction algorithms.
37* Support for ASCII case insensitive matching.
38* Support for accelerating searches when the patterns all start with a small
39  number of fixed bytes. Or alternatively, when the  patterns all contain a
40  small number of rare bytes. (Searching for these bytes uses SIMD vectorized
41  code courtesy of `memchr`.)
42* Transparent support for alternative SIMD vectorized search routines for
43  smaller number of literals, such as the Teddy algorithm. We called these
44  "packed" search routines because they use SIMD. They can often be an order of
45  magnitude faster than just Aho-Corasick, but don't scale as well.
46* Support for searching streams. This can reuse most of the underlying code,
47  but does require careful buffering support.
48* Support for anchored searches, which permit efficient `is_prefix` checks for
49  a large number of patterns.
50
51When you combine all of this together along with trying to make everything as
52fast as possible, what you end up with is enitrely too much code with too much
53`unsafe`. Alas, I was not smart enough to figure out how to reduce it. Instead,
54we will explain it.
55
56
57# Basics
58
59The fundamental problem this crate is trying to solve is to determine the
60occurrences of possibly many patterns in a haystack. The naive way to solve
61this is to look for a match for each pattern at each position in the haystack:
62
63    for i in 0..haystack.len():
64      for p in patterns.iter():
65        if haystack[i..].starts_with(p.bytes()):
66          return Match(p.id(), i, i + p.bytes().len())
67
68Those four lines are effectively all this crate does. The problem with those
69four lines is that they are very slow, especially when you're searching for a
70large number of patterns.
71
72While there are many different algorithms available to solve this, a popular
73one is Aho-Corasick. It's a common solution because it's not too hard to
74implement, scales quite well even when searching for thousands of patterns and
75is generally pretty fast. Aho-Corasick does well here because, regardless of
76the number of patterns you're searching for, it always visits each byte in the
77haystack exactly once. This means, generally speaking, adding more patterns to
78an Aho-Corasick automaton does not make it slower. (Strictly speaking, however,
79this is not true, since a larger automaton will make less effective use of the
80CPU's cache.)
81
82Aho-Corasick can be succinctly described as a trie with state transitions
83between some of the nodes that efficiently instruct the search algorithm to
84try matching alternative keys in the automaton. The trick is that these state
85transitions are arranged such that each byte of input needs to be inspected
86only once. These state transitions are typically called "failure transitions,"
87because they instruct the searcher (the thing traversing the automaton while
88reading from the haystack) what to do when a byte in the haystack does not
89correspond to a valid transition in the current state of the trie.
90
91More formally, a failure transition points to a state in the automaton that may
92lead to a match whose prefix is a proper suffix of the path traversed through
93the trie so far. (If no such proper suffix exists, then the failure transition
94points back to the start state of the trie, effectively restarting the search.)
95This is perhaps simpler to explain pictorally. For example, let's say we built
96an Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. The
97trie looks like this:
98
99         a - S1 - b - S2 - c - S3 - d - S4*
100        /
101    S0 - c - S5 - e - S6 - f - S7*
102
103where states marked with a `*` are match states (meaning, the search algorithm
104should stop and report a match to the caller).
105
106So given this trie, it should be somewhat straight-forward to see how it can
107be used to determine whether any particular haystack *starts* with either
108`abcd` or `cef`. It's easy to express this in code:
109
110    fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool {
111      let mut state_id = trie.start();
112      // If the empty pattern is in trie, then state_id is a match state.
113      if trie.is_match(state_id) {
114        return true;
115      }
116      for (i, &b) in haystack.iter().enumerate() {
117        state_id = match trie.next_state(state_id, b) {
118          Some(id) => id,
119          // If there was no transition for this state and byte, then we know
120          // the haystack does not start with one of the patterns in our trie.
121          None => return false,
122        };
123        if trie.is_match(state_id) {
124          return true;
125        }
126      }
127      false
128    }
129
130And that's pretty much it. All we do is move through the trie starting with the
131bytes at the beginning of the haystack. If we find ourselves in a position
132where we can't move, or if we've looked through the entire haystack without
133seeing a match state, then we know the haystack does not start with any of the
134patterns in the trie.
135
136The meat of the Aho-Corasick algorithm is in how we add failure transitions to
137our trie to keep searching efficient. Specifically, it permits us to not only
138check whether a haystack *starts* with any one of a number of patterns, but
139rather, whether the haystack contains any of a number of patterns *anywhere* in
140the haystack.
141
142As mentioned before, failure transitions connect a proper suffix of the path
143traversed through the trie before, with a path that leads to a match that has a
144prefix corresponding to that proper suffix. So in our case, for patterns `abcd`
145and `cef`, with a haystack `abcef`, we want to transition to state `S5` (from
146the diagram above) from `S3` upon seeing that the byte following `c` is not
147`d`. Namely, the proper suffix in this example is `c`, which is a prefix of
148`cef`. So the modified diagram looks like this:
149
150
151         a - S1 - b - S2 - c - S3 - d - S4*
152        /                      /
153       /       ----------------
154      /       /
155    S0 - c - S5 - e - S6 - f - S7*
156
157One thing that isn't shown in this diagram is that *all* states have a failure
158transition, but only `S3` has a *non-trivial* failure transition. That is, all
159other states have a failure transition back to the start state. So if our
160haystack was `abzabcd`, then the searcher would transition back to `S0` after
161seeing `z`, which effectively restarts the search. (Because there is no pattern
162in our trie that has a prefix of `bz` or `z`.)
163
164The code for traversing this *automaton* or *finite state machine* (it is no
165longer just a trie) is not that much different from the `has_prefix` code
166above:
167
168    fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool {
169      let mut state_id = fsm.start();
170      // If the empty pattern is in fsm, then state_id is a match state.
171      if fsm.is_match(state_id) {
172        return true;
173      }
174      for (i, &b) in haystack.iter().enumerate() {
175        // While the diagram above doesn't show this, we may wind up needing
176        // to follow multiple failure transitions before we land on a state
177        // in which we can advance. Therefore, when searching for the next
178        // state, we need to loop until we don't see a failure transition.
179        //
180        // This loop terminates because the start state has no empty
181        // transitions. Every transition from the start state either points to
182        // another state, or loops back to the start state.
183        loop {
184          match fsm.next_state(state_id, b) {
185            Some(id) => {
186              state_id = id;
187              break;
188            }
189            // Unlike our code above, if there was no transition for this
190            // state, then we don't quit. Instead, we look for this state's
191            // failure transition and follow that instead.
192            None => {
193              state_id = fsm.next_fail_state(state_id);
194            }
195          };
196        }
197        if fsm.is_match(state_id) {
198          return true;
199        }
200      }
201      false
202    }
203
204Other than the complication around traversing failure transitions, this code
205is still roughly "traverse the automaton with bytes from the haystack, and quit
206when a match is seen."
207
208And that concludes our section on the basics. While we didn't go deep into
209how the automaton is built (see `src/nfa.rs`, which has detailed comments about
210that), the basic structure of Aho-Corasick should be reasonably clear.
211
212
213# NFAs and DFAs
214
215There are generally two types of finite automata: non-deterministic finite
216automata (NFA) and deterministic finite automata (DFA). The difference between
217them is, principally, that an NFA can be in multiple states at once. This is
218typically accomplished by things called _epsilon_ transitions, where one could
219move to a new state without consuming any bytes from the input. (The other
220mechanism by which NFAs can be in more than one state is where the same byte in
221a particular state transitions to multiple distinct states.) In contrast, a DFA
222can only ever be in one state at a time. A DFA has no epsilon transitions, and
223for any given state, a byte transitions to at most one other state.
224
225By this formulation, the Aho-Corasick automaton described in the previous
226section is an NFA. This is because failure transitions are, effectively,
227epsilon transitions. That is, whenever the automaton is in state `S`, it is
228actually in the set of states that are reachable by recursively following
229failure transitions from `S`. (This means that, for example, the start state
230is always active since the start state is reachable via failure transitions
231from any state in the automaton.)
232
233NFAs have a lot of nice properties. They tend to be easier to construct, and
234also tend to use less memory. However, their primary downside is that they are
235typically slower to execute. For example, the code above showing how to search
236with an Aho-Corasick automaton needs to potentially iterate through many
237failure transitions for every byte of input. While this is a fairly small
238amount of overhead, this can add up, especially if the automaton has a lot of
239overlapping patterns with a lot of failure transitions.
240
241A DFA's search code, by contrast, looks like this:
242
243    fn contains(dfa: &DFA, haystack: &[u8]) -> bool {
244      let mut state_id = dfa.start();
245      // If the empty pattern is in dfa, then state_id is a match state.
246      if dfa.is_match(state_id) {
247        return true;
248      }
249      for (i, &b) in haystack.iter().enumerate() {
250        // An Aho-Corasick DFA *never* has a missing state that requires
251        // failure transitions to be followed. One byte of input advances the
252        // automaton by one state. Always.
253        state_id = trie.next_state(state_id, b);
254        if fsm.is_match(state_id) {
255          return true;
256        }
257      }
258      false
259    }
260
261The search logic here is much simpler than for the NFA, and this tends to
262translate into significant performance benefits as well, since there's a lot
263less work being done for each byte in the haystack. How is this accomplished?
264It's done by pre-following all failure transitions for all states for all bytes
265in the alphabet, and then building a single state transition table. Building
266this DFA can be much more costly than building the NFA, and use much more
267memory, but the better performance can be worth it.
268
269Users of this crate can actually choose between using an NFA or a DFA. By
270default, an NFA is used, because it typically strikes the best balance between
271space usage and search performance. But the DFA option is available for cases
272where a little extra memory and upfront time building the automaton is okay.
273For example, the `AhoCorasick::auto_configure` and
274`AhoCorasickBuilder::auto_configure` methods will enable the DFA setting if
275there are a small number of patterns.
276
277
278# More DFA tricks
279
280As described in the previous section, one of the downsides of using a DFA
281is that is uses more memory and can take longer to build. One small way of
282mitigating these concerns is to map the alphabet used by the automaton into
283a smaller space. Typically, the alphabet of a DFA has 256 elements in it:
284one element for each possible value that fits into a byte. However, in many
285cases, one does not need the full alphabet. For example, if all patterns in an
286Aho-Corasick automaton are ASCII letters, then this only uses up 52 distinct
287bytes. As far as the automaton is concerned, the rest of the 204 bytes are
288indistinguishable from one another: they will never disrciminate between a
289match or a non-match. Therefore, in cases like that, the alphabet can be shrunk
290to just 53 elements. One for each ASCII letter, and then another to serve as a
291placeholder for every other unused byte.
292
293In practice, this library doesn't quite compute the optimal set of equivalence
294classes, but it's close enough in most cases. The key idea is that this then
295allows the transition table for the DFA to be potentially much smaller. The
296downside of doing this, however, is that since the transition table is defined
297in terms of this smaller alphabet space, every byte in the haystack must be
298re-mapped to this smaller space. This requires an additional 256-byte table.
299In practice, this can lead to a small search time hit, but it can be difficult
300to measure. Moreover, it can sometimes lead to faster search times for bigger
301automata, since it could be difference between more parts of the automaton
302staying in the CPU cache or not.
303
304One other trick for DFAs employed by this crate is the notion of premultiplying
305state identifiers. Specifically, the normal way to compute the next transition
306in a DFA is via the following (assuming that the transition table is laid out
307sequentially in memory, in row-major order, where the rows are states):
308
309    next_state_id = dfa.transitions[current_state_id * 256 + current_byte]
310
311However, since the value `256` is a fixed constant, we can actually premultiply
312the state identifiers in the table when we build the table initially. Then, the
313next transition computation simply becomes:
314
315    next_state_id = dfa.transitions[current_state_id + current_byte]
316
317This doesn't seem like much, but when this is being executed for every byte of
318input that you're searching, saving that extra multiplication instruction can
319add up.
320
321The same optimization works even when equivalence classes are enabled, as
322described above. The only difference is that the premultiplication is by the
323total number of equivalence classes instead of 256.
324
325There isn't much downside to premultiplying state identifiers, other than the
326fact that you may need to choose a bigger integer representation than you would
327otherwise. For example, if you don't premultiply state identifiers, then an
328automaton that uses `u8` as a state identifier can hold up to 256 states.
329However, if they are premultiplied, then it can only hold up to
330`floor(256 / len(alphabet))` states. Thus premultiplication impacts how compact
331your DFA can be. In practice, it's pretty rare to use `u8` as a state
332identifier, so premultiplication is usually a good thing to do.
333
334Both equivalence classes and premultiplication are tuneable parameters via the
335`AhoCorasickBuilder` type, and both are enabled by default.
336
337
338# Match semantics
339
340One of the more interesting things about this implementation of Aho-Corasick
341that (as far as this author knows) separates it from other implementations, is
342that it natively supports leftmost-first and leftmost-longest match semantics.
343Briefly, match semantics refer to the decision procedure by which searching
344will disambiguate matches when there are multiple to choose from:
345
346* **standard** match semantics emits matches as soon as they are detected by
347  the automaton. This is typically equivalent to the textbook non-overlapping
348  formulation of Aho-Corasick.
349* **leftmost-first** match semantics means that 1) the next match is the match
350  starting at the leftmost position and 2) among multiple matches starting at
351  the same leftmost position, the match corresponding to the pattern provided
352  first by the caller is reported.
353* **leftmost-longest** is like leftmost-first, except when there are multiple
354  matches starting at the same leftmost position, the pattern corresponding to
355  the longest match is returned.
356
357(The crate API documentation discusses these differences, with examples, in
358more depth on the `MatchKind` type.)
359
360The reason why supporting these match semantics is important is because it
361gives the user more control over the match procedure. For example,
362leftmost-first permits users to implement match priority by simply putting the
363higher priority patterns first. Leftmost-longest, on the other hand, permits
364finding the longest possible match, which might be useful when trying to find
365words matching a dictionary. Additionally, regex engines often want to use
366Aho-Corasick as an optimization when searching for an alternation of literals.
367In order to preserve correct match semantics, regex engines typically can't use
368the standard textbook definition directly, since regex engines will implement
369either leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics.
370
371Supporting leftmost semantics requires a couple key changes:
372
373* Constructing the Aho-Corasick automaton changes a bit in both how the trie is
374  constructed and how failure transitions are found. Namely, only a subset of
375  the failure transitions are added. Specifically, only the failure transitions
376  that either do not occur after a match or do occur after a match but preserve
377  that match are kept. (More details on this can be found in `src/nfa.rs`.)
378* The search algorithm changes slightly. Since we are looking for the leftmost
379  match, we cannot quit as soon as a match is detected. Instead, after a match
380  is detected, we must keep searching until either the end of the input or
381  until a dead state is seen. (Dead states are not used for standard match
382  semantics. Dead states mean that searching should stop after a match has been
383  found.)
384
385Other implementations of Aho-Corasick do support leftmost match semantics, but
386they do it with more overhead at search time, or even worse, with a queue of
387matches and sophisticated hijinks to disambiguate the matches. While our
388construction algorithm becomes a bit more complicated, the correct match
389semantics fall out from the structure of the automaton itself.
390
391
392# Overlapping matches
393
394One of the nice properties of an Aho-Corasick automaton is that it can report
395all possible matches, even when they overlap with one another. In this mode,
396the match semantics don't matter, since all possible matches are reported.
397Overlapping searches work just like regular searches, except the state
398identifier at which the previous search left off is carried over to the next
399search, so that it can pick up where it left off. If there are additional
400matches at that state, then they are reported before resuming the search.
401
402Enabling leftmost-first or leftmost-longest match semantics causes the
403automaton to use a subset of all failure transitions, which means that
404overlapping searches cannot be used. Therefore, if leftmost match semantics are
405used, attempting to do an overlapping search will panic. Thus, to get
406overlapping searches, the caller must use the default standard match semantics.
407This behavior was chosen because there are only two alternatives, which were
408deemed worse:
409
410* Compile two automatons internally, one for standard semantics and one for
411  the semantics requested by the caller (if not standard).
412* Create a new type, distinct from the `AhoCorasick` type, which has different
413  capabilities based on the configuration options.
414
415The first is untenable because of the amount of memory used by the automaton.
416The second increases the complexity of the API too much by adding too many
417types that do similar things. It is conceptually much simpler to keep all
418searching isolated to a single type. Callers may query whether the automaton
419supports overlapping searches via the `AhoCorasick::supports_overlapping`
420method.
421
422
423# Stream searching
424
425Since Aho-Corasick is an automaton, it is possible to do partial searches on
426partial parts of the haystack, and then resume that search on subsequent pieces
427of the haystack. This is useful when the haystack you're trying to search is
428not stored contiguous in memory, or if one does not want to read the entire
429haystack into memory at once.
430
431Currently, only standard semantics are supported for stream searching. This is
432some of the more complicated code in this crate, and is something I would very
433much like to improve. In particular, it currently has the restriction that it
434must buffer at least enough of the haystack in memory in order to fit the
435longest possible match. The difficulty in getting stream searching right is
436that the implementation choices (such as the buffer size) often impact what the
437API looks like and what it's allowed to do.
438
439
440# Prefilters
441
442In some cases, Aho-Corasick is not the fastest way to find matches containing
443multiple patterns. Sometimes, the search can be accelerated using highly
444optimized SIMD routines. For example, consider searching the following
445patterns:
446
447    Sherlock
448    Moriarty
449    Watson
450
451It is plausible that it would be much faster to quickly look for occurrences of
452the leading bytes, `S`, `M` or `W`, before trying to start searching via the
453automaton. Indeed, this is exactly what this crate will do.
454
455When there are more than three distinct starting bytes, then this crate will
456look for three distinct bytes occurring at any position in the patterns, while
457preferring bytes that are heuristically determined to be rare over others. For
458example:
459
460    Abuzz
461    Sanchez
462    Vasquez
463    Topaz
464    Waltz
465
466Here, we have more than 3 distinct starting bytes, but all of the patterns
467contain `z`, which is typically a rare byte. In this case, the prefilter will
468scan for `z`, back up a bit, and then execute the Aho-Corasick automaton.
469
470If all of that fails, then a packed multiple substring algorithm will be
471attempted. Currently, the only algorithm available for this is Teddy, but more
472may be added in the future. Teddy is unlike the above prefilters in that it
473confirms its own matches, so when Teddy is active, it might not be necessary
474for Aho-Corasick to run at all. (See `Automaton::leftmost_find_at_no_state_imp`
475in `src/automaton.rs`.) However, the current Teddy implementation only works
476in `x86_64` and when SSSE3 or AVX2 are available, and moreover, only works
477_well_ when there are a small number of patterns (say, less than 100). Teddy
478also requires the haystack to be of a certain length (more than 16-34 bytes).
479When the haystack is shorter than that, Rabin-Karp is used instead. (See
480`src/packed/rabinkarp.rs`.)
481
482There is a more thorough description of Teddy at
483[`src/packed/teddy/README.md`](src/packed/teddy/README.md).
484