1This document describes the internal design of this crate, which is an object 2lesson in what happens when you take a fairly simple old algorithm like 3Aho-Corasick and make it fast and production ready. 4 5The target audience of this document is Rust programmers that have some 6familiarity with string searching, however, one does not need to know the 7Aho-Corasick algorithm in order to read this (it is explained below). One 8should, however, know what a trie is. (If you don't, go read its Wikipedia 9article.) 10 11The center-piece of this crate is an implementation of Aho-Corasick. On its 12own, Aho-Corasick isn't that complicated. The complex pieces come from the 13different variants of Aho-Corasick implemented in this crate. Specifically, 14they are: 15 16* Aho-Corasick as an NFA, using dense transitions near the root with sparse 17 transitions elsewhere. 18* Aho-Corasick as a DFA. (An NFA is slower to search, but cheaper to construct 19 and uses less memory.) 20 * A DFA with pre-multiplied state identifiers. This saves a multiplication 21 instruction in the core search loop. 22 * A DFA with equivalence classes of bytes as the alphabet, instead of the 23 traditional 256-byte alphabet. This shrinks the size of the DFA in memory, 24 but adds an extra lookup in the core search loop to map the input byte to 25 an equivalent class. 26* The option to choose how state identifiers are represented, via one of 27 u8, u16, u32, u64 or usize. This permits creating compact automatons when 28 matching a small number of patterns. 29* Supporting "standard" match semantics, along with its overlapping variant, 30 in addition to leftmost-first and leftmost-longest semantics. The "standard" 31 semantics are typically what you see in a textbook description of 32 Aho-Corasick. However, Aho-Corasick is also useful as an optimization in 33 regex engines, which often use leftmost-first or leftmost-longest semantics. 34 Thus, it is useful to implement those semantics here. The "standard" and 35 "leftmost" search algorithms are subtly different, and also require slightly 36 different construction algorithms. 37* Support for ASCII case insensitive matching. 38* Support for accelerating searches when the patterns all start with a small 39 number of fixed bytes. Or alternatively, when the patterns all contain a 40 small number of rare bytes. (Searching for these bytes uses SIMD vectorized 41 code courtesy of `memchr`.) 42* Transparent support for alternative SIMD vectorized search routines for 43 smaller number of literals, such as the Teddy algorithm. We called these 44 "packed" search routines because they use SIMD. They can often be an order of 45 magnitude faster than just Aho-Corasick, but don't scale as well. 46* Support for searching streams. This can reuse most of the underlying code, 47 but does require careful buffering support. 48* Support for anchored searches, which permit efficient `is_prefix` checks for 49 a large number of patterns. 50 51When you combine all of this together along with trying to make everything as 52fast as possible, what you end up with is enitrely too much code with too much 53`unsafe`. Alas, I was not smart enough to figure out how to reduce it. Instead, 54we will explain it. 55 56 57# Basics 58 59The fundamental problem this crate is trying to solve is to determine the 60occurrences of possibly many patterns in a haystack. The naive way to solve 61this is to look for a match for each pattern at each position in the haystack: 62 63 for i in 0..haystack.len(): 64 for p in patterns.iter(): 65 if haystack[i..].starts_with(p.bytes()): 66 return Match(p.id(), i, i + p.bytes().len()) 67 68Those four lines are effectively all this crate does. The problem with those 69four lines is that they are very slow, especially when you're searching for a 70large number of patterns. 71 72While there are many different algorithms available to solve this, a popular 73one is Aho-Corasick. It's a common solution because it's not too hard to 74implement, scales quite well even when searching for thousands of patterns and 75is generally pretty fast. Aho-Corasick does well here because, regardless of 76the number of patterns you're searching for, it always visits each byte in the 77haystack exactly once. This means, generally speaking, adding more patterns to 78an Aho-Corasick automaton does not make it slower. (Strictly speaking, however, 79this is not true, since a larger automaton will make less effective use of the 80CPU's cache.) 81 82Aho-Corasick can be succinctly described as a trie with state transitions 83between some of the nodes that efficiently instruct the search algorithm to 84try matching alternative keys in the automaton. The trick is that these state 85transitions are arranged such that each byte of input needs to be inspected 86only once. These state transitions are typically called "failure transitions," 87because they instruct the searcher (the thing traversing the automaton while 88reading from the haystack) what to do when a byte in the haystack does not 89correspond to a valid transition in the current state of the trie. 90 91More formally, a failure transition points to a state in the automaton that may 92lead to a match whose prefix is a proper suffix of the path traversed through 93the trie so far. (If no such proper suffix exists, then the failure transition 94points back to the start state of the trie, effectively restarting the search.) 95This is perhaps simpler to explain pictorally. For example, let's say we built 96an Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. The 97trie looks like this: 98 99 a - S1 - b - S2 - c - S3 - d - S4* 100 / 101 S0 - c - S5 - e - S6 - f - S7* 102 103where states marked with a `*` are match states (meaning, the search algorithm 104should stop and report a match to the caller). 105 106So given this trie, it should be somewhat straight-forward to see how it can 107be used to determine whether any particular haystack *starts* with either 108`abcd` or `cef`. It's easy to express this in code: 109 110 fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool { 111 let mut state_id = trie.start(); 112 // If the empty pattern is in trie, then state_id is a match state. 113 if trie.is_match(state_id) { 114 return true; 115 } 116 for (i, &b) in haystack.iter().enumerate() { 117 state_id = match trie.next_state(state_id, b) { 118 Some(id) => id, 119 // If there was no transition for this state and byte, then we know 120 // the haystack does not start with one of the patterns in our trie. 121 None => return false, 122 }; 123 if trie.is_match(state_id) { 124 return true; 125 } 126 } 127 false 128 } 129 130And that's pretty much it. All we do is move through the trie starting with the 131bytes at the beginning of the haystack. If we find ourselves in a position 132where we can't move, or if we've looked through the entire haystack without 133seeing a match state, then we know the haystack does not start with any of the 134patterns in the trie. 135 136The meat of the Aho-Corasick algorithm is in how we add failure transitions to 137our trie to keep searching efficient. Specifically, it permits us to not only 138check whether a haystack *starts* with any one of a number of patterns, but 139rather, whether the haystack contains any of a number of patterns *anywhere* in 140the haystack. 141 142As mentioned before, failure transitions connect a proper suffix of the path 143traversed through the trie before, with a path that leads to a match that has a 144prefix corresponding to that proper suffix. So in our case, for patterns `abcd` 145and `cef`, with a haystack `abcef`, we want to transition to state `S5` (from 146the diagram above) from `S3` upon seeing that the byte following `c` is not 147`d`. Namely, the proper suffix in this example is `c`, which is a prefix of 148`cef`. So the modified diagram looks like this: 149 150 151 a - S1 - b - S2 - c - S3 - d - S4* 152 / / 153 / ---------------- 154 / / 155 S0 - c - S5 - e - S6 - f - S7* 156 157One thing that isn't shown in this diagram is that *all* states have a failure 158transition, but only `S3` has a *non-trivial* failure transition. That is, all 159other states have a failure transition back to the start state. So if our 160haystack was `abzabcd`, then the searcher would transition back to `S0` after 161seeing `z`, which effectively restarts the search. (Because there is no pattern 162in our trie that has a prefix of `bz` or `z`.) 163 164The code for traversing this *automaton* or *finite state machine* (it is no 165longer just a trie) is not that much different from the `has_prefix` code 166above: 167 168 fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool { 169 let mut state_id = fsm.start(); 170 // If the empty pattern is in fsm, then state_id is a match state. 171 if fsm.is_match(state_id) { 172 return true; 173 } 174 for (i, &b) in haystack.iter().enumerate() { 175 // While the diagram above doesn't show this, we may wind up needing 176 // to follow multiple failure transitions before we land on a state 177 // in which we can advance. Therefore, when searching for the next 178 // state, we need to loop until we don't see a failure transition. 179 // 180 // This loop terminates because the start state has no empty 181 // transitions. Every transition from the start state either points to 182 // another state, or loops back to the start state. 183 loop { 184 match fsm.next_state(state_id, b) { 185 Some(id) => { 186 state_id = id; 187 break; 188 } 189 // Unlike our code above, if there was no transition for this 190 // state, then we don't quit. Instead, we look for this state's 191 // failure transition and follow that instead. 192 None => { 193 state_id = fsm.next_fail_state(state_id); 194 } 195 }; 196 } 197 if fsm.is_match(state_id) { 198 return true; 199 } 200 } 201 false 202 } 203 204Other than the complication around traversing failure transitions, this code 205is still roughly "traverse the automaton with bytes from the haystack, and quit 206when a match is seen." 207 208And that concludes our section on the basics. While we didn't go deep into 209how the automaton is built (see `src/nfa.rs`, which has detailed comments about 210that), the basic structure of Aho-Corasick should be reasonably clear. 211 212 213# NFAs and DFAs 214 215There are generally two types of finite automata: non-deterministic finite 216automata (NFA) and deterministic finite automata (DFA). The difference between 217them is, principally, that an NFA can be in multiple states at once. This is 218typically accomplished by things called _epsilon_ transitions, where one could 219move to a new state without consuming any bytes from the input. (The other 220mechanism by which NFAs can be in more than one state is where the same byte in 221a particular state transitions to multiple distinct states.) In contrast, a DFA 222can only ever be in one state at a time. A DFA has no epsilon transitions, and 223for any given state, a byte transitions to at most one other state. 224 225By this formulation, the Aho-Corasick automaton described in the previous 226section is an NFA. This is because failure transitions are, effectively, 227epsilon transitions. That is, whenever the automaton is in state `S`, it is 228actually in the set of states that are reachable by recursively following 229failure transitions from `S`. (This means that, for example, the start state 230is always active since the start state is reachable via failure transitions 231from any state in the automaton.) 232 233NFAs have a lot of nice properties. They tend to be easier to construct, and 234also tend to use less memory. However, their primary downside is that they are 235typically slower to execute. For example, the code above showing how to search 236with an Aho-Corasick automaton needs to potentially iterate through many 237failure transitions for every byte of input. While this is a fairly small 238amount of overhead, this can add up, especially if the automaton has a lot of 239overlapping patterns with a lot of failure transitions. 240 241A DFA's search code, by contrast, looks like this: 242 243 fn contains(dfa: &DFA, haystack: &[u8]) -> bool { 244 let mut state_id = dfa.start(); 245 // If the empty pattern is in dfa, then state_id is a match state. 246 if dfa.is_match(state_id) { 247 return true; 248 } 249 for (i, &b) in haystack.iter().enumerate() { 250 // An Aho-Corasick DFA *never* has a missing state that requires 251 // failure transitions to be followed. One byte of input advances the 252 // automaton by one state. Always. 253 state_id = trie.next_state(state_id, b); 254 if fsm.is_match(state_id) { 255 return true; 256 } 257 } 258 false 259 } 260 261The search logic here is much simpler than for the NFA, and this tends to 262translate into significant performance benefits as well, since there's a lot 263less work being done for each byte in the haystack. How is this accomplished? 264It's done by pre-following all failure transitions for all states for all bytes 265in the alphabet, and then building a single state transition table. Building 266this DFA can be much more costly than building the NFA, and use much more 267memory, but the better performance can be worth it. 268 269Users of this crate can actually choose between using an NFA or a DFA. By 270default, an NFA is used, because it typically strikes the best balance between 271space usage and search performance. But the DFA option is available for cases 272where a little extra memory and upfront time building the automaton is okay. 273For example, the `AhoCorasick::auto_configure` and 274`AhoCorasickBuilder::auto_configure` methods will enable the DFA setting if 275there are a small number of patterns. 276 277 278# More DFA tricks 279 280As described in the previous section, one of the downsides of using a DFA 281is that is uses more memory and can take longer to build. One small way of 282mitigating these concerns is to map the alphabet used by the automaton into 283a smaller space. Typically, the alphabet of a DFA has 256 elements in it: 284one element for each possible value that fits into a byte. However, in many 285cases, one does not need the full alphabet. For example, if all patterns in an 286Aho-Corasick automaton are ASCII letters, then this only uses up 52 distinct 287bytes. As far as the automaton is concerned, the rest of the 204 bytes are 288indistinguishable from one another: they will never disrciminate between a 289match or a non-match. Therefore, in cases like that, the alphabet can be shrunk 290to just 53 elements. One for each ASCII letter, and then another to serve as a 291placeholder for every other unused byte. 292 293In practice, this library doesn't quite compute the optimal set of equivalence 294classes, but it's close enough in most cases. The key idea is that this then 295allows the transition table for the DFA to be potentially much smaller. The 296downside of doing this, however, is that since the transition table is defined 297in terms of this smaller alphabet space, every byte in the haystack must be 298re-mapped to this smaller space. This requires an additional 256-byte table. 299In practice, this can lead to a small search time hit, but it can be difficult 300to measure. Moreover, it can sometimes lead to faster search times for bigger 301automata, since it could be difference between more parts of the automaton 302staying in the CPU cache or not. 303 304One other trick for DFAs employed by this crate is the notion of premultiplying 305state identifiers. Specifically, the normal way to compute the next transition 306in a DFA is via the following (assuming that the transition table is laid out 307sequentially in memory, in row-major order, where the rows are states): 308 309 next_state_id = dfa.transitions[current_state_id * 256 + current_byte] 310 311However, since the value `256` is a fixed constant, we can actually premultiply 312the state identifiers in the table when we build the table initially. Then, the 313next transition computation simply becomes: 314 315 next_state_id = dfa.transitions[current_state_id + current_byte] 316 317This doesn't seem like much, but when this is being executed for every byte of 318input that you're searching, saving that extra multiplication instruction can 319add up. 320 321The same optimization works even when equivalence classes are enabled, as 322described above. The only difference is that the premultiplication is by the 323total number of equivalence classes instead of 256. 324 325There isn't much downside to premultiplying state identifiers, other than the 326fact that you may need to choose a bigger integer representation than you would 327otherwise. For example, if you don't premultiply state identifiers, then an 328automaton that uses `u8` as a state identifier can hold up to 256 states. 329However, if they are premultiplied, then it can only hold up to 330`floor(256 / len(alphabet))` states. Thus premultiplication impacts how compact 331your DFA can be. In practice, it's pretty rare to use `u8` as a state 332identifier, so premultiplication is usually a good thing to do. 333 334Both equivalence classes and premultiplication are tuneable parameters via the 335`AhoCorasickBuilder` type, and both are enabled by default. 336 337 338# Match semantics 339 340One of the more interesting things about this implementation of Aho-Corasick 341that (as far as this author knows) separates it from other implementations, is 342that it natively supports leftmost-first and leftmost-longest match semantics. 343Briefly, match semantics refer to the decision procedure by which searching 344will disambiguate matches when there are multiple to choose from: 345 346* **standard** match semantics emits matches as soon as they are detected by 347 the automaton. This is typically equivalent to the textbook non-overlapping 348 formulation of Aho-Corasick. 349* **leftmost-first** match semantics means that 1) the next match is the match 350 starting at the leftmost position and 2) among multiple matches starting at 351 the same leftmost position, the match corresponding to the pattern provided 352 first by the caller is reported. 353* **leftmost-longest** is like leftmost-first, except when there are multiple 354 matches starting at the same leftmost position, the pattern corresponding to 355 the longest match is returned. 356 357(The crate API documentation discusses these differences, with examples, in 358more depth on the `MatchKind` type.) 359 360The reason why supporting these match semantics is important is because it 361gives the user more control over the match procedure. For example, 362leftmost-first permits users to implement match priority by simply putting the 363higher priority patterns first. Leftmost-longest, on the other hand, permits 364finding the longest possible match, which might be useful when trying to find 365words matching a dictionary. Additionally, regex engines often want to use 366Aho-Corasick as an optimization when searching for an alternation of literals. 367In order to preserve correct match semantics, regex engines typically can't use 368the standard textbook definition directly, since regex engines will implement 369either leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics. 370 371Supporting leftmost semantics requires a couple key changes: 372 373* Constructing the Aho-Corasick automaton changes a bit in both how the trie is 374 constructed and how failure transitions are found. Namely, only a subset of 375 the failure transitions are added. Specifically, only the failure transitions 376 that either do not occur after a match or do occur after a match but preserve 377 that match are kept. (More details on this can be found in `src/nfa.rs`.) 378* The search algorithm changes slightly. Since we are looking for the leftmost 379 match, we cannot quit as soon as a match is detected. Instead, after a match 380 is detected, we must keep searching until either the end of the input or 381 until a dead state is seen. (Dead states are not used for standard match 382 semantics. Dead states mean that searching should stop after a match has been 383 found.) 384 385Other implementations of Aho-Corasick do support leftmost match semantics, but 386they do it with more overhead at search time, or even worse, with a queue of 387matches and sophisticated hijinks to disambiguate the matches. While our 388construction algorithm becomes a bit more complicated, the correct match 389semantics fall out from the structure of the automaton itself. 390 391 392# Overlapping matches 393 394One of the nice properties of an Aho-Corasick automaton is that it can report 395all possible matches, even when they overlap with one another. In this mode, 396the match semantics don't matter, since all possible matches are reported. 397Overlapping searches work just like regular searches, except the state 398identifier at which the previous search left off is carried over to the next 399search, so that it can pick up where it left off. If there are additional 400matches at that state, then they are reported before resuming the search. 401 402Enabling leftmost-first or leftmost-longest match semantics causes the 403automaton to use a subset of all failure transitions, which means that 404overlapping searches cannot be used. Therefore, if leftmost match semantics are 405used, attempting to do an overlapping search will panic. Thus, to get 406overlapping searches, the caller must use the default standard match semantics. 407This behavior was chosen because there are only two alternatives, which were 408deemed worse: 409 410* Compile two automatons internally, one for standard semantics and one for 411 the semantics requested by the caller (if not standard). 412* Create a new type, distinct from the `AhoCorasick` type, which has different 413 capabilities based on the configuration options. 414 415The first is untenable because of the amount of memory used by the automaton. 416The second increases the complexity of the API too much by adding too many 417types that do similar things. It is conceptually much simpler to keep all 418searching isolated to a single type. Callers may query whether the automaton 419supports overlapping searches via the `AhoCorasick::supports_overlapping` 420method. 421 422 423# Stream searching 424 425Since Aho-Corasick is an automaton, it is possible to do partial searches on 426partial parts of the haystack, and then resume that search on subsequent pieces 427of the haystack. This is useful when the haystack you're trying to search is 428not stored contiguous in memory, or if one does not want to read the entire 429haystack into memory at once. 430 431Currently, only standard semantics are supported for stream searching. This is 432some of the more complicated code in this crate, and is something I would very 433much like to improve. In particular, it currently has the restriction that it 434must buffer at least enough of the haystack in memory in order to fit the 435longest possible match. The difficulty in getting stream searching right is 436that the implementation choices (such as the buffer size) often impact what the 437API looks like and what it's allowed to do. 438 439 440# Prefilters 441 442In some cases, Aho-Corasick is not the fastest way to find matches containing 443multiple patterns. Sometimes, the search can be accelerated using highly 444optimized SIMD routines. For example, consider searching the following 445patterns: 446 447 Sherlock 448 Moriarty 449 Watson 450 451It is plausible that it would be much faster to quickly look for occurrences of 452the leading bytes, `S`, `M` or `W`, before trying to start searching via the 453automaton. Indeed, this is exactly what this crate will do. 454 455When there are more than three distinct starting bytes, then this crate will 456look for three distinct bytes occurring at any position in the patterns, while 457preferring bytes that are heuristically determined to be rare over others. For 458example: 459 460 Abuzz 461 Sanchez 462 Vasquez 463 Topaz 464 Waltz 465 466Here, we have more than 3 distinct starting bytes, but all of the patterns 467contain `z`, which is typically a rare byte. In this case, the prefilter will 468scan for `z`, back up a bit, and then execute the Aho-Corasick automaton. 469 470If all of that fails, then a packed multiple substring algorithm will be 471attempted. Currently, the only algorithm available for this is Teddy, but more 472may be added in the future. Teddy is unlike the above prefilters in that it 473confirms its own matches, so when Teddy is active, it might not be necessary 474for Aho-Corasick to run at all. (See `Automaton::leftmost_find_at_no_state_imp` 475in `src/automaton.rs`.) However, the current Teddy implementation only works 476in `x86_64` and when SSSE3 or AVX2 are available, and moreover, only works 477_well_ when there are a small number of patterns (say, less than 100). Teddy 478also requires the haystack to be of a certain length (more than 16-34 bytes). 479When the haystack is shorter than that, Rabin-Karp is used instead. (See 480`src/packed/rabinkarp.rs`.) 481 482There is a more thorough description of Teddy at 483[`src/packed/teddy/README.md`](src/packed/teddy/README.md). 484