README.md
1aho-corasick
2============
3A library for finding occurrences of many patterns at once with SIMD
4acceleration in some cases. This library provides multiple pattern
5search principally through an implementation of the
6[Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm),
7which builds a finite state machine for executing searches in linear time.
8Features include case insensitive matching, overlapping matches, fast searching
9via SIMD and optional full DFA construction and search & replace in streams.
10
11[![Build status](https://github.com/BurntSushi/aho-corasick/workflows/ci/badge.svg)](https://github.com/BurntSushi/aho-corasick/actions)
12[![](http://meritbadge.herokuapp.com/aho-corasick)](https://crates.io/crates/aho-corasick)
13
14Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).
15
16
17### Documentation
18
19https://docs.rs/aho-corasick
20
21
22### Usage
23
24Add this to your `Cargo.toml`:
25
26```toml
27[dependencies]
28aho-corasick = "0.7"
29```
30
31and this to your crate root (if you're using Rust 2015):
32
33```rust
34extern crate aho_corasick;
35```
36
37
38### Example: basic searching
39
40This example shows how to search for occurrences of multiple patterns
41simultaneously. Each match includes the pattern that matched along with the
42byte offsets of the match.
43
44```rust
45use aho_corasick::AhoCorasick;
46
47let patterns = &["apple", "maple", "Snapple"];
48let haystack = "Nobody likes maple in their apple flavored Snapple.";
49
50let ac = AhoCorasick::new(patterns);
51let mut matches = vec![];
52for mat in ac.find_iter(haystack) {
53 matches.push((mat.pattern(), mat.start(), mat.end()));
54}
55assert_eq!(matches, vec![
56 (1, 13, 18),
57 (0, 28, 33),
58 (2, 43, 50),
59]);
60```
61
62
63### Example: case insensitivity
64
65This is like the previous example, but matches `Snapple` case insensitively
66using `AhoCorasickBuilder`:
67
68```rust
69use aho_corasick::AhoCorasickBuilder;
70
71let patterns = &["apple", "maple", "snapple"];
72let haystack = "Nobody likes maple in their apple flavored Snapple.";
73
74let ac = AhoCorasickBuilder::new()
75 .ascii_case_insensitive(true)
76 .build(patterns);
77let mut matches = vec![];
78for mat in ac.find_iter(haystack) {
79 matches.push((mat.pattern(), mat.start(), mat.end()));
80}
81assert_eq!(matches, vec![
82 (1, 13, 18),
83 (0, 28, 33),
84 (2, 43, 50),
85]);
86```
87
88
89### Example: replacing matches in a stream
90
91This example shows how to execute a search and replace on a stream without
92loading the entire stream into memory first.
93
94```rust
95use aho_corasick::AhoCorasick;
96
97let patterns = &["fox", "brown", "quick"];
98let replace_with = &["sloth", "grey", "slow"];
99
100// In a real example, these might be `std::fs::File`s instead. All you need to
101// do is supply a pair of `std::io::Read` and `std::io::Write` implementations.
102let rdr = "The quick brown fox.";
103let mut wtr = vec![];
104
105let ac = AhoCorasick::new(patterns);
106ac.stream_replace_all(rdr.as_bytes(), &mut wtr, replace_with)
107 .expect("stream_replace_all failed");
108assert_eq!(b"The slow grey sloth.".to_vec(), wtr);
109```
110
111
112### Example: finding the leftmost first match
113
114In the textbook description of Aho-Corasick, its formulation is typically
115structured such that it reports all possible matches, even when they overlap
116with another. In many cases, overlapping matches may not be desired, such as
117the case of finding all successive non-overlapping matches like you might with
118a standard regular expression.
119
120Unfortunately the "obvious" way to modify the Aho-Corasick algorithm to do
121this doesn't always work in the expected way, since it will report matches as
122soon as they are seen. For example, consider matching the regex `Samwise|Sam`
123against the text `Samwise`. Most regex engines (that are Perl-like, or
124non-POSIX) will report `Samwise` as a match, but the standard Aho-Corasick
125algorithm modified for reporting non-overlapping matches will report `Sam`.
126
127A novel contribution of this library is the ability to change the match
128semantics of Aho-Corasick (without additional search time overhead) such that
129`Samwise` is reported instead. For example, here's the standard approach:
130
131```rust
132use aho_corasick::AhoCorasick;
133
134let patterns = &["Samwise", "Sam"];
135let haystack = "Samwise";
136
137let ac = AhoCorasick::new(patterns);
138let mat = ac.find(haystack).expect("should have a match");
139assert_eq!("Sam", &haystack[mat.start()..mat.end()]);
140```
141
142And now here's the leftmost-first version, which matches how a Perl-like
143regex will work:
144
145```rust
146use aho_corasick::{AhoCorasickBuilder, MatchKind};
147
148let patterns = &["Samwise", "Sam"];
149let haystack = "Samwise";
150
151let ac = AhoCorasickBuilder::new()
152 .match_kind(MatchKind::LeftmostFirst)
153 .build(patterns);
154let mat = ac.find(haystack).expect("should have a match");
155assert_eq!("Samwise", &haystack[mat.start()..mat.end()]);
156```
157
158In addition to leftmost-first semantics, this library also supports
159leftmost-longest semantics, which match the POSIX behavior of a regular
160expression alternation. See `MatchKind` in the docs for more details.
161
162
163### Minimum Rust version policy
164
165This crate's minimum supported `rustc` version is `1.28.0`.
166
167The current policy is that the minimum Rust version required to use this crate
168can be increased in minor version updates. For example, if `crate 1.0` requires
169Rust 1.20.0, then `crate 1.0.z` for all values of `z` will also require Rust
1701.20.0 or newer. However, `crate 1.y` for `y > 0` may require a newer minimum
171version of Rust.
172
173In general, this crate will be conservative with respect to the minimum
174supported version of Rust.
175
176
177### Future work
178
179Here are some plans for the future:
180
181* Assuming the current API is sufficient, I'd like to commit to it and release
182 a `1.0` version of this crate some time in the next 6-12 months.
183* Support stream searching with leftmost match semantics. Currently, only
184 standard match semantics are supported. Getting this right seems possible,
185 but is tricky since the match state needs to be propagated through multiple
186 searches. (With standard semantics, as soon as a match is seen the search
187 ends.)
188