• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

.github/workflows/H03-May-2022-159143

data/H03-May-2022-4,4763,837

src/H03-May-2022-9,5155,006

tests/H03-May-2022-867741

.cargo-checksum.jsonH A D03-May-202289 11

.cargo_vcs_info.jsonH A D09-Mar-202074 65

.gitignoreH A D09-Mar-2020104 87

COPYINGH A D21-Feb-2020126 42

Cargo.tomlH A D09-Mar-20202.1 KiB9174

Cargo.toml.orig-cargoH A D09-Mar-20202 KiB7765

LICENSE-MITH A D21-Feb-20201.1 KiB2217

README.mdH A D09-Mar-202011.6 KiB223185

TODOH A D09-Mar-2020583 1110

UNLICENSEH A D21-Feb-20201.2 KiB2520

rustfmt.tomlH A D21-Feb-202044 32

README.md

1regex-automata
2==============
3A low level regular expression library that uses deterministic finite automata.
4It supports a rich syntax with Unicode support, has extensive options for
5configuring the best space vs time trade off for your use case and provides
6support for cheap deserialization of automata for use in `no_std` environments.
7
8[![Build status](https://github.com/BurntSushi/regex-automata/workflows/ci/badge.svg)](https://github.com/BurntSushi/regex-automata/actions)
9[![](http://meritbadge.herokuapp.com/regex-automata)](https://crates.io/crates/regex-automata)
10
11Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).
12
13
14### Documentation
15
16https://docs.rs/regex-automata
17
18
19### Usage
20
21Add this to your `Cargo.toml`:
22
23```toml
24[dependencies]
25regex-automata = "0.1"
26```
27
28and this to your crate root (if you're using Rust 2015):
29
30```rust
31extern crate regex_automata;
32```
33
34
35### Example: basic regex searching
36
37This example shows how to compile a regex using the default configuration
38and then use it to find matches in a byte string:
39
40```rust
41use regex_automata::Regex;
42
43let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
44let text = b"2018-12-24 2016-10-08";
45let matches: Vec<(usize, usize)> = re.find_iter(text).collect();
46assert_eq!(matches, vec![(0, 10), (11, 21)]);
47```
48
49For more examples and information about the various knobs that can be turned,
50please see the [docs](https://docs.rs/regex-automata).
51
52
53### Support for `no_std`
54
55This crate comes with a `std` feature that is enabled by default. When the
56`std` feature is enabled, the API of this crate will include the facilities
57necessary for compiling, serializing, deserializing and searching with regular
58expressions. When the `std` feature is disabled, the API of this crate will
59shrink such that it only includes the facilities necessary for deserializing
60and searching with regular expressions.
61
62The intended workflow for `no_std` environments is thus as follows:
63
64* Write a program with the `std` feature that compiles and serializes a
65  regular expression. Serialization should only happen after first converting
66  the DFAs to use a fixed size state identifier instead of the default `usize`.
67  You may also need to serialize both little and big endian versions of each
68  DFA. (So that's 4 DFAs in total for each regex.)
69* In your `no_std` environment, follow the examples above for deserializing
70  your previously serialized DFAs into regexes. You can then search with them
71  as you would any regex.
72
73Deserialization can happen anywhere. For example, with bytes embedded into a
74binary or with a file memory mapped at runtime.
75
76Note that the
77[`ucd-generate`](https://github.com/BurntSushi/ucd-generate)
78tool will do the first step for you with its `dfa` or `regex` sub-commands.
79
80
81### Cargo features
82
83* `std` - **Enabled** by default. This enables the ability to compile finite
84  automata. This requires the `regex-syntax` dependency. Without this feature
85  enabled, finite automata can only be used for searching (using the approach
86  described above).
87* `transducer` - **Disabled** by default. This provides implementations of the
88  `Automaton` trait found in the `fst` crate. This permits using finite
89  automata generated by this crate to search finite state transducers. This
90  requires the `fst` dependency.
91
92
93### Differences with the regex crate
94
95The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a
96general purpose regular expression engine. It aims to automatically balance low
97compile times, fast search times and low memory usage, while also providing
98a convenient API for users. In contrast, this crate provides a lower level
99regular expression interface that is a bit less convenient while providing more
100explicit control over memory usage and search times.
101
102Here are some specific negative differences:
103
104* **Compilation can take an exponential amount of time and space** in the size
105  of the regex pattern. While most patterns do not exhibit worst case
106  exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will
107  build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should
108  not be compiled with this library. (In the future, the API may expose an
109  option to return an error if the DFA gets too big.)
110* This crate does not support sub-match extraction, which can be achieved with
111  the regex crate's "captures" API. This may be added in the future, but is
112  unlikely.
113* While the regex crate doesn't necessarily sport fast compilation times, the
114  regexes in this crate are almost universally slow to compile, especially when
115  they contain large Unicode character classes. For example, on my system,
116  compiling `\w{3}` with byte classes enabled takes just over 1 second and
117  almost 5MB of memory! (Compiling a sparse regex takes about the same time
118  but only uses about 500KB of memory.) Conversly, compiling the same regex
119  without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and
120  less than 5KB of memory. For this reason, you should only use Unicode
121  character classes if you absolutely need them!
122* This crate does not support regex sets.
123* This crate does not support zero-width assertions such as `^`, `$`, `\b` or
124  `\B`.
125* As a lower level crate, this library does not do literal optimizations. In
126  exchange, you get predictable performance regardless of input. The
127  philosophy here is that literal optimizations should be applied at a higher
128  level, although there is no easy support for this in the ecosystem yet.
129* There is no `&str` API like in the regex crate. In this crate, all APIs
130  operate on `&[u8]`. By default, match indices are guaranteed to fall on
131  UTF-8 boundaries, unless `RegexBuilder::allow_invalid_utf8` is enabled.
132
133With some of the downsides out of the way, here are some positive differences:
134
135* Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply
136  deserialized. Deserialization always takes constant time since searching can
137  be performed directly on the raw serialized bytes of a DFA.
138* This crate was specifically designed so that the searching phase of a DFA has
139  minimal runtime requirements, and can therefore be used in `no_std`
140  environments. While `no_std` environments cannot compile regexes, they can
141  deserialize pre-compiled regexes.
142* Since this crate builds DFAs ahead of time, it will generally out-perform
143  the `regex` crate on equivalent tasks. The performance difference is likely
144  not large. However, because of a complex set of optimizations in the regex
145  crate (like literal optimizations), an accurate performance comparison may be
146  difficult to do.
147* Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search
148  performance a small amount, but uses much less storage space. Potentially
149  even less than what the regex crate uses.
150* This crate exposes DFAs directly, such as `DenseDFA` and `SparseDFA`,
151  which enables one to do less work in some cases. For example, if you only
152  need the end of a match and not the start of a match, then you can use a DFA
153  directly without building a `Regex`, which always requires a second DFA to
154  find the start of a match.
155* Aside from choosing between dense and sparse DFAs, there are several options
156  for configuring the space usage vs search time trade off. These include
157  things like choosing a smaller state identifier representation, to
158  premultiplying state identifiers and splitting a DFA's alphabet into
159  equivalence classes. Finally, DFA minimization is also provided, but can
160  increase compilation times dramatically.
161
162
163### Future work
164
165* Look into being smarter about generating NFA states for large Unicode
166  character classes. These can create a lot of additional work for both the
167  determinizer and the minimizer, and I suspect this is the key thing we'll
168  want to improve if we want to make DFA compile times faster. I *believe*
169  it's possible to potentially build minimal or nearly minimal NFAs for the
170  special case of Unicode character classes by leveraging Daciuk's algorithms
171  for building minimal automata in linear time for sets of strings. See
172  https://blog.burntsushi.net/transducers/#construction for more details. The
173  key adaptation I think we need to make is to modify the algorithm to operate
174  on byte ranges instead of enumerating every codepoint in the set. Otherwise,
175  it might not be worth doing.
176* Add support for regex sets. It should be possible to do this by "simply"
177  introducing more match states. I think we can also report the positions at
178  each match, similar to how Aho-Corasick works. I think the long pole in the
179  tent here is probably the API design work and arranging it so that we don't
180  introduce extra overhead into the non-regex-set case without duplicating a
181  lot of code. It seems doable.
182* Stretch goal: support capturing groups by implementing "tagged" DFA
183  (transducers). Laurikari's paper is the usual reference here, but Trofimovich
184  has a much more thorough treatment here:
185  http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf
186  I've only read the paper once. I suspect it will require at least a few more
187  read throughs before I understand it.
188  See also: http://re2c.org/
189* Possibly less ambitious goal: can we select a portion of Trofimovich's work
190  to make small fixed length look-around work? It would be really nice to
191  support ^, $ and \b, especially the Unicode variant of \b and CRLF aware $.
192* Experiment with code generating Rust code. There is an early experiment in
193  src/codegen.rs that is thoroughly bit-rotted. At the time, I was
194  experimenting with whether or not codegen would significant decrease the size
195  of a DFA, since if you squint hard enough, it's kind of like a sparse
196  representation. However, it didn't shrink as much as I thought it would, so
197  I gave up. The other problem is that Rust doesn't support gotos, so I don't
198  even know whether the "match on each state" in a loop thing will be fast
199  enough. Either way, it's probably a good option to have. For one thing, it
200  would be endian independent where as the serialization format of the DFAs in
201  this crate are endian dependent (so you need two versions of every DFA, but
202  you only need to compile one of them for any given arch).
203* Experiment with unrolling the match loops and fill out the benchmarks.
204* Add some kind of streaming API. I believe users of the library can already
205  implement something for this outside of the crate, but it would be good to
206  provide an official API. The key thing here is figuring out the API. I
207  suspect we might want to support several variants.
208* Make a decision on whether or not there is room for literal optimizations
209  in this crate. My original intent was to not let this crate sink down into
210  that very very very deep rabbit hole. But instead, we might want to provide
211  some way for literal optimizations to hook into the match routines. The right
212  path forward here is to probably build something outside of the crate and
213  then see about integrating it. After all, users can implement their own
214  match routines just as efficiently as what the crate provides.
215* A key downside of DFAs is that they can take up a lot of memory and can be
216  quite costly to build. Their worst case compilation time is O(2^n), where
217  n is the number of NFA states. A paper by Yang and Prasanna (2011) actually
218  seems to provide a way to character state blow up such that it is detectable.
219  If we could know whether a regex will exhibit state explosion or not, then
220  we could make an intelligent decision about whether to ahead-of-time compile
221  a DFA.
222  See: https://www.researchgate.net/profile/XU_Shutu/publication/229032602_Characterization_of_a_global_germplasm_collection_and_its_potential_utilization_for_analysis_of_complex_quantitative_traits_in_maize/links/02bfe50f914d04c837000000.pdf
223