1![grex](logo.png)
2
3<br>
4
5[![build](https://github.com/pemistahl/grex/actions/workflows/build.yml/badge.svg)](https://github.com/pemistahl/grex/actions/workflows/build.yml)
6[![dependency status](https://deps.rs/crate/grex/1.3.0/status.svg)](https://deps.rs/crate/grex/1.3.0)
7[![codecov](https://codecov.io/gh/pemistahl/grex/branch/main/graph/badge.svg)](https://codecov.io/gh/pemistahl/grex)
8[![lines of code](https://tokei.rs/b1/github/pemistahl/grex?category=code)](https://github.com/XAMPPRocky/tokei)
9[![Downloads](https://img.shields.io/crates/d/grex.svg)](https://crates.io/crates/grex)
10
11[![Docs.rs](https://docs.rs/grex/badge.svg)](https://docs.rs/grex)
12[![Crates.io](https://img.shields.io/crates/v/grex.svg)](https://crates.io/crates/grex)
13[![Lib.rs](https://img.shields.io/badge/lib.rs-v1.3.0-blue)](https://lib.rs/crates/grex)
14[![license](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
15
16[![Linux Download](https://img.shields.io/badge/Linux%20Download-v1.3.0-blue?logo=Linux)](https://github.com/pemistahl/grex/releases/download/v1.3.0/grex-v1.3.0-x86_64-unknown-linux-musl.tar.gz)
17[![MacOS Download](https://img.shields.io/badge/macOS%20Download-v1.3.0-blue?logo=Apple)](https://github.com/pemistahl/grex/releases/download/v1.3.0/grex-v1.3.0-x86_64-apple-darwin.tar.gz)
18[![Windows Download](https://img.shields.io/badge/Windows%20Download-v1.3.0-blue?logo=Windows)](https://github.com/pemistahl/grex/releases/download/v1.3.0/grex-v1.3.0-x86_64-pc-windows-msvc.zip)
19
20<br>
21
22![grex demo](demo.gif)
23
24<br>
25
26## <a name="table-of-contents"></a> Table of Contents
271. [What does this tool do?](#what-does-tool-do)
282. [Do I still need to learn to write regexes then?](#learn-regex)
293. [Current features](#current-features)
304. [How to install?](#how-to-install)
31  4.1 [The command-line tool](#how-to-install-cli)
32  4.2 [The library](#how-to-install-library)
335. [How to use?](#how-to-use)
34  5.1 [The command-line tool](#how-to-use-cli)
35  5.2 [The library](#how-to-use-library)
36  5.3 [Examples](#examples)
376. [How to build?](#how-to-build)
387. [How does it work?](#how-does-it-work)
398. [Contributions](#contribution)
40
41
42## 1. <a name="what-does-tool-do"></a> What does this tool do? <sup>[Top ▲](#table-of-contents)</sup>
43
44*grex* is a library as well as a command-line utility that is meant to simplify the often
45complicated and tedious task of creating regular expressions. It does so by automatically
46generating a single regular expression from user-provided test cases. The resulting
47expression is guaranteed to match the test cases which it was generated from.
48
49This project has started as a Rust port of the JavaScript tool
50[*regexgen*](https://github.com/devongovett/regexgen) written by
51[Devon Govett](https://github.com/devongovett). Although a lot of further useful features
52could be added to it, its development was apparently ceased several years ago. The plan
53is now to add these new features to *grex* as Rust really shines when it comes to
54command-line tools. *grex* offers all features that *regexgen* provides, and more.
55
56The philosophy of this project is to generate the most specific regular expression
57possible by default which exactly matches the given input only and nothing else.
58With the use of command-line flags (in the CLI tool) or preprocessing methods
59(in the library), more generalized expressions can be created.
60
61The produced expressions are [Perl-compatible regular expressions](https://www.pcre.org) which are also
62compatible with the regular expression parser in Rust's [*regex* crate](https://lib.rs/crates/regex).
63Other regular expression parsers or respective libraries from other programming languages
64have not been tested so far, but they ought to be mostly compatible as well.
65
66## 2. <a name="learn-regex"></a> Do I still need to learn to write regexes then? <sup>[Top ▲](#table-of-contents)</sup>
67
68**Definitely, yes!** Using the standard settings, *grex* produces a regular expression that is guaranteed
69to match only the test cases given as input and nothing else.
70This has been verified by [property tests](https://github.com/pemistahl/grex/blob/main/tests/property_tests.rs).
71However, if the conversion to shorthand character classes such as `\w` is enabled, the resulting regex matches
72a much wider scope of test cases. Knowledge about the consequences of this conversion is essential for finding
73a correct regular expression for your business domain.
74
75*grex* uses an algorithm that tries to find the shortest possible regex for the given test cases.
76Very often though, the resulting expression is still longer or more complex than it needs to be.
77In such cases, a more compact or elegant regex can be created only by hand.
78Also, every regular expression engine has different built-in optimizations. *grex* does not know anything
79about those and therefore cannot optimize its regexes for a specific engine.
80
81**So, please learn how to write regular expressions!** The currently best use case for *grex* is to find
82an initial correct regex which should be inspected by hand if further optimizations are possible.
83
84## 3. <a name="current-features"></a> Current Features <sup>[Top ▲](#table-of-contents)</sup>
85- literals
86- character classes
87- detection of common prefixes and suffixes
88- detection of repeated substrings and conversion to `{min,max}` quantifier notation
89- alternation using `|` operator
90- optionality using `?` quantifier
91- escaping of non-ascii characters, with optional conversion of astral code points to surrogate pairs
92- case-sensitive or case-insensitive matching
93- capturing or non-capturing groups
94- fully compliant to newest [Unicode Standard 13.0](https://unicode.org/versions/Unicode13.0.0)
95- fully compatible with [*regex* crate 1.3.5+](https://lib.rs/crates/regex)
96- correctly handles graphemes consisting of multiple Unicode symbols
97- reads input strings from the command-line or from a file
98- produces more readable expressions indented on multiple using optional verbose mode
99- optional syntax highlighting for nicer output in supported terminals
100
101## 4. <a name="how-to-install"></a> How to install? <sup>[Top ▲](#table-of-contents)</sup>
102
103### 4.1 <a name="how-to-install-cli"></a> The command-line tool <sup>[Top ▲](#table-of-contents)</sup>
104
105You can download the self-contained executable for your platform above and put it in a place of your choice.
106Alternatively, pre-compiled 64-Bit binaries are available within the package managers [Scoop](https://scoop.sh)
107(for Windows), [Homebrew](https://brew.sh) (for macOS and Linux), [MacPorts](https://www.macports.org) (for macOS), and [Huber](https://github.com/innobead/huber) (for macOS, Linux and Windows).
108[Raúl Piracés](https://github.com/piraces) has contributed a [Chocolatey Windows package](https://community.chocolatey.org/packages/grex).
109
110*grex* is also hosted on [crates.io](https://crates.io/crates/grex),
111the official Rust package registry. If you are a Rust developer and already have the Rust
112toolchain installed, you can install by compiling from source using
113[*cargo*](https://doc.rust-lang.org/cargo/), the Rust package manager.
114So the summary of your installation options is:
115
116```
117( brew | cargo | choco | huber | port | scoop ) install grex
118```
119
120### 4.2 <a name="how-to-install-library"></a> The library <sup>[Top ▲](#table-of-contents)</sup>
121
122In order to use *grex* as a library, simply add it as a dependency to your `Cargo.toml` file:
123
124```toml
125[dependencies]
126grex = "1.3.0"
127```
128
129## 5. <a name="how-to-use"></a> How to use? <sup>[Top ▲](#table-of-contents)</sup>
130
131Detailed explanations of the available settings are provided in the [library section](#how-to-install-library).
132All settings can be freely combined with each other.
133
134### 5.1 <a name="how-to-use-cli"></a> The command-line tool <sup>[Top ▲](#table-of-contents)</sup>
135
136Test cases are passed either directly (`grex a b c`) or from a file (`grex -f test_cases.txt`).
137*grex* is able to receive its input from Unix pipelines as well, e.g. `cat test_cases.txt | grex -`.
138
139The following table shows all available flags and options:
140
141```
142$ grex -h
143
144grex 1.3.0
145© 2019-today Peter M. Stahl <pemistahl@gmail.com>
146Licensed under the Apache License, Version 2.0
147Downloadable from https://crates.io/crates/grex
148Source code at https://github.com/pemistahl/grex
149
150grex generates regular expressions from user-provided test cases.
151
152USAGE:
153    grex [FLAGS] [OPTIONS] <INPUT>... --file <FILE>
154
155FLAGS:
156    -d, --digits             Converts any Unicode decimal digit to \d
157    -D, --non-digits         Converts any character which is not a Unicode decimal digit to \D
158    -s, --spaces             Converts any Unicode whitespace character to \s
159    -S, --non-spaces         Converts any character which is not a Unicode whitespace character to \S
160    -w, --words              Converts any Unicode word character to \w
161    -W, --non-words          Converts any character which is not a Unicode word character to \W
162    -r, --repetitions        Detects repeated non-overlapping substrings and
163                             converts them to {min,max} quantifier notation
164    -e, --escape             Replaces all non-ASCII characters with unicode escape sequences
165        --with-surrogates    Converts astral code points to surrogate pairs if --escape is set
166    -i, --ignore-case        Performs case-insensitive matching, letters match both upper and lower case
167    -g, --capture-groups     Replaces non-capturing groups by capturing ones
168    -x, --verbose            Produces a nicer looking regular expression in verbose mode
169        --no-start-anchor    Removes the caret anchor '^' from the resulting regular expression
170        --no-end-anchor      Removes the dollar sign anchor '$' from the resulting regular expression
171        --no-anchors         Removes the caret and dollar sign anchors from the resulting regular expression
172    -c, --colorize           Provides syntax highlighting for the resulting regular expression
173    -h, --help               Prints help information
174    -v, --version            Prints version information
175
176OPTIONS:
177    -f, --file <FILE>                      Reads test cases on separate lines from a file
178        --min-repetitions <QUANTITY>       Specifies the minimum quantity of substring repetitions
179                                           to be converted if --repetitions is set [default: 1]
180        --min-substring-length <LENGTH>    Specifies the minimum length a repeated substring must have
181                                           in order to be converted if --repetitions is set [default: 1]
182
183ARGS:
184    <INPUT>...    One or more test cases separated by blank space
185```
186
187### 5.2 <a name="how-to-use-library"></a> The library <sup>[Top ▲](#table-of-contents)</sup>
188
189#### 5.2.1 Default settings
190
191Test cases are passed either from a collection via [`RegExpBuilder::from()`](https://docs.rs/grex/1.3.0/grex/struct.RegExpBuilder.html#method.from)
192or from a file via [`RegExpBuilder::from_file()`](https://docs.rs/grex/1.3.0/grex/struct.RegExpBuilder.html#method.from_file).
193If read from a file, each test case must be on a separate line. Lines may be ended with either a newline `\n` or a carriage
194return with a line feed `\r\n`.
195
196```rust
197use grex::RegExpBuilder;
198
199let regexp = RegExpBuilder::from(&["a", "aa", "aaa"]).build();
200assert_eq!(regexp, "^a(?:aa?)?$");
201```
202
203#### 5.2.2 Convert to character classes
204
205```rust
206use grex::RegExpBuilder;
207
208let regexp = RegExpBuilder::from(&["a", "aa", "123"])
209    .with_conversion_of_digits()
210    .with_conversion_of_words()
211    .build();
212assert_eq!(regexp, "^(\\d\\d\\d|\\w(?:\\w)?)$");
213```
214
215#### 5.2.3 Convert repeated substrings
216
217```rust
218use grex::RegExpBuilder;
219
220let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
221    .with_conversion_of_repetitions()
222    .build();
223assert_eq!(regexp, "^(?:a{2}|(?:bc){2}|(?:def){3})$");
224```
225
226By default, *grex* converts each substring this way which is at least a single character long
227and which is subsequently repeated at least once. You can customize these two parameters if you like.
228
229In the following example, the test case `aa` is not converted to `a{2}` because the repeated substring
230`a` has a length of 1, but the minimum substring length has been set to 2.
231
232```rust
233use grex::RegExpBuilder;
234
235let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
236    .with_conversion_of_repetitions()
237    .with_minimum_substring_length(2)
238    .build();
239assert_eq!(regexp, "^(?:aa|(?:bc){2}|(?:def){3})$");
240```
241
242Setting a minimum number of 2 repetitions in the next example, only the test case `defdefdef` will be
243converted because it is the only one that is repeated twice.
244
245```rust
246use grex::RegExpBuilder;
247
248let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
249    .with_conversion_of_repetitions()
250    .with_minimum_repetitions(2)
251    .build();
252assert_eq!(regexp, "^(?:bcbc|aa|(?:def){3})$");
253```
254
255#### 5.2.4 Escape non-ascii characters
256
257```rust
258use grex::RegExpBuilder;
259
260let regexp = RegExpBuilder::from(&["You smell like ��."])
261    .with_escaping_of_non_ascii_chars(false)
262    .build();
263assert_eq!(regexp, "^You smell like \\u{1f4a9}\\.$");
264```
265
266Old versions of JavaScript do not support unicode escape sequences for the astral code planes
267(range `U+010000` to `U+10FFFF`). In order to support these symbols in JavaScript regular
268expressions, the conversion to surrogate pairs is necessary. More information on that matter
269can be found [here](https://mathiasbynens.be/notes/javascript-unicode).
270
271```rust
272use grex::RegExpBuilder;
273
274let regexp = RegExpBuilder::from(&["You smell like ��."])
275    .with_escaped_non_ascii_chars(true)
276    .build();
277assert_eq!(regexp, "^You smell like \\u{d83d}\\u{dca9}\\.$");
278```
279
280#### 5.2.5 Case-insensitive matching
281
282The regular expressions that *grex* generates are case-sensitive by default.
283Case-insensitive matching can be enabled like so:
284
285```rust
286use grex::RegExpBuilder;
287
288let regexp = RegExpBuilder::from(&["big", "BIGGER"])
289    .with_case_insensitive_matching()
290    .build();
291assert_eq!(regexp, "(?i)^big(?:ger)?$");
292```
293
294#### 5.2.6 Capturing Groups
295
296Non-capturing groups are used by default.
297Extending the previous example, you can switch to capturing groups instead.
298
299```rust
300use grex::RegExpBuilder;
301
302let regexp = RegExpBuilder::from(&["big", "BIGGER"])
303    .with_case_insensitive_matching()
304    .with_capturing_groups()
305    .build();
306assert_eq!(regexp, "(?i)^big(ger)?$");
307```
308
309#### 5.2.7 Verbose mode
310
311If you find the generated regular expression hard to read, you can enable verbose mode.
312The expression is then put on multiple lines and indented to make it more pleasant to the eyes.
313
314```rust
315use grex::RegExpBuilder;
316use indoc::indoc;
317
318let regexp = RegExpBuilder::from(&["a", "b", "bcd"])
319    .with_verbose_mode()
320    .build();
321
322assert_eq!(regexp, indoc!(
323    r#"
324    (?x)
325    ^
326      (?:
327        b
328        (?:
329          cd
330        )?
331        |
332        a
333      )
334    $"#
335));
336```
337
338#### 5.2.8 Disable anchors
339
340By default, the anchors `^` and `$` are put around every generated regular expression in order
341to ensure that it matches only the test cases given as input. Often enough, however, it is
342desired to use the generated pattern as part of a larger one. For this purpose, the anchors
343can be disabled, either separately or both of them.
344
345```rust
346use grex::RegExpBuilder;
347
348let regexp = RegExpBuilder::from(&["a", "aa", "aaa"])
349    .without_anchors()
350    .build();
351assert_eq!(regexp, "a(?:aa?)?");
352```
353
354#### 5.2.9 Syntax highlighting
355
356⚠ The method `with_syntax_highlighting()` may only be used if the resulting regular expression is meant to
357be printed to the console. It is mainly meant to be used for the command-line tool output.
358The regex string representation returned from enabling this setting cannot be fed into the
359[*regex* crate](https://crates.io/crates/regex).
360
361```rust
362use grex::RegExpBuilder;
363
364let regexp = RegExpBuilder::from(&["a", "aa", "123"])
365    .with_syntax_highlighting()
366    .build();
367```
368
369### 5.3 <a name="examples"></a> Examples <sup>[Top ▲](#table-of-contents)</sup>
370
371The following examples show the various supported regex syntax features:
372
373```
374$ grex a b c
375^[a-c]$
376
377$ grex a c d e f
378^[ac-f]$
379
380$ grex a b x de
381^(?:de|[abx])$
382
383$ grex abc bc
384^a?bc$
385
386$ grex a b bc
387^(?:bc?|a)$
388
389$ grex [a-z]
390^\[a\-z\]$
391
392$ grex -r b ba baa baaa
393^b(?:a{1,3})?$
394
395$ grex -r b ba baa baaaa
396^b(?:a{1,2}|a{4})?$
397
398$ grex y̆ a z
399^(?:y̆|[az])$
400Note:
401Grapheme y̆ consists of two Unicode symbols:
402U+0079 (Latin Small Letter Y)
403U+0306 (Combining Breve)
404
405$ grex "I ♥ cake" "I ♥ cookies"
406^I ♥ c(?:ookies|ake)$
407Note:
408Input containing blank space must be
409surrounded by quotation marks.
410```
411
412The string `"I ♥♥♥ 36 and ٣ and ����."` serves as input for the following examples using the command-line notation:
413
414```
415$ grex <INPUT>
416^I ♥♥♥ 36 and ٣ and ����\.$
417
418$ grex -e <INPUT>
419^I \u{2665}\u{2665}\u{2665} 36 and \u{663} and \u{1f4a9}\u{1f4a9}\.$
420
421$ grex -e --with-surrogates <INPUT>
422^I \u{2665}\u{2665}\u{2665} 36 and \u{663} and \u{d83d}\u{dca9}\u{d83d}\u{dca9}\.$
423
424$ grex -d <INPUT>
425^I ♥♥♥ \d\d and \d and ����\.$
426
427$ grex -s <INPUT>
428^I\s♥♥♥\s36\sand\s٣\sand\s����\.$
429
430$ grex -w <INPUT>
431^\w ♥♥♥ \w\w \w\w\w \w \w\w\w ����\.$
432
433$ grex -D <INPUT>
434^\D\D\D\D\D\D36\D\D\D\D\D٣\D\D\D\D\D\D\D\D$
435
436$ grex -S <INPUT>
437^\S \S\S\S \S\S \S\S\S \S \S\S\S \S\S\S$
438
439$ grex -dsw <INPUT>
440^\w\s♥♥♥\s\d\d\s\w\w\w\s\d\s\w\w\w\s����\.$
441
442$ grex -dswW <INPUT>
443^\w\s\W\W\W\s\d\d\s\w\w\w\s\d\s\w\w\w\s\W\W\W$
444
445$ grex -r <INPUT>
446^I ♥{3} 36 and ٣ and ��{2}\.$
447
448$ grex -er <INPUT>
449^I \u{2665}{3} 36 and \u{663} and \u{1f4a9}{2}\.$
450
451$ grex -er --with-surrogates <INPUT>
452^I \u{2665}{3} 36 and \u{663} and (?:\u{d83d}\u{dca9}){2}\.$
453
454$ grex -dgr <INPUT>
455^I ♥{3} \d(\d and ){2}��{2}\.$
456
457$ grex -rs <INPUT>
458^I\s♥{3}\s36\sand\s٣\sand\s��{2}\.$
459
460$ grex -rw <INPUT>
461^\w ♥{3} \w(?:\w \w{3} ){2}��{2}\.$
462
463$ grex -Dr <INPUT>
464^\D{6}36\D{5}٣\D{8}$
465
466$ grex -rS <INPUT>
467^\S \S(?:\S{2} ){2}\S{3} \S \S{3} \S{3}$
468
469$ grex -rW <INPUT>
470^I\W{5}36\Wand\W٣\Wand\W{4}$
471
472$ grex -drsw <INPUT>
473^\w\s♥{3}\s\d(?:\d\s\w{3}\s){2}��{2}\.$
474
475$ grex -drswW <INPUT>
476^\w\s\W{3}\s\d(?:\d\s\w{3}\s){2}\W{3}$
477```
478
479## 6. <a name="how-to-build"></a> How to build? <sup>[Top ▲](#table-of-contents)</sup>
480
481In order to build the source code yourself, you need the
482[stable Rust toolchain](https://www.rust-lang.org/tools/install) installed on your machine
483so that [*cargo*](https://doc.rust-lang.org/cargo/), the Rust package manager is available.
484
485```
486git clone https://github.com/pemistahl/grex.git
487cd grex
488cargo build
489```
490
491The source code is accompanied by an extensive test suite consisting of unit tests, integration
492tests and property tests. For running them, simply say:
493
494```
495cargo test
496```
497
498## 7. <a name="how-does-it-work"></a> How does it work? <sup>[Top ▲](#table-of-contents)</sup>
499
5001. A [deterministic finite automaton](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) (DFA)
501is created from the input strings.
502
5032. The number of states and transitions between states in the DFA is reduced by applying
504[Hopcroft's DFA minimization algorithm](https://en.wikipedia.org/wiki/DFA_minimization#Hopcroft.27s_algorithm).
505
5063. The minimized DFA is expressed as a system of linear equations which are solved with
507[Brzozowski's algebraic method](http://cs.stackexchange.com/questions/2016/how-to-convert-finite-automata-to-regular-expressions#2392),
508resulting in the final regular expression.
509
510## 8. <a name="contribution"></a> Contributions <sup>[Top ▲](#table-of-contents)</sup>
511
512- [Krzysztof Zawisła](https://github.com/KrzysztofZawisla) has written JavaScript bindings. Check out [grex.js](https://github.com/KrzysztofZawisla/grex.js).
513- [Maciej Gryka](https://github.com/maciejgryka) has created [https://regex.help](https://regex.help) where you can try out *grex* in your browser.
514
515In case you want to contribute something to *grex*, I encourage you to do so.
516Do you have ideas for cool features? Or have you found any bugs so far?
517Feel free to open an issue or send a pull request. It's very much appreciated. :-)
518