1![grex](logo.png) 2 3<br> 4 5[![build](https://github.com/pemistahl/grex/actions/workflows/build.yml/badge.svg)](https://github.com/pemistahl/grex/actions/workflows/build.yml) 6[![dependency status](https://deps.rs/crate/grex/1.3.0/status.svg)](https://deps.rs/crate/grex/1.3.0) 7[![codecov](https://codecov.io/gh/pemistahl/grex/branch/main/graph/badge.svg)](https://codecov.io/gh/pemistahl/grex) 8[![lines of code](https://tokei.rs/b1/github/pemistahl/grex?category=code)](https://github.com/XAMPPRocky/tokei) 9[![Downloads](https://img.shields.io/crates/d/grex.svg)](https://crates.io/crates/grex) 10 11[![Docs.rs](https://docs.rs/grex/badge.svg)](https://docs.rs/grex) 12[![Crates.io](https://img.shields.io/crates/v/grex.svg)](https://crates.io/crates/grex) 13[![Lib.rs](https://img.shields.io/badge/lib.rs-v1.3.0-blue)](https://lib.rs/crates/grex) 14[![license](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0) 15 16[![Linux Download](https://img.shields.io/badge/Linux%20Download-v1.3.0-blue?logo=Linux)](https://github.com/pemistahl/grex/releases/download/v1.3.0/grex-v1.3.0-x86_64-unknown-linux-musl.tar.gz) 17[![MacOS Download](https://img.shields.io/badge/macOS%20Download-v1.3.0-blue?logo=Apple)](https://github.com/pemistahl/grex/releases/download/v1.3.0/grex-v1.3.0-x86_64-apple-darwin.tar.gz) 18[![Windows Download](https://img.shields.io/badge/Windows%20Download-v1.3.0-blue?logo=Windows)](https://github.com/pemistahl/grex/releases/download/v1.3.0/grex-v1.3.0-x86_64-pc-windows-msvc.zip) 19 20<br> 21 22![grex demo](demo.gif) 23 24<br> 25 26## <a name="table-of-contents"></a> Table of Contents 271. [What does this tool do?](#what-does-tool-do) 282. [Do I still need to learn to write regexes then?](#learn-regex) 293. [Current features](#current-features) 304. [How to install?](#how-to-install) 31 4.1 [The command-line tool](#how-to-install-cli) 32 4.2 [The library](#how-to-install-library) 335. [How to use?](#how-to-use) 34 5.1 [The command-line tool](#how-to-use-cli) 35 5.2 [The library](#how-to-use-library) 36 5.3 [Examples](#examples) 376. [How to build?](#how-to-build) 387. [How does it work?](#how-does-it-work) 398. [Contributions](#contribution) 40 41 42## 1. <a name="what-does-tool-do"></a> What does this tool do? <sup>[Top ▲](#table-of-contents)</sup> 43 44*grex* is a library as well as a command-line utility that is meant to simplify the often 45complicated and tedious task of creating regular expressions. It does so by automatically 46generating a single regular expression from user-provided test cases. The resulting 47expression is guaranteed to match the test cases which it was generated from. 48 49This project has started as a Rust port of the JavaScript tool 50[*regexgen*](https://github.com/devongovett/regexgen) written by 51[Devon Govett](https://github.com/devongovett). Although a lot of further useful features 52could be added to it, its development was apparently ceased several years ago. The plan 53is now to add these new features to *grex* as Rust really shines when it comes to 54command-line tools. *grex* offers all features that *regexgen* provides, and more. 55 56The philosophy of this project is to generate the most specific regular expression 57possible by default which exactly matches the given input only and nothing else. 58With the use of command-line flags (in the CLI tool) or preprocessing methods 59(in the library), more generalized expressions can be created. 60 61The produced expressions are [Perl-compatible regular expressions](https://www.pcre.org) which are also 62compatible with the regular expression parser in Rust's [*regex* crate](https://lib.rs/crates/regex). 63Other regular expression parsers or respective libraries from other programming languages 64have not been tested so far, but they ought to be mostly compatible as well. 65 66## 2. <a name="learn-regex"></a> Do I still need to learn to write regexes then? <sup>[Top ▲](#table-of-contents)</sup> 67 68**Definitely, yes!** Using the standard settings, *grex* produces a regular expression that is guaranteed 69to match only the test cases given as input and nothing else. 70This has been verified by [property tests](https://github.com/pemistahl/grex/blob/main/tests/property_tests.rs). 71However, if the conversion to shorthand character classes such as `\w` is enabled, the resulting regex matches 72a much wider scope of test cases. Knowledge about the consequences of this conversion is essential for finding 73a correct regular expression for your business domain. 74 75*grex* uses an algorithm that tries to find the shortest possible regex for the given test cases. 76Very often though, the resulting expression is still longer or more complex than it needs to be. 77In such cases, a more compact or elegant regex can be created only by hand. 78Also, every regular expression engine has different built-in optimizations. *grex* does not know anything 79about those and therefore cannot optimize its regexes for a specific engine. 80 81**So, please learn how to write regular expressions!** The currently best use case for *grex* is to find 82an initial correct regex which should be inspected by hand if further optimizations are possible. 83 84## 3. <a name="current-features"></a> Current Features <sup>[Top ▲](#table-of-contents)</sup> 85- literals 86- character classes 87- detection of common prefixes and suffixes 88- detection of repeated substrings and conversion to `{min,max}` quantifier notation 89- alternation using `|` operator 90- optionality using `?` quantifier 91- escaping of non-ascii characters, with optional conversion of astral code points to surrogate pairs 92- case-sensitive or case-insensitive matching 93- capturing or non-capturing groups 94- fully compliant to newest [Unicode Standard 13.0](https://unicode.org/versions/Unicode13.0.0) 95- fully compatible with [*regex* crate 1.3.5+](https://lib.rs/crates/regex) 96- correctly handles graphemes consisting of multiple Unicode symbols 97- reads input strings from the command-line or from a file 98- produces more readable expressions indented on multiple using optional verbose mode 99- optional syntax highlighting for nicer output in supported terminals 100 101## 4. <a name="how-to-install"></a> How to install? <sup>[Top ▲](#table-of-contents)</sup> 102 103### 4.1 <a name="how-to-install-cli"></a> The command-line tool <sup>[Top ▲](#table-of-contents)</sup> 104 105You can download the self-contained executable for your platform above and put it in a place of your choice. 106Alternatively, pre-compiled 64-Bit binaries are available within the package managers [Scoop](https://scoop.sh) 107(for Windows), [Homebrew](https://brew.sh) (for macOS and Linux), [MacPorts](https://www.macports.org) (for macOS), and [Huber](https://github.com/innobead/huber) (for macOS, Linux and Windows). 108[Raúl Piracés](https://github.com/piraces) has contributed a [Chocolatey Windows package](https://community.chocolatey.org/packages/grex). 109 110*grex* is also hosted on [crates.io](https://crates.io/crates/grex), 111the official Rust package registry. If you are a Rust developer and already have the Rust 112toolchain installed, you can install by compiling from source using 113[*cargo*](https://doc.rust-lang.org/cargo/), the Rust package manager. 114So the summary of your installation options is: 115 116``` 117( brew | cargo | choco | huber | port | scoop ) install grex 118``` 119 120### 4.2 <a name="how-to-install-library"></a> The library <sup>[Top ▲](#table-of-contents)</sup> 121 122In order to use *grex* as a library, simply add it as a dependency to your `Cargo.toml` file: 123 124```toml 125[dependencies] 126grex = "1.3.0" 127``` 128 129## 5. <a name="how-to-use"></a> How to use? <sup>[Top ▲](#table-of-contents)</sup> 130 131Detailed explanations of the available settings are provided in the [library section](#how-to-install-library). 132All settings can be freely combined with each other. 133 134### 5.1 <a name="how-to-use-cli"></a> The command-line tool <sup>[Top ▲](#table-of-contents)</sup> 135 136Test cases are passed either directly (`grex a b c`) or from a file (`grex -f test_cases.txt`). 137*grex* is able to receive its input from Unix pipelines as well, e.g. `cat test_cases.txt | grex -`. 138 139The following table shows all available flags and options: 140 141``` 142$ grex -h 143 144grex 1.3.0 145© 2019-today Peter M. Stahl <pemistahl@gmail.com> 146Licensed under the Apache License, Version 2.0 147Downloadable from https://crates.io/crates/grex 148Source code at https://github.com/pemistahl/grex 149 150grex generates regular expressions from user-provided test cases. 151 152USAGE: 153 grex [FLAGS] [OPTIONS] <INPUT>... --file <FILE> 154 155FLAGS: 156 -d, --digits Converts any Unicode decimal digit to \d 157 -D, --non-digits Converts any character which is not a Unicode decimal digit to \D 158 -s, --spaces Converts any Unicode whitespace character to \s 159 -S, --non-spaces Converts any character which is not a Unicode whitespace character to \S 160 -w, --words Converts any Unicode word character to \w 161 -W, --non-words Converts any character which is not a Unicode word character to \W 162 -r, --repetitions Detects repeated non-overlapping substrings and 163 converts them to {min,max} quantifier notation 164 -e, --escape Replaces all non-ASCII characters with unicode escape sequences 165 --with-surrogates Converts astral code points to surrogate pairs if --escape is set 166 -i, --ignore-case Performs case-insensitive matching, letters match both upper and lower case 167 -g, --capture-groups Replaces non-capturing groups by capturing ones 168 -x, --verbose Produces a nicer looking regular expression in verbose mode 169 --no-start-anchor Removes the caret anchor '^' from the resulting regular expression 170 --no-end-anchor Removes the dollar sign anchor '$' from the resulting regular expression 171 --no-anchors Removes the caret and dollar sign anchors from the resulting regular expression 172 -c, --colorize Provides syntax highlighting for the resulting regular expression 173 -h, --help Prints help information 174 -v, --version Prints version information 175 176OPTIONS: 177 -f, --file <FILE> Reads test cases on separate lines from a file 178 --min-repetitions <QUANTITY> Specifies the minimum quantity of substring repetitions 179 to be converted if --repetitions is set [default: 1] 180 --min-substring-length <LENGTH> Specifies the minimum length a repeated substring must have 181 in order to be converted if --repetitions is set [default: 1] 182 183ARGS: 184 <INPUT>... One or more test cases separated by blank space 185``` 186 187### 5.2 <a name="how-to-use-library"></a> The library <sup>[Top ▲](#table-of-contents)</sup> 188 189#### 5.2.1 Default settings 190 191Test cases are passed either from a collection via [`RegExpBuilder::from()`](https://docs.rs/grex/1.3.0/grex/struct.RegExpBuilder.html#method.from) 192or from a file via [`RegExpBuilder::from_file()`](https://docs.rs/grex/1.3.0/grex/struct.RegExpBuilder.html#method.from_file). 193If read from a file, each test case must be on a separate line. Lines may be ended with either a newline `\n` or a carriage 194return with a line feed `\r\n`. 195 196```rust 197use grex::RegExpBuilder; 198 199let regexp = RegExpBuilder::from(&["a", "aa", "aaa"]).build(); 200assert_eq!(regexp, "^a(?:aa?)?$"); 201``` 202 203#### 5.2.2 Convert to character classes 204 205```rust 206use grex::RegExpBuilder; 207 208let regexp = RegExpBuilder::from(&["a", "aa", "123"]) 209 .with_conversion_of_digits() 210 .with_conversion_of_words() 211 .build(); 212assert_eq!(regexp, "^(\\d\\d\\d|\\w(?:\\w)?)$"); 213``` 214 215#### 5.2.3 Convert repeated substrings 216 217```rust 218use grex::RegExpBuilder; 219 220let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"]) 221 .with_conversion_of_repetitions() 222 .build(); 223assert_eq!(regexp, "^(?:a{2}|(?:bc){2}|(?:def){3})$"); 224``` 225 226By default, *grex* converts each substring this way which is at least a single character long 227and which is subsequently repeated at least once. You can customize these two parameters if you like. 228 229In the following example, the test case `aa` is not converted to `a{2}` because the repeated substring 230`a` has a length of 1, but the minimum substring length has been set to 2. 231 232```rust 233use grex::RegExpBuilder; 234 235let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"]) 236 .with_conversion_of_repetitions() 237 .with_minimum_substring_length(2) 238 .build(); 239assert_eq!(regexp, "^(?:aa|(?:bc){2}|(?:def){3})$"); 240``` 241 242Setting a minimum number of 2 repetitions in the next example, only the test case `defdefdef` will be 243converted because it is the only one that is repeated twice. 244 245```rust 246use grex::RegExpBuilder; 247 248let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"]) 249 .with_conversion_of_repetitions() 250 .with_minimum_repetitions(2) 251 .build(); 252assert_eq!(regexp, "^(?:bcbc|aa|(?:def){3})$"); 253``` 254 255#### 5.2.4 Escape non-ascii characters 256 257```rust 258use grex::RegExpBuilder; 259 260let regexp = RegExpBuilder::from(&["You smell like ."]) 261 .with_escaping_of_non_ascii_chars(false) 262 .build(); 263assert_eq!(regexp, "^You smell like \\u{1f4a9}\\.$"); 264``` 265 266Old versions of JavaScript do not support unicode escape sequences for the astral code planes 267(range `U+010000` to `U+10FFFF`). In order to support these symbols in JavaScript regular 268expressions, the conversion to surrogate pairs is necessary. More information on that matter 269can be found [here](https://mathiasbynens.be/notes/javascript-unicode). 270 271```rust 272use grex::RegExpBuilder; 273 274let regexp = RegExpBuilder::from(&["You smell like ."]) 275 .with_escaped_non_ascii_chars(true) 276 .build(); 277assert_eq!(regexp, "^You smell like \\u{d83d}\\u{dca9}\\.$"); 278``` 279 280#### 5.2.5 Case-insensitive matching 281 282The regular expressions that *grex* generates are case-sensitive by default. 283Case-insensitive matching can be enabled like so: 284 285```rust 286use grex::RegExpBuilder; 287 288let regexp = RegExpBuilder::from(&["big", "BIGGER"]) 289 .with_case_insensitive_matching() 290 .build(); 291assert_eq!(regexp, "(?i)^big(?:ger)?$"); 292``` 293 294#### 5.2.6 Capturing Groups 295 296Non-capturing groups are used by default. 297Extending the previous example, you can switch to capturing groups instead. 298 299```rust 300use grex::RegExpBuilder; 301 302let regexp = RegExpBuilder::from(&["big", "BIGGER"]) 303 .with_case_insensitive_matching() 304 .with_capturing_groups() 305 .build(); 306assert_eq!(regexp, "(?i)^big(ger)?$"); 307``` 308 309#### 5.2.7 Verbose mode 310 311If you find the generated regular expression hard to read, you can enable verbose mode. 312The expression is then put on multiple lines and indented to make it more pleasant to the eyes. 313 314```rust 315use grex::RegExpBuilder; 316use indoc::indoc; 317 318let regexp = RegExpBuilder::from(&["a", "b", "bcd"]) 319 .with_verbose_mode() 320 .build(); 321 322assert_eq!(regexp, indoc!( 323 r#" 324 (?x) 325 ^ 326 (?: 327 b 328 (?: 329 cd 330 )? 331 | 332 a 333 ) 334 $"# 335)); 336``` 337 338#### 5.2.8 Disable anchors 339 340By default, the anchors `^` and `$` are put around every generated regular expression in order 341to ensure that it matches only the test cases given as input. Often enough, however, it is 342desired to use the generated pattern as part of a larger one. For this purpose, the anchors 343can be disabled, either separately or both of them. 344 345```rust 346use grex::RegExpBuilder; 347 348let regexp = RegExpBuilder::from(&["a", "aa", "aaa"]) 349 .without_anchors() 350 .build(); 351assert_eq!(regexp, "a(?:aa?)?"); 352``` 353 354#### 5.2.9 Syntax highlighting 355 356⚠ The method `with_syntax_highlighting()` may only be used if the resulting regular expression is meant to 357be printed to the console. It is mainly meant to be used for the command-line tool output. 358The regex string representation returned from enabling this setting cannot be fed into the 359[*regex* crate](https://crates.io/crates/regex). 360 361```rust 362use grex::RegExpBuilder; 363 364let regexp = RegExpBuilder::from(&["a", "aa", "123"]) 365 .with_syntax_highlighting() 366 .build(); 367``` 368 369### 5.3 <a name="examples"></a> Examples <sup>[Top ▲](#table-of-contents)</sup> 370 371The following examples show the various supported regex syntax features: 372 373``` 374$ grex a b c 375^[a-c]$ 376 377$ grex a c d e f 378^[ac-f]$ 379 380$ grex a b x de 381^(?:de|[abx])$ 382 383$ grex abc bc 384^a?bc$ 385 386$ grex a b bc 387^(?:bc?|a)$ 388 389$ grex [a-z] 390^\[a\-z\]$ 391 392$ grex -r b ba baa baaa 393^b(?:a{1,3})?$ 394 395$ grex -r b ba baa baaaa 396^b(?:a{1,2}|a{4})?$ 397 398$ grex y̆ a z 399^(?:y̆|[az])$ 400Note: 401Grapheme y̆ consists of two Unicode symbols: 402U+0079 (Latin Small Letter Y) 403U+0306 (Combining Breve) 404 405$ grex "I ♥ cake" "I ♥ cookies" 406^I ♥ c(?:ookies|ake)$ 407Note: 408Input containing blank space must be 409surrounded by quotation marks. 410``` 411 412The string `"I ♥♥♥ 36 and ٣ and ."` serves as input for the following examples using the command-line notation: 413 414``` 415$ grex <INPUT> 416^I ♥♥♥ 36 and ٣ and \.$ 417 418$ grex -e <INPUT> 419^I \u{2665}\u{2665}\u{2665} 36 and \u{663} and \u{1f4a9}\u{1f4a9}\.$ 420 421$ grex -e --with-surrogates <INPUT> 422^I \u{2665}\u{2665}\u{2665} 36 and \u{663} and \u{d83d}\u{dca9}\u{d83d}\u{dca9}\.$ 423 424$ grex -d <INPUT> 425^I ♥♥♥ \d\d and \d and \.$ 426 427$ grex -s <INPUT> 428^I\s♥♥♥\s36\sand\s٣\sand\s\.$ 429 430$ grex -w <INPUT> 431^\w ♥♥♥ \w\w \w\w\w \w \w\w\w \.$ 432 433$ grex -D <INPUT> 434^\D\D\D\D\D\D36\D\D\D\D\D٣\D\D\D\D\D\D\D\D$ 435 436$ grex -S <INPUT> 437^\S \S\S\S \S\S \S\S\S \S \S\S\S \S\S\S$ 438 439$ grex -dsw <INPUT> 440^\w\s♥♥♥\s\d\d\s\w\w\w\s\d\s\w\w\w\s\.$ 441 442$ grex -dswW <INPUT> 443^\w\s\W\W\W\s\d\d\s\w\w\w\s\d\s\w\w\w\s\W\W\W$ 444 445$ grex -r <INPUT> 446^I ♥{3} 36 and ٣ and {2}\.$ 447 448$ grex -er <INPUT> 449^I \u{2665}{3} 36 and \u{663} and \u{1f4a9}{2}\.$ 450 451$ grex -er --with-surrogates <INPUT> 452^I \u{2665}{3} 36 and \u{663} and (?:\u{d83d}\u{dca9}){2}\.$ 453 454$ grex -dgr <INPUT> 455^I ♥{3} \d(\d and ){2}{2}\.$ 456 457$ grex -rs <INPUT> 458^I\s♥{3}\s36\sand\s٣\sand\s{2}\.$ 459 460$ grex -rw <INPUT> 461^\w ♥{3} \w(?:\w \w{3} ){2}{2}\.$ 462 463$ grex -Dr <INPUT> 464^\D{6}36\D{5}٣\D{8}$ 465 466$ grex -rS <INPUT> 467^\S \S(?:\S{2} ){2}\S{3} \S \S{3} \S{3}$ 468 469$ grex -rW <INPUT> 470^I\W{5}36\Wand\W٣\Wand\W{4}$ 471 472$ grex -drsw <INPUT> 473^\w\s♥{3}\s\d(?:\d\s\w{3}\s){2}{2}\.$ 474 475$ grex -drswW <INPUT> 476^\w\s\W{3}\s\d(?:\d\s\w{3}\s){2}\W{3}$ 477``` 478 479## 6. <a name="how-to-build"></a> How to build? <sup>[Top ▲](#table-of-contents)</sup> 480 481In order to build the source code yourself, you need the 482[stable Rust toolchain](https://www.rust-lang.org/tools/install) installed on your machine 483so that [*cargo*](https://doc.rust-lang.org/cargo/), the Rust package manager is available. 484 485``` 486git clone https://github.com/pemistahl/grex.git 487cd grex 488cargo build 489``` 490 491The source code is accompanied by an extensive test suite consisting of unit tests, integration 492tests and property tests. For running them, simply say: 493 494``` 495cargo test 496``` 497 498## 7. <a name="how-does-it-work"></a> How does it work? <sup>[Top ▲](#table-of-contents)</sup> 499 5001. A [deterministic finite automaton](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) (DFA) 501is created from the input strings. 502 5032. The number of states and transitions between states in the DFA is reduced by applying 504[Hopcroft's DFA minimization algorithm](https://en.wikipedia.org/wiki/DFA_minimization#Hopcroft.27s_algorithm). 505 5063. The minimized DFA is expressed as a system of linear equations which are solved with 507[Brzozowski's algebraic method](http://cs.stackexchange.com/questions/2016/how-to-convert-finite-automata-to-regular-expressions#2392), 508resulting in the final regular expression. 509 510## 8. <a name="contribution"></a> Contributions <sup>[Top ▲](#table-of-contents)</sup> 511 512- [Krzysztof Zawisła](https://github.com/KrzysztofZawisla) has written JavaScript bindings. Check out [grex.js](https://github.com/KrzysztofZawisla/grex.js). 513- [Maciej Gryka](https://github.com/maciejgryka) has created [https://regex.help](https://regex.help) where you can try out *grex* in your browser. 514 515In case you want to contribute something to *grex*, I encourage you to do so. 516Do you have ideas for cool features? Or have you found any bugs so far? 517Feel free to open an issue or send a pull request. It's very much appreciated. :-) 518