1# Fossil grep vs POSIX grep 2 3As of Fossil 2.7, there is a `grep` command which acts roughly like 4POSIX's `grep -E` over all historical versions of a single file name. 5This document explains the commonalities and divergences between [POSIX 6`grep`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html) 7and Fossil `grep`. 8 9 10## Options 11 12Fossil `grep` supports only a small subset of the options specified for 13POSIX `grep`: 14 15| Option | Meaning 16|--------|------------------------------------------------------------- 17| `-i` | ignore case in matches 18| `-l` | list a checkin ID prefix for matching historical versions of the file 19| `-v` | print each checkin ID considered, regardless of whether it matches 20 21That leaves many divergences at the option level from POSIX `grep`: 22 23* There is no built-in way to get a count of matches, as with 24 `grep -c`. 25 26* You cannot give more than one pattern, as with `grep -e` or 27 `grep -f`. 28 29* There is no equivalent of `grep -F` to do literal fixed-string 30 matches only. 31 32* `fossil grep -l` does not do precisely the same thing as POSIX 33 `grep -l`: it lists checkin ID prefixes, not file names. 34 35* Fossil always gives the line number in its output, which is to say 36 that it acts like `grep -n`. There is no way to disable the line 37 number in `fossil grep` output. 38 39* There is no way to suppress all output, returning only a status code 40 to indicate whether the pattern matched, as with `grep -q`. 41 42* There is no way to suppress error output, as with `grep -s`. 43 44* Fossil `grep` does not accept a directory name for Fossil to 45 expand to the set of all files under that directory. This means 46 Fossil `grep` has no equivalent of the common POSIX `grep -R` 47 extension. (And if it did, it would probably have a different 48 option letter, since `-R` in Fossil has a different meaning, by 49 convention.) 50 51* You cannot invert the match, as with `grep -v`. 52 53Patches to remove those limitations will be thoughtfully considered. 54 55 56## Regular Expression Dialect 57 58Fossil contains a built-in regular expression engine implementing a 59subset of the [POSIX extended regular expression][ere] dialect: 60 61[ere]: https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended 62 63| Atom | Meaning 64|---------|------------------------------------------------------------- 65| `X*` | zero or more occurrences of X 66| `X+` | one or more occurrences of X 67| `X?` | zero or one occurrences of X 68| `X{p,q}`| between p and q occurrences of X, inclusive 69| `(X)` | match X 70| <tt>X\|Y</tt>| X or Y 71| `^X` | X occurring at the beginning of a line 72| `X$` | X occurring at the end of a line 73| `.` | Match any single character 74| `\c` | Character `c` where `c` is one of <tt>{}()[]\|\*+?.\\</tt> 75| `\c` | C-language escapes for `c` in `afnrtv`. ex: `\t` or `\n` 76| `\uXXXX`| Where XXXX is exactly 4 hex digits, Unicode value XXXX 77| `\xXX` | Where XX is exactly 2 hex digits, Unicode value XX 78| `[abc]` | Any single character from the set `abc` 79| `[^abc]`| Any single character not in the set `abc` 80| `[a-z]` | Any single character in the range `a-z` 81| `[^a-z]`| Any single character not in the range `a-z` 82| `\b` | Word boundary 83| `\w` | Word character: `[A-Za-z0-9_]` 84| `\W` | Non-word character: `[^A-Za-z0-9_]` 85| `\d` | Digit: `[0-9]` 86| `\D` | Non-digit: `[^0-9]` 87| `\s` | Whitespace character: `[ \t\r\n\v\f]` 88| `\S` | Non-whitespace character: `[^ \t\r\n\v\f]` 89 90There are several restrictions in Fossil `grep` relative to a fully 91POSIX compatible regular expression engine. Among them are: 92 93* There is currently no support for POSIX character classes such as 94 `[:lower:]`. 95 96* Fossil `grep` does not currently attempt to take your operating 97 system's locale settings into account when doing this match. Since 98 Fossil has no way to mark a given file as having a particular 99 encoding, Fossil `grep` assumes the input files are in UTF-8 format. 100 101 This means Fossil `grep` will not work correctly if the files in 102 your repository are in an encoding that is not backwards-compatible 103 with ASCII, such as UTF-16. Partially compatible encodings such as 104 ISO 8859 should work with Fossil `grep` as long as you stick to 105 their ASCII-compatible subset. 106 107The Fossil `grep` language is not a strict subset of POSIX extended 108regular expressions. Some of the features documented above are 109well-understood extensions to it, such as the "word" features `\b`, `\w` 110and `\W`. 111 112Fossil `grep` is based on the Unicode engine from [SQLite's FTS5 113feature][fts5]. This means it does do things like Unicode-aware case 114folding. Therefore, it is usually a user error to attempt to substitute 115`[a-z]` for a lack of the POSIX character class `[:lower:]` if you are 116grepping over pretty much any human written language other than English. 117Use `fossil grep -i` instead, which does Unicode case folding. 118 119[fts5]: https://www.sqlite.org/fts5.html 120 121 122## Algorithm Details 123 124Fossil `grep` uses a [nondeterministic finite automaton][nfa] for 125matching, so the performance is bounded by ***O(nm)*** where ***n*** is 126the size of the regular expression and ***m*** is the size of the input 127string. 128 129[nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton 130 131In order to avoid [ReDoS attacks][redos], the Fossil regular expression 132matcher was purposely written to avoid [implementation choices][rei] 133which have the potential to require exponential evaluation time. This 134constrains the possible feature set we can support in the Fossil `grep` 135dialect. For instance, we are unlikely to ever add support for 136[backtracking][bt]. 137 138[redos]: https://en.wikipedia.org/wiki/ReDoS 139[rei]: https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times 140[bt]: https://en.wikipedia.org/wiki/Backtracking 141 142The `X{p,q}` operator expands to `p` copies of `X` followed by `q-p` 143copies of `X?` before RE evaluation. The ***O(nm)*** performance bound 144above remains true for this case, but realize that it applies to the RE 145*after* this expansion, not to the form as given by the user. In other 146words, as `q-p` increases, so does the RE evaluation time. 147