1# Fossil grep vs POSIX grep
2
3As of Fossil 2.7, there is a `grep` command which acts roughly like
4POSIX's `grep -E` over all historical versions of a single file name.
5This document explains the commonalities and divergences between [POSIX
6`grep`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html)
7and Fossil `grep`.
8
9
10## Options
11
12Fossil `grep` supports only a small subset of the options specified for
13POSIX `grep`:
14
15| Option | Meaning
16|--------|-------------------------------------------------------------
17| `-i`   | ignore case in matches
18| `-l`   | list a checkin ID prefix for matching historical versions of the file
19| `-v`   | print each checkin ID considered, regardless of whether it matches
20
21That leaves many divergences at the option level from POSIX `grep`:
22
23*   There is no built-in way to get a count of matches, as with
24    `grep -c`.
25
26*    You cannot give more than one pattern, as with `grep -e` or
27     `grep -f`.
28
29*   There is no equivalent of `grep -F` to do literal fixed-string
30    matches only.
31
32*   `fossil grep -l` does not do precisely the same thing as POSIX
33    `grep -l`: it lists checkin ID prefixes, not file names.
34
35*   Fossil always gives the line number in its output, which is to say
36    that it acts like `grep -n`.  There is no way to disable the line
37    number in `fossil grep` output.
38
39*   There is no way to suppress all output, returning only a status code
40    to indicate whether the pattern matched, as with `grep -q`.
41
42*   There is no way to suppress error output, as with `grep -s`.
43
44*   Fossil `grep` does not accept a directory name for Fossil to
45    expand to the set of all files under that directory. This means
46    Fossil `grep` has no equivalent of the common POSIX `grep -R`
47    extension. (And if it did, it would probably have a different
48    option letter, since `-R` in Fossil has a different meaning, by
49    convention.)
50
51*   You cannot invert the match, as with `grep -v`.
52
53Patches to remove those limitations will be thoughtfully considered.
54
55
56## Regular Expression Dialect
57
58Fossil contains a built-in regular expression engine implementing a
59subset of the [POSIX extended regular expression][ere] dialect:
60
61[ere]: https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended
62
63| Atom    | Meaning
64|---------|-------------------------------------------------------------
65| `X*`    | zero or more occurrences of X
66| `X+`    | one or more occurrences of X
67| `X?`    | zero or one occurrences of X
68| `X{p,q}`| between p and q occurrences of X, inclusive
69| `(X)`   | match X
70| <tt>X\|Y</tt>| X or Y
71| `^X`    | X occurring at the beginning of a line
72| `X$`    | X occurring at the end of a line
73| `.`     | Match any single character
74| `\c`    | Character `c` where `c` is one of <tt>{}()[]\|\*+?.\\</tt>
75| `\c`    | C-language escapes for `c` in `afnrtv`.  ex: `\t` or `\n`
76| `\uXXXX`| Where XXXX is exactly 4 hex digits, Unicode value XXXX
77| `\xXX`  | Where XX is exactly 2 hex digits, Unicode value XX
78| `[abc]` | Any single character from the set `abc`
79| `[^abc]`| Any single character not in the set `abc`
80| `[a-z]` | Any single character in the range `a-z`
81| `[^a-z]`| Any single character not in the range `a-z`
82| `\b`    | Word boundary
83| `\w`    | Word character: `[A-Za-z0-9_]`
84| `\W`    | Non-word character: `[^A-Za-z0-9_]`
85| `\d`    | Digit: `[0-9]`
86| `\D`    | Non-digit: `[^0-9]`
87| `\s`    | Whitespace character: `[ \t\r\n\v\f]`
88| `\S`    | Non-whitespace character: `[^ \t\r\n\v\f]`
89
90There are several restrictions in Fossil `grep` relative to a fully
91POSIX compatible regular expression engine. Among them are:
92
93*   There is currently no support for POSIX character classes such as
94    `[:lower:]`.
95
96*   Fossil `grep` does not currently attempt to take your operating
97    system's locale settings into account when doing this match.  Since
98    Fossil has no way to mark a given file as having a particular
99    encoding, Fossil `grep` assumes the input files are in UTF-8 format.
100
101    This means Fossil `grep` will not work correctly if the files in
102    your repository are in an encoding that is not backwards-compatible
103    with ASCII, such as UTF-16.  Partially compatible encodings such as
104    ISO 8859 should work with Fossil `grep` as long as you stick to
105    their ASCII-compatible subset.
106
107The Fossil `grep` language is not a strict subset of POSIX extended
108regular expressions.  Some of the features documented above are
109well-understood extensions to it, such as the "word" features `\b`, `\w`
110and `\W`.
111
112Fossil `grep` is based on the Unicode engine from [SQLite's FTS5
113feature][fts5].  This means it does do things like Unicode-aware case
114folding. Therefore, it is usually a user error to attempt to substitute
115`[a-z]` for a lack of the POSIX character class `[:lower:]` if you are
116grepping over pretty much any human written language other than English.
117Use `fossil grep -i` instead, which does Unicode case folding.
118
119[fts5]: https://www.sqlite.org/fts5.html
120
121
122## Algorithm Details
123
124Fossil `grep` uses a [nondeterministic finite automaton][nfa] for
125matching, so the performance is bounded by ***O(nm)*** where ***n*** is
126the size of the regular expression and ***m*** is the size of the input
127string.
128
129[nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
130
131In order to avoid [ReDoS attacks][redos], the Fossil regular expression
132matcher was purposely written to avoid [implementation choices][rei]
133which have the potential to require exponential evaluation time. This
134constrains the possible feature set we can support in the Fossil `grep`
135dialect. For instance, we are unlikely to ever add support for
136[backtracking][bt].
137
138[redos]: https://en.wikipedia.org/wiki/ReDoS
139[rei]:   https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
140[bt]:    https://en.wikipedia.org/wiki/Backtracking
141
142The `X{p,q}` operator expands to `p` copies of `X` followed by `q-p`
143copies of `X?` before RE evaluation. The ***O(nm)*** performance bound
144above remains true for this case, but realize that it applies to the RE
145*after* this expansion, not to the form as given by the user.  In other
146words, as `q-p` increases, so does the RE evaluation time.
147