• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

COPYRIGHTH A D08-Nov-20214.3 KiB8568

MakefileH A D08-Nov-2021683 3014

READMEH A D08-Nov-202124.4 KiB441386

regc_color.cH A D08-Nov-202127.7 KiB1,187842

regc_cvec.cH A D08-Nov-20213.9 KiB13961

regc_lex.cH A D08-Nov-202121.9 KiB1,040895

regc_locale.cH A D08-Nov-202115.2 KiB772569

regc_nfa.cH A D08-Nov-202197.6 KiB3,8252,498

regc_pg_locale.cH A D08-Nov-202128.1 KiB945743

regcomp.cH A D08-Nov-202168.8 KiB2,5831,857

rege_dfa.cH A D08-Nov-202127.2 KiB1,107820

regerror.cH A D08-Nov-20213.6 KiB12173

regexec.cH A D08-Nov-202138.5 KiB1,4951,050

regexport.cH A D08-Nov-20217.7 KiB294143

regfree.cH A D08-Nov-20212.1 KiB558

regprefix.cH A D08-Nov-20217.2 KiB269133

README

1Implementation notes about Henry Spencer's regex library
2========================================================
3
4If Henry ever had any internals documentation, he didn't publish it.
5So this file is an attempt to reverse-engineer some docs.
6
7General source-file layout
8--------------------------
9
10There are six separately-compilable source files, five of which expose
11exactly one exported function apiece:
12	regcomp.c: pg_regcomp
13	regexec.c: pg_regexec
14	regerror.c: pg_regerror
15	regfree.c: pg_regfree
16	regprefix.c: pg_regprefix
17(The pg_ prefixes were added by the Postgres project to distinguish this
18library version from any similar one that might be present on a particular
19system.  They'd need to be removed or replaced in any standalone version
20of the library.)
21
22The sixth file, regexport.c, exposes multiple functions that allow extraction
23of info about a compiled regex (see regexport.h).
24
25There are additional source files regc_*.c that are #include'd in regcomp,
26and similarly additional source files rege_*.c that are #include'd in
27regexec.  This was done to avoid exposing internal symbols globally;
28all functions not meant to be part of the library API are static.
29
30(Actually the above is a lie in one respect: there are two more global
31symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp.  These are
32not meant to be part of the API, but they have to be global because both
33regcomp and regexec call them.  It'd be better to get rid of
34pg_set_regex_collation, as well as the static variables it sets, in favor of
35keeping the needed locale state in the regex structs.  We have not done this
36yet for lack of a design for how to add application-specific state to the
37structs.)
38
39What's where in src/backend/regex/:
40
41regcomp.c		Top-level regex compilation code
42regc_color.c		Color map management
43regc_cvec.c		Character vector (cvec) management
44regc_lex.c		Lexer
45regc_nfa.c		NFA handling
46regc_locale.c		Application-specific locale code from Tcl project
47regc_pg_locale.c	Postgres-added application-specific locale code
48regexec.c		Top-level regex execution code
49rege_dfa.c		DFA creation and execution
50regerror.c		pg_regerror: generate text for a regex error code
51regfree.c		pg_regfree: API to free a no-longer-needed regex_t
52regexport.c		Functions for extracting info from a regex_t
53regprefix.c		Code for extracting a common prefix from a regex_t
54
55The locale-specific code is concerned primarily with case-folding and with
56expanding locale-specific character classes, such as [[:alnum:]].  It
57really needs refactoring if this is ever to become a standalone library.
58
59The header files for the library are in src/include/regex/:
60
61regcustom.h		Customizes library for particular application
62regerrs.h		Error message list
63regex.h			Exported API
64regexport.h		Exported API for regexport.c
65regguts.h		Internals declarations
66
67
68DFAs, NFAs, and all that
69------------------------
70
71This library is a hybrid DFA/NFA regex implementation.  (If you've never
72heard either of those terms, get thee to a first-year comp sci textbook.)
73It might not be clear at first glance what that really means and how it
74relates to what you'll see in the code.  Here's what really happens:
75
76* Initial parsing of a regex generates an NFA representation, with number
77of states approximately proportional to the length of the regexp.
78
79* The NFA is then optimized into a "compact NFA" representation, which is
80basically the same idea but without fields that are not going to be needed
81at runtime.  It is simplified too: the compact format only allows "plain"
82and "LACON" arc types.  The cNFA representation is what is passed from
83regcomp to regexec.
84
85* Unlike traditional NFA-based regex engines, we do not execute directly
86from the NFA representation, as that would require backtracking and so be
87very slow in some cases.  Rather, we execute a DFA, which ideally can
88process an input string in linear time (O(M) for M characters of input)
89without backtracking.  Each state of the DFA corresponds to a set of
90states of the NFA, that is all the states that the NFA might have been in
91upon reaching the current point in the input string.  Therefore, an NFA
92with N states might require as many as 2^N states in the corresponding
93DFA, which could easily require unreasonable amounts of memory.  We deal
94with this by materializing states of the DFA lazily (only when needed) and
95keeping them in a limited-size cache.  The possible need to build the same
96state of the DFA repeatedly makes this approach not truly O(M) time, but
97in the worst case as much as O(M*N).  That's still far better than the
98worst case for a backtracking NFA engine.
99
100If that were the end of it, we'd just say this is a DFA engine, with the
101use of NFAs being merely an implementation detail.  However, a DFA engine
102cannot handle some important regex features such as capturing parens and
103back-references.  If the parser finds that a regex uses these features
104(collectively called "messy cases" in the code), then we have to use
105NFA-style backtracking search after all.
106
107When using the NFA mode, the representation constructed by the parser
108consists of a tree of sub-expressions ("subre"s).  Leaf tree nodes are
109either plain regular expressions (which are executed as DFAs in the manner
110described above) or back-references (which try to match the input to some
111previous substring).  Non-leaf nodes are capture nodes (which save the
112location of the substring currently matching their child node),
113concatenation, alternation, or iteration nodes.  At execution time, the
114executor recursively scans the tree.  At concatenation, alternation, or
115iteration nodes, it considers each possible alternative way of matching the
116input string, that is each place where the string could be split for a
117concatenation or iteration, or each child node for an alternation.  It
118tries the next alternative if the match fails according to the child nodes.
119This is exactly the sort of backtracking search done by a traditional NFA
120regex engine.  If there are many tree levels it can get very slow.
121
122But all is not lost: we can still be smarter than the average pure NFA
123engine.  To do this, each subre node has an associated DFA, which
124represents what the node could possibly match insofar as a mathematically
125pure regex can describe that, which basically means "no backrefs".
126Before we perform any search of possible alternative sub-matches, we run
127the DFA to see if it thinks the proposed substring could possibly match.
128If not, we can reject the match immediately without iterating through many
129possibilities.
130
131As an example, consider the regex "(a[bc]+)\1".  The compiled
132representation will have a top-level concatenation subre node.  Its first
133child is a plain DFA node for "a[bc]+" (which is marked as being a capture
134node).  The concatenation's second child is a backref node for \1.
135The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
136where the backref has been replaced by a copy of the DFA for its referent
137expression.  When executed, the concatenation node will have to search for
138a possible division of the input string that allows its two child nodes to
139each match their part of the string (and although this specific case can
140only succeed when the division is at the middle, the code does not know
141that, nor would it be true in general).  However, we can first run the DFA
142and quickly reject any input that doesn't start with an "a" and contain
143one more "a" plus some number of b's and c's.  If the DFA doesn't match,
144there is no need to recurse to the two child nodes for each possible
145string division point.  In many cases, this prefiltering makes the search
146run much faster than a pure NFA engine could do.  It is this behavior that
147justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
148library.
149
150It's perhaps worth noting that separate capture subre nodes are a rarity:
151normally, we just mark a subre as capturing and that's it.  However, it's
152legal to write a regex like "((x))" in which the same substring has to be
153captured by multiple sets of parentheses.  Since a subre has room for only
154one "capno" field, a single subre can't handle that.  We handle such cases
155by wrapping the base subre (which captures the innermost parens) in a
156no-op capture node, or even more than one for "(((x)))" etc.  This is a
157little bit inefficient because we end up with multiple identical NFAs,
158but since the case is pointless and infrequent, it's not worth working
159harder.
160
161
162Colors and colormapping
163-----------------------
164
165In many common regex patterns, there are large numbers of characters that
166can be treated alike by the execution engine.  A simple example is the
167pattern "[[:alpha:]][[:alnum:]]*" for an identifier.  Basically the engine
168only needs to care whether an input symbol is a letter, a digit, or other.
169We could build the NFA or DFA with a separate arc for each possible letter
170and digit, but that's very wasteful of space and not so cheap to execute
171either, especially when dealing with Unicode which can have thousands of
172letters.  Instead, the parser builds a "color map" that maps each possible
173input symbol to a "color", or equivalence class.  The NFA or DFA
174representation then has arcs labeled with colors, not specific input
175symbols.  At execution, the first thing the executor does with each input
176symbol is to look up its color in the color map, and then everything else
177works from the color only.
178
179To build the colormap, we start by assigning every possible input symbol
180the color WHITE, which means "other" (that is, at the end of parsing, the
181symbols that are still WHITE are those not explicitly referenced anywhere
182in the regex).  When we see a simple literal character or a bracket
183expression in the regex, we want to assign that character, or all the
184characters represented by the bracket expression, a unique new color that
185can be used to label the NFA arc corresponding to the state transition for
186matching this character or bracket expression.  The basic idea is:
187first, change the color assigned to a character to some new value;
188second, run through all the existing arcs in the partially-built NFA,
189and for each one referencing the character's old color, add a parallel
190arc referencing its new color (this keeps the reassignment from changing
191the semantics of what we already built); and third, add a new arc with
192the character's new color to the current pair of NFA states, denoting
193that seeing this character allows the state transition to be made.
194
195This is complicated a bit by not wanting to create more colors
196(equivalence classes) than absolutely necessary.  In particular, if a
197bracket expression mentions two characters that had the same color before,
198they should still share the same color after we process the bracket, since
199there is still not a need to distinguish them.  But we do need to
200distinguish them from other characters that previously had the same color
201yet are not listed in the bracket expression.  To mechanize this, the code
202has a concept of "parent colors" and "subcolors", where a color's subcolor
203is the new color that we are giving to any characters of that color while
204parsing the current atom.  (The word "parent" is a bit unfortunate here,
205because it suggests a long-lived relationship, but a subcolor link really
206only lasts for the duration of parsing a single atom.)  In other words,
207a subcolor link means that we are in process of splitting the parent color
208into two colors (equivalence classes), depending on whether or not each
209member character should be included by the current regex atom.
210
211As an example, suppose we have the regex "a\d\wx".  Initially all possible
212character codes are labeled WHITE (color 0).  To parse the atom "a", we
213create a new color (1), update "a"'s color map entry to 1, and create an
214arc labeled 1 between the first two states of the NFA.  Now we see \d,
215which is really a bracket expression containing the digits "0"-"9".
216First we process "0", which is currently WHITE, so we create a new color
217(2), update "0"'s color map entry to 2, and create an arc labeled 2
218between the second and third states of the NFA.  We also mark color WHITE
219as having the subcolor 2, which means that future relabelings of WHITE
220characters should also select 2 as the new color.  Thus, when we process
221"1", we won't create a new color but re-use 2.  We update "1"'s color map
222entry to 2, and then find that we don't need a new arc because there is
223already one labeled 2 between the second and third states of the NFA.
224Similarly for the other 8 digits, so there will be only one arc labeled 2
225between NFA states 2 and 3 for all members of this bracket expression.
226At completion of processing of the bracket expression, we call okcolors()
227which breaks all the existing parent/subcolor links; there is no longer a
228marker saying that WHITE characters should be relabeled 2.  (Note:
229actually, we did the same creation and clearing of a subcolor link for the
230primitive atom "a", but it didn't do anything very interesting.)  Now we
231come to the "\w" bracket expression, which for simplicity assume expands
232to just "[a-z0-9]".  We process "a", but observe that it is already the
233sole member of its color 1.  This means there is no need to subdivide that
234equivalence class more finely, so we do not create any new color.  We just
235make an arc labeled 1 between the third and fourth NFA states.  Next we
236process "b", which is WHITE and far from the only WHITE character, so we
237create a new color (3), link that as WHITE's subcolor, relabel "b" as
238color 3, and make an arc labeled 3.  As we process "c" through "z", each
239is relabeled from WHITE to 3, but no new arc is needed.  Now we come to
240"0", which is not the only member of its color 2, so we suppose that a new
241color is needed and create color 4.  We link 4 as subcolor of 2, relabel
242"0" as color 4 in the map, and add an arc for color 4.  Next "1" through
243"9" are similarly relabeled as color 4, with no additional arcs needed.
244Having finished the bracket expression, we call okcolors(), which breaks
245the subcolor links.  okcolors() further observes that we have removed
246every member of color 2 (the previous color of the digit characters).
247Therefore, it runs through the partial NFA built so far and relabels arcs
248labeled 2 to color 4; in particular the arc from NFA state 2 to state 3 is
249relabeled color 4.  Then it frees up color 2, since we have no more use
250for that color.  We now have an NFA in which transitions for digits are
251consistently labeled with color 4.  Last, we come to the atom "x".
252"x" is currently labeled with color 3, and it's not the only member of
253that color, so we realize that we now need to distinguish "x" from other
254letters when we did not before.  We create a new color, which might have
255been 5 but instead we recycle the unused color 2.  "x" is relabeled 2 in
256the color map and 2 is linked as the subcolor of 3, and we add an arc for
2572 between states 4 and 5 of the NFA.  Now we call okcolors(), which breaks
258the subcolor link between colors 3 and 2 and notices that both colors are
259nonempty.  Therefore, it also runs through the existing NFA arcs and adds
260an additional arc labeled 2 wherever there is an arc labeled 3; this
261action ensures that characters of color 2 (i.e., "x") will still be
262considered as allowing any transitions they did before.  We are now done
263parsing the regex, and we have these final color assignments:
264	color 1: "a"
265	color 2: "x"
266	color 3: other letters
267	color 4: digits
268and the NFA has these arcs:
269	states 1 -> 2 on color 1 (hence, "a" only)
270	states 2 -> 3 on color 4 (digits)
271	states 3 -> 4 on colors 1, 3, 4, and 2 (covering all \w characters)
272	states 4 -> 5 on color 2 ("x" only)
273which can be seen to be a correct representation of the regex.
274
275There is one more complexity, which is how to handle ".", that is a
276match-anything atom.  We used to do that by generating a "rainbow"
277of arcs of all live colors between the two NFA states before and after
278the dot.  That's expensive in itself when there are lots of colors,
279and it also typically adds lots of follow-on arc-splitting work for the
280color splitting logic.  Now we handle this case by generating a single arc
281labeled with the special color RAINBOW, meaning all colors.  Such arcs
282never need to be split, so they help keep NFAs small in this common case.
283(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
284not supposed to match newline.  In that case we still handle "." by
285generating an almost-rainbow of all colors except newline's color.)
286
287Given this summary, we can see we need the following operations for
288colors:
289
290* A fast way to look up the current color assignment for any character
291  code.  (This is needed during both parsing and execution, while the
292  remaining operations are needed only during parsing.)
293* A way to alter the color assignment for any given character code.
294* We must track the number of characters currently assigned to each
295  color, so that we can detect empty and singleton colors.
296* We must track all existing NFA arcs of a given color, so that we
297  can relabel them at need, or add parallel arcs of a new color when
298  an existing color has to be subdivided.
299
300The last two of these are handled with the "struct colordesc" array and
301the "colorchain" links in NFA arc structs.
302
303Ideally, we'd do the first two operations using a simple linear array
304storing the current color assignment for each character code.
305Unfortunately, that's not terribly workable for large charsets such as
306Unicode.  Our solution is to divide the color map into two parts.  A simple
307linear array is used for character codes up to MAX_SIMPLE_CHR, which can be
308chosen large enough to include all popular characters (so that the
309significantly-slower code paths about to be described are seldom invoked).
310Characters above that need be considered at compile time only if they
311appear explicitly in the regex pattern.  We store each such mentioned
312character or character range as an entry in the "colormaprange" array in
313the colormap.  (Overlapping ranges are split into unique subranges, so that
314each range in the finished list needs only a single color that describes
315all its characters.)  When mapping a character above MAX_SIMPLE_CHR to a
316color at runtime, we search this list of ranges explicitly.
317
318That's still not quite enough, though, because of locale-dependent
319character classes such as [[:alpha:]].  In Unicode locales these classes
320may have thousands of entries that are above MAX_SIMPLE_CHR, and we
321certainly don't want to be searching large colormaprange arrays at runtime.
322Nor do we even want to spend the time to initialize cvec structures that
323exhaustively describe all of those characters.  Our solution is to compute
324exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR.
325For characters above that, we apply the <ctype.h> or <wctype.h> lookup
326functions at runtime for each locale-dependent character class used in the
327regex pattern, constructing a bitmap that describes which classes the
328runtime character belongs to.  The per-character-range data structure
329mentioned above actually holds, for each range, a separate color entry
330for each possible combination of character class properties.  That is,
331the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
332whose rows correspond to high characters or character ranges that are
333explicitly mentioned in the regex pattern, and whose columns correspond
334to sets of the locale-dependent character classes that are used in the
335regex.
336
337As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]'
338(and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need
339a high color map with three rows.  One row is for the single character
340U+1234 (represented as a single-element range), one is for the range
341U+1D100..U+1D1FF, and the other row represents all remaining high
342characters.  The color map has two columns, one for characters that
343satisfy iswalnum() and one for those that don't.
344
345We build this color map in parallel with scanning the regex.  Each time
346we detect a new explicit high character (or range) or a locale-dependent
347character class, we split existing entry(s) in the high color map so that
348characters we need to be able to distinguish will have distinct entries
349that can be given separate colors.  Often, though, single entries in the
350high color map will represent very large sets of characters.
351
352If there are both explicit high characters/ranges and locale-dependent
353character classes, we may have entries in the high color map array that
354have non-WHITE colors but don't actually represent any real characters.
355(For example, in a row representing a singleton range, only one of the
356columns could possibly be a live entry; it's the one matching the actual
357locale properties for that single character.)  We don't currently make
358any effort to reclaim such colors.  In principle it could be done, but
359it's not clear that it's worth the trouble.
360
361
362Detailed semantics of an NFA
363----------------------------
364
365When trying to read dumped-out NFAs, it's helpful to know these facts:
366
367State 0 (additionally marked with "@" in dumpnfa's output) is always the
368goal state, and state 1 (additionally marked with ">") is the start state.
369(The code refers to these as the post state and pre state respectively.)
370
371The possible arc types are:
372
373    PLAIN arcs, which specify matching of any character of a given "color"
374    (see above).  These are dumped as "[color_number]->to_state".
375    In addition there can be "rainbow" PLAIN arcs, which are dumped as
376    "[*]->to_state".
377
378    EMPTY arcs, which specify a no-op transition to another state.  These
379    are dumped as "->to_state".
380
381    AHEAD constraints, which represent a "next character must be of this
382    color" constraint.  AHEAD differs from a PLAIN arc in that the input
383    character is not consumed when crossing the arc.  These are dumped as
384    ">color_number>->to_state", or possibly ">*>->to_state".
385
386    BEHIND constraints, which represent a "previous character must be of
387    this color" constraint, which likewise consumes no input.  These are
388    dumped as "<color_number<->to_state", or possibly "<*<->to_state".
389
390    '^' arcs, which specify a beginning-of-input constraint.  These are
391    dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
392    beginning-of-line constraints respectively.
393
394    '$' arcs, which specify an end-of-input constraint.  These are dumped
395    as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line
396    constraints respectively.
397
398    LACON constraints, which represent "(?=re)", "(?!re)", "(?<=re)", and
399    "(?<!re)" constraints, i.e. the input starting/ending at this point must
400    match (or not match) a given sub-RE, but the matching input is not
401    consumed.  These are dumped as ":subtree_number:->to_state".
402
403If you see anything else (especially any question marks) in the display of
404an arc, it's dumpnfa() trying to tell you that there's something fishy
405about the arc; see the source code.
406
407The regex executor can only handle PLAIN and LACON transitions.  The regex
408optimize() function is responsible for transforming the parser's output
409to get rid of all the other arc types.  In particular, ^ and $ arcs that
410are not dropped as impossible will always end up adjacent to the pre or
411post state respectively, and then will be converted into PLAIN arcs that
412mention the special "colors" for BOS, BOL, EOS, or EOL.
413
414To decide whether a thus-transformed NFA matches a given substring of the
415input string, the executor essentially follows these rules:
4161. Start the NFA "looking at" the character *before* the given substring,
417or if the substring is at the start of the input, prepend an imaginary BOS
418character instead.
4192. Run the NFA until it has consumed the character *after* the given
420substring, or an imaginary following EOS character if the substring is at
421the end of the input.
4223. If the NFA is (or can be) in the goal state at this point, it matches.
423
424This definition is necessary to support regexes that begin or end with
425constraints such as \m and \M, which imply requirements on the adjacent
426character if any.  The executor implements that by checking if the
427adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
428right color, and it does that in the same loop that checks characters
429within the match.
430
431So one can mentally execute an untransformed NFA by taking ^ and $ as
432ordinary constraints that match at start and end of input; but plain
433arcs out of the start state should be taken as matches for the character
434before the target substring, and similarly, plain arcs leading to the
435post state are matches for the character after the target substring.
436After the optimize() transformation, there are explicit arcs mentioning
437BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
438NFA for a pattern without anchors or adjacent-character constraints will
439have pre-state outarcs for RAINBOW (all possible character colors) as well
440as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
441