1Implementation notes about Henry Spencer's regex library 2======================================================== 3 4If Henry ever had any internals documentation, he didn't publish it. 5So this file is an attempt to reverse-engineer some docs. 6 7General source-file layout 8-------------------------- 9 10There are six separately-compilable source files, five of which expose 11exactly one exported function apiece: 12 regcomp.c: pg_regcomp 13 regexec.c: pg_regexec 14 regerror.c: pg_regerror 15 regfree.c: pg_regfree 16 regprefix.c: pg_regprefix 17(The pg_ prefixes were added by the Postgres project to distinguish this 18library version from any similar one that might be present on a particular 19system. They'd need to be removed or replaced in any standalone version 20of the library.) 21 22The sixth file, regexport.c, exposes multiple functions that allow extraction 23of info about a compiled regex (see regexport.h). 24 25There are additional source files regc_*.c that are #include'd in regcomp, 26and similarly additional source files rege_*.c that are #include'd in 27regexec. This was done to avoid exposing internal symbols globally; 28all functions not meant to be part of the library API are static. 29 30(Actually the above is a lie in one respect: there are two more global 31symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp. These are 32not meant to be part of the API, but they have to be global because both 33regcomp and regexec call them. It'd be better to get rid of 34pg_set_regex_collation, as well as the static variables it sets, in favor of 35keeping the needed locale state in the regex structs. We have not done this 36yet for lack of a design for how to add application-specific state to the 37structs.) 38 39What's where in src/backend/regex/: 40 41regcomp.c Top-level regex compilation code 42regc_color.c Color map management 43regc_cvec.c Character vector (cvec) management 44regc_lex.c Lexer 45regc_nfa.c NFA handling 46regc_locale.c Application-specific locale code from Tcl project 47regc_pg_locale.c Postgres-added application-specific locale code 48regexec.c Top-level regex execution code 49rege_dfa.c DFA creation and execution 50regerror.c pg_regerror: generate text for a regex error code 51regfree.c pg_regfree: API to free a no-longer-needed regex_t 52regexport.c Functions for extracting info from a regex_t 53regprefix.c Code for extracting a common prefix from a regex_t 54 55The locale-specific code is concerned primarily with case-folding and with 56expanding locale-specific character classes, such as [[:alnum:]]. It 57really needs refactoring if this is ever to become a standalone library. 58 59The header files for the library are in src/include/regex/: 60 61regcustom.h Customizes library for particular application 62regerrs.h Error message list 63regex.h Exported API 64regexport.h Exported API for regexport.c 65regguts.h Internals declarations 66 67 68DFAs, NFAs, and all that 69------------------------ 70 71This library is a hybrid DFA/NFA regex implementation. (If you've never 72heard either of those terms, get thee to a first-year comp sci textbook.) 73It might not be clear at first glance what that really means and how it 74relates to what you'll see in the code. Here's what really happens: 75 76* Initial parsing of a regex generates an NFA representation, with number 77of states approximately proportional to the length of the regexp. 78 79* The NFA is then optimized into a "compact NFA" representation, which is 80basically the same idea but without fields that are not going to be needed 81at runtime. It is simplified too: the compact format only allows "plain" 82and "LACON" arc types. The cNFA representation is what is passed from 83regcomp to regexec. 84 85* Unlike traditional NFA-based regex engines, we do not execute directly 86from the NFA representation, as that would require backtracking and so be 87very slow in some cases. Rather, we execute a DFA, which ideally can 88process an input string in linear time (O(M) for M characters of input) 89without backtracking. Each state of the DFA corresponds to a set of 90states of the NFA, that is all the states that the NFA might have been in 91upon reaching the current point in the input string. Therefore, an NFA 92with N states might require as many as 2^N states in the corresponding 93DFA, which could easily require unreasonable amounts of memory. We deal 94with this by materializing states of the DFA lazily (only when needed) and 95keeping them in a limited-size cache. The possible need to build the same 96state of the DFA repeatedly makes this approach not truly O(M) time, but 97in the worst case as much as O(M*N). That's still far better than the 98worst case for a backtracking NFA engine. 99 100If that were the end of it, we'd just say this is a DFA engine, with the 101use of NFAs being merely an implementation detail. However, a DFA engine 102cannot handle some important regex features such as capturing parens and 103back-references. If the parser finds that a regex uses these features 104(collectively called "messy cases" in the code), then we have to use 105NFA-style backtracking search after all. 106 107When using the NFA mode, the representation constructed by the parser 108consists of a tree of sub-expressions ("subre"s). Leaf tree nodes are 109either plain regular expressions (which are executed as DFAs in the manner 110described above) or back-references (which try to match the input to some 111previous substring). Non-leaf nodes are capture nodes (which save the 112location of the substring currently matching their child node), 113concatenation, alternation, or iteration nodes. At execution time, the 114executor recursively scans the tree. At concatenation, alternation, or 115iteration nodes, it considers each possible alternative way of matching the 116input string, that is each place where the string could be split for a 117concatenation or iteration, or each child node for an alternation. It 118tries the next alternative if the match fails according to the child nodes. 119This is exactly the sort of backtracking search done by a traditional NFA 120regex engine. If there are many tree levels it can get very slow. 121 122But all is not lost: we can still be smarter than the average pure NFA 123engine. To do this, each subre node has an associated DFA, which 124represents what the node could possibly match insofar as a mathematically 125pure regex can describe that, which basically means "no backrefs". 126Before we perform any search of possible alternative sub-matches, we run 127the DFA to see if it thinks the proposed substring could possibly match. 128If not, we can reject the match immediately without iterating through many 129possibilities. 130 131As an example, consider the regex "(a[bc]+)\1". The compiled 132representation will have a top-level concatenation subre node. Its left 133child is a capture node, and the child of that is a plain DFA node for 134"a[bc]+". The concatenation's right child is a backref node for \1. 135The DFA associated with the concatenation node will be "a[bc]+a[bc]+", 136where the backref has been replaced by a copy of the DFA for its referent 137expression. When executed, the concatenation node will have to search for 138a possible division of the input string that allows its two child nodes to 139each match their part of the string (and although this specific case can 140only succeed when the division is at the middle, the code does not know 141that, nor would it be true in general). However, we can first run the DFA 142and quickly reject any input that doesn't start with an "a" and contain 143one more "a" plus some number of b's and c's. If the DFA doesn't match, 144there is no need to recurse to the two child nodes for each possible 145string division point. In many cases, this prefiltering makes the search 146run much faster than a pure NFA engine could do. It is this behavior that 147justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's 148library. 149 150 151Colors and colormapping 152----------------------- 153 154In many common regex patterns, there are large numbers of characters that 155can be treated alike by the execution engine. A simple example is the 156pattern "[[:alpha:]][[:alnum:]]*" for an identifier. Basically the engine 157only needs to care whether an input symbol is a letter, a digit, or other. 158We could build the NFA or DFA with a separate arc for each possible letter 159and digit, but that's very wasteful of space and not so cheap to execute 160either, especially when dealing with Unicode which can have thousands of 161letters. Instead, the parser builds a "color map" that maps each possible 162input symbol to a "color", or equivalence class. The NFA or DFA 163representation then has arcs labeled with colors, not specific input 164symbols. At execution, the first thing the executor does with each input 165symbol is to look up its color in the color map, and then everything else 166works from the color only. 167 168To build the colormap, we start by assigning every possible input symbol 169the color WHITE, which means "other" (that is, at the end of parsing, the 170symbols that are still WHITE are those not explicitly referenced anywhere 171in the regex). When we see a simple literal character or a bracket 172expression in the regex, we want to assign that character, or all the 173characters represented by the bracket expression, a unique new color that 174can be used to label the NFA arc corresponding to the state transition for 175matching this character or bracket expression. The basic idea is: 176first, change the color assigned to a character to some new value; 177second, run through all the existing arcs in the partially-built NFA, 178and for each one referencing the character's old color, add a parallel 179arc referencing its new color (this keeps the reassignment from changing 180the semantics of what we already built); and third, add a new arc with 181the character's new color to the current pair of NFA states, denoting 182that seeing this character allows the state transition to be made. 183 184This is complicated a bit by not wanting to create more colors 185(equivalence classes) than absolutely necessary. In particular, if a 186bracket expression mentions two characters that had the same color before, 187they should still share the same color after we process the bracket, since 188there is still not a need to distinguish them. But we do need to 189distinguish them from other characters that previously had the same color 190yet are not listed in the bracket expression. To mechanize this, the code 191has a concept of "parent colors" and "subcolors", where a color's subcolor 192is the new color that we are giving to any characters of that color while 193parsing the current atom. (The word "parent" is a bit unfortunate here, 194because it suggests a long-lived relationship, but a subcolor link really 195only lasts for the duration of parsing a single atom.) In other words, 196a subcolor link means that we are in process of splitting the parent color 197into two colors (equivalence classes), depending on whether or not each 198member character should be included by the current regex atom. 199 200As an example, suppose we have the regex "a\d\wx". Initially all possible 201character codes are labeled WHITE (color 0). To parse the atom "a", we 202create a new color (1), update "a"'s color map entry to 1, and create an 203arc labeled 1 between the first two states of the NFA. Now we see \d, 204which is really a bracket expression containing the digits "0"-"9". 205First we process "0", which is currently WHITE, so we create a new color 206(2), update "0"'s color map entry to 2, and create an arc labeled 2 207between the second and third states of the NFA. We also mark color WHITE 208as having the subcolor 2, which means that future relabelings of WHITE 209characters should also select 2 as the new color. Thus, when we process 210"1", we won't create a new color but re-use 2. We update "1"'s color map 211entry to 2, and then find that we don't need a new arc because there is 212already one labeled 2 between the second and third states of the NFA. 213Similarly for the other 8 digits, so there will be only one arc labeled 2 214between NFA states 2 and 3 for all members of this bracket expression. 215At completion of processing of the bracket expression, we call okcolors() 216which breaks all the existing parent/subcolor links; there is no longer a 217marker saying that WHITE characters should be relabeled 2. (Note: 218actually, we did the same creation and clearing of a subcolor link for the 219primitive atom "a", but it didn't do anything very interesting.) Now we 220come to the "\w" bracket expression, which for simplicity assume expands 221to just "[a-z0-9]". We process "a", but observe that it is already the 222sole member of its color 1. This means there is no need to subdivide that 223equivalence class more finely, so we do not create any new color. We just 224make an arc labeled 1 between the third and fourth NFA states. Next we 225process "b", which is WHITE and far from the only WHITE character, so we 226create a new color (3), link that as WHITE's subcolor, relabel "b" as 227color 3, and make an arc labeled 3. As we process "c" through "z", each 228is relabeled from WHITE to 3, but no new arc is needed. Now we come to 229"0", which is not the only member of its color 2, so we suppose that a new 230color is needed and create color 4. We link 4 as subcolor of 2, relabel 231"0" as color 4 in the map, and add an arc for color 4. Next "1" through 232"9" are similarly relabeled as color 4, with no additional arcs needed. 233Having finished the bracket expression, we call okcolors(), which breaks 234the subcolor links. okcolors() further observes that we have removed 235every member of color 2 (the previous color of the digit characters). 236Therefore, it runs through the partial NFA built so far and relabels arcs 237labeled 2 to color 4; in particular the arc from NFA state 2 to state 3 is 238relabeled color 4. Then it frees up color 2, since we have no more use 239for that color. We now have an NFA in which transitions for digits are 240consistently labeled with color 4. Last, we come to the atom "x". 241"x" is currently labeled with color 3, and it's not the only member of 242that color, so we realize that we now need to distinguish "x" from other 243letters when we did not before. We create a new color, which might have 244been 5 but instead we recycle the unused color 2. "x" is relabeled 2 in 245the color map and 2 is linked as the subcolor of 3, and we add an arc for 2462 between states 4 and 5 of the NFA. Now we call okcolors(), which breaks 247the subcolor link between colors 3 and 2 and notices that both colors are 248nonempty. Therefore, it also runs through the existing NFA arcs and adds 249an additional arc labeled 2 wherever there is an arc labeled 3; this 250action ensures that characters of color 2 (i.e., "x") will still be 251considered as allowing any transitions they did before. We are now done 252parsing the regex, and we have these final color assignments: 253 color 1: "a" 254 color 2: "x" 255 color 3: other letters 256 color 4: digits 257and the NFA has these arcs: 258 states 1 -> 2 on color 1 (hence, "a" only) 259 states 2 -> 3 on color 4 (digits) 260 states 3 -> 4 on colors 1, 3, 4, and 2 (covering all \w characters) 261 states 4 -> 5 on color 2 ("x" only) 262which can be seen to be a correct representation of the regex. 263 264Given this summary, we can see we need the following operations for 265colors: 266 267* A fast way to look up the current color assignment for any character 268 code. (This is needed during both parsing and execution, while the 269 remaining operations are needed only during parsing.) 270* A way to alter the color assignment for any given character code. 271* We must track the number of characters currently assigned to each 272 color, so that we can detect empty and singleton colors. 273* We must track all existing NFA arcs of a given color, so that we 274 can relabel them at need, or add parallel arcs of a new color when 275 an existing color has to be subdivided. 276 277The last two of these are handled with the "struct colordesc" array and 278the "colorchain" links in NFA arc structs. 279 280Ideally, we'd do the first two operations using a simple linear array 281storing the current color assignment for each character code. 282Unfortunately, that's not terribly workable for large charsets such as 283Unicode. Our solution is to divide the color map into two parts. A simple 284linear array is used for character codes up to MAX_SIMPLE_CHR, which can be 285chosen large enough to include all popular characters (so that the 286significantly-slower code paths about to be described are seldom invoked). 287Characters above that need be considered at compile time only if they 288appear explicitly in the regex pattern. We store each such mentioned 289character or character range as an entry in the "colormaprange" array in 290the colormap. (Overlapping ranges are split into unique subranges, so that 291each range in the finished list needs only a single color that describes 292all its characters.) When mapping a character above MAX_SIMPLE_CHR to a 293color at runtime, we search this list of ranges explicitly. 294 295That's still not quite enough, though, because of locale-dependent 296character classes such as [[:alpha:]]. In Unicode locales these classes 297may have thousands of entries that are above MAX_SIMPLE_CHR, and we 298certainly don't want to be searching large colormaprange arrays at runtime. 299Nor do we even want to spend the time to initialize cvec structures that 300exhaustively describe all of those characters. Our solution is to compute 301exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR. 302For characters above that, we apply the <ctype.h> or <wctype.h> lookup 303functions at runtime for each locale-dependent character class used in the 304regex pattern, constructing a bitmap that describes which classes the 305runtime character belongs to. The per-character-range data structure 306mentioned above actually holds, for each range, a separate color entry 307for each possible combination of character class properties. That is, 308the color map for characters above MAX_SIMPLE_CHR is really a 2-D array, 309whose rows correspond to high characters or character ranges that are 310explicitly mentioned in the regex pattern, and whose columns correspond 311to sets of the locale-dependent character classes that are used in the 312regex. 313 314As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]' 315(and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need 316a high color map with three rows. One row is for the single character 317U+1234 (represented as a single-element range), one is for the range 318U+1D100..U+1D1FF, and the other row represents all remaining high 319characters. The color map has two columns, one for characters that 320satisfy iswalnum() and one for those that don't. 321 322We build this color map in parallel with scanning the regex. Each time 323we detect a new explicit high character (or range) or a locale-dependent 324character class, we split existing entry(s) in the high color map so that 325characters we need to be able to distinguish will have distinct entries 326that can be given separate colors. Often, though, single entries in the 327high color map will represent very large sets of characters. 328 329If there are both explicit high characters/ranges and locale-dependent 330character classes, we may have entries in the high color map array that 331have non-WHITE colors but don't actually represent any real characters. 332(For example, in a row representing a singleton range, only one of the 333columns could possibly be a live entry; it's the one matching the actual 334locale properties for that single character.) We don't currently make 335any effort to reclaim such colors. In principle it could be done, but 336it's not clear that it's worth the trouble. 337 338 339Detailed semantics of an NFA 340---------------------------- 341 342When trying to read dumped-out NFAs, it's helpful to know these facts: 343 344State 0 (additionally marked with "@" in dumpnfa's output) is always the 345goal state, and state 1 (additionally marked with ">") is the start state. 346(The code refers to these as the post state and pre state respectively.) 347 348The possible arc types are: 349 350 PLAIN arcs, which specify matching of any character of a given "color" 351 (see above). These are dumped as "[color_number]->to_state". 352 353 EMPTY arcs, which specify a no-op transition to another state. These 354 are dumped as "->to_state". 355 356 AHEAD constraints, which represent a "next character must be of this 357 color" constraint. AHEAD differs from a PLAIN arc in that the input 358 character is not consumed when crossing the arc. These are dumped as 359 ">color_number>->to_state". 360 361 BEHIND constraints, which represent a "previous character must be of 362 this color" constraint, which likewise consumes no input. These are 363 dumped as "<color_number<->to_state". 364 365 '^' arcs, which specify a beginning-of-input constraint. These are 366 dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and 367 beginning-of-line constraints respectively. 368 369 '$' arcs, which specify an end-of-input constraint. These are dumped 370 as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line 371 constraints respectively. 372 373 LACON constraints, which represent "(?=re)", "(?!re)", "(?<=re)", and 374 "(?<!re)" constraints, i.e. the input starting/ending at this point must 375 match (or not match) a given sub-RE, but the matching input is not 376 consumed. These are dumped as ":subtree_number:->to_state". 377 378If you see anything else (especially any question marks) in the display of 379an arc, it's dumpnfa() trying to tell you that there's something fishy 380about the arc; see the source code. 381 382The regex executor can only handle PLAIN and LACON transitions. The regex 383optimize() function is responsible for transforming the parser's output 384to get rid of all the other arc types. In particular, ^ and $ arcs that 385are not dropped as impossible will always end up adjacent to the pre or 386post state respectively, and then will be converted into PLAIN arcs that 387mention the special "colors" for BOS, BOL, EOS, or EOL. 388 389To decide whether a thus-transformed NFA matches a given substring of the 390input string, the executor essentially follows these rules: 3911. Start the NFA "looking at" the character *before* the given substring, 392or if the substring is at the start of the input, prepend an imaginary BOS 393character instead. 3942. Run the NFA until it has consumed the character *after* the given 395substring, or an imaginary following EOS character if the substring is at 396the end of the input. 3973. If the NFA is (or can be) in the goal state at this point, it matches. 398 399So one can mentally execute an untransformed NFA by taking ^ and $ as 400ordinary constraints that match at start and end of input; but plain 401arcs out of the start state should be taken as matches for the character 402before the target substring, and similarly, plain arcs leading to the 403post state are matches for the character after the target substring. 404This definition is necessary to support regexes that begin or end with 405constraints such as \m and \M, which imply requirements on the adjacent 406character if any. NFAs for simple unanchored patterns will usually have 407pre-state outarcs for all possible character colors as well as BOS and 408BOL, and post-state inarcs for all possible character colors as well as 409EOS and EOL, so that the executor's behavior will work. 410