1This material is derived from The GNU Emacs Manual, 9th edition. 2Copyright 1985, 1986, 1987, 1993 Free Software Foundation, Inc. 3and from the GNU regex library version 0.11. 4Copyright 1992 Free Software Foundation, Inc. 5 6The Free Software Foundation gives permission for the use of this 7material in e93, with different distribution terms from those stated 8in the Emacs manual, because e93 is free software, and thus worthy 9of cooperation. 10 11 12NOTE: this has been modified to reflect e93's syntax. 131/22/00 TMS (squirest@e93.org) 14 15Regular Expression Syntax 16 17Common Operators 18 19* Match-self Operator:: Ordinary characters. 20* Match-any-character Operator:: . 21* Concatenation Operator:: Juxtaposition. 22* Repetition Operators:: * + ? {} 23* Alternation Operator:: | 24* List Operators:: [...] [^...] 25* Grouping Operators:: (...) 26* Back-reference Operator:: \digit 27* Anchoring Operators:: ^ $ < > 28 29Repetition Operators 30 31* Match-zero-or-more Operator:: * 32* Match-one-or-more Operator:: + 33* Match-zero-or-one Operator:: ? 34* Interval Operators:: {} 35 36List Operators (`[' ... `]' and `[^' ... `]') 37 38* Range Operator:: start-end 39 40Anchoring Operators 41 42* Match-beginning-of-word Operator:: < 43* Match-end-of-word Operator:: > 44* Match-beginning-of-line Operator:: ^ 45* Match-end-of-line Operator:: $ 46 47 48Overview 49******** 50 51 A "regular expression" (or "regexp", or "pattern") is a text string 52that describes some (mathematical) set of strings. A regexp R 53"matches" a string S if S is in the set of strings described by R. 54 55 Some regular expressions match only one string, i.e., the set they 56describe has only one member. For example, the regular expression 57`foo' matches the string `foo' and no others. Other regular 58expressions match more than one string, i.e., the set they describe has 59more than one member. For example, the regular expression `f*' matches 60the set of strings made up of any number (including zero) of `f's. As 61you can see, some characters in regular expressions match themselves 62(such as `f') and some don't (such as `*'); the ones that don't match 63themselves instead let you specify patterns that describe many 64different strings. 65 66Regular Expression Syntax 67************************* 68 69 "Characters" are things you can type. "Operators" are things in a 70regular expression that match one or more characters. You compose 71regular expressions from operators, which in turn you specify using one 72or more characters. 73 74 Most characters represent what we call the match-self operator, i.e., 75they match themselves; we call these characters "ordinary". Other 76characters represent either all or parts of fancier operators; e.g., 77`.' represents what we call the match-any-character operator (which, no 78surprise, matches any character); we call these characters 79"special". 80 81 In the following sections, we describe these things in more detail. 82 83The Backslash Character 84======================= 85 86 The `\' character quotes (makes ordinary, if it's special, 87 or possibly special if it's ordinary) the next character. 88 89 '\' sequences: 90 91 \n -- stands for new line (0x0A). 92 \b -- stands for backspace (0x08). 93 \r -- stands for return (0x0D). 94 \t -- stands for tab (0x09). 95 \x## -- allows specification of arbitrary characters in hex 96 for example \x0A is equivalent to \n. 97 \0 to \9 -- backreference to register (not valid within []) 98 99 100Common Operators 101**************** 102 103 You compose regular expressions from operators. In the following 104sections, we describe the regular expression operators. 105 106 107* Match-self Operator:: Ordinary characters. 108* Match-any-character Operator:: . 109* Concatenation Operator:: Juxtaposition. 110* Repetition Operators:: * + ? {} 111* Alternation Operator:: | 112* List Operators:: [...] [^...] 113* Grouping Operators:: (...) 114* Back-reference Operator:: \digit 115* Anchoring Operators:: ^ $ < > 116 117 118The Match-self Operator (ORDINARY CHARACTER) 119============================================ 120 121 This operator matches the character itself. All ordinary characters 122represent this operator. For example, `f' is always an ordinary character, 123so the regular expression `f' matches only the string `f'. 124In particular, it does *not* match the string `ff'. 125 126 127The Match-any-character Operator (`.') 128====================================== 129 130 This operator matches any single printing or nonprinting character except 131 newline (it is equivalent to `[^\n]'). 132 133 NOTE: if you wish to match absolutely anything, use `[-]', or `[^]'. 134 135 The `.' (period) character represents this operator. For example, 136`a.b' matches any three-character string beginning with `a' and ending 137with `b'. 138 139The Concatenation Operator 140========================== 141 142 This operator concatenates two regular expressions A and B. No 143character represents this operator; you simply put B after A. The 144result is a regular expression that will match a string if A matches 145its first part and B matches the rest. For example, `xy' (two 146match-self operators) matches `xy'. 147 148Repetition Operators 149==================== 150 151 Repetition operators repeat the preceding regular expression a 152specified number of times. 153 154* Match-zero-or-more Operator:: * 155* Match-one-or-more Operator:: + 156* Match-zero-or-one Operator:: ? 157* Interval Operators:: {} 158 159The Match-zero-or-more Operator (`*') 160------------------------------------- 161 162 This operator repeats the smallest possible preceding regular 163expression as many times as necessary (including zero) to match the 164pattern. `*' represents this operator. For example, `o*' matches any 165string made up of zero or more `o's. Since this operator operates on 166the smallest preceding regular expression, `fo*' has a repeating `o', 167not a repeating `fo'. So, `fo*' matches `f', `fo', `foo', and so on. 168 169 Since the match-zero-or-more operator is a suffix operator, it may 170not be applied when no regular expression precedes it. This is the 171case when it: 172 173 * is first in a regular expression, or 174 175 * follows a match-beginning-of-line, match-end-of-line, open-group, 176 or alternation operator. 177 178 The matcher processes a match-zero-or-more operator by first matching 179as many repetitions of the smallest preceding regular expression as it 180can. Then it continues to match the rest of the pattern. 181 182 If it can't match the rest of the pattern, it backtracks (as many 183times as necessary), each time discarding one of the matches until it 184can either match the entire pattern or be certain that it cannot get a 185match. For example, when matching `ca*ar' against `caaar', the matcher 186first matches all three `a's of the string with the `a*' of the regular 187expression. However, it cannot then match the final `ar' of the 188regular expression against the final `r' of the string. So it 189backtracks, discarding the match of the last `a' in the string. It can 190then match the remaining `ar'. 191 192The Match-one-or-more Operator (`+') 193------------------------------------ 194 195 This operator is similar to the match-zero-or-more operator except 196that it repeats the preceding regular expression at least once; *note 197Match-zero-or-more Operator::., for what it operates on, and how Regex 198backtracks to match it. 199 200 For example, supposing that `+' represents the match-one-or-more 201operator; then `ca+r' matches, e.g., `car' and `caaaar', but not `cr'. 202 203The Match-zero-or-one Operator (`?') 204------------------------------------ 205 206 This operator is similar to the match-zero-or-more operator except 207that it repeats the preceding regular expression once or not at all; 208*note Match-zero-or-more Operator::., to see what it operates on, and 209how Regex backtracks to match it. 210 211 For example, supposing that `?' represents the match-zero-or-one 212operator; then `ca?r' matches both `car' and `cr', but nothing else. 213 214Interval Operators (`{' ... `}') 215---------------------------------- 216 217Supposing that `{' and `}' represent the open-interval 218and close-interval operators; then: 219 220`{COUNT}' 221 matches exactly COUNT occurrences of the preceding regular 222 expression. 223 224`{MIN,}' 225 matches MIN or more occurrences of the preceding regular 226 expression. 227 228`{MIN, MAX}' 229 matches at least MIN but no more than MAX occurrences of the 230 preceding regular expression. 231 232 The interval expression (but not necessarily the regular expression 233that contains it) is invalid if: 234 235 * MIN is greater than MAX 236 237The Alternation Operator (`|') 238============================== 239 240 Alternatives match one of a choice of regular expressions: if you put 241the character(s) representing the alternation operator between any two 242regular expressions A and B, the result matches the union of the 243strings that A and B match. For example, supposing that `|' is the 244alternation operator, then `foo|bar|quux' would match any of `foo', 245`bar' or `quux'. 246 247 The alternation operator operates on the *largest* possible 248surrounding regular expressions. (Put another way, it has the lowest 249precedence of any regular expression operator.) Thus, the only way you 250can delimit its arguments is to use grouping. For example, if `(' and 251`)' are the open and close-group operators, then `fo(o|b)ar' would 252match either `fooar' or `fobar'. (`foo|bar' would match `foo' or 253`bar'.) 254 255 The matcher tries each combination of alternatives in order until it 256is able to make a match. 257 258List Operators (`[' ... `]' and `[^' ... `]') 259============================================= 260 261 "Lists", also called "bracket expressions", are a set of zero or more 262items. An "item" is a character, or a range expression. 263 264 A "matching list" matches a single character represented by one of 265the list items. You form a matching list by enclosing one or more items 266within an "open-matching-list operator" (represented by `[') and a 267"close-list operator" (represented by `]'). 268 269 For example, `[ab]' matches either `a' or `b'. `[ad]*' matches the 270empty string and any string composed of just `a's and `d's in any 271order. Regex considers invalid a regular expression with a `[' but no 272matching `]'. 273 274 "Nonmatching lists" are similar to matching lists except that they 275match a single character *not* represented by one of the list items. 276You use an "open-nonmatching-list operator" (represented by `[^') 277instead of an open-matching-list operator to start a nonmatching list. 278 279 For example, `[^ab]' matches any character except `a' or `b'. 280 281 Most characters lose any special meaning inside a list. The special 282characters inside a list follow. 283 284`]' 285 ends the list unless quoted by '\'. 286 287`\' 288 quotes the next character. 289 290`-' 291 represents the range operator unless quoted by '\'. 292 293All other characters are ordinary. For example, `[.*]' matches `.' and 294`*'. 295 296The Range Operator (`-') 297------------------------ 298 299 Regex recognizes "range expressions" inside a list. They represent 300those characters that fall between two elements in the current 301collating sequence. You form a range expression by putting a "range 302operator" between two characters. `-' represents the range operator. 303 For example, `a-f' within a list represents all the characters from `a' 304through `f' inclusively. 305 306 Since `-' represents the range operator, if you want to make a `-' 307character itself a list item, you must quote it with '\'. 308 309 Ranges do not need a start and end, if the start is omitted 310for example, `[-a]' matches all characters through lowercase 'a'; 311if the end is omitted: '[a-]' matches lowercase 'a' through 0xFF; 312[-] matches all characters. 313 314Grouping Operators (`(' ... `)') 315================================================= 316 317 A "group", also known as a "subexpression", consists of an 318"open-group operator", any number of other operators, and a 319"close-group operator". Regex treats this sequence as a unit, just as 320mathematics and programming languages treat a parenthesized expression 321as a unit. Groups can be empty. 322 323 Therefore, using "groups", you can: 324 325 * delimit the argument(s) to an alternation operator (*note 326 Alternation Operator::.) or a repetition operator (*note 327 Repetition Operators::.). 328 329 * keep track of the indices of the substring that matched a given 330 group. *Note Using Registers::, for a precise explanation. This 331 lets you: 332 333 * use the back-reference operator (*note Back-reference 334 Operator::.). 335 336 * use registers (*note Using Registers::.). 337 338The Back-reference Operator ("\"DIGIT) 339====================================== 340 341A back reference matches a specified preceding group. 342The back reference operator is represented by `\DIGIT' anywhere after 343the end of a regular expression's DIGIT-th group (*note Grouping 344Operators::.). 345 346 DIGIT must be between `0' and `9'. The matcher assigns numbers 0 347through 9 to the first ten groups it encounters. By using one of `\0' 348through `\9' after the corresponding group's close-group operator, you 349can match a substring identical to the one that the group does. 350 351 Back references match according to the following (in all examples 352below, `(' represents the open-group, `)' the close-group, `{' the 353open-interval and `}' the close-interval operator): 354 355 * If the group matches a substring, the back reference matches an 356 identical substring. For example, `(a)\0' matches `aa' and 357 `(bana)na\0bo\0' matches `bananabanabobana'. Likewise, `(.*)\0' 358 matches any string that is composed of two identical halves; the `(.*)' 359 matches the first half and the `\0' matches the second half. 360 361 * If the group matches more than once (as it might if followed by, 362 e.g., a repetition operator), then the back reference matches the 363 substring the group *last* matched. For example, `((a*)b)*\0\1' 364 matches `aabababa'; first group 0 (the outer one) matches `aab' 365 and group 1 (the inner one) matches `aa'. Then group 0 matches 366 `ab' and group 1 matches `a'. So, `\0' matches `ab' and `\1' 367 matches `a'. 368 369 * If the group doesn't participate in a match, i.e., it is part of an 370 alternative not taken or a repetition operator allows zero 371 repetitions of it, then the back reference makes the whole match 372 fail. 373 374 You can use a back reference as an argument to a repetition operator. 375 For example, `(a(b))\1*' matches `a' followed by one or more `b's. 376Similarly, `(a(b))\1{3}' matches `abbbb'. 377 378 If there is no preceding DIGIT-th subexpression, the regular 379expression is invalid. 380 381Anchoring Operators 382=================== 383 384 These operators can appear anywhere (except lists) within a pattern 385and force that point in the pattern to match only at the beginning or end of a word or line. 386 387* Match-beginning-of-word Operator:: < 388* Match-end-of-word Operator:: > 389* Match-beginning-of-line Operator:: ^ 390* Match-end-of-line Operator:: $ 391 392 393The Match-beginning-of-word Operator (`<') 394------------------------------------------ 395 396 This operator can match the empty string either at the beginning of 397the text or the beginning of a word. Thus, it is said to "anchor" the 398pattern to the beginning of a word. 399 400The Match-end-of-word Operator (`>') 401------------------------------------ 402 403 This operator can match the empty string either at the end of the text 404or the end of a word. Thus, it is said to "anchor" the pattern to the 405end of a word. 406 407The Match-beginning-of-line Operator (`^') 408------------------------------------------ 409 410 This operator can match the empty string either at the beginning of 411the text or after a newline character. Thus, it is said to "anchor" 412the pattern to the beginning of a line. 413 414The Match-end-of-line Operator (`$') 415------------------------------------ 416 417 This operator can match the empty string either at the end of the 418text or before a newline character in the text. Thus, it is said 419to "anchor" the pattern to the end of a line. 420