1\input texinfo
2@setfilename regexp.info
3@comment @newindex {fn}
4
5
6@comment From the `ed' documentation.
7
8@chapter Regular expressions
9
10Regular expressions are patterns used in selecting text.
11
12In addition to a specifying string literals, regular expressions can
13represent classes of strings.  Strings thus represented are said to be
14matched by the corresponding regular expression.  If it is possible for
15a regular expression to match several strings in a line, then the
16left-most longest match is the one selected.
17
18The following symbols are used in constructing regular expressions:
19
20@table @code
21
22@item @var{c}
23Any character @var{c} not listed below, including @samp{@{}, @samp{@}},
24@samp{(}, @samp{)}, @samp{<} and @samp{>}, matches itself.
25
26@item \@var{c}
27Any backslash-escaped character @var{c}, other than @samp{@{},
28`@samp{@}}, @samp{(}, @samp{)}, @samp{<}, @samp{>}, @samp{b}, @samp{B},
29@samp{w}, @samp{W}, @samp{+} and @samp{?}, matches itself.
30
31Note that @samp{\} also has special meaning in the read syntax of Lisp
32strings, and must be quoted with @samp{\}.  For
33example, the regular expression that matches the @samp{\} character is
34@samp{\\}.  To write a Lisp string that contains the characters
35@samp{\\}, Lisp syntax requires you to quote each @samp{\} with another
36@samp{\}.  Therefore, the read syntax for a regular expression matching
37@samp{\} is @code{"\\\\"}.@refill
38
39@item .
40Matches any single character.
41
42@item [@var{char-class}]
43Matches any single character in @var{char-class}.  To include a @samp{]}
44in @var{char-class}, it must be the first character.  A range of
45characters may be specified by separating the end characters of the
46range with a @samp{-}, e.g., @samp{a-z} specifies the lower case
47characters.  The following literal expressions can also be used in
48@var{char-class} to specify sets of characters:
49
50@example
51[:alnum:] [:cntrl:] [:lower:] [:space:]
52[:alpha:] [:digit:] [:print:] [:upper:]
53[:blank:] [:graph:] [:punct:] [:xdigit:]
54@end example
55
56If @samp{-} appears as the first or last character of @var{char-class},
57then it matches itself.  All other characters in @var{char-class} match
58themselves.
59
60Patterns in
61@var{char-class}
62of the form:
63@example
64[.@var{col-elm}.]
65[=@var{col-elm}=]
66@end example
67
68@noindent
69where @var{col-elm} is a @dfn{collating element} are interpreted
70according to @code{locale (5)} (not currently supported).  See
71@code{regex (3)} for an explanation of these constructs.
72
73@item [^@var{char-class}]
74Matches any single character, other than newline, not in
75@var{char-class}.  @var{char-class} is defined as above.
76
77@item ^
78If @samp{^} is the first character of a regular expression, then it
79anchors the regular expression to the beginning of a line.  Otherwise,
80it matches itself.
81
82@item $
83If @samp{$} is the last character of a regular expression, it anchors
84the regular expression to the end of a line.  Otherwise, it matches
85itself.
86
87@item \(@var{re}\)
88Defines a (possibly null) subexpression @var{re}.
89Subexpressions may be nested.  A
90subsequent backreference of the form @samp{\@var{n}}, where @var{n} is a
91number in the range [1,9], expands to the text matched by the @var{n}th
92subexpression. For example, the regular expression @samp{\(a.c\)\1} matches
93the string @samp{abcabc}, but not @samp{abcadc}.
94Subexpressions are ordered relative to their left delimiter.
95
96@item *
97Matches the single character regular expression or subexpression
98immediately preceding it zero or more times.  If @samp{*} is the first
99character of a regular expression or subexpression, then it matches
100itself.  The @samp{*} operator sometimes yields unexpected results.  For
101example, the regular expression @samp{b*} matches the beginning of the
102string @samp{abbb}, as opposed to the substring @samp{bbb}, since a
103null match is the only left-most match.
104
105@item \@{@var{n,m}\@}
106@itemx \@{@var{n,}\@}
107@itemx \@{@var{n}\@}
108Matches the single character regular expression or subexpression
109immediately preceding it at least @var{n} and at most @var{m} times.  If
110@var{m} is omitted, then it matches at least @var{n} times.  If the
111comma is also omitted, then it matches exactly @var{n} times.
112If any of these forms occurs first in a regular expression or subexpression,
113then it is interpreted literally (i.e., the regular expression @samp{\@{2\@}}
114matches the string @samp{@{2@}}, and so on).
115
116@item \<
117@itemx \>
118Anchors the single character regular expression or subexpression
119immediately following it to the beginning (in the case of @samp{\<})
120or ending (in the case of @samp{\>}) of
121a @dfn{word}, i.e., in ASCII, a maximal string of alphanumeric characters,
122including the underscore (_).
123
124@end table
125
126The following extended operators are preceded by a backslash @samp{\} to
127distinguish them from traditional @code{ed} syntax.
128
129@table @code
130
131@item  \`
132@itemx \'
133Unconditionally matches the beginning @samp{\`} or ending @samp{\'} of a line.
134
135@item \?
136Optionally matches the single character regular expression or subexpression
137immediately preceding it.  For example, the regular expression @samp{a[bd]\?c}
138matches the strings @samp{abc}, @samp{adc} and @samp{ac}.
139If @samp{\?} occurs at the beginning
140of a regular expressions or subexpression, then it matches a literal @samp{?}.
141
142@item \+
143Matches the single character regular expression or subexpression
144immediately preceding it one or more times.  So the regular expression
145@samp{a+} is shorthand for @samp{aa*}.  If @samp{\+} occurs at the
146beginning of a regular expression or subexpression, then it matches a
147literal @samp{+}.
148
149@item \b
150Matches the beginning or ending (null string) of a word.  Thus the regular
151expression @samp{\bhello\b} is equivalent to @samp{\<hello\>}.
152However, @samp{\b\b}
153is a valid regular expression whereas @samp{\<\>} is not.
154
155@item \B
156Matches (a null string) inside a word.
157
158@item \w
159Matches any character in a word.
160
161@item \W
162Matches any character not in a word.
163
164@end table
165
166
167@comment From the `emacs' documentation.
168@comment The RE_SYNTAX_POSIX_BASIC syntax we use differs from the Emacs syntax
169@comment in the following bits:
170@comment RE_CHAR_CLASSES     ->  character classes are supported
171@comment RE_DOT_NEWLINE      ->  . matches newline
172@comment RE_DOT_NOT_NULL     ->  . doesn't match NUL
173@comment RE_INTERVALS        ->
174@comment RE_NO_EMPTY_RANGES  ->  [z-a] is invalid
175@comment RE_BK_PLUS_QM       ->  + and ? are operators, \+ and \? are literals
176
177@node Regular Expressions
178@chapter Regular Expressions
179@cindex regular expression
180@cindex regexp
181
182  A @dfn{regular expression} (@dfn{regexp}, for short) is a pattern that
183denotes a (possibly infinite) set of strings.  Searching for matches for
184a regexp is a very powerful operation.  This section explains how to write
185regexps; the following section says how to search for them.
186
187@menu
188* Syntax of Regexps::       Rules for writing regular expressions.
189* Regexp Examples::         Illustrates regular expression syntax.
190@end menu
191
192@node Syntax of Regexps
193@section Syntax of Regular Expressions
194
195  Regular expressions have a syntax in which a few characters are special
196constructs and the rest are @dfn{ordinary}.  An ordinary character is a
197simple regular expression which matches that character and nothing else.
198The special characters are @samp{$}, @samp{^}, @samp{.}, @samp{*},
199@samp{[}, @samp{]} and @samp{\}; no new special
200characters will be defined in the future.  Any other character appearing
201in a regular expression is ordinary, unless a @samp{\} precedes it.
202
203For example, @samp{f} is not a special character, so it is ordinary, and
204therefore @samp{f} is a regular expression that matches the string
205@samp{f} and no other string.  (It does @emph{not} match the string
206@samp{ff}.)  Likewise, @samp{o} is a regular expression that matches
207only @samp{o}.@refill
208
209Any two regular expressions @var{a} and @var{b} can be concatenated.  The
210result is a regular expression which matches a string if @var{a} matches
211some amount of the beginning of that string and @var{b} matches the rest of
212the string.@refill
213
214As a simple example, we can concatenate the regular expressions @samp{f}
215and @samp{o} to get the regular expression @samp{fo}, which matches only
216the string @samp{fo}.  Still trivial.  To do something more powerful, you
217need to use one of the special characters.  Here is a list of them:
218
219@need 1200
220@table @kbd
221@item .@: @r{(Period)}
222@cindex @samp{.} in regexp
223is a special character that matches any single character.
224Using concatenation, we can make regular expressions like @samp{a.b}, which
225matches any three-character string that begins with @samp{a} and ends with
226@samp{b}.@refill
227
228@item *
229@cindex @samp{*} in regexp
230is not a construct by itself; it is a suffix operator that means to
231repeat the preceding regular expression as many times as possible.  In
232@samp{fo*}, the @samp{*} applies to the @samp{o}, so @samp{fo*} matches
233one @samp{f} followed by any number of @samp{o}s.  The case of zero
234@samp{o}s is allowed: @samp{fo*} does match @samp{f}.@refill
235
236@samp{*} always applies to the @emph{smallest} possible preceding
237expression.  Thus, @samp{fo*} has a repeating @samp{o}, not a
238repeating @samp{fo}.@refill
239
240The matcher processes a @samp{*} construct by matching, immediately,
241as many repetitions as can be found.  Then it continues with the rest
242of the pattern.  If that fails, backtracking occurs, discarding some
243of the matches of the @samp{*}-modified construct in case that makes
244it possible to match the rest of the pattern.  For example, in matching
245@samp{ca*ar} against the string @samp{caaar}, the @samp{a*} first
246tries to match all three @samp{a}s; but the rest of the pattern is
247@samp{ar} and there is only @samp{r} left to match, so this try fails.
248The next alternative is for @samp{a*} to match only two @samp{a}s.
249With this choice, the rest of the regexp matches successfully.@refill
250
251@item [ @dots{} ]
252@cindex character set (in regexp)
253@cindex @samp{[} in regexp
254@cindex @samp{]} in regexp
255@samp{[} begins a @dfn{character set}, which is terminated by a
256@samp{]}.  In the simplest case, the characters between the two brackets
257form the set.  Thus, @samp{[ad]} matches either one @samp{a} or one
258@samp{d}, and @samp{[ad]*} matches any string composed of just @samp{a}s
259and @samp{d}s (including the empty string), from which it follows that
260@samp{c[ad]*r} matches @samp{cr}, @samp{car}, @samp{cdr},
261@samp{caddaar}, etc.@refill
262
263The usual regular expression special characters are not special inside a
264character set.  A completely different set of special characters exists
265inside character sets: @samp{]}, @samp{-} and @samp{^}.@refill
266
267@samp{-} is used for ranges of characters.  To write a range, write two
268characters with a @samp{-} between them.  Thus, @samp{[a-z]} matches any
269lower case letter.  Ranges may be intermixed freely with individual
270characters, as in @samp{[a-z$%.]}, which matches any lower case letter
271or @samp{$}, @samp{%} or a period.@refill
272
273The following literal expressions can also be used in
274@var{char-class} to specify sets of characters:
275
276@example
277[:alnum:] [:cntrl:] [:lower:] [:space:]
278[:alpha:] [:digit:] [:print:] [:upper:]
279[:blank:] [:graph:] [:punct:] [:xdigit:]
280@end example
281
282To include a @samp{]} in a character set, make it the first character.
283For example, @samp{[]a]} matches @samp{]} or @samp{a}.  To include a
284@samp{-}, write @samp{-} as the first character in the set, or put
285immediately after a range.  (You can replace one individual character
286@var{c} with the range @samp{@var{c}-@var{c}} to make a place to put the
287@samp{-}).  There is no way to write a set containing just @samp{-} and
288@samp{]}.
289
290To include @samp{^} in a set, put it anywhere but at the beginning of
291the set.
292
293@item [^ @dots{} ]
294@cindex @samp{^} in regexp
295@samp{[^} begins a @dfn{complement character set}, which matches any
296character except the ones specified.  Thus, @samp{[^a-z0-9A-Z]}
297matches all characters @emph{except} letters and digits.@refill
298
299@samp{^} is not special in a character set unless it is the first
300character.  The character following the @samp{^} is treated as if it
301were first (thus, @samp{-} and @samp{]} are not special there).
302
303Note that a complement character set can match a newline, unless
304newline is mentioned as one of the characters not to match.
305
306@item ^
307@cindex @samp{^} in regexp
308@cindex beginning of line in regexp
309is a special character that matches the empty string, but only at
310the beginning of a line in the text being matched.  Otherwise it fails
311to match anything.  Thus, @samp{^foo} matches a @samp{foo} which occurs
312at the beginning of a line.
313
314When matching a string, @samp{^} matches at the beginning of the string
315or after a newline character @samp{\n}.
316
317@item $
318@cindex @samp{$} in regexp
319is similar to @samp{^} but matches only at the end of a line.  Thus,
320@samp{x+$} matches a string of one @samp{x} or more at the end of a line.
321
322When matching a string, @samp{$} matches at the end of the string
323or before a newline character @samp{\n}.
324
325@item \
326@cindex @samp{\} in regexp
327has two functions: it quotes the special characters (including
328@samp{\}), and it introduces additional special constructs.
329
330Because @samp{\} quotes special characters, @samp{\$} is a regular
331expression which matches only @samp{$}, and @samp{\[} is a regular
332expression which matches only @samp{[}, and so on.
333
334Note that @samp{\} also has special meaning in the read syntax of Lisp
335strings, and must be quoted with @samp{\}.  For
336example, the regular expression that matches the @samp{\} character is
337@samp{\\}.  To write a Lisp string that contains the characters
338@samp{\\}, Lisp syntax requires you to quote each @samp{\} with another
339@samp{\}.  Therefore, the read syntax for a regular expression matching
340@samp{\} is @code{"\\\\"}.@refill
341@end table
342
343For the most part, @samp{\} followed by any character matches only
344that character.  However, there are several exceptions: characters
345which, when preceded by @samp{\}, are special constructs.  Such
346characters are always ordinary when encountered on their own.  Here
347is a table of @samp{\} constructs:
348
349@table @kbd
350@item \+
351@cindex @samp{\+} in regexp
352is a suffix operator similar to @samp{*} except that the preceding
353expression must match at least once.  So, for example, @samp{ca+r}
354matches the strings @samp{car} and @samp{caaaar} but not the string
355@samp{cr}, whereas @samp{ca*r} matches all three strings.
356
357@item \?
358@cindex @samp{\?} in regexp
359is a suffix operator similar to @samp{*} except that the preceding
360expression can match either once or not at all.  For example,
361@samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anyhing
362else.
363
364@item \|
365@cindex @samp{|} in regexp
366@cindex regexp alternative
367specifies an alternative.
368Two regular expressions @var{a} and @var{b} with @samp{\|} in
369between form an expression that matches anything that either @var{a} or
370@var{b} matches.@refill
371
372Thus, @samp{foo\|bar} matches either @samp{foo} or @samp{bar}
373but no other string.@refill
374
375@samp{\|} applies to the largest possible surrounding expressions.  Only a
376surrounding @samp{\( @dots{} \)} grouping can limit the grouping power of
377@samp{\|}.@refill
378
379Full backtracking capability exists to handle multiple uses of @samp{\|}.
380
381@item \( @dots{} \)
382@cindex @samp{(} in regexp
383@cindex @samp{)} in regexp
384@cindex regexp grouping
385is a grouping construct that serves three purposes:
386
387@enumerate
388@item
389To enclose a set of @samp{\|} alternatives for other operations.
390Thus, @samp{\(foo\|bar\)x} matches either @samp{foox} or @samp{barx}.
391
392@item
393To enclose an expression for a suffix operator such as @samp{*} to act
394on.  Thus, @samp{ba\(na\)*} matches @samp{bananana}, etc., with any
395(zero or more) number of @samp{na} strings.@refill
396
397@item
398To record a matched substring for future reference.
399@end enumerate
400
401This last application is not a consequence of the idea of a
402parenthetical grouping; it is a separate feature which happens to be
403assigned as a second meaning to the same @samp{\( @dots{} \)} construct
404because there is no conflict in practice between the two meanings.
405Here is an explanation of this feature:
406
407@item \@var{digit}
408matches the same text which matched the @var{digit}th occurrence of a
409@samp{\( @dots{} \)} construct.
410
411In other words, after the end of a @samp{\( @dots{} \)} construct.  the
412matcher remembers the beginning and end of the text matched by that
413construct.  Then, later on in the regular expression, you can use
414@samp{\} followed by @var{digit} to match that same text, whatever it
415may have been.
416
417The strings matching the first nine @samp{\( @dots{} \)} constructs
418appearing in a regular expression are assigned numbers 1 through 9 in
419the order that the open parentheses appear in the regular expression.
420So you can use @samp{\1} through @samp{\9} to refer to the text matched
421by the corresponding @samp{\( @dots{} \)} constructs.
422
423For example, @samp{\(.*\)\1} matches any newline-free string that is
424composed of two identical halves.  The @samp{\(.*\)} matches the first
425half, which may be anything, but the @samp{\1} that follows must match
426the same exact text.
427
428@item \w
429@cindex @samp{\w} in regexp
430matches any word-constituent character.
431
432@item \W
433@cindex @samp{\W} in regexp
434matches any character that is not a word-constituent.
435@end table
436
437  These regular expression constructs match the empty string---that is,
438they don't use up any characters---but whether they match depends on the
439context.
440
441@table @kbd
442@item \`
443@cindex @samp{\`} in regexp
444matches the empty string, but only at the beginning
445of the buffer or string being matched against.
446
447@item \'
448@cindex @samp{\'} in regexp
449matches the empty string, but only at the end of
450the buffer or string being matched against.
451
452@item \b
453@cindex @samp{\b} in regexp
454matches the empty string, but only at the beginning or
455end of a word.  Thus, @samp{\bfoo\b} matches any occurrence of
456@samp{foo} as a separate word.  @samp{\bballs?\b} matches
457@samp{ball} or @samp{balls} as a separate word.@refill
458
459@item \B
460@cindex @samp{\B} in regexp
461matches the empty string, but @emph{not} at the beginning or
462end of a word.
463
464@item \<
465@cindex @samp{\<} in regexp
466matches the empty string, but only at the beginning of a word.
467
468@item \>
469@cindex @samp{\>} in regexp
470matches the empty string, but only at the end of a word.
471@end table
472
473@kindex invalid-regexp
474  Not every string is a valid regular expression.  For example, a string
475with unbalanced square brackets is invalid (with a few exceptions, such
476as @samp{[]]}, and so is a string that ends with a single @samp{\}.  If
477an invalid regular expression is passed to any of the search functions,
478an @code{invalid-regexp} error is signaled.
479
480@node Regexp Examples
481@chapter Examples
482@section Complex Regexp Example
483
484  Here is a complicated regexp, used by Emacs to recognize the end of a
485sentence together with any whitespace that follows.  It is the value of
486the variable @code{sentence-end}.
487
488  First, we show the regexp as a string in C syntax to distinguish
489spaces from tab characters.  The string constant begins and ends with a
490double-quote.  @samp{\"} stands for a double-quote as part of the
491string, @samp{\\} for a backslash as part of the string, @samp{\t} for a
492tab and @samp{\n} for a newline.
493
494@example
495"[.?!][]\"')@}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*"
496@end example
497
498  In contrast, in Lisp, you have to type the tab as Ctrl-V Ctrl-I, producing
499the following:
500
501@example
502@group
503sentence-end
504@result{}
505"[.?!][]\"')@}]*\\($\\| $\\|  \\|  \\)[
506]*"
507@end group
508@end example
509
510@noindent
511In this output, tab and newline appear as themselves.
512
513  This regular expression contains four parts in succession and can be
514deciphered as follows:
515
516@table @code
517@item [.?!]
518The first part of the pattern consists of three characters, a period, a
519question mark and an exclamation mark, within square brackets.  The
520match must begin with one of these three characters.
521
522@item []\"')@}]*
523The second part of the pattern matches any closing braces and quotation
524marks, zero or more of them, that may follow the period, question mark
525or exclamation mark.  The @code{\"} is C or Lisp syntax for a double-quote in
526a string.  The @samp{*} at the end indicates that the immediately
527preceding regular expression (a character set, in this case) may be
528repeated zero or more times.
529
530@item \\($\\|@ \\|\t\\|@ @ \\)
531The third part of the pattern matches the whitespace that follows the
532end of a sentence: the end of a line, or a tab, or two spaces.  The
533double backslashes mark the parentheses and vertical bars as regular
534expression syntax; the parentheses mark the group and the vertical bars
535separate alternatives.  The dollar sign is used to match the end of a
536line.
537
538@item [ \t\n]*
539Finally, the last part of the pattern matches any additional whitespace
540beyond the minimum needed to end a sentence.
541@end table
542
543@node Common Regexps
544@section Common Regular Expressions Used in Editing
545@cindex regexps used standardly in editing
546@cindex standard regexps used in editing
547
548  This section describes some common regular expressions
549used for certain purposes in editing:
550
551Page delimiter:
552This is the regexp describing line-beginnings that separate pages.  A good
553value is @code{(string #\Page)}.
554
555Paragraph separator:
556This is the regular expression for recognizing the beginning of a line
557that separates paragraphs.  A good value is (in C syntax) @code{"^[
558\t\f]*$"}, which is a line that consists entirely of spaces, tabs, and
559form feeds.
560
561Paragraph start:
562This is the regular expression for recognizing the beginning of a line
563that starts @emph{or} separates paragraphs.  A good value is (in C syntax)
564@code{"^[ \t\n\f]"}, which matches a line starting with a space, tab,
565newline, or form feed.
566
567Sentence end:
568This is the regular expression describing the end of a sentence.  (All
569paragraph boundaries also end sentences, regardless.)  A good value
570is (in C syntax, again):
571
572@example
573"[.?!][]\"')@}]*\\($\\|\t\\| \\)[ \t\n]*"
574@end example
575
576This means a period, question mark or exclamation mark, followed by a
577closing brace, followed by tabs, spaces or new lines.
578
579@bye
580
581