1@c -*-texinfo-*-
2@c This is part of the GNU Guile Reference Manual.
3@c Copyright (C)  1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010, 2012
4@c   Free Software Foundation, Inc.
5@c See the file guile.texi for copying conditions.
6
7@node Regular Expressions
8@section Regular Expressions
9@tpindex Regular expressions
10
11@cindex regular expressions
12@cindex regex
13@cindex emacs regexp
14
15A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
16describes a whole class of strings.  A full description of regular
17expressions and their syntax is beyond the scope of this manual.
18
19If your system does not include a POSIX regular expression library,
20and you have not linked Guile with a third-party regexp library such
21as Rx, these functions will not be available.  You can tell whether
22your Guile installation includes regular expression support by
23checking whether @code{(provided? 'regex)} returns true.
24
25The following regexp and string matching features are provided by the
26@code{(ice-9 regex)} module.  Before using the described functions,
27you should load this module by executing @code{(use-modules (ice-9
28regex))}.
29
30@menu
31* Regexp Functions::            Functions that create and match regexps.
32* Match Structures::            Finding what was matched by a regexp.
33* Backslash Escapes::           Removing the special meaning of regexp
34                                meta-characters.
35@end menu
36
37
38@node Regexp Functions
39@subsection Regexp Functions
40
41By default, Guile supports POSIX extended regular expressions.  That
42means that the characters @samp{(}, @samp{)}, @samp{+} and @samp{?} are
43special, and must be escaped if you wish to match the literal characters
44and there is no support for ``non-greedy'' variants of @samp{*},
45@samp{+} or @samp{?}.
46
47This regular expression interface was modeled after that
48implemented by SCSH, the Scheme Shell.  It is intended to be
49upwardly compatible with SCSH regular expressions.
50
51Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
52strings, since the underlying C functions treat that as the end of
53string.  If there's a zero byte an error is thrown.
54
55Internally, patterns and input strings are converted to the current
56locale's encoding, and then passed to the C library's regular expression
57routines (@pxref{Regular Expressions,,, libc, The GNU C Library
58Reference Manual}).  The returned match structures always point to
59characters in the strings, not to individual bytes, even in the case of
60multi-byte encodings.
61
62@deffn {Scheme Procedure} string-match pattern str [start]
63Compile the string @var{pattern} into a regular expression and compare
64it with @var{str}.  The optional numeric argument @var{start} specifies
65the position of @var{str} at which to begin matching.
66
67@code{string-match} returns a @dfn{match structure} which
68describes what, if anything, was matched by the regular
69expression.  @xref{Match Structures}.  If @var{str} does not match
70@var{pattern} at all, @code{string-match} returns @code{#f}.
71@end deffn
72
73Two examples of a match follow.  In the first example, the pattern
74matches the four digits in the match string.  In the second, the pattern
75matches nothing.
76
77@example
78(string-match "[0-9][0-9][0-9][0-9]" "blah2002")
79@result{} #("blah2002" (4 . 8))
80
81(string-match "[A-Za-z]" "123456")
82@result{} #f
83@end example
84
85Each time @code{string-match} is called, it must compile its
86@var{pattern} argument into a regular expression structure.  This
87operation is expensive, which makes @code{string-match} inefficient if
88the same regular expression is used several times (for example, in a
89loop).  For better performance, you can compile a regular expression in
90advance and then match strings against the compiled regexp.
91
92@deffn {Scheme Procedure} make-regexp pat flag@dots{}
93@deffnx {C Function} scm_make_regexp (pat, flaglst)
94Compile the regular expression described by @var{pat}, and
95return the compiled regexp structure.  If @var{pat} does not
96describe a legal regular expression, @code{make-regexp} throws
97a @code{regular-expression-syntax} error.
98
99The @var{flag} arguments change the behavior of the compiled
100regular expression.  The following values may be supplied:
101
102@defvar regexp/icase
103Consider uppercase and lowercase letters to be the same when
104matching.
105@end defvar
106
107@defvar regexp/newline
108If a newline appears in the target string, then permit the
109@samp{^} and @samp{$} operators to match immediately after or
110immediately before the newline, respectively.  Also, the
111@samp{.} and @samp{[^...]} operators will never match a newline
112character.  The intent of this flag is to treat the target
113string as a buffer containing many lines of text, and the
114regular expression as a pattern that may match a single one of
115those lines.
116@end defvar
117
118@defvar regexp/basic
119Compile a basic (``obsolete'') regexp instead of the extended
120(``modern'') regexps that are the default.  Basic regexps do
121not consider @samp{|}, @samp{+} or @samp{?} to be special
122characters, and require the @samp{@{...@}} and @samp{(...)}
123metacharacters to be backslash-escaped (@pxref{Backslash
124Escapes}).  There are several other differences between basic
125and extended regular expressions, but these are the most
126significant.
127@end defvar
128
129@defvar regexp/extended
130Compile an extended regular expression rather than a basic
131regexp.  This is the default behavior; this flag will not
132usually be needed.  If a call to @code{make-regexp} includes
133both @code{regexp/basic} and @code{regexp/extended} flags, the
134one which comes last will override the earlier one.
135@end defvar
136@end deffn
137
138@deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
139@deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
140Match the compiled regular expression @var{rx} against
141@code{str}.  If the optional integer @var{start} argument is
142provided, begin matching from that position in the string.
143Return a match structure describing the results of the match,
144or @code{#f} if no match could be found.
145
146The @var{flags} argument changes the matching behavior.  The following
147flag values may be supplied, use @code{logior} (@pxref{Bitwise
148Operations}) to combine them,
149
150@defvar regexp/notbol
151Consider that the @var{start} offset into @var{str} is not the
152beginning of a line and should not match operator @samp{^}.
153
154If @var{rx} was created with the @code{regexp/newline} option above,
155@samp{^} will still match after a newline in @var{str}.
156@end defvar
157
158@defvar regexp/noteol
159Consider that the end of @var{str} is not the end of a line and should
160not match operator @samp{$}.
161
162If @var{rx} was created with the @code{regexp/newline} option above,
163@samp{$} will still match before a newline in @var{str}.
164@end defvar
165@end deffn
166
167@lisp
168;; Regexp to match uppercase letters
169(define r (make-regexp "[A-Z]*"))
170
171;; Regexp to match letters, ignoring case
172(define ri (make-regexp "[A-Z]*" regexp/icase))
173
174;; Search for bob using regexp r
175(match:substring (regexp-exec r "bob"))
176@result{} ""                  ; no match
177
178;; Search for bob using regexp ri
179(match:substring (regexp-exec ri "Bob"))
180@result{} "Bob"               ; matched case insensitive
181@end lisp
182
183@deffn {Scheme Procedure} regexp? obj
184@deffnx {C Function} scm_regexp_p (obj)
185Return @code{#t} if @var{obj} is a compiled regular expression,
186or @code{#f} otherwise.
187@end deffn
188
189@sp 1
190@deffn {Scheme Procedure} list-matches regexp str [flags]
191Return a list of match structures which are the non-overlapping
192matches of @var{regexp} in @var{str}.  @var{regexp} can be either a
193pattern string or a compiled regexp.  The @var{flags} argument is as
194per @code{regexp-exec} above.
195
196@example
197(map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
198@result{} ("abc" "def")
199@end  example
200@end deffn
201
202@deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
203Apply @var{proc} to the non-overlapping matches of @var{regexp} in
204@var{str}, to build a result.  @var{regexp} can be either a pattern
205string or a compiled regexp.  The @var{flags} argument is as per
206@code{regexp-exec} above.
207
208@var{proc} is called as @code{(@var{proc} match prev)} where
209@var{match} is a match structure and @var{prev} is the previous return
210from @var{proc}.  For the first call @var{prev} is the given
211@var{init} parameter.  @code{fold-matches} returns the final value
212from @var{proc}.
213
214For example to count matches,
215
216@example
217(fold-matches "[a-z][0-9]" "abc x1 def y2" 0
218              (lambda (match count)
219                (1+ count)))
220@result{} 2
221@end example
222@end deffn
223
224@sp 1
225Regular expressions are commonly used to find patterns in one string
226and replace them with the contents of another string.  The following
227functions are convenient ways to do this.
228
229@c begin (scm-doc-string "regex.scm" "regexp-substitute")
230@deffn {Scheme Procedure} regexp-substitute port match item @dots{}
231Write to @var{port} selected parts of the match structure @var{match}.
232Or if @var{port} is @code{#f} then form a string from those parts and
233return that.
234
235Each @var{item} specifies a part to be written, and may be one of the
236following,
237
238@itemize @bullet
239@item
240A string.  String arguments are written out verbatim.
241
242@item
243An integer.  The submatch with that number is written
244(@code{match:substring}).  Zero is the entire match.
245
246@item
247The symbol @samp{pre}.  The portion of the matched string preceding
248the regexp match is written (@code{match:prefix}).
249
250@item
251The symbol @samp{post}.  The portion of the matched string following
252the regexp match is written (@code{match:suffix}).
253@end itemize
254
255For example, changing a match and retaining the text before and after,
256
257@example
258(regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
259                   'pre "37" 'post)
260@result{} "number 37 is good"
261@end example
262
263Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
264re-ordering and hyphenating the fields.
265
266@lisp
267(define date-regex
268   "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
269(define s "Date 20020429 12am.")
270(regexp-substitute #f (string-match date-regex s)
271                   'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
272@result{} "Date 04-29-2002 12am. (20020429)"
273@end lisp
274@end deffn
275
276
277@c begin (scm-doc-string "regex.scm" "regexp-substitute")
278@deffn {Scheme Procedure} regexp-substitute/global port regexp target item@dots{}
279@cindex search and replace
280Write to @var{port} selected parts of matches of @var{regexp} in
281@var{target}.  If @var{port} is @code{#f} then form a string from
282those parts and return that.  @var{regexp} can be a string or a
283compiled regex.
284
285This is similar to @code{regexp-substitute}, but allows global
286substitutions on @var{target}.  Each @var{item} behaves as per
287@code{regexp-substitute}, with the following differences,
288
289@itemize @bullet
290@item
291A function.  Called as @code{(@var{item} match)} with the match
292structure for the @var{regexp} match, it should return a string to be
293written to @var{port}.
294
295@item
296The symbol @samp{post}.  This doesn't output anything, but instead
297causes @code{regexp-substitute/global} to recurse on the unmatched
298portion of @var{target}.
299
300This @emph{must} be supplied to perform a global search and replace on
301@var{target}; without it @code{regexp-substitute/global} returns after
302a single match and output.
303@end itemize
304
305For example, to collapse runs of tabs and spaces to a single hyphen
306each,
307
308@example
309(regexp-substitute/global #f "[ \t]+"  "this   is   the text"
310                          'pre "-" 'post)
311@result{} "this-is-the-text"
312@end example
313
314Or using a function to reverse the letters in each word,
315
316@example
317(regexp-substitute/global #f "[a-z]+"  "to do and not-do"
318  'pre (lambda (m) (string-reverse (match:substring m))) 'post)
319@result{} "ot od dna ton-od"
320@end example
321
322Without the @code{post} symbol, just one regexp match is made.  For
323example the following is the date example from
324@code{regexp-substitute} above, without the need for the separate
325@code{string-match} call.
326
327@lisp
328(define date-regex
329   "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
330(define s "Date 20020429 12am.")
331(regexp-substitute/global #f date-regex s
332                          'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
333
334@result{} "Date 04-29-2002 12am. (20020429)"
335@end lisp
336@end deffn
337
338
339@node Match Structures
340@subsection Match Structures
341
342@cindex match structures
343
344A @dfn{match structure} is the object returned by @code{string-match} and
345@code{regexp-exec}.  It describes which portion of a string, if any,
346matched the given regular expression.  Match structures include: a
347reference to the string that was checked for matches; the starting and
348ending positions of the regexp match; and, if the regexp included any
349parenthesized subexpressions, the starting and ending positions of each
350submatch.
351
352In each of the regexp match functions described below, the @code{match}
353argument must be a match structure returned by a previous call to
354@code{string-match} or @code{regexp-exec}.  Most of these functions
355return some information about the original target string that was
356matched against a regular expression; we will call that string
357@var{target} for easy reference.
358
359@c begin (scm-doc-string "regex.scm" "regexp-match?")
360@deffn {Scheme Procedure} regexp-match? obj
361Return @code{#t} if @var{obj} is a match structure returned by a
362previous call to @code{regexp-exec}, or @code{#f} otherwise.
363@end deffn
364
365@c begin (scm-doc-string "regex.scm" "match:substring")
366@deffn {Scheme Procedure} match:substring match [n]
367Return the portion of @var{target} matched by subexpression number
368@var{n}.  Submatch 0 (the default) represents the entire regexp match.
369If the regular expression as a whole matched, but the subexpression
370number @var{n} did not match, return @code{#f}.
371@end deffn
372
373@lisp
374(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
375(match:substring s)
376@result{} "2002"
377
378;; match starting at offset 6 in the string
379(match:substring
380  (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
381@result{} "7654"
382@end lisp
383
384@c begin (scm-doc-string "regex.scm" "match:start")
385@deffn {Scheme Procedure} match:start match [n]
386Return the starting position of submatch number @var{n}.
387@end deffn
388
389In the following example, the result is 4, since the match starts at
390character index 4:
391
392@lisp
393(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
394(match:start s)
395@result{} 4
396@end lisp
397
398@c begin (scm-doc-string "regex.scm" "match:end")
399@deffn {Scheme Procedure} match:end match [n]
400Return the ending position of submatch number @var{n}.
401@end deffn
402
403In the following example, the result is 8, since the match runs between
404characters 4 and 8 (i.e.@: the ``2002'').
405
406@lisp
407(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
408(match:end s)
409@result{} 8
410@end lisp
411
412@c begin (scm-doc-string "regex.scm" "match:prefix")
413@deffn {Scheme Procedure} match:prefix match
414Return the unmatched portion of @var{target} preceding the regexp match.
415
416@lisp
417(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
418(match:prefix s)
419@result{} "blah"
420@end lisp
421@end deffn
422
423@c begin (scm-doc-string "regex.scm" "match:suffix")
424@deffn {Scheme Procedure} match:suffix match
425Return the unmatched portion of @var{target} following the regexp match.
426@end deffn
427
428@lisp
429(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
430(match:suffix s)
431@result{} "foo"
432@end lisp
433
434@c begin (scm-doc-string "regex.scm" "match:count")
435@deffn {Scheme Procedure} match:count match
436Return the number of parenthesized subexpressions from @var{match}.
437Note that the entire regular expression match itself counts as a
438subexpression, and failed submatches are included in the count.
439@end deffn
440
441@c begin (scm-doc-string "regex.scm" "match:string")
442@deffn {Scheme Procedure} match:string match
443Return the original @var{target} string.
444@end deffn
445
446@lisp
447(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
448(match:string s)
449@result{} "blah2002foo"
450@end lisp
451
452
453@node Backslash Escapes
454@subsection Backslash Escapes
455
456Sometimes you will want a regexp to match characters like @samp{*} or
457@samp{$} exactly.  For example, to check whether a particular string
458represents a menu entry from an Info node, it would be useful to match
459it against a regexp like @samp{^* [^:]*::}.  However, this won't work;
460because the asterisk is a metacharacter, it won't match the @samp{*} at
461the beginning of the string.  In this case, we want to make the first
462asterisk un-magic.
463
464You can do this by preceding the metacharacter with a backslash
465character @samp{\}.  (This is also called @dfn{quoting} the
466metacharacter, and is known as a @dfn{backslash escape}.)  When Guile
467sees a backslash in a regular expression, it considers the following
468glyph to be an ordinary character, no matter what special meaning it
469would ordinarily have.  Therefore, we can make the above example work by
470changing the regexp to @samp{^\* [^:]*::}.  The @samp{\*} sequence tells
471the regular expression engine to match only a single asterisk in the
472target string.
473
474Since the backslash is itself a metacharacter, you may force a regexp to
475match a backslash in the target string by preceding the backslash with
476itself.  For example, to find variable references in a @TeX{} program,
477you might want to find occurrences of the string @samp{\let\} followed
478by any number of alphabetic characters.  The regular expression
479@samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
480regexp each match a single backslash in the target string.
481
482@c begin (scm-doc-string "regex.scm" "regexp-quote")
483@deffn {Scheme Procedure} regexp-quote str
484Quote each special character found in @var{str} with a backslash, and
485return the resulting string.
486@end deffn
487
488@strong{Very important:} Using backslash escapes in Guile source code
489(as in Emacs Lisp or C) can be tricky, because the backslash character
490has special meaning for the Guile reader.  For example, if Guile
491encounters the character sequence @samp{\n} in the middle of a string
492while processing Scheme code, it replaces those characters with a
493newline character.  Similarly, the character sequence @samp{\t} is
494replaced by a horizontal tab.  Several of these @dfn{escape sequences}
495are processed by the Guile reader before your code is executed.
496Unrecognized escape sequences are ignored: if the characters @samp{\*}
497appear in a string, they will be translated to the single character
498@samp{*}.
499
500This translation is obviously undesirable for regular expressions, since
501we want to be able to include backslashes in a string in order to
502escape regexp metacharacters.  Therefore, to make sure that a backslash
503is preserved in a string in your Guile program, you must use @emph{two}
504consecutive backslashes:
505
506@lisp
507(define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
508@end lisp
509
510The string in this example is preprocessed by the Guile reader before
511any code is executed.  The resulting argument to @code{make-regexp} is
512the string @samp{^\* [^:]*}, which is what we really want.
513
514This also means that in order to write a regular expression that matches
515a single backslash character, the regular expression string in the
516source code must include @emph{four} backslashes.  Each consecutive pair
517of backslashes gets translated by the Guile reader to a single
518backslash, and the resulting double-backslash is interpreted by the
519regexp engine as matching a single backslash character.  Hence:
520
521@lisp
522(define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
523@end lisp
524
525The reason for the unwieldiness of this syntax is historical.  Both
526regular expression pattern matchers and Unix string processing systems
527have traditionally used backslashes with the special meanings
528described above.  The POSIX regular expression specification and ANSI C
529standard both require these semantics.  Attempting to abandon either
530convention would cause other kinds of compatibility problems, possibly
531more severe ones.  Therefore, without extending the Scheme reader to
532support strings with different quoting conventions (an ungainly and
533confusing extension when implemented in other languages), we must adhere
534to this cumbersome escape syntax.
535