1@c -*-texinfo-*- 2@c This is part of the GNU Guile Reference Manual. 3@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010, 2012 4@c Free Software Foundation, Inc. 5@c See the file guile.texi for copying conditions. 6 7@node Regular Expressions 8@section Regular Expressions 9@tpindex Regular expressions 10 11@cindex regular expressions 12@cindex regex 13@cindex emacs regexp 14 15A @dfn{regular expression} (or @dfn{regexp}) is a pattern that 16describes a whole class of strings. A full description of regular 17expressions and their syntax is beyond the scope of this manual. 18 19If your system does not include a POSIX regular expression library, 20and you have not linked Guile with a third-party regexp library such 21as Rx, these functions will not be available. You can tell whether 22your Guile installation includes regular expression support by 23checking whether @code{(provided? 'regex)} returns true. 24 25The following regexp and string matching features are provided by the 26@code{(ice-9 regex)} module. Before using the described functions, 27you should load this module by executing @code{(use-modules (ice-9 28regex))}. 29 30@menu 31* Regexp Functions:: Functions that create and match regexps. 32* Match Structures:: Finding what was matched by a regexp. 33* Backslash Escapes:: Removing the special meaning of regexp 34 meta-characters. 35@end menu 36 37 38@node Regexp Functions 39@subsection Regexp Functions 40 41By default, Guile supports POSIX extended regular expressions. That 42means that the characters @samp{(}, @samp{)}, @samp{+} and @samp{?} are 43special, and must be escaped if you wish to match the literal characters 44and there is no support for ``non-greedy'' variants of @samp{*}, 45@samp{+} or @samp{?}. 46 47This regular expression interface was modeled after that 48implemented by SCSH, the Scheme Shell. It is intended to be 49upwardly compatible with SCSH regular expressions. 50 51Zero bytes (@code{#\nul}) cannot be used in regex patterns or input 52strings, since the underlying C functions treat that as the end of 53string. If there's a zero byte an error is thrown. 54 55Internally, patterns and input strings are converted to the current 56locale's encoding, and then passed to the C library's regular expression 57routines (@pxref{Regular Expressions,,, libc, The GNU C Library 58Reference Manual}). The returned match structures always point to 59characters in the strings, not to individual bytes, even in the case of 60multi-byte encodings. 61 62@deffn {Scheme Procedure} string-match pattern str [start] 63Compile the string @var{pattern} into a regular expression and compare 64it with @var{str}. The optional numeric argument @var{start} specifies 65the position of @var{str} at which to begin matching. 66 67@code{string-match} returns a @dfn{match structure} which 68describes what, if anything, was matched by the regular 69expression. @xref{Match Structures}. If @var{str} does not match 70@var{pattern} at all, @code{string-match} returns @code{#f}. 71@end deffn 72 73Two examples of a match follow. In the first example, the pattern 74matches the four digits in the match string. In the second, the pattern 75matches nothing. 76 77@example 78(string-match "[0-9][0-9][0-9][0-9]" "blah2002") 79@result{} #("blah2002" (4 . 8)) 80 81(string-match "[A-Za-z]" "123456") 82@result{} #f 83@end example 84 85Each time @code{string-match} is called, it must compile its 86@var{pattern} argument into a regular expression structure. This 87operation is expensive, which makes @code{string-match} inefficient if 88the same regular expression is used several times (for example, in a 89loop). For better performance, you can compile a regular expression in 90advance and then match strings against the compiled regexp. 91 92@deffn {Scheme Procedure} make-regexp pat flag@dots{} 93@deffnx {C Function} scm_make_regexp (pat, flaglst) 94Compile the regular expression described by @var{pat}, and 95return the compiled regexp structure. If @var{pat} does not 96describe a legal regular expression, @code{make-regexp} throws 97a @code{regular-expression-syntax} error. 98 99The @var{flag} arguments change the behavior of the compiled 100regular expression. The following values may be supplied: 101 102@defvar regexp/icase 103Consider uppercase and lowercase letters to be the same when 104matching. 105@end defvar 106 107@defvar regexp/newline 108If a newline appears in the target string, then permit the 109@samp{^} and @samp{$} operators to match immediately after or 110immediately before the newline, respectively. Also, the 111@samp{.} and @samp{[^...]} operators will never match a newline 112character. The intent of this flag is to treat the target 113string as a buffer containing many lines of text, and the 114regular expression as a pattern that may match a single one of 115those lines. 116@end defvar 117 118@defvar regexp/basic 119Compile a basic (``obsolete'') regexp instead of the extended 120(``modern'') regexps that are the default. Basic regexps do 121not consider @samp{|}, @samp{+} or @samp{?} to be special 122characters, and require the @samp{@{...@}} and @samp{(...)} 123metacharacters to be backslash-escaped (@pxref{Backslash 124Escapes}). There are several other differences between basic 125and extended regular expressions, but these are the most 126significant. 127@end defvar 128 129@defvar regexp/extended 130Compile an extended regular expression rather than a basic 131regexp. This is the default behavior; this flag will not 132usually be needed. If a call to @code{make-regexp} includes 133both @code{regexp/basic} and @code{regexp/extended} flags, the 134one which comes last will override the earlier one. 135@end defvar 136@end deffn 137 138@deffn {Scheme Procedure} regexp-exec rx str [start [flags]] 139@deffnx {C Function} scm_regexp_exec (rx, str, start, flags) 140Match the compiled regular expression @var{rx} against 141@code{str}. If the optional integer @var{start} argument is 142provided, begin matching from that position in the string. 143Return a match structure describing the results of the match, 144or @code{#f} if no match could be found. 145 146The @var{flags} argument changes the matching behavior. The following 147flag values may be supplied, use @code{logior} (@pxref{Bitwise 148Operations}) to combine them, 149 150@defvar regexp/notbol 151Consider that the @var{start} offset into @var{str} is not the 152beginning of a line and should not match operator @samp{^}. 153 154If @var{rx} was created with the @code{regexp/newline} option above, 155@samp{^} will still match after a newline in @var{str}. 156@end defvar 157 158@defvar regexp/noteol 159Consider that the end of @var{str} is not the end of a line and should 160not match operator @samp{$}. 161 162If @var{rx} was created with the @code{regexp/newline} option above, 163@samp{$} will still match before a newline in @var{str}. 164@end defvar 165@end deffn 166 167@lisp 168;; Regexp to match uppercase letters 169(define r (make-regexp "[A-Z]*")) 170 171;; Regexp to match letters, ignoring case 172(define ri (make-regexp "[A-Z]*" regexp/icase)) 173 174;; Search for bob using regexp r 175(match:substring (regexp-exec r "bob")) 176@result{} "" ; no match 177 178;; Search for bob using regexp ri 179(match:substring (regexp-exec ri "Bob")) 180@result{} "Bob" ; matched case insensitive 181@end lisp 182 183@deffn {Scheme Procedure} regexp? obj 184@deffnx {C Function} scm_regexp_p (obj) 185Return @code{#t} if @var{obj} is a compiled regular expression, 186or @code{#f} otherwise. 187@end deffn 188 189@sp 1 190@deffn {Scheme Procedure} list-matches regexp str [flags] 191Return a list of match structures which are the non-overlapping 192matches of @var{regexp} in @var{str}. @var{regexp} can be either a 193pattern string or a compiled regexp. The @var{flags} argument is as 194per @code{regexp-exec} above. 195 196@example 197(map match:substring (list-matches "[a-z]+" "abc 42 def 78")) 198@result{} ("abc" "def") 199@end example 200@end deffn 201 202@deffn {Scheme Procedure} fold-matches regexp str init proc [flags] 203Apply @var{proc} to the non-overlapping matches of @var{regexp} in 204@var{str}, to build a result. @var{regexp} can be either a pattern 205string or a compiled regexp. The @var{flags} argument is as per 206@code{regexp-exec} above. 207 208@var{proc} is called as @code{(@var{proc} match prev)} where 209@var{match} is a match structure and @var{prev} is the previous return 210from @var{proc}. For the first call @var{prev} is the given 211@var{init} parameter. @code{fold-matches} returns the final value 212from @var{proc}. 213 214For example to count matches, 215 216@example 217(fold-matches "[a-z][0-9]" "abc x1 def y2" 0 218 (lambda (match count) 219 (1+ count))) 220@result{} 2 221@end example 222@end deffn 223 224@sp 1 225Regular expressions are commonly used to find patterns in one string 226and replace them with the contents of another string. The following 227functions are convenient ways to do this. 228 229@c begin (scm-doc-string "regex.scm" "regexp-substitute") 230@deffn {Scheme Procedure} regexp-substitute port match item @dots{} 231Write to @var{port} selected parts of the match structure @var{match}. 232Or if @var{port} is @code{#f} then form a string from those parts and 233return that. 234 235Each @var{item} specifies a part to be written, and may be one of the 236following, 237 238@itemize @bullet 239@item 240A string. String arguments are written out verbatim. 241 242@item 243An integer. The submatch with that number is written 244(@code{match:substring}). Zero is the entire match. 245 246@item 247The symbol @samp{pre}. The portion of the matched string preceding 248the regexp match is written (@code{match:prefix}). 249 250@item 251The symbol @samp{post}. The portion of the matched string following 252the regexp match is written (@code{match:suffix}). 253@end itemize 254 255For example, changing a match and retaining the text before and after, 256 257@example 258(regexp-substitute #f (string-match "[0-9]+" "number 25 is good") 259 'pre "37" 'post) 260@result{} "number 37 is good" 261@end example 262 263Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and 264re-ordering and hyphenating the fields. 265 266@lisp 267(define date-regex 268 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])") 269(define s "Date 20020429 12am.") 270(regexp-substitute #f (string-match date-regex s) 271 'pre 2 "-" 3 "-" 1 'post " (" 0 ")") 272@result{} "Date 04-29-2002 12am. (20020429)" 273@end lisp 274@end deffn 275 276 277@c begin (scm-doc-string "regex.scm" "regexp-substitute") 278@deffn {Scheme Procedure} regexp-substitute/global port regexp target item@dots{} 279@cindex search and replace 280Write to @var{port} selected parts of matches of @var{regexp} in 281@var{target}. If @var{port} is @code{#f} then form a string from 282those parts and return that. @var{regexp} can be a string or a 283compiled regex. 284 285This is similar to @code{regexp-substitute}, but allows global 286substitutions on @var{target}. Each @var{item} behaves as per 287@code{regexp-substitute}, with the following differences, 288 289@itemize @bullet 290@item 291A function. Called as @code{(@var{item} match)} with the match 292structure for the @var{regexp} match, it should return a string to be 293written to @var{port}. 294 295@item 296The symbol @samp{post}. This doesn't output anything, but instead 297causes @code{regexp-substitute/global} to recurse on the unmatched 298portion of @var{target}. 299 300This @emph{must} be supplied to perform a global search and replace on 301@var{target}; without it @code{regexp-substitute/global} returns after 302a single match and output. 303@end itemize 304 305For example, to collapse runs of tabs and spaces to a single hyphen 306each, 307 308@example 309(regexp-substitute/global #f "[ \t]+" "this is the text" 310 'pre "-" 'post) 311@result{} "this-is-the-text" 312@end example 313 314Or using a function to reverse the letters in each word, 315 316@example 317(regexp-substitute/global #f "[a-z]+" "to do and not-do" 318 'pre (lambda (m) (string-reverse (match:substring m))) 'post) 319@result{} "ot od dna ton-od" 320@end example 321 322Without the @code{post} symbol, just one regexp match is made. For 323example the following is the date example from 324@code{regexp-substitute} above, without the need for the separate 325@code{string-match} call. 326 327@lisp 328(define date-regex 329 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])") 330(define s "Date 20020429 12am.") 331(regexp-substitute/global #f date-regex s 332 'pre 2 "-" 3 "-" 1 'post " (" 0 ")") 333 334@result{} "Date 04-29-2002 12am. (20020429)" 335@end lisp 336@end deffn 337 338 339@node Match Structures 340@subsection Match Structures 341 342@cindex match structures 343 344A @dfn{match structure} is the object returned by @code{string-match} and 345@code{regexp-exec}. It describes which portion of a string, if any, 346matched the given regular expression. Match structures include: a 347reference to the string that was checked for matches; the starting and 348ending positions of the regexp match; and, if the regexp included any 349parenthesized subexpressions, the starting and ending positions of each 350submatch. 351 352In each of the regexp match functions described below, the @code{match} 353argument must be a match structure returned by a previous call to 354@code{string-match} or @code{regexp-exec}. Most of these functions 355return some information about the original target string that was 356matched against a regular expression; we will call that string 357@var{target} for easy reference. 358 359@c begin (scm-doc-string "regex.scm" "regexp-match?") 360@deffn {Scheme Procedure} regexp-match? obj 361Return @code{#t} if @var{obj} is a match structure returned by a 362previous call to @code{regexp-exec}, or @code{#f} otherwise. 363@end deffn 364 365@c begin (scm-doc-string "regex.scm" "match:substring") 366@deffn {Scheme Procedure} match:substring match [n] 367Return the portion of @var{target} matched by subexpression number 368@var{n}. Submatch 0 (the default) represents the entire regexp match. 369If the regular expression as a whole matched, but the subexpression 370number @var{n} did not match, return @code{#f}. 371@end deffn 372 373@lisp 374(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) 375(match:substring s) 376@result{} "2002" 377 378;; match starting at offset 6 in the string 379(match:substring 380 (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6)) 381@result{} "7654" 382@end lisp 383 384@c begin (scm-doc-string "regex.scm" "match:start") 385@deffn {Scheme Procedure} match:start match [n] 386Return the starting position of submatch number @var{n}. 387@end deffn 388 389In the following example, the result is 4, since the match starts at 390character index 4: 391 392@lisp 393(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) 394(match:start s) 395@result{} 4 396@end lisp 397 398@c begin (scm-doc-string "regex.scm" "match:end") 399@deffn {Scheme Procedure} match:end match [n] 400Return the ending position of submatch number @var{n}. 401@end deffn 402 403In the following example, the result is 8, since the match runs between 404characters 4 and 8 (i.e.@: the ``2002''). 405 406@lisp 407(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) 408(match:end s) 409@result{} 8 410@end lisp 411 412@c begin (scm-doc-string "regex.scm" "match:prefix") 413@deffn {Scheme Procedure} match:prefix match 414Return the unmatched portion of @var{target} preceding the regexp match. 415 416@lisp 417(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) 418(match:prefix s) 419@result{} "blah" 420@end lisp 421@end deffn 422 423@c begin (scm-doc-string "regex.scm" "match:suffix") 424@deffn {Scheme Procedure} match:suffix match 425Return the unmatched portion of @var{target} following the regexp match. 426@end deffn 427 428@lisp 429(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) 430(match:suffix s) 431@result{} "foo" 432@end lisp 433 434@c begin (scm-doc-string "regex.scm" "match:count") 435@deffn {Scheme Procedure} match:count match 436Return the number of parenthesized subexpressions from @var{match}. 437Note that the entire regular expression match itself counts as a 438subexpression, and failed submatches are included in the count. 439@end deffn 440 441@c begin (scm-doc-string "regex.scm" "match:string") 442@deffn {Scheme Procedure} match:string match 443Return the original @var{target} string. 444@end deffn 445 446@lisp 447(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) 448(match:string s) 449@result{} "blah2002foo" 450@end lisp 451 452 453@node Backslash Escapes 454@subsection Backslash Escapes 455 456Sometimes you will want a regexp to match characters like @samp{*} or 457@samp{$} exactly. For example, to check whether a particular string 458represents a menu entry from an Info node, it would be useful to match 459it against a regexp like @samp{^* [^:]*::}. However, this won't work; 460because the asterisk is a metacharacter, it won't match the @samp{*} at 461the beginning of the string. In this case, we want to make the first 462asterisk un-magic. 463 464You can do this by preceding the metacharacter with a backslash 465character @samp{\}. (This is also called @dfn{quoting} the 466metacharacter, and is known as a @dfn{backslash escape}.) When Guile 467sees a backslash in a regular expression, it considers the following 468glyph to be an ordinary character, no matter what special meaning it 469would ordinarily have. Therefore, we can make the above example work by 470changing the regexp to @samp{^\* [^:]*::}. The @samp{\*} sequence tells 471the regular expression engine to match only a single asterisk in the 472target string. 473 474Since the backslash is itself a metacharacter, you may force a regexp to 475match a backslash in the target string by preceding the backslash with 476itself. For example, to find variable references in a @TeX{} program, 477you might want to find occurrences of the string @samp{\let\} followed 478by any number of alphabetic characters. The regular expression 479@samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the 480regexp each match a single backslash in the target string. 481 482@c begin (scm-doc-string "regex.scm" "regexp-quote") 483@deffn {Scheme Procedure} regexp-quote str 484Quote each special character found in @var{str} with a backslash, and 485return the resulting string. 486@end deffn 487 488@strong{Very important:} Using backslash escapes in Guile source code 489(as in Emacs Lisp or C) can be tricky, because the backslash character 490has special meaning for the Guile reader. For example, if Guile 491encounters the character sequence @samp{\n} in the middle of a string 492while processing Scheme code, it replaces those characters with a 493newline character. Similarly, the character sequence @samp{\t} is 494replaced by a horizontal tab. Several of these @dfn{escape sequences} 495are processed by the Guile reader before your code is executed. 496Unrecognized escape sequences are ignored: if the characters @samp{\*} 497appear in a string, they will be translated to the single character 498@samp{*}. 499 500This translation is obviously undesirable for regular expressions, since 501we want to be able to include backslashes in a string in order to 502escape regexp metacharacters. Therefore, to make sure that a backslash 503is preserved in a string in your Guile program, you must use @emph{two} 504consecutive backslashes: 505 506@lisp 507(define Info-menu-entry-pattern (make-regexp "^\\* [^:]*")) 508@end lisp 509 510The string in this example is preprocessed by the Guile reader before 511any code is executed. The resulting argument to @code{make-regexp} is 512the string @samp{^\* [^:]*}, which is what we really want. 513 514This also means that in order to write a regular expression that matches 515a single backslash character, the regular expression string in the 516source code must include @emph{four} backslashes. Each consecutive pair 517of backslashes gets translated by the Guile reader to a single 518backslash, and the resulting double-backslash is interpreted by the 519regexp engine as matching a single backslash character. Hence: 520 521@lisp 522(define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*")) 523@end lisp 524 525The reason for the unwieldiness of this syntax is historical. Both 526regular expression pattern matchers and Unix string processing systems 527have traditionally used backslashes with the special meanings 528described above. The POSIX regular expression specification and ANSI C 529standard both require these semantics. Attempting to abandon either 530convention would cause other kinds of compatibility problems, possibly 531more severe ones. Therefore, without extending the Scheme reader to 532support strings with different quoting conventions (an ungainly and 533confusing extension when implemented in other languages), we must adhere 534to this cumbersome escape syntax. 535