1=head1 NAME 2 3perlreapi - Perl regular expression plugin interface 4 5=head1 DESCRIPTION 6 7As of Perl 5.9.5 there is a new interface for plugging and using 8regular expression engines other than the default one. 9 10Each engine is supposed to provide access to a constant structure of the 11following format: 12 13 typedef struct regexp_engine { 14 REGEXP* (*comp) (pTHX_ 15 const SV * const pattern, const U32 flags); 16 I32 (*exec) (pTHX_ 17 REGEXP * const rx, 18 char* stringarg, 19 char* strend, char* strbeg, 20 SSize_t minend, SV* sv, 21 void* data, U32 flags); 22 char* (*intuit) (pTHX_ 23 REGEXP * const rx, SV *sv, 24 const char * const strbeg, 25 char *strpos, char *strend, U32 flags, 26 struct re_scream_pos_data_s *data); 27 SV* (*checkstr) (pTHX_ REGEXP * const rx); 28 void (*free) (pTHX_ REGEXP * const rx); 29 void (*numbered_buff_FETCH) (pTHX_ 30 REGEXP * const rx, 31 const I32 paren, 32 SV * const sv); 33 void (*numbered_buff_STORE) (pTHX_ 34 REGEXP * const rx, 35 const I32 paren, 36 SV const * const value); 37 I32 (*numbered_buff_LENGTH) (pTHX_ 38 REGEXP * const rx, 39 const SV * const sv, 40 const I32 paren); 41 SV* (*named_buff) (pTHX_ 42 REGEXP * const rx, 43 SV * const key, 44 SV * const value, 45 U32 flags); 46 SV* (*named_buff_iter) (pTHX_ 47 REGEXP * const rx, 48 const SV * const lastkey, 49 const U32 flags); 50 SV* (*qr_package)(pTHX_ REGEXP * const rx); 51 #ifdef USE_ITHREADS 52 void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param); 53 #endif 54 REGEXP* (*op_comp) (...); 55 56 57=for apidoc_section $regexp 58=for apidoc Ay||regexp_engine 59 60When a regexp is compiled, its C<engine> field is then set to point at 61the appropriate structure, so that when it needs to be used Perl can find 62the right routines to do so. 63 64In order to install a new regexp handler, C<$^H{regcomp}> is set 65to an integer which (when casted appropriately) resolves to one of these 66structures. When compiling, the C<comp> method is executed, and the 67resulting C<regexp> structure's engine field is expected to point back at 68the same structure. 69 70The pTHX_ symbol in the definition is a macro used by Perl under threading 71to provide an extra argument to the routine holding a pointer back to 72the interpreter that is executing the regexp. So under threading all 73routines get an extra argument. 74 75=head1 Callbacks 76 77=head2 comp 78 79 REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags); 80 81Compile the pattern stored in C<pattern> using the given C<flags> and 82return a pointer to a prepared C<REGEXP> structure that can perform 83the match. See L</The REGEXP structure> below for an explanation of 84the individual fields in the REGEXP struct. 85 86The C<pattern> parameter is the scalar that was used as the 87pattern. Previous versions of Perl would pass two C<char*> indicating 88the start and end of the stringified pattern; the following snippet can 89be used to get the old parameters: 90 91 STRLEN plen; 92 char* exp = SvPV(pattern, plen); 93 char* xend = exp + plen; 94 95Since any scalar can be passed as a pattern, it's possible to implement 96an engine that does something with an array (C<< "ook" =~ [ qw/ eek 97hlagh / ] >>) or with the non-stringified form of a compiled regular 98expression (C<< "ook" =~ qr/eek/ >>). Perl's own engine will always 99stringify everything using the snippet above, but that doesn't mean 100other engines have to. 101 102The C<flags> parameter is a bitfield which indicates which of the 103C<msixpn> flags the regex was compiled with. It also contains 104additional info, such as if C<use locale> is in effect. 105 106The C<eogc> flags are stripped out before being passed to the comp 107routine. The regex engine does not need to know if any of these 108are set, as those flags should only affect what Perl does with the 109pattern and its match variables, not how it gets compiled and 110executed. 111 112By the time the comp callback is called, some of these flags have 113already had effect (noted below where applicable). However most of 114their effect occurs after the comp callback has run, in routines that 115read the C<< rx->extflags >> field which it populates. 116 117In general the flags should be preserved in C<< rx->extflags >> after 118compilation, although the regex engine might want to add or delete 119some of them to invoke or disable some special behavior in Perl. The 120flags along with any special behavior they cause are documented below: 121 122The pattern modifiers: 123 124=over 4 125 126=item C</m> - RXf_PMf_MULTILINE 127 128If this is in C<< rx->extflags >> it will be passed to 129C<Perl_fbm_instr> by C<pp_split> which will treat the subject string 130as a multi-line string. 131 132=for apidoc Amnh||RXf_PMf_EXTENDED 133=for apidoc_item RXf_PMf_FOLD 134=for apidoc_item RXf_PMf_KEEPCOPY 135=for apidoc_item RXf_PMf_MULTILINE 136=for apidoc_item RXf_PMf_SINGLELINE 137 138=item C</s> - RXf_PMf_SINGLELINE 139 140=item C</i> - RXf_PMf_FOLD 141 142=item C</x> - RXf_PMf_EXTENDED 143 144If present on a regex, C<"#"> comments will be handled differently by the 145tokenizer in some cases. 146 147TODO: Document those cases. 148 149 150=item C</p> - RXf_PMf_KEEPCOPY 151 152TODO: Document this 153 154=item Character set 155 156The character set rules are determined by an enum that is contained 157in this field. This is still experimental and subject to change, but 158the current interface returns the rules by use of the in-line function 159C<get_regex_charset(const U32 flags)>. The only currently documented 160value returned from it is REGEX_LOCALE_CHARSET, which is set if 161C<use locale> is in effect. If present in C<< rx->extflags >>, 162C<split> will use the locale dependent definition of whitespace 163when RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII whitespace 164is defined as per L<isSPACE|perlapi/isSPACE>, and by the internal 165macros C<is_utf8_space> under UTF-8, and C<isSPACE_LC> under C<use 166locale>. 167 168=for apidoc Amnh||REGEX_LOCALE_CHARSET 169 170=back 171 172Additional flags: 173 174=over 4 175 176=item RXf_SPLIT 177 178This flag was removed in perl 5.18.0. C<split ' '> is now special-cased 179solely in the parser. RXf_SPLIT is still #defined, so you can test for it. 180This is how it used to work: 181 182If C<split> is invoked as C<split ' '> or with no arguments (which 183really means C<split(' ', $_)>, see L<split|perlfunc/split>), Perl will 184set this flag. The regex engine can then check for it and set the 185SKIPWHITE and WHITE extflags. To do this, the Perl engine does: 186 187 if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ') 188 r->extflags |= (RXf_SKIPWHITE|RXf_WHITE); 189 190=back 191 192These flags can be set during compilation to enable optimizations in 193the C<split> operator. 194 195=for apidoc Amnh||RXf_NO_INPLACE_SUBST 196=for apidoc_item RXf_NULL 197=for apidoc_item RXf_SKIPWHITE 198=for apidoc_item RXf_SPLIT 199=for apidoc_item RXf_START_ONLY 200=for apidoc_item RXf_WHITE 201 202=over 4 203 204=item RXf_SKIPWHITE 205 206This flag was removed in perl 5.18.0. It is still #defined, so you can 207set it, but doing so will have no effect. This is how it used to work: 208 209If the flag is present in C<< rx->extflags >> C<split> will delete 210whitespace from the start of the subject string before it's operated 211on. What is considered whitespace depends on if the subject is a 212UTF-8 string and if the C<RXf_PMf_LOCALE> flag is set. 213 214If RXf_WHITE is set in addition to this flag, C<split> will behave like 215C<split " "> under the Perl engine. 216 217 218=item RXf_START_ONLY 219 220Tells the split operator to split the target string on newlines 221(C<\n>) without invoking the regex engine. 222 223Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp 224== '^'>), even under C</^/s>; see L<split|perlfunc>. Of course a 225different regex engine might want to use the same optimizations 226with a different syntax. 227 228=item RXf_WHITE 229 230Tells the split operator to split the target string on whitespace 231without invoking the regex engine. The definition of whitespace varies 232depending on if the target string is a UTF-8 string and on 233if RXf_PMf_LOCALE is set. 234 235Perl's engine sets this flag if the pattern is C<\s+>. 236 237=item RXf_NULL 238 239Tells the split operator to split the target string on 240characters. The definition of character varies depending on if 241the target string is a UTF-8 string. 242 243Perl's engine sets this flag on empty patterns, this optimization 244makes C<split //> much faster than it would otherwise be. It's even 245faster than C<unpack>. 246 247=item RXf_NO_INPLACE_SUBST 248 249Added in perl 5.18.0, this flag indicates that a regular expression might 250perform an operation that would interfere with inplace substitution. For 251instance it might contain lookbehind, or assign to non-magical variables 252(such as $REGMARK and $REGERROR) during matching. C<s///> will skip 253certain optimisations when this is set. 254 255=back 256 257=head2 exec 258 259 I32 exec(pTHX_ REGEXP * const rx, 260 char *stringarg, char* strend, char* strbeg, 261 SSize_t minend, SV* sv, 262 void* data, U32 flags); 263 264Execute a regexp. The arguments are 265 266=over 4 267 268=item rx 269 270The regular expression to execute. 271 272=item sv 273 274This is the SV to be matched against. Note that the 275actual char array to be matched against is supplied by the arguments 276described below; the SV is just used to determine UTF8ness, C<pos()> etc. 277 278=item strbeg 279 280Pointer to the physical start of the string. 281 282=item strend 283 284Pointer to the character following the physical end of the string (i.e. 285the C<\0>, if any). 286 287=item stringarg 288 289Pointer to the position in the string where matching should start; it might 290not be equal to C<strbeg> (for example in a later iteration of C</.../g>). 291 292=item minend 293 294Minimum length of string (measured in bytes from C<stringarg>) that must 295match; if the engine reaches the end of the match but hasn't reached this 296position in the string, it should fail. 297 298=item data 299 300Optimisation data; subject to change. 301 302=item flags 303 304Optimisation flags; subject to change. 305 306=back 307 308=head2 intuit 309 310 char* intuit(pTHX_ 311 REGEXP * const rx, 312 SV *sv, 313 const char * const strbeg, 314 char *strpos, 315 char *strend, 316 const U32 flags, 317 struct re_scream_pos_data_s *data); 318 319Find the start position where a regex match should be attempted, 320or possibly if the regex engine should not be run because the 321pattern can't match. This is called, as appropriate, by the core, 322depending on the values of the C<extflags> member of the C<regexp> 323structure. 324 325Arguments: 326 327 rx: the regex to match against 328 sv: the SV being matched: only used for utf8 flag; the string 329 itself is accessed via the pointers below. Note that on 330 something like an overloaded SV, SvPOK(sv) may be false 331 and the string pointers may point to something unrelated to 332 the SV itself. 333 strbeg: real beginning of string 334 strpos: the point in the string at which to begin matching 335 strend: pointer to the byte following the last char of the string 336 flags currently unused; set to 0 337 data: currently unused; set to NULL 338 339 340=head2 checkstr 341 342 SV* checkstr(pTHX_ REGEXP * const rx); 343 344Return a SV containing a string that must appear in the pattern. Used 345by C<split> for optimising matches. 346 347=head2 free 348 349 void free(pTHX_ REGEXP * const rx); 350 351Called by Perl when it is freeing a regexp pattern so that the engine 352can release any resources pointed to by the C<pprivate> member of the 353C<regexp> structure. This is only responsible for freeing private data; 354Perl will handle releasing anything else contained in the C<regexp> structure. 355 356=head2 Numbered capture callbacks 357 358Called to get/set the value of C<$`>, C<$'>, C<$&> and their named 359equivalents, ${^PREMATCH}, ${^POSTMATCH} and ${^MATCH}, as well as the 360numbered capture groups (C<$1>, C<$2>, ...). 361 362The C<paren> parameter will be C<1> for C<$1>, C<2> for C<$2> and so 363forth, and have these symbolic values for the special variables: 364 365 ${^PREMATCH} RX_BUFF_IDX_CARET_PREMATCH 366 ${^POSTMATCH} RX_BUFF_IDX_CARET_POSTMATCH 367 ${^MATCH} RX_BUFF_IDX_CARET_FULLMATCH 368 $` RX_BUFF_IDX_PREMATCH 369 $' RX_BUFF_IDX_POSTMATCH 370 $& RX_BUFF_IDX_FULLMATCH 371 372=for apidoc Amnh||RX_BUFF_IDX_CARET_FULLMATCH 373=for apidoc_item RX_BUFF_IDX_CARET_POSTMATCH 374=for apidoc_item RX_BUFF_IDX_CARET_PREMATCH 375=for apidoc_item RX_BUFF_IDX_FULLMATCH 376=for apidoc_item RX_BUFF_IDX_POSTMATCH 377=for apidoc_item RX_BUFF_IDX_PREMATCH 378 379Note that in Perl 5.17.3 and earlier, the last three constants were also 380used for the caret variants of the variables. 381 382The names have been chosen by analogy with L<Tie::Scalar> methods 383names with an additional B<LENGTH> callback for efficiency. However 384named capture variables are currently not tied internally but 385implemented via magic. 386 387=head3 numbered_buff_FETCH 388 389 void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren, 390 SV * const sv); 391 392Fetch a specified numbered capture. C<sv> should be set to the scalar 393to return, the scalar is passed as an argument rather than being 394returned from the function because when it's called Perl already has a 395scalar to store the value, creating another one would be 396redundant. The scalar can be set with C<sv_setsv>, C<sv_setpvn> and 397friends, see L<perlapi>. 398 399This callback is where Perl untaints its own capture variables under 400taint mode (see L<perlsec>). See the C<Perl_reg_numbered_buff_fetch> 401function in F<regcomp.c> for how to untaint capture variables if 402that's something you'd like your engine to do as well. 403 404=head3 numbered_buff_STORE 405 406 void (*numbered_buff_STORE) (pTHX_ 407 REGEXP * const rx, 408 const I32 paren, 409 SV const * const value); 410 411Set the value of a numbered capture variable. C<value> is the scalar 412that is to be used as the new value. It's up to the engine to make 413sure this is used as the new value (or reject it). 414 415Example: 416 417 if ("ook" =~ /(o*)/) { 418 # 'paren' will be '1' and 'value' will be 'ee' 419 $1 =~ tr/o/e/; 420 } 421 422Perl's own engine will croak on any attempt to modify the capture 423variables, to do this in another engine use the following callback 424(copied from C<Perl_reg_numbered_buff_store>): 425 426 void 427 Example_reg_numbered_buff_store(pTHX_ 428 REGEXP * const rx, 429 const I32 paren, 430 SV const * const value) 431 { 432 PERL_UNUSED_ARG(rx); 433 PERL_UNUSED_ARG(paren); 434 PERL_UNUSED_ARG(value); 435 436 if (!PL_localizing) 437 Perl_croak(aTHX_ PL_no_modify); 438 } 439 440Actually Perl will not I<always> croak in a statement that looks 441like it would modify a numbered capture variable. This is because the 442STORE callback will not be called if Perl can determine that it 443doesn't have to modify the value. This is exactly how tied variables 444behave in the same situation: 445 446 package CaptureVar; 447 use parent 'Tie::Scalar'; 448 449 sub TIESCALAR { bless [] } 450 sub FETCH { undef } 451 sub STORE { die "This doesn't get called" } 452 453 package main; 454 455 tie my $sv => "CaptureVar"; 456 $sv =~ y/a/b/; 457 458Because C<$sv> is C<undef> when the C<y///> operator is applied to it, 459the transliteration won't actually execute and the program won't 460C<die>. This is different to how 5.8 and earlier versions behaved 461since the capture variables were READONLY variables then; now they'll 462just die when assigned to in the default engine. 463 464=head3 numbered_buff_LENGTH 465 466 I32 numbered_buff_LENGTH (pTHX_ 467 REGEXP * const rx, 468 const SV * const sv, 469 const I32 paren); 470 471Get the C<length> of a capture variable. There's a special callback 472for this so that Perl doesn't have to do a FETCH and run C<length> on 473the result, since the length is (in Perl's case) known from an offset 474stored in C<< rx->offs >>, this is much more efficient: 475 476 I32 s1 = rx->offs[paren].start; 477 I32 s2 = rx->offs[paren].end; 478 I32 len = t1 - s1; 479 480This is a little bit more complex in the case of UTF-8, see what 481C<Perl_reg_numbered_buff_length> does with 482L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>. 483 484=head2 Named capture callbacks 485 486Called to get/set the value of C<%+> and C<%->, as well as by some 487utility functions in L<re>. 488 489There are two callbacks, C<named_buff> is called in all the cases the 490FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks 491would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the 492same cases as FIRSTKEY and NEXTKEY. 493 494The C<flags> parameter can be used to determine which of these 495operations the callbacks should respond to. The following flags are 496currently defined: 497 498Which L<Tie::Hash> operation is being performed from the Perl level on 499C<%+> or C<%+>, if any: 500 501 RXapif_FETCH 502 RXapif_STORE 503 RXapif_DELETE 504 RXapif_CLEAR 505 RXapif_EXISTS 506 RXapif_SCALAR 507 RXapif_FIRSTKEY 508 RXapif_NEXTKEY 509 510=for apidoc Amnh ||RXapif_ALL 511=for apidoc_item RXapif_CLEAR 512=for apidoc_item RXapif_DELETE 513=for apidoc_item RXapif_EXISTS 514=for apidoc_item RXapif_FETCH 515=for apidoc_item RXapif_FIRSTKEY 516=for apidoc_item RXapif_NEXTKEY 517=for apidoc_item RXapif_ONE 518=for apidoc_item RXapif_REGNAME 519=for apidoc_item RXapif_REGNAMES 520=for apidoc_item RXapif_REGNAMES_COUNT 521=for apidoc_item RXapif_SCALAR 522=for apidoc_item RXapif_STORE 523 524If C<%+> or C<%-> is being operated on, if any. 525 526 RXapif_ONE /* %+ */ 527 RXapif_ALL /* %- */ 528 529If this is being called as C<re::regname>, C<re::regnames> or 530C<re::regnames_count>, if any. The first two will be combined with 531C<RXapif_ONE> or C<RXapif_ALL>. 532 533 RXapif_REGNAME 534 RXapif_REGNAMES 535 RXapif_REGNAMES_COUNT 536 537 538Internally C<%+> and C<%-> are implemented with a real tied interface 539via L<Tie::Hash::NamedCapture>. The methods in that package will call 540back into these functions. However the usage of 541L<Tie::Hash::NamedCapture> for this purpose might change in future 542releases. For instance this might be implemented by magic instead 543(would need an extension to mgvtbl). 544 545=head3 named_buff 546 547 SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key, 548 SV * const value, U32 flags); 549 550=head3 named_buff_iter 551 552 SV* (*named_buff_iter) (pTHX_ 553 REGEXP * const rx, 554 const SV * const lastkey, 555 const U32 flags); 556 557=head2 qr_package 558 559 SV* qr_package(pTHX_ REGEXP * const rx); 560 561The package the qr// magic object is blessed into (as seen by C<ref 562qr//>). It is recommended that engines change this to their package 563name for identification regardless of if they implement methods 564on the object. 565 566The package this method returns should also have the internal 567C<Regexp> package in its C<@ISA>. C<< qr//->isa("Regexp") >> should always 568be true regardless of what engine is being used. 569 570Example implementation might be: 571 572 SV* 573 Example_qr_package(pTHX_ REGEXP * const rx) 574 { 575 PERL_UNUSED_ARG(rx); 576 return newSVpvs("re::engine::Example"); 577 } 578 579Any method calls on an object created with C<qr//> will be dispatched to the 580package as a normal object. 581 582 use re::engine::Example; 583 my $re = qr//; 584 $re->meth; # dispatched to re::engine::Example::meth() 585 586To retrieve the C<REGEXP> object from the scalar in an XS function use 587the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP 588Functions>. 589 590 void meth(SV * rv) 591 PPCODE: 592 REGEXP * re = SvRX(sv); 593 594=head2 dupe 595 596 void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param); 597 598On threaded builds a regexp may need to be duplicated so that the pattern 599can be used by multiple threads. This routine is expected to handle the 600duplication of any private data pointed to by the C<pprivate> member of 601the C<regexp> structure. It will be called with the preconstructed new 602C<regexp> structure as an argument, the C<pprivate> member will point at 603the B<old> private structure, and it is this routine's responsibility to 604construct a copy and return a pointer to it (which Perl will then use to 605overwrite the field as passed to this routine.) 606 607This allows the engine to dupe its private data but also if necessary 608modify the final structure if it really must. 609 610On unthreaded builds this field doesn't exist. 611 612=head2 op_comp 613 614This is private to the Perl core and subject to change. Should be left 615null. 616 617=head1 The REGEXP structure 618 619The REGEXP struct is defined in F<regexp.h>. 620All regex engines must be able to 621correctly build such a structure in their L</comp> routine. 622 623=for apidoc Ayh||struct regexp 624=for apidoc Ayh||REGEXP 625 626The REGEXP structure contains all the data that Perl needs to be aware of 627to properly work with the regular expression. It includes data about 628optimisations that Perl can use to determine if the regex engine should 629really be used, and various other control info that is needed to properly 630execute patterns in various contexts, such as if the pattern anchored in 631some way, or what flags were used during the compile, or if the 632program contains special constructs that Perl needs to be aware of. 633 634In addition it contains two fields that are intended for the private 635use of the regex engine that compiled the pattern. These are the 636C<intflags> and C<pprivate> members. C<pprivate> is a void pointer to 637an arbitrary structure, whose use and management is the responsibility 638of the compiling engine. Perl will never modify either of these 639values. 640 641 /* copied from: regexp.h */ 642 typedef struct regexp { 643 /*---------------------------------------------------------------------- 644 * Fields required for compatibility with SV types 645 */ 646 _XPV_HEAD; 647 648 /*---------------------------------------------------------------------- 649 * Operational fields 650 */ 651 const struct regexp_engine* engine; /* what engine created this regexp? */ 652 REGEXP *mother_re; /* what re is this a lightweight copy of? */ 653 HV *paren_names; /* Optional hash of paren names */ 654 655 /*---------------------------------------------------------------------- 656 * Information about the match that the perl core uses to manage things 657 */ 658 659 /* see comment in regcomp_internal.h about branch reset to understand 660 the distinction between physical and logical capture buffers */ 661 U32 nparens; /* physical number of capture buffers */ 662 U32 logical_nparens; /* logical_number of capture buffers */ 663 I32 *logical_to_parno; /* map logical parno to first physcial */ 664 I32 *parno_to_logical; /* map every physical parno to logical */ 665 I32 *parno_to_logical_next; /* map every physical parno to the next 666 physical with the same logical id */ 667 668 SSize_t maxlen; /* maximum possible number of chars in string to match */ 669 SSize_t minlen; /* minimum possible number of chars in string to match */ 670 SSize_t minlenret; /* minimum possible number of chars in $& */ 671 STRLEN gofs; /* chars left of pos that we search from */ 672 /* substring data about strings that must appear in 673 * the final match, used for optimisations */ 674 675 struct reg_substr_data *substrs; 676 677 /* private engine specific data */ 678 679 void *pprivate; /* Data private to the regex engine which 680 * created this object. */ 681 U32 extflags; /* Flags used both externally and internally */ 682 U32 intflags; /* Engine Specific Internal flags */ 683 684 /*---------------------------------------------------------------------- 685 * Data about the last/current match. These are modified during matching 686 */ 687 688 U32 lastparen; /* highest close paren matched ($+) */ 689 U32 lastcloseparen; /* last close paren matched ($^N) */ 690 regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */ 691 char **recurse_locinput; /* used to detect infinite recursion, XXX: move to internal */ 692 693 694 /*---------------------------------------------------------------------- */ 695 696 /* offset from wrapped to the start of precomp */ 697 PERL_BITFIELD32 pre_prefix:4; 698 699 /* original flags used to compile the pattern, may differ from 700 * extflags in various ways */ 701 PERL_BITFIELD32 compflags:9; 702 703 /*---------------------------------------------------------------------- */ 704 705 char *subbeg; /* saved or original string so \digit works forever. */ 706 SV_SAVED_COPY /* If non-NULL, SV which is COW from original */ 707 SSize_t sublen; /* Length of string pointed by subbeg */ 708 SSize_t suboffset; /* byte offset of subbeg from logical start of str */ 709 SSize_t subcoffset; /* suboffset equiv, but in chars (for @-/@+) */ 710 711 /*---------------------------------------------------------------------- 712 * More Operational fields 713 */ 714 715 CV *qr_anoncv; /* the anon sub wrapped round qr/(?{..})/ */ 716 } regexp; 717 718Most of the fields contained in this structure are accessed via macros 719with a prefix of C<RX_> or C<RXp_>. The fields are discussed in more detail 720below: 721 722=head2 C<engine> 723 724This field points at a C<regexp_engine> structure which contains pointers 725to the subroutines that are to be used for performing a match. It 726is the compiling routine's responsibility to populate this field before 727returning the regexp object. 728 729Internally this is set to C<NULL> unless a custom engine is specified in 730C<$^H{regcomp}>, Perl's own set of callbacks can be accessed in the struct 731pointed to by C<RE_ENGINE_PTR>. 732 733=for apidoc Amnh||SV_SAVED_COPY 734 735=head2 C<mother_re> 736 737This is a pointer to another struct regexp which this one was derived 738from. C<qr//> objects means that the same regexp pattern can be used in 739different contexts at the same time, and as long as match status 740information is stored in the structure (there are plans to change this 741eventually) we need to support having multiple copies of the structure 742in use at the same time. The fields related to the regexp program itself 743are copied from the mother_re, and owned by the mother_re, whereas the 744match state variables are owned by the struct itself. 745 746=head2 C<extflags> 747 748This will be used by Perl to see what flags the regexp was compiled 749with, this will normally be set to the value of the flags parameter by 750the L<comp|/comp> callback. See the L<comp|/comp> documentation for 751valid flags. 752 753=head2 C<minlen> C<minlenret> 754 755The minimum string length (in characters) required for the pattern to match. 756This is used to 757prune the search space by not bothering to match any closer to the end of a 758string than would allow a match. For instance there is no point in even 759starting the regex engine if the minlen is 10 but the string is only 5 760characters long. There is no way that the pattern can match. 761 762C<minlenret> is the minimum length (in characters) of the string that would 763be found in $& after a match. 764 765The difference between C<minlen> and C<minlenret> can be seen in the 766following pattern: 767 768 /ns(?=\d)/ 769 770where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is 771required to match but is not actually 772included in the matched content. This 773distinction is particularly important as the substitution logic uses the 774C<minlenret> to tell if it can do in-place substitutions (these can 775result in considerable speed-up). 776 777=head2 C<gofs> 778 779Left offset from pos() to start match at. 780 781=head2 C<substrs> 782 783Substring data about strings that must appear in the final match. This 784is currently only used internally by Perl's engine, but might be 785used in the future for all engines for optimisations. 786 787=head2 C<nparens>, C<logical_nparens> 788 789 790These fields are used to keep track of the number of physical and logical 791paren capture groups there are in the pattern, which may differ if the 792pattern includes the use of the branch reset construct C<(?| ... | ... )>. 793For instance the pattern C</(?|(foo)|(bar))/> contains two physical capture 794buffers, but only one logical capture buffer. Most internals logic in the 795regex engine uses the physical capture buffer ids, but the user exposed 796logic uses logical capture buffer ids. See the next section for data-structures 797that allow mapping from one to the other. 798 799=head2 C<logical_to_parno>, C<parno_to_logical>, C<parno_to_logical_next> 800 801These fields facilitate mapping between logical and physical capture 802buffer numbers. C<logical_to_parno> is an array whose Kth element 803contains the lowest physical capture buffer id for the Kth logical 804capture buffer. C<parno_to_logical> is an array whose Kth element 805contains the logical capture buffer associated with the Kth physical 806capture buffer. C<parno_to_logical_next> is an array whose Kth element 807contains the next physical capture buffer with the same logical id, or 0 808if there is none. 809 810Note that all three of these arrays are ONLY populated when the pattern 811includes the use of the branch reset concept. Patterns which do not use 812branch-reset effectively have a 1:1 to mapping between logical and 813physical so there is no need for this meta-data. 814 815The following table gives an example of how this works. 816 817 Pattern /(a) (?| (b) (c) (d) | (e) (f) | (g) ) (h)/ 818 Logical: $1 $2 $3 $4 $2 $3 $2 $5 819 Physical: 1 2 3 4 5 6 7 8 820 Next: 0 5 6 0 7 0 0 0 821 822Also note that the 0th element of any of these arrays is not used as it 823represents the "entire pattern". 824 825=head2 C<lastparen>, and C<lastcloseparen> 826 827These fields are used to keep track of: which was the highest paren to 828be closed (see L<perlvar/$+>); and which was the most recent paren to be 829closed (see L<perlvar/$^N>). 830 831=head2 C<intflags> 832 833The engine's private copy of the flags the pattern was compiled with. Usually 834this is the same as C<extflags> unless the engine chose to modify one of them. 835 836=head2 C<pprivate> 837 838A void* pointing to an engine-defined 839data structure. The Perl engine uses the 840C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom 841engine should use something else. 842 843=head2 C<offs> 844 845A C<regexp_paren_pair> structure which defines offsets into the string being 846matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the 847C<regexp_paren_pair> struct is defined as follows: 848 849 typedef struct regexp_paren_pair { 850 I32 start; 851 I32 end; 852 } regexp_paren_pair; 853 854=for apidoc Ayh||regexp_paren_pair 855 856If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that 857capture group did not match. 858C<< ->offs[0].start/end >> represents C<$&> (or 859C<${^MATCH}> under C</p>) and C<< ->offs[paren].end >> matches C<$$paren> where 860C<$paren >= 1>. 861 862=head2 C<RX_PRECOMP> C<RX_PRELEN> 863 864Used for optimisations. C<RX_PRECOMP> holds a copy of the pattern that 865was compiled and C<RX_PRELEN> its length. When a new pattern is to be 866compiled (such as inside a loop) the internal C<regcomp> operator 867checks if the last compiled C<REGEXP>'s C<RX_PRECOMP> and C<RX_PRELEN> 868are equivalent to the new one, and if so uses the old pattern instead 869of compiling a new one. 870 871In older perls these two macros were actually fields in the structure 872with the names C<precomp> and C<prelen> respectively. 873 874=head2 C<paren_names> 875 876This is a hash used internally to track named capture groups and their 877offsets. The keys are the names of the buffers the values are dualvars, 878with the IV slot holding the number of buffers with the given name and the 879pv being an embedded array of I32. The values may also be contained 880independently in the data array in cases where named backreferences are 881used. 882 883=head2 C<substrs> 884 885Holds information on the longest string that must occur at a fixed 886offset from the start of the pattern, and the longest string that must 887occur at a floating offset from the start of the pattern. Used to do 888Fast-Boyer-Moore searches on the string to find out if its worth using 889the regex engine at all, and if so where in the string to search. 890 891=head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset> 892 893Used during the execution phase for managing search and replace patterns, 894and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a 895buffer (either the original string, or a copy in the case of 896C<RX_MATCH_COPIED(rx_sv)>), and C<sublen> is the length of the buffer. The 897C<RX_OFFS_START(rx_sv,n)> and C<RX_OFFS_END(rx_sv,n)> macros index into this 898buffer. as does the data structure returned by C<RX_OFFSp(rx_sv)> but you 899should not use that directly. 900 901=for apidoc Amh||RX_MATCH_COPIED|const REGEXP * rx_sv 902 903In the presence of the C<REXEC_COPY_STR> flag, but with the addition of 904the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine 905can choose not to copy the full buffer (although it must still do so in 906the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in 907C<PL_sawampersand>). In this case, it may set C<suboffset> to indicate the 908number of bytes from the logical start of the buffer to the physical start 909(i.e. C<subbeg>). It should also set C<subcoffset>, the number of 910characters in the offset. The latter is needed to support C<@-> and C<@+> 911which work in characters, not bytes. 912 913=for apidoc Amnh ||REXEC_COPY_SKIP_POST 914=for apidoc_item ||REXEC_COPY_SKIP_PRE 915=for apidoc_item ||REXEC_COPY_STR 916 917=head2 C<RX_WRAPPED> C<RX_WRAPLEN> 918 919Macros which access the string the C<qr//> stringifies to. The Perl 920engine for example stores C<(?^:eek)> in the case of C<qr/eek/>. 921 922When using a custom engine that doesn't support the C<(?:)> construct 923for inline modifiers, it's probably best to have C<qr//> stringify to 924the supplied pattern, note that this will create undesired patterns in 925cases such as: 926 927 my $x = qr/a|b/; # "a|b" 928 my $y = qr/c/i; # "c" 929 my $z = qr/$x$y/; # "a|bc" 930 931There's no solution for this problem other than making the custom 932engine understand a construct like C<(?:)>. 933 934=head2 C<RX_REFCNT()> 935 936The number of times the structure is referenced. When this falls to 0, 937the regexp is automatically freed by a call to C<pregfree>. This should 938be set to 1 in each engine's L</comp> routine. Note that in older perls 939this was a member in the struct called C<refcnt> but in more modern 940perls where the regexp structure was unified with the SV structure this 941is an alias to SvREFCNT(). 942 943=head1 HISTORY 944 945Originally part of L<perlreguts>. 946 947=head1 AUTHORS 948 949Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth> 950Bjarmason. 951 952=head1 LICENSE 953 954Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason. 955 956This program is free software; you can redistribute it and/or modify it under 957the same terms as Perl itself. 958 959=cut 960