xref: /openbsd/gnu/usr.bin/perl/pod/perlreapi.pod (revision 3d61058a)
1=head1 NAME
2
3perlreapi - Perl regular expression plugin interface
4
5=head1 DESCRIPTION
6
7As of Perl 5.9.5 there is a new interface for plugging and using
8regular expression engines other than the default one.
9
10Each engine is supposed to provide access to a constant structure of the
11following format:
12
13    typedef struct regexp_engine {
14        REGEXP* (*comp) (pTHX_
15                         const SV * const pattern, const U32 flags);
16        I32     (*exec) (pTHX_
17                         REGEXP * const rx,
18                         char* stringarg,
19                         char* strend, char* strbeg,
20                         SSize_t minend, SV* sv,
21                         void* data, U32 flags);
22        char*   (*intuit) (pTHX_
23                           REGEXP * const rx, SV *sv,
24			   const char * const strbeg,
25                           char *strpos, char *strend, U32 flags,
26                           struct re_scream_pos_data_s *data);
27        SV*     (*checkstr) (pTHX_ REGEXP * const rx);
28        void    (*free) (pTHX_ REGEXP * const rx);
29        void    (*numbered_buff_FETCH) (pTHX_
30                                        REGEXP * const rx,
31                                        const I32 paren,
32                                        SV * const sv);
33        void    (*numbered_buff_STORE) (pTHX_
34                                        REGEXP * const rx,
35                                        const I32 paren,
36                                        SV const * const value);
37        I32     (*numbered_buff_LENGTH) (pTHX_
38                                         REGEXP * const rx,
39                                         const SV * const sv,
40                                         const I32 paren);
41        SV*     (*named_buff) (pTHX_
42                               REGEXP * const rx,
43                               SV * const key,
44                               SV * const value,
45                               U32 flags);
46        SV*     (*named_buff_iter) (pTHX_
47                                    REGEXP * const rx,
48                                    const SV * const lastkey,
49                                    const U32 flags);
50        SV*     (*qr_package)(pTHX_ REGEXP * const rx);
51    #ifdef USE_ITHREADS
52        void*   (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
53    #endif
54        REGEXP* (*op_comp) (...);
55
56
57=for apidoc_section $regexp
58=for apidoc Ay||regexp_engine
59
60When a regexp is compiled, its C<engine> field is then set to point at
61the appropriate structure, so that when it needs to be used Perl can find
62the right routines to do so.
63
64In order to install a new regexp handler, C<$^H{regcomp}> is set
65to an integer which (when casted appropriately) resolves to one of these
66structures.  When compiling, the C<comp> method is executed, and the
67resulting C<regexp> structure's engine field is expected to point back at
68the same structure.
69
70The pTHX_ symbol in the definition is a macro used by Perl under threading
71to provide an extra argument to the routine holding a pointer back to
72the interpreter that is executing the regexp. So under threading all
73routines get an extra argument.
74
75=head1 Callbacks
76
77=head2 comp
78
79    REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags);
80
81Compile the pattern stored in C<pattern> using the given C<flags> and
82return a pointer to a prepared C<REGEXP> structure that can perform
83the match.  See L</The REGEXP structure> below for an explanation of
84the individual fields in the REGEXP struct.
85
86The C<pattern> parameter is the scalar that was used as the
87pattern.  Previous versions of Perl would pass two C<char*> indicating
88the start and end of the stringified pattern; the following snippet can
89be used to get the old parameters:
90
91    STRLEN plen;
92    char*  exp = SvPV(pattern, plen);
93    char* xend = exp + plen;
94
95Since any scalar can be passed as a pattern, it's possible to implement
96an engine that does something with an array (C<< "ook" =~ [ qw/ eek
97hlagh / ] >>) or with the non-stringified form of a compiled regular
98expression (C<< "ook" =~ qr/eek/ >>).  Perl's own engine will always
99stringify everything using the snippet above, but that doesn't mean
100other engines have to.
101
102The C<flags> parameter is a bitfield which indicates which of the
103C<msixpn> flags the regex was compiled with.  It also contains
104additional info, such as if C<use locale> is in effect.
105
106The C<eogc> flags are stripped out before being passed to the comp
107routine.  The regex engine does not need to know if any of these
108are set, as those flags should only affect what Perl does with the
109pattern and its match variables, not how it gets compiled and
110executed.
111
112By the time the comp callback is called, some of these flags have
113already had effect (noted below where applicable).  However most of
114their effect occurs after the comp callback has run, in routines that
115read the C<< rx->extflags >> field which it populates.
116
117In general the flags should be preserved in C<< rx->extflags >> after
118compilation, although the regex engine might want to add or delete
119some of them to invoke or disable some special behavior in Perl.  The
120flags along with any special behavior they cause are documented below:
121
122The pattern modifiers:
123
124=over 4
125
126=item C</m> - RXf_PMf_MULTILINE
127
128If this is in C<< rx->extflags >> it will be passed to
129C<Perl_fbm_instr> by C<pp_split> which will treat the subject string
130as a multi-line string.
131
132=for apidoc Amnh||RXf_PMf_EXTENDED
133=for apidoc_item  RXf_PMf_FOLD
134=for apidoc_item  RXf_PMf_KEEPCOPY
135=for apidoc_item  RXf_PMf_MULTILINE
136=for apidoc_item  RXf_PMf_SINGLELINE
137
138=item C</s> - RXf_PMf_SINGLELINE
139
140=item C</i> - RXf_PMf_FOLD
141
142=item C</x> - RXf_PMf_EXTENDED
143
144If present on a regex, C<"#"> comments will be handled differently by the
145tokenizer in some cases.
146
147TODO: Document those cases.
148
149
150=item C</p> - RXf_PMf_KEEPCOPY
151
152TODO: Document this
153
154=item Character set
155
156The character set rules are determined by an enum that is contained
157in this field.  This is still experimental and subject to change, but
158the current interface returns the rules by use of the in-line function
159C<get_regex_charset(const U32 flags)>.  The only currently documented
160value returned from it is REGEX_LOCALE_CHARSET, which is set if
161C<use locale> is in effect. If present in C<< rx->extflags >>,
162C<split> will use the locale dependent definition of whitespace
163when RXf_SKIPWHITE or RXf_WHITE is in effect.  ASCII whitespace
164is defined as per L<isSPACE|perlapi/isSPACE>, and by the internal
165macros C<is_utf8_space> under UTF-8, and C<isSPACE_LC> under C<use
166locale>.
167
168=for apidoc Amnh||REGEX_LOCALE_CHARSET
169
170=back
171
172Additional flags:
173
174=over 4
175
176=item RXf_SPLIT
177
178This flag was removed in perl 5.18.0.  C<split ' '> is now special-cased
179solely in the parser.  RXf_SPLIT is still #defined, so you can test for it.
180This is how it used to work:
181
182If C<split> is invoked as C<split ' '> or with no arguments (which
183really means C<split(' ', $_)>, see L<split|perlfunc/split>), Perl will
184set this flag.  The regex engine can then check for it and set the
185SKIPWHITE and WHITE extflags.  To do this, the Perl engine does:
186
187    if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ')
188        r->extflags |= (RXf_SKIPWHITE|RXf_WHITE);
189
190=back
191
192These flags can be set during compilation to enable optimizations in
193the C<split> operator.
194
195=for apidoc Amnh||RXf_NO_INPLACE_SUBST
196=for apidoc_item  RXf_NULL
197=for apidoc_item  RXf_SKIPWHITE
198=for apidoc_item  RXf_SPLIT
199=for apidoc_item  RXf_START_ONLY
200=for apidoc_item  RXf_WHITE
201
202=over 4
203
204=item RXf_SKIPWHITE
205
206This flag was removed in perl 5.18.0.  It is still #defined, so you can
207set it, but doing so will have no effect.  This is how it used to work:
208
209If the flag is present in C<< rx->extflags >> C<split> will delete
210whitespace from the start of the subject string before it's operated
211on.  What is considered whitespace depends on if the subject is a
212UTF-8 string and if the C<RXf_PMf_LOCALE> flag is set.
213
214If RXf_WHITE is set in addition to this flag, C<split> will behave like
215C<split " "> under the Perl engine.
216
217
218=item RXf_START_ONLY
219
220Tells the split operator to split the target string on newlines
221(C<\n>) without invoking the regex engine.
222
223Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp
224== '^'>), even under C</^/s>; see L<split|perlfunc>.  Of course a
225different regex engine might want to use the same optimizations
226with a different syntax.
227
228=item RXf_WHITE
229
230Tells the split operator to split the target string on whitespace
231without invoking the regex engine.  The definition of whitespace varies
232depending on if the target string is a UTF-8 string and on
233if RXf_PMf_LOCALE is set.
234
235Perl's engine sets this flag if the pattern is C<\s+>.
236
237=item RXf_NULL
238
239Tells the split operator to split the target string on
240characters.  The definition of character varies depending on if
241the target string is a UTF-8 string.
242
243Perl's engine sets this flag on empty patterns, this optimization
244makes C<split //> much faster than it would otherwise be.  It's even
245faster than C<unpack>.
246
247=item RXf_NO_INPLACE_SUBST
248
249Added in perl 5.18.0, this flag indicates that a regular expression might
250perform an operation that would interfere with inplace substitution. For
251instance it might contain lookbehind, or assign to non-magical variables
252(such as $REGMARK and $REGERROR) during matching.  C<s///> will skip
253certain optimisations when this is set.
254
255=back
256
257=head2 exec
258
259    I32 exec(pTHX_ REGEXP * const rx,
260             char *stringarg, char* strend, char* strbeg,
261             SSize_t minend, SV* sv,
262             void* data, U32 flags);
263
264Execute a regexp. The arguments are
265
266=over 4
267
268=item rx
269
270The regular expression to execute.
271
272=item sv
273
274This is the SV to be matched against.  Note that the
275actual char array to be matched against is supplied by the arguments
276described below; the SV is just used to determine UTF8ness, C<pos()> etc.
277
278=item strbeg
279
280Pointer to the physical start of the string.
281
282=item strend
283
284Pointer to the character following the physical end of the string (i.e.
285the C<\0>, if any).
286
287=item stringarg
288
289Pointer to the position in the string where matching should start; it might
290not be equal to C<strbeg> (for example in a later iteration of C</.../g>).
291
292=item minend
293
294Minimum length of string (measured in bytes from C<stringarg>) that must
295match; if the engine reaches the end of the match but hasn't reached this
296position in the string, it should fail.
297
298=item data
299
300Optimisation data; subject to change.
301
302=item flags
303
304Optimisation flags; subject to change.
305
306=back
307
308=head2 intuit
309
310    char* intuit(pTHX_
311		REGEXP * const rx,
312		SV *sv,
313		const char * const strbeg,
314		char *strpos,
315		char *strend,
316		const U32 flags,
317		struct re_scream_pos_data_s *data);
318
319Find the start position where a regex match should be attempted,
320or possibly if the regex engine should not be run because the
321pattern can't match.  This is called, as appropriate, by the core,
322depending on the values of the C<extflags> member of the C<regexp>
323structure.
324
325Arguments:
326
327    rx:     the regex to match against
328    sv:     the SV being matched: only used for utf8 flag; the string
329	    itself is accessed via the pointers below. Note that on
330	    something like an overloaded SV, SvPOK(sv) may be false
331	    and the string pointers may point to something unrelated to
332	    the SV itself.
333    strbeg: real beginning of string
334    strpos: the point in the string at which to begin matching
335    strend: pointer to the byte following the last char of the string
336    flags   currently unused; set to 0
337    data:   currently unused; set to NULL
338
339
340=head2 checkstr
341
342    SV*	checkstr(pTHX_ REGEXP * const rx);
343
344Return a SV containing a string that must appear in the pattern. Used
345by C<split> for optimising matches.
346
347=head2 free
348
349    void free(pTHX_ REGEXP * const rx);
350
351Called by Perl when it is freeing a regexp pattern so that the engine
352can release any resources pointed to by the C<pprivate> member of the
353C<regexp> structure.  This is only responsible for freeing private data;
354Perl will handle releasing anything else contained in the C<regexp> structure.
355
356=head2 Numbered capture callbacks
357
358Called to get/set the value of C<$`>, C<$'>, C<$&> and their named
359equivalents, ${^PREMATCH}, ${^POSTMATCH} and ${^MATCH}, as well as the
360numbered capture groups (C<$1>, C<$2>, ...).
361
362The C<paren> parameter will be C<1> for C<$1>, C<2> for C<$2> and so
363forth, and have these symbolic values for the special variables:
364
365    ${^PREMATCH}  RX_BUFF_IDX_CARET_PREMATCH
366    ${^POSTMATCH} RX_BUFF_IDX_CARET_POSTMATCH
367    ${^MATCH}     RX_BUFF_IDX_CARET_FULLMATCH
368    $`            RX_BUFF_IDX_PREMATCH
369    $'            RX_BUFF_IDX_POSTMATCH
370    $&            RX_BUFF_IDX_FULLMATCH
371
372=for apidoc Amnh||RX_BUFF_IDX_CARET_FULLMATCH
373=for apidoc_item  RX_BUFF_IDX_CARET_POSTMATCH
374=for apidoc_item  RX_BUFF_IDX_CARET_PREMATCH
375=for apidoc_item  RX_BUFF_IDX_FULLMATCH
376=for apidoc_item  RX_BUFF_IDX_POSTMATCH
377=for apidoc_item  RX_BUFF_IDX_PREMATCH
378
379Note that in Perl 5.17.3 and earlier, the last three constants were also
380used for the caret variants of the variables.
381
382The names have been chosen by analogy with L<Tie::Scalar> methods
383names with an additional B<LENGTH> callback for efficiency.  However
384named capture variables are currently not tied internally but
385implemented via magic.
386
387=head3 numbered_buff_FETCH
388
389    void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren,
390                             SV * const sv);
391
392Fetch a specified numbered capture.  C<sv> should be set to the scalar
393to return, the scalar is passed as an argument rather than being
394returned from the function because when it's called Perl already has a
395scalar to store the value, creating another one would be
396redundant.  The scalar can be set with C<sv_setsv>, C<sv_setpvn> and
397friends, see L<perlapi>.
398
399This callback is where Perl untaints its own capture variables under
400taint mode (see L<perlsec>).  See the C<Perl_reg_numbered_buff_fetch>
401function in F<regcomp.c> for how to untaint capture variables if
402that's something you'd like your engine to do as well.
403
404=head3 numbered_buff_STORE
405
406    void    (*numbered_buff_STORE) (pTHX_
407                                    REGEXP * const rx,
408                                    const I32 paren,
409                                    SV const * const value);
410
411Set the value of a numbered capture variable.  C<value> is the scalar
412that is to be used as the new value.  It's up to the engine to make
413sure this is used as the new value (or reject it).
414
415Example:
416
417    if ("ook" =~ /(o*)/) {
418        # 'paren' will be '1' and 'value' will be 'ee'
419        $1 =~ tr/o/e/;
420    }
421
422Perl's own engine will croak on any attempt to modify the capture
423variables, to do this in another engine use the following callback
424(copied from C<Perl_reg_numbered_buff_store>):
425
426    void
427    Example_reg_numbered_buff_store(pTHX_
428                                    REGEXP * const rx,
429                                    const I32 paren,
430                                    SV const * const value)
431    {
432        PERL_UNUSED_ARG(rx);
433        PERL_UNUSED_ARG(paren);
434        PERL_UNUSED_ARG(value);
435
436        if (!PL_localizing)
437            Perl_croak(aTHX_ PL_no_modify);
438    }
439
440Actually Perl will not I<always> croak in a statement that looks
441like it would modify a numbered capture variable.  This is because the
442STORE callback will not be called if Perl can determine that it
443doesn't have to modify the value.  This is exactly how tied variables
444behave in the same situation:
445
446    package CaptureVar;
447    use parent 'Tie::Scalar';
448
449    sub TIESCALAR { bless [] }
450    sub FETCH { undef }
451    sub STORE { die "This doesn't get called" }
452
453    package main;
454
455    tie my $sv => "CaptureVar";
456    $sv =~ y/a/b/;
457
458Because C<$sv> is C<undef> when the C<y///> operator is applied to it,
459the transliteration won't actually execute and the program won't
460C<die>.  This is different to how 5.8 and earlier versions behaved
461since the capture variables were READONLY variables then; now they'll
462just die when assigned to in the default engine.
463
464=head3 numbered_buff_LENGTH
465
466    I32 numbered_buff_LENGTH (pTHX_
467                              REGEXP * const rx,
468                              const SV * const sv,
469                              const I32 paren);
470
471Get the C<length> of a capture variable.  There's a special callback
472for this so that Perl doesn't have to do a FETCH and run C<length> on
473the result, since the length is (in Perl's case) known from an offset
474stored in C<< rx->offs >>, this is much more efficient:
475
476    I32 s1  = rx->offs[paren].start;
477    I32 s2  = rx->offs[paren].end;
478    I32 len = t1 - s1;
479
480This is a little bit more complex in the case of UTF-8, see what
481C<Perl_reg_numbered_buff_length> does with
482L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>.
483
484=head2 Named capture callbacks
485
486Called to get/set the value of C<%+> and C<%->, as well as by some
487utility functions in L<re>.
488
489There are two callbacks, C<named_buff> is called in all the cases the
490FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks
491would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the
492same cases as FIRSTKEY and NEXTKEY.
493
494The C<flags> parameter can be used to determine which of these
495operations the callbacks should respond to.  The following flags are
496currently defined:
497
498Which L<Tie::Hash> operation is being performed from the Perl level on
499C<%+> or C<%+>, if any:
500
501    RXapif_FETCH
502    RXapif_STORE
503    RXapif_DELETE
504    RXapif_CLEAR
505    RXapif_EXISTS
506    RXapif_SCALAR
507    RXapif_FIRSTKEY
508    RXapif_NEXTKEY
509
510=for apidoc Amnh ||RXapif_ALL
511=for apidoc_item   RXapif_CLEAR
512=for apidoc_item   RXapif_DELETE
513=for apidoc_item   RXapif_EXISTS
514=for apidoc_item   RXapif_FETCH
515=for apidoc_item   RXapif_FIRSTKEY
516=for apidoc_item   RXapif_NEXTKEY
517=for apidoc_item   RXapif_ONE
518=for apidoc_item   RXapif_REGNAME
519=for apidoc_item   RXapif_REGNAMES
520=for apidoc_item   RXapif_REGNAMES_COUNT
521=for apidoc_item   RXapif_SCALAR
522=for apidoc_item   RXapif_STORE
523
524If C<%+> or C<%-> is being operated on, if any.
525
526    RXapif_ONE /* %+ */
527    RXapif_ALL /* %- */
528
529If this is being called as C<re::regname>, C<re::regnames> or
530C<re::regnames_count>, if any.  The first two will be combined with
531C<RXapif_ONE> or C<RXapif_ALL>.
532
533    RXapif_REGNAME
534    RXapif_REGNAMES
535    RXapif_REGNAMES_COUNT
536
537
538Internally C<%+> and C<%-> are implemented with a real tied interface
539via L<Tie::Hash::NamedCapture>.  The methods in that package will call
540back into these functions.  However the usage of
541L<Tie::Hash::NamedCapture> for this purpose might change in future
542releases.  For instance this might be implemented by magic instead
543(would need an extension to mgvtbl).
544
545=head3 named_buff
546
547    SV*     (*named_buff) (pTHX_ REGEXP * const rx, SV * const key,
548                           SV * const value, U32 flags);
549
550=head3 named_buff_iter
551
552    SV*     (*named_buff_iter) (pTHX_
553                                REGEXP * const rx,
554                                const SV * const lastkey,
555                                const U32 flags);
556
557=head2 qr_package
558
559    SV* qr_package(pTHX_ REGEXP * const rx);
560
561The package the qr// magic object is blessed into (as seen by C<ref
562qr//>).  It is recommended that engines change this to their package
563name for identification regardless of if they implement methods
564on the object.
565
566The package this method returns should also have the internal
567C<Regexp> package in its C<@ISA>.  C<< qr//->isa("Regexp") >> should always
568be true regardless of what engine is being used.
569
570Example implementation might be:
571
572    SV*
573    Example_qr_package(pTHX_ REGEXP * const rx)
574    {
575    	PERL_UNUSED_ARG(rx);
576    	return newSVpvs("re::engine::Example");
577    }
578
579Any method calls on an object created with C<qr//> will be dispatched to the
580package as a normal object.
581
582    use re::engine::Example;
583    my $re = qr//;
584    $re->meth; # dispatched to re::engine::Example::meth()
585
586To retrieve the C<REGEXP> object from the scalar in an XS function use
587the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP
588Functions>.
589
590    void meth(SV * rv)
591    PPCODE:
592        REGEXP * re = SvRX(sv);
593
594=head2 dupe
595
596    void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
597
598On threaded builds a regexp may need to be duplicated so that the pattern
599can be used by multiple threads.  This routine is expected to handle the
600duplication of any private data pointed to by the C<pprivate> member of
601the C<regexp> structure.  It will be called with the preconstructed new
602C<regexp> structure as an argument, the C<pprivate> member will point at
603the B<old> private structure, and it is this routine's responsibility to
604construct a copy and return a pointer to it (which Perl will then use to
605overwrite the field as passed to this routine.)
606
607This allows the engine to dupe its private data but also if necessary
608modify the final structure if it really must.
609
610On unthreaded builds this field doesn't exist.
611
612=head2 op_comp
613
614This is private to the Perl core and subject to change. Should be left
615null.
616
617=head1 The REGEXP structure
618
619The REGEXP struct is defined in F<regexp.h>.
620All regex engines must be able to
621correctly build such a structure in their L</comp> routine.
622
623=for apidoc Ayh||struct regexp
624=for apidoc Ayh||REGEXP
625
626The REGEXP structure contains all the data that Perl needs to be aware of
627to properly work with the regular expression.  It includes data about
628optimisations that Perl can use to determine if the regex engine should
629really be used, and various other control info that is needed to properly
630execute patterns in various contexts, such as if the pattern anchored in
631some way, or what flags were used during the compile, or if the
632program contains special constructs that Perl needs to be aware of.
633
634In addition it contains two fields that are intended for the private
635use of the regex engine that compiled the pattern.  These are the
636C<intflags> and C<pprivate> members.  C<pprivate> is a void pointer to
637an arbitrary structure, whose use and management is the responsibility
638of the compiling engine.  Perl will never modify either of these
639values.
640
641    /* copied from: regexp.h */
642    typedef struct regexp {
643        /*----------------------------------------------------------------------
644         * Fields required for compatibility with SV types
645         */
646        _XPV_HEAD;
647
648        /*----------------------------------------------------------------------
649         * Operational fields
650         */
651        const struct regexp_engine* engine; /* what engine created this regexp? */
652        REGEXP *mother_re; /* what re is this a lightweight copy of? */
653        HV *paren_names;   /* Optional hash of paren names */
654
655        /*----------------------------------------------------------------------
656         * Information about the match that the perl core uses to manage things
657         */
658
659        /* see comment in regcomp_internal.h about branch reset to understand
660           the distinction between physical and logical capture buffers */
661        U32 nparens;                    /* physical number of capture buffers */
662        U32 logical_nparens;            /* logical_number of capture buffers */
663        I32 *logical_to_parno;          /* map logical parno to first physcial */
664        I32 *parno_to_logical;          /* map every physical parno to logical */
665        I32 *parno_to_logical_next;     /* map every physical parno to the next
666                                           physical with the same logical id */
667
668        SSize_t maxlen;    /* maximum possible number of chars in string to match */
669        SSize_t minlen;    /* minimum possible number of chars in string to match */
670        SSize_t minlenret; /* minimum possible number of chars in $& */
671        STRLEN gofs;       /* chars left of pos that we search from */
672                           /* substring data about strings that must appear in
673                            * the final match, used for optimisations */
674
675        struct reg_substr_data *substrs;
676
677        /* private engine specific data */
678
679        void *pprivate;    /* Data private to the regex engine which
680                            * created this object. */
681        U32 extflags;      /* Flags used both externally and internally */
682        U32 intflags;      /* Engine Specific Internal flags */
683
684        /*----------------------------------------------------------------------
685         * Data about the last/current match. These are modified during matching
686         */
687
688        U32 lastparen;           /* highest close paren matched ($+) */
689        U32 lastcloseparen;      /* last close paren matched ($^N) */
690        regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */
691        char **recurse_locinput; /* used to detect infinite recursion, XXX: move to internal */
692
693
694        /*---------------------------------------------------------------------- */
695
696        /* offset from wrapped to the start of precomp */
697        PERL_BITFIELD32 pre_prefix:4;
698
699        /* original flags used to compile the pattern, may differ from
700         * extflags in various ways */
701        PERL_BITFIELD32 compflags:9;
702
703        /*---------------------------------------------------------------------- */
704
705        char *subbeg;       /* saved or original string so \digit works forever. */
706        SV_SAVED_COPY       /* If non-NULL, SV which is COW from original */
707        SSize_t sublen;     /* Length of string pointed by subbeg */
708        SSize_t suboffset;  /* byte offset of subbeg from logical start of str */
709        SSize_t subcoffset; /* suboffset equiv, but in chars (for @-/@+) */
710
711        /*----------------------------------------------------------------------
712         * More Operational fields
713         */
714
715        CV *qr_anoncv;      /* the anon sub wrapped round qr/(?{..})/ */
716    } regexp;
717
718Most of the fields contained in this structure are accessed via macros
719with a prefix of C<RX_> or C<RXp_>. The fields are discussed in more detail
720below:
721
722=head2 C<engine>
723
724This field points at a C<regexp_engine> structure which contains pointers
725to the subroutines that are to be used for performing a match.  It
726is the compiling routine's responsibility to populate this field before
727returning the regexp object.
728
729Internally this is set to C<NULL> unless a custom engine is specified in
730C<$^H{regcomp}>, Perl's own set of callbacks can be accessed in the struct
731pointed to by C<RE_ENGINE_PTR>.
732
733=for apidoc Amnh||SV_SAVED_COPY
734
735=head2 C<mother_re>
736
737This is a pointer to another struct regexp which this one was derived
738from. C<qr//> objects means that the same regexp pattern can be used in
739different contexts at the same time, and as long as match status
740information is stored in the structure (there are plans to change this
741eventually) we need to support having multiple copies of the structure
742in use at the same time. The fields related to the regexp program itself
743are copied from the mother_re, and owned by the mother_re, whereas the
744match state variables are owned by the struct itself.
745
746=head2 C<extflags>
747
748This will be used by Perl to see what flags the regexp was compiled
749with, this will normally be set to the value of the flags parameter by
750the L<comp|/comp> callback.  See the L<comp|/comp> documentation for
751valid flags.
752
753=head2 C<minlen> C<minlenret>
754
755The minimum string length (in characters) required for the pattern to match.
756This is used to
757prune the search space by not bothering to match any closer to the end of a
758string than would allow a match.  For instance there is no point in even
759starting the regex engine if the minlen is 10 but the string is only 5
760characters long.  There is no way that the pattern can match.
761
762C<minlenret> is the minimum length (in characters) of the string that would
763be found in $& after a match.
764
765The difference between C<minlen> and C<minlenret> can be seen in the
766following pattern:
767
768    /ns(?=\d)/
769
770where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is
771required to match but is not actually
772included in the matched content.  This
773distinction is particularly important as the substitution logic uses the
774C<minlenret> to tell if it can do in-place substitutions (these can
775result in considerable speed-up).
776
777=head2 C<gofs>
778
779Left offset from pos() to start match at.
780
781=head2 C<substrs>
782
783Substring data about strings that must appear in the final match.  This
784is currently only used internally by Perl's engine, but might be
785used in the future for all engines for optimisations.
786
787=head2 C<nparens>, C<logical_nparens>
788
789
790These fields are used to keep track of the number of physical and logical
791paren capture groups there are in the pattern, which may differ if the
792pattern includes the use of the branch reset construct C<(?| ... | ... )>.
793For instance the pattern C</(?|(foo)|(bar))/> contains two physical capture
794buffers, but only one logical capture buffer. Most internals logic in the
795regex engine uses the physical capture buffer ids, but the user exposed
796logic uses logical capture buffer ids. See the next section for data-structures
797that allow mapping from one to the other.
798
799=head2 C<logical_to_parno>, C<parno_to_logical>, C<parno_to_logical_next>
800
801These fields facilitate mapping between logical and physical capture
802buffer numbers. C<logical_to_parno> is an array whose Kth element
803contains the lowest physical capture buffer id for the Kth logical
804capture buffer. C<parno_to_logical> is an array whose Kth element
805contains the logical capture buffer associated with the Kth physical
806capture buffer. C<parno_to_logical_next> is an array whose Kth element
807contains the next physical capture buffer with the same logical id, or 0
808if there is none.
809
810Note that all three of these arrays are ONLY populated when the pattern
811includes the use of the branch reset concept. Patterns which do not use
812branch-reset effectively have a 1:1 to mapping between logical and
813physical so there is no need for this meta-data.
814
815The following table gives an example of how this works.
816
817     Pattern /(a) (?| (b) (c) (d) | (e) (f) | (g) ) (h)/
818     Logical: $1      $2  $3  $4    $2  $3    $2    $5
819     Physical: 1       2   3   4     5   6     7     8
820     Next:     0       5   6   0     7   0     0     0
821
822Also note that the 0th element of any of these arrays is not used as it
823represents the "entire pattern".
824
825=head2 C<lastparen>, and C<lastcloseparen>
826
827These fields are used to keep track of: which was the highest paren to
828be closed (see L<perlvar/$+>); and which was the most recent paren to be
829closed (see L<perlvar/$^N>).
830
831=head2 C<intflags>
832
833The engine's private copy of the flags the pattern was compiled with. Usually
834this is the same as C<extflags> unless the engine chose to modify one of them.
835
836=head2 C<pprivate>
837
838A void* pointing to an engine-defined
839data structure.  The Perl engine uses the
840C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom
841engine should use something else.
842
843=head2 C<offs>
844
845A C<regexp_paren_pair> structure which defines offsets into the string being
846matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the
847C<regexp_paren_pair> struct is defined as follows:
848
849    typedef struct regexp_paren_pair {
850        I32 start;
851        I32 end;
852    } regexp_paren_pair;
853
854=for apidoc Ayh||regexp_paren_pair
855
856If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that
857capture group did not match.
858C<< ->offs[0].start/end >> represents C<$&> (or
859C<${^MATCH}> under C</p>) and C<< ->offs[paren].end >> matches C<$$paren> where
860C<$paren >= 1>.
861
862=head2 C<RX_PRECOMP> C<RX_PRELEN>
863
864Used for optimisations.  C<RX_PRECOMP> holds a copy of the pattern that
865was compiled and C<RX_PRELEN> its length.  When a new pattern is to be
866compiled (such as inside a loop) the internal C<regcomp> operator
867checks if the last compiled C<REGEXP>'s C<RX_PRECOMP> and C<RX_PRELEN>
868are equivalent to the new one, and if so uses the old pattern instead
869of compiling a new one.
870
871In older perls these two macros were actually fields in the structure
872with the names C<precomp> and C<prelen> respectively.
873
874=head2 C<paren_names>
875
876This is a hash used internally to track named capture groups and their
877offsets.  The keys are the names of the buffers the values are dualvars,
878with the IV slot holding the number of buffers with the given name and the
879pv being an embedded array of I32.  The values may also be contained
880independently in the data array in cases where named backreferences are
881used.
882
883=head2 C<substrs>
884
885Holds information on the longest string that must occur at a fixed
886offset from the start of the pattern, and the longest string that must
887occur at a floating offset from the start of the pattern.  Used to do
888Fast-Boyer-Moore searches on the string to find out if its worth using
889the regex engine at all, and if so where in the string to search.
890
891=head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset>
892
893Used during the execution phase for managing search and replace patterns,
894and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a
895buffer (either the original string, or a copy in the case of
896C<RX_MATCH_COPIED(rx_sv)>), and C<sublen> is the length of the buffer.  The
897C<RX_OFFS_START(rx_sv,n)> and C<RX_OFFS_END(rx_sv,n)> macros index into this
898buffer. as does the data structure returned by C<RX_OFFSp(rx_sv)> but you
899should not use that directly.
900
901=for apidoc Amh||RX_MATCH_COPIED|const REGEXP * rx_sv
902
903In the presence of the C<REXEC_COPY_STR> flag, but with the addition of
904the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine
905can choose not to copy the full buffer (although it must still do so in
906the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in
907C<PL_sawampersand>).  In this case, it may set C<suboffset> to indicate the
908number of bytes from the logical start of the buffer to the physical start
909(i.e. C<subbeg>).  It should also set C<subcoffset>, the number of
910characters in the offset. The latter is needed to support C<@-> and C<@+>
911which work in characters, not bytes.
912
913=for apidoc Amnh ||REXEC_COPY_SKIP_POST
914=for apidoc_item ||REXEC_COPY_SKIP_PRE
915=for apidoc_item ||REXEC_COPY_STR
916
917=head2 C<RX_WRAPPED> C<RX_WRAPLEN>
918
919Macros which access the string the C<qr//> stringifies to. The Perl
920engine for example stores C<(?^:eek)> in the case of C<qr/eek/>.
921
922When using a custom engine that doesn't support the C<(?:)> construct
923for inline modifiers, it's probably best to have C<qr//> stringify to
924the supplied pattern, note that this will create undesired patterns in
925cases such as:
926
927    my $x = qr/a|b/;  # "a|b"
928    my $y = qr/c/i;   # "c"
929    my $z = qr/$x$y/; # "a|bc"
930
931There's no solution for this problem other than making the custom
932engine understand a construct like C<(?:)>.
933
934=head2 C<RX_REFCNT()>
935
936The number of times the structure is referenced. When this falls to 0,
937the regexp is automatically freed by a call to C<pregfree>. This should
938be set to 1 in each engine's L</comp> routine. Note that in older perls
939this was a member in the struct called C<refcnt> but in more modern
940perls where the regexp structure was unified with the SV structure this
941is an alias to SvREFCNT().
942
943=head1 HISTORY
944
945Originally part of L<perlreguts>.
946
947=head1 AUTHORS
948
949Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth>
950Bjarmason.
951
952=head1 LICENSE
953
954Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason.
955
956This program is free software; you can redistribute it and/or modify it under
957the same terms as Perl itself.
958
959=cut
960