xref: /openbsd/gnu/usr.bin/perl/regen/regcharclass.pl (revision 771fbea0)
1#!perl
2package CharClass::Matcher;
3use strict;
4use 5.008;
5use warnings;
6use warnings FATAL => 'all';
7use Data::Dumper;
8$Data::Dumper::Useqq= 1;
9our $hex_fmt= "0x%02X";
10
11sub DEBUG () { 0 }
12$|=1 if DEBUG;
13
14require './regen/regen_lib.pl';
15require './regen/charset_translations.pl';
16require "./regen/regcharclass_multi_char_folds.pl";
17
18=head1 NAME
19
20CharClass::Matcher -- Generate C macros that match character classes efficiently
21
22=head1 SYNOPSIS
23
24    perl regen/regcharclass.pl
25
26=head1 DESCRIPTION
27
28Dynamically generates macros for detecting special charclasses
29in latin-1, utf8, and codepoint forms. Macros can be set to return
30the length (in bytes) of the matched codepoint, and/or the codepoint itself.
31
32To regenerate F<regcharclass.h>, run this script from perl-root. No arguments
33are necessary.
34
35Using WHATEVER as an example the following macros can be produced, depending
36on the input parameters (how to get each is described by internal comments at
37the C<__DATA__> line):
38
39=over 4
40
41=item C<is_WHATEVER(s,is_utf8)>
42
43=item C<is_WHATEVER_safe(s,e,is_utf8)>
44
45Do a lookup as appropriate based on the C<is_utf8> flag. When possible
46comparisons involving octet<128 are done before checking the C<is_utf8>
47flag, hopefully saving time.
48
49The version without the C<_safe> suffix should be used only when the input is
50known to be well-formed.
51
52=item C<is_WHATEVER_utf8(s)>
53
54=item C<is_WHATEVER_utf8_safe(s,e)>
55
56Do a lookup assuming the string is encoded in (normalized) UTF8.
57
58The version without the C<_safe> suffix should be used only when the input is
59known to be well-formed.
60
61=item C<is_WHATEVER_latin1(s)>
62
63=item C<is_WHATEVER_latin1_safe(s,e)>
64
65Do a lookup assuming the string is encoded in latin-1 (aka plan octets).
66
67The version without the C<_safe> suffix should be used only when it is known
68that C<s> contains at least one character.
69
70=item C<is_WHATEVER_cp(cp)>
71
72Check to see if the string matches a given codepoint (hypothetically a
73U32). The condition is constructed as to "break out" as early as
74possible if the codepoint is out of range of the condition.
75
76IOW:
77
78  (cp==X || (cp>X && (cp==Y || (cp>Y && ...))))
79
80Thus if the character is X+1 only two comparisons will be done. Making
81matching lookups slower, but non-matching faster.
82
83=item C<what_len_WHATEVER_FOO(arg1, ..., len)>
84
85A variant form of each of the macro types described above can be generated, in
86which the code point is returned by the macro, and an extra parameter (in the
87final position) is added, which is a pointer for the macro to set the byte
88length of the returned code point.
89
90These forms all have a C<what_len> prefix instead of the C<is_>, for example
91C<what_len_WHATEVER_safe(s,e,is_utf8,len)> and
92C<what_len_WHATEVER_utf8(s,len)>.
93
94These forms should not be used I<except> on small sets of mostly widely
95separated code points; otherwise the code generated is inefficient.  For these
96cases, it is best to use the C<is_> forms, and then find the code point with
97C<utf8_to_uvchr_buf>().  This program can fail with a "deep recursion"
98message on the worst of the inappropriate sets.  Examine the generated macro
99to see if it is acceptable.
100
101=item C<what_WHATEVER_FOO(arg1, ...)>
102
103A variant form of each of the C<is_> macro types described above can be generated, in
104which the code point and not the length is returned by the macro.  These have
105the same caveat as L</what_len_WHATEVER_FOO(arg1, ..., len)>, plus they should
106not be used where the set contains a NULL, as 0 is returned for two different
107cases: a) the set doesn't include the input code point; b) the set does
108include it, and it is a NULL.
109
110=back
111
112The above isn't quite complete, as for specialized purposes one can get a
113macro like C<is_WHATEVER_utf8_no_length_checks(s)>, which assumes that it is
114already known that there is enough space to hold the character starting at
115C<s>, but otherwise checks that it is well-formed.  In other words, this is
116intermediary in checking between C<is_WHATEVER_utf8(s)> and
117C<is_WHATEVER_utf8_safe(s,e)>.
118
119=head2 CODE FORMAT
120
121perltidy  -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"
122
123
124=head1 AUTHOR
125
126Author: Yves Orton (demerphq) 2007.  Maintained by Perl5 Porters.
127
128=head1 BUGS
129
130No tests directly here (although the regex engine will fail tests
131if this code is broken). Insufficient documentation and no Getopts
132handler for using the module as a script.
133
134=head1 LICENSE
135
136You may distribute under the terms of either the GNU General Public
137License or the Artistic License, as specified in the README file.
138
139=cut
140
141# Sub naming convention:
142# __func : private subroutine, can not be called as a method
143# _func  : private method, not meant for external use
144# func   : public method.
145
146# private subs
147#-------------------------------------------------------------------------------
148#
149# ($cp,$n,$l,$u)=__uni_latin($str);
150#
151# Return a list of arrays, each of which when interpreted correctly
152# represent the string in some given encoding with specific conditions.
153#
154# $cp - list of codepoints that make up the string.
155# $n  - list of octets that make up the string if all codepoints are invariant
156#       regardless of if the string is in UTF-8 or not.
157# $l  - list of octets that make up the string in latin1 encoding if all
158#       codepoints < 256, and at least one codepoint is UTF-8 variant.
159# $u  - list of octets that make up the string in utf8 if any codepoint is
160#       UTF-8 variant
161#
162#   High CP | Defined
163#-----------+----------
164#   0 - 127 : $n            (127/128 are the values for ASCII platforms)
165# 128 - 255 : $l, $u
166# 256 - ... : $u
167#
168
169sub __uni_latin1 {
170    my $charset= shift;
171    my $str= shift;
172    my $max= 0;
173    my @cp;
174    my @cp_high;
175    my $only_has_invariants = 1;
176    my $a2n = get_a2n($charset);
177    for my $ch ( split //, $str ) {
178        my $cp= ord $ch;
179        $max= $cp if $max < $cp;
180        if ($cp > 255) {
181            push @cp, $cp;
182            push @cp_high, $cp;
183        }
184        else {
185            push @cp, $a2n->[$cp];
186        }
187    }
188    my ( $n, $l, $u );
189    $only_has_invariants = ($charset =~ /ascii/i) ? $max < 128 : $max < 160;
190    if ($only_has_invariants) {
191        $n= [@cp];
192    } else {
193        $l= [@cp] if $max && $max < 256;
194
195        my @u;
196        for my $ch ( split //, $str ) {
197            push @u, map { ord } split //, cp_2_utfbytes(ord $ch, $charset);
198        }
199        $u = \@u;
200    }
201    return ( \@cp, \@cp_high, $n, $l, $u );
202}
203
204#
205# $clean= __clean($expr);
206#
207# Cleanup a ternary expression, removing unnecessary parens and apply some
208# simplifications using regexes.
209#
210
211sub __clean {
212    my ( $expr )= @_;
213
214    #return $expr;
215
216    our $parens;
217    $parens= qr/ (?> \( (?> (?: (?> [^()]+ ) | (??{ $parens }) )* ) \) ) /x;
218
219    ## remove redundant parens
220    1 while $expr =~ s/ \( \s* ( $parens ) \s* \) /$1/gx;
221
222
223    # repeatedly simplify conditions like
224    #       ( (cond1) ? ( (cond2) ? X : Y ) : Y )
225    # into
226    #       ( ( (cond1) && (cond2) ) ? X : Y )
227    # Also similarly handles expressions like:
228    #       : (cond1) ? ( (cond2) ? X : Y ) : Y )
229    # Note the inclusion of the close paren in ([:()]) and the open paren in ([()]) is
230    # purely to ensure we have a balanced set of parens in the expression which makes
231    # it easier to understand the pattern in an editor that understands paren's, we do
232    # not expect either of these cases to actually fire. - Yves
233    1 while $expr =~ s/
234        ([:()])  \s*
235            ($parens) \s*
236            \? \s*
237                \( \s* ($parens) \s*
238                    \? \s* ($parens|[^()?:\s]+?) \s*
239                    :  \s* ($parens|[^()?:\s]+?) \s*
240                \) \s*
241            : \s* \5 \s*
242        ([()])
243    /$1 ( $2 && $3 ) ? $4 : $5 $6/gx;
244    #$expr=~s/\(\(U8\*\)s\)\[(\d+)\]/S$1/g if length $expr > 8000;
245    #$expr=~s/\s+//g if length $expr > 8000;
246
247    die "Expression too long" if length $expr > 8000;
248
249    return $expr;
250}
251
252#
253# $text= __macro(@args);
254# Join args together by newlines, and then neatly add backslashes to the end
255# of every  line as expected by the C pre-processor for #define's.
256#
257
258sub __macro {
259    my $str= join "\n", @_;
260    $str =~ s/\s*$//;
261    my @lines= map { s/\s+$//; s/\t/        /g; $_ } split /\n/, $str;
262    my $last= pop @lines;
263    $str= join "\n", ( map { sprintf "%-76s\\", $_ } @lines ), $last;
264    1 while $str =~ s/^(\t*) {8}/$1\t/gm;
265    return $str . "\n";
266}
267
268#
269# my $op=__incrdepth($op);
270#
271# take an 'op' hashref and add one to it and all its childrens depths.
272#
273
274sub __incrdepth {
275    my $op= shift;
276    return unless ref $op;
277    $op->{depth} += 1;
278    __incrdepth( $op->{yes} );
279    __incrdepth( $op->{no} );
280    return $op;
281}
282
283# join two branches of an opcode together with a condition, incrementing
284# the depth on the yes branch when we do so.
285# returns the new root opcode of the tree.
286sub __cond_join {
287    my ( $cond, $yes, $no )= @_;
288    if (ref $yes) {
289        return {
290            test  => $cond,
291            yes   => __incrdepth( $yes ),
292            no    => $no,
293            depth => 0,
294        };
295    }
296    else {
297        return {
298            test  => $cond,
299            yes   => $yes,
300            no    => __incrdepth($no),
301            depth => 0,
302        };
303    }
304}
305
306# Methods
307
308# constructor
309#
310# my $obj=CLASS->new(op=>'SOMENAME',title=>'blah',txt=>[..]);
311#
312# Create a new CharClass::Matcher object by parsing the text in
313# the txt array. Currently applies the following rules:
314#
315# Element starts with C<0x>, line is evaled the result treated as
316# a number which is passed to chr().
317#
318# Element starts with C<">, line is evaled and the result treated
319# as a string.
320#
321# Each string is then stored in the 'strs' subhash as a hash record
322# made up of the results of __uni_latin1, using the keynames
323# 'low','latin1','utf8', as well as the synthesized 'LATIN1', 'high', and
324# 'UTF8' which hold a merge of 'low' and their lowercase equivalents.
325#
326# Size data is tracked per type in the 'size' subhash.
327#
328# Return an object
329#
330sub new {
331    my $class= shift;
332    my %opt= @_;
333    for ( qw(op txt) ) {
334        die "in " . __PACKAGE__ . " constructor '$_;' is a mandatory field"
335          if !exists $opt{$_};
336    }
337
338    my $self= bless {
339        op    => $opt{op},
340        title => $opt{title} || '',
341    }, $class;
342    foreach my $txt ( @{ $opt{txt} } ) {
343        my $str= $txt;
344        if ( $str =~ /^[""]/ ) {
345            $str= eval $str;
346        } elsif ($str =~ / - /x ) { # A range:  Replace this element on the
347                                    # list with its expansion
348            my ($lower, $upper) = $str =~ / 0x (.+?) \s* - \s* 0x (.+) /x;
349            die "Format must be like '0xDEAD - 0xBEAF'; instead was '$str'" if ! defined $lower || ! defined $upper;
350            foreach my $cp (hex $lower .. hex $upper) {
351                push @{$opt{txt}}, sprintf "0x%X", $cp;
352            }
353            next;
354        } elsif ($str =~ s/ ^ N (?= 0x ) //x ) {
355            # Otherwise undocumented, a leading N means is already in the
356            # native character set; don't convert.
357            $str= chr eval $str;
358        } elsif ( $str =~ /^0x/ ) {
359            $str= eval $str;
360            $str = chr $str;
361        } elsif ( $str =~ / \s* \\p \{ ( .*? ) \} /x) {
362            my $property = $1;
363            use Unicode::UCD qw(prop_invlist);
364
365            my @invlist = prop_invlist($property, '_perl_core_internal_ok');
366            if (! @invlist) {
367
368                # An empty return could mean an unknown property, or merely
369                # that it is empty.  Call in scalar context to differentiate
370                my $count = prop_invlist($property, '_perl_core_internal_ok');
371                die "$property not found" unless defined $count;
372            }
373
374            # Replace this element on the list with the property's expansion
375            for (my $i = 0; $i < @invlist; $i += 2) {
376                foreach my $cp ($invlist[$i] .. $invlist[$i+1] - 1) {
377
378                    # prop_invlist() returns native values; add leading 'N'
379                    # to indicate that.
380                    push @{$opt{txt}}, sprintf "N0x%X", $cp;
381                }
382            }
383            next;
384        } elsif ($str =~ / ^ do \s+ ( .* ) /x) {
385            die "do '$1' failed: $!$@" if ! do $1 or $@;
386            next;
387        } elsif ($str =~ / ^ & \s* ( .* ) /x) { # user-furnished sub() call
388            my @results = eval "$1";
389            die "eval '$1' failed: $@" if $@;
390            push @{$opt{txt}}, @results;
391            next;
392        } else {
393            die "Unparsable line: $txt\n";
394        }
395        my ( $cp, $cp_high, $low, $latin1, $utf8 )= __uni_latin1( $opt{charset}, $str );
396        my $UTF8= $low   || $utf8;
397        my $LATIN1= $low || $latin1;
398        my $high = (scalar grep { $_ < 256 } @$cp) ? 0 : $utf8;
399        #die Dumper($txt,$cp,$low,$latin1,$utf8)
400        #    if $txt=~/NEL/ or $utf8 and @$utf8>3;
401
402        @{ $self->{strs}{$str} }{qw( str txt low utf8 latin1 high cp cp_high UTF8 LATIN1 )}=
403          ( $str, $txt, $low, $utf8, $latin1, $high, $cp, $cp_high, $UTF8, $LATIN1 );
404        my $rec= $self->{strs}{$str};
405        foreach my $key ( qw(low utf8 latin1 high cp cp_high UTF8 LATIN1) ) {
406            $self->{size}{$key}{ 0 + @{ $self->{strs}{$str}{$key} } }++
407              if $self->{strs}{$str}{$key};
408        }
409        $self->{has_multi} ||= @$cp > 1;
410        $self->{has_ascii} ||= $latin1 && @$latin1;
411        $self->{has_low}   ||= $low && @$low;
412        $self->{has_high}  ||= !$low && !$latin1;
413    }
414    $self->{val_fmt}= $hex_fmt;
415    $self->{count}= 0 + keys %{ $self->{strs} };
416    return $self;
417}
418
419# my $trie = make_trie($type,$maxlen);
420#
421# using the data stored in the object build a trie of a specific type,
422# and with specific maximum depth. The trie is made up the elements of
423# the given types array for each string in the object (assuming it is
424# not too long.)
425#
426# returns the trie, or undef if there was no relevant data in the object.
427#
428
429sub make_trie {
430    my ( $self, $type, $maxlen )= @_;
431
432    my $strs= $self->{strs};
433    my %trie;
434    foreach my $rec ( values %$strs ) {
435        die "panic: unknown type '$type'"
436          if !exists $rec->{$type};
437        my $dat= $rec->{$type};
438        next unless $dat;
439        next if $maxlen && @$dat > $maxlen;
440        my $node= \%trie;
441        foreach my $elem ( @$dat ) {
442            $node->{$elem} ||= {};
443            $node= $node->{$elem};
444        }
445        $node->{''}= $rec->{str};
446    }
447    return 0 + keys( %trie ) ? \%trie : undef;
448}
449
450sub pop_count ($) {
451    my $word = shift;
452
453    # This returns a list of the positions of the bits in the input word that
454    # are 1.
455
456    my @positions;
457    my $position = 0;
458    while ($word) {
459        push @positions, $position if $word & 1;
460        $position++;
461        $word >>= 1;
462    }
463    return @positions;
464}
465
466# my $optree= _optree()
467#
468# recursively convert a trie to an optree where every node represents
469# an if else branch.
470#
471#
472
473sub _optree {
474    my ( $self, $trie, $test_type, $ret_type, $else, $depth )= @_;
475    return unless defined $trie;
476    if ( $self->{has_multi} and $ret_type =~ /cp|both/ ) {
477        die "Can't do 'cp' optree from multi-codepoint strings";
478    }
479    $ret_type ||= 'len';
480    $else= 0  unless defined $else;
481    $depth= 0 unless defined $depth;
482
483    # if we have an empty string as a key it means we are in an
484    # accepting state and unless we can match further on should
485    # return the value of the '' key.
486    if (exists $trie->{''} ) {
487        # we can now update the "else" value, anything failing to match
488        # after this point should return the value from this.
489        if ( $ret_type eq 'cp' ) {
490            $else= $self->{strs}{ $trie->{''} }{cp}[0];
491            $else= sprintf "$self->{val_fmt}", $else if $else > 9;
492        } elsif ( $ret_type eq 'len' ) {
493            $else= $depth;
494        } elsif ( $ret_type eq 'both') {
495            $else= $self->{strs}{ $trie->{''} }{cp}[0];
496            $else= sprintf "$self->{val_fmt}", $else if $else > 9;
497            $else= "len=$depth, $else";
498        }
499    }
500    # extract the meaningful keys from the trie, filter out '' as
501    # it means we are an accepting state (end of sequence).
502    my @conds= sort { $a <=> $b } grep { length $_ } keys %$trie;
503
504    # if we haven't any keys there is no further we can match and we
505    # can return the "else" value.
506    return $else if !@conds;
507
508    my $test = $test_type =~ /^cp/ ? "cp" : "((const U8*)s)[$depth]";
509
510    # First we loop over the possible keys/conditions and find out what they
511    # look like; we group conditions with the same optree together.
512    my %dmp_res;
513    my @res_order;
514    local $Data::Dumper::Sortkeys=1;
515    foreach my $cond ( @conds ) {
516
517        # get the optree for this child/condition
518        my $res= $self->_optree( $trie->{$cond}, $test_type, $ret_type, $else, $depth + 1 );
519        # convert it to a string with Dumper
520        my $res_code= Dumper( $res );
521
522        push @{$dmp_res{$res_code}{vals}}, $cond;
523        if (!$dmp_res{$res_code}{optree}) {
524            $dmp_res{$res_code}{optree}= $res;
525            push @res_order, $res_code;
526        }
527    }
528
529    # now that we have deduped the optrees we construct a new optree containing the merged
530    # results.
531    my %root;
532    my $node= \%root;
533    foreach my $res_code_idx (0 .. $#res_order) {
534        my $res_code= $res_order[$res_code_idx];
535        $node->{vals}= $dmp_res{$res_code}{vals};
536        $node->{test}= $test;
537        $node->{yes}= $dmp_res{$res_code}{optree};
538        $node->{depth}= $depth;
539        if ($res_code_idx < $#res_order) {
540            $node= $node->{no}= {};
541        } else {
542            $node->{no}= $else;
543        }
544    }
545
546    # return the optree.
547    return \%root;
548}
549
550# my $optree= optree(%opts);
551#
552# Convert a trie to an optree, wrapper for _optree
553
554sub optree {
555    my $self= shift;
556    my %opt= @_;
557    my $trie= $self->make_trie( $opt{type}, $opt{max_depth} );
558    $opt{ret_type} ||= 'len';
559    my $test_type= $opt{type} =~ /^cp/ ? 'cp' : 'depth';
560    return $self->_optree( $trie, $test_type, $opt{ret_type}, $opt{else}, 0 );
561}
562
563# my $optree= generic_optree(%opts);
564#
565# build a "generic" optree out of the three 'low', 'latin1', 'utf8'
566# sets of strings, including a branch for handling the string type check.
567#
568
569sub generic_optree {
570    my $self= shift;
571    my %opt= @_;
572
573    $opt{ret_type} ||= 'len';
574    my $test_type= 'depth';
575    my $else= $opt{else} || 0;
576
577    my $latin1= $self->make_trie( 'latin1', $opt{max_depth} );
578    my $utf8= $self->make_trie( 'utf8',     $opt{max_depth} );
579
580    $_= $self->_optree( $_, $test_type, $opt{ret_type}, $else, 0 )
581      for $latin1, $utf8;
582
583    if ( $utf8 ) {
584        $else= __cond_join( "( is_utf8 )", $utf8, $latin1 || $else );
585    } elsif ( $latin1 ) {
586        $else= __cond_join( "!( is_utf8 )", $latin1, $else );
587    }
588    if ($opt{type} eq 'generic') {
589        my $low= $self->make_trie( 'low', $opt{max_depth} );
590        if ( $low ) {
591            $else= $self->_optree( $low, $test_type, $opt{ret_type}, $else, 0 );
592        }
593    }
594
595    return $else;
596}
597
598# length_optree()
599#
600# create a string length guarded optree.
601#
602
603sub length_optree {
604    my $self= shift;
605    my %opt= @_;
606    my $type= $opt{type};
607
608    die "Can't do a length_optree on type 'cp', makes no sense."
609      if $type =~ /^cp/;
610
611    my $else= ( $opt{else} ||= 0 );
612
613    return $else if $self->{count} == 0;
614
615    my $method = $type =~ /generic/ ? 'generic_optree' : 'optree';
616    if ($method eq 'optree' && scalar keys %{$self->{size}{$type}} == 1) {
617
618        # Here is non-generic output (meaning that we are only generating one
619        # type), and all things that match have the same number ('size') of
620        # bytes.  The length guard is simply that we have that number of
621        # bytes.
622        my @size = keys %{$self->{size}{$type}};
623        my $cond= "((e) - (s)) >= $size[0]";
624        my $optree = $self->$method(%opt);
625        $else= __cond_join( $cond, $optree, $else );
626    }
627    elsif ($self->{has_multi}) {
628        my @size;
629
630        # Here, there can be a match of a multiple character string.  We use
631        # the traditional method which is to have a branch for each possible
632        # size (longest first) and test for the legal values for that size.
633        my %sizes= (
634            %{ $self->{size}{low}    || {} },
635            %{ $self->{size}{latin1} || {} },
636            %{ $self->{size}{utf8}   || {} }
637        );
638        if ($method eq 'generic_optree') {
639            @size= sort { $a <=> $b } keys %sizes;
640        } else {
641            @size= sort { $a <=> $b } keys %{ $self->{size}{$type} };
642        }
643        for my $size ( @size ) {
644            my $optree= $self->$method( %opt, type => $type, max_depth => $size );
645            my $cond= "((e)-(s) > " . ( $size - 1 ).")";
646            $else= __cond_join( $cond, $optree, $else );
647        }
648    }
649    else {
650        my $utf8;
651
652        # Here, has more than one possible size, and only matches a single
653        # character.  For non-utf8, the needed length is 1; for utf8, it is
654        # found by array lookup 'UTF8SKIP'.
655
656        # If want just the code points above 255, set up to look for those;
657        # otherwise assume will be looking for all non-UTF-8-invariant code
658        # poiints.
659        my $trie_type = ($type eq 'high') ? 'high' : 'utf8';
660
661        # If we do want more than the 0-255 range, find those, and if they
662        # exist...
663        if ($opt{type} !~ /latin1/i && ($utf8 = $self->make_trie($trie_type, 0))) {
664
665            # ... get them into an optree, and set them up as the 'else' clause
666            $utf8 = $self->_optree( $utf8, 'depth', $opt{ret_type}, 0, 0 );
667
668            # We could make this
669            #   UTF8_IS_START(*s) && ((e) - (s)) >= UTF8SKIP(s))";
670            # to avoid doing the UTF8SKIP and subsequent branches for invariants
671            # that don't match.  But the current macros that get generated
672            # have only a few things that can match past this, so I (khw)
673            # don't think it is worth it.  (Even better would be to use
674            # calculate_mask(keys %$utf8) instead of UTF8_IS_START, and use it
675            # if it saves a bunch.  We assume that input text likely to be
676            # well-formed .
677            my $cond = "LIKELY(((e) - (s)) >= UTF8SKIP(s))";
678            $else = __cond_join($cond, $utf8, $else);
679
680            # For 'generic', we also will want the latin1 UTF-8 variants for
681            # the case where the input isn't UTF-8.
682            my $latin1;
683            if ($method eq 'generic_optree') {
684                $latin1 = $self->make_trie( 'latin1', 1);
685                $latin1= $self->_optree( $latin1, 'depth', $opt{ret_type}, 0, 0 );
686            }
687
688            # If we want the UTF-8 invariants, get those.
689            my $low;
690            if ($opt{type} !~ /non_low|high/
691                && ($low= $self->make_trie( 'low', 1)))
692            {
693                $low= $self->_optree( $low, 'depth', $opt{ret_type}, 0, 0 );
694
695                # Expand out the UTF-8 invariants as a string so that we
696                # can use them as the conditional
697                $low = $self->_cond_as_str( $low, 0, \%opt);
698
699                # If there are Latin1 variants, add a test for them.
700                if ($latin1) {
701                    $else = __cond_join("(! is_utf8 )", $latin1, $else);
702                }
703                elsif ($method eq 'generic_optree') {
704
705                    # Otherwise for 'generic' only we know that what
706                    # follows must be valid for just UTF-8 strings,
707                    $else->{test} = "( is_utf8 && $else->{test} )";
708                }
709
710                # If the invariants match, we are done; otherwise we have
711                # to go to the 'else' clause.
712                $else = __cond_join($low, 1, $else);
713            }
714            elsif ($latin1) {   # Here, didn't want or didn't have invariants,
715                                # but we do have latin variants
716                $else = __cond_join("(! is_utf8)", $latin1, $else);
717            }
718
719            # We need at least one byte available to start off the tests
720            $else = __cond_join("LIKELY((e) > (s))", $else, 0);
721        }
722        else {  # Here, we don't want or there aren't any variants.  A single
723                # byte available is enough.
724            my $cond= "((e) > (s))";
725            my $optree = $self->$method(%opt);
726            $else= __cond_join( $cond, $optree, $else );
727        }
728    }
729
730    return $else;
731}
732
733sub calculate_mask(@) {
734    # Look at the input list of byte values.  This routine returns an array of
735    # mask/base pairs to generate that list.
736
737    my @list = @_;
738    my $list_count = @list;
739
740    # Consider a set of byte values, A, B, C ....  If we want to determine if
741    # <c> is one of them, we can write c==A || c==B || c==C ....  If the
742    # values are consecutive, we can shorten that to inRANGE(c, 'A', 'Z'),
743    # which uses far fewer branches.  If only some of them are consecutive we
744    # can still save some branches by creating range tests for just those that
745    # are consecutive. _cond_as_str() does this work for looking for ranges.
746    #
747    # Another approach is to look at the bit patterns for A, B, C .... and see
748    # if they have some commonalities.  That's what this function does.  For
749    # example, consider a set consisting of the bytes
750    # 0x42, 0x43, 0x62, and 0x63.  We could write:
751    #   inRANGE(c, 0x42, 0x43) || inRANGE(c, 0x62, 0x63)
752    # which through the magic of casting has not 4, but 2 tests.  But the
753    # following mask/compare also works, and has just one test:
754    #   (c & 0xDE) == 0x42
755    # The reason it works is that the set consists of exactly the 4 bit
756    # patterns which have either 0 or 1 in the two bit positions that are 0 in
757    # the mask.  They have the same value in each bit position where the mask
758    # is 1.  The comparison makes sure that the result matches all bytes which
759    # match those six 1 bits exactly.  This can be applied to bytes that
760    # differ in 1 through all 8 bit positions.  In order to be a candidate for
761    # this optimization, the number of bytes in the set must be a power of 2.
762    #
763    # It may be that the bytes needing to be matched can't be done with a
764    # single mask.  But it may be possible to have two (or more) sets, each
765    # with a separate mask.  This function attempts to find some way to save
766    # some branches using the mask technique.  If not, it returns an empty
767    # list; if so, it returns a list consisting of
768    #   [ [compare1, mask1], [compare2, mask2], ...
769    #     [compare_n, undef], [compare_m, undef], ...
770    #   ]
771    # The <mask> is undef in the above for those bytes that must be tested
772    # for individually.
773    #
774    # This function does not attempt to find the optimal set.  To do so would
775    # probably require testing all possible combinations, and keeping track of
776    # the current best one.
777    #
778    # There are probably much better algorithms, but this is the one I (khw)
779    # came up with.  We start with doing a bit-wise compare of every byte in
780    # the set with every other byte.  The results are sorted into arrays of
781    # all those that differ by the same bit positions.  These are stored in a
782    # hash with the each key being the bits they differ in.  Here is the hash
783    # for the 0x53, 0x54, 0x73, 0x74 set:
784    # {
785    #    4 => {
786    #            "0,1,2,5" => [
787    #                            83,
788    #                            116,
789    #                            84,
790    #                            115
791    #                        ]
792    #        },
793    #    3 => {
794    #            "0,1,2" => [
795    #                        83,
796    #                        84,
797    #                        115,
798    #                        116
799    #                        ]
800    #        }
801    #    1 => {
802    #            5 => [
803    #                    83,
804    #                    115,
805    #                    84,
806    #                    116
807    #                ]
808    #        },
809    # }
810    #
811    # The set consisting of values which differ in the 4 bit positions 0, 1,
812    # 2, and 5 from some other value in the set consists of all 4 values.
813    # Likewise all 4 values differ from some other value in the 3 bit
814    # positions 0, 1, and 2; and all 4 values differ from some other value in
815    # the single bit position 5.  The keys at the uppermost level in the above
816    # hash, 1, 3, and 4, give the number of bit positions that each sub-key
817    # below it has.  For example, the 4 key could have as its value an array
818    # consisting of "0,1,2,5", "0,1,2,6", and "3,4,6,7", if the inputs were
819    # such.  The best optimization will group the most values into a single
820    # mask.  The most values will be the ones that differ in the most
821    # positions, the ones with the largest value for the topmost key.  These
822    # keys, are thus just for convenience of sorting by that number, and do
823    # not have any bearing on the core of the algorithm.
824    #
825    # We start with an element from largest number of differing bits.  The
826    # largest in this case is 4 bits, and there is only one situation in this
827    # set which has 4 differing bits, "0,1,2,5".  We look for any subset of
828    # this set which has 16 values that differ in these 4 bits.  There aren't
829    # any, because there are only 4 values in the entire set.  We then look at
830    # the next possible thing, which is 3 bits differing in positions "0,1,2".
831    # We look for a subset that has 8 values that differ in these 3 bits.
832    # Again there are none.  So we go to look for the next possible thing,
833    # which is a subset of 2**1 values that differ only in bit position 5.  83
834    # and 115 do, so we calculate a mask and base for those and remove them
835    # from every set.  Since there is only the one set remaining, we remove
836    # them from just this one.  We then look to see if there is another set of
837    # 2 values that differ in bit position 5.  84 and 116 do, so we calculate
838    # a mask and base for those and remove them from every set (again only
839    # this set remains in this example).  The set is now empty, and there are
840    # no more sets to look at, so we are done.
841
842    if ($list_count == 256) {   # All 256 is trivially masked
843        return (0, 0);
844    }
845
846    my %hash;
847
848    # Generate bits-differing lists for each element compared against each
849    # other element
850    for my $i (0 .. $list_count - 2) {
851        for my $j ($i + 1 .. $list_count - 1) {
852            my @bits_that_differ = pop_count($list[$i] ^ $list[$j]);
853            my $differ_count = @bits_that_differ;
854            my $key = join ",", @bits_that_differ;
855            push @{$hash{$differ_count}{$key}}, $list[$i] unless grep { $_ == $list[$i] } @{$hash{$differ_count}{$key}};
856            push @{$hash{$differ_count}{$key}}, $list[$j];
857        }
858    }
859
860    print STDERR __LINE__, ": calculate_mask() called:  List of values grouped by differing bits: ", Dumper \%hash if DEBUG;
861
862    my @final_results;
863    foreach my $count (reverse sort { $a <=> $b } keys %hash) {
864        my $need = 2 ** $count;     # Need 8 values for 3 differing bits, etc
865        foreach my $bits (sort keys $hash{$count}->%*) {
866
867            print STDERR __LINE__, ": For $count bit(s) difference ($bits), need $need; have ", scalar @{$hash{$count}{$bits}}, "\n" if DEBUG;
868
869            # Look only as long as there are at least as many elements in the
870            # subset as are needed
871            while ((my $cur_count = @{$hash{$count}{$bits}}) >= $need) {
872
873                print STDERR __LINE__, ": Looking at bit positions ($bits): ", Dumper $hash{$count}{$bits} if DEBUG;
874
875                # Start with the first element in it
876                my $try_base = $hash{$count}{$bits}[0];
877                my @subset = $try_base;
878
879                # If it succeeds, we return a mask and a base to compare
880                # against the masked value.  That base will be the AND of
881                # every element in the subset.  Initialize to the one element
882                # we have so far.
883                my $compare = $try_base;
884
885                # We are trying to find a subset of this that has <need>
886                # elements that differ in the bit positions given by the
887                # string $bits, which is comma separated.
888                my @bits = split ",", $bits;
889
890                TRY: # Look through the remainder of the list for other
891                     # elements that differ only by these bit positions.
892
893                for (my $i = 1; $i < $cur_count; $i++) {
894                    my $try_this = $hash{$count}{$bits}[$i];
895                    my @positions = pop_count($try_base ^ $try_this);
896
897                    print STDERR __LINE__, ": $try_base vs $try_this: is (", join(',', @positions), ") a subset of ($bits)?" if DEBUG;;
898
899                    foreach my $pos (@positions) {
900                        unless (grep { $pos == $_ } @bits) {
901                            print STDERR "  No\n" if DEBUG;
902                            my $remaining = $cur_count - $i - 1;
903                            if ($remaining && @subset + $remaining < $need) {
904                                print STDERR __LINE__, ": Can stop trying $try_base, because even if all the remaining $remaining values work, they wouldn't add up to the needed $need when combined with the existing ", scalar @subset, " ones\n" if DEBUG;
905                                last TRY;
906                            }
907                            next TRY;
908                        }
909                    }
910
911                    print STDERR "  Yes\n" if DEBUG;
912                    push @subset, $try_this;
913
914                    # Add this to the mask base, in case it ultimately
915                    # succeeds,
916                    $compare &= $try_this;
917                }
918
919                print STDERR __LINE__, ": subset (", join(", ", @subset), ") has ", scalar @subset, " elements; needs $need\n" if DEBUG;
920
921                if (@subset < $need) {
922                    shift @{$hash{$count}{$bits}};
923                    next;   # Try with next value
924                }
925
926                # Create the mask
927                my $mask = 0;
928                foreach my $position (@bits) {
929                    $mask |= 1 << $position;
930                }
931                $mask = ~$mask & 0xFF;
932                push @final_results, [$compare, $mask];
933
934                printf STDERR "%d: Got it: compare=%d=0x%X; mask=%X\n", __LINE__, $compare, $compare, $mask if DEBUG;
935
936                # These values are now spoken for.  Remove them from future
937                # consideration
938                foreach my $remove_count (sort keys %hash) {
939                    foreach my $bits (sort keys %{$hash{$remove_count}}) {
940                        foreach my $to_remove (@subset) {
941                            @{$hash{$remove_count}{$bits}} = grep { $_ != $to_remove } @{$hash{$remove_count}{$bits}};
942                        }
943                    }
944                }
945            }
946        }
947    }
948
949    # Any values that remain in the list are ones that have to be tested for
950    # individually.
951    my @individuals;
952    foreach my $count (reverse sort { $a <=> $b } keys %hash) {
953        foreach my $bits (sort keys $hash{$count}->%*) {
954            foreach my $remaining (@{$hash{$count}{$bits}}) {
955
956                # If we already know about this value, just ignore it.
957                next if grep { $remaining == $_ } @individuals;
958
959                # Otherwise it needs to be returned as something to match
960                # individually
961                push @final_results, [$remaining, undef];
962                push @individuals, $remaining;
963            }
964        }
965    }
966
967    # Sort by increasing numeric value
968    @final_results = sort { $a->[0] <=> $b->[0] } @final_results;
969
970    print STDERR __LINE__, ": Final return: ", Dumper \@final_results if DEBUG;
971
972    return @final_results;
973}
974
975# _cond_as_str
976# turn a list of conditions into a text expression
977# - merges ranges of conditions, and joins the result with ||
978sub _cond_as_str {
979    my ( $self, $op, $combine, $opts_ref )= @_;
980    my $cond= $op->{vals};
981    my $test= $op->{test};
982    my $is_cp_ret = $opts_ref->{ret_type} eq "cp";
983    return "( $test )" if !defined $cond;
984
985    # rangify the list.
986    my @ranges;
987    my $Update= sub {
988        # We skip this if there are optimizations that
989        # we can apply (below) to the individual ranges
990        if ( ($is_cp_ret || $combine) && @ranges && ref $ranges[-1]) {
991            if ( $ranges[-1][0] == $ranges[-1][1] ) {
992                $ranges[-1]= $ranges[-1][0];
993            } elsif ( $ranges[-1][0] + 1 == $ranges[-1][1] ) {
994                $ranges[-1]= $ranges[-1][0];
995                push @ranges, $ranges[-1] + 1;
996            }
997        }
998    };
999    for my $condition ( @$cond ) {
1000        if ( !@ranges || $condition != $ranges[-1][1] + 1 ) {
1001            $Update->();
1002            push @ranges, [ $condition, $condition ];
1003        } else {
1004            $ranges[-1][1]++;
1005        }
1006    }
1007    $Update->();
1008
1009    return $self->_combine( $test, @ranges )
1010      if $combine;
1011
1012    if ($is_cp_ret) {
1013        @ranges= map {
1014            ref $_
1015            ? sprintf(
1016                "isRANGE( $test, $self->{val_fmt}, $self->{val_fmt} )",
1017                @$_ )
1018            : sprintf( "$self->{val_fmt} == $test", $_ );
1019        } @ranges;
1020
1021        return "( " . join( " || ", @ranges ) . " )";
1022    }
1023
1024    # If the input set has certain characteristics, we can optimize tests
1025    # for it.  This doesn't apply if returning the code point, as we want
1026    # each element of the set individually.  The code above is for this
1027    # simpler case.
1028
1029    return 1 if @$cond == 256;  # If all bytes match, is trivially true
1030
1031    my @masks;
1032    if (@ranges > 1) {
1033
1034        # See if the entire set shares optimizable characteristics, and if so,
1035        # return the optimization.  There is no need to do this on sets with
1036        # just a single range, as that can be expressed with a single
1037        # conditional.
1038        @masks = calculate_mask(@$cond);
1039
1040        # Stringify the output of calculate_mask()
1041        if (@masks) {
1042            my @return;
1043            foreach my $mask_ref (@masks) {
1044                if (defined $mask_ref->[1]) {
1045                    push @return, sprintf "( ( $test & $self->{val_fmt} ) == $self->{val_fmt} )", $mask_ref->[1], $mask_ref->[0];
1046                }
1047                else {  # An undefined mask means to use the value as-is
1048                    push @return, sprintf "$test == $self->{val_fmt}", $mask_ref->[0];
1049                }
1050            }
1051
1052            # The best possible case below for specifying this set of values via
1053            # ranges is 1 branch per range.  If our mask method yielded better
1054            # results, there is no sense trying something that is bound to be
1055            # worse.
1056            if (@return < @ranges) {
1057                return "( " . join( " || ", @return ) . " )";
1058            }
1059
1060            @masks = @return;
1061        }
1062    }
1063
1064    # Here, there was no entire-class optimization that was clearly better
1065    # than doing things by ranges.  Look at each range.
1066    my $range_count_extra = 0;
1067    for (my $i = 0; $i < @ranges; $i++) {
1068        if (! ref $ranges[$i]) {    # Trivial case: no range
1069            $ranges[$i] = sprintf "$self->{val_fmt} == $test", $ranges[$i];
1070        }
1071        elsif ($ranges[$i]->[0] == $ranges[$i]->[1]) {
1072            $ranges[$i] =           # Trivial case: single element range
1073                    sprintf "$self->{val_fmt} == $test", $ranges[$i]->[0];
1074        }
1075        elsif ($ranges[$i]->[0] == 0) {
1076            # If the range matches all 256 possible bytes, it is trivially
1077            # true.
1078            return 1 if $ranges[0]->[1] == 0xFF;    # @ranges must be 1 in
1079                                                    # this case
1080            $ranges[$i] = sprintf "( $test <= $self->{val_fmt} )",
1081                                                               $ranges[$i]->[1];
1082        }
1083        elsif ($ranges[$i]->[1] == 255) {
1084
1085            # Similarly the max possible is 255, so can omit an upper bound
1086            # test if the calculated max is the max possible one.
1087            $ranges[$i] = sprintf "( $test >= $self->{val_fmt} )",
1088                                                                $ranges[0]->[0];
1089        }
1090        else {
1091            my $output = "";
1092
1093            # Well-formed UTF-8 continuation bytes on ascii platforms must be
1094            # in the range 0x80 .. 0xBF.  If we know that the input is
1095            # well-formed (indicated by not trying to be 'safe'), we can omit
1096            # tests that verify that the input is within either of these
1097            # bounds.  (No legal UTF-8 character can begin with anything in
1098            # this range, so we don't have to worry about this being a
1099            # continuation byte or not.)
1100            if ($opts_ref->{charset} =~ /ascii/i
1101                && (! $opts_ref->{safe} && ! $opts_ref->{no_length_checks})
1102                && $opts_ref->{type} =~ / ^ (?: utf8 | high ) $ /xi)
1103            {
1104                # If the range is the entire legal range, it matches any legal
1105                # byte, so we can omit both tests.  (This should happen only
1106                # if the number of ranges is 1.)
1107                if ($ranges[$i]->[0] == 0x80 && $ranges[$i]->[1] == 0xBF) {
1108                    return 1;
1109                }
1110            }
1111
1112            # Here, it isn't the full range of legal continuation bytes.  We
1113            # could just assume that there's nothing outside of the legal
1114            # bounds.  But inRANGE() allows us to have a single conditional,
1115            # so the only cost of making sure it's a legal UTF-8 continuation
1116            # byte is an extra subtraction instruction, a trivial expense.
1117            $ranges[$i] = sprintf("inRANGE($test, $self->{val_fmt},"
1118                                                . " $self->{val_fmt} )",
1119                                        $ranges[$i]->[0], $ranges[$i]->[1]);
1120        }
1121    }
1122
1123    # We have generated the list of bytes in two ways; one trying to use masks
1124    # to cut the number of branches down, and the other to look at individual
1125    # ranges (some of which could be cut down by using a mask for just it).
1126    # We return whichever method uses the fewest branches.
1127    return "( "
1128           . join( " || ", (@masks && @masks < @ranges + $range_count_extra)
1129                            ? @masks
1130                            : @ranges)
1131           . " )";
1132}
1133
1134# _combine
1135# recursively turn a list of conditions into a fast break-out condition
1136# used by _cond_as_str() for 'cp' type macros.
1137sub _combine {
1138    my ( $self, $test, @cond )= @_;
1139    return if !@cond;
1140    my $item= shift @cond;
1141    my ( $cstr, $gtv );
1142    if ( ref $item ) {  # @item should be a 2-element array giving range start
1143                        # and end
1144        if ($item->[0] == 0) {  # UV's are never negative, so skip "0 <= "
1145                                # test which could generate a compiler warning
1146                                # that test is always true
1147            $cstr= sprintf( "$test <= $self->{val_fmt}", $item->[1] );
1148        }
1149        else {
1150            $cstr=
1151          sprintf( "inRANGE($test, $self->{val_fmt}, $self->{val_fmt})",
1152                   @$item );
1153        }
1154        $gtv= sprintf "$self->{val_fmt}", $item->[1];
1155    } else {
1156        $cstr= sprintf( "$self->{val_fmt} == $test", $item );
1157        $gtv= sprintf "$self->{val_fmt}", $item;
1158    }
1159    if ( @cond ) {
1160        my $combine= $self->_combine( $test, @cond );
1161        if (@cond >1) {
1162            return "( $cstr || ( $gtv < $test &&\n"
1163                   . $combine . " ) )";
1164        } else {
1165            return "( $cstr || $combine )";
1166        }
1167    } else {
1168        return $cstr;
1169    }
1170}
1171
1172# _render()
1173# recursively convert an optree to text with reasonably neat formatting
1174sub _render {
1175    my ( $self, $op, $combine, $brace, $opts_ref, $def, $submacros )= @_;
1176    return 0 if ! defined $op;  # The set is empty
1177    if ( !ref $op ) {
1178        return $op;
1179    }
1180    my $cond= $self->_cond_as_str( $op, $combine, $opts_ref );
1181    #no warnings 'recursion';   # This would allow really really inefficient
1182                                # code to be generated.  See pod
1183    my $yes= $self->_render( $op->{yes}, $combine, 1, $opts_ref, $def, $submacros );
1184    return $yes if $cond eq '1';
1185
1186    my $no= $self->_render( $op->{no},   $combine, 0, $opts_ref, $def, $submacros );
1187    return "( $cond )" if $yes eq '1' and $no eq '0';
1188    my ( $lb, $rb )= $brace ? ( "( ", " )" ) : ( "", "" );
1189    return "$lb$cond ? $yes : $no$rb"
1190      if !ref( $op->{yes} ) && !ref( $op->{no} );
1191    my $ind1= " " x 4;
1192    my $ind= "\n" . ( $ind1 x $op->{depth} );
1193
1194    if ( ref $op->{yes} ) {
1195        $yes= $ind . $ind1 . $yes;
1196    } else {
1197        $yes= " " . $yes;
1198    }
1199
1200    my $str= "$lb$cond ?$yes$ind: $no$rb";
1201    if (length $str > 6000) {
1202        push @$submacros, sprintf "#define $def\n( %s )", "_part" . (my $yes_idx= 0+@$submacros), $yes;
1203        push @$submacros, sprintf "#define $def\n( %s )", "_part" . (my $no_idx= 0+@$submacros), $no;
1204        return sprintf "%s%s ? $def : $def%s", $lb, $cond, "_part$yes_idx", "_part$no_idx", $rb;
1205    }
1206    return $str;
1207}
1208
1209# $expr=render($op,$combine)
1210#
1211# convert an optree to text with reasonably neat formatting. If $combine
1212# is true then the condition is created using "fast breakouts" which
1213# produce uglier expressions that are more efficient for common case,
1214# longer lists such as that resulting from type 'cp' output.
1215# Currently only used for type 'cp' macros.
1216sub render {
1217    my ( $self, $op, $combine, $opts_ref, $def_fmt )= @_;
1218
1219    my @submacros;
1220    my $macro= sprintf "#define $def_fmt\n( %s )", "", $self->_render( $op, $combine, 0, $opts_ref, $def_fmt, \@submacros );
1221
1222    return join "\n\n", map { "/*** GENERATED CODE ***/\n" . __macro( __clean( $_ ) ) } @submacros, $macro;
1223}
1224
1225# make_macro
1226# make a macro of a given type.
1227# calls into make_trie and (generic_|length_)optree as needed
1228# Opts are:
1229# type             : 'cp','cp_high', 'generic','high','low','latin1','utf8','LATIN1','UTF8'
1230# ret_type         : 'cp' or 'len'
1231# safe             : don't assume is well-formed UTF-8, so don't skip any range
1232#                    checks, and add length guards to macro
1233# no_length_checks : like safe, but don't add length guards.
1234#
1235# type defaults to 'generic', and ret_type to 'len' unless type is 'cp'
1236# in which case it defaults to 'cp' as well.
1237#
1238# It is illegal to do a type 'cp' macro on a pattern with multi-codepoint
1239# sequences in it, as the generated macro will accept only a single codepoint
1240# as an argument.
1241#
1242# It is also illegal to do a non-safe macro on a pattern with multi-codepoint
1243# sequences in it, as even if it is known to be well-formed, we need to not
1244# run off the end of the buffer when, say, the buffer ends with the first two
1245# characters, but three are looked at by the macro.
1246#
1247# returns the macro.
1248
1249
1250sub make_macro {
1251    my $self= shift;
1252    my %opts= @_;
1253    my $type= $opts{type} || 'generic';
1254    if ($self->{has_multi}) {
1255        if ($type =~ /^cp/) {
1256            die "Can't do a 'cp' on multi-codepoint character class '$self->{op}'"
1257        }
1258        elsif (! $opts{safe}) {
1259            die "'safe' is required on multi-codepoint character class '$self->{op}'"
1260        }
1261    }
1262    my $ret_type= $opts{ret_type} || ( $opts{type} =~ /^cp/ ? 'cp' : 'len' );
1263    my $method;
1264    if ( $opts{safe} ) {
1265        $method= 'length_optree';
1266    } elsif ( $type =~ /generic/ ) {
1267        $method= 'generic_optree';
1268    } else {
1269        $method= 'optree';
1270    }
1271    my @args= $type =~ /^cp/ ? 'cp' : 's';
1272    push @args, "e" if $opts{safe};
1273    push @args, "is_utf8" if $type =~ /generic/;
1274    push @args, "len" if $ret_type eq 'both';
1275    my $pfx= $ret_type eq 'both'    ? 'what_len_' :
1276             $ret_type eq 'cp'      ? 'what_'     : 'is_';
1277    my $ext= $type     =~ /generic/ ? ''          : '_' . lc( $type );
1278    $ext .= '_non_low' if $type eq 'generic_non_low';
1279    $ext .= "_safe" if $opts{safe};
1280    $ext .= "_no_length_checks" if $opts{no_length_checks};
1281    my $argstr= join ",", @args;
1282    my $def_fmt="$pfx$self->{op}$ext%s($argstr)";
1283    my $optree= $self->$method( %opts, type => $type, ret_type => $ret_type );
1284    return $self->render( $optree, ($type =~ /^cp/) ? 1 : 0, \%opts, $def_fmt );
1285}
1286
1287# if we aren't being used as a module (highly likely) then process
1288# the __DATA__ below and produce macros in regcharclass.h
1289# if an argument is provided to the script then it is assumed to
1290# be the path of the file to output to, if the arg is '-' outputs
1291# to STDOUT.
1292if ( !caller ) {
1293    $|++;
1294    my $path= shift @ARGV || "regcharclass.h";
1295    my $out_fh;
1296    if ( $path eq '-' ) {
1297        $out_fh= \*STDOUT;
1298    } else {
1299	$out_fh = open_new( $path );
1300    }
1301    print $out_fh read_only_top( lang => 'C', by => $0,
1302				 file => 'regcharclass.h', style => '*',
1303				 copyright => [2007, 2011],
1304                                 final => <<EOF,
1305WARNING: These macros are for internal Perl core use only, and may be
1306changed or removed without notice.
1307EOF
1308    );
1309    print $out_fh "\n#ifndef PERL_REGCHARCLASS_H_ /* Guard against nested #includes */\n#define PERL_REGCHARCLASS_H_\n";
1310
1311    my ( $op, $title, @txt, @types, %mods );
1312    my $doit= sub ($) {
1313        return unless $op;
1314
1315        my $charset = shift;
1316
1317        # Skip if to compile on a different platform.
1318        return if delete $mods{only_ascii_platform} && $charset !~ /ascii/i;
1319        return if delete $mods{only_ebcdic_platform} && $charset !~ /ebcdic/i;
1320
1321        print $out_fh "/*\n\t$op: $title\n\n";
1322        print $out_fh join "\n", ( map { "\t$_" } @txt ), "*/", "";
1323        my $obj= __PACKAGE__->new( op => $op, title => $title, txt => \@txt, charset => $charset);
1324
1325        #die Dumper(\@types,\%mods);
1326
1327        my @mods;
1328        push @mods, 'safe' if delete $mods{safe};
1329        push @mods, 'no_length_checks' if delete $mods{no_length_checks};
1330        unshift @mods, 'fast' if delete $mods{fast} || ! @mods; # Default to 'fast'
1331                                                                # do this one
1332                                                                # first, as
1333                                                                # traditional
1334        if (%mods) {
1335            die "Unknown modifiers: ", join ", ", map { "'$_'" } sort keys %mods;
1336        }
1337
1338        foreach my $type_spec ( @types ) {
1339            my ( $type, $ret )= split /-/, $type_spec;
1340            $ret ||= 'len';
1341            foreach my $mod ( @mods ) {
1342
1343                # 'safe' is irrelevant with code point macros, so skip if
1344                # there is also a 'fast', but don't skip if this is the only
1345                # way a cp macro will get generated.  Below we convert 'safe'
1346                # to 'fast' in this instance
1347                next if $type =~ /^cp/
1348                        && ($mod eq 'safe' || $mod eq 'no_length_checks')
1349                        && grep { 'fast' =~ $_ } @mods;
1350                delete $mods{$mod};
1351                my $macro= $obj->make_macro(
1352                    type     => $type,
1353                    ret_type => $ret,
1354                    safe     => $mod eq 'safe' && $type !~ /^cp/,
1355                    charset  => $charset,
1356                    no_length_checks => $mod eq 'no_length_checks' && $type !~ /^cp/,
1357                );
1358                print $out_fh $macro, "\n";
1359            }
1360        }
1361    };
1362
1363    my @data = <DATA>;
1364    foreach my $charset (get_supported_code_pages()) {
1365        my $first_time = 1;
1366        undef $op;
1367        undef $title;
1368        undef @txt;
1369        undef @types;
1370        undef %mods;
1371        print $out_fh "\n", get_conditional_compile_line_start($charset);
1372        my @data_copy = @data;
1373        for (@data_copy) {
1374            s/^ \s* (?: \# .* ) ? $ //x;    # squeeze out comment and blanks
1375            next unless /\S/;
1376            chomp;
1377            if ( /^[A-Z]/ ) {
1378                $doit->($charset) unless $first_time;  # This starts a new
1379                                                       # definition; do the
1380                                                       # previous one
1381                $first_time = 0;
1382                ( $op, $title )= split /\s*:\s*/, $_, 2;
1383                @txt= ();
1384            } elsif ( s/^=>// ) {
1385                my ( $type, $modifier )= split /:/, $_;
1386                @types= split ' ', $type;
1387                undef %mods;
1388                map { $mods{$_} = 1 } split ' ',  $modifier;
1389            } else {
1390                push @txt, "$_";
1391            }
1392        }
1393        $doit->($charset);
1394        print $out_fh get_conditional_compile_line_end();
1395    }
1396
1397    print $out_fh "\n#endif /* PERL_REGCHARCLASS_H_ */\n";
1398
1399    if($path eq '-') {
1400	print $out_fh "/* ex: set ro: */\n";
1401    } else {
1402        # Some of the sources for these macros come from Unicode tables
1403        my $sources_list = "lib/unicore/mktables.lst";
1404        my @sources = ($0, qw(lib/unicore/mktables
1405                              lib/Unicode/UCD.pm
1406                              regen/regcharclass_multi_char_folds.pl
1407                              regen/charset_translations.pl
1408                             ));
1409        {
1410            # Depend on mktables’ own sources.  It’s a shorter list of files than
1411            # those that Unicode::UCD uses.
1412            if (! open my $mktables_list, '<', $sources_list) {
1413
1414                # This should force a rebuild once $sources_list exists
1415                push @sources, $sources_list;
1416            }
1417            else {
1418                while(<$mktables_list>) {
1419                    last if /===/;
1420                    chomp;
1421                    push @sources, "lib/unicore/$_" if /^[^#]/;
1422                }
1423            }
1424        }
1425        read_only_bottom_close_and_rename($out_fh, \@sources)
1426    }
1427}
1428
1429# The form of the input is a series of definitions to make macros for.
1430# The first line gives the base name of the macro, followed by a colon, and
1431# then text to be used in comments associated with the macro that are its
1432# title or description.  In all cases the first (perhaps only) parameter to
1433# the macro is a pointer to the first byte of the code point it is to test to
1434# see if it is in the class determined by the macro.  In the case of non-UTF8,
1435# the code point consists only of a single byte.
1436#
1437# The second line must begin with a '=>' and be followed by the types of
1438# macro(s) to be generated; these are specified below.  A colon follows the
1439# types, followed by the modifiers, also specified below.  At least one
1440# modifier is required.
1441#
1442# The subsequent lines give what code points go into the class defined by the
1443# macro.  Multiple characters may be specified via a string like "\x0D\x0A",
1444# enclosed in quotes.  Otherwise the lines consist of one of:
1445#   1)  a single Unicode code point, prefaced by 0x
1446#   2)  a single range of Unicode code points separated by a minus (and
1447#       optional space)
1448#   3)  a single Unicode property specified in the standard Perl form
1449#       "\p{...}"
1450#   4)  a line like 'do path'.  This will do a 'do' on the file given by
1451#       'path'.  It is assumed that this does nothing but load subroutines
1452#       (See item 5 below).  The reason 'require path' is not used instead is
1453#       because 'do' doesn't assume that path is in @INC.
1454#   5)  a subroutine call
1455#           &pkg::foo(arg1, ...)
1456#       where pkg::foo was loaded by a 'do' line (item 4).  The subroutine
1457#       returns an array of entries of forms like items 1-3 above.  This
1458#       allows more complex inputs than achievable from the other input types.
1459#
1460# A blank line or one whose first non-blank character is '#' is a comment.
1461# The definition of the macro is terminated by a line unlike those described.
1462#
1463# Valid types:
1464#   low         generate a macro whose name is 'is_BASE_low' and defines a
1465#               class that includes only ASCII-range chars.  (BASE is the
1466#               input macro base name.)
1467#   latin1      generate a macro whose name is 'is_BASE_latin1' and defines a
1468#               class that includes only upper-Latin1-range chars.  It is not
1469#               designed to take a UTF-8 input parameter.
1470#   high        generate a macro whose name is 'is_BASE_high' and defines a
1471#               class that includes all relevant code points that are above
1472#               the Latin1 range.  This is for very specialized uses only.
1473#               It is designed to take only an input UTF-8 parameter.
1474#   utf8        generate a macro whose name is 'is_BASE_utf8' and defines a
1475#               class that includes all relevant characters that aren't ASCII.
1476#               It is designed to take only an input UTF-8 parameter.
1477#   LATIN1      generate a macro whose name is 'is_BASE_latin1' and defines a
1478#               class that includes both ASCII and upper-Latin1-range chars.
1479#               It is not designed to take a UTF-8 input parameter.
1480#   UTF8        generate a macro whose name is 'is_BASE_utf8' and defines a
1481#               class that can include any code point, adding the 'low' ones
1482#               to what 'utf8' works on.  It is designed to take only an input
1483#               UTF-8 parameter.
1484#   generic     generate a macro whose name is 'is_BASE".  It has a 2nd,
1485#               boolean, parameter which indicates if the first one points to
1486#               a UTF-8 string or not.  Thus it works in all circumstances.
1487#   generic_non_low generate a macro whose name is 'is_BASE_non_low".  It has
1488#               a 2nd, boolean, parameter which indicates if the first one
1489#               points to a UTF-8 string or not.  It excludes any ASCII-range
1490#               matches, but otherwise it works in all circumstances.
1491#   cp          generate a macro whose name is 'is_BASE_cp' and defines a
1492#               class that returns true if the UV parameter is a member of the
1493#               class; false if not.
1494#   cp_high     like cp, but it is assumed that it is known that the UV
1495#               parameter is above Latin1.  The name of the generated macro is
1496#               'is_BASE_cp_high'.  This is different from high-cp, derived
1497#               below.
1498# A macro of the given type is generated for each type listed in the input.
1499# The default return value is the number of octets read to generate the match.
1500# Append "-cp" to the type to have it instead return the matched codepoint.
1501#               The macro name is changed to 'what_BASE...'.  See pod for
1502#               caveats
1503# Appending '-both" instead adds an extra parameter to the end of the argument
1504#               list, which is a pointer as to where to store the number of
1505#               bytes matched, while also returning the code point.  The macro
1506#               name is changed to 'what_len_BASE...'.  See pod for caveats
1507#
1508# Valid modifiers:
1509#   safe        The input string is not necessarily valid UTF-8.  In
1510#               particular an extra parameter (always the 2nd) to the macro is
1511#               required, which points to one beyond the end of the string.
1512#               The macro will make sure not to read off the end of the
1513#               string.  In the case of non-UTF8, it makes sure that the
1514#               string has at least one byte in it.  The macro name has
1515#               '_safe' appended to it.
1516#   no_length_checks  The input string is not necessarily valid UTF-8, but it
1517#               is to be assumed that the length has already been checked and
1518#               found to be valid
1519#   fast        The input string is valid UTF-8.  No bounds checking is done,
1520#               and the macro can make assumptions that lead to faster
1521#               execution.
1522#   only_ascii_platform   Skip this definition if the character set is for
1523#               a non-ASCII platform.
1524#   only_ebcdic_platform  Skip this definition if the character set is for
1525#               a non-EBCDIC platform.
1526# No modifier need be specified; fast is assumed for this case.  If both
1527# 'fast', and 'safe' are specified, two macros will be created for each
1528# 'type'.
1529#
1530# If run on a non-ASCII platform will automatically convert the Unicode input
1531# to native.  The documentation above is slightly wrong in this case.  'low'
1532# actually refers to code points whose UTF-8 representation is the same as the
1533# non-UTF-8 version (invariants); and 'latin1' refers to all the rest of the
1534# code points less than 256.
1535
15361; # in the unlikely case we are being used as a module
1537
1538__DATA__
1539# This is no longer used, but retained in case it is needed some day.
1540# TRICKYFOLD: Problematic fold case letters.  When adding to this list, also should add them to regcomp.c and fold_grind.t
1541# => generic cp generic-cp generic-both :fast safe
1542# 0x00DF	# LATIN SMALL LETTER SHARP S
1543# 0x0390	# GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1544# 0x03B0	# GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
1545# 0x1E9E  # LATIN CAPITAL LETTER SHARP S, because maps to same as 00DF
1546# 0x1FD3  # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA; maps same as 0390
1547# 0x1FE3  # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA; maps same as 03B0
1548
1549LNBREAK: Line Break: \R
1550=> generic UTF8 LATIN1 : safe
1551"\x0D\x0A"      # CRLF - Network (Windows) line ending
1552\p{VertSpace}
1553
1554HORIZWS: Horizontal Whitespace: \h \H
1555=> high cp_high : fast
1556\p{HorizSpace}
1557
1558VERTWS: Vertical Whitespace: \v \V
1559=> high cp_high : fast
1560\p{VertSpace}
1561
1562XDIGIT: Hexadecimal digits
1563=> high cp_high : fast
1564\p{XDigit}
1565
1566XPERLSPACE: \p{XPerlSpace}
1567=> high cp_high : fast
1568\p{XPerlSpace}
1569
1570NONCHAR: Non character code points
1571=> UTF8 :safe
1572\p{_Perl_Nchar}
1573
1574SURROGATE: Surrogate code points
1575=> UTF8 :safe
1576\p{_Perl_Surrogate}
1577
1578QUOTEMETA: Meta-characters that \Q should quote
1579=> high :fast
1580\p{_Perl_Quotemeta}
1581
1582MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
1583=> UTF8 :safe
1584&regcharclass_multi_char_folds::multi_char_folds('u', 'a')
1585
1586MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
1587=> LATIN1 : safe
1588&regcharclass_multi_char_folds::multi_char_folds('l', 'a')
1589
1590THREE_CHAR_FOLD: A three-character multi-char fold
1591=> UTF8 :safe
1592&regcharclass_multi_char_folds::multi_char_folds('u', '3')
1593
1594THREE_CHAR_FOLD: A three-character multi-char fold
1595=> LATIN1 :safe
1596&regcharclass_multi_char_folds::multi_char_folds('l', '3')
1597
1598THREE_CHAR_FOLD_HEAD: The first two of three-character multi-char folds
1599=> UTF8 :safe
1600&regcharclass_multi_char_folds::multi_char_folds('u', 'h')
1601
1602THREE_CHAR_FOLD_HEAD: The first two of three-character multi-char folds
1603=> LATIN1 :safe
1604&regcharclass_multi_char_folds::multi_char_folds('l', 'h')
1605#
1606#THREE_CHAR_FOLD_NON_FINAL: The first or middle character of multi-char folds
1607#=> UTF8 :safe
1608#&regcharclass_multi_char_folds::multi_char_folds('u', 'fm')
1609#
1610#THREE_CHAR_FOLD_NON_FINAL: The first or middle character of multi-char folds
1611#=> LATIN1 :safe
1612#&regcharclass_multi_char_folds::multi_char_folds('l', 'fm')
1613
1614FOLDS_TO_MULTI: characters that fold to multi-char strings
1615=> UTF8 :fast
1616\p{_Perl_Folds_To_Multi_Char}
1617
1618PROBLEMATIC_LOCALE_FOLD : characters whose fold is problematic under locale
1619=> UTF8 cp :fast
1620\p{_Perl_Problematic_Locale_Folds}
1621
1622PROBLEMATIC_LOCALE_FOLDEDS_START : The first folded character of folds which are problematic under locale
1623=> UTF8 cp :fast
1624\p{_Perl_Problematic_Locale_Foldeds_Start}
1625
1626PATWS: pattern white space
1627=> generic cp : safe
1628\p{_Perl_PatWS}
1629
1630HANGUL_ED: Hangul syllables whose first character is \xED
1631=> UTF8 :only_ascii_platform safe
16320xD000 - 0xD7FF
1633