1#!perl 2package CharClass::Matcher; 3use strict; 4use 5.008; 5use warnings; 6use warnings FATAL => 'all'; 7use Data::Dumper; 8$Data::Dumper::Useqq= 1; 9our $hex_fmt= "0x%02X"; 10 11sub DEBUG () { 0 } 12$|=1 if DEBUG; 13 14require './regen/regen_lib.pl'; 15require './regen/charset_translations.pl'; 16require "./regen/regcharclass_multi_char_folds.pl"; 17 18=head1 NAME 19 20CharClass::Matcher -- Generate C macros that match character classes efficiently 21 22=head1 SYNOPSIS 23 24 perl Porting/regcharclass.pl 25 26=head1 DESCRIPTION 27 28Dynamically generates macros for detecting special charclasses 29in latin-1, utf8, and codepoint forms. Macros can be set to return 30the length (in bytes) of the matched codepoint, and/or the codepoint itself. 31 32To regenerate F<regcharclass.h>, run this script from perl-root. No arguments 33are necessary. 34 35Using WHATEVER as an example the following macros can be produced, depending 36on the input parameters (how to get each is described by internal comments at 37the C<__DATA__> line): 38 39=over 4 40 41=item C<is_WHATEVER(s,is_utf8)> 42 43=item C<is_WHATEVER_safe(s,e,is_utf8)> 44 45Do a lookup as appropriate based on the C<is_utf8> flag. When possible 46comparisons involving octet<128 are done before checking the C<is_utf8> 47flag, hopefully saving time. 48 49The version without the C<_safe> suffix should be used only when the input is 50known to be well-formed. 51 52=item C<is_WHATEVER_utf8(s)> 53 54=item C<is_WHATEVER_utf8_safe(s,e)> 55 56Do a lookup assuming the string is encoded in (normalized) UTF8. 57 58The version without the C<_safe> suffix should be used only when the input is 59known to be well-formed. 60 61=item C<is_WHATEVER_latin1(s)> 62 63=item C<is_WHATEVER_latin1_safe(s,e)> 64 65Do a lookup assuming the string is encoded in latin-1 (aka plan octets). 66 67The version without the C<_safe> suffix should be used only when it is known 68that C<s> contains at least one character. 69 70=item C<is_WHATEVER_cp(cp)> 71 72Check to see if the string matches a given codepoint (hypothetically a 73U32). The condition is constructed as to "break out" as early as 74possible if the codepoint is out of range of the condition. 75 76IOW: 77 78 (cp==X || (cp>X && (cp==Y || (cp>Y && ...)))) 79 80Thus if the character is X+1 only two comparisons will be done. Making 81matching lookups slower, but non-matching faster. 82 83=item C<what_len_WHATEVER_FOO(arg1, ..., len)> 84 85A variant form of each of the macro types described above can be generated, in 86which the code point is returned by the macro, and an extra parameter (in the 87final position) is added, which is a pointer for the macro to set the byte 88length of the returned code point. 89 90These forms all have a C<what_len> prefix instead of the C<is_>, for example 91C<what_len_WHATEVER_safe(s,e,is_utf8,len)> and 92C<what_len_WHATEVER_utf8(s,len)>. 93 94These forms should not be used I<except> on small sets of mostly widely 95separated code points; otherwise the code generated is inefficient. For these 96cases, it is best to use the C<is_> forms, and then find the code point with 97C<utf8_to_uvchr_buf>(). This program can fail with a "deep recursion" 98message on the worst of the inappropriate sets. Examine the generated macro 99to see if it is acceptable. 100 101=item C<what_WHATEVER_FOO(arg1, ...)> 102 103A variant form of each of the C<is_> macro types described above can be generated, in 104which the code point and not the length is returned by the macro. These have 105the same caveat as L</what_len_WHATEVER_FOO(arg1, ..., len)>, plus they should 106not be used where the set contains a NULL, as 0 is returned for two different 107cases: a) the set doesn't include the input code point; b) the set does 108include it, and it is a NULL. 109 110=back 111 112The above isn't quite complete, as for specialized purposes one can get a 113macro like C<is_WHATEVER_utf8_no_length_checks(s)>, which assumes that it is 114already known that there is enough space to hold the character starting at 115C<s>, but otherwise checks that it is well-formed. In other words, this is 116intermediary in checking between C<is_WHATEVER_utf8(s)> and 117C<is_WHATEVER_utf8_safe(s,e)>. 118 119=head2 CODE FORMAT 120 121perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f" 122 123 124=head1 AUTHOR 125 126Author: Yves Orton (demerphq) 2007. Maintained by Perl5 Porters. 127 128=head1 BUGS 129 130No tests directly here (although the regex engine will fail tests 131if this code is broken). Insufficient documentation and no Getopts 132handler for using the module as a script. 133 134=head1 LICENSE 135 136You may distribute under the terms of either the GNU General Public 137License or the Artistic License, as specified in the README file. 138 139=cut 140 141# Sub naming convention: 142# __func : private subroutine, can not be called as a method 143# _func : private method, not meant for external use 144# func : public method. 145 146# private subs 147#------------------------------------------------------------------------------- 148# 149# ($cp,$n,$l,$u)=__uni_latin($str); 150# 151# Return a list of arrays, each of which when interpreted correctly 152# represent the string in some given encoding with specific conditions. 153# 154# $cp - list of codepoints that make up the string. 155# $n - list of octets that make up the string if all codepoints are invariant 156# regardless of if the string is in UTF-8 or not. 157# $l - list of octets that make up the string in latin1 encoding if all 158# codepoints < 256, and at least one codepoint is UTF-8 variant. 159# $u - list of octets that make up the string in utf8 if any codepoint is 160# UTF-8 variant 161# 162# High CP | Defined 163#-----------+---------- 164# 0 - 127 : $n (127/128 are the values for ASCII platforms) 165# 128 - 255 : $l, $u 166# 256 - ... : $u 167# 168 169sub __uni_latin1 { 170 my $charset= shift; 171 my $str= shift; 172 my $max= 0; 173 my @cp; 174 my @cp_high; 175 my $only_has_invariants = 1; 176 my $a2n = get_a2n($charset); 177 for my $ch ( split //, $str ) { 178 my $cp= ord $ch; 179 $max= $cp if $max < $cp; 180 if ($cp > 255) { 181 push @cp, $cp; 182 push @cp_high, $cp; 183 } 184 else { 185 push @cp, $a2n->[$cp]; 186 } 187 } 188 my ( $n, $l, $u ); 189 $only_has_invariants = ($charset =~ /ascii/i) ? $max < 128 : $max < 160; 190 if ($only_has_invariants) { 191 $n= [@cp]; 192 } else { 193 $l= [@cp] if $max && $max < 256; 194 195 my @u; 196 for my $ch ( split //, $str ) { 197 push @u, map { ord } split //, cp_2_utfbytes(ord $ch, $charset); 198 } 199 $u = \@u; 200 } 201 return ( \@cp, \@cp_high, $n, $l, $u ); 202} 203 204# 205# $clean= __clean($expr); 206# 207# Cleanup a ternary expression, removing unnecessary parens and apply some 208# simplifications using regexes. 209# 210 211sub __clean { 212 my ( $expr )= @_; 213 214 #return $expr; 215 216 our $parens; 217 $parens= qr/ (?> \( (?> (?: (?> [^()]+ ) | (??{ $parens }) )* ) \) ) /x; 218 219 ## remove redundant parens 220 1 while $expr =~ s/ \( \s* ( $parens ) \s* \) /$1/gx; 221 222 223 # repeatedly simplify conditions like 224 # ( (cond1) ? ( (cond2) ? X : Y ) : Y ) 225 # into 226 # ( ( (cond1) && (cond2) ) ? X : Y ) 227 # Also similarly handles expressions like: 228 # : (cond1) ? ( (cond2) ? X : Y ) : Y ) 229 # Note the inclusion of the close paren in ([:()]) and the open paren in ([()]) is 230 # purely to ensure we have a balanced set of parens in the expression which makes 231 # it easier to understand the pattern in an editor that understands paren's, we do 232 # not expect either of these cases to actually fire. - Yves 233 1 while $expr =~ s/ 234 ([:()]) \s* 235 ($parens) \s* 236 \? \s* 237 \( \s* ($parens) \s* 238 \? \s* ($parens|[^()?:\s]+?) \s* 239 : \s* ($parens|[^()?:\s]+?) \s* 240 \) \s* 241 : \s* \5 \s* 242 ([()]) 243 /$1 ( $2 && $3 ) ? $4 : $5 $6/gx; 244 #$expr=~s/\(\(U8\*\)s\)\[(\d+)\]/S$1/g if length $expr > 8000; 245 #$expr=~s/\s+//g if length $expr > 8000; 246 247 die "Expression too long" if length $expr > 8000; 248 249 return $expr; 250} 251 252# 253# $text= __macro(@args); 254# Join args together by newlines, and then neatly add backslashes to the end 255# of every line as expected by the C pre-processor for #define's. 256# 257 258sub __macro { 259 my $str= join "\n", @_; 260 $str =~ s/\s*$//; 261 my @lines= map { s/\s+$//; s/\t/ /g; $_ } split /\n/, $str; 262 my $last= pop @lines; 263 $str= join "\n", ( map { sprintf "%-76s\\", $_ } @lines ), $last; 264 1 while $str =~ s/^(\t*) {8}/$1\t/gm; 265 return $str . "\n"; 266} 267 268# 269# my $op=__incrdepth($op); 270# 271# take an 'op' hashref and add one to it and all its childrens depths. 272# 273 274sub __incrdepth { 275 my $op= shift; 276 return unless ref $op; 277 $op->{depth} += 1; 278 __incrdepth( $op->{yes} ); 279 __incrdepth( $op->{no} ); 280 return $op; 281} 282 283# join two branches of an opcode together with a condition, incrementing 284# the depth on the yes branch when we do so. 285# returns the new root opcode of the tree. 286sub __cond_join { 287 my ( $cond, $yes, $no )= @_; 288 if (ref $yes) { 289 return { 290 test => $cond, 291 yes => __incrdepth( $yes ), 292 no => $no, 293 depth => 0, 294 }; 295 } 296 else { 297 return { 298 test => $cond, 299 yes => $yes, 300 no => __incrdepth($no), 301 depth => 0, 302 }; 303 } 304} 305 306# Methods 307 308# constructor 309# 310# my $obj=CLASS->new(op=>'SOMENAME',title=>'blah',txt=>[..]); 311# 312# Create a new CharClass::Matcher object by parsing the text in 313# the txt array. Currently applies the following rules: 314# 315# Element starts with C<0x>, line is evaled the result treated as 316# a number which is passed to chr(). 317# 318# Element starts with C<">, line is evaled and the result treated 319# as a string. 320# 321# Each string is then stored in the 'strs' subhash as a hash record 322# made up of the results of __uni_latin1, using the keynames 323# 'low','latin1','utf8', as well as the synthesized 'LATIN1', 'high', and 324# 'UTF8' which hold a merge of 'low' and their lowercase equivalents. 325# 326# Size data is tracked per type in the 'size' subhash. 327# 328# Return an object 329# 330sub new { 331 my $class= shift; 332 my %opt= @_; 333 for ( qw(op txt) ) { 334 die "in " . __PACKAGE__ . " constructor '$_;' is a mandatory field" 335 if !exists $opt{$_}; 336 } 337 338 my $self= bless { 339 op => $opt{op}, 340 title => $opt{title} || '', 341 }, $class; 342 foreach my $txt ( @{ $opt{txt} } ) { 343 my $str= $txt; 344 if ( $str =~ /^[""]/ ) { 345 $str= eval $str; 346 } elsif ($str =~ / - /x ) { # A range: Replace this element on the 347 # list with its expansion 348 my ($lower, $upper) = $str =~ / 0x (.+?) \s* - \s* 0x (.+) /x; 349 die "Format must be like '0xDEAD - 0xBEAF'; instead was '$str'" if ! defined $lower || ! defined $upper; 350 foreach my $cp (hex $lower .. hex $upper) { 351 push @{$opt{txt}}, sprintf "0x%X", $cp; 352 } 353 next; 354 } elsif ($str =~ s/ ^ N (?= 0x ) //x ) { 355 # Otherwise undocumented, a leading N means is already in the 356 # native character set; don't convert. 357 $str= chr eval $str; 358 } elsif ( $str =~ /^0x/ ) { 359 $str= eval $str; 360 $str = chr $str; 361 } elsif ( $str =~ / \s* \\p \{ ( .*? ) \} /x) { 362 my $property = $1; 363 use Unicode::UCD qw(prop_invlist); 364 365 my @invlist = prop_invlist($property, '_perl_core_internal_ok'); 366 if (! @invlist) { 367 368 # An empty return could mean an unknown property, or merely 369 # that it is empty. Call in scalar context to differentiate 370 my $count = prop_invlist($property, '_perl_core_internal_ok'); 371 die "$property not found" unless defined $count; 372 } 373 374 # Replace this element on the list with the property's expansion 375 for (my $i = 0; $i < @invlist; $i += 2) { 376 foreach my $cp ($invlist[$i] .. $invlist[$i+1] - 1) { 377 378 # prop_invlist() returns native values; add leading 'N' 379 # to indicate that. 380 push @{$opt{txt}}, sprintf "N0x%X", $cp; 381 } 382 } 383 next; 384 } elsif ($str =~ / ^ do \s+ ( .* ) /x) { 385 die "do '$1' failed: $!$@" if ! do $1 or $@; 386 next; 387 } elsif ($str =~ / ^ & \s* ( .* ) /x) { # user-furnished sub() call 388 my @results = eval "$1"; 389 die "eval '$1' failed: $@" if $@; 390 push @{$opt{txt}}, @results; 391 next; 392 } else { 393 die "Unparsable line: $txt\n"; 394 } 395 my ( $cp, $cp_high, $low, $latin1, $utf8 )= __uni_latin1( $opt{charset}, $str ); 396 my $UTF8= $low || $utf8; 397 my $LATIN1= $low || $latin1; 398 my $high = (scalar grep { $_ < 256 } @$cp) ? 0 : $utf8; 399 #die Dumper($txt,$cp,$low,$latin1,$utf8) 400 # if $txt=~/NEL/ or $utf8 and @$utf8>3; 401 402 @{ $self->{strs}{$str} }{qw( str txt low utf8 latin1 high cp cp_high UTF8 LATIN1 )}= 403 ( $str, $txt, $low, $utf8, $latin1, $high, $cp, $cp_high, $UTF8, $LATIN1 ); 404 my $rec= $self->{strs}{$str}; 405 foreach my $key ( qw(low utf8 latin1 high cp cp_high UTF8 LATIN1) ) { 406 $self->{size}{$key}{ 0 + @{ $self->{strs}{$str}{$key} } }++ 407 if $self->{strs}{$str}{$key}; 408 } 409 $self->{has_multi} ||= @$cp > 1; 410 $self->{has_ascii} ||= $latin1 && @$latin1; 411 $self->{has_low} ||= $low && @$low; 412 $self->{has_high} ||= !$low && !$latin1; 413 } 414 $self->{val_fmt}= $hex_fmt; 415 $self->{count}= 0 + keys %{ $self->{strs} }; 416 return $self; 417} 418 419# my $trie = make_trie($type,$maxlen); 420# 421# using the data stored in the object build a trie of a specific type, 422# and with specific maximum depth. The trie is made up the elements of 423# the given types array for each string in the object (assuming it is 424# not too long.) 425# 426# returns the trie, or undef if there was no relevant data in the object. 427# 428 429sub make_trie { 430 my ( $self, $type, $maxlen )= @_; 431 432 my $strs= $self->{strs}; 433 my %trie; 434 foreach my $rec ( values %$strs ) { 435 die "panic: unknown type '$type'" 436 if !exists $rec->{$type}; 437 my $dat= $rec->{$type}; 438 next unless $dat; 439 next if $maxlen && @$dat > $maxlen; 440 my $node= \%trie; 441 foreach my $elem ( @$dat ) { 442 $node->{$elem} ||= {}; 443 $node= $node->{$elem}; 444 } 445 $node->{''}= $rec->{str}; 446 } 447 return 0 + keys( %trie ) ? \%trie : undef; 448} 449 450sub pop_count ($) { 451 my $word = shift; 452 453 # This returns a list of the positions of the bits in the input word that 454 # are 1. 455 456 my @positions; 457 my $position = 0; 458 while ($word) { 459 push @positions, $position if $word & 1; 460 $position++; 461 $word >>= 1; 462 } 463 return @positions; 464} 465 466# my $optree= _optree() 467# 468# recursively convert a trie to an optree where every node represents 469# an if else branch. 470# 471# 472 473sub _optree { 474 my ( $self, $trie, $test_type, $ret_type, $else, $depth )= @_; 475 return unless defined $trie; 476 if ( $self->{has_multi} and $ret_type =~ /cp|both/ ) { 477 die "Can't do 'cp' optree from multi-codepoint strings"; 478 } 479 $ret_type ||= 'len'; 480 $else= 0 unless defined $else; 481 $depth= 0 unless defined $depth; 482 483 # if we have an empty string as a key it means we are in an 484 # accepting state and unless we can match further on should 485 # return the value of the '' key. 486 if (exists $trie->{''} ) { 487 # we can now update the "else" value, anything failing to match 488 # after this point should return the value from this. 489 if ( $ret_type eq 'cp' ) { 490 $else= $self->{strs}{ $trie->{''} }{cp}[0]; 491 $else= sprintf "$self->{val_fmt}", $else if $else > 9; 492 } elsif ( $ret_type eq 'len' ) { 493 $else= $depth; 494 } elsif ( $ret_type eq 'both') { 495 $else= $self->{strs}{ $trie->{''} }{cp}[0]; 496 $else= sprintf "$self->{val_fmt}", $else if $else > 9; 497 $else= "len=$depth, $else"; 498 } 499 } 500 # extract the meaningful keys from the trie, filter out '' as 501 # it means we are an accepting state (end of sequence). 502 my @conds= sort { $a <=> $b } grep { length $_ } keys %$trie; 503 504 # if we haven't any keys there is no further we can match and we 505 # can return the "else" value. 506 return $else if !@conds; 507 508 my $test = $test_type =~ /^cp/ ? "cp" : "((const U8*)s)[$depth]"; 509 510 # First we loop over the possible keys/conditions and find out what they 511 # look like; we group conditions with the same optree together. 512 my %dmp_res; 513 my @res_order; 514 local $Data::Dumper::Sortkeys=1; 515 foreach my $cond ( @conds ) { 516 517 # get the optree for this child/condition 518 my $res= $self->_optree( $trie->{$cond}, $test_type, $ret_type, $else, $depth + 1 ); 519 # convert it to a string with Dumper 520 my $res_code= Dumper( $res ); 521 522 push @{$dmp_res{$res_code}{vals}}, $cond; 523 if (!$dmp_res{$res_code}{optree}) { 524 $dmp_res{$res_code}{optree}= $res; 525 push @res_order, $res_code; 526 } 527 } 528 529 # now that we have deduped the optrees we construct a new optree containing the merged 530 # results. 531 my %root; 532 my $node= \%root; 533 foreach my $res_code_idx (0 .. $#res_order) { 534 my $res_code= $res_order[$res_code_idx]; 535 $node->{vals}= $dmp_res{$res_code}{vals}; 536 $node->{test}= $test; 537 $node->{yes}= $dmp_res{$res_code}{optree}; 538 $node->{depth}= $depth; 539 if ($res_code_idx < $#res_order) { 540 $node= $node->{no}= {}; 541 } else { 542 $node->{no}= $else; 543 } 544 } 545 546 # return the optree. 547 return \%root; 548} 549 550# my $optree= optree(%opts); 551# 552# Convert a trie to an optree, wrapper for _optree 553 554sub optree { 555 my $self= shift; 556 my %opt= @_; 557 my $trie= $self->make_trie( $opt{type}, $opt{max_depth} ); 558 $opt{ret_type} ||= 'len'; 559 my $test_type= $opt{type} =~ /^cp/ ? 'cp' : 'depth'; 560 return $self->_optree( $trie, $test_type, $opt{ret_type}, $opt{else}, 0 ); 561} 562 563# my $optree= generic_optree(%opts); 564# 565# build a "generic" optree out of the three 'low', 'latin1', 'utf8' 566# sets of strings, including a branch for handling the string type check. 567# 568 569sub generic_optree { 570 my $self= shift; 571 my %opt= @_; 572 573 $opt{ret_type} ||= 'len'; 574 my $test_type= 'depth'; 575 my $else= $opt{else} || 0; 576 577 my $latin1= $self->make_trie( 'latin1', $opt{max_depth} ); 578 my $utf8= $self->make_trie( 'utf8', $opt{max_depth} ); 579 580 $_= $self->_optree( $_, $test_type, $opt{ret_type}, $else, 0 ) 581 for $latin1, $utf8; 582 583 if ( $utf8 ) { 584 $else= __cond_join( "( is_utf8 )", $utf8, $latin1 || $else ); 585 } elsif ( $latin1 ) { 586 $else= __cond_join( "!( is_utf8 )", $latin1, $else ); 587 } 588 if ($opt{type} eq 'generic') { 589 my $low= $self->make_trie( 'low', $opt{max_depth} ); 590 if ( $low ) { 591 $else= $self->_optree( $low, $test_type, $opt{ret_type}, $else, 0 ); 592 } 593 } 594 595 return $else; 596} 597 598# length_optree() 599# 600# create a string length guarded optree. 601# 602 603sub length_optree { 604 my $self= shift; 605 my %opt= @_; 606 my $type= $opt{type}; 607 608 die "Can't do a length_optree on type 'cp', makes no sense." 609 if $type =~ /^cp/; 610 611 my $else= ( $opt{else} ||= 0 ); 612 613 return $else if $self->{count} == 0; 614 615 my $method = $type =~ /generic/ ? 'generic_optree' : 'optree'; 616 if ($method eq 'optree' && scalar keys %{$self->{size}{$type}} == 1) { 617 618 # Here is non-generic output (meaning that we are only generating one 619 # type), and all things that match have the same number ('size') of 620 # bytes. The length guard is simply that we have that number of 621 # bytes. 622 my @size = keys %{$self->{size}{$type}}; 623 my $cond= "((e) - (s)) >= $size[0]"; 624 my $optree = $self->$method(%opt); 625 $else= __cond_join( $cond, $optree, $else ); 626 } 627 elsif ($self->{has_multi}) { 628 my @size; 629 630 # Here, there can be a match of a multiple character string. We use 631 # the traditional method which is to have a branch for each possible 632 # size (longest first) and test for the legal values for that size. 633 my %sizes= ( 634 %{ $self->{size}{low} || {} }, 635 %{ $self->{size}{latin1} || {} }, 636 %{ $self->{size}{utf8} || {} } 637 ); 638 if ($method eq 'generic_optree') { 639 @size= sort { $a <=> $b } keys %sizes; 640 } else { 641 @size= sort { $a <=> $b } keys %{ $self->{size}{$type} }; 642 } 643 for my $size ( @size ) { 644 my $optree= $self->$method( %opt, type => $type, max_depth => $size ); 645 my $cond= "((e)-(s) > " . ( $size - 1 ).")"; 646 $else= __cond_join( $cond, $optree, $else ); 647 } 648 } 649 else { 650 my $utf8; 651 652 # Here, has more than one possible size, and only matches a single 653 # character. For non-utf8, the needed length is 1; for utf8, it is 654 # found by array lookup 'UTF8SKIP'. 655 656 # If want just the code points above 255, set up to look for those; 657 # otherwise assume will be looking for all non-UTF-8-invariant code 658 # poiints. 659 my $trie_type = ($type eq 'high') ? 'high' : 'utf8'; 660 661 # If we do want more than the 0-255 range, find those, and if they 662 # exist... 663 if ($opt{type} !~ /latin1/i && ($utf8 = $self->make_trie($trie_type, 0))) { 664 665 # ... get them into an optree, and set them up as the 'else' clause 666 $utf8 = $self->_optree( $utf8, 'depth', $opt{ret_type}, 0, 0 ); 667 668 # We could make this 669 # UTF8_IS_START(*s) && ((e) - (s)) >= UTF8SKIP(s))"; 670 # to avoid doing the UTF8SKIP and subsequent branches for invariants 671 # that don't match. But the current macros that get generated 672 # have only a few things that can match past this, so I (khw) 673 # don't think it is worth it. (Even better would be to use 674 # calculate_mask(keys %$utf8) instead of UTF8_IS_START, and use it 675 # if it saves a bunch. We assume that input text likely to be 676 # well-formed . 677 my $cond = "LIKELY(((e) - (s)) >= UTF8SKIP(s))"; 678 $else = __cond_join($cond, $utf8, $else); 679 680 # For 'generic', we also will want the latin1 UTF-8 variants for 681 # the case where the input isn't UTF-8. 682 my $latin1; 683 if ($method eq 'generic_optree') { 684 $latin1 = $self->make_trie( 'latin1', 1); 685 $latin1= $self->_optree( $latin1, 'depth', $opt{ret_type}, 0, 0 ); 686 } 687 688 # If we want the UTF-8 invariants, get those. 689 my $low; 690 if ($opt{type} !~ /non_low|high/ 691 && ($low= $self->make_trie( 'low', 1))) 692 { 693 $low= $self->_optree( $low, 'depth', $opt{ret_type}, 0, 0 ); 694 695 # Expand out the UTF-8 invariants as a string so that we 696 # can use them as the conditional 697 $low = $self->_cond_as_str( $low, 0, \%opt); 698 699 # If there are Latin1 variants, add a test for them. 700 if ($latin1) { 701 $else = __cond_join("(! is_utf8 )", $latin1, $else); 702 } 703 elsif ($method eq 'generic_optree') { 704 705 # Otherwise for 'generic' only we know that what 706 # follows must be valid for just UTF-8 strings, 707 $else->{test} = "( is_utf8 && $else->{test} )"; 708 } 709 710 # If the invariants match, we are done; otherwise we have 711 # to go to the 'else' clause. 712 $else = __cond_join($low, 1, $else); 713 } 714 elsif ($latin1) { # Here, didn't want or didn't have invariants, 715 # but we do have latin variants 716 $else = __cond_join("(! is_utf8)", $latin1, $else); 717 } 718 719 # We need at least one byte available to start off the tests 720 $else = __cond_join("LIKELY((e) > (s))", $else, 0); 721 } 722 else { # Here, we don't want or there aren't any variants. A single 723 # byte available is enough. 724 my $cond= "((e) > (s))"; 725 my $optree = $self->$method(%opt); 726 $else= __cond_join( $cond, $optree, $else ); 727 } 728 } 729 730 return $else; 731} 732 733sub calculate_mask(@) { 734 # Look at the input list of byte values. This routine returns an array of 735 # mask/base pairs to generate that list. 736 737 my @list = @_; 738 my $list_count = @list; 739 740 # Consider a set of byte values, A, B, C .... If we want to determine if 741 # <c> is one of them, we can write c==A || c==B || c==C .... If the 742 # values are consecutive, we can shorten that to inRANGE(c, 'A', 'Z'), 743 # which uses far fewer branches. If only some of them are consecutive we 744 # can still save some branches by creating range tests for just those that 745 # are consecutive. _cond_as_str() does this work for looking for ranges. 746 # 747 # Another approach is to look at the bit patterns for A, B, C .... and see 748 # if they have some commonalities. That's what this function does. For 749 # example, consider a set consisting of the bytes 750 # 0x42, 0x43, 0x62, and 0x63. We could write: 751 # inRANGE(c, 0x42, 0x43) || inRANGE(c, 0x62, 0x63) 752 # which through the magic of casting has not 4, but 2 tests. But the 753 # following mask/compare also works, and has just one test: 754 # (c & 0xDE) == 0x42 755 # The reason it works is that the set consists of exactly the 4 bit 756 # patterns which have either 0 or 1 in the two bit positions that are 0 in 757 # the mask. They have the same value in each bit position where the mask 758 # is 1. The comparison makes sure that the result matches all bytes which 759 # match those six 1 bits exactly. This can be applied to bytes that 760 # differ in 1 through all 8 bit positions. In order to be a candidate for 761 # this optimization, the number of bytes in the set must be a power of 2. 762 # 763 # It may be that the bytes needing to be matched can't be done with a 764 # single mask. But it may be possible to have two (or more) sets, each 765 # with a separate mask. This function attempts to find some way to save 766 # some branches using the mask technique. If not, it returns an empty 767 # list; if so, it returns a list consisting of 768 # [ [compare1, mask1], [compare2, mask2], ... 769 # [compare_n, undef], [compare_m, undef], ... 770 # ] 771 # The <mask> is undef in the above for those bytes that must be tested 772 # for individually. 773 # 774 # This function does not attempt to find the optimal set. To do so would 775 # probably require testing all possible combinations, and keeping track of 776 # the current best one. 777 # 778 # There are probably much better algorithms, but this is the one I (khw) 779 # came up with. We start with doing a bit-wise compare of every byte in 780 # the set with every other byte. The results are sorted into arrays of 781 # all those that differ by the same bit positions. These are stored in a 782 # hash with the each key being the bits they differ in. Here is the hash 783 # for the 0x53, 0x54, 0x73, 0x74 set: 784 # { 785 # 4 => { 786 # "0,1,2,5" => [ 787 # 83, 788 # 116, 789 # 84, 790 # 115 791 # ] 792 # }, 793 # 3 => { 794 # "0,1,2" => [ 795 # 83, 796 # 84, 797 # 115, 798 # 116 799 # ] 800 # } 801 # 1 => { 802 # 5 => [ 803 # 83, 804 # 115, 805 # 84, 806 # 116 807 # ] 808 # }, 809 # } 810 # 811 # The set consisting of values which differ in the 4 bit positions 0, 1, 812 # 2, and 5 from some other value in the set consists of all 4 values. 813 # Likewise all 4 values differ from some other value in the 3 bit 814 # positions 0, 1, and 2; and all 4 values differ from some other value in 815 # the single bit position 5. The keys at the uppermost level in the above 816 # hash, 1, 3, and 4, give the number of bit positions that each sub-key 817 # below it has. For example, the 4 key could have as its value an array 818 # consisting of "0,1,2,5", "0,1,2,6", and "3,4,6,7", if the inputs were 819 # such. The best optimization will group the most values into a single 820 # mask. The most values will be the ones that differ in the most 821 # positions, the ones with the largest value for the topmost key. These 822 # keys, are thus just for convenience of sorting by that number, and do 823 # not have any bearing on the core of the algorithm. 824 # 825 # We start with an element from largest number of differing bits. The 826 # largest in this case is 4 bits, and there is only one situation in this 827 # set which has 4 differing bits, "0,1,2,5". We look for any subset of 828 # this set which has 16 values that differ in these 4 bits. There aren't 829 # any, because there are only 4 values in the entire set. We then look at 830 # the next possible thing, which is 3 bits differing in positions "0,1,2". 831 # We look for a subset that has 8 values that differ in these 3 bits. 832 # Again there are none. So we go to look for the next possible thing, 833 # which is a subset of 2**1 values that differ only in bit position 5. 83 834 # and 115 do, so we calculate a mask and base for those and remove them 835 # from every set. Since there is only the one set remaining, we remove 836 # them from just this one. We then look to see if there is another set of 837 # 2 values that differ in bit position 5. 84 and 116 do, so we calculate 838 # a mask and base for those and remove them from every set (again only 839 # this set remains in this example). The set is now empty, and there are 840 # no more sets to look at, so we are done. 841 842 if ($list_count == 256) { # All 256 is trivially masked 843 return (0, 0); 844 } 845 846 my %hash; 847 848 # Generate bits-differing lists for each element compared against each 849 # other element 850 for my $i (0 .. $list_count - 2) { 851 for my $j ($i + 1 .. $list_count - 1) { 852 my @bits_that_differ = pop_count($list[$i] ^ $list[$j]); 853 my $differ_count = @bits_that_differ; 854 my $key = join ",", @bits_that_differ; 855 push @{$hash{$differ_count}{$key}}, $list[$i] unless grep { $_ == $list[$i] } @{$hash{$differ_count}{$key}}; 856 push @{$hash{$differ_count}{$key}}, $list[$j]; 857 } 858 } 859 860 print STDERR __LINE__, ": calculate_mask() called: List of values grouped by differing bits: ", Dumper \%hash if DEBUG; 861 862 my @final_results; 863 foreach my $count (reverse sort { $a <=> $b } keys %hash) { 864 my $need = 2 ** $count; # Need 8 values for 3 differing bits, etc 865 foreach my $bits (sort keys $hash{$count}->%*) { 866 867 print STDERR __LINE__, ": For $count bit(s) difference ($bits), need $need; have ", scalar @{$hash{$count}{$bits}}, "\n" if DEBUG; 868 869 # Look only as long as there are at least as many elements in the 870 # subset as are needed 871 while ((my $cur_count = @{$hash{$count}{$bits}}) >= $need) { 872 873 print STDERR __LINE__, ": Looking at bit positions ($bits): ", Dumper $hash{$count}{$bits} if DEBUG; 874 875 # Start with the first element in it 876 my $try_base = $hash{$count}{$bits}[0]; 877 my @subset = $try_base; 878 879 # If it succeeds, we return a mask and a base to compare 880 # against the masked value. That base will be the AND of 881 # every element in the subset. Initialize to the one element 882 # we have so far. 883 my $compare = $try_base; 884 885 # We are trying to find a subset of this that has <need> 886 # elements that differ in the bit positions given by the 887 # string $bits, which is comma separated. 888 my @bits = split ",", $bits; 889 890 TRY: # Look through the remainder of the list for other 891 # elements that differ only by these bit positions. 892 893 for (my $i = 1; $i < $cur_count; $i++) { 894 my $try_this = $hash{$count}{$bits}[$i]; 895 my @positions = pop_count($try_base ^ $try_this); 896 897 print STDERR __LINE__, ": $try_base vs $try_this: is (", join(',', @positions), ") a subset of ($bits)?" if DEBUG;; 898 899 foreach my $pos (@positions) { 900 unless (grep { $pos == $_ } @bits) { 901 print STDERR " No\n" if DEBUG; 902 my $remaining = $cur_count - $i - 1; 903 if ($remaining && @subset + $remaining < $need) { 904 print STDERR __LINE__, ": Can stop trying $try_base, because even if all the remaining $remaining values work, they wouldn't add up to the needed $need when combined with the existing ", scalar @subset, " ones\n" if DEBUG; 905 last TRY; 906 } 907 next TRY; 908 } 909 } 910 911 print STDERR " Yes\n" if DEBUG; 912 push @subset, $try_this; 913 914 # Add this to the mask base, in case it ultimately 915 # succeeds, 916 $compare &= $try_this; 917 } 918 919 print STDERR __LINE__, ": subset (", join(", ", @subset), ") has ", scalar @subset, " elements; needs $need\n" if DEBUG; 920 921 if (@subset < $need) { 922 shift @{$hash{$count}{$bits}}; 923 next; # Try with next value 924 } 925 926 # Create the mask 927 my $mask = 0; 928 foreach my $position (@bits) { 929 $mask |= 1 << $position; 930 } 931 $mask = ~$mask & 0xFF; 932 push @final_results, [$compare, $mask]; 933 934 printf STDERR "%d: Got it: compare=%d=0x%X; mask=%X\n", __LINE__, $compare, $compare, $mask if DEBUG; 935 936 # These values are now spoken for. Remove them from future 937 # consideration 938 foreach my $remove_count (sort keys %hash) { 939 foreach my $bits (sort keys %{$hash{$remove_count}}) { 940 foreach my $to_remove (@subset) { 941 @{$hash{$remove_count}{$bits}} = grep { $_ != $to_remove } @{$hash{$remove_count}{$bits}}; 942 } 943 } 944 } 945 } 946 } 947 } 948 949 # Any values that remain in the list are ones that have to be tested for 950 # individually. 951 my @individuals; 952 foreach my $count (reverse sort { $a <=> $b } keys %hash) { 953 foreach my $bits (sort keys $hash{$count}->%*) { 954 foreach my $remaining (@{$hash{$count}{$bits}}) { 955 956 # If we already know about this value, just ignore it. 957 next if grep { $remaining == $_ } @individuals; 958 959 # Otherwise it needs to be returned as something to match 960 # individually 961 push @final_results, [$remaining, undef]; 962 push @individuals, $remaining; 963 } 964 } 965 } 966 967 # Sort by increasing numeric value 968 @final_results = sort { $a->[0] <=> $b->[0] } @final_results; 969 970 print STDERR __LINE__, ": Final return: ", Dumper \@final_results if DEBUG; 971 972 return @final_results; 973} 974 975# _cond_as_str 976# turn a list of conditions into a text expression 977# - merges ranges of conditions, and joins the result with || 978sub _cond_as_str { 979 my ( $self, $op, $combine, $opts_ref )= @_; 980 my $cond= $op->{vals}; 981 my $test= $op->{test}; 982 my $is_cp_ret = $opts_ref->{ret_type} eq "cp"; 983 return "( $test )" if !defined $cond; 984 985 # rangify the list. 986 my @ranges; 987 my $Update= sub { 988 # We skip this if there are optimizations that 989 # we can apply (below) to the individual ranges 990 if ( ($is_cp_ret || $combine) && @ranges && ref $ranges[-1]) { 991 if ( $ranges[-1][0] == $ranges[-1][1] ) { 992 $ranges[-1]= $ranges[-1][0]; 993 } elsif ( $ranges[-1][0] + 1 == $ranges[-1][1] ) { 994 $ranges[-1]= $ranges[-1][0]; 995 push @ranges, $ranges[-1] + 1; 996 } 997 } 998 }; 999 for my $condition ( @$cond ) { 1000 if ( !@ranges || $condition != $ranges[-1][1] + 1 ) { 1001 $Update->(); 1002 push @ranges, [ $condition, $condition ]; 1003 } else { 1004 $ranges[-1][1]++; 1005 } 1006 } 1007 $Update->(); 1008 1009 return $self->_combine( $test, @ranges ) 1010 if $combine; 1011 1012 if ($is_cp_ret) { 1013 @ranges= map { 1014 ref $_ 1015 ? sprintf( 1016 "isRANGE( $test, $self->{val_fmt}, $self->{val_fmt} )", 1017 @$_ ) 1018 : sprintf( "$self->{val_fmt} == $test", $_ ); 1019 } @ranges; 1020 1021 return "( " . join( " || ", @ranges ) . " )"; 1022 } 1023 1024 # If the input set has certain characteristics, we can optimize tests 1025 # for it. This doesn't apply if returning the code point, as we want 1026 # each element of the set individually. The code above is for this 1027 # simpler case. 1028 1029 return 1 if @$cond == 256; # If all bytes match, is trivially true 1030 1031 my @masks; 1032 if (@ranges > 1) { 1033 1034 # See if the entire set shares optimizable characteristics, and if so, 1035 # return the optimization. There is no need to do this on sets with 1036 # just a single range, as that can be expressed with a single 1037 # conditional. 1038 @masks = calculate_mask(@$cond); 1039 1040 # Stringify the output of calculate_mask() 1041 if (@masks) { 1042 my @return; 1043 foreach my $mask_ref (@masks) { 1044 if (defined $mask_ref->[1]) { 1045 push @return, sprintf "( ( $test & $self->{val_fmt} ) == $self->{val_fmt} )", $mask_ref->[1], $mask_ref->[0]; 1046 } 1047 else { # An undefined mask means to use the value as-is 1048 push @return, sprintf "$test == $self->{val_fmt}", $mask_ref->[0]; 1049 } 1050 } 1051 1052 # The best possible case below for specifying this set of values via 1053 # ranges is 1 branch per range. If our mask method yielded better 1054 # results, there is no sense trying something that is bound to be 1055 # worse. 1056 if (@return < @ranges) { 1057 return "( " . join( " || ", @return ) . " )"; 1058 } 1059 1060 @masks = @return; 1061 } 1062 } 1063 1064 # Here, there was no entire-class optimization that was clearly better 1065 # than doing things by ranges. Look at each range. 1066 my $range_count_extra = 0; 1067 for (my $i = 0; $i < @ranges; $i++) { 1068 if (! ref $ranges[$i]) { # Trivial case: no range 1069 $ranges[$i] = sprintf "$self->{val_fmt} == $test", $ranges[$i]; 1070 } 1071 elsif ($ranges[$i]->[0] == $ranges[$i]->[1]) { 1072 $ranges[$i] = # Trivial case: single element range 1073 sprintf "$self->{val_fmt} == $test", $ranges[$i]->[0]; 1074 } 1075 elsif ($ranges[$i]->[0] == 0) { 1076 # If the range matches all 256 possible bytes, it is trivially 1077 # true. 1078 return 1 if $ranges[0]->[1] == 0xFF; # @ranges must be 1 in 1079 # this case 1080 $ranges[$i] = sprintf "( $test <= $self->{val_fmt} )", 1081 $ranges[$i]->[1]; 1082 } 1083 elsif ($ranges[$i]->[1] == 255) { 1084 1085 # Similarly the max possible is 255, so can omit an upper bound 1086 # test if the calculated max is the max possible one. 1087 $ranges[$i] = sprintf "( $test >= $self->{val_fmt} )", 1088 $ranges[0]->[0]; 1089 } 1090 else { 1091 my $output = ""; 1092 1093 # Well-formed UTF-8 continuation bytes on ascii platforms must be 1094 # in the range 0x80 .. 0xBF. If we know that the input is 1095 # well-formed (indicated by not trying to be 'safe'), we can omit 1096 # tests that verify that the input is within either of these 1097 # bounds. (No legal UTF-8 character can begin with anything in 1098 # this range, so we don't have to worry about this being a 1099 # continuation byte or not.) 1100 if ($opts_ref->{charset} =~ /ascii/i 1101 && (! $opts_ref->{safe} && ! $opts_ref->{no_length_checks}) 1102 && $opts_ref->{type} =~ / ^ (?: utf8 | high ) $ /xi) 1103 { 1104 # If the range is the entire legal range, it matches any legal 1105 # byte, so we can omit both tests. (This should happen only 1106 # if the number of ranges is 1.) 1107 if ($ranges[$i]->[0] == 0x80 && $ranges[$i]->[1] == 0xBF) { 1108 return 1; 1109 } 1110 } 1111 1112 # Here, it isn't the full range of legal continuation bytes. We 1113 # could just assume that there's nothing outside of the legal 1114 # bounds. But inRANGE() allows us to have a single conditional, 1115 # so the only cost of making sure it's a legal UTF-8 continuation 1116 # byte is an extra subtraction instruction, a trivial expense. 1117 $ranges[$i] = sprintf("inRANGE($test, $self->{val_fmt}," 1118 . " $self->{val_fmt} )", 1119 $ranges[$i]->[0], $ranges[$i]->[1]); 1120 } 1121 } 1122 1123 # We have generated the list of bytes in two ways; one trying to use masks 1124 # to cut the number of branches down, and the other to look at individual 1125 # ranges (some of which could be cut down by using a mask for just it). 1126 # We return whichever method uses the fewest branches. 1127 return "( " 1128 . join( " || ", (@masks && @masks < @ranges + $range_count_extra) 1129 ? @masks 1130 : @ranges) 1131 . " )"; 1132} 1133 1134# _combine 1135# recursively turn a list of conditions into a fast break-out condition 1136# used by _cond_as_str() for 'cp' type macros. 1137sub _combine { 1138 my ( $self, $test, @cond )= @_; 1139 return if !@cond; 1140 my $item= shift @cond; 1141 my ( $cstr, $gtv ); 1142 if ( ref $item ) { # @item should be a 2-element array giving range start 1143 # and end 1144 if ($item->[0] == 0) { # UV's are never negative, so skip "0 <= " 1145 # test which could generate a compiler warning 1146 # that test is always true 1147 $cstr= sprintf( "$test <= $self->{val_fmt}", $item->[1] ); 1148 } 1149 else { 1150 $cstr= 1151 sprintf( "inRANGE($test, $self->{val_fmt}, $self->{val_fmt})", 1152 @$item ); 1153 } 1154 $gtv= sprintf "$self->{val_fmt}", $item->[1]; 1155 } else { 1156 $cstr= sprintf( "$self->{val_fmt} == $test", $item ); 1157 $gtv= sprintf "$self->{val_fmt}", $item; 1158 } 1159 if ( @cond ) { 1160 my $combine= $self->_combine( $test, @cond ); 1161 if (@cond >1) { 1162 return "( $cstr || ( $gtv < $test &&\n" 1163 . $combine . " ) )"; 1164 } else { 1165 return "( $cstr || $combine )"; 1166 } 1167 } else { 1168 return $cstr; 1169 } 1170} 1171 1172# _render() 1173# recursively convert an optree to text with reasonably neat formatting 1174sub _render { 1175 my ( $self, $op, $combine, $brace, $opts_ref, $def, $submacros )= @_; 1176 return 0 if ! defined $op; # The set is empty 1177 if ( !ref $op ) { 1178 return $op; 1179 } 1180 my $cond= $self->_cond_as_str( $op, $combine, $opts_ref ); 1181 #no warnings 'recursion'; # This would allow really really inefficient 1182 # code to be generated. See pod 1183 my $yes= $self->_render( $op->{yes}, $combine, 1, $opts_ref, $def, $submacros ); 1184 return $yes if $cond eq '1'; 1185 1186 my $no= $self->_render( $op->{no}, $combine, 0, $opts_ref, $def, $submacros ); 1187 return "( $cond )" if $yes eq '1' and $no eq '0'; 1188 my ( $lb, $rb )= $brace ? ( "( ", " )" ) : ( "", "" ); 1189 return "$lb$cond ? $yes : $no$rb" 1190 if !ref( $op->{yes} ) && !ref( $op->{no} ); 1191 my $ind1= " " x 4; 1192 my $ind= "\n" . ( $ind1 x $op->{depth} ); 1193 1194 if ( ref $op->{yes} ) { 1195 $yes= $ind . $ind1 . $yes; 1196 } else { 1197 $yes= " " . $yes; 1198 } 1199 1200 my $str= "$lb$cond ?$yes$ind: $no$rb"; 1201 if (length $str > 6000) { 1202 push @$submacros, sprintf "#define $def\n( %s )", "_part" . (my $yes_idx= 0+@$submacros), $yes; 1203 push @$submacros, sprintf "#define $def\n( %s )", "_part" . (my $no_idx= 0+@$submacros), $no; 1204 return sprintf "%s%s ? $def : $def%s", $lb, $cond, "_part$yes_idx", "_part$no_idx", $rb; 1205 } 1206 return $str; 1207} 1208 1209# $expr=render($op,$combine) 1210# 1211# convert an optree to text with reasonably neat formatting. If $combine 1212# is true then the condition is created using "fast breakouts" which 1213# produce uglier expressions that are more efficient for common case, 1214# longer lists such as that resulting from type 'cp' output. 1215# Currently only used for type 'cp' macros. 1216sub render { 1217 my ( $self, $op, $combine, $opts_ref, $def_fmt )= @_; 1218 1219 my @submacros; 1220 my $macro= sprintf "#define $def_fmt\n( %s )", "", $self->_render( $op, $combine, 0, $opts_ref, $def_fmt, \@submacros ); 1221 1222 return join "\n\n", map { "/*** GENERATED CODE ***/\n" . __macro( __clean( $_ ) ) } @submacros, $macro; 1223} 1224 1225# make_macro 1226# make a macro of a given type. 1227# calls into make_trie and (generic_|length_)optree as needed 1228# Opts are: 1229# type : 'cp','cp_high', 'generic','high','low','latin1','utf8','LATIN1','UTF8' 1230# ret_type : 'cp' or 'len' 1231# safe : don't assume is well-formed UTF-8, so don't skip any range 1232# checks, and add length guards to macro 1233# no_length_checks : like safe, but don't add length guards. 1234# 1235# type defaults to 'generic', and ret_type to 'len' unless type is 'cp' 1236# in which case it defaults to 'cp' as well. 1237# 1238# It is illegal to do a type 'cp' macro on a pattern with multi-codepoint 1239# sequences in it, as the generated macro will accept only a single codepoint 1240# as an argument. 1241# 1242# It is also illegal to do a non-safe macro on a pattern with multi-codepoint 1243# sequences in it, as even if it is known to be well-formed, we need to not 1244# run off the end of the buffer when, say, the buffer ends with the first two 1245# characters, but three are looked at by the macro. 1246# 1247# returns the macro. 1248 1249 1250sub make_macro { 1251 my $self= shift; 1252 my %opts= @_; 1253 my $type= $opts{type} || 'generic'; 1254 if ($self->{has_multi}) { 1255 if ($type =~ /^cp/) { 1256 die "Can't do a 'cp' on multi-codepoint character class '$self->{op}'" 1257 } 1258 elsif (! $opts{safe}) { 1259 die "'safe' is required on multi-codepoint character class '$self->{op}'" 1260 } 1261 } 1262 my $ret_type= $opts{ret_type} || ( $opts{type} =~ /^cp/ ? 'cp' : 'len' ); 1263 my $method; 1264 if ( $opts{safe} ) { 1265 $method= 'length_optree'; 1266 } elsif ( $type =~ /generic/ ) { 1267 $method= 'generic_optree'; 1268 } else { 1269 $method= 'optree'; 1270 } 1271 my @args= $type =~ /^cp/ ? 'cp' : 's'; 1272 push @args, "e" if $opts{safe}; 1273 push @args, "is_utf8" if $type =~ /generic/; 1274 push @args, "len" if $ret_type eq 'both'; 1275 my $pfx= $ret_type eq 'both' ? 'what_len_' : 1276 $ret_type eq 'cp' ? 'what_' : 'is_'; 1277 my $ext= $type =~ /generic/ ? '' : '_' . lc( $type ); 1278 $ext .= '_non_low' if $type eq 'generic_non_low'; 1279 $ext .= "_safe" if $opts{safe}; 1280 $ext .= "_no_length_checks" if $opts{no_length_checks}; 1281 my $argstr= join ",", @args; 1282 my $def_fmt="$pfx$self->{op}$ext%s($argstr)"; 1283 my $optree= $self->$method( %opts, type => $type, ret_type => $ret_type ); 1284 return $self->render( $optree, ($type =~ /^cp/) ? 1 : 0, \%opts, $def_fmt ); 1285} 1286 1287# if we aren't being used as a module (highly likely) then process 1288# the __DATA__ below and produce macros in regcharclass.h 1289# if an argument is provided to the script then it is assumed to 1290# be the path of the file to output to, if the arg is '-' outputs 1291# to STDOUT. 1292if ( !caller ) { 1293 $|++; 1294 my $path= shift @ARGV || "regcharclass.h"; 1295 my $out_fh; 1296 if ( $path eq '-' ) { 1297 $out_fh= \*STDOUT; 1298 } else { 1299 $out_fh = open_new( $path ); 1300 } 1301 print $out_fh read_only_top( lang => 'C', by => $0, 1302 file => 'regcharclass.h', style => '*', 1303 copyright => [2007, 2011], 1304 final => <<EOF, 1305WARNING: These macros are for internal Perl core use only, and may be 1306changed or removed without notice. 1307EOF 1308 ); 1309 print $out_fh "\n#ifndef PERL_REGCHARCLASS_H_ /* Guard against nested #includes */\n#define PERL_REGCHARCLASS_H_\n"; 1310 1311 my ( $op, $title, @txt, @types, %mods ); 1312 my $doit= sub ($) { 1313 return unless $op; 1314 1315 my $charset = shift; 1316 1317 # Skip if to compile on a different platform. 1318 return if delete $mods{only_ascii_platform} && $charset !~ /ascii/i; 1319 return if delete $mods{only_ebcdic_platform} && $charset !~ /ebcdic/i; 1320 1321 print $out_fh "/*\n\t$op: $title\n\n"; 1322 print $out_fh join "\n", ( map { "\t$_" } @txt ), "*/", ""; 1323 my $obj= __PACKAGE__->new( op => $op, title => $title, txt => \@txt, charset => $charset); 1324 1325 #die Dumper(\@types,\%mods); 1326 1327 my @mods; 1328 push @mods, 'safe' if delete $mods{safe}; 1329 push @mods, 'no_length_checks' if delete $mods{no_length_checks}; 1330 unshift @mods, 'fast' if delete $mods{fast} || ! @mods; # Default to 'fast' 1331 # do this one 1332 # first, as 1333 # traditional 1334 if (%mods) { 1335 die "Unknown modifiers: ", join ", ", map { "'$_'" } sort keys %mods; 1336 } 1337 1338 foreach my $type_spec ( @types ) { 1339 my ( $type, $ret )= split /-/, $type_spec; 1340 $ret ||= 'len'; 1341 foreach my $mod ( @mods ) { 1342 1343 # 'safe' is irrelevant with code point macros, so skip if 1344 # there is also a 'fast', but don't skip if this is the only 1345 # way a cp macro will get generated. Below we convert 'safe' 1346 # to 'fast' in this instance 1347 next if $type =~ /^cp/ 1348 && ($mod eq 'safe' || $mod eq 'no_length_checks') 1349 && grep { 'fast' =~ $_ } @mods; 1350 delete $mods{$mod}; 1351 my $macro= $obj->make_macro( 1352 type => $type, 1353 ret_type => $ret, 1354 safe => $mod eq 'safe' && $type !~ /^cp/, 1355 charset => $charset, 1356 no_length_checks => $mod eq 'no_length_checks' && $type !~ /^cp/, 1357 ); 1358 print $out_fh $macro, "\n"; 1359 } 1360 } 1361 }; 1362 1363 my @data = <DATA>; 1364 foreach my $charset (get_supported_code_pages()) { 1365 my $first_time = 1; 1366 undef $op; 1367 undef $title; 1368 undef @txt; 1369 undef @types; 1370 undef %mods; 1371 print $out_fh "\n", get_conditional_compile_line_start($charset); 1372 my @data_copy = @data; 1373 for (@data_copy) { 1374 s/^ \s* (?: \# .* ) ? $ //x; # squeeze out comment and blanks 1375 next unless /\S/; 1376 chomp; 1377 if ( /^[A-Z]/ ) { 1378 $doit->($charset) unless $first_time; # This starts a new 1379 # definition; do the 1380 # previous one 1381 $first_time = 0; 1382 ( $op, $title )= split /\s*:\s*/, $_, 2; 1383 @txt= (); 1384 } elsif ( s/^=>// ) { 1385 my ( $type, $modifier )= split /:/, $_; 1386 @types= split ' ', $type; 1387 undef %mods; 1388 map { $mods{$_} = 1 } split ' ', $modifier; 1389 } else { 1390 push @txt, "$_"; 1391 } 1392 } 1393 $doit->($charset); 1394 print $out_fh get_conditional_compile_line_end(); 1395 } 1396 1397 print $out_fh "\n#endif /* PERL_REGCHARCLASS_H_ */\n"; 1398 1399 if($path eq '-') { 1400 print $out_fh "/* ex: set ro: */\n"; 1401 } else { 1402 # Some of the sources for these macros come from Unicode tables 1403 my $sources_list = "lib/unicore/mktables.lst"; 1404 my @sources = ($0, qw(lib/unicore/mktables 1405 lib/Unicode/UCD.pm 1406 regen/regcharclass_multi_char_folds.pl 1407 regen/charset_translations.pl 1408 )); 1409 { 1410 # Depend on mktables’ own sources. It’s a shorter list of files than 1411 # those that Unicode::UCD uses. 1412 if (! open my $mktables_list, '<', $sources_list) { 1413 1414 # This should force a rebuild once $sources_list exists 1415 push @sources, $sources_list; 1416 } 1417 else { 1418 while(<$mktables_list>) { 1419 last if /===/; 1420 chomp; 1421 push @sources, "lib/unicore/$_" if /^[^#]/; 1422 } 1423 } 1424 } 1425 read_only_bottom_close_and_rename($out_fh, \@sources) 1426 } 1427} 1428 1429# The form of the input is a series of definitions to make macros for. 1430# The first line gives the base name of the macro, followed by a colon, and 1431# then text to be used in comments associated with the macro that are its 1432# title or description. In all cases the first (perhaps only) parameter to 1433# the macro is a pointer to the first byte of the code point it is to test to 1434# see if it is in the class determined by the macro. In the case of non-UTF8, 1435# the code point consists only of a single byte. 1436# 1437# The second line must begin with a '=>' and be followed by the types of 1438# macro(s) to be generated; these are specified below. A colon follows the 1439# types, followed by the modifiers, also specified below. At least one 1440# modifier is required. 1441# 1442# The subsequent lines give what code points go into the class defined by the 1443# macro. Multiple characters may be specified via a string like "\x0D\x0A", 1444# enclosed in quotes. Otherwise the lines consist of one of: 1445# 1) a single Unicode code point, prefaced by 0x 1446# 2) a single range of Unicode code points separated by a minus (and 1447# optional space) 1448# 3) a single Unicode property specified in the standard Perl form 1449# "\p{...}" 1450# 4) a line like 'do path'. This will do a 'do' on the file given by 1451# 'path'. It is assumed that this does nothing but load subroutines 1452# (See item 5 below). The reason 'require path' is not used instead is 1453# because 'do' doesn't assume that path is in @INC. 1454# 5) a subroutine call 1455# &pkg::foo(arg1, ...) 1456# where pkg::foo was loaded by a 'do' line (item 4). The subroutine 1457# returns an array of entries of forms like items 1-3 above. This 1458# allows more complex inputs than achievable from the other input types. 1459# 1460# A blank line or one whose first non-blank character is '#' is a comment. 1461# The definition of the macro is terminated by a line unlike those described. 1462# 1463# Valid types: 1464# low generate a macro whose name is 'is_BASE_low' and defines a 1465# class that includes only ASCII-range chars. (BASE is the 1466# input macro base name.) 1467# latin1 generate a macro whose name is 'is_BASE_latin1' and defines a 1468# class that includes only upper-Latin1-range chars. It is not 1469# designed to take a UTF-8 input parameter. 1470# high generate a macro whose name is 'is_BASE_high' and defines a 1471# class that includes all relevant code points that are above 1472# the Latin1 range. This is for very specialized uses only. 1473# It is designed to take only an input UTF-8 parameter. 1474# utf8 generate a macro whose name is 'is_BASE_utf8' and defines a 1475# class that includes all relevant characters that aren't ASCII. 1476# It is designed to take only an input UTF-8 parameter. 1477# LATIN1 generate a macro whose name is 'is_BASE_latin1' and defines a 1478# class that includes both ASCII and upper-Latin1-range chars. 1479# It is not designed to take a UTF-8 input parameter. 1480# UTF8 generate a macro whose name is 'is_BASE_utf8' and defines a 1481# class that can include any code point, adding the 'low' ones 1482# to what 'utf8' works on. It is designed to take only an input 1483# UTF-8 parameter. 1484# generic generate a macro whose name is 'is_BASE". It has a 2nd, 1485# boolean, parameter which indicates if the first one points to 1486# a UTF-8 string or not. Thus it works in all circumstances. 1487# generic_non_low generate a macro whose name is 'is_BASE_non_low". It has 1488# a 2nd, boolean, parameter which indicates if the first one 1489# points to a UTF-8 string or not. It excludes any ASCII-range 1490# matches, but otherwise it works in all circumstances. 1491# cp generate a macro whose name is 'is_BASE_cp' and defines a 1492# class that returns true if the UV parameter is a member of the 1493# class; false if not. 1494# cp_high like cp, but it is assumed that it is known that the UV 1495# parameter is above Latin1. The name of the generated macro is 1496# 'is_BASE_cp_high'. This is different from high-cp, derived 1497# below. 1498# A macro of the given type is generated for each type listed in the input. 1499# The default return value is the number of octets read to generate the match. 1500# Append "-cp" to the type to have it instead return the matched codepoint. 1501# The macro name is changed to 'what_BASE...'. See pod for 1502# caveats 1503# Appending '-both" instead adds an extra parameter to the end of the argument 1504# list, which is a pointer as to where to store the number of 1505# bytes matched, while also returning the code point. The macro 1506# name is changed to 'what_len_BASE...'. See pod for caveats 1507# 1508# Valid modifiers: 1509# safe The input string is not necessarily valid UTF-8. In 1510# particular an extra parameter (always the 2nd) to the macro is 1511# required, which points to one beyond the end of the string. 1512# The macro will make sure not to read off the end of the 1513# string. In the case of non-UTF8, it makes sure that the 1514# string has at least one byte in it. The macro name has 1515# '_safe' appended to it. 1516# no_length_checks The input string is not necessarily valid UTF-8, but it 1517# is to be assumed that the length has already been checked and 1518# found to be valid 1519# fast The input string is valid UTF-8. No bounds checking is done, 1520# and the macro can make assumptions that lead to faster 1521# execution. 1522# only_ascii_platform Skip this definition if the character set is for 1523# a non-ASCII platform. 1524# only_ebcdic_platform Skip this definition if the character set is for 1525# a non-EBCDIC platform. 1526# No modifier need be specified; fast is assumed for this case. If both 1527# 'fast', and 'safe' are specified, two macros will be created for each 1528# 'type'. 1529# 1530# If run on a non-ASCII platform will automatically convert the Unicode input 1531# to native. The documentation above is slightly wrong in this case. 'low' 1532# actually refers to code points whose UTF-8 representation is the same as the 1533# non-UTF-8 version (invariants); and 'latin1' refers to all the rest of the 1534# code points less than 256. 1535 15361; # in the unlikely case we are being used as a module 1537 1538__DATA__ 1539# This is no longer used, but retained in case it is needed some day. 1540# TRICKYFOLD: Problematic fold case letters. When adding to this list, also should add them to regcomp.c and fold_grind.t 1541# => generic cp generic-cp generic-both :fast safe 1542# 0x00DF # LATIN SMALL LETTER SHARP S 1543# 0x0390 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS 1544# 0x03B0 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS 1545# 0x1E9E # LATIN CAPITAL LETTER SHARP S, because maps to same as 00DF 1546# 0x1FD3 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA; maps same as 0390 1547# 0x1FE3 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA; maps same as 03B0 1548 1549LNBREAK: Line Break: \R 1550=> generic UTF8 LATIN1 : safe 1551"\x0D\x0A" # CRLF - Network (Windows) line ending 1552\p{VertSpace} 1553 1554HORIZWS: Horizontal Whitespace: \h \H 1555=> high cp_high : fast 1556\p{HorizSpace} 1557 1558VERTWS: Vertical Whitespace: \v \V 1559=> high cp_high : fast 1560\p{VertSpace} 1561 1562XDIGIT: Hexadecimal digits 1563=> high cp_high : fast 1564\p{XDigit} 1565 1566XPERLSPACE: \p{XPerlSpace} 1567=> high cp_high : fast 1568\p{XPerlSpace} 1569 1570NONCHAR: Non character code points 1571=> UTF8 :safe 1572\p{_Perl_Nchar} 1573 1574SURROGATE: Surrogate code points 1575=> UTF8 :safe 1576\p{_Perl_Surrogate} 1577 1578QUOTEMETA: Meta-characters that \Q should quote 1579=> high :fast 1580\p{_Perl_Quotemeta} 1581 1582MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character 1583=> UTF8 :safe 1584 1585# 1 => All folds 1586®charclass_multi_char_folds::multi_char_folds(1) 1587 1588MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character 1589=> LATIN1 : safe 1590 1591®charclass_multi_char_folds::multi_char_folds(0) 1592# 0 => Latin1-only 1593 1594FOLDS_TO_MULTI: characters that fold to multi-char strings 1595=> UTF8 :fast 1596\p{_Perl_Folds_To_Multi_Char} 1597 1598PROBLEMATIC_LOCALE_FOLD : characters whose fold is problematic under locale 1599=> UTF8 cp :fast 1600\p{_Perl_Problematic_Locale_Folds} 1601 1602PROBLEMATIC_LOCALE_FOLDEDS_START : The first folded character of folds which are problematic under locale 1603=> UTF8 cp :fast 1604\p{_Perl_Problematic_Locale_Foldeds_Start} 1605 1606PATWS: pattern white space 1607=> generic cp : safe 1608\p{_Perl_PatWS} 1609 1610HANGUL_ED: Hangul syllables whose first character is \xED 1611=> UTF8 :only_ascii_platform safe 16120xD000 - 0xD7FF 1613