1package bytes; 2 3our $VERSION = '1.06'; 4 5$bytes::hint_bits = 0x00000008; 6 7sub import { 8 $^H |= $bytes::hint_bits; 9} 10 11sub unimport { 12 $^H &= ~$bytes::hint_bits; 13} 14 15sub AUTOLOAD { 16 require "bytes_heavy.pl"; 17 goto &$AUTOLOAD if defined &$AUTOLOAD; 18 require Carp; 19 Carp::croak("Undefined subroutine $AUTOLOAD called"); 20} 21 22sub length (_); 23sub chr (_); 24sub ord (_); 25sub substr ($$;$$); 26sub index ($$;$); 27sub rindex ($$;$); 28 291; 30__END__ 31 32=head1 NAME 33 34bytes - Perl pragma to expose the individual bytes of characters 35 36=head1 NOTICE 37 38Because the bytes pragma breaks encapsulation (i.e. it exposes the innards of 39how the perl executable currently happens to store a string), the byte values 40that result are in an unspecified encoding. 41 42B<Use of this module for anything other than debugging purposes is 43strongly discouraged.> If you feel that the functions here within 44might be useful for your application, this possibly indicates a 45mismatch between your mental model of Perl Unicode and the current 46reality. In that case, you may wish to read some of the perl Unicode 47documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and 48L<perlunicode>. 49 50=head1 SYNOPSIS 51 52 use bytes; 53 ... chr(...); # or bytes::chr 54 ... index(...); # or bytes::index 55 ... length(...); # or bytes::length 56 ... ord(...); # or bytes::ord 57 ... rindex(...); # or bytes::rindex 58 ... substr(...); # or bytes::substr 59 no bytes; 60 61 62=head1 DESCRIPTION 63 64Perl's characters are stored internally as sequences of one or more bytes. 65This pragma allows for the examination of the individual bytes that together 66comprise a character. 67 68Originally the pragma was designed for the loftier goal of helping incorporate 69Unicode into Perl, but the approach that used it was found to be defective, 70and the one remaining legitimate use is for debugging when you need to 71non-destructively examine characters' individual bytes. Just insert this 72pragma temporarily, and remove it after the debugging is finished. 73 74The original usage can be accomplished by explicit (rather than this pragma's 75implict) encoding using the L<Encode> module: 76 77 use Encode qw/encode/; 78 79 my $utf8_byte_string = encode "UTF8", $string; 80 my $latin1_byte_string = encode "Latin1", $string; 81 82Or, if performance is needed and you are only interested in the UTF-8 83representation: 84 85 utf8::encode(my $utf8_byte_string = $string); 86 87C<no bytes> can be used to reverse the effect of C<use bytes> within the 88current lexical scope. 89 90As an example, when Perl sees C<$x = chr(400)>, it encodes the character 91in UTF-8 and stores it in C<$x>. Then it is marked as character data, so, 92for instance, C<length $x> returns C<1>. However, in the scope of the 93C<bytes> pragma, C<$x> is treated as a series of bytes - the bytes that make 94up the UTF8 encoding - and C<length $x> returns C<2>: 95 96 $x = chr(400); 97 print "Length is ", length $x, "\n"; # "Length is 1" 98 printf "Contents are %vd\n", $x; # "Contents are 400" 99 { 100 use bytes; # or "require bytes; bytes::length()" 101 print "Length is ", length $x, "\n"; # "Length is 2" 102 printf "Contents are %vd\n", $x; # "Contents are 198.144 (on 103 # ASCII platforms)" 104 } 105 106C<chr()>, C<ord()>, C<substr()>, C<index()> and C<rindex()> behave similarly. 107 108For more on the implications, see L<perluniintro> and L<perlunicode>. 109 110C<bytes::length()> is admittedly handy if you need to know the 111B<byte length> of a Perl scalar. But a more modern way is: 112 113 use Encode 'encode'; 114 length(encode('UTF-8', $scalar)) 115 116=head1 LIMITATIONS 117 118C<bytes::substr()> does not work as an I<lvalue()>. 119 120=head1 SEE ALSO 121 122L<perluniintro>, L<perlunicode>, L<utf8>, L<Encode> 123 124=cut 125