1 2=head1 NAME 3 4IO::Compress::FAQ -- Frequently Asked Questions about IO::Compress 5 6=head1 DESCRIPTION 7 8Common questions answered. 9 10=head1 GENERAL 11 12=head2 Compatibility with Unix compress/uncompress. 13 14Although C<Compress::Zlib> has a pair of functions called C<compress> and 15C<uncompress>, they are I<not> related to the Unix programs of the same 16name. The C<Compress::Zlib> module is not compatible with Unix 17C<compress>. 18 19If you have the C<uncompress> program available, you can use this to read 20compressed files 21 22 open F, "uncompress -c $filename |"; 23 while (<F>) 24 { 25 ... 26 27Alternatively, if you have the C<gunzip> program available, you can use 28this to read compressed files 29 30 open F, "gunzip -c $filename |"; 31 while (<F>) 32 { 33 ... 34 35and this to write compress files, if you have the C<compress> program 36available 37 38 open F, "| compress -c $filename "; 39 print F "data"; 40 ... 41 close F ; 42 43=head2 Accessing .tar.Z files 44 45The C<Archive::Tar> module can optionally use C<Compress::Zlib> (via the 46C<IO::Zlib> module) to access tar files that have been compressed with 47C<gzip>. Unfortunately tar files compressed with the Unix C<compress> 48utility cannot be read by C<Compress::Zlib> and so cannot be directly 49accessed by C<Archive::Tar>. 50 51If the C<uncompress> or C<gunzip> programs are available, you can use one 52of these workarounds to read C<.tar.Z> files from C<Archive::Tar> 53 54Firstly with C<uncompress> 55 56 use strict; 57 use warnings; 58 use Archive::Tar; 59 60 open F, "uncompress -c $filename |"; 61 my $tar = Archive::Tar->new(*F); 62 ... 63 64and this with C<gunzip> 65 66 use strict; 67 use warnings; 68 use Archive::Tar; 69 70 open F, "gunzip -c $filename |"; 71 my $tar = Archive::Tar->new(*F); 72 ... 73 74Similarly, if the C<compress> program is available, you can use this to 75write a C<.tar.Z> file 76 77 use strict; 78 use warnings; 79 use Archive::Tar; 80 use IO::File; 81 82 my $fh = new IO::File "| compress -c >$filename"; 83 my $tar = Archive::Tar->new(); 84 ... 85 $tar->write($fh); 86 $fh->close ; 87 88=head2 How do I recompress using a different compression? 89 90This is easier that you might expect if you realise that all the 91C<IO::Compress::*> objects are derived from C<IO::File> and that all the 92C<IO::Uncompress::*> modules can read from an C<IO::File> filehandle. 93 94So, for example, say you have a file compressed with gzip that you want to 95recompress with bzip2. Here is all that is needed to carry out the 96recompression. 97 98 use IO::Uncompress::Gunzip ':all'; 99 use IO::Compress::Bzip2 ':all'; 100 101 my $gzipFile = "somefile.gz"; 102 my $bzipFile = "somefile.bz2"; 103 104 my $gunzip = new IO::Uncompress::Gunzip $gzipFile 105 or die "Cannot gunzip $gzipFile: $GunzipError\n" ; 106 107 bzip2 $gunzip => $bzipFile 108 or die "Cannot bzip2 to $bzipFile: $Bzip2Error\n" ; 109 110Note, there is a limitation of this technique. Some compression file 111formats store extra information along with the compressed data payload. For 112example, gzip can optionally store the original filename and Zip stores a 113lot of information about the original file. If the original compressed file 114contains any of this extra information, it will not be transferred to the 115new compressed file using the technique above. 116 117=head1 ZIP 118 119=head2 What Compression Types do IO::Compress::Zip & IO::Uncompress::Unzip support? 120 121The following compression formats are supported by C<IO::Compress::Zip> and 122C<IO::Uncompress::Unzip> 123 124=over 5 125 126=item * Store (method 0) 127 128No compression at all. 129 130=item * Deflate (method 8) 131 132This is the default compression used when creating a zip file with 133C<IO::Compress::Zip>. 134 135=item * Bzip2 (method 12) 136 137Only supported if the C<IO-Compress-Bzip2> module is installed. 138 139=item * Lzma (method 14) 140 141Only supported if the C<IO-Compress-Lzma> module is installed. 142 143=back 144 145=head2 Can I Read/Write Zip files larger the 4 Gig? 146 147Yes, both the C<IO-Compress-Zip> and C<IO-Uncompress-Unzip> modules 148support the zip feature called I<Zip64>. That allows them to read/write 149files/buffers larger than 4Gig. 150 151If you are creating a Zip file using the one-shot interface, and any of the 152input files is greater than 4Gig, a zip64 complaint zip file will be 153created. 154 155 zip "really-large-file" => "my.zip"; 156 157Similarly with the one-shot interface, if the input is a buffer larger than 1584 Gig, a zip64 complaint zip file will be created. 159 160 zip \$really_large_buffer => "my.zip"; 161 162The one-shot interface allows you to force the creation of a zip64 zip file 163by including the C<Zip64> option. 164 165 zip $filehandle => "my.zip", Zip64 => 1; 166 167If you want to create a zip64 zip file with the OO interface you must 168specify the C<Zip64> option. 169 170 my $zip = new IO::Compress::Zip "whatever", Zip64 => 1; 171 172When uncompressing with C<IO-Uncompress-Unzip>, it will automatically 173detect if the zip file is zip64. 174 175If you intend to manipulate the Zip64 zip files created with 176C<IO-Compress-Zip> using an external zip/unzip, make sure that it supports 177Zip64. 178 179In particular, if you are using Info-Zip you need to have zip version 3.x 180or better to update a Zip64 archive and unzip version 6.x to read a zip64 181archive. 182 183=head2 Can I write more that 64K entries is a Zip files? 184 185Yes. Zip64 allows this. See previous question. 186 187=head2 Zip Resources 188 189The primary reference for zip files is the "appnote" document available at 190L<http://www.pkware.com/documents/casestudies/APPNOTE.TXT> 191 192An alternatively is the Info-Zip appnote. This is available from 193L<ftp://ftp.info-zip.org/pub/infozip/doc/> 194 195=head1 GZIP 196 197=head2 Gzip Resources 198 199The primary reference for gzip files is RFC 1952 200L<http://www.faqs.org/rfcs/rfc1952.html> 201 202The primary site for gzip is L<http://www.gzip.org>. 203 204=head2 Dealing with concatenated gzip files 205 206If the gunzip program encounters a file containing multiple gzip files 207concatenated together it will automatically uncompress them all. 208The example below illustrates this behaviour 209 210 $ echo abc | gzip -c >x.gz 211 $ echo def | gzip -c >>x.gz 212 $ gunzip -c x.gz 213 abc 214 def 215 216By default C<IO::Uncompress::Gunzip> will I<not> behave like the gunzip 217program. It will only uncompress the first gzip data stream in the file, as 218shown below 219 220 $ perl -MIO::Uncompress::Gunzip=:all -e 'gunzip "x.gz" => \*STDOUT' 221 abc 222 223To force C<IO::Uncompress::Gunzip> to uncompress all the gzip data streams, 224include the C<MultiStream> option, as shown below 225 226 $ perl -MIO::Uncompress::Gunzip=:all -e 'gunzip "x.gz" => \*STDOUT, MultiStream => 1' 227 abc 228 def 229 230=head2 Reading bgzip files with IO::Uncompress::Gunzip 231 232A C<bgzip> file consists of a series of valid gzip-compliant data streams 233concatenated together. To read a file created by C<bgzip> with 234C<IO::Uncompress::Gunzip> use the C<MultiStream> option as shown in the 235previous section. 236 237See the section titled "The BGZF compression format" in 238L<http://samtools.github.io/hts-specs/SAMv1.pdf> for a definition of 239C<bgzip>. 240 241=head1 ZLIB 242 243=head2 Zlib Resources 244 245The primary site for the I<zlib> compression library is 246L<http://www.zlib.org>. 247 248=head1 Bzip2 249 250=head2 Bzip2 Resources 251 252The primary site for bzip2 is L<http://www.bzip.org>. 253 254=head2 Dealing with Concatenated bzip2 files 255 256If the bunzip2 program encounters a file containing multiple bzip2 files 257concatenated together it will automatically uncompress them all. 258The example below illustrates this behaviour 259 260 $ echo abc | bzip2 -c >x.bz2 261 $ echo def | bzip2 -c >>x.bz2 262 $ bunzip2 -c x.bz2 263 abc 264 def 265 266By default C<IO::Uncompress::Bunzip2> will I<not> behave like the bunzip2 267program. It will only uncompress the first bunzip2 data stream in the file, as 268shown below 269 270 $ perl -MIO::Uncompress::Bunzip2=:all -e 'bunzip2 "x.bz2" => \*STDOUT' 271 abc 272 273To force C<IO::Uncompress::Bunzip2> to uncompress all the bzip2 data streams, 274include the C<MultiStream> option, as shown below 275 276 $ perl -MIO::Uncompress::Bunzip2=:all -e 'bunzip2 "x.bz2" => \*STDOUT, MultiStream => 1' 277 abc 278 def 279 280=head2 Interoperating with Pbzip2 281 282Pbzip2 (L<http://compression.ca/pbzip2/>) is a parallel implementation of 283bzip2. The output from pbzip2 consists of a series of concatenated bzip2 284data streams. 285 286By default C<IO::Uncompress::Bzip2> will only uncompress the first bzip2 287data stream in a pbzip2 file. To uncompress the complete pbzip2 file you 288must include the C<MultiStream> option, like this. 289 290 bunzip2 $input => \$output, MultiStream => 1 291 or die "bunzip2 failed: $Bunzip2Error\n"; 292 293=head1 HTTP & NETWORK 294 295=head2 Apache::GZip Revisited 296 297Below is a mod_perl Apache compression module, called C<Apache::GZip>, 298taken from 299L<http://perl.apache.org/docs/tutorials/tips/mod_perl_tricks/mod_perl_tricks.html#On_the_Fly_Compression> 300 301 package Apache::GZip; 302 #File: Apache::GZip.pm 303 304 use strict vars; 305 use Apache::Constants ':common'; 306 use Compress::Zlib; 307 use IO::File; 308 use constant GZIP_MAGIC => 0x1f8b; 309 use constant OS_MAGIC => 0x03; 310 311 sub handler { 312 my $r = shift; 313 my ($fh,$gz); 314 my $file = $r->filename; 315 return DECLINED unless $fh=IO::File->new($file); 316 $r->header_out('Content-Encoding'=>'gzip'); 317 $r->send_http_header; 318 return OK if $r->header_only; 319 320 tie *STDOUT,'Apache::GZip',$r; 321 print($_) while <$fh>; 322 untie *STDOUT; 323 return OK; 324 } 325 326 sub TIEHANDLE { 327 my($class,$r) = @_; 328 # initialize a deflation stream 329 my $d = deflateInit(-WindowBits=>-MAX_WBITS()) || return undef; 330 331 # gzip header -- don't ask how I found out 332 $r->print(pack("nccVcc",GZIP_MAGIC,Z_DEFLATED,0,time(),0,OS_MAGIC)); 333 334 return bless { r => $r, 335 crc => crc32(undef), 336 d => $d, 337 l => 0 338 },$class; 339 } 340 341 sub PRINT { 342 my $self = shift; 343 foreach (@_) { 344 # deflate the data 345 my $data = $self->{d}->deflate($_); 346 $self->{r}->print($data); 347 # keep track of its length and crc 348 $self->{l} += length($_); 349 $self->{crc} = crc32($_,$self->{crc}); 350 } 351 } 352 353 sub DESTROY { 354 my $self = shift; 355 356 # flush the output buffers 357 my $data = $self->{d}->flush; 358 $self->{r}->print($data); 359 360 # print the CRC and the total length (uncompressed) 361 $self->{r}->print(pack("LL",@{$self}{qw/crc l/})); 362 } 363 364 1; 365 366Here's the Apache configuration entry you'll need to make use of it. Once 367set it will result in everything in the /compressed directory will be 368compressed automagically. 369 370 <Location /compressed> 371 SetHandler perl-script 372 PerlHandler Apache::GZip 373 </Location> 374 375Although at first sight there seems to be quite a lot going on in 376C<Apache::GZip>, you could sum up what the code was doing as follows -- 377read the contents of the file in C<< $r->filename >>, compress it and write 378the compressed data to standard output. That's all. 379 380This code has to jump through a few hoops to achieve this because 381 382=over 383 384=item 1. 385 386The gzip support in C<Compress::Zlib> version 1.x can only work with a real 387filesystem filehandle. The filehandles used by Apache modules are not 388associated with the filesystem. 389 390=item 2. 391 392That means all the gzip support has to be done by hand - in this case by 393creating a tied filehandle to deal with creating the gzip header and 394trailer. 395 396=back 397 398C<IO::Compress::Gzip> doesn't have that filehandle limitation (this was one 399of the reasons for writing it in the first place). So if 400C<IO::Compress::Gzip> is used instead of C<Compress::Zlib> the whole tied 401filehandle code can be removed. Here is the rewritten code. 402 403 package Apache::GZip; 404 405 use strict vars; 406 use Apache::Constants ':common'; 407 use IO::Compress::Gzip; 408 use IO::File; 409 410 sub handler { 411 my $r = shift; 412 my ($fh,$gz); 413 my $file = $r->filename; 414 return DECLINED unless $fh=IO::File->new($file); 415 $r->header_out('Content-Encoding'=>'gzip'); 416 $r->send_http_header; 417 return OK if $r->header_only; 418 419 my $gz = new IO::Compress::Gzip '-', Minimal => 1 420 or return DECLINED ; 421 422 print $gz $_ while <$fh>; 423 424 return OK; 425 } 426 427or even more succinctly, like this, using a one-shot gzip 428 429 package Apache::GZip; 430 431 use strict vars; 432 use Apache::Constants ':common'; 433 use IO::Compress::Gzip qw(gzip); 434 435 sub handler { 436 my $r = shift; 437 $r->header_out('Content-Encoding'=>'gzip'); 438 $r->send_http_header; 439 return OK if $r->header_only; 440 441 gzip $r->filename => '-', Minimal => 1 442 or return DECLINED ; 443 444 return OK; 445 } 446 447 1; 448 449The use of one-shot C<gzip> above just reads from C<< $r->filename >> and 450writes the compressed data to standard output. 451 452Note the use of the C<Minimal> option in the code above. When using gzip 453for Content-Encoding you should I<always> use this option. In the example 454above it will prevent the filename being included in the gzip header and 455make the size of the gzip data stream a slight bit smaller. 456 457=head2 Compressed files and Net::FTP 458 459The C<Net::FTP> module provides two low-level methods called C<stor> and 460C<retr> that both return filehandles. These filehandles can used with the 461C<IO::Compress/Uncompress> modules to compress or uncompress files read 462from or written to an FTP Server on the fly, without having to create a 463temporary file. 464 465Firstly, here is code that uses C<retr> to uncompressed a file as it is 466read from the FTP Server. 467 468 use Net::FTP; 469 use IO::Uncompress::Gunzip qw(:all); 470 471 my $ftp = new Net::FTP ... 472 473 my $retr_fh = $ftp->retr($compressed_filename); 474 gunzip $retr_fh => $outFilename, AutoClose => 1 475 or die "Cannot uncompress '$compressed_file': $GunzipError\n"; 476 477and this to compress a file as it is written to the FTP Server 478 479 use Net::FTP; 480 use IO::Compress::Gzip qw(:all); 481 482 my $stor_fh = $ftp->stor($filename); 483 gzip "filename" => $stor_fh, AutoClose => 1 484 or die "Cannot compress '$filename': $GzipError\n"; 485 486=head1 MISC 487 488=head2 Using C<InputLength> to uncompress data embedded in a larger file/buffer. 489 490A fairly common use-case is where compressed data is embedded in a larger 491file/buffer and you want to read both. 492 493As an example consider the structure of a zip file. This is a well-defined 494file format that mixes both compressed and uncompressed sections of data in 495a single file. 496 497For the purposes of this discussion you can think of a zip file as sequence 498of compressed data streams, each of which is prefixed by an uncompressed 499local header. The local header contains information about the compressed 500data stream, including the name of the compressed file and, in particular, 501the length of the compressed data stream. 502 503To illustrate how to use C<InputLength> here is a script that walks a zip 504file and prints out how many lines are in each compressed file (if you 505intend write code to walking through a zip file for real see 506L<IO::Uncompress::Unzip/"Walking through a zip file"> ). Also, although 507this example uses the zlib-based compression, the technique can be used by 508the other C<IO::Uncompress::*> modules. 509 510 use strict; 511 use warnings; 512 513 use IO::File; 514 use IO::Uncompress::RawInflate qw(:all); 515 516 use constant ZIP_LOCAL_HDR_SIG => 0x04034b50; 517 use constant ZIP_LOCAL_HDR_LENGTH => 30; 518 519 my $file = $ARGV[0] ; 520 521 my $fh = new IO::File "<$file" 522 or die "Cannot open '$file': $!\n"; 523 524 while (1) 525 { 526 my $sig; 527 my $buffer; 528 529 my $x ; 530 ($x = $fh->read($buffer, ZIP_LOCAL_HDR_LENGTH)) == ZIP_LOCAL_HDR_LENGTH 531 or die "Truncated file: $!\n"; 532 533 my $signature = unpack ("V", substr($buffer, 0, 4)); 534 535 last unless $signature == ZIP_LOCAL_HDR_SIG; 536 537 # Read Local Header 538 my $gpFlag = unpack ("v", substr($buffer, 6, 2)); 539 my $compressedMethod = unpack ("v", substr($buffer, 8, 2)); 540 my $compressedLength = unpack ("V", substr($buffer, 18, 4)); 541 my $uncompressedLength = unpack ("V", substr($buffer, 22, 4)); 542 my $filename_length = unpack ("v", substr($buffer, 26, 2)); 543 my $extra_length = unpack ("v", substr($buffer, 28, 2)); 544 545 my $filename ; 546 $fh->read($filename, $filename_length) == $filename_length 547 or die "Truncated file\n"; 548 549 $fh->read($buffer, $extra_length) == $extra_length 550 or die "Truncated file\n"; 551 552 if ($compressedMethod != 8 && $compressedMethod != 0) 553 { 554 warn "Skipping file '$filename' - not deflated $compressedMethod\n"; 555 $fh->read($buffer, $compressedLength) == $compressedLength 556 or die "Truncated file\n"; 557 next; 558 } 559 560 if ($compressedMethod == 0 && $gpFlag & 8 == 8) 561 { 562 die "Streamed Stored not supported for '$filename'\n"; 563 } 564 565 next if $compressedLength == 0; 566 567 # Done reading the Local Header 568 569 my $inf = new IO::Uncompress::RawInflate $fh, 570 Transparent => 1, 571 InputLength => $compressedLength 572 or die "Cannot uncompress $file [$filename]: $RawInflateError\n" ; 573 574 my $line_count = 0; 575 576 while (<$inf>) 577 { 578 ++ $line_count; 579 } 580 581 print "$filename: $line_count\n"; 582 } 583 584The majority of the code above is concerned with reading the zip local 585header data. The code that I want to focus on is at the bottom. 586 587 while (1) { 588 589 # read local zip header data 590 # get $filename 591 # get $compressedLength 592 593 my $inf = new IO::Uncompress::RawInflate $fh, 594 Transparent => 1, 595 InputLength => $compressedLength 596 or die "Cannot uncompress $file [$filename]: $RawInflateError\n" ; 597 598 my $line_count = 0; 599 600 while (<$inf>) 601 { 602 ++ $line_count; 603 } 604 605 print "$filename: $line_count\n"; 606 } 607 608The call to C<IO::Uncompress::RawInflate> creates a new filehandle C<$inf> 609that can be used to read from the parent filehandle C<$fh>, uncompressing 610it as it goes. The use of the C<InputLength> option will guarantee that 611I<at most> C<$compressedLength> bytes of compressed data will be read from 612the C<$fh> filehandle (The only exception is for an error case like a 613truncated file or a corrupt data stream). 614 615This means that once RawInflate is finished C<$fh> will be left at the 616byte directly after the compressed data stream. 617 618Now consider what the code looks like without C<InputLength> 619 620 while (1) { 621 622 # read local zip header data 623 # get $filename 624 # get $compressedLength 625 626 # read all the compressed data into $data 627 read($fh, $data, $compressedLength); 628 629 my $inf = new IO::Uncompress::RawInflate \$data, 630 Transparent => 1, 631 or die "Cannot uncompress $file [$filename]: $RawInflateError\n" ; 632 633 my $line_count = 0; 634 635 while (<$inf>) 636 { 637 ++ $line_count; 638 } 639 640 print "$filename: $line_count\n"; 641 } 642 643The difference here is the addition of the temporary variable C<$data>. 644This is used to store a copy of the compressed data while it is being 645uncompressed. 646 647If you know that C<$compressedLength> isn't that big then using temporary 648storage won't be a problem. But if C<$compressedLength> is very large or 649you are writing an application that other people will use, and so have no 650idea how big C<$compressedLength> will be, it could be an issue. 651 652Using C<InputLength> avoids the use of temporary storage and means the 653application can cope with large compressed data streams. 654 655One final point -- obviously C<InputLength> can only be used whenever you 656know the length of the compressed data beforehand, like here with a zip 657file. 658 659=head1 SEE ALSO 660 661L<Compress::Zlib>, L<IO::Compress::Gzip>, L<IO::Uncompress::Gunzip>, L<IO::Compress::Deflate>, L<IO::Uncompress::Inflate>, L<IO::Compress::RawDeflate>, L<IO::Uncompress::RawInflate>, L<IO::Compress::Bzip2>, L<IO::Uncompress::Bunzip2>, L<IO::Compress::Lzma>, L<IO::Uncompress::UnLzma>, L<IO::Compress::Xz>, L<IO::Uncompress::UnXz>, L<IO::Compress::Lzip>, L<IO::Uncompress::UnLzip>, L<IO::Compress::Lzop>, L<IO::Uncompress::UnLzop>, L<IO::Compress::Lzf>, L<IO::Uncompress::UnLzf>, L<IO::Compress::Zstd>, L<IO::Uncompress::UnZstd>, L<IO::Uncompress::AnyInflate>, L<IO::Uncompress::AnyUncompress> 662 663L<IO::Compress::FAQ|IO::Compress::FAQ> 664 665L<File::GlobMapper|File::GlobMapper>, L<Archive::Zip|Archive::Zip>, 666L<Archive::Tar|Archive::Tar>, 667L<IO::Zlib|IO::Zlib> 668 669=head1 AUTHOR 670 671This module was written by Paul Marquess, C<pmqs@cpan.org>. 672 673=head1 MODIFICATION HISTORY 674 675See the Changes file. 676 677=head1 COPYRIGHT AND LICENSE 678 679Copyright (c) 2005-2019 Paul Marquess. All rights reserved. 680 681This program is free software; you can redistribute it and/or 682modify it under the same terms as Perl itself. 683 684