1
2=head1 NAME
3
4IO::Compress::FAQ -- Frequently Asked Questions about IO::Compress
5
6=head1 DESCRIPTION
7
8Common questions answered.
9
10=head1 GENERAL
11
12=head2 Compatibility with Unix compress/uncompress.
13
14Although C<Compress::Zlib> has a pair of functions called C<compress> and
15C<uncompress>, they are I<not> related to the Unix programs of the same
16name. The C<Compress::Zlib> module is not compatible with Unix
17C<compress>.
18
19If you have the C<uncompress> program available, you can use this to read
20compressed files
21
22    open F, "uncompress -c $filename |";
23    while (<F>)
24    {
25        ...
26
27Alternatively, if you have the C<gunzip> program available, you can use
28this to read compressed files
29
30    open F, "gunzip -c $filename |";
31    while (<F>)
32    {
33        ...
34
35and this to write compress files, if you have the C<compress> program
36available
37
38    open F, "| compress -c $filename ";
39    print F "data";
40    ...
41    close F ;
42
43=head2 Accessing .tar.Z files
44
45The C<Archive::Tar> module can optionally use C<Compress::Zlib> (via the
46C<IO::Zlib> module) to access tar files that have been compressed with
47C<gzip>. Unfortunately tar files compressed with the Unix C<compress>
48utility cannot be read by C<Compress::Zlib> and so cannot be directly
49accessed by C<Archive::Tar>.
50
51If the C<uncompress> or C<gunzip> programs are available, you can use one
52of these workarounds to read C<.tar.Z> files from C<Archive::Tar>
53
54Firstly with C<uncompress>
55
56    use strict;
57    use warnings;
58    use Archive::Tar;
59
60    open F, "uncompress -c $filename |";
61    my $tar = Archive::Tar->new(*F);
62    ...
63
64and this with C<gunzip>
65
66    use strict;
67    use warnings;
68    use Archive::Tar;
69
70    open F, "gunzip -c $filename |";
71    my $tar = Archive::Tar->new(*F);
72    ...
73
74Similarly, if the C<compress> program is available, you can use this to
75write a C<.tar.Z> file
76
77    use strict;
78    use warnings;
79    use Archive::Tar;
80    use IO::File;
81
82    my $fh = IO::File->new( "| compress -c >$filename" );
83    my $tar = Archive::Tar->new();
84    ...
85    $tar->write($fh);
86    $fh->close ;
87
88=head2 How do I recompress using a different compression?
89
90This is easier that you might expect if you realise that all the
91C<IO::Compress::*> objects are derived from C<IO::File> and that all the
92C<IO::Uncompress::*> modules can read from an C<IO::File> filehandle.
93
94So, for example, say you have a file compressed with gzip that you want to
95recompress with bzip2. Here is all that is needed to carry out the
96recompression.
97
98    use IO::Uncompress::Gunzip ':all';
99    use IO::Compress::Bzip2 ':all';
100
101    my $gzipFile = "somefile.gz";
102    my $bzipFile = "somefile.bz2";
103
104    my $gunzip = IO::Uncompress::Gunzip->new( $gzipFile )
105        or die "Cannot gunzip $gzipFile: $GunzipError\n" ;
106
107    bzip2 $gunzip => $bzipFile
108        or die "Cannot bzip2 to $bzipFile: $Bzip2Error\n" ;
109
110Note, there is a limitation of this technique. Some compression file
111formats store extra information along with the compressed data payload. For
112example, gzip can optionally store the original filename and Zip stores a
113lot of information about the original file. If the original compressed file
114contains any of this extra information, it will not be transferred to the
115new compressed file using the technique above.
116
117=head1 ZIP
118
119=head2 What Compression Types do IO::Compress::Zip & IO::Uncompress::Unzip support?
120
121The following compression formats are supported by C<IO::Compress::Zip> and
122C<IO::Uncompress::Unzip>
123
124=over 5
125
126=item * Store (method 0)
127
128No compression at all.
129
130=item * Deflate (method 8)
131
132This is the default compression used when creating a zip file with
133C<IO::Compress::Zip>.
134
135=item * Bzip2 (method 12)
136
137Only supported if the C<IO-Compress-Bzip2> module is installed.
138
139=item * Lzma (method 14)
140
141Only supported if the C<IO-Compress-Lzma> module is installed.
142
143=back
144
145=head2 Can I Read/Write Zip files larger the 4 Gig?
146
147Yes, both the C<IO-Compress-Zip> and C<IO-Uncompress-Unzip>  modules
148support the zip feature called I<Zip64>. That allows them to read/write
149files/buffers larger than 4Gig.
150
151If you are creating a Zip file using the one-shot interface, and any of the
152input files is greater than 4Gig, a zip64 complaint zip file will be
153created.
154
155    zip "really-large-file" => "my.zip";
156
157Similarly with the one-shot interface, if the input is a buffer larger than
1584 Gig, a zip64 complaint zip file will be created.
159
160    zip \$really_large_buffer => "my.zip";
161
162The one-shot interface allows you to force the creation of a zip64 zip file
163by including the C<Zip64> option.
164
165    zip $filehandle => "my.zip", Zip64 => 1;
166
167If you want to create a zip64 zip file with the OO interface you must
168specify the C<Zip64> option.
169
170    my $zip = IO::Compress::Zip->new( "whatever", Zip64 => 1 );
171
172When uncompressing with C<IO-Uncompress-Unzip>, it will automatically
173detect if the zip file is zip64.
174
175If you intend to manipulate the Zip64 zip files created with
176C<IO-Compress-Zip> using an external zip/unzip, make sure that it supports
177Zip64.
178
179In particular, if you are using Info-Zip you need to have zip version 3.x
180or better to update a Zip64 archive and unzip version 6.x to read a zip64
181archive.
182
183=head2 Can I write more that 64K entries is a Zip files?
184
185Yes. Zip64 allows this. See previous question.
186
187=head2 Zip Resources
188
189The primary reference for zip files is the "appnote" document available at
190L<http://www.pkware.com/documents/casestudies/APPNOTE.TXT>
191
192An alternatively is the Info-Zip appnote. This is available from
193L<ftp://ftp.info-zip.org/pub/infozip/doc/>
194
195=head1 GZIP
196
197=head2 Gzip Resources
198
199The primary reference for gzip files is RFC 1952
200L<http://www.faqs.org/rfcs/rfc1952.html>
201
202The primary site for gzip is L<http://www.gzip.org>.
203
204=head2 Dealing with concatenated gzip files
205
206If the gunzip program encounters a file containing multiple gzip files
207concatenated together it will automatically uncompress them all.
208The example below illustrates this behaviour
209
210    $ echo abc | gzip -c >x.gz
211    $ echo def | gzip -c >>x.gz
212    $ gunzip -c x.gz
213    abc
214    def
215
216By default C<IO::Uncompress::Gunzip> will I<not> behave like the gunzip
217program. It will only uncompress the first gzip data stream in the file, as
218shown below
219
220    $ perl -MIO::Uncompress::Gunzip=:all -e 'gunzip "x.gz" => \*STDOUT'
221    abc
222
223To force C<IO::Uncompress::Gunzip> to uncompress all the gzip data streams,
224include the C<MultiStream> option, as shown below
225
226    $ perl -MIO::Uncompress::Gunzip=:all -e 'gunzip "x.gz" => \*STDOUT, MultiStream => 1'
227    abc
228    def
229
230=head2 Reading bgzip files with IO::Uncompress::Gunzip
231
232A C<bgzip> file consists of a series of valid gzip-compliant data streams
233concatenated together. To read a file created by C<bgzip> with
234C<IO::Uncompress::Gunzip> use the C<MultiStream> option as shown in the
235previous section.
236
237See the section titled "The BGZF compression format" in
238L<http://samtools.github.io/hts-specs/SAMv1.pdf> for a definition of
239C<bgzip>.
240
241=head1 ZLIB
242
243=head2 Zlib Resources
244
245The primary site for the I<zlib> compression library is
246L<http://www.zlib.org>.
247
248=head1 Bzip2
249
250=head2 Bzip2 Resources
251
252The primary site for bzip2 is L<http://www.bzip.org>.
253
254=head2 Dealing with Concatenated bzip2 files
255
256If the bunzip2 program encounters a file containing multiple bzip2 files
257concatenated together it will automatically uncompress them all.
258The example below illustrates this behaviour
259
260    $ echo abc | bzip2 -c >x.bz2
261    $ echo def | bzip2 -c >>x.bz2
262    $ bunzip2 -c x.bz2
263    abc
264    def
265
266By default C<IO::Uncompress::Bunzip2> will I<not> behave like the bunzip2
267program. It will only uncompress the first bunzip2 data stream in the file, as
268shown below
269
270    $ perl -MIO::Uncompress::Bunzip2=:all -e 'bunzip2 "x.bz2" => \*STDOUT'
271    abc
272
273To force C<IO::Uncompress::Bunzip2> to uncompress all the bzip2 data streams,
274include the C<MultiStream> option, as shown below
275
276    $ perl -MIO::Uncompress::Bunzip2=:all -e 'bunzip2 "x.bz2" => \*STDOUT, MultiStream => 1'
277    abc
278    def
279
280=head2 Interoperating with Pbzip2
281
282Pbzip2 (L<http://compression.ca/pbzip2/>) is a parallel implementation of
283bzip2. The output from pbzip2 consists of a series of concatenated bzip2
284data streams.
285
286By default C<IO::Uncompress::Bzip2> will only uncompress the first bzip2
287data stream in a pbzip2 file. To uncompress the complete pbzip2 file you
288must include the C<MultiStream> option, like this.
289
290    bunzip2 $input => \$output, MultiStream => 1
291        or die "bunzip2 failed: $Bunzip2Error\n";
292
293=head1 HTTP & NETWORK
294
295=head2 Apache::GZip Revisited
296
297Below is a mod_perl Apache compression module, called C<Apache::GZip>,
298taken from
299L<http://perl.apache.org/docs/tutorials/tips/mod_perl_tricks/mod_perl_tricks.html#On_the_Fly_Compression>
300
301  package Apache::GZip;
302  #File: Apache::GZip.pm
303
304  use strict vars;
305  use Apache::Constants ':common';
306  use Compress::Zlib;
307  use IO::File;
308  use constant GZIP_MAGIC => 0x1f8b;
309  use constant OS_MAGIC => 0x03;
310
311  sub handler {
312      my $r = shift;
313      my ($fh,$gz);
314      my $file = $r->filename;
315      return DECLINED unless $fh=IO::File->new($file);
316      $r->header_out('Content-Encoding'=>'gzip');
317      $r->send_http_header;
318      return OK if $r->header_only;
319
320      tie *STDOUT,'Apache::GZip',$r;
321      print($_) while <$fh>;
322      untie *STDOUT;
323      return OK;
324  }
325
326  sub TIEHANDLE {
327      my($class,$r) = @_;
328      # initialize a deflation stream
329      my $d = deflateInit(-WindowBits=>-MAX_WBITS()) || return undef;
330
331      # gzip header -- don't ask how I found out
332      $r->print(pack("nccVcc",GZIP_MAGIC,Z_DEFLATED,0,time(),0,OS_MAGIC));
333
334      return bless { r   => $r,
335                     crc =>  crc32(undef),
336                     d   => $d,
337                     l   =>  0
338                   },$class;
339  }
340
341  sub PRINT {
342      my $self = shift;
343      foreach (@_) {
344        # deflate the data
345        my $data = $self->{d}->deflate($_);
346        $self->{r}->print($data);
347        # keep track of its length and crc
348        $self->{l} += length($_);
349        $self->{crc} = crc32($_,$self->{crc});
350      }
351  }
352
353  sub DESTROY {
354     my $self = shift;
355
356     # flush the output buffers
357     my $data = $self->{d}->flush;
358     $self->{r}->print($data);
359
360     # print the CRC and the total length (uncompressed)
361     $self->{r}->print(pack("LL",@{$self}{qw/crc l/}));
362  }
363
364  1;
365
366Here's the Apache configuration entry you'll need to make use of it.  Once
367set it will result in everything in the /compressed directory will be
368compressed automagically.
369
370  <Location /compressed>
371     SetHandler  perl-script
372     PerlHandler Apache::GZip
373  </Location>
374
375Although at first sight there seems to be quite a lot going on in
376C<Apache::GZip>, you could sum up what the code was doing as follows --
377read the contents of the file in C<< $r->filename >>, compress it and write
378the compressed data to standard output. That's all.
379
380This code has to jump through a few hoops to achieve this because
381
382=over
383
384=item 1.
385
386The gzip support in C<Compress::Zlib> version 1.x can only work with a real
387filesystem filehandle. The filehandles used by Apache modules are not
388associated with the filesystem.
389
390=item 2.
391
392That means all the gzip support has to be done by hand - in this case by
393creating a tied filehandle to deal with creating the gzip header and
394trailer.
395
396=back
397
398C<IO::Compress::Gzip> doesn't have that filehandle limitation (this was one
399of the reasons for writing it in the first place). So if
400C<IO::Compress::Gzip> is used instead of C<Compress::Zlib> the whole tied
401filehandle code can be removed. Here is the rewritten code.
402
403  package Apache::GZip;
404
405  use strict vars;
406  use Apache::Constants ':common';
407  use IO::Compress::Gzip;
408  use IO::File;
409
410  sub handler {
411      my $r = shift;
412      my ($fh,$gz);
413      my $file = $r->filename;
414      return DECLINED unless $fh=IO::File->new($file);
415      $r->header_out('Content-Encoding'=>'gzip');
416      $r->send_http_header;
417      return OK if $r->header_only;
418
419      my $gz = IO::Compress::Gzip->new( '-', Minimal => 1 )
420          or return DECLINED ;
421
422      print $gz $_ while <$fh>;
423
424      return OK;
425  }
426
427or even more succinctly, like this, using a one-shot gzip
428
429  package Apache::GZip;
430
431  use strict vars;
432  use Apache::Constants ':common';
433  use IO::Compress::Gzip qw(gzip);
434
435  sub handler {
436      my $r = shift;
437      $r->header_out('Content-Encoding'=>'gzip');
438      $r->send_http_header;
439      return OK if $r->header_only;
440
441      gzip $r->filename => '-', Minimal => 1
442        or return DECLINED ;
443
444      return OK;
445  }
446
447  1;
448
449The use of one-shot C<gzip> above just reads from C<< $r->filename >> and
450writes the compressed data to standard output.
451
452Note the use of the C<Minimal> option in the code above. When using gzip
453for Content-Encoding you should I<always> use this option. In the example
454above it will prevent the filename being included in the gzip header and
455make the size of the gzip data stream a slight bit smaller.
456
457=head2 Compressed files and Net::FTP
458
459The C<Net::FTP> module provides two low-level methods called C<stor> and
460C<retr> that both return filehandles. These filehandles can used with the
461C<IO::Compress/Uncompress> modules to compress or uncompress files read
462from or written to an FTP Server on the fly, without having to create a
463temporary file.
464
465Firstly, here is code that uses C<retr> to uncompressed a file as it is
466read from the FTP Server.
467
468    use Net::FTP;
469    use IO::Uncompress::Gunzip qw(:all);
470
471    my $ftp = Net::FTP->new( ... )
472
473    my $retr_fh = $ftp->retr($compressed_filename);
474    gunzip $retr_fh => $outFilename, AutoClose => 1
475        or die "Cannot uncompress '$compressed_file': $GunzipError\n";
476
477and this to compress a file as it is written to the FTP Server
478
479    use Net::FTP;
480    use IO::Compress::Gzip qw(:all);
481
482    my $stor_fh = $ftp->stor($filename);
483    gzip "filename" => $stor_fh, AutoClose => 1
484        or die "Cannot compress '$filename': $GzipError\n";
485
486=head1 MISC
487
488=head2 Using C<InputLength> to uncompress data embedded in a larger file/buffer.
489
490A fairly common use-case is where compressed data is embedded in a larger
491file/buffer and you want to read both.
492
493As an example consider the structure of a zip file. This is a well-defined
494file format that mixes both compressed and uncompressed sections of data in
495a single file.
496
497For the purposes of this discussion you can think of a zip file as sequence
498of compressed data streams, each of which is prefixed by an uncompressed
499local header. The local header contains information about the compressed
500data stream, including the name of the compressed file and, in particular,
501the length of the compressed data stream.
502
503To illustrate how to use C<InputLength> here is a script that walks a zip
504file and prints out how many lines are in each compressed file (if you
505intend write code to walking through a zip file for real see
506L<IO::Uncompress::Unzip/"Walking through a zip file"> ). Also, although
507this example uses the zlib-based compression, the technique can be used by
508the other C<IO::Uncompress::*> modules.
509
510    use strict;
511    use warnings;
512
513    use IO::File;
514    use IO::Uncompress::RawInflate qw(:all);
515
516    use constant ZIP_LOCAL_HDR_SIG  => 0x04034b50;
517    use constant ZIP_LOCAL_HDR_LENGTH => 30;
518
519    my $file = $ARGV[0] ;
520
521    my $fh = IO::File->new( "<$file" )
522                or die "Cannot open '$file': $!\n";
523
524    while (1)
525    {
526        my $sig;
527        my $buffer;
528
529        my $x ;
530        ($x = $fh->read($buffer, ZIP_LOCAL_HDR_LENGTH)) == ZIP_LOCAL_HDR_LENGTH
531            or die "Truncated file: $!\n";
532
533        my $signature = unpack ("V", substr($buffer, 0, 4));
534
535        last unless $signature == ZIP_LOCAL_HDR_SIG;
536
537        # Read Local Header
538        my $gpFlag             = unpack ("v", substr($buffer, 6, 2));
539        my $compressedMethod   = unpack ("v", substr($buffer, 8, 2));
540        my $compressedLength   = unpack ("V", substr($buffer, 18, 4));
541        my $uncompressedLength = unpack ("V", substr($buffer, 22, 4));
542        my $filename_length    = unpack ("v", substr($buffer, 26, 2));
543        my $extra_length       = unpack ("v", substr($buffer, 28, 2));
544
545        my $filename ;
546        $fh->read($filename, $filename_length) == $filename_length
547            or die "Truncated file\n";
548
549        $fh->read($buffer, $extra_length) == $extra_length
550            or die "Truncated file\n";
551
552        if ($compressedMethod != 8 && $compressedMethod != 0)
553        {
554            warn "Skipping file '$filename' - not deflated $compressedMethod\n";
555            $fh->read($buffer, $compressedLength) == $compressedLength
556                or die "Truncated file\n";
557            next;
558        }
559
560        if ($compressedMethod == 0 && $gpFlag & 8 == 8)
561        {
562            die "Streamed Stored not supported for '$filename'\n";
563        }
564
565        next if $compressedLength == 0;
566
567        # Done reading the Local Header
568
569        my $inf = IO::Uncompress::RawInflate->new( $fh,
570                            Transparent => 1,
571                            InputLength => $compressedLength )
572          or die "Cannot uncompress $file [$filename]: $RawInflateError\n"  ;
573
574        my $line_count = 0;
575
576        while (<$inf>)
577        {
578            ++ $line_count;
579        }
580
581        print "$filename: $line_count\n";
582    }
583
584The majority of the code above is concerned with reading the zip local
585header data. The code that I want to focus on is at the bottom.
586
587    while (1) {
588
589        # read local zip header data
590        # get $filename
591        # get $compressedLength
592
593        my $inf = IO::Uncompress::RawInflate->new( $fh,
594                            Transparent => 1,
595                            InputLength => $compressedLength )
596          or die "Cannot uncompress $file [$filename]: $RawInflateError\n"  ;
597
598        my $line_count = 0;
599
600        while (<$inf>)
601        {
602            ++ $line_count;
603        }
604
605        print "$filename: $line_count\n";
606    }
607
608The call to C<IO::Uncompress::RawInflate> creates a new filehandle C<$inf>
609that can be used to read from the parent filehandle C<$fh>, uncompressing
610it as it goes. The use of the C<InputLength> option will guarantee that
611I<at most> C<$compressedLength> bytes of compressed data will be read from
612the C<$fh> filehandle (The only exception is for an error case like a
613truncated file or a corrupt data stream).
614
615This means that once RawInflate is finished C<$fh> will be left at the
616byte directly after the compressed data stream.
617
618Now consider what the code looks like without C<InputLength>
619
620    while (1) {
621
622        # read local zip header data
623        # get $filename
624        # get $compressedLength
625
626        # read all the compressed data into $data
627        read($fh, $data, $compressedLength);
628
629        my $inf = IO::Uncompress::RawInflate->new( \$data,
630                            Transparent => 1 )
631          or die "Cannot uncompress $file [$filename]: $RawInflateError\n"  ;
632
633        my $line_count = 0;
634
635        while (<$inf>)
636        {
637            ++ $line_count;
638        }
639
640        print "$filename: $line_count\n";
641    }
642
643The difference here is the addition of the temporary variable C<$data>.
644This is used to store a copy of the compressed data while it is being
645uncompressed.
646
647If you know that C<$compressedLength> isn't that big then using temporary
648storage won't be a problem. But if C<$compressedLength> is very large or
649you are writing an application that other people will use, and so have no
650idea how big C<$compressedLength> will be, it could be an issue.
651
652Using C<InputLength> avoids the use of temporary storage and means the
653application can cope with large compressed data streams.
654
655One final point -- obviously C<InputLength> can only be used whenever you
656know the length of the compressed data beforehand, like here with a zip
657file.
658
659=head1 SUPPORT
660
661General feedback/questions/bug reports should be sent to
662L<https://github.com/pmqs//issues> (preferred) or
663L<https://rt.cpan.org/Public/Dist/Display.html?Name=>.
664
665=head1 SEE ALSO
666
667L<Compress::Zlib>, L<IO::Compress::Gzip>, L<IO::Uncompress::Gunzip>, L<IO::Compress::Deflate>, L<IO::Uncompress::Inflate>, L<IO::Compress::RawDeflate>, L<IO::Uncompress::RawInflate>, L<IO::Compress::Bzip2>, L<IO::Uncompress::Bunzip2>, L<IO::Compress::Lzma>, L<IO::Uncompress::UnLzma>, L<IO::Compress::Xz>, L<IO::Uncompress::UnXz>, L<IO::Compress::Lzip>, L<IO::Uncompress::UnLzip>, L<IO::Compress::Lzop>, L<IO::Uncompress::UnLzop>, L<IO::Compress::Lzf>, L<IO::Uncompress::UnLzf>, L<IO::Compress::Zstd>, L<IO::Uncompress::UnZstd>, L<IO::Uncompress::AnyInflate>, L<IO::Uncompress::AnyUncompress>
668
669L<IO::Compress::FAQ|IO::Compress::FAQ>
670
671L<File::GlobMapper|File::GlobMapper>, L<Archive::Zip|Archive::Zip>,
672L<Archive::Tar|Archive::Tar>,
673L<IO::Zlib|IO::Zlib>
674
675=head1 AUTHOR
676
677This module was written by Paul Marquess, C<pmqs@cpan.org>.
678
679=head1 MODIFICATION HISTORY
680
681See the Changes file.
682
683=head1 COPYRIGHT AND LICENSE
684
685Copyright (c) 2005-2021 Paul Marquess. All rights reserved.
686
687This program is free software; you can redistribute it and/or
688modify it under the same terms as Perl itself.
689
690