1\input texinfo  @c                                  -*- Texinfo -*-
2@setfilename bzip2.info
3
4@ignore
5This file documents bzip2 version 0.9.0c, and associated library
6libbzip2, written by Julian Seward (jseward@acm.org).
7
8Copyright (C) 1996-1998 Julian R Seward
9
10Permission is granted to make and distribute verbatim copies of
11this manual provided the copyright notice and this permission notice
12are preserved on all copies.
13
14Permission is granted to copy and distribute translations of this manual
15into another language, under the above conditions for verbatim copies.
16@end ignore
17
18@ifinfo
19@format
20START-INFO-DIR-ENTRY
21* Bzip2: (bzip2).		A program and library for data compression.
22END-INFO-DIR-ENTRY
23@end format
24
25@end ifinfo
26
27@iftex
28@c @finalout
29@settitle bzip2 and libbzip2
30@titlepage
31@title bzip2 and libbzip2
32@subtitle a program and library for data compression
33@subtitle copyright (C) 1996-1998 Julian Seward
34@subtitle version 0.9.0c of 18 October 1998
35@author Julian Seward
36
37@end titlepage
38@end iftex
39
40
41@parindent 0mm
42@parskip 2mm
43
44
45This program, @code{bzip2},
46and associated library @code{libbzip2}, are
47Copyright (C) 1996-1998 Julian R Seward.  All rights reserved.
48
49Redistribution and use in source and binary forms, with or without
50modification, are permitted provided that the following conditions
51are met:
52@itemize @bullet
53@item
54   Redistributions of source code must retain the above copyright
55   notice, this list of conditions and the following disclaimer.
56@item
57   The origin of this software must not be misrepresented; you must
58   not claim that you wrote the original software.  If you use this
59   software in a product, an acknowledgment in the product
60   documentation would be appreciated but is not required.
61@item
62   Altered source versions must be plainly marked as such, and must
63   not be misrepresented as being the original software.
64@item
65   The name of the author may not be used to endorse or promote
66   products derived from this software without specific prior written
67   permission.
68@end itemize
69THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS
70OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
71WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
72ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
73DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
74DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
75GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
76INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
77WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
78NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
79SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
80
81Julian Seward, Guildford, Surrey, UK.
82
83@code{jseward@@acm.org}
84
85@code{http://www.muraroa.demon.co.uk}
86
87@code{bzip2}/@code{libbzip2} version 0.9.0c of 18 October 1998.
88
89PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented
90algorithms.  However, I do not have the resources available to carry out
91a full patent search.  Therefore I cannot give any guarantee of the
92above statement.
93
94
95
96
97
98
99
100@node Overview, Implementation, Top, Top
101@chapter Introduction
102
103@code{bzip2}  compresses  files  using the Burrows-Wheeler
104block-sorting text compression algorithm,  and  Huffman  coding.
105Compression  is  generally  considerably  better than that
106achieved by more conventional LZ77/LZ78-based compressors,
107and  approaches  the performance of the PPM family of statistical compressors.
108
109@code{bzip2} is built on top of @code{libbzip2}, a flexible library
110for handling compressed data in the @code{bzip2} format.  This manual
111describes both how to use the program and
112how to work with the library interface.  Most of the
113manual is devoted to this library, not the program,
114which is good news if your interest is only in the program.
115
116Chapter 2 describes how to use @code{bzip2}; this is the only part
117you need to read if you just want to know how to operate the program.
118Chapter 3 describes the programming interfaces in detail, and
119Chapter 4 records some miscellaneous notes which I thought
120ought to be recorded somewhere.
121
122
123@chapter How to use @code{bzip2}
124
125This chapter contains a copy of the @code{bzip2} man page,
126and nothing else.
127@example
128NAME
129       bzip2, bunzip2 - a block-sorting file compressor, v0.9.0
130       bzcat - decompresses files to stdout
131       bzip2recover - recovers data from damaged bzip2 files
132
133
134SYNOPSIS
135       bzip2 [ -cdfkstvzVL123456789 ] [ filenames ...  ]
136       bunzip2 [ -fkvsVL ] [ filenames ...  ]
137       bzcat [ -s ] [ filenames ...  ]
138       bzip2recover filename
139
140
141DESCRIPTION
142       bzip2  compresses  files  using the Burrows-Wheeler block-
143       sorting text compression algorithm,  and  Huffman  coding.
144       Compression  is  generally  considerably  better than that
145       achieved by more conventional LZ77/LZ78-based compressors,
146       and  approaches  the performance of the PPM family of sta-
147       tistical compressors.
148
149       The command-line options are deliberately very similar  to
150       those of GNU Gzip, but they are not identical.
151
152       bzip2  expects  a list of file names to accompany the com-
153       mand-line flags.  Each file is replaced  by  a  compressed
154       version  of  itself,  with  the  name "original_name.bz2".
155       Each compressed file has the same  modification  date  and
156       permissions  as  the corresponding original, so that these
157       properties can  be  correctly  restored  at  decompression
158       time.  File name handling is naive in the sense that there
159       is no mechanism for preserving original file  names,  per-
160       missions  and  dates  in filesystems which lack these con-
161       cepts, or have serious file name length restrictions, such
162       as MS-DOS.
163
164       bzip2  and  bunzip2 will by default not overwrite existing
165       files; if you want this to happen, specify the -f flag.
166
167       If no file names  are  specified,  bzip2  compresses  from
168       standard  input  to  standard output.  In this case, bzip2
169       will decline to write compressed output to a terminal,  as
170       this  would  be  entirely  incomprehensible  and therefore
171       pointless.
172
173       bunzip2 (or bzip2 -d ) decompresses and restores all spec-
174       ified files whose names end in ".bz2".  Files without this
175       suffix are ignored.  Again, supplying no filenames  causes
176       decompression from standard input to standard output.
177
178       bunzip2 will correctly decompress a file which is the con-
179       catenation of two or more compressed files.  The result is
180       the concatenation of the corresponding uncompressed files.
181       Integrity testing (-t) of concatenated compressed files is
182       also supported.
183
184       You  can also compress or decompress files to the standard
185       output by giving the -c flag.  Multiple files may be  com-
186       pressed and decompressed like this.  The resulting outputs
187       are fed sequentially to stdout.  Compression  of  multiple
188       files  in this manner generates a stream containing multi-
189       ple compressed file representations.  Such a stream can be
190       decompressed  correctly  only  by  bzip2  version 0.9.0 or
191       later.  Earlier versions of bzip2 will stop  after  decom-
192       pressing the first file in the stream.
193
194       bzcat  (or bzip2 -dc ) decompresses all specified files to
195       the standard output.
196
197       Compression is always performed, even  if  the  compressed
198       file  is slightly larger than the original.  Files of less
199       than about one hundred bytes tend to get larger, since the
200       compression  mechanism  has  a  constant  overhead  in the
201       region of 50 bytes.  Random data (including the output  of
202       most  file  compressors)  is  coded at about 8.05 bits per
203       byte, giving an expansion of around 0.5%.
204
205       As a self-check for your  protection,  bzip2  uses  32-bit
206       CRCs  to make sure that the decompressed version of a file
207       is identical to the original.  This guards against corrup-
208       tion  of  the compressed data, and against undetected bugs
209       in bzip2 (hopefully very unlikely).  The chances  of  data
210       corruption  going  undetected  is  microscopic,  about one
211       chance in four billion for each file processed.  Be aware,
212       though,  that  the  check occurs upon decompression, so it
213       can only tell you that that something is wrong.  It  can't
214       help  you recover the original uncompressed data.  You can
215       use bzip2recover to  try  to  recover  data  from  damaged
216       files.
217
218       Return  values:  0  for a normal exit, 1 for environmental
219       problems (file not found, invalid flags, I/O errors,  &c),
220       2 to indicate a corrupt compressed file, 3 for an internal
221       consistency error (eg, bug) which caused bzip2 to panic.
222
223
224MEMORY MANAGEMENT
225       Bzip2 compresses large files in blocks.   The  block  size
226       affects  both  the  compression  ratio  achieved,  and the
227       amount of memory needed both for  compression  and  decom-
228       pression.   The flags -1 through -9 specify the block size
229       to be 100,000 bytes through 900,000  bytes  (the  default)
230       respectively.   At decompression-time, the block size used
231       for compression is read from the header of the  compressed
232       file, and bunzip2 then allocates itself just enough memory
233       to decompress the file.  Since block sizes are  stored  in
234       compressed  files,  it follows that the flags -1 to -9 are
235       irrelevant  to  and  so  ignored   during   decompression.
236
237       Compression  and decompression requirements, in bytes, can
238       be estimated as:
239
240             Compression:   400k + ( 7 x block size )
241
242             Decompression: 100k + ( 4 x block size ), or
243                            100k + ( 2.5 x block size )
244
245       Larger  block  sizes  give  rapidly  diminishing  marginal
246       returns;  most of the compression comes from the first two
247       or three hundred k of block size, a fact worth bearing  in
248       mind  when  using  bzip2  on  small  machines.  It is also
249       important to  appreciate  that  the  decompression  memory
250       requirement  is  set  at compression-time by the choice of
251       block size.
252
253       For files compressed with the  default  900k  block  size,
254       bunzip2  will require about 3700 kbytes to decompress.  To
255       support decompression of any file on a 4 megabyte machine,
256       bunzip2  has  an  option to decompress using approximately
257       half this amount of memory, about 2300 kbytes.  Decompres-
258       sion  speed  is also halved, so you should use this option
259       only where necessary.  The relevant flag is -s.
260
261       In general, try and use the largest block size memory con-
262       straints  allow,  since  that  maximises  the  compression
263       achieved.  Compression and decompression speed are  virtu-
264       ally unaffected by block size.
265
266       Another  significant point applies to files which fit in a
267       single block -- that  means  most  files  you'd  encounter
268       using  a  large  block  size.   The  amount of real memory
269       touched is proportional to the size of the file, since the
270       file  is smaller than a block.  For example, compressing a
271       file 20,000 bytes long with the flag  -9  will  cause  the
272       compressor  to  allocate  around 6700k of memory, but only
273       touch 400k + 20000 * 7 = 540 kbytes of it.  Similarly, the
274       decompressor  will  allocate  3700k  but only touch 100k +
275       20000 * 4 = 180 kbytes.
276
277       Here is a table which summarises the maximum memory  usage
278       for  different  block  sizes.   Also recorded is the total
279       compressed size for 14 files of the Calgary Text  Compres-
280       sion  Corpus totalling 3,141,622 bytes.  This column gives
281       some feel for how  compression  varies  with  block  size.
282       These  figures  tend to understate the advantage of larger
283       block sizes for larger files, since the  Corpus  is  domi-
284       nated by smaller files.
285
286                  Compress   Decompress   Decompress   Corpus
287           Flag     usage      usage       -s usage     Size
288
289            -1      1100k       500k         350k      914704
290            -2      1800k       900k         600k      877703
291            -3      2500k      1300k         850k      860338
292            -4      3200k      1700k        1100k      846899
293            -5      3900k      2100k        1350k      845160
294            -6      4600k      2500k        1600k      838626
295            -7      5400k      2900k        1850k      834096
296            -8      6000k      3300k        2100k      828642
297            -9      6700k      3700k        2350k      828642
298
299
300OPTIONS
301       -c --stdout
302              Compress or decompress to standard output.  -c will
303              decompress multiple files to stdout, but will  only
304              compress a single file to stdout.
305
306       -d --decompress
307              Force  decompression.  bzip2, bunzip2 and bzcat are
308              really the same program,  and  the  decision  about
309              what  actions to take is done on the basis of which
310              name is used.  This flag overrides that  mechanism,
311              and forces bzip2 to decompress.
312
313       -z --compress
314              The  complement  to -d: forces compression, regard-
315              less of the invokation name.
316
317       -t --test
318              Check integrity of the specified file(s), but don't
319              decompress  them.   This  really  performs  a trial
320              decompression and throws away the result.
321
322       -f --force
323              Force overwrite of output files.   Normally,  bzip2
324              will not overwrite existing output files.
325
326       -k --keep
327              Keep  (don't delete) input files during compression
328              or decompression.
329
330       -s --small
331              Reduce memory usage, for compression, decompression
332              and  testing.   Files  are  decompressed and tested
333              using a modified algorithm which only requires  2.5
334              bytes  per  block byte.  This means any file can be
335              decompressed in 2300k of memory,  albeit  at  about
336              half the normal speed.
337
338              During  compression,  -s  selects  a  block size of
339              200k, which limits memory use to  around  the  same
340              figure,  at  the expense of your compression ratio.
341              In short, if your  machine  is  low  on  memory  (8
342              megabytes  or  less),  use  -s for everything.  See
343              MEMORY MANAGEMENT above.
344
345       -v --verbose
346              Verbose mode -- show the compression ratio for each
347              file  processed.   Further  -v's  increase the ver-
348              bosity level, spewing out lots of information which
349              is primarily of interest for diagnostic purposes.
350
351       -L --license -V --version
352              Display  the  software  version,  license terms and
353              conditions.
354
355       -1 to -9
356              Set the block size to 100 k, 200 k ..  900  k  when
357              compressing.   Has  no  effect  when decompressing.
358              See MEMORY MANAGEMENT above.
359
360       --repetitive-fast
361              bzip2 injects some small  pseudo-random  variations
362              into  very  repetitive  blocks  to limit worst-case
363              performance during compression.   If  sorting  runs
364              into  difficulties,  the  block  is randomised, and
365              sorting is restarted.  Very roughly, bzip2 persists
366              for  three  times  as  long as a well-behaved input
367              would take before resorting to randomisation.  This
368              flag makes it give up much sooner.
369
370       --repetitive-best
371              Opposite  of  --repetitive-fast;  try  a lot harder
372              before resorting to randomisation.
373
374
375RECOVERING DATA FROM DAMAGED FILES
376       bzip2 compresses files in blocks, usually 900kbytes  long.
377       Each block is handled independently.  If a media or trans-
378       mission error causes a multi-block  .bz2  file  to  become
379       damaged,  it  may  be  possible  to  recover data from the
380       undamaged blocks in the file.
381
382       The compressed representation of each block  is  delimited
383       by  a  48-bit pattern, which makes it possible to find the
384       block boundaries with reasonable  certainty.   Each  block
385       also  carries its own 32-bit CRC, so damaged blocks can be
386       distinguished from undamaged ones.
387
388       bzip2recover is a  simple  program  whose  purpose  is  to
389       search  for blocks in .bz2 files, and write each block out
390       into its own .bz2 file.  You can then use bzip2 -t to test
391       the integrity of the resulting files, and decompress those
392       which are undamaged.
393
394       bzip2recover takes a single argument, the name of the dam-
395       aged file, and writes a number of files "rec0001file.bz2",
396       "rec0002file.bz2", etc, containing the  extracted  blocks.
397       The  output  filenames  are  designed  so  that the use of
398       wildcards in subsequent processing -- for example,  "bzip2
399       -dc  rec*file.bz2  > recovered_data" -- lists the files in
400       the "right" order.
401
402       bzip2recover should be of most use dealing with large .bz2
403       files,  as  these will contain many blocks.  It is clearly
404       futile to use it on damaged single-block  files,  since  a
405       damaged  block  cannot  be recovered.  If you wish to min-
406       imise any potential data loss through media  or  transmis-
407       sion errors, you might consider compressing with a smaller
408       block size.
409
410
411PERFORMANCE NOTES
412       The sorting phase of compression gathers together  similar
413       strings  in  the  file.  Because of this, files containing
414       very long runs of  repeated  symbols,  like  "aabaabaabaab
415       ..."   (repeated   several  hundred  times)  may  compress
416       extraordinarily slowly.  You can use the -vvvvv option  to
417       monitor progress in great detail, if you want.  Decompres-
418       sion speed is unaffected.
419
420       Such pathological cases seem rare in  practice,  appearing
421       mostly in artificially-constructed test files, and in low-
422       level disk images.  It may be inadvisable to use bzip2  to
423       compress  the  latter.   If you do get a file which causes
424       severe slowness in compression, try making the block  size
425       as small as possible, with flag -1.
426
427       bzip2  usually  allocates  several  megabytes of memory to
428       operate in, and then charges all over it in a fairly  ran-
429       dom  fashion.   This means that performance, both for com-
430       pressing and decompressing, is largely determined  by  the
431       speed  at  which  your  machine  can service cache misses.
432       Because of this, small changes to the code to  reduce  the
433       miss  rate  have  been observed to give disproportionately
434       large performance improvements.  I imagine bzip2 will per-
435       form best on machines with very large caches.
436
437
438CAVEATS
439       I/O  error  messages  are not as helpful as they could be.
440       Bzip2 tries hard to detect I/O errors  and  exit  cleanly,
441       but  the  details  of  what  the problem is sometimes seem
442       rather misleading.
443
444       This manual page pertains to version 0.9.0 of bzip2.  Com-
445       pressed  data created by this version is entirely forwards
446       and backwards compatible with the previous public release,
447       version  0.1pl2,  but  with the following exception: 0.9.0
448       can correctly decompress multiple concatenated  compressed
449       files.   0.1pl2  cannot do this; it will stop after decom-
450       pressing just the first file in the stream.
451
452       Wildcard expansion for Windows 95 and NT is flaky.
453
454       bzip2recover uses 32-bit integers to represent  bit  posi-
455       tions  in compressed files, so it cannot handle compressed
456       files more than 512 megabytes long.  This could easily  be
457       fixed.
458
459
460AUTHOR
461       Julian Seward, jseward@@acm.org.
462
463       The ideas embodied in bzip2 are due to (at least) the fol-
464       lowing people: Michael Burrows and David Wheeler (for  the
465       block  sorting  transformation), David Wheeler (again, for
466       the Huffman coder), Peter Fenwick (for the structured cod-
467       ing model in the original bzip, and many refinements), and
468       Alistair Moffat, Radford Neal  and  Ian  Witten  (for  the
469       arithmetic  coder  in  the  original  bzip).   I  am  much
470       indebted for their help, support and advice.  See the man-
471       ual  in the source distribution for pointers to sources of
472       documentation.  Christian von Roques encouraged me to look
473       for  faster sorting algorithms, so as to speed up compres-
474       sion.  Bela Lubkin encouraged me to improve the worst-case
475       compression performance.  Many people sent patches, helped
476       with portability problems, lent machines, gave advice  and
477       were generally helpful.
478@end example
479
480
481
482
483
484@chapter Programming with @code{libbzip2}
485
486This chapter describes the programming interface to @code{libbzip2}.
487
488For general background information, particularly about memory
489use and performance aspects, you'd be well advised to read Chapter 2
490as well.
491
492@section Top-level structure
493
494@code{libbzip2} is a flexible library for compressing and decompressing
495data in the @code{bzip2} data format.  Although packaged as a single
496entity, it helps to regard the library as three separate parts: the low
497level interface, and the high level interface, and some utility
498functions.
499
500The structure of @code{libbzip2}'s interfaces is similar to
501that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib}
502library.
503
504@subsection Low-level summary
505
506This interface provides services for compressing and decompressing
507data in memory.  There's no provision for dealing with files, streams
508or any other I/O mechanisms, just straight memory-to-memory work.
509In fact, this part of the library can be compiled without inclusion
510of @code{stdio.h}, which may be helpful for embedded applications.
511
512The low-level part of the library has no global variables and
513is therefore thread-safe.
514
515Six routines make up the low level interface:
516@code{bzCompressInit}, @code{bzCompress}, and @* @code{bzCompressEnd}
517for compression,
518and a corresponding trio @code{bzDecompressInit}, @* @code{bzDecompress}
519and @code{bzDecompressEnd} for decompression.
520The @code{*Init} functions allocate
521memory for compression/decompression and do other
522initialisations, whilst the @code{*End} functions close down operations
523and release memory.
524
525The real work is done by @code{bzCompress} and @code{bzDecompress}.
526These compress/decompress data from a user-supplied input buffer
527to a user-supplied output buffer.  These buffers can be any size;
528arbitrary quantities of data are handled by making repeated calls
529to these functions.  This is a flexible mechanism allowing a
530consumer-pull style of activity, or producer-push, or a mixture of
531both.
532
533
534
535@subsection High-level summary
536
537This interface provides some handy wrappers around the low-level
538interface to facilitate reading and writing @code{bzip2} format
539files (@code{.bz2} files).  The routines provide hooks to facilitate
540reading files in which the @code{bzip2} data stream is embedded
541within some larger-scale file structure, or where there are
542multiple @code{bzip2} data streams concatenated end-to-end.
543
544For reading files, @code{bzReadOpen}, @code{bzRead}, @code{bzReadClose}
545and @code{bzReadGetUnused} are supplied.  For writing files,
546@code{bzWriteOpen}, @code{bzWrite} and @code{bzWriteFinish} are
547available.
548
549As with the low-level library, no global variables are used
550so the library is per se thread-safe.  However, if I/O errors
551occur whilst reading or writing the underlying compressed files,
552you may have to consult @code{errno} to determine the cause of
553the error.  In that case, you'd need a C library which correctly
554supports @code{errno} in a multithreaded environment.
555
556To make the library a little simpler and more portable,
557@code{bzReadOpen} and @code{bzWriteOpen} require you to pass them file
558handles (@code{FILE*}s) which have previously been opened for reading or
559writing respectively.  That avoids portability problems associated with
560file operations and file attributes, whilst not being much of an
561imposition on the programmer.
562
563
564
565@subsection Utility functions summary
566For very simple needs, @code{bzBuffToBuffCompress} and
567@code{bzBuffToBuffDecompress} are provided.  These compress
568data in memory from one buffer to another buffer in a single
569function call.  You should assess whether these functions
570fulfill your memory-to-memory compression/decompression
571requirements before investing effort in understanding the more
572general but more complex low-level interface.
573
574Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} /
575@code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to
576give better @code{zlib} compatibility.  These functions are
577@code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush},
578@code{bzclose},
579@code{bzerror} and @code{bzlibVersion}.  You may find these functions
580more convenient for simple file reading and writing, than those in the
581high-level interface.  These functions are not (yet) officially part of
582the library, and are not further documented here.  If they break, you
583get to keep all the pieces.  I hope to document them properly when time
584permits.
585
586Yoshioka also contributed modifications to allow the library to be
587built as a Windows DLL.
588
589
590@section Error handling
591
592The library is designed to recover cleanly in all situations, including
593the worst-case situation of decompressing random data.  I'm not
594100% sure that it can always do this, so you might want to add
595a signal handler to catch segmentation violations during decompression
596if you are feeling especially paranoid.  I would be interested in
597hearing more about the robustness of the library to corrupted
598compressed data.
599
600The file @code{bzlib.h} contains all definitions needed to use
601the library.  In particular, you should definitely not include
602@code{bzlib_private.h}.
603
604In @code{bzlib.h}, the various return values are defined.  The following
605list is not intended as an exhaustive description of the circumstances
606in which a given value may be returned -- those descriptions are given
607later.  Rather, it is intended to convey the rough meaning of each
608return value.  The first five actions are normal and not intended to
609denote an error situation.
610@table @code
611@item BZ_OK
612The requested action was completed successfully.
613@item BZ_RUN_OK
614@itemx BZ_FLUSH_OK
615@itemx BZ_FINISH_OK
616In @code{bzCompress}, the requested flush/finish/nothing-special action
617was completed successfully.
618@item BZ_STREAM_END
619Compression of data was completed, or the logical stream end was
620detected during decompression.
621@end table
622
623The following return values indicate an error of some kind.
624@table @code
625@item BZ_SEQUENCE_ERROR
626When using the library, it is important to call the functions in the
627correct sequence and with data structures (buffers etc) in the correct
628states.  @code{libbzip2} checks as much as it can to ensure this is
629happening, and returns @code{BZ_SEQUENCE_ERROR} if not.  Code which
630complies precisely with the function semantics, as detailed below,
631should never receive this value; such an event denotes buggy code
632which you should investigate.
633@item BZ_PARAM_ERROR
634Returned when a parameter to a function call is out of range
635or otherwise manifestly incorrect.  As with @code{BZ_SEQUENCE_ERROR},
636this denotes a bug in the client code.  The distinction between
637@code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth
638making.
639@item BZ_MEM_ERROR
640Returned when a request to allocate memory failed.  Note that the
641quantity of memory needed to decompress a stream cannot be determined
642until the stream's header has been read.  So @code{bzDecompress} and
643@code{bzRead} may return @code{BZ_MEM_ERROR} even though some of
644the compressed data has been read.  The same is not true for
645compression; once @code{bzCompressInit} or @code{bzWriteOpen} have
646successfully completed, @code{BZ_MEM_ERROR} cannot occur.
647@item BZ_DATA_ERROR
648Returned when a data integrity error is detected during decompression.
649Most importantly, this means when stored and computed CRCs for the
650data do not match.  This value is also returned upon detection of any
651other anomaly in the compressed data.
652@item BZ_DATA_ERROR_MAGIC
653As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to
654know when the compressed stream does not start with the correct
655magic bytes (@code{'B' 'Z' 'h'}).
656@item BZ_IO_ERROR
657Returned by @code{bzRead} and @code{bzRead} when there is an error
658reading or writing in the compressed file, and by @code{bzReadOpen}
659and @code{bzWriteOpen} for attempts to use a file for which the
660error indicator (viz, @code{ferror(f)}) is set.
661On receipt of @code{BZ_IO_ERROR}, the caller should consult
662@code{errno} and/or @code{perror} to acquire operating-system
663specific information about the problem.
664@item BZ_UNEXPECTED_EOF
665Returned by @code{bzRead} when the compressed file finishes
666before the logical end of stream is detected.
667@item BZ_OUTBUFF_FULL
668Returned by @code{bzBuffToBuffCompress} and
669@code{bzBuffToBuffDecompress} to indicate that the output data
670will not fit into the output buffer provided.
671@end table
672
673
674
675@section Low-level interface
676
677@subsection @code{bzCompressInit}
678@example
679typedef
680   struct @{
681      char *next_in;
682      unsigned int avail_in;
683      unsigned int total_in;
684
685      char *next_out;
686      unsigned int avail_out;
687      unsigned int total_out;
688
689      void *state;
690
691      void *(*bzalloc)(void *,int,int);
692      void (*bzfree)(void *,void *);
693      void *opaque;
694   @}
695   bz_stream;
696
697int bzCompressInit ( bz_stream *strm,
698                     int blockSize100k,
699                     int verbosity,
700                     int workFactor );
701
702@end example
703
704Prepares for compression.  The @code{bz_stream} structure
705holds all data pertaining to the compression activity.
706A @code{bz_stream} structure should be allocated and initialised
707prior to the call.
708The fields of @code{bz_stream}
709comprise the entirety of the user-visible data.  @code{state}
710is a pointer to the private data structures required for compression.
711
712Custom memory allocators are supported, via fields @code{bzalloc},
713@code{bzfree},
714and @code{opaque}.  The value
715@code{opaque} is passed to as the first argument to
716all calls to @code{bzalloc} and @code{bzfree}, but is
717otherwise ignored by the library.
718The call @code{bzalloc ( opaque, n, m )} is expected to return a
719pointer @code{p} to
720@code{n * m} bytes of memory, and @code{bzfree ( opaque, p )}
721should free
722that memory.
723
724If you don't want to use a custom memory allocator, set @code{bzalloc},
725@code{bzfree} and
726@code{opaque} to @code{NULL},
727and the library will then use the standard @code{malloc}/@code{free}
728routines.
729
730Before calling @code{bzCompressInit}, fields @code{bzalloc},
731@code{bzfree} and @code{opaque} should
732be filled appropriately, as just described.  Upon return, the internal
733state will have been allocated and initialised, and @code{total_in} and
734@code{total_out} will have been set to zero.
735These last two fields are used by the library
736to inform the caller of the total amount of data passed into and out of
737the library, respectively.  You should not try to change them.
738
739Parameter @code{blockSize100k} specifies the block size to be used for
740compression.  It should be a value between 1 and 9 inclusive, and the
741actual block size used is 100000 x this figure.  9 gives the best
742compression but takes most memory.
743
744Parameter @code{verbosity} should be set to a number between 0 and 4
745inclusive.  0 is silent, and greater numbers give increasingly verbose
746monitoring/debugging output.  If the library has been compiled with
747@code{-DBZ_NO_STDIO}, no such output will appear for any verbosity
748setting.
749
750Parameter @code{workFactor} controls how the compression phase behaves
751when presented with worst case, highly repetitive, input data.
752If compression runs into difficulties caused by repetitive data,
753some pseudo-random variations are inserted into the block, and
754compression is restarted.  Lower values of @code{workFactor}
755reduce the tolerance of compression to repetitive data.
756You should set this parameter carefully; too low, and
757compression ratio suffers, too high, and your average-to-worst
758case compression times can become very large.
759The default value of 30
760gives reasonable behaviour over a wide range of circumstances.
761
762Allowable values range from 0 to 250 inclusive.  0 is a special
763case, equivalent to using the default value of 30.
764
765Note that the randomisation process is entirely transparent.
766If the library decides to randomise and restart compression on a
767block, it does so without comment.  Randomised blocks are
768automatically de-randomised during decompression, so data
769integrity is never compromised.
770
771Possible return values:
772@display
773      @code{BZ_PARAM_ERROR}
774         if @code{strm} is @code{NULL}
775         or @code{blockSize} < 1 or @code{blockSize} > 9
776         or @code{verbosity} < 0 or @code{verbosity} > 4
777         or @code{workFactor} < 0 or @code{workFactor} > 250
778      @code{BZ_MEM_ERROR}
779         if not enough memory is available
780      @code{BZ_OK}
781         otherwise
782@end display
783Allowable next actions:
784@display
785      @code{bzCompress}
786         if @code{BZ_OK} is returned
787      no specific action needed in case of error
788@end display
789
790@subsection @code{bzCompress}
791@example
792   int bzCompress ( bz_stream *strm, int action );
793@end example
794Provides more input and/or output buffer space for the library.  The
795caller maintains input and output buffers, and calls @code{bzCompress} to
796transfer data between them.
797
798Before each call to @code{bzCompress}, @code{next_in} should point at
799the data to be compressed, and @code{avail_in} should indicate how many
800bytes the library may read.  @code{bzCompress} updates @code{next_in},
801@code{avail_in} and @code{total_in} to reflect the number of bytes it
802has read.
803
804Similarly, @code{next_out} should point to a buffer in which the
805compressed data is to be placed, with @code{avail_out} indicating how
806much output space is available.  @code{bzCompress} updates
807@code{next_out}, @code{avail_out} and @code{total_out} to reflect the
808number of bytes output.
809
810You may provide and remove as little or as much data as you like on each
811call of @code{bzCompress}.  In the limit, it is acceptable to supply and
812remove data one byte at a time, although this would be terribly
813inefficient.  You should always ensure that at least one byte of output
814space is available at each call.
815
816A second purpose of @code{bzCompress} is to request a change of mode of the
817compressed stream.
818
819Conceptually, a compressed stream can be in one of four states: IDLE,
820RUNNING, FLUSHING and FINISHING.  Before initialisation
821(@code{bzCompressInit}) and after termination (@code{bzCompressEnd}), a
822stream is regarded as IDLE.
823
824Upon initialisation (@code{bzCompressInit}), the stream is placed in the
825RUNNING state.  Subsequent calls to @code{bzCompress} should pass
826@code{BZ_RUN} as the requested action; other actions are illegal and
827will result in @code{BZ_SEQUENCE_ERROR}.
828
829At some point, the calling program will have provided all the input data
830it wants to.  It will then want to finish up -- in effect, asking the
831library to process any data it might have buffered internally.  In this
832state, @code{bzCompress} will no longer attempt to read data from
833@code{next_in}, but it will want to write data to @code{next_out}.
834Because the output buffer supplied by the user can be arbitrarily small,
835the finishing-up operation cannot necessarily be done with a single call
836of @code{bzCompress}.
837
838Instead, the calling program passes @code{BZ_FINISH} as an action to
839@code{bzCompress}.  This changes the stream's state to FINISHING.  Any
840remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and
841transferred to the output buffer.  To do this, @code{bzCompress} must be
842called repeatedly until all the output has been consumed.  At that
843point, @code{bzCompress} returns @code{BZ_STREAM_END}, and the stream's
844state is set back to IDLE.  @code{bzCompressEnd} should then be
845called.
846
847Just to make sure the calling program does not cheat, the library makes
848a note of @code{avail_in} at the time of the first call to
849@code{bzCompress} which has @code{BZ_FINISH} as an action (ie, at the
850time the program has announced its intention to not supply any more
851input).  By comparing this value with that of @code{avail_in} over
852subsequent calls to @code{bzCompress}, the library can detect any
853attempts to slip in more data to compress.  Any calls for which this is
854detected will return @code{BZ_SEQUENCE_ERROR}.  This indicates a
855programming mistake which should be corrected.
856
857Instead of asking to finish, the calling program may ask
858@code{bzCompress} to take all the remaining input, compress it and
859terminate the current (Burrows-Wheeler) compression block.  This could
860be useful for error control purposes.  The mechanism is analogous to
861that for finishing: call @code{bzCompress} with an action of
862@code{BZ_FLUSH}, remove output data, and persist with the
863@code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned.  As
864with finishing, @code{bzCompress} detects any attempt to provide more
865input data once the flush has begun.
866
867Once the flush is complete, the stream returns to the normal RUNNING
868state.
869
870This all sounds pretty complex, but isn't really.  Here's a table
871which shows which actions are allowable in each state, what action
872will be taken, what the next state is, and what the non-error return
873values are.  Note that you can't explicitly ask what state the
874stream is in, but nor do you need to -- it can be inferred from the
875values returned by @code{bzCompress}.
876@display
877IDLE/@code{any}
878      Illegal.  IDLE state only exists after @code{bzCompressEnd} or
879      before @code{bzCompressInit}.
880      Return value = @code{BZ_SEQUENCE_ERROR}
881
882RUNNING/@code{BZ_RUN}
883      Compress from @code{next_in} to @code{next_out} as much as possible.
884      Next state = RUNNING
885      Return value = @code{BZ_RUN_OK}
886
887RUNNING/@code{BZ_FLUSH}
888      Remember current value of @code{next_in}.  Compress from @code{next_in}
889      to @code{next_out} as much as possible, but do not accept any more input.
890      Next state = FLUSHING
891      Return value = @code{BZ_FLUSH_OK}
892
893RUNNING/@code{BZ_FINISH}
894      Remember current value of @code{next_in}.  Compress from @code{next_in}
895      to @code{next_out} as much as possible, but do not accept any more input.
896      Next state = FINISHING
897      Return value = @code{BZ_FINISH_OK}
898
899FLUSHING/@code{BZ_FLUSH}
900      Compress from @code{next_in} to @code{next_out} as much as possible,
901      but do not accept any more input.
902      If all the existing input has been used up and all compressed
903      output has been removed
904         Next state = RUNNING; Return value = @code{BZ_RUN_OK}
905      else
906         Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK}
907
908FLUSHING/other
909      Illegal.
910      Return value = @code{BZ_SEQUENCE_ERROR}
911
912FINISHING/@code{BZ_FINISH}
913      Compress from @code{next_in} to @code{next_out} as much as possible,
914      but to not accept any more input.
915      If all the existing input has been used up and all compressed
916      output has been removed
917         Next state = IDLE; Return value = @code{BZ_STREAM_END}
918      else
919         Next state = FINISHING; Return value = @code{BZ_FINISHING}
920
921FINISHING/other
922      Illegal.
923      Return value = @code{BZ_SEQUENCE_ERROR}
924@end display
925
926That still looks complicated?  Well, fair enough.  The usual sequence
927of calls for compressing a load of data is:
928@itemize @bullet
929@item Get started with @code{bzCompressInit}.
930@item Shovel data in and shlurp out its compressed form using zero or more
931calls of @code{bzCompress} with action = @code{BZ_RUN}.
932@item Finish up.
933Repeatedly call @code{bzCompress} with action = @code{BZ_FINISH},
934copying out the compressed output, until @code{BZ_STREAM_END} is returned.
935@item Close up and go home.  Call @code{bzCompressEnd}.
936@end itemize
937If the data you want to compress fits into your input buffer all
938at once, you can skip the calls of @code{bzCompress ( ..., BZ_RUN )} and
939just do the @code{bzCompress ( ..., BZ_FINISH )} calls.
940
941All required memory is allocated by @code{bzCompressInit}.  The
942compression library can accept any data at all (obviously).  So you
943shouldn't get any error return values from the @code{bzCompress} calls.
944If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in
945your programming.
946
947Trivial other possible return values:
948@display
949      @code{BZ_PARAM_ERROR}
950         if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL}
951@end display
952
953@subsection @code{bzCompressEnd}
954@example
955int bzCompressEnd ( bz_stream *strm );
956@end example
957Releases all memory associated with a compression stream.
958
959Possible return values:
960@display
961   @code{BZ_PARAM_ERROR}    if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
962   @code{BZ_OK}    otherwise
963@end display
964
965
966@subsection @code{bzDecompressInit}
967@example
968int bzDecompressInit ( bz_stream *strm, int verbosity, int small );
969@end example
970Prepares for decompression.  As with @code{bzCompressInit}, a
971@code{bz_stream} record should be allocated and initialised before the
972call.  Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be
973set if a custom memory allocator is required, or made @code{NULL} for
974the normal @code{malloc}/@code{free} routines.  Upon return, the internal
975state will have been initialised, and @code{total_in} and
976@code{total_out} will be zero.
977
978For the meaning of parameter @code{verbosity}, see @code{bzCompressInit}.
979
980If @code{small} is nonzero, the library will use an alternative
981decompression algorithm which uses less memory but at the cost of
982decompressing more slowly (roughly speaking, half the speed, but the
983maximum memory requirement drops to around 2300k).  See Chapter 2 for
984more information on memory management.
985
986Note that the amount of memory needed to decompress
987a stream cannot be determined until the stream's header has been read,
988so even if @code{bzDecompressInit} succeeds, a subsequent
989@code{bzDecompress} could fail with @code{BZ_MEM_ERROR}.
990
991Possible return values:
992@display
993      @code{BZ_PARAM_ERROR}
994         if @code{(small != 0 && small != 1)}
995         or @code{(verbosity < 0 || verbosity > 4)}
996      @code{BZ_MEM_ERROR}
997         if insufficient memory is available
998@end display
999
1000Allowable next actions:
1001@display
1002      @code{bzDecompress}
1003         if @code{BZ_OK} was returned
1004      no specific action required in case of error
1005@end display
1006
1007
1008
1009@subsection @code{bzDecompress}
1010@example
1011int bzDecompress ( bz_stream *strm );
1012@end example
1013Provides more input and/out output buffer space for the library.  The
1014caller maintains input and output buffers, and uses @code{bzDecompress}
1015to transfer data between them.
1016
1017Before each call to @code{bzDecompress}, @code{next_in}
1018should point at the compressed data,
1019and @code{avail_in} should indicate how many bytes the library
1020may read.  @code{bzDecompress} updates @code{next_in}, @code{avail_in}
1021and @code{total_in}
1022to reflect the number of bytes it has read.
1023
1024Similarly, @code{next_out} should point to a buffer in which the uncompressed
1025output is to be placed, with @code{avail_out} indicating how much output space
1026is available.  @code{bzCompress} updates @code{next_out},
1027@code{avail_out} and @code{total_out} to reflect
1028the number of bytes output.
1029
1030You may provide and remove as little or as much data as you like on
1031each call of @code{bzDecompress}.
1032In the limit, it is acceptable to
1033supply and remove data one byte at a time, although this would be
1034terribly inefficient.  You should always ensure that at least one
1035byte of output space is available at each call.
1036
1037Use of @code{bzDecompress} is simpler than @code{bzCompress}.
1038
1039You should provide input and remove output as described above, and
1040repeatedly call @code{bzDecompress} until @code{BZ_STREAM_END} is
1041returned.  Appearance of @code{BZ_STREAM_END} denotes that
1042@code{bzDecompress} has detected the logical end of the compressed
1043stream.  @code{bzDecompress} will not produce @code{BZ_STREAM_END} until
1044all output data has been placed into the output buffer, so once
1045@code{BZ_STREAM_END} appears, you are guaranteed to have available all
1046the decompressed output, and @code{bzDecompressEnd} can safely be
1047called.
1048
1049If case of an error return value, you should call @code{bzDecompressEnd}
1050to clean up and release memory.
1051
1052Possible return values:
1053@display
1054      @code{BZ_PARAM_ERROR}
1055         if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1056         or @code{strm->avail_out < 1}
1057      @code{BZ_DATA_ERROR}
1058         if a data integrity error is detected in the compressed stream
1059      @code{BZ_DATA_ERROR_MAGIC}
1060         if the compressed stream doesn't begin with the right magic bytes
1061      @code{BZ_MEM_ERROR}
1062         if there wasn't enough memory available
1063      @code{BZ_STREAM_END}
1064         if the logical end of the data stream was detected and all
1065         output in has been consumed, eg @code{s->avail_out > 0}
1066      @code{BZ_OK}
1067         otherwise
1068@end display
1069Allowable next actions:
1070@display
1071      @code{bzDecompress}
1072         if @code{BZ_OK} was returned
1073      @code{bzDecompressEnd}
1074         otherwise
1075@end display
1076
1077
1078@subsection @code{bzDecompressEnd}
1079@example
1080int bzDecompressEnd ( bz_stream *strm );
1081@end example
1082Releases all memory associated with a decompression stream.
1083
1084Possible return values:
1085@display
1086      @code{BZ_PARAM_ERROR}
1087         if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1088      @code{BZ_OK}
1089         otherwise
1090@end display
1091
1092Allowable next actions:
1093@display
1094      None.
1095@end display
1096
1097
1098@section High-level interface
1099
1100This interface provides functions for reading and writing
1101@code{bzip2} format files.  First, some general points.
1102
1103@itemize @bullet
1104@item All of the functions take an @code{int*} first argument,
1105  @code{bzerror}.
1106  After each call, @code{bzerror} should be consulted first to determine
1107  the outcome of the call.  If @code{bzerror} is @code{BZ_OK},
1108  the call completed
1109  successfully, and only then should the return value of the function
1110  (if any) be consulted.  If @code{bzerror} is @code{BZ_IO_ERROR},
1111  there was an error
1112  reading/writing the underlying compressed file, and you should
1113  then consult @code{errno}/@code{perror} to determine the
1114  cause of the difficulty.
1115  @code{bzerror} may also be set to various other values; precise details are
1116  given on a per-function basis below.
1117@item If @code{bzerror} indicates an error
1118  (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}),
1119  you should immediately call @code{bzReadClose} (or @code{bzWriteClose},
1120  depending on whether you are attempting to read or to write)
1121  to free up all resources associated
1122  with the stream.  Once an error has been indicated, behaviour of all calls
1123  except @code{bzReadClose} (@code{bzWriteClose}) is undefined.
1124  The implication is that (1) @code{bzerror} should
1125  be checked after each call, and (2) if @code{bzerror} indicates an error,
1126  @code{bzReadClose} (@code{bzWriteClose}) should then be called to clean up.
1127@item The @code{FILE*} arguments passed to
1128   @code{bzReadOpen}/@code{bzWriteOpen}
1129  should be set to binary mode.
1130  Most Unix systems will do this by default, but other platforms,
1131  including Windows and Mac, will not.  If you omit this, you may
1132  encounter problems when moving code to new platforms.
1133@item Memory allocation requests are handled by
1134  @code{malloc}/@code{free}.
1135  At present
1136  there is no facility for user-defined memory allocators in the file I/O
1137  functions (could easily be added, though).
1138@end itemize
1139
1140
1141
1142@subsection @code{bzReadOpen}
1143@example
1144   typedef void BZFILE;
1145
1146   BZFILE *bzReadOpen ( int *bzerror, FILE *f,
1147                        int small, int verbosity,
1148                        void *unused, int nUnused );
1149@end example
1150Prepare to read compressed data from file handle @code{f}.  @code{f}
1151should refer to a file which has been opened for reading, and for which
1152the error indicator (@code{ferror(f)})is not set.  If @code{small} is 1,
1153the library will try to decompress using less memory, at the expense of
1154speed.
1155
1156For reasons explained below, @code{bzRead} will decompress the
1157@code{nUnused} bytes starting at @code{unused}, before starting to read
1158from the file @code{f}.  At most @code{BZ_MAX_UNUSED} bytes may be
1159supplied like this.  If this facility is not required, you should pass
1160@code{NULL} and @code{0} for @code{unused} and n@code{Unused}
1161respectively.
1162
1163For the meaning of parameters @code{small} and @code{verbosity},
1164see @code{bzDecompressInit}.
1165
1166The amount of memory needed to decompress a file cannot be determined
1167until the file's header has been read.  So it is possible that
1168@code{bzReadOpen} returns @code{BZ_OK} but a subsequent call of
1169@code{bzRead} will return @code{BZ_MEM_ERROR}.
1170
1171Possible assignments to @code{bzerror}:
1172@display
1173      @code{BZ_PARAM_ERROR}
1174         if @code{f} is @code{NULL}
1175         or @code{small} is neither @code{0} nor @code{1}
1176         or @code{(unused == NULL && nUnused != 0)}
1177         or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))}
1178      @code{BZ_IO_ERROR}
1179         if @code{ferror(f)} is nonzero
1180      @code{BZ_MEM_ERROR}
1181         if insufficient memory is available
1182      @code{BZ_OK}
1183         otherwise.
1184@end display
1185
1186Possible return values:
1187@display
1188      Pointer to an abstract @code{BZFILE}
1189         if @code{bzerror} is @code{BZ_OK}
1190      @code{NULL}
1191         otherwise
1192@end display
1193
1194Allowable next actions:
1195@display
1196      @code{bzRead}
1197         if @code{bzerror} is @code{BZ_OK}
1198      @code{bzClose}
1199         otherwise
1200@end display
1201
1202
1203@subsection @code{bzRead}
1204@example
1205   int bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
1206@end example
1207Reads up to @code{len} (uncompressed) bytes from the compressed file
1208@code{b} into
1209the buffer @code{buf}.  If the read was successful,
1210@code{bzerror} is set to @code{BZ_OK}
1211and the number of bytes read is returned.  If the logical end-of-stream
1212was detected, @code{bzerror} will be set to @code{BZ_STREAM_END},
1213and the number
1214of bytes read is returned.  All other @code{bzerror} values denote an error.
1215
1216@code{bzRead} will supply @code{len} bytes,
1217unless the logical stream end is detected
1218or an error occurs.  Because of this, it is possible to detect the
1219stream end by observing when the number of bytes returned is
1220less than the number
1221requested.  Nevertheless, this is regarded as inadvisable; you should
1222instead check @code{bzerror} after every call and watch out for
1223@code{BZ_STREAM_END}.
1224
1225Internally, @code{bzRead} copies data from the compressed file in chunks
1226of size @code{BZ_MAX_UNUSED} bytes
1227before decompressing it.  If the file contains more bytes than strictly
1228needed to reach the logical end-of-stream, @code{bzRead} will almost certainly
1229read some of the trailing data before signalling @code{BZ_SEQUENCE_END}.
1230To collect the read but unused data once @code{BZ_SEQUENCE_END} has
1231appeared, call @code{bzReadGetUnused} immediately before @code{bzReadClose}.
1232
1233Possible assignments to @code{bzerror}:
1234@display
1235      @code{BZ_PARAM_ERROR}
1236         if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1237      @code{BZ_SEQUENCE_ERROR}
1238         if @code{b} was opened with @code{bzWriteOpen}
1239      @code{BZ_IO_ERROR}
1240         if there is an error reading from the compressed file
1241      @code{BZ_UNEXPECTED_EOF}
1242         if the compressed file ended before the logical end-of-stream was detected
1243      @code{BZ_DATA_ERROR}
1244         if a data integrity error was detected in the compressed stream
1245      @code{BZ_DATA_ERROR_MAGIC}
1246         if the stream does not begin with the requisite header bytes (ie, is not
1247         a @code{bzip2} data file).  This is really a special case of @code{BZ_DATA_ERROR}.
1248      @code{BZ_MEM_ERROR}
1249         if insufficient memory was available
1250      @code{BZ_STREAM_END}
1251         if the logical end of stream was detected.
1252      @code{BZ_OK}
1253         otherwise.
1254@end display
1255
1256Possible return values:
1257@display
1258      number of bytes read
1259         if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END}
1260      undefined
1261         otherwise
1262@end display
1263
1264Allowable next actions:
1265@display
1266      collect data from @code{buf}, then @code{bzRead} or @code{bzReadClose}
1267         if @code{bzerror} is @code{BZ_OK}
1268      collect data from @code{buf}, then @code{bzReadClose} or @code{bzReadGetUnused}
1269         if @code{bzerror} is @code{BZ_SEQUENCE_END}
1270      @code{bzReadClose}
1271         otherwise
1272@end display
1273
1274
1275
1276@subsection @code{bzReadGetUnused}
1277@example
1278   void bzReadGetUnused ( int* bzerror, BZFILE *b,
1279                          void** unused, int* nUnused );
1280@end example
1281Returns data which was read from the compressed file but was not needed
1282to get to the logical end-of-stream.  @code{*unused} is set to the address
1283of the data, and @code{*nUnused} to the number of bytes.  @code{*nUnused} will
1284be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive.
1285
1286This function may only be called once @code{bzRead} has signalled
1287@code{BZ_STREAM_END} but before @code{bzReadClose}.
1288
1289Possible assignments to @code{bzerror}:
1290@display
1291      @code{BZ_PARAM_ERROR}
1292         if @code{b} is @code{NULL}
1293         or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL}
1294      @code{BZ_SEQUENCE_ERROR}
1295         if @code{BZ_STREAM_END} has not been signalled
1296         or if @code{b} was opened with @code{bzWriteOpen}
1297     @code{BZ_OK}
1298         otherwise
1299@end display
1300
1301Allowable next actions:
1302@display
1303      @code{bzReadClose}
1304@end display
1305
1306
1307@subsection @code{bzReadClose}
1308@example
1309   void bzReadClose ( int *bzerror, BZFILE *b );
1310@end example
1311Releases all memory pertaining to the compressed file @code{b}.
1312@code{bzReadClose} does not call @code{fclose} on the underlying file
1313handle, so you should do that yourself if appropriate.
1314@code{bzReadClose} should be called to clean up after all error
1315situations.
1316
1317Possible assignments to @code{bzerror}:
1318@display
1319      @code{BZ_SEQUENCE_ERROR}
1320         if @code{b} was opened with @code{bzOpenWrite}
1321      @code{BZ_OK}
1322         otherwise
1323@end display
1324
1325Allowable next actions:
1326@display
1327      none
1328@end display
1329
1330
1331
1332@subsection @code{bzWriteOpen}
1333@example
1334   BZFILE *bzWriteOpen ( int *bzerror, FILE *f,
1335                         int blockSize100k, int verbosity,
1336                         int workFactor );
1337@end example
1338Prepare to write compressed data to file handle @code{f}.
1339@code{f} should refer to
1340a file which has been opened for writing, and for which the error
1341indicator (@code{ferror(f)})is not set.
1342
1343For the meaning of parameters @code{blockSize100k},
1344@code{verbosity} and @code{workFactor}, see
1345@* @code{bzCompressInit}.
1346
1347All required memory is allocated at this stage, so if the call
1348completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a
1349subsequent call to @code{bzWrite}.
1350
1351Possible assignments to @code{bzerror}:
1352@display
1353      @code{BZ_PARAM_ERROR}
1354         if @code{f} is @code{NULL}
1355         or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1356      @code{BZ_IO_ERROR}
1357         if @code{ferror(f)} is nonzero
1358      @code{BZ_MEM_ERROR}
1359         if insufficient memory is available
1360      @code{BZ_OK}
1361         otherwise
1362@end display
1363
1364Possible return values:
1365@display
1366      Pointer to an abstract @code{BZFILE}
1367         if @code{bzerror} is @code{BZ_OK}
1368      @code{NULL}
1369         otherwise
1370@end display
1371
1372Allowable next actions:
1373@display
1374      @code{bzWrite}
1375         if @code{bzerror} is @code{BZ_OK}
1376         (you could go directly to @code{bzWriteClose}, but this would be pretty pointless)
1377      @code{bzWriteClose}
1378         otherwise
1379@end display
1380
1381
1382
1383@subsection @code{bzWrite}
1384@example
1385   void bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
1386@end example
1387Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be
1388compressed and written to the file.
1389
1390Possible assignments to @code{bzerror}:
1391@display
1392      @code{BZ_PARAM_ERROR}
1393         if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1394      @code{BZ_SEQUENCE_ERROR}
1395         if b was opened with @code{bzReadOpen}
1396      @code{BZ_IO_ERROR}
1397         if there is an error writing the compressed file.
1398      @code{BZ_OK}
1399         otherwise
1400@end display
1401
1402
1403
1404
1405@subsection @code{bzWriteClose}
1406@example
1407   int bzWriteClose ( int *bzerror, BZFILE* f,
1408                      int abandon,
1409                      unsigned int* nbytes_in,
1410                      unsigned int* nbytes_out );
1411@end example
1412
1413Compresses and flushes to the compressed file all data so far supplied
1414by @code{bzWrite}.  The logical end-of-stream markers are also written, so
1415subsequent calls to @code{bzWrite} are illegal.  All memory associated
1416with the compressed file @code{b} is released.
1417@code{fflush} is called on the
1418compressed file, but it is not @code{fclose}'d.
1419
1420If @code{bzWriteClose} is called to clean up after an error, the only
1421action is to release the memory.  The library records the error codes
1422issued by previous calls, so this situation will be detected
1423automatically.  There is no attempt to complete the compression
1424operation, nor to @code{fflush} the compressed file.  You can force this
1425behaviour to happen even in the case of no error, by passing a nonzero
1426value to @code{abandon}.
1427
1428If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the
1429total volume of uncompressed data handled.  Similarly, @code{nbytes_out}
1430will be set to the total volume of compressed data written.
1431
1432Possible assignments to @code{bzerror}:
1433@display
1434      @code{BZ_SEQUENCE_ERROR}
1435         if @code{b} was opened with @code{bzReadOpen}
1436      @code{BZ_IO_ERROR}
1437         if there is an error writing the compressed file
1438      @code{BZ_OK}
1439         otherwise
1440@end display
1441
1442@subsection Handling embedded compressed data streams
1443
1444The high-level library facilitates use of
1445@code{bzip2} data streams which form some part of a surrounding, larger
1446data stream.
1447@itemize @bullet
1448@item For writing, the library takes an open file handle, writes
1449compressed data to it, @code{fflush}es it but does not @code{fclose} it.
1450The calling application can write its own data before and after the
1451compressed data stream, using that same file handle.
1452@item Reading is more complex, and the facilities are not as general
1453as they could be since generality is hard to reconcile with efficiency.
1454@code{bzRead} reads from the compressed file in blocks of size
1455@code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot
1456the logical end of compressed stream.
1457To recover this data once decompression has
1458ended, call @code{bzReadGetUnused} after the last call of @code{bzRead}
1459(the one returning @code{BZ_STREAM_END}) but before calling
1460@code{bzReadClose}.
1461@end itemize
1462
1463This mechanism makes it easy to decompress multiple @code{bzip2}
1464streams placed end-to-end.  As the end of one stream, when @code{bzRead}
1465returns @code{BZ_STREAM_END}, call @code{bzReadGetUnused} to collect the
1466unused data (copy it into your own buffer somewhere).
1467That data forms the start of the next compressed stream.
1468To start uncompressing that next stream, call @code{bzReadOpen} again,
1469feeding in the unused data via the @code{unused}/@code{nUnused}
1470parameters.
1471Keep doing this until @code{BZ_STREAM_END} return coincides with the
1472physical end of file (@code{feof(f)}).  In this situation
1473@code{bzReadGetUnused}
1474will of course return no data.
1475
1476This should give some feel for how the high-level interface can be used.
1477If you require extra flexibility, you'll have to bite the bullet and get
1478to grips with the low-level interface.
1479
1480@subsection Standard file-reading/writing code
1481Here's how you'd write data to a compressed file:
1482@example @code
1483FILE*   f;
1484BZFILE* b;
1485int     nBuf;
1486char    buf[ /* whatever size you like */ ];
1487int     bzerror;
1488int     nWritten;
1489
1490f = fopen ( "myfile.bz2", "w" );
1491if (!f) @{
1492   /* handle error */
1493@}
1494b = bzWriteOpen ( &bzerror, f, 9 );
1495if (bzerror != BZ_OK) @{
1496   bzWriteClose ( b );
1497   /* handle error */
1498@}
1499
1500while ( /* condition */ ) @{
1501   /* get data to write into buf, and set nBuf appropriately */
1502   nWritten = bzWrite ( &bzerror, b, buf, nBuf );
1503   if (bzerror == BZ_IO_ERROR) @{
1504      bzWriteClose ( &bzerror, b );
1505      /* handle error */
1506   @}
1507@}
1508
1509bzWriteClose ( &bzerror, b );
1510if (bzerror == BZ_IO_ERROR) @{
1511   /* handle error */
1512@}
1513@end example
1514And to read from a compressed file:
1515@example
1516FILE*   f;
1517BZFILE* b;
1518int     nBuf;
1519char    buf[ /* whatever size you like */ ];
1520int     bzerror;
1521int     nWritten;
1522
1523f = fopen ( "myfile.bz2", "r" );
1524if (!f) @{
1525   /* handle error */
1526@}
1527b = bzReadOpen ( &bzerror, f, 0, NULL, 0 );
1528if (bzerror != BZ_OK) @{
1529   bzReadClose ( &bzerror, b );
1530   /* handle error */
1531@}
1532
1533bzerror = BZ_OK;
1534while (bzerror == BZ_OK && /* arbitrary other conditions */) @{
1535   nBuf = bzRead ( &bzerror, b, buf, /* size of buf */ );
1536   if (bzerror == BZ_OK) @{
1537      /* do something with buf[0 .. nBuf-1] */
1538   @}
1539@}
1540if (bzerror != BZ_STREAM_END) @{
1541   bzReadClose ( &bzerror, b );
1542   /* handle error */
1543@} else @{
1544   bzReadClose ( &bzerror );
1545@}
1546@end example
1547
1548
1549
1550@section Utility functions
1551@subsection @code{bzBuffToBuffCompress}
1552@example
1553   int bzBuffToBuffCompress( char*         dest,
1554                             unsigned int* destLen,
1555                             char*         source,
1556                             unsigned int  sourceLen,
1557                             int           blockSize100k,
1558                             int           verbosity,
1559                             int           workFactor );
1560@end example
1561Attempts to compress the data in @code{source[0 .. sourceLen-1]}
1562into the destination buffer, @code{dest[0 .. *destLen-1]}.
1563If the destination buffer is big enough, @code{*destLen} is
1564set to the size of the compressed data, and @code{BZ_OK} is
1565returned.  If the compressed data won't fit, @code{*destLen}
1566is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1567
1568Compression in this manner is a one-shot event, done with a single call
1569to this function.  The resulting compressed data is a complete
1570@code{bzip2} format data stream.  There is no mechanism for making
1571additional calls to provide extra input data.  If you want that kind of
1572mechanism, use the low-level interface.
1573
1574For the meaning of parameters @code{blockSize100k}, @code{verbosity}
1575and @code{workFactor}, @* see @code{bzCompressInit}.
1576
1577To guarantee that the compressed data will fit in its buffer, allocate
1578an output buffer of size 1% larger than the uncompressed data, plus
1579six hundred extra bytes.
1580
1581@code{bzBuffToBuffDecompress} will not write data at or
1582beyond @code{dest[*destLen]}, even in case of buffer overflow.
1583
1584Possible return values:
1585@display
1586      @code{BZ_PARAM_ERROR}
1587         if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1588         or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1589         or @code{verbosity < 0} or @code{verbosity > 4}
1590         or @code{workFactor < 0} or @code{workFactor > 250}
1591      @code{BZ_MEM_ERROR}
1592         if insufficient memory is available
1593      @code{BZ_OUTBUFF_FULL}
1594         if the size of the compressed data exceeds @code{*destLen}
1595      @code{BZ_OK}
1596         otherwise
1597@end display
1598
1599
1600
1601@subsection @code{bzBuffToBuffDecompress}
1602@example
1603   int bzBuffToBuffDecompress ( char*         dest,
1604                                unsigned int* destLen,
1605                                char*         source,
1606                                unsigned int  sourceLen,
1607                                int           small,
1608                                int           verbosity );
1609@end example
1610Attempts to decompress the data in @code{source[0 .. sourceLen-1]}
1611into the destination buffer, @code{dest[0 .. *destLen-1]}.
1612If the destination buffer is big enough, @code{*destLen} is
1613set to the size of the uncompressed data, and @code{BZ_OK} is
1614returned.  If the compressed data won't fit, @code{*destLen}
1615is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1616
1617@code{source} is assumed to hold a complete @code{bzip2} format
1618data stream.  @code{bzBuffToBuffDecompress} tries to decompress
1619the entirety of the stream into the output buffer.
1620
1621For the meaning of parameters @code{small} and @code{verbosity},
1622see @code{bzDecompressInit}.
1623
1624Because the compression ratio of the compressed data cannot be known in
1625advance, there is no easy way to guarantee that the output buffer will
1626be big enough.  You may of course make arrangements in your code to
1627record the size of the uncompressed data, but such a mechanism is beyond
1628the scope of this library.
1629
1630@code{bzBuffToBuffDecompress} will not write data at or
1631beyond @code{dest[*destLen]}, even in case of buffer overflow.
1632
1633Possible return values:
1634@display
1635      @code{BZ_PARAM_ERROR}
1636         if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1637         or @code{small != 0 && small != 1}
1638         or @code{verbosity < 0} or @code{verbosity > 4}
1639      @code{BZ_MEM_ERROR}
1640         if insufficient memory is available
1641      @code{BZ_OUTBUFF_FULL}
1642         if the size of the compressed data exceeds @code{*destLen}
1643      @code{BZ_DATA_ERROR}
1644         if a data integrity error was detected in the compressed data
1645      @code{BZ_DATA_ERROR_MAGIC}
1646         if the compressed data doesn't begin with the right magic bytes
1647      @code{BZ_UNEXPECTED_EOF}
1648         if the compressed data ends unexpectedly
1649      @code{BZ_OK}
1650         otherwise
1651@end display
1652
1653
1654
1655@section Using the library in a @code{stdio}-free environment
1656
1657@subsection Getting rid of @code{stdio}
1658
1659In a deeply embedded application, you might want to use just
1660the memory-to-memory functions.  You can do this conveniently
1661by compiling the library with preprocessor symbol @code{BZ_NO_STDIO}
1662defined.  Doing this gives you a library containing only the following
1663eight functions:
1664
1665@code{bzCompressInit}, @code{bzCompress}, @code{bzCompressEnd} @*
1666@code{bzDecompressInit}, @code{bzDecompress}, @code{bzDecompressEnd} @*
1667@code{bzBuffToBuffCompress}, @code{bzBuffToBuffDecompress}
1668
1669When compiled like this, all functions will ignore @code{verbosity}
1670settings.
1671
1672@subsection Critical error handling
1673@code{libbzip2} contains a number of internal assertion checks which
1674should, needless to say, never be activated.  Nevertheless, if an
1675assertion should fail, behaviour depends on whether or not the library
1676was compiled with @code{BZ_NO_STDIO} set.
1677
1678For a normal compile, an assertion failure yields the message
1679@example
1680   bzip2/libbzip2, v0.9.0: internal error number N.
1681   This is a bug in bzip2/libbzip2, v0.9.0.  Please report
1682   it to me at: jseward@@acm.org.  If this happened when
1683   you were using some program which uses libbzip2 as a
1684   component, you should also report this bug to the author(s)
1685   of that program.  Please make an effort to report this bug;
1686   timely and accurate bug reports eventually lead to higher
1687   quality software.  Thx.  Julian Seward, 27 June 1998.
1688@end example
1689where @code{N} is some error code number.  @code{exit(3)}
1690is then called.
1691
1692For a @code{stdio}-free library, assertion failures result
1693in a call to a function declared as:
1694@example
1695   extern void bz_internal_error ( int errcode );
1696@end example
1697The relevant code is passed as a parameter.  You should supply
1698such a function.
1699
1700In either case, once an assertion failure has occurred, any
1701@code{bz_stream} records involved can be regarded as invalid.
1702You should not attempt to resume normal operation with them.
1703
1704You may, of course, change critical error handling to suit
1705your needs.  As I said above, critical errors indicate bugs
1706in the library and should not occur.  All "normal" error
1707situations are indicated via error return codes from functions,
1708and can be recovered from.
1709
1710
1711@section Making a Windows DLL
1712Everything related to Windows has been contributed by Yoshioka Tsuneo
1713@* (@code{QWF00133@@niftyserve.or.jp} /
1714@code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to
1715him (but perhaps Cc: me, @code{jseward@@acm.org}).
1716
1717My vague understanding of what to do is: using Visual C++ 5.0,
1718open the project file @code{libbz2.dsp}, and build.  That's all.
1719
1720If you can't
1721open the project file for some reason, make a new one, naming these files:
1722@code{blocksort.c}, @code{bzlib.c}, @code{compress.c},
1723@code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @*
1724@code{randtable.c} and @code{libbz2.def}.  You might also need
1725to name the header files @code{bzlib.h} and @code{bzlib_private.h}.
1726
1727If you don't use VC++, you may need to define the proprocessor symbol
1728@code{_WIN32}.
1729
1730Finally, @code{dlltest.c} is a sample program using the DLL.  It has a
1731project file, @code{dlltest.dsp}.
1732
1733I haven't tried any of this stuff myself, but it all looks plausible.
1734
1735
1736
1737@chapter Miscellanea
1738
1739These are just some random thoughts of mine.  Your mileage may
1740vary.
1741
1742@section Limitations of the compressed file format
1743@code{bzip2-0.9.0} uses exactly the same file format as the previous
1744version, @code{bzip2-0.1}.  This decision was made in the interests of
1745stability.  Creating yet another incompatible compressed file format
1746would create further confusion and disruption for users.
1747
1748Nevertheless, this is not a painless decision.  Development
1749work since the release of @code{bzip2-0.1} in August 1997
1750has shown complexities in the file format which slow down
1751decompression and, in retrospect, are unnecessary.  These are:
1752@itemize @bullet
1753@item The run-length encoder, which is the first of the
1754      compression transformations, is entirely irrelevant.
1755      The original purpose was to protect the sorting algorithm
1756      from the very worst case input: a string of repeated
1757      symbols.  But algorithm steps Q6a and Q6b in the original
1758      Burrows-Wheeler technical report (SRC-124) show how
1759      repeats can be handled without difficulty in block
1760      sorting.
1761@item The randomisation mechanism doesn't really need to be
1762      there.  Udi Manber and Gene Myers published a suffix
1763      array construction algorithm a few years back, which
1764      can be employed to sort any block, no matter how
1765      repetitive, in O(N log N) time.  Subsequent work by
1766      Kunihiko Sadakane has produced a derivative O(N (log N)^2)
1767      algorithm which usually outperforms the Manber-Myers
1768      algorithm.
1769
1770      I could have changed to Sadakane's algorithm, but I find
1771      it to be slower than @code{bzip2}'s existing algorithm for
1772      most inputs, and the randomisation mechanism protects
1773      adequately against bad cases.  I didn't think it was
1774      a good tradeoff to make.  Partly this is due to the fact
1775      that I was not flooded with email complaints about
1776      @code{bzip2-0.1}'s performance on repetitive data, so
1777      perhaps it isn't a problem for real inputs.
1778
1779      Probably the best long-term solution
1780      is to use the existing sorting
1781      algorithm initially, and fall back to a O(N (log N)^2)
1782      algorithm if the standard algorithm gets into difficulties.
1783      This can be done without much difficulty; I made
1784      a prototype implementation of it some months now.
1785@item The compressed file format was never designed to be
1786      handled by a library, and I have had to jump though
1787      some hoops to produce an efficient implementation of
1788      decompression.  It's a bit hairy.  Try passing
1789      @code{decompress.c} through the C preprocessor
1790      and you'll see what I mean.  Much of this complexity
1791      could have been avoided if the compressed size of
1792      each block of data was recorded in the data stream.
1793@item An Adler-32 checksum, rather than a CRC32 checksum,
1794      would be faster to compute.
1795@end itemize
1796It would be fair to say that the @code{bzip2} format was frozen
1797before I properly and fully understood the performance
1798consequences of doing so.
1799
1800Improvements which I have been able to incorporate into
18010.9.0, despite using the same file format, are:
1802@itemize @bullet
1803@item Single array implementation of the inverse BWT.  This
1804      significantly speeds up decompression, presumably
1805      because it reduces the number of cache misses.
1806@item Faster inverse MTF transform for large MTF values.  The
1807      new implementation is based on the notion of sliding blocks
1808      of values.
1809@item @code{bzip2-0.9.0} now reads and writes files with @code{fread}
1810      and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}.
1811      Duh! I'm embarrassed at my own moronicness (moronicity?) on this
1812      one.
1813
1814@end itemize
1815Further ahead, it would be nice
1816to be able to do random access into files.  This will
1817require some careful design of compressed file formats.
1818
1819
1820
1821@section Portability issues
1822After some consideration, I have decided not to use
1823GNU @code{autoconf} to configure 0.9.0.
1824
1825@code{autoconf}, admirable and wonderful though it is,
1826mainly assists with portability problems between Unix-like
1827platforms.  But @code{bzip2} doesn't have much in the way
1828of portability problems on Unix; most of the difficulties appear
1829when porting to the Mac, or to Microsoft's operating systems.
1830@code{autoconf} doesn't help in those cases, and brings in a
1831whole load of new complexity.
1832
1833Most people should be able to compile the library and program
1834under Unix straight out-of-the-box, so to speak, especially
1835if you have a version of GNU C available.
1836
1837There are a couple of @code{__inline__} directives in the code.  GNU C
1838(@code{gcc}) should be able to handle them.  If your compiler doesn't
1839like them, just @code{#define} @code{__inline__} to be null.  One
1840easy way to do this is to compile with the flag @code{-D__inline__=},
1841which should be understood by most Unix compilers.
1842
1843If you still have difficulties, try compiling with the macro
1844@code{BZ_STRICT_ANSI} defined.  This should enable you to build the
1845library in a strictly ANSI compliant environment.  Building the program
1846itself like this is dangerous and not supported, since you remove
1847@code{bzip2}'s checks against compressing directories, symbolic links,
1848devices, and other not-really-a-file entities.  This could cause
1849filesystem corruption!
1850
1851One other thing: if you create a @code{bzip2} binary for public
1852distribution, please try and link it statically (@code{gcc -s}).  This
1853avoids all sorts of library-version issues that others may encounter
1854later on.
1855
1856
1857@section Reporting bugs
1858I tried pretty hard to make sure @code{bzip2} is
1859bug free, both by design and by testing.  Hopefully
1860you'll never need to read this section for real.
1861
1862Nevertheless, if @code{bzip2} dies with a segmentation
1863fault, a bus error or an internal assertion failure, it
1864will ask you to email me a bug report.  Experience with
1865version 0.1 shows that almost all these problems can
1866be traced to either compiler bugs or hardware problems.
1867@itemize @bullet
1868@item
1869Recompile the program with no optimisation, and see if it
1870works.  And/or try a different compiler.
1871I heard all sorts of stories about various flavours
1872of GNU C (and other compilers) generating bad code for
1873@code{bzip2}, and I've run across two such examples myself.
1874
18752.7.X versions of GNU C are known to generate bad code from
1876time to time, at high optimisation levels.
1877If you get problems, try using the flags
1878@code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}.
1879You should specifically @emph{not} use @code{-funroll-loops}.
1880
1881You may notice that the Makefile runs four tests as part of
1882the build process.  If the program passes all of these, it's
1883a pretty good (but not 100%) indication that the compiler has
1884done its job correctly.
1885@item
1886If @code{bzip2} crashes randomly, and the crashes are not
1887repeatable, you may have a flaky memory subsystem.  @code{bzip2}
1888really hammers your memory hierarchy, and if it's a bit marginal,
1889you may get these problems.  Ditto if your disk or I/O subsystem
1890is slowly failing.  Yup, this really does happen.
1891
1892Try using a different machine of the same type, and see if
1893you can repeat the problem.
1894@item This isn't really a bug, but ... If @code{bzip2} tells
1895you your file is corrupted on decompression, and you
1896obtained the file via FTP, there is a possibility that you
1897forgot to tell FTP to do a binary mode transfer.  That absolutely
1898will cause the file to be non-decompressible.  You'll have to transfer
1899it again.
1900@end itemize
1901
1902If you've incorporated @code{libbzip2} into your own program
1903and are getting problems, please, please, please, check that the
1904parameters you are passing in calls to the library, are
1905correct, and in accordance with what the documentation says
1906is allowable.  I have tried to make the library robust against
1907such problems, but I'm sure I haven't succeeded.
1908
1909Finally, if the above comments don't help, you'll have to send
1910me a bug report.  Now, it's just amazing how many people will
1911send me a bug report saying something like
1912@display
1913   bzip2 crashed with segmentation fault on my machine
1914@end display
1915and absolutely nothing else.  Needless to say, a such a report
1916is @emph{totally, utterly, completely and comprehensively 100% useless;
1917a waste of your time, my time, and net bandwidth}.
1918With no details at all, there's no way I can possibly begin
1919to figure out what the problem is.
1920
1921The rules of the game are: facts, facts, facts.  Don't omit
1922them because "oh, they won't be relevant".  At the bare
1923minimum:
1924@display
1925   Machine type.  Operating system version.
1926   Exact version of @code{bzip2} (do @code{bzip2 -V}).
1927   Exact version of the compiler used.
1928   Flags passed to the compiler.
1929@end display
1930However, the most important single thing that will help me is
1931the file that you were trying to compress or decompress at the
1932time the problem happened.  Without that, my ability to do anything
1933more than speculate about the cause, is limited.
1934
1935Please remember that I connect to the Internet with a modem, so
1936you should contact me before mailing me huge files.
1937
1938
1939@section Did you get the right package?
1940
1941@code{bzip2} is a resource hog.  It soaks up large amounts of CPU cycles
1942and memory.  Also, it gives very large latencies.  In the worst case, you
1943can feed many megabytes of uncompressed data into the library before
1944getting any compressed output, so this probably rules out applications
1945requiring interactive behaviour.
1946
1947These aren't faults of my implementation, I hope, but more
1948an intrinsic property of the Burrows-Wheeler transform (unfortunately).
1949Maybe this isn't what you want.
1950
1951If you want a compressor and/or library which is faster, uses less
1952memory but gets pretty good compression, and has minimal latency,
1953consider Jean-loup
1954Gailly's and Mark Adler's work, @code{zlib-1.1.2} and
1955@code{gzip-1.2.4}.  Look for them at
1956@code{http://www.cdrom.com/pub/infozip/zlib} and
1957@code{http://www.gzip.org} respectively.
1958
1959For something faster and lighter still, you might try Markus F X J
1960Oberhumer's @code{LZO} real-time compression/decompression library, at
1961@* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}.
1962
1963If you want to use the @code{bzip2} algorithms to compress small blocks
1964of data, 64k bytes or smaller, for example on an on-the-fly disk
1965compressor, you'd be well advised not to use this library.  Instead,
1966I've made a special library tuned for that kind of use.  It's part of
1967@code{e2compr-0.40}, an on-the-fly disk compressor for the Linux
1968@code{ext2} filesystem.  Look at
1969@code{http://www.netspace.net.au/~reiter/e2compr}.
1970
1971
1972
1973@section Testing
1974
1975A record of the tests I've done.
1976
1977First, some data sets:
1978@itemize @bullet
1979@item B: a directory containing a 6001 files, one for every length in the
1980      range 0 to 6000 bytes.  The files contain random lowercase
1981      letters.  18.7 megabytes.
1982@item H: my home directory tree.  Documents, source code, mail files,
1983      compressed data.  H contains B, and also a directory of
1984      files designed as boundary cases for the sorting; mostly very
1985      repetitive, nasty files.  445 megabytes.
1986@item A: directory tree holding various applications built from source:
1987      @code{egcs-1.0.2}, @code{gcc-2.8.1}, KDE Beta 4, GTK, Octave, etc.
1988      827 megabytes.
1989@item P: directory tree holding large amounts of source code (@code{.tar}
1990      files) of the entire GNU distribution, plus a couple of
1991      Linux distributions.  2400 megabytes.
1992@end itemize
1993The tests conducted are as follows.  Each test means compressing
1994(a copy of) each file in the data set, decompressing it and
1995comparing it against the original.
1996
1997First, a bunch of tests with block sizes, internal buffer
1998sizes and randomisation lengths set very small,
1999to detect any problems with the
2000blocking, buffering and randomisation mechanisms.
2001This required modifying the source code so as to try to
2002break it.
2003@enumerate
2004@item Data set H, with
2005      buffer size of 1 byte, and block size of 23 bytes.
2006@item Data set B, buffer sizes 1 byte, block size 1 byte.
2007@item As (2) but small-mode decompression (first 1700 files).
2008@item As (2) with block size 2 bytes.
2009@item As (2) with block size 3 bytes.
2010@item As (2) with block size 4 bytes.
2011@item As (2) with block size 5 bytes.
2012@item As (2) with block size 6 bytes and small-mode decompression.
2013@item H with normal buffer sizes (5000 bytes), normal block
2014      size (up to 900000 bytes), but with randomisation
2015      mechanism running intensely (randomising approximately every
2016      third byte).
2017@item As (9) with small-mode decompression.
2018@end enumerate
2019Then some tests with unmodified source code.
2020@enumerate
2021@item H, all settings normal.
2022@item As (1), with small-mode decompress.
2023@item H, compress with flag @code{-1}.
2024@item H, compress with flag @code{-s}, decompress with flag @code{-s}.
2025@item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing,
2026      @code{bzip2-0.9.0} decompressing, all settings normal.
2027@item Backwards compatibility:  H, @code{bzip2-0.9.0} compressing,
2028      @code{bzip2-0.1pl2} decompressing, all settings normal.
2029@item Bigger tests: A, all settings normal.
2030@item P, all settings normal.
2031@item Misc test: about 100 megabytes of @code{.tar} files with
2032      @code{bzip2} compiled with Purify.
2033@item Misc tests to make sure it builds and runs ok on non-Linux/x86
2034      platforms.
2035@end enumerate
2036These tests were conducted on a 205 MHz Cyrix 6x86MX machine, running
2037Linux 2.0.32.  They represent nearly a week of continuous computation.
2038All tests completed successfully.
2039
2040
2041@section Further reading
2042@code{bzip2} is not research work, in the sense that it doesn't present
2043any new ideas.  Rather, it's an engineering exercise based on existing
2044ideas.
2045
2046Four documents describe essentially all the ideas behind @code{bzip2}:
2047@example
2048Michael Burrows and D. J. Wheeler:
2049  "A block-sorting lossless data compression algorithm"
2050   10th May 1994.
2051   Digital SRC Research Report 124.
2052   ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
2053   If you have trouble finding it, try searching at the
2054   New Zealand Digital Library, http://www.nzdl.org.
2055
2056Daniel S. Hirschberg and Debra A. LeLewer
2057  "Efficient Decoding of Prefix Codes"
2058   Communications of the ACM, April 1990, Vol 33, Number 4.
2059   You might be able to get an electronic copy of this
2060      from the ACM Digital Library.
2061
2062David J. Wheeler
2063   Program bred3.c and accompanying document bred3.ps.
2064   This contains the idea behind the multi-table Huffman
2065   coding scheme.
2066   ftp://ftp.cl.cam.ac.uk/pub/user/djw3/
2067
2068Jon L. Bentley and Robert Sedgewick
2069  "Fast Algorithms for Sorting and Searching Strings"
2070   Available from Sedgewick's web page,
2071   www.cs.princeton.edu/~rs
2072@end example
2073The following paper gives valuable additional insights into the
2074algorithm, but is not immediately the basis of any code
2075used in bzip2.
2076@example
2077Peter Fenwick:
2078   Block Sorting Text Compression
2079   Proceedings of the 19th Australasian Computer Science Conference,
2080     Melbourne, Australia.  Jan 31 - Feb 2, 1996.
2081   ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
2082@end example
2083Kunihiko Sadakane's sorting algorithm, mentioned above,
2084is available from:
2085@example
2086http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
2087@end example
2088The Manber-Myers suffix array construction
2089algorithm is described in a paper
2090available from:
2091@example
2092http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
2093@end example
2094
2095
2096
2097@contents
2098
2099@bye
2100
2101