1\input texinfo @c -*- Texinfo -*- 2@setfilename bzip2.info 3 4@ignore 5This file documents bzip2 version 0.9.0c, and associated library 6libbzip2, written by Julian Seward (jseward@acm.org). 7 8Copyright (C) 1996-1998 Julian R Seward 9 10Permission is granted to make and distribute verbatim copies of 11this manual provided the copyright notice and this permission notice 12are preserved on all copies. 13 14Permission is granted to copy and distribute translations of this manual 15into another language, under the above conditions for verbatim copies. 16@end ignore 17 18@ifinfo 19@format 20START-INFO-DIR-ENTRY 21* Bzip2: (bzip2). A program and library for data compression. 22END-INFO-DIR-ENTRY 23@end format 24 25@end ifinfo 26 27@iftex 28@c @finalout 29@settitle bzip2 and libbzip2 30@titlepage 31@title bzip2 and libbzip2 32@subtitle a program and library for data compression 33@subtitle copyright (C) 1996-1998 Julian Seward 34@subtitle version 0.9.0c of 18 October 1998 35@author Julian Seward 36 37@end titlepage 38@end iftex 39 40 41@parindent 0mm 42@parskip 2mm 43 44 45This program, @code{bzip2}, 46and associated library @code{libbzip2}, are 47Copyright (C) 1996-1998 Julian R Seward. All rights reserved. 48 49Redistribution and use in source and binary forms, with or without 50modification, are permitted provided that the following conditions 51are met: 52@itemize @bullet 53@item 54 Redistributions of source code must retain the above copyright 55 notice, this list of conditions and the following disclaimer. 56@item 57 The origin of this software must not be misrepresented; you must 58 not claim that you wrote the original software. If you use this 59 software in a product, an acknowledgment in the product 60 documentation would be appreciated but is not required. 61@item 62 Altered source versions must be plainly marked as such, and must 63 not be misrepresented as being the original software. 64@item 65 The name of the author may not be used to endorse or promote 66 products derived from this software without specific prior written 67 permission. 68@end itemize 69THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS 70OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 71WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 72ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY 73DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 74DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE 75GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 76INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 77WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 78NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 79SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 80 81Julian Seward, Guildford, Surrey, UK. 82 83@code{jseward@@acm.org} 84 85@code{http://www.muraroa.demon.co.uk} 86 87@code{bzip2}/@code{libbzip2} version 0.9.0c of 18 October 1998. 88 89PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented 90algorithms. However, I do not have the resources available to carry out 91a full patent search. Therefore I cannot give any guarantee of the 92above statement. 93 94 95 96 97 98 99 100@node Overview, Implementation, Top, Top 101@chapter Introduction 102 103@code{bzip2} compresses files using the Burrows-Wheeler 104block-sorting text compression algorithm, and Huffman coding. 105Compression is generally considerably better than that 106achieved by more conventional LZ77/LZ78-based compressors, 107and approaches the performance of the PPM family of statistical compressors. 108 109@code{bzip2} is built on top of @code{libbzip2}, a flexible library 110for handling compressed data in the @code{bzip2} format. This manual 111describes both how to use the program and 112how to work with the library interface. Most of the 113manual is devoted to this library, not the program, 114which is good news if your interest is only in the program. 115 116Chapter 2 describes how to use @code{bzip2}; this is the only part 117you need to read if you just want to know how to operate the program. 118Chapter 3 describes the programming interfaces in detail, and 119Chapter 4 records some miscellaneous notes which I thought 120ought to be recorded somewhere. 121 122 123@chapter How to use @code{bzip2} 124 125This chapter contains a copy of the @code{bzip2} man page, 126and nothing else. 127@example 128NAME 129 bzip2, bunzip2 - a block-sorting file compressor, v0.9.0 130 bzcat - decompresses files to stdout 131 bzip2recover - recovers data from damaged bzip2 files 132 133 134SYNOPSIS 135 bzip2 [ -cdfkstvzVL123456789 ] [ filenames ... ] 136 bunzip2 [ -fkvsVL ] [ filenames ... ] 137 bzcat [ -s ] [ filenames ... ] 138 bzip2recover filename 139 140 141DESCRIPTION 142 bzip2 compresses files using the Burrows-Wheeler block- 143 sorting text compression algorithm, and Huffman coding. 144 Compression is generally considerably better than that 145 achieved by more conventional LZ77/LZ78-based compressors, 146 and approaches the performance of the PPM family of sta- 147 tistical compressors. 148 149 The command-line options are deliberately very similar to 150 those of GNU Gzip, but they are not identical. 151 152 bzip2 expects a list of file names to accompany the com- 153 mand-line flags. Each file is replaced by a compressed 154 version of itself, with the name "original_name.bz2". 155 Each compressed file has the same modification date and 156 permissions as the corresponding original, so that these 157 properties can be correctly restored at decompression 158 time. File name handling is naive in the sense that there 159 is no mechanism for preserving original file names, per- 160 missions and dates in filesystems which lack these con- 161 cepts, or have serious file name length restrictions, such 162 as MS-DOS. 163 164 bzip2 and bunzip2 will by default not overwrite existing 165 files; if you want this to happen, specify the -f flag. 166 167 If no file names are specified, bzip2 compresses from 168 standard input to standard output. In this case, bzip2 169 will decline to write compressed output to a terminal, as 170 this would be entirely incomprehensible and therefore 171 pointless. 172 173 bunzip2 (or bzip2 -d ) decompresses and restores all spec- 174 ified files whose names end in ".bz2". Files without this 175 suffix are ignored. Again, supplying no filenames causes 176 decompression from standard input to standard output. 177 178 bunzip2 will correctly decompress a file which is the con- 179 catenation of two or more compressed files. The result is 180 the concatenation of the corresponding uncompressed files. 181 Integrity testing (-t) of concatenated compressed files is 182 also supported. 183 184 You can also compress or decompress files to the standard 185 output by giving the -c flag. Multiple files may be com- 186 pressed and decompressed like this. The resulting outputs 187 are fed sequentially to stdout. Compression of multiple 188 files in this manner generates a stream containing multi- 189 ple compressed file representations. Such a stream can be 190 decompressed correctly only by bzip2 version 0.9.0 or 191 later. Earlier versions of bzip2 will stop after decom- 192 pressing the first file in the stream. 193 194 bzcat (or bzip2 -dc ) decompresses all specified files to 195 the standard output. 196 197 Compression is always performed, even if the compressed 198 file is slightly larger than the original. Files of less 199 than about one hundred bytes tend to get larger, since the 200 compression mechanism has a constant overhead in the 201 region of 50 bytes. Random data (including the output of 202 most file compressors) is coded at about 8.05 bits per 203 byte, giving an expansion of around 0.5%. 204 205 As a self-check for your protection, bzip2 uses 32-bit 206 CRCs to make sure that the decompressed version of a file 207 is identical to the original. This guards against corrup- 208 tion of the compressed data, and against undetected bugs 209 in bzip2 (hopefully very unlikely). The chances of data 210 corruption going undetected is microscopic, about one 211 chance in four billion for each file processed. Be aware, 212 though, that the check occurs upon decompression, so it 213 can only tell you that that something is wrong. It can't 214 help you recover the original uncompressed data. You can 215 use bzip2recover to try to recover data from damaged 216 files. 217 218 Return values: 0 for a normal exit, 1 for environmental 219 problems (file not found, invalid flags, I/O errors, &c), 220 2 to indicate a corrupt compressed file, 3 for an internal 221 consistency error (eg, bug) which caused bzip2 to panic. 222 223 224MEMORY MANAGEMENT 225 Bzip2 compresses large files in blocks. The block size 226 affects both the compression ratio achieved, and the 227 amount of memory needed both for compression and decom- 228 pression. The flags -1 through -9 specify the block size 229 to be 100,000 bytes through 900,000 bytes (the default) 230 respectively. At decompression-time, the block size used 231 for compression is read from the header of the compressed 232 file, and bunzip2 then allocates itself just enough memory 233 to decompress the file. Since block sizes are stored in 234 compressed files, it follows that the flags -1 to -9 are 235 irrelevant to and so ignored during decompression. 236 237 Compression and decompression requirements, in bytes, can 238 be estimated as: 239 240 Compression: 400k + ( 7 x block size ) 241 242 Decompression: 100k + ( 4 x block size ), or 243 100k + ( 2.5 x block size ) 244 245 Larger block sizes give rapidly diminishing marginal 246 returns; most of the compression comes from the first two 247 or three hundred k of block size, a fact worth bearing in 248 mind when using bzip2 on small machines. It is also 249 important to appreciate that the decompression memory 250 requirement is set at compression-time by the choice of 251 block size. 252 253 For files compressed with the default 900k block size, 254 bunzip2 will require about 3700 kbytes to decompress. To 255 support decompression of any file on a 4 megabyte machine, 256 bunzip2 has an option to decompress using approximately 257 half this amount of memory, about 2300 kbytes. Decompres- 258 sion speed is also halved, so you should use this option 259 only where necessary. The relevant flag is -s. 260 261 In general, try and use the largest block size memory con- 262 straints allow, since that maximises the compression 263 achieved. Compression and decompression speed are virtu- 264 ally unaffected by block size. 265 266 Another significant point applies to files which fit in a 267 single block -- that means most files you'd encounter 268 using a large block size. The amount of real memory 269 touched is proportional to the size of the file, since the 270 file is smaller than a block. For example, compressing a 271 file 20,000 bytes long with the flag -9 will cause the 272 compressor to allocate around 6700k of memory, but only 273 touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the 274 decompressor will allocate 3700k but only touch 100k + 275 20000 * 4 = 180 kbytes. 276 277 Here is a table which summarises the maximum memory usage 278 for different block sizes. Also recorded is the total 279 compressed size for 14 files of the Calgary Text Compres- 280 sion Corpus totalling 3,141,622 bytes. This column gives 281 some feel for how compression varies with block size. 282 These figures tend to understate the advantage of larger 283 block sizes for larger files, since the Corpus is domi- 284 nated by smaller files. 285 286 Compress Decompress Decompress Corpus 287 Flag usage usage -s usage Size 288 289 -1 1100k 500k 350k 914704 290 -2 1800k 900k 600k 877703 291 -3 2500k 1300k 850k 860338 292 -4 3200k 1700k 1100k 846899 293 -5 3900k 2100k 1350k 845160 294 -6 4600k 2500k 1600k 838626 295 -7 5400k 2900k 1850k 834096 296 -8 6000k 3300k 2100k 828642 297 -9 6700k 3700k 2350k 828642 298 299 300OPTIONS 301 -c --stdout 302 Compress or decompress to standard output. -c will 303 decompress multiple files to stdout, but will only 304 compress a single file to stdout. 305 306 -d --decompress 307 Force decompression. bzip2, bunzip2 and bzcat are 308 really the same program, and the decision about 309 what actions to take is done on the basis of which 310 name is used. This flag overrides that mechanism, 311 and forces bzip2 to decompress. 312 313 -z --compress 314 The complement to -d: forces compression, regard- 315 less of the invokation name. 316 317 -t --test 318 Check integrity of the specified file(s), but don't 319 decompress them. This really performs a trial 320 decompression and throws away the result. 321 322 -f --force 323 Force overwrite of output files. Normally, bzip2 324 will not overwrite existing output files. 325 326 -k --keep 327 Keep (don't delete) input files during compression 328 or decompression. 329 330 -s --small 331 Reduce memory usage, for compression, decompression 332 and testing. Files are decompressed and tested 333 using a modified algorithm which only requires 2.5 334 bytes per block byte. This means any file can be 335 decompressed in 2300k of memory, albeit at about 336 half the normal speed. 337 338 During compression, -s selects a block size of 339 200k, which limits memory use to around the same 340 figure, at the expense of your compression ratio. 341 In short, if your machine is low on memory (8 342 megabytes or less), use -s for everything. See 343 MEMORY MANAGEMENT above. 344 345 -v --verbose 346 Verbose mode -- show the compression ratio for each 347 file processed. Further -v's increase the ver- 348 bosity level, spewing out lots of information which 349 is primarily of interest for diagnostic purposes. 350 351 -L --license -V --version 352 Display the software version, license terms and 353 conditions. 354 355 -1 to -9 356 Set the block size to 100 k, 200 k .. 900 k when 357 compressing. Has no effect when decompressing. 358 See MEMORY MANAGEMENT above. 359 360 --repetitive-fast 361 bzip2 injects some small pseudo-random variations 362 into very repetitive blocks to limit worst-case 363 performance during compression. If sorting runs 364 into difficulties, the block is randomised, and 365 sorting is restarted. Very roughly, bzip2 persists 366 for three times as long as a well-behaved input 367 would take before resorting to randomisation. This 368 flag makes it give up much sooner. 369 370 --repetitive-best 371 Opposite of --repetitive-fast; try a lot harder 372 before resorting to randomisation. 373 374 375RECOVERING DATA FROM DAMAGED FILES 376 bzip2 compresses files in blocks, usually 900kbytes long. 377 Each block is handled independently. If a media or trans- 378 mission error causes a multi-block .bz2 file to become 379 damaged, it may be possible to recover data from the 380 undamaged blocks in the file. 381 382 The compressed representation of each block is delimited 383 by a 48-bit pattern, which makes it possible to find the 384 block boundaries with reasonable certainty. Each block 385 also carries its own 32-bit CRC, so damaged blocks can be 386 distinguished from undamaged ones. 387 388 bzip2recover is a simple program whose purpose is to 389 search for blocks in .bz2 files, and write each block out 390 into its own .bz2 file. You can then use bzip2 -t to test 391 the integrity of the resulting files, and decompress those 392 which are undamaged. 393 394 bzip2recover takes a single argument, the name of the dam- 395 aged file, and writes a number of files "rec0001file.bz2", 396 "rec0002file.bz2", etc, containing the extracted blocks. 397 The output filenames are designed so that the use of 398 wildcards in subsequent processing -- for example, "bzip2 399 -dc rec*file.bz2 > recovered_data" -- lists the files in 400 the "right" order. 401 402 bzip2recover should be of most use dealing with large .bz2 403 files, as these will contain many blocks. It is clearly 404 futile to use it on damaged single-block files, since a 405 damaged block cannot be recovered. If you wish to min- 406 imise any potential data loss through media or transmis- 407 sion errors, you might consider compressing with a smaller 408 block size. 409 410 411PERFORMANCE NOTES 412 The sorting phase of compression gathers together similar 413 strings in the file. Because of this, files containing 414 very long runs of repeated symbols, like "aabaabaabaab 415 ..." (repeated several hundred times) may compress 416 extraordinarily slowly. You can use the -vvvvv option to 417 monitor progress in great detail, if you want. Decompres- 418 sion speed is unaffected. 419 420 Such pathological cases seem rare in practice, appearing 421 mostly in artificially-constructed test files, and in low- 422 level disk images. It may be inadvisable to use bzip2 to 423 compress the latter. If you do get a file which causes 424 severe slowness in compression, try making the block size 425 as small as possible, with flag -1. 426 427 bzip2 usually allocates several megabytes of memory to 428 operate in, and then charges all over it in a fairly ran- 429 dom fashion. This means that performance, both for com- 430 pressing and decompressing, is largely determined by the 431 speed at which your machine can service cache misses. 432 Because of this, small changes to the code to reduce the 433 miss rate have been observed to give disproportionately 434 large performance improvements. I imagine bzip2 will per- 435 form best on machines with very large caches. 436 437 438CAVEATS 439 I/O error messages are not as helpful as they could be. 440 Bzip2 tries hard to detect I/O errors and exit cleanly, 441 but the details of what the problem is sometimes seem 442 rather misleading. 443 444 This manual page pertains to version 0.9.0 of bzip2. Com- 445 pressed data created by this version is entirely forwards 446 and backwards compatible with the previous public release, 447 version 0.1pl2, but with the following exception: 0.9.0 448 can correctly decompress multiple concatenated compressed 449 files. 0.1pl2 cannot do this; it will stop after decom- 450 pressing just the first file in the stream. 451 452 Wildcard expansion for Windows 95 and NT is flaky. 453 454 bzip2recover uses 32-bit integers to represent bit posi- 455 tions in compressed files, so it cannot handle compressed 456 files more than 512 megabytes long. This could easily be 457 fixed. 458 459 460AUTHOR 461 Julian Seward, jseward@@acm.org. 462 463 The ideas embodied in bzip2 are due to (at least) the fol- 464 lowing people: Michael Burrows and David Wheeler (for the 465 block sorting transformation), David Wheeler (again, for 466 the Huffman coder), Peter Fenwick (for the structured cod- 467 ing model in the original bzip, and many refinements), and 468 Alistair Moffat, Radford Neal and Ian Witten (for the 469 arithmetic coder in the original bzip). I am much 470 indebted for their help, support and advice. See the man- 471 ual in the source distribution for pointers to sources of 472 documentation. Christian von Roques encouraged me to look 473 for faster sorting algorithms, so as to speed up compres- 474 sion. Bela Lubkin encouraged me to improve the worst-case 475 compression performance. Many people sent patches, helped 476 with portability problems, lent machines, gave advice and 477 were generally helpful. 478@end example 479 480 481 482 483 484@chapter Programming with @code{libbzip2} 485 486This chapter describes the programming interface to @code{libbzip2}. 487 488For general background information, particularly about memory 489use and performance aspects, you'd be well advised to read Chapter 2 490as well. 491 492@section Top-level structure 493 494@code{libbzip2} is a flexible library for compressing and decompressing 495data in the @code{bzip2} data format. Although packaged as a single 496entity, it helps to regard the library as three separate parts: the low 497level interface, and the high level interface, and some utility 498functions. 499 500The structure of @code{libbzip2}'s interfaces is similar to 501that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib} 502library. 503 504@subsection Low-level summary 505 506This interface provides services for compressing and decompressing 507data in memory. There's no provision for dealing with files, streams 508or any other I/O mechanisms, just straight memory-to-memory work. 509In fact, this part of the library can be compiled without inclusion 510of @code{stdio.h}, which may be helpful for embedded applications. 511 512The low-level part of the library has no global variables and 513is therefore thread-safe. 514 515Six routines make up the low level interface: 516@code{bzCompressInit}, @code{bzCompress}, and @* @code{bzCompressEnd} 517for compression, 518and a corresponding trio @code{bzDecompressInit}, @* @code{bzDecompress} 519and @code{bzDecompressEnd} for decompression. 520The @code{*Init} functions allocate 521memory for compression/decompression and do other 522initialisations, whilst the @code{*End} functions close down operations 523and release memory. 524 525The real work is done by @code{bzCompress} and @code{bzDecompress}. 526These compress/decompress data from a user-supplied input buffer 527to a user-supplied output buffer. These buffers can be any size; 528arbitrary quantities of data are handled by making repeated calls 529to these functions. This is a flexible mechanism allowing a 530consumer-pull style of activity, or producer-push, or a mixture of 531both. 532 533 534 535@subsection High-level summary 536 537This interface provides some handy wrappers around the low-level 538interface to facilitate reading and writing @code{bzip2} format 539files (@code{.bz2} files). The routines provide hooks to facilitate 540reading files in which the @code{bzip2} data stream is embedded 541within some larger-scale file structure, or where there are 542multiple @code{bzip2} data streams concatenated end-to-end. 543 544For reading files, @code{bzReadOpen}, @code{bzRead}, @code{bzReadClose} 545and @code{bzReadGetUnused} are supplied. For writing files, 546@code{bzWriteOpen}, @code{bzWrite} and @code{bzWriteFinish} are 547available. 548 549As with the low-level library, no global variables are used 550so the library is per se thread-safe. However, if I/O errors 551occur whilst reading or writing the underlying compressed files, 552you may have to consult @code{errno} to determine the cause of 553the error. In that case, you'd need a C library which correctly 554supports @code{errno} in a multithreaded environment. 555 556To make the library a little simpler and more portable, 557@code{bzReadOpen} and @code{bzWriteOpen} require you to pass them file 558handles (@code{FILE*}s) which have previously been opened for reading or 559writing respectively. That avoids portability problems associated with 560file operations and file attributes, whilst not being much of an 561imposition on the programmer. 562 563 564 565@subsection Utility functions summary 566For very simple needs, @code{bzBuffToBuffCompress} and 567@code{bzBuffToBuffDecompress} are provided. These compress 568data in memory from one buffer to another buffer in a single 569function call. You should assess whether these functions 570fulfill your memory-to-memory compression/decompression 571requirements before investing effort in understanding the more 572general but more complex low-level interface. 573 574Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} / 575@code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to 576give better @code{zlib} compatibility. These functions are 577@code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush}, 578@code{bzclose}, 579@code{bzerror} and @code{bzlibVersion}. You may find these functions 580more convenient for simple file reading and writing, than those in the 581high-level interface. These functions are not (yet) officially part of 582the library, and are not further documented here. If they break, you 583get to keep all the pieces. I hope to document them properly when time 584permits. 585 586Yoshioka also contributed modifications to allow the library to be 587built as a Windows DLL. 588 589 590@section Error handling 591 592The library is designed to recover cleanly in all situations, including 593the worst-case situation of decompressing random data. I'm not 594100% sure that it can always do this, so you might want to add 595a signal handler to catch segmentation violations during decompression 596if you are feeling especially paranoid. I would be interested in 597hearing more about the robustness of the library to corrupted 598compressed data. 599 600The file @code{bzlib.h} contains all definitions needed to use 601the library. In particular, you should definitely not include 602@code{bzlib_private.h}. 603 604In @code{bzlib.h}, the various return values are defined. The following 605list is not intended as an exhaustive description of the circumstances 606in which a given value may be returned -- those descriptions are given 607later. Rather, it is intended to convey the rough meaning of each 608return value. The first five actions are normal and not intended to 609denote an error situation. 610@table @code 611@item BZ_OK 612The requested action was completed successfully. 613@item BZ_RUN_OK 614@itemx BZ_FLUSH_OK 615@itemx BZ_FINISH_OK 616In @code{bzCompress}, the requested flush/finish/nothing-special action 617was completed successfully. 618@item BZ_STREAM_END 619Compression of data was completed, or the logical stream end was 620detected during decompression. 621@end table 622 623The following return values indicate an error of some kind. 624@table @code 625@item BZ_SEQUENCE_ERROR 626When using the library, it is important to call the functions in the 627correct sequence and with data structures (buffers etc) in the correct 628states. @code{libbzip2} checks as much as it can to ensure this is 629happening, and returns @code{BZ_SEQUENCE_ERROR} if not. Code which 630complies precisely with the function semantics, as detailed below, 631should never receive this value; such an event denotes buggy code 632which you should investigate. 633@item BZ_PARAM_ERROR 634Returned when a parameter to a function call is out of range 635or otherwise manifestly incorrect. As with @code{BZ_SEQUENCE_ERROR}, 636this denotes a bug in the client code. The distinction between 637@code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth 638making. 639@item BZ_MEM_ERROR 640Returned when a request to allocate memory failed. Note that the 641quantity of memory needed to decompress a stream cannot be determined 642until the stream's header has been read. So @code{bzDecompress} and 643@code{bzRead} may return @code{BZ_MEM_ERROR} even though some of 644the compressed data has been read. The same is not true for 645compression; once @code{bzCompressInit} or @code{bzWriteOpen} have 646successfully completed, @code{BZ_MEM_ERROR} cannot occur. 647@item BZ_DATA_ERROR 648Returned when a data integrity error is detected during decompression. 649Most importantly, this means when stored and computed CRCs for the 650data do not match. This value is also returned upon detection of any 651other anomaly in the compressed data. 652@item BZ_DATA_ERROR_MAGIC 653As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to 654know when the compressed stream does not start with the correct 655magic bytes (@code{'B' 'Z' 'h'}). 656@item BZ_IO_ERROR 657Returned by @code{bzRead} and @code{bzRead} when there is an error 658reading or writing in the compressed file, and by @code{bzReadOpen} 659and @code{bzWriteOpen} for attempts to use a file for which the 660error indicator (viz, @code{ferror(f)}) is set. 661On receipt of @code{BZ_IO_ERROR}, the caller should consult 662@code{errno} and/or @code{perror} to acquire operating-system 663specific information about the problem. 664@item BZ_UNEXPECTED_EOF 665Returned by @code{bzRead} when the compressed file finishes 666before the logical end of stream is detected. 667@item BZ_OUTBUFF_FULL 668Returned by @code{bzBuffToBuffCompress} and 669@code{bzBuffToBuffDecompress} to indicate that the output data 670will not fit into the output buffer provided. 671@end table 672 673 674 675@section Low-level interface 676 677@subsection @code{bzCompressInit} 678@example 679typedef 680 struct @{ 681 char *next_in; 682 unsigned int avail_in; 683 unsigned int total_in; 684 685 char *next_out; 686 unsigned int avail_out; 687 unsigned int total_out; 688 689 void *state; 690 691 void *(*bzalloc)(void *,int,int); 692 void (*bzfree)(void *,void *); 693 void *opaque; 694 @} 695 bz_stream; 696 697int bzCompressInit ( bz_stream *strm, 698 int blockSize100k, 699 int verbosity, 700 int workFactor ); 701 702@end example 703 704Prepares for compression. The @code{bz_stream} structure 705holds all data pertaining to the compression activity. 706A @code{bz_stream} structure should be allocated and initialised 707prior to the call. 708The fields of @code{bz_stream} 709comprise the entirety of the user-visible data. @code{state} 710is a pointer to the private data structures required for compression. 711 712Custom memory allocators are supported, via fields @code{bzalloc}, 713@code{bzfree}, 714and @code{opaque}. The value 715@code{opaque} is passed to as the first argument to 716all calls to @code{bzalloc} and @code{bzfree}, but is 717otherwise ignored by the library. 718The call @code{bzalloc ( opaque, n, m )} is expected to return a 719pointer @code{p} to 720@code{n * m} bytes of memory, and @code{bzfree ( opaque, p )} 721should free 722that memory. 723 724If you don't want to use a custom memory allocator, set @code{bzalloc}, 725@code{bzfree} and 726@code{opaque} to @code{NULL}, 727and the library will then use the standard @code{malloc}/@code{free} 728routines. 729 730Before calling @code{bzCompressInit}, fields @code{bzalloc}, 731@code{bzfree} and @code{opaque} should 732be filled appropriately, as just described. Upon return, the internal 733state will have been allocated and initialised, and @code{total_in} and 734@code{total_out} will have been set to zero. 735These last two fields are used by the library 736to inform the caller of the total amount of data passed into and out of 737the library, respectively. You should not try to change them. 738 739Parameter @code{blockSize100k} specifies the block size to be used for 740compression. It should be a value between 1 and 9 inclusive, and the 741actual block size used is 100000 x this figure. 9 gives the best 742compression but takes most memory. 743 744Parameter @code{verbosity} should be set to a number between 0 and 4 745inclusive. 0 is silent, and greater numbers give increasingly verbose 746monitoring/debugging output. If the library has been compiled with 747@code{-DBZ_NO_STDIO}, no such output will appear for any verbosity 748setting. 749 750Parameter @code{workFactor} controls how the compression phase behaves 751when presented with worst case, highly repetitive, input data. 752If compression runs into difficulties caused by repetitive data, 753some pseudo-random variations are inserted into the block, and 754compression is restarted. Lower values of @code{workFactor} 755reduce the tolerance of compression to repetitive data. 756You should set this parameter carefully; too low, and 757compression ratio suffers, too high, and your average-to-worst 758case compression times can become very large. 759The default value of 30 760gives reasonable behaviour over a wide range of circumstances. 761 762Allowable values range from 0 to 250 inclusive. 0 is a special 763case, equivalent to using the default value of 30. 764 765Note that the randomisation process is entirely transparent. 766If the library decides to randomise and restart compression on a 767block, it does so without comment. Randomised blocks are 768automatically de-randomised during decompression, so data 769integrity is never compromised. 770 771Possible return values: 772@display 773 @code{BZ_PARAM_ERROR} 774 if @code{strm} is @code{NULL} 775 or @code{blockSize} < 1 or @code{blockSize} > 9 776 or @code{verbosity} < 0 or @code{verbosity} > 4 777 or @code{workFactor} < 0 or @code{workFactor} > 250 778 @code{BZ_MEM_ERROR} 779 if not enough memory is available 780 @code{BZ_OK} 781 otherwise 782@end display 783Allowable next actions: 784@display 785 @code{bzCompress} 786 if @code{BZ_OK} is returned 787 no specific action needed in case of error 788@end display 789 790@subsection @code{bzCompress} 791@example 792 int bzCompress ( bz_stream *strm, int action ); 793@end example 794Provides more input and/or output buffer space for the library. The 795caller maintains input and output buffers, and calls @code{bzCompress} to 796transfer data between them. 797 798Before each call to @code{bzCompress}, @code{next_in} should point at 799the data to be compressed, and @code{avail_in} should indicate how many 800bytes the library may read. @code{bzCompress} updates @code{next_in}, 801@code{avail_in} and @code{total_in} to reflect the number of bytes it 802has read. 803 804Similarly, @code{next_out} should point to a buffer in which the 805compressed data is to be placed, with @code{avail_out} indicating how 806much output space is available. @code{bzCompress} updates 807@code{next_out}, @code{avail_out} and @code{total_out} to reflect the 808number of bytes output. 809 810You may provide and remove as little or as much data as you like on each 811call of @code{bzCompress}. In the limit, it is acceptable to supply and 812remove data one byte at a time, although this would be terribly 813inefficient. You should always ensure that at least one byte of output 814space is available at each call. 815 816A second purpose of @code{bzCompress} is to request a change of mode of the 817compressed stream. 818 819Conceptually, a compressed stream can be in one of four states: IDLE, 820RUNNING, FLUSHING and FINISHING. Before initialisation 821(@code{bzCompressInit}) and after termination (@code{bzCompressEnd}), a 822stream is regarded as IDLE. 823 824Upon initialisation (@code{bzCompressInit}), the stream is placed in the 825RUNNING state. Subsequent calls to @code{bzCompress} should pass 826@code{BZ_RUN} as the requested action; other actions are illegal and 827will result in @code{BZ_SEQUENCE_ERROR}. 828 829At some point, the calling program will have provided all the input data 830it wants to. It will then want to finish up -- in effect, asking the 831library to process any data it might have buffered internally. In this 832state, @code{bzCompress} will no longer attempt to read data from 833@code{next_in}, but it will want to write data to @code{next_out}. 834Because the output buffer supplied by the user can be arbitrarily small, 835the finishing-up operation cannot necessarily be done with a single call 836of @code{bzCompress}. 837 838Instead, the calling program passes @code{BZ_FINISH} as an action to 839@code{bzCompress}. This changes the stream's state to FINISHING. Any 840remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and 841transferred to the output buffer. To do this, @code{bzCompress} must be 842called repeatedly until all the output has been consumed. At that 843point, @code{bzCompress} returns @code{BZ_STREAM_END}, and the stream's 844state is set back to IDLE. @code{bzCompressEnd} should then be 845called. 846 847Just to make sure the calling program does not cheat, the library makes 848a note of @code{avail_in} at the time of the first call to 849@code{bzCompress} which has @code{BZ_FINISH} as an action (ie, at the 850time the program has announced its intention to not supply any more 851input). By comparing this value with that of @code{avail_in} over 852subsequent calls to @code{bzCompress}, the library can detect any 853attempts to slip in more data to compress. Any calls for which this is 854detected will return @code{BZ_SEQUENCE_ERROR}. This indicates a 855programming mistake which should be corrected. 856 857Instead of asking to finish, the calling program may ask 858@code{bzCompress} to take all the remaining input, compress it and 859terminate the current (Burrows-Wheeler) compression block. This could 860be useful for error control purposes. The mechanism is analogous to 861that for finishing: call @code{bzCompress} with an action of 862@code{BZ_FLUSH}, remove output data, and persist with the 863@code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned. As 864with finishing, @code{bzCompress} detects any attempt to provide more 865input data once the flush has begun. 866 867Once the flush is complete, the stream returns to the normal RUNNING 868state. 869 870This all sounds pretty complex, but isn't really. Here's a table 871which shows which actions are allowable in each state, what action 872will be taken, what the next state is, and what the non-error return 873values are. Note that you can't explicitly ask what state the 874stream is in, but nor do you need to -- it can be inferred from the 875values returned by @code{bzCompress}. 876@display 877IDLE/@code{any} 878 Illegal. IDLE state only exists after @code{bzCompressEnd} or 879 before @code{bzCompressInit}. 880 Return value = @code{BZ_SEQUENCE_ERROR} 881 882RUNNING/@code{BZ_RUN} 883 Compress from @code{next_in} to @code{next_out} as much as possible. 884 Next state = RUNNING 885 Return value = @code{BZ_RUN_OK} 886 887RUNNING/@code{BZ_FLUSH} 888 Remember current value of @code{next_in}. Compress from @code{next_in} 889 to @code{next_out} as much as possible, but do not accept any more input. 890 Next state = FLUSHING 891 Return value = @code{BZ_FLUSH_OK} 892 893RUNNING/@code{BZ_FINISH} 894 Remember current value of @code{next_in}. Compress from @code{next_in} 895 to @code{next_out} as much as possible, but do not accept any more input. 896 Next state = FINISHING 897 Return value = @code{BZ_FINISH_OK} 898 899FLUSHING/@code{BZ_FLUSH} 900 Compress from @code{next_in} to @code{next_out} as much as possible, 901 but do not accept any more input. 902 If all the existing input has been used up and all compressed 903 output has been removed 904 Next state = RUNNING; Return value = @code{BZ_RUN_OK} 905 else 906 Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK} 907 908FLUSHING/other 909 Illegal. 910 Return value = @code{BZ_SEQUENCE_ERROR} 911 912FINISHING/@code{BZ_FINISH} 913 Compress from @code{next_in} to @code{next_out} as much as possible, 914 but to not accept any more input. 915 If all the existing input has been used up and all compressed 916 output has been removed 917 Next state = IDLE; Return value = @code{BZ_STREAM_END} 918 else 919 Next state = FINISHING; Return value = @code{BZ_FINISHING} 920 921FINISHING/other 922 Illegal. 923 Return value = @code{BZ_SEQUENCE_ERROR} 924@end display 925 926That still looks complicated? Well, fair enough. The usual sequence 927of calls for compressing a load of data is: 928@itemize @bullet 929@item Get started with @code{bzCompressInit}. 930@item Shovel data in and shlurp out its compressed form using zero or more 931calls of @code{bzCompress} with action = @code{BZ_RUN}. 932@item Finish up. 933Repeatedly call @code{bzCompress} with action = @code{BZ_FINISH}, 934copying out the compressed output, until @code{BZ_STREAM_END} is returned. 935@item Close up and go home. Call @code{bzCompressEnd}. 936@end itemize 937If the data you want to compress fits into your input buffer all 938at once, you can skip the calls of @code{bzCompress ( ..., BZ_RUN )} and 939just do the @code{bzCompress ( ..., BZ_FINISH )} calls. 940 941All required memory is allocated by @code{bzCompressInit}. The 942compression library can accept any data at all (obviously). So you 943shouldn't get any error return values from the @code{bzCompress} calls. 944If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in 945your programming. 946 947Trivial other possible return values: 948@display 949 @code{BZ_PARAM_ERROR} 950 if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL} 951@end display 952 953@subsection @code{bzCompressEnd} 954@example 955int bzCompressEnd ( bz_stream *strm ); 956@end example 957Releases all memory associated with a compression stream. 958 959Possible return values: 960@display 961 @code{BZ_PARAM_ERROR} if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} 962 @code{BZ_OK} otherwise 963@end display 964 965 966@subsection @code{bzDecompressInit} 967@example 968int bzDecompressInit ( bz_stream *strm, int verbosity, int small ); 969@end example 970Prepares for decompression. As with @code{bzCompressInit}, a 971@code{bz_stream} record should be allocated and initialised before the 972call. Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be 973set if a custom memory allocator is required, or made @code{NULL} for 974the normal @code{malloc}/@code{free} routines. Upon return, the internal 975state will have been initialised, and @code{total_in} and 976@code{total_out} will be zero. 977 978For the meaning of parameter @code{verbosity}, see @code{bzCompressInit}. 979 980If @code{small} is nonzero, the library will use an alternative 981decompression algorithm which uses less memory but at the cost of 982decompressing more slowly (roughly speaking, half the speed, but the 983maximum memory requirement drops to around 2300k). See Chapter 2 for 984more information on memory management. 985 986Note that the amount of memory needed to decompress 987a stream cannot be determined until the stream's header has been read, 988so even if @code{bzDecompressInit} succeeds, a subsequent 989@code{bzDecompress} could fail with @code{BZ_MEM_ERROR}. 990 991Possible return values: 992@display 993 @code{BZ_PARAM_ERROR} 994 if @code{(small != 0 && small != 1)} 995 or @code{(verbosity < 0 || verbosity > 4)} 996 @code{BZ_MEM_ERROR} 997 if insufficient memory is available 998@end display 999 1000Allowable next actions: 1001@display 1002 @code{bzDecompress} 1003 if @code{BZ_OK} was returned 1004 no specific action required in case of error 1005@end display 1006 1007 1008 1009@subsection @code{bzDecompress} 1010@example 1011int bzDecompress ( bz_stream *strm ); 1012@end example 1013Provides more input and/out output buffer space for the library. The 1014caller maintains input and output buffers, and uses @code{bzDecompress} 1015to transfer data between them. 1016 1017Before each call to @code{bzDecompress}, @code{next_in} 1018should point at the compressed data, 1019and @code{avail_in} should indicate how many bytes the library 1020may read. @code{bzDecompress} updates @code{next_in}, @code{avail_in} 1021and @code{total_in} 1022to reflect the number of bytes it has read. 1023 1024Similarly, @code{next_out} should point to a buffer in which the uncompressed 1025output is to be placed, with @code{avail_out} indicating how much output space 1026is available. @code{bzCompress} updates @code{next_out}, 1027@code{avail_out} and @code{total_out} to reflect 1028the number of bytes output. 1029 1030You may provide and remove as little or as much data as you like on 1031each call of @code{bzDecompress}. 1032In the limit, it is acceptable to 1033supply and remove data one byte at a time, although this would be 1034terribly inefficient. You should always ensure that at least one 1035byte of output space is available at each call. 1036 1037Use of @code{bzDecompress} is simpler than @code{bzCompress}. 1038 1039You should provide input and remove output as described above, and 1040repeatedly call @code{bzDecompress} until @code{BZ_STREAM_END} is 1041returned. Appearance of @code{BZ_STREAM_END} denotes that 1042@code{bzDecompress} has detected the logical end of the compressed 1043stream. @code{bzDecompress} will not produce @code{BZ_STREAM_END} until 1044all output data has been placed into the output buffer, so once 1045@code{BZ_STREAM_END} appears, you are guaranteed to have available all 1046the decompressed output, and @code{bzDecompressEnd} can safely be 1047called. 1048 1049If case of an error return value, you should call @code{bzDecompressEnd} 1050to clean up and release memory. 1051 1052Possible return values: 1053@display 1054 @code{BZ_PARAM_ERROR} 1055 if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} 1056 or @code{strm->avail_out < 1} 1057 @code{BZ_DATA_ERROR} 1058 if a data integrity error is detected in the compressed stream 1059 @code{BZ_DATA_ERROR_MAGIC} 1060 if the compressed stream doesn't begin with the right magic bytes 1061 @code{BZ_MEM_ERROR} 1062 if there wasn't enough memory available 1063 @code{BZ_STREAM_END} 1064 if the logical end of the data stream was detected and all 1065 output in has been consumed, eg @code{s->avail_out > 0} 1066 @code{BZ_OK} 1067 otherwise 1068@end display 1069Allowable next actions: 1070@display 1071 @code{bzDecompress} 1072 if @code{BZ_OK} was returned 1073 @code{bzDecompressEnd} 1074 otherwise 1075@end display 1076 1077 1078@subsection @code{bzDecompressEnd} 1079@example 1080int bzDecompressEnd ( bz_stream *strm ); 1081@end example 1082Releases all memory associated with a decompression stream. 1083 1084Possible return values: 1085@display 1086 @code{BZ_PARAM_ERROR} 1087 if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} 1088 @code{BZ_OK} 1089 otherwise 1090@end display 1091 1092Allowable next actions: 1093@display 1094 None. 1095@end display 1096 1097 1098@section High-level interface 1099 1100This interface provides functions for reading and writing 1101@code{bzip2} format files. First, some general points. 1102 1103@itemize @bullet 1104@item All of the functions take an @code{int*} first argument, 1105 @code{bzerror}. 1106 After each call, @code{bzerror} should be consulted first to determine 1107 the outcome of the call. If @code{bzerror} is @code{BZ_OK}, 1108 the call completed 1109 successfully, and only then should the return value of the function 1110 (if any) be consulted. If @code{bzerror} is @code{BZ_IO_ERROR}, 1111 there was an error 1112 reading/writing the underlying compressed file, and you should 1113 then consult @code{errno}/@code{perror} to determine the 1114 cause of the difficulty. 1115 @code{bzerror} may also be set to various other values; precise details are 1116 given on a per-function basis below. 1117@item If @code{bzerror} indicates an error 1118 (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}), 1119 you should immediately call @code{bzReadClose} (or @code{bzWriteClose}, 1120 depending on whether you are attempting to read or to write) 1121 to free up all resources associated 1122 with the stream. Once an error has been indicated, behaviour of all calls 1123 except @code{bzReadClose} (@code{bzWriteClose}) is undefined. 1124 The implication is that (1) @code{bzerror} should 1125 be checked after each call, and (2) if @code{bzerror} indicates an error, 1126 @code{bzReadClose} (@code{bzWriteClose}) should then be called to clean up. 1127@item The @code{FILE*} arguments passed to 1128 @code{bzReadOpen}/@code{bzWriteOpen} 1129 should be set to binary mode. 1130 Most Unix systems will do this by default, but other platforms, 1131 including Windows and Mac, will not. If you omit this, you may 1132 encounter problems when moving code to new platforms. 1133@item Memory allocation requests are handled by 1134 @code{malloc}/@code{free}. 1135 At present 1136 there is no facility for user-defined memory allocators in the file I/O 1137 functions (could easily be added, though). 1138@end itemize 1139 1140 1141 1142@subsection @code{bzReadOpen} 1143@example 1144 typedef void BZFILE; 1145 1146 BZFILE *bzReadOpen ( int *bzerror, FILE *f, 1147 int small, int verbosity, 1148 void *unused, int nUnused ); 1149@end example 1150Prepare to read compressed data from file handle @code{f}. @code{f} 1151should refer to a file which has been opened for reading, and for which 1152the error indicator (@code{ferror(f)})is not set. If @code{small} is 1, 1153the library will try to decompress using less memory, at the expense of 1154speed. 1155 1156For reasons explained below, @code{bzRead} will decompress the 1157@code{nUnused} bytes starting at @code{unused}, before starting to read 1158from the file @code{f}. At most @code{BZ_MAX_UNUSED} bytes may be 1159supplied like this. If this facility is not required, you should pass 1160@code{NULL} and @code{0} for @code{unused} and n@code{Unused} 1161respectively. 1162 1163For the meaning of parameters @code{small} and @code{verbosity}, 1164see @code{bzDecompressInit}. 1165 1166The amount of memory needed to decompress a file cannot be determined 1167until the file's header has been read. So it is possible that 1168@code{bzReadOpen} returns @code{BZ_OK} but a subsequent call of 1169@code{bzRead} will return @code{BZ_MEM_ERROR}. 1170 1171Possible assignments to @code{bzerror}: 1172@display 1173 @code{BZ_PARAM_ERROR} 1174 if @code{f} is @code{NULL} 1175 or @code{small} is neither @code{0} nor @code{1} 1176 or @code{(unused == NULL && nUnused != 0)} 1177 or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))} 1178 @code{BZ_IO_ERROR} 1179 if @code{ferror(f)} is nonzero 1180 @code{BZ_MEM_ERROR} 1181 if insufficient memory is available 1182 @code{BZ_OK} 1183 otherwise. 1184@end display 1185 1186Possible return values: 1187@display 1188 Pointer to an abstract @code{BZFILE} 1189 if @code{bzerror} is @code{BZ_OK} 1190 @code{NULL} 1191 otherwise 1192@end display 1193 1194Allowable next actions: 1195@display 1196 @code{bzRead} 1197 if @code{bzerror} is @code{BZ_OK} 1198 @code{bzClose} 1199 otherwise 1200@end display 1201 1202 1203@subsection @code{bzRead} 1204@example 1205 int bzRead ( int *bzerror, BZFILE *b, void *buf, int len ); 1206@end example 1207Reads up to @code{len} (uncompressed) bytes from the compressed file 1208@code{b} into 1209the buffer @code{buf}. If the read was successful, 1210@code{bzerror} is set to @code{BZ_OK} 1211and the number of bytes read is returned. If the logical end-of-stream 1212was detected, @code{bzerror} will be set to @code{BZ_STREAM_END}, 1213and the number 1214of bytes read is returned. All other @code{bzerror} values denote an error. 1215 1216@code{bzRead} will supply @code{len} bytes, 1217unless the logical stream end is detected 1218or an error occurs. Because of this, it is possible to detect the 1219stream end by observing when the number of bytes returned is 1220less than the number 1221requested. Nevertheless, this is regarded as inadvisable; you should 1222instead check @code{bzerror} after every call and watch out for 1223@code{BZ_STREAM_END}. 1224 1225Internally, @code{bzRead} copies data from the compressed file in chunks 1226of size @code{BZ_MAX_UNUSED} bytes 1227before decompressing it. If the file contains more bytes than strictly 1228needed to reach the logical end-of-stream, @code{bzRead} will almost certainly 1229read some of the trailing data before signalling @code{BZ_SEQUENCE_END}. 1230To collect the read but unused data once @code{BZ_SEQUENCE_END} has 1231appeared, call @code{bzReadGetUnused} immediately before @code{bzReadClose}. 1232 1233Possible assignments to @code{bzerror}: 1234@display 1235 @code{BZ_PARAM_ERROR} 1236 if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} 1237 @code{BZ_SEQUENCE_ERROR} 1238 if @code{b} was opened with @code{bzWriteOpen} 1239 @code{BZ_IO_ERROR} 1240 if there is an error reading from the compressed file 1241 @code{BZ_UNEXPECTED_EOF} 1242 if the compressed file ended before the logical end-of-stream was detected 1243 @code{BZ_DATA_ERROR} 1244 if a data integrity error was detected in the compressed stream 1245 @code{BZ_DATA_ERROR_MAGIC} 1246 if the stream does not begin with the requisite header bytes (ie, is not 1247 a @code{bzip2} data file). This is really a special case of @code{BZ_DATA_ERROR}. 1248 @code{BZ_MEM_ERROR} 1249 if insufficient memory was available 1250 @code{BZ_STREAM_END} 1251 if the logical end of stream was detected. 1252 @code{BZ_OK} 1253 otherwise. 1254@end display 1255 1256Possible return values: 1257@display 1258 number of bytes read 1259 if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END} 1260 undefined 1261 otherwise 1262@end display 1263 1264Allowable next actions: 1265@display 1266 collect data from @code{buf}, then @code{bzRead} or @code{bzReadClose} 1267 if @code{bzerror} is @code{BZ_OK} 1268 collect data from @code{buf}, then @code{bzReadClose} or @code{bzReadGetUnused} 1269 if @code{bzerror} is @code{BZ_SEQUENCE_END} 1270 @code{bzReadClose} 1271 otherwise 1272@end display 1273 1274 1275 1276@subsection @code{bzReadGetUnused} 1277@example 1278 void bzReadGetUnused ( int* bzerror, BZFILE *b, 1279 void** unused, int* nUnused ); 1280@end example 1281Returns data which was read from the compressed file but was not needed 1282to get to the logical end-of-stream. @code{*unused} is set to the address 1283of the data, and @code{*nUnused} to the number of bytes. @code{*nUnused} will 1284be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive. 1285 1286This function may only be called once @code{bzRead} has signalled 1287@code{BZ_STREAM_END} but before @code{bzReadClose}. 1288 1289Possible assignments to @code{bzerror}: 1290@display 1291 @code{BZ_PARAM_ERROR} 1292 if @code{b} is @code{NULL} 1293 or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL} 1294 @code{BZ_SEQUENCE_ERROR} 1295 if @code{BZ_STREAM_END} has not been signalled 1296 or if @code{b} was opened with @code{bzWriteOpen} 1297 @code{BZ_OK} 1298 otherwise 1299@end display 1300 1301Allowable next actions: 1302@display 1303 @code{bzReadClose} 1304@end display 1305 1306 1307@subsection @code{bzReadClose} 1308@example 1309 void bzReadClose ( int *bzerror, BZFILE *b ); 1310@end example 1311Releases all memory pertaining to the compressed file @code{b}. 1312@code{bzReadClose} does not call @code{fclose} on the underlying file 1313handle, so you should do that yourself if appropriate. 1314@code{bzReadClose} should be called to clean up after all error 1315situations. 1316 1317Possible assignments to @code{bzerror}: 1318@display 1319 @code{BZ_SEQUENCE_ERROR} 1320 if @code{b} was opened with @code{bzOpenWrite} 1321 @code{BZ_OK} 1322 otherwise 1323@end display 1324 1325Allowable next actions: 1326@display 1327 none 1328@end display 1329 1330 1331 1332@subsection @code{bzWriteOpen} 1333@example 1334 BZFILE *bzWriteOpen ( int *bzerror, FILE *f, 1335 int blockSize100k, int verbosity, 1336 int workFactor ); 1337@end example 1338Prepare to write compressed data to file handle @code{f}. 1339@code{f} should refer to 1340a file which has been opened for writing, and for which the error 1341indicator (@code{ferror(f)})is not set. 1342 1343For the meaning of parameters @code{blockSize100k}, 1344@code{verbosity} and @code{workFactor}, see 1345@* @code{bzCompressInit}. 1346 1347All required memory is allocated at this stage, so if the call 1348completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a 1349subsequent call to @code{bzWrite}. 1350 1351Possible assignments to @code{bzerror}: 1352@display 1353 @code{BZ_PARAM_ERROR} 1354 if @code{f} is @code{NULL} 1355 or @code{blockSize100k < 1} or @code{blockSize100k > 9} 1356 @code{BZ_IO_ERROR} 1357 if @code{ferror(f)} is nonzero 1358 @code{BZ_MEM_ERROR} 1359 if insufficient memory is available 1360 @code{BZ_OK} 1361 otherwise 1362@end display 1363 1364Possible return values: 1365@display 1366 Pointer to an abstract @code{BZFILE} 1367 if @code{bzerror} is @code{BZ_OK} 1368 @code{NULL} 1369 otherwise 1370@end display 1371 1372Allowable next actions: 1373@display 1374 @code{bzWrite} 1375 if @code{bzerror} is @code{BZ_OK} 1376 (you could go directly to @code{bzWriteClose}, but this would be pretty pointless) 1377 @code{bzWriteClose} 1378 otherwise 1379@end display 1380 1381 1382 1383@subsection @code{bzWrite} 1384@example 1385 void bzWrite ( int *bzerror, BZFILE *b, void *buf, int len ); 1386@end example 1387Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be 1388compressed and written to the file. 1389 1390Possible assignments to @code{bzerror}: 1391@display 1392 @code{BZ_PARAM_ERROR} 1393 if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} 1394 @code{BZ_SEQUENCE_ERROR} 1395 if b was opened with @code{bzReadOpen} 1396 @code{BZ_IO_ERROR} 1397 if there is an error writing the compressed file. 1398 @code{BZ_OK} 1399 otherwise 1400@end display 1401 1402 1403 1404 1405@subsection @code{bzWriteClose} 1406@example 1407 int bzWriteClose ( int *bzerror, BZFILE* f, 1408 int abandon, 1409 unsigned int* nbytes_in, 1410 unsigned int* nbytes_out ); 1411@end example 1412 1413Compresses and flushes to the compressed file all data so far supplied 1414by @code{bzWrite}. The logical end-of-stream markers are also written, so 1415subsequent calls to @code{bzWrite} are illegal. All memory associated 1416with the compressed file @code{b} is released. 1417@code{fflush} is called on the 1418compressed file, but it is not @code{fclose}'d. 1419 1420If @code{bzWriteClose} is called to clean up after an error, the only 1421action is to release the memory. The library records the error codes 1422issued by previous calls, so this situation will be detected 1423automatically. There is no attempt to complete the compression 1424operation, nor to @code{fflush} the compressed file. You can force this 1425behaviour to happen even in the case of no error, by passing a nonzero 1426value to @code{abandon}. 1427 1428If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the 1429total volume of uncompressed data handled. Similarly, @code{nbytes_out} 1430will be set to the total volume of compressed data written. 1431 1432Possible assignments to @code{bzerror}: 1433@display 1434 @code{BZ_SEQUENCE_ERROR} 1435 if @code{b} was opened with @code{bzReadOpen} 1436 @code{BZ_IO_ERROR} 1437 if there is an error writing the compressed file 1438 @code{BZ_OK} 1439 otherwise 1440@end display 1441 1442@subsection Handling embedded compressed data streams 1443 1444The high-level library facilitates use of 1445@code{bzip2} data streams which form some part of a surrounding, larger 1446data stream. 1447@itemize @bullet 1448@item For writing, the library takes an open file handle, writes 1449compressed data to it, @code{fflush}es it but does not @code{fclose} it. 1450The calling application can write its own data before and after the 1451compressed data stream, using that same file handle. 1452@item Reading is more complex, and the facilities are not as general 1453as they could be since generality is hard to reconcile with efficiency. 1454@code{bzRead} reads from the compressed file in blocks of size 1455@code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot 1456the logical end of compressed stream. 1457To recover this data once decompression has 1458ended, call @code{bzReadGetUnused} after the last call of @code{bzRead} 1459(the one returning @code{BZ_STREAM_END}) but before calling 1460@code{bzReadClose}. 1461@end itemize 1462 1463This mechanism makes it easy to decompress multiple @code{bzip2} 1464streams placed end-to-end. As the end of one stream, when @code{bzRead} 1465returns @code{BZ_STREAM_END}, call @code{bzReadGetUnused} to collect the 1466unused data (copy it into your own buffer somewhere). 1467That data forms the start of the next compressed stream. 1468To start uncompressing that next stream, call @code{bzReadOpen} again, 1469feeding in the unused data via the @code{unused}/@code{nUnused} 1470parameters. 1471Keep doing this until @code{BZ_STREAM_END} return coincides with the 1472physical end of file (@code{feof(f)}). In this situation 1473@code{bzReadGetUnused} 1474will of course return no data. 1475 1476This should give some feel for how the high-level interface can be used. 1477If you require extra flexibility, you'll have to bite the bullet and get 1478to grips with the low-level interface. 1479 1480@subsection Standard file-reading/writing code 1481Here's how you'd write data to a compressed file: 1482@example @code 1483FILE* f; 1484BZFILE* b; 1485int nBuf; 1486char buf[ /* whatever size you like */ ]; 1487int bzerror; 1488int nWritten; 1489 1490f = fopen ( "myfile.bz2", "w" ); 1491if (!f) @{ 1492 /* handle error */ 1493@} 1494b = bzWriteOpen ( &bzerror, f, 9 ); 1495if (bzerror != BZ_OK) @{ 1496 bzWriteClose ( b ); 1497 /* handle error */ 1498@} 1499 1500while ( /* condition */ ) @{ 1501 /* get data to write into buf, and set nBuf appropriately */ 1502 nWritten = bzWrite ( &bzerror, b, buf, nBuf ); 1503 if (bzerror == BZ_IO_ERROR) @{ 1504 bzWriteClose ( &bzerror, b ); 1505 /* handle error */ 1506 @} 1507@} 1508 1509bzWriteClose ( &bzerror, b ); 1510if (bzerror == BZ_IO_ERROR) @{ 1511 /* handle error */ 1512@} 1513@end example 1514And to read from a compressed file: 1515@example 1516FILE* f; 1517BZFILE* b; 1518int nBuf; 1519char buf[ /* whatever size you like */ ]; 1520int bzerror; 1521int nWritten; 1522 1523f = fopen ( "myfile.bz2", "r" ); 1524if (!f) @{ 1525 /* handle error */ 1526@} 1527b = bzReadOpen ( &bzerror, f, 0, NULL, 0 ); 1528if (bzerror != BZ_OK) @{ 1529 bzReadClose ( &bzerror, b ); 1530 /* handle error */ 1531@} 1532 1533bzerror = BZ_OK; 1534while (bzerror == BZ_OK && /* arbitrary other conditions */) @{ 1535 nBuf = bzRead ( &bzerror, b, buf, /* size of buf */ ); 1536 if (bzerror == BZ_OK) @{ 1537 /* do something with buf[0 .. nBuf-1] */ 1538 @} 1539@} 1540if (bzerror != BZ_STREAM_END) @{ 1541 bzReadClose ( &bzerror, b ); 1542 /* handle error */ 1543@} else @{ 1544 bzReadClose ( &bzerror ); 1545@} 1546@end example 1547 1548 1549 1550@section Utility functions 1551@subsection @code{bzBuffToBuffCompress} 1552@example 1553 int bzBuffToBuffCompress( char* dest, 1554 unsigned int* destLen, 1555 char* source, 1556 unsigned int sourceLen, 1557 int blockSize100k, 1558 int verbosity, 1559 int workFactor ); 1560@end example 1561Attempts to compress the data in @code{source[0 .. sourceLen-1]} 1562into the destination buffer, @code{dest[0 .. *destLen-1]}. 1563If the destination buffer is big enough, @code{*destLen} is 1564set to the size of the compressed data, and @code{BZ_OK} is 1565returned. If the compressed data won't fit, @code{*destLen} 1566is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. 1567 1568Compression in this manner is a one-shot event, done with a single call 1569to this function. The resulting compressed data is a complete 1570@code{bzip2} format data stream. There is no mechanism for making 1571additional calls to provide extra input data. If you want that kind of 1572mechanism, use the low-level interface. 1573 1574For the meaning of parameters @code{blockSize100k}, @code{verbosity} 1575and @code{workFactor}, @* see @code{bzCompressInit}. 1576 1577To guarantee that the compressed data will fit in its buffer, allocate 1578an output buffer of size 1% larger than the uncompressed data, plus 1579six hundred extra bytes. 1580 1581@code{bzBuffToBuffDecompress} will not write data at or 1582beyond @code{dest[*destLen]}, even in case of buffer overflow. 1583 1584Possible return values: 1585@display 1586 @code{BZ_PARAM_ERROR} 1587 if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} 1588 or @code{blockSize100k < 1} or @code{blockSize100k > 9} 1589 or @code{verbosity < 0} or @code{verbosity > 4} 1590 or @code{workFactor < 0} or @code{workFactor > 250} 1591 @code{BZ_MEM_ERROR} 1592 if insufficient memory is available 1593 @code{BZ_OUTBUFF_FULL} 1594 if the size of the compressed data exceeds @code{*destLen} 1595 @code{BZ_OK} 1596 otherwise 1597@end display 1598 1599 1600 1601@subsection @code{bzBuffToBuffDecompress} 1602@example 1603 int bzBuffToBuffDecompress ( char* dest, 1604 unsigned int* destLen, 1605 char* source, 1606 unsigned int sourceLen, 1607 int small, 1608 int verbosity ); 1609@end example 1610Attempts to decompress the data in @code{source[0 .. sourceLen-1]} 1611into the destination buffer, @code{dest[0 .. *destLen-1]}. 1612If the destination buffer is big enough, @code{*destLen} is 1613set to the size of the uncompressed data, and @code{BZ_OK} is 1614returned. If the compressed data won't fit, @code{*destLen} 1615is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. 1616 1617@code{source} is assumed to hold a complete @code{bzip2} format 1618data stream. @code{bzBuffToBuffDecompress} tries to decompress 1619the entirety of the stream into the output buffer. 1620 1621For the meaning of parameters @code{small} and @code{verbosity}, 1622see @code{bzDecompressInit}. 1623 1624Because the compression ratio of the compressed data cannot be known in 1625advance, there is no easy way to guarantee that the output buffer will 1626be big enough. You may of course make arrangements in your code to 1627record the size of the uncompressed data, but such a mechanism is beyond 1628the scope of this library. 1629 1630@code{bzBuffToBuffDecompress} will not write data at or 1631beyond @code{dest[*destLen]}, even in case of buffer overflow. 1632 1633Possible return values: 1634@display 1635 @code{BZ_PARAM_ERROR} 1636 if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} 1637 or @code{small != 0 && small != 1} 1638 or @code{verbosity < 0} or @code{verbosity > 4} 1639 @code{BZ_MEM_ERROR} 1640 if insufficient memory is available 1641 @code{BZ_OUTBUFF_FULL} 1642 if the size of the compressed data exceeds @code{*destLen} 1643 @code{BZ_DATA_ERROR} 1644 if a data integrity error was detected in the compressed data 1645 @code{BZ_DATA_ERROR_MAGIC} 1646 if the compressed data doesn't begin with the right magic bytes 1647 @code{BZ_UNEXPECTED_EOF} 1648 if the compressed data ends unexpectedly 1649 @code{BZ_OK} 1650 otherwise 1651@end display 1652 1653 1654 1655@section Using the library in a @code{stdio}-free environment 1656 1657@subsection Getting rid of @code{stdio} 1658 1659In a deeply embedded application, you might want to use just 1660the memory-to-memory functions. You can do this conveniently 1661by compiling the library with preprocessor symbol @code{BZ_NO_STDIO} 1662defined. Doing this gives you a library containing only the following 1663eight functions: 1664 1665@code{bzCompressInit}, @code{bzCompress}, @code{bzCompressEnd} @* 1666@code{bzDecompressInit}, @code{bzDecompress}, @code{bzDecompressEnd} @* 1667@code{bzBuffToBuffCompress}, @code{bzBuffToBuffDecompress} 1668 1669When compiled like this, all functions will ignore @code{verbosity} 1670settings. 1671 1672@subsection Critical error handling 1673@code{libbzip2} contains a number of internal assertion checks which 1674should, needless to say, never be activated. Nevertheless, if an 1675assertion should fail, behaviour depends on whether or not the library 1676was compiled with @code{BZ_NO_STDIO} set. 1677 1678For a normal compile, an assertion failure yields the message 1679@example 1680 bzip2/libbzip2, v0.9.0: internal error number N. 1681 This is a bug in bzip2/libbzip2, v0.9.0. Please report 1682 it to me at: jseward@@acm.org. If this happened when 1683 you were using some program which uses libbzip2 as a 1684 component, you should also report this bug to the author(s) 1685 of that program. Please make an effort to report this bug; 1686 timely and accurate bug reports eventually lead to higher 1687 quality software. Thx. Julian Seward, 27 June 1998. 1688@end example 1689where @code{N} is some error code number. @code{exit(3)} 1690is then called. 1691 1692For a @code{stdio}-free library, assertion failures result 1693in a call to a function declared as: 1694@example 1695 extern void bz_internal_error ( int errcode ); 1696@end example 1697The relevant code is passed as a parameter. You should supply 1698such a function. 1699 1700In either case, once an assertion failure has occurred, any 1701@code{bz_stream} records involved can be regarded as invalid. 1702You should not attempt to resume normal operation with them. 1703 1704You may, of course, change critical error handling to suit 1705your needs. As I said above, critical errors indicate bugs 1706in the library and should not occur. All "normal" error 1707situations are indicated via error return codes from functions, 1708and can be recovered from. 1709 1710 1711@section Making a Windows DLL 1712Everything related to Windows has been contributed by Yoshioka Tsuneo 1713@* (@code{QWF00133@@niftyserve.or.jp} / 1714@code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to 1715him (but perhaps Cc: me, @code{jseward@@acm.org}). 1716 1717My vague understanding of what to do is: using Visual C++ 5.0, 1718open the project file @code{libbz2.dsp}, and build. That's all. 1719 1720If you can't 1721open the project file for some reason, make a new one, naming these files: 1722@code{blocksort.c}, @code{bzlib.c}, @code{compress.c}, 1723@code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @* 1724@code{randtable.c} and @code{libbz2.def}. You might also need 1725to name the header files @code{bzlib.h} and @code{bzlib_private.h}. 1726 1727If you don't use VC++, you may need to define the proprocessor symbol 1728@code{_WIN32}. 1729 1730Finally, @code{dlltest.c} is a sample program using the DLL. It has a 1731project file, @code{dlltest.dsp}. 1732 1733I haven't tried any of this stuff myself, but it all looks plausible. 1734 1735 1736 1737@chapter Miscellanea 1738 1739These are just some random thoughts of mine. Your mileage may 1740vary. 1741 1742@section Limitations of the compressed file format 1743@code{bzip2-0.9.0} uses exactly the same file format as the previous 1744version, @code{bzip2-0.1}. This decision was made in the interests of 1745stability. Creating yet another incompatible compressed file format 1746would create further confusion and disruption for users. 1747 1748Nevertheless, this is not a painless decision. Development 1749work since the release of @code{bzip2-0.1} in August 1997 1750has shown complexities in the file format which slow down 1751decompression and, in retrospect, are unnecessary. These are: 1752@itemize @bullet 1753@item The run-length encoder, which is the first of the 1754 compression transformations, is entirely irrelevant. 1755 The original purpose was to protect the sorting algorithm 1756 from the very worst case input: a string of repeated 1757 symbols. But algorithm steps Q6a and Q6b in the original 1758 Burrows-Wheeler technical report (SRC-124) show how 1759 repeats can be handled without difficulty in block 1760 sorting. 1761@item The randomisation mechanism doesn't really need to be 1762 there. Udi Manber and Gene Myers published a suffix 1763 array construction algorithm a few years back, which 1764 can be employed to sort any block, no matter how 1765 repetitive, in O(N log N) time. Subsequent work by 1766 Kunihiko Sadakane has produced a derivative O(N (log N)^2) 1767 algorithm which usually outperforms the Manber-Myers 1768 algorithm. 1769 1770 I could have changed to Sadakane's algorithm, but I find 1771 it to be slower than @code{bzip2}'s existing algorithm for 1772 most inputs, and the randomisation mechanism protects 1773 adequately against bad cases. I didn't think it was 1774 a good tradeoff to make. Partly this is due to the fact 1775 that I was not flooded with email complaints about 1776 @code{bzip2-0.1}'s performance on repetitive data, so 1777 perhaps it isn't a problem for real inputs. 1778 1779 Probably the best long-term solution 1780 is to use the existing sorting 1781 algorithm initially, and fall back to a O(N (log N)^2) 1782 algorithm if the standard algorithm gets into difficulties. 1783 This can be done without much difficulty; I made 1784 a prototype implementation of it some months now. 1785@item The compressed file format was never designed to be 1786 handled by a library, and I have had to jump though 1787 some hoops to produce an efficient implementation of 1788 decompression. It's a bit hairy. Try passing 1789 @code{decompress.c} through the C preprocessor 1790 and you'll see what I mean. Much of this complexity 1791 could have been avoided if the compressed size of 1792 each block of data was recorded in the data stream. 1793@item An Adler-32 checksum, rather than a CRC32 checksum, 1794 would be faster to compute. 1795@end itemize 1796It would be fair to say that the @code{bzip2} format was frozen 1797before I properly and fully understood the performance 1798consequences of doing so. 1799 1800Improvements which I have been able to incorporate into 18010.9.0, despite using the same file format, are: 1802@itemize @bullet 1803@item Single array implementation of the inverse BWT. This 1804 significantly speeds up decompression, presumably 1805 because it reduces the number of cache misses. 1806@item Faster inverse MTF transform for large MTF values. The 1807 new implementation is based on the notion of sliding blocks 1808 of values. 1809@item @code{bzip2-0.9.0} now reads and writes files with @code{fread} 1810 and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}. 1811 Duh! I'm embarrassed at my own moronicness (moronicity?) on this 1812 one. 1813 1814@end itemize 1815Further ahead, it would be nice 1816to be able to do random access into files. This will 1817require some careful design of compressed file formats. 1818 1819 1820 1821@section Portability issues 1822After some consideration, I have decided not to use 1823GNU @code{autoconf} to configure 0.9.0. 1824 1825@code{autoconf}, admirable and wonderful though it is, 1826mainly assists with portability problems between Unix-like 1827platforms. But @code{bzip2} doesn't have much in the way 1828of portability problems on Unix; most of the difficulties appear 1829when porting to the Mac, or to Microsoft's operating systems. 1830@code{autoconf} doesn't help in those cases, and brings in a 1831whole load of new complexity. 1832 1833Most people should be able to compile the library and program 1834under Unix straight out-of-the-box, so to speak, especially 1835if you have a version of GNU C available. 1836 1837There are a couple of @code{__inline__} directives in the code. GNU C 1838(@code{gcc}) should be able to handle them. If your compiler doesn't 1839like them, just @code{#define} @code{__inline__} to be null. One 1840easy way to do this is to compile with the flag @code{-D__inline__=}, 1841which should be understood by most Unix compilers. 1842 1843If you still have difficulties, try compiling with the macro 1844@code{BZ_STRICT_ANSI} defined. This should enable you to build the 1845library in a strictly ANSI compliant environment. Building the program 1846itself like this is dangerous and not supported, since you remove 1847@code{bzip2}'s checks against compressing directories, symbolic links, 1848devices, and other not-really-a-file entities. This could cause 1849filesystem corruption! 1850 1851One other thing: if you create a @code{bzip2} binary for public 1852distribution, please try and link it statically (@code{gcc -s}). This 1853avoids all sorts of library-version issues that others may encounter 1854later on. 1855 1856 1857@section Reporting bugs 1858I tried pretty hard to make sure @code{bzip2} is 1859bug free, both by design and by testing. Hopefully 1860you'll never need to read this section for real. 1861 1862Nevertheless, if @code{bzip2} dies with a segmentation 1863fault, a bus error or an internal assertion failure, it 1864will ask you to email me a bug report. Experience with 1865version 0.1 shows that almost all these problems can 1866be traced to either compiler bugs or hardware problems. 1867@itemize @bullet 1868@item 1869Recompile the program with no optimisation, and see if it 1870works. And/or try a different compiler. 1871I heard all sorts of stories about various flavours 1872of GNU C (and other compilers) generating bad code for 1873@code{bzip2}, and I've run across two such examples myself. 1874 18752.7.X versions of GNU C are known to generate bad code from 1876time to time, at high optimisation levels. 1877If you get problems, try using the flags 1878@code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}. 1879You should specifically @emph{not} use @code{-funroll-loops}. 1880 1881You may notice that the Makefile runs four tests as part of 1882the build process. If the program passes all of these, it's 1883a pretty good (but not 100%) indication that the compiler has 1884done its job correctly. 1885@item 1886If @code{bzip2} crashes randomly, and the crashes are not 1887repeatable, you may have a flaky memory subsystem. @code{bzip2} 1888really hammers your memory hierarchy, and if it's a bit marginal, 1889you may get these problems. Ditto if your disk or I/O subsystem 1890is slowly failing. Yup, this really does happen. 1891 1892Try using a different machine of the same type, and see if 1893you can repeat the problem. 1894@item This isn't really a bug, but ... If @code{bzip2} tells 1895you your file is corrupted on decompression, and you 1896obtained the file via FTP, there is a possibility that you 1897forgot to tell FTP to do a binary mode transfer. That absolutely 1898will cause the file to be non-decompressible. You'll have to transfer 1899it again. 1900@end itemize 1901 1902If you've incorporated @code{libbzip2} into your own program 1903and are getting problems, please, please, please, check that the 1904parameters you are passing in calls to the library, are 1905correct, and in accordance with what the documentation says 1906is allowable. I have tried to make the library robust against 1907such problems, but I'm sure I haven't succeeded. 1908 1909Finally, if the above comments don't help, you'll have to send 1910me a bug report. Now, it's just amazing how many people will 1911send me a bug report saying something like 1912@display 1913 bzip2 crashed with segmentation fault on my machine 1914@end display 1915and absolutely nothing else. Needless to say, a such a report 1916is @emph{totally, utterly, completely and comprehensively 100% useless; 1917a waste of your time, my time, and net bandwidth}. 1918With no details at all, there's no way I can possibly begin 1919to figure out what the problem is. 1920 1921The rules of the game are: facts, facts, facts. Don't omit 1922them because "oh, they won't be relevant". At the bare 1923minimum: 1924@display 1925 Machine type. Operating system version. 1926 Exact version of @code{bzip2} (do @code{bzip2 -V}). 1927 Exact version of the compiler used. 1928 Flags passed to the compiler. 1929@end display 1930However, the most important single thing that will help me is 1931the file that you were trying to compress or decompress at the 1932time the problem happened. Without that, my ability to do anything 1933more than speculate about the cause, is limited. 1934 1935Please remember that I connect to the Internet with a modem, so 1936you should contact me before mailing me huge files. 1937 1938 1939@section Did you get the right package? 1940 1941@code{bzip2} is a resource hog. It soaks up large amounts of CPU cycles 1942and memory. Also, it gives very large latencies. In the worst case, you 1943can feed many megabytes of uncompressed data into the library before 1944getting any compressed output, so this probably rules out applications 1945requiring interactive behaviour. 1946 1947These aren't faults of my implementation, I hope, but more 1948an intrinsic property of the Burrows-Wheeler transform (unfortunately). 1949Maybe this isn't what you want. 1950 1951If you want a compressor and/or library which is faster, uses less 1952memory but gets pretty good compression, and has minimal latency, 1953consider Jean-loup 1954Gailly's and Mark Adler's work, @code{zlib-1.1.2} and 1955@code{gzip-1.2.4}. Look for them at 1956@code{http://www.cdrom.com/pub/infozip/zlib} and 1957@code{http://www.gzip.org} respectively. 1958 1959For something faster and lighter still, you might try Markus F X J 1960Oberhumer's @code{LZO} real-time compression/decompression library, at 1961@* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}. 1962 1963If you want to use the @code{bzip2} algorithms to compress small blocks 1964of data, 64k bytes or smaller, for example on an on-the-fly disk 1965compressor, you'd be well advised not to use this library. Instead, 1966I've made a special library tuned for that kind of use. It's part of 1967@code{e2compr-0.40}, an on-the-fly disk compressor for the Linux 1968@code{ext2} filesystem. Look at 1969@code{http://www.netspace.net.au/~reiter/e2compr}. 1970 1971 1972 1973@section Testing 1974 1975A record of the tests I've done. 1976 1977First, some data sets: 1978@itemize @bullet 1979@item B: a directory containing a 6001 files, one for every length in the 1980 range 0 to 6000 bytes. The files contain random lowercase 1981 letters. 18.7 megabytes. 1982@item H: my home directory tree. Documents, source code, mail files, 1983 compressed data. H contains B, and also a directory of 1984 files designed as boundary cases for the sorting; mostly very 1985 repetitive, nasty files. 445 megabytes. 1986@item A: directory tree holding various applications built from source: 1987 @code{egcs-1.0.2}, @code{gcc-2.8.1}, KDE Beta 4, GTK, Octave, etc. 1988 827 megabytes. 1989@item P: directory tree holding large amounts of source code (@code{.tar} 1990 files) of the entire GNU distribution, plus a couple of 1991 Linux distributions. 2400 megabytes. 1992@end itemize 1993The tests conducted are as follows. Each test means compressing 1994(a copy of) each file in the data set, decompressing it and 1995comparing it against the original. 1996 1997First, a bunch of tests with block sizes, internal buffer 1998sizes and randomisation lengths set very small, 1999to detect any problems with the 2000blocking, buffering and randomisation mechanisms. 2001This required modifying the source code so as to try to 2002break it. 2003@enumerate 2004@item Data set H, with 2005 buffer size of 1 byte, and block size of 23 bytes. 2006@item Data set B, buffer sizes 1 byte, block size 1 byte. 2007@item As (2) but small-mode decompression (first 1700 files). 2008@item As (2) with block size 2 bytes. 2009@item As (2) with block size 3 bytes. 2010@item As (2) with block size 4 bytes. 2011@item As (2) with block size 5 bytes. 2012@item As (2) with block size 6 bytes and small-mode decompression. 2013@item H with normal buffer sizes (5000 bytes), normal block 2014 size (up to 900000 bytes), but with randomisation 2015 mechanism running intensely (randomising approximately every 2016 third byte). 2017@item As (9) with small-mode decompression. 2018@end enumerate 2019Then some tests with unmodified source code. 2020@enumerate 2021@item H, all settings normal. 2022@item As (1), with small-mode decompress. 2023@item H, compress with flag @code{-1}. 2024@item H, compress with flag @code{-s}, decompress with flag @code{-s}. 2025@item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing, 2026 @code{bzip2-0.9.0} decompressing, all settings normal. 2027@item Backwards compatibility: H, @code{bzip2-0.9.0} compressing, 2028 @code{bzip2-0.1pl2} decompressing, all settings normal. 2029@item Bigger tests: A, all settings normal. 2030@item P, all settings normal. 2031@item Misc test: about 100 megabytes of @code{.tar} files with 2032 @code{bzip2} compiled with Purify. 2033@item Misc tests to make sure it builds and runs ok on non-Linux/x86 2034 platforms. 2035@end enumerate 2036These tests were conducted on a 205 MHz Cyrix 6x86MX machine, running 2037Linux 2.0.32. They represent nearly a week of continuous computation. 2038All tests completed successfully. 2039 2040 2041@section Further reading 2042@code{bzip2} is not research work, in the sense that it doesn't present 2043any new ideas. Rather, it's an engineering exercise based on existing 2044ideas. 2045 2046Four documents describe essentially all the ideas behind @code{bzip2}: 2047@example 2048Michael Burrows and D. J. Wheeler: 2049 "A block-sorting lossless data compression algorithm" 2050 10th May 1994. 2051 Digital SRC Research Report 124. 2052 ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz 2053 If you have trouble finding it, try searching at the 2054 New Zealand Digital Library, http://www.nzdl.org. 2055 2056Daniel S. Hirschberg and Debra A. LeLewer 2057 "Efficient Decoding of Prefix Codes" 2058 Communications of the ACM, April 1990, Vol 33, Number 4. 2059 You might be able to get an electronic copy of this 2060 from the ACM Digital Library. 2061 2062David J. Wheeler 2063 Program bred3.c and accompanying document bred3.ps. 2064 This contains the idea behind the multi-table Huffman 2065 coding scheme. 2066 ftp://ftp.cl.cam.ac.uk/pub/user/djw3/ 2067 2068Jon L. Bentley and Robert Sedgewick 2069 "Fast Algorithms for Sorting and Searching Strings" 2070 Available from Sedgewick's web page, 2071 www.cs.princeton.edu/~rs 2072@end example 2073The following paper gives valuable additional insights into the 2074algorithm, but is not immediately the basis of any code 2075used in bzip2. 2076@example 2077Peter Fenwick: 2078 Block Sorting Text Compression 2079 Proceedings of the 19th Australasian Computer Science Conference, 2080 Melbourne, Australia. Jan 31 - Feb 2, 1996. 2081 ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps 2082@end example 2083Kunihiko Sadakane's sorting algorithm, mentioned above, 2084is available from: 2085@example 2086http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz 2087@end example 2088The Manber-Myers suffix array construction 2089algorithm is described in a paper 2090available from: 2091@example 2092http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps 2093@end example 2094 2095 2096 2097@contents 2098 2099@bye 2100 2101