1\input texinfo @c -*-texinfo-*-
2@c %**start of header
3@setfilename lziprecover.info
4@documentencoding ISO-8859-15
5@settitle Lziprecover Manual
6@finalout
7@c %**end of header
8
9@set UPDATED 2 January 2021
10@set VERSION 1.22
11
12@dircategory Data Compression
13@direntry
14* Lziprecover: (lziprecover).   Data recovery tool for the lzip format
15@end direntry
16
17
18@ifnothtml
19@titlepage
20@title Lziprecover
21@subtitle Data recovery tool for the lzip format
22@subtitle for Lziprecover version @value{VERSION}, @value{UPDATED}
23@author by Antonio Diaz Diaz
24
25@page
26@vskip 0pt plus 1filll
27@end titlepage
28
29@contents
30@end ifnothtml
31
32@ifnottex
33@node Top
34@top
35
36This manual is for Lziprecover (version @value{VERSION}, @value{UPDATED}).
37
38@menu
39* Introduction::            Purpose and features of lziprecover
40* Invoking lziprecover::    Command line interface
41* Data safety::             Protecting data from accidental loss
42* Repairing one byte::      Fixing bit flips and similar errors
43* Merging files::           Fixing several damaged copies
44* Reproducing one sector::  Fixing a missing (zeroed) sector
45* Tarlz::                   Options supporting the tar.lz format
46* File names::              Names of the files produced by lziprecover
47* File format::             Detailed format of the compressed file
48* Trailing data::           Extra data appended to the file
49* Examples::                A small tutorial with examples
50* Unzcrash::                Testing the robustness of decompressors
51* Problems::                Reporting bugs
52* Concept index::           Index of concepts
53@end menu
54
55@sp 1
56Copyright @copyright{} 2009-2021 Antonio Diaz Diaz.
57
58This manual is free documentation: you have unlimited permission to copy,
59distribute, and modify it.
60@end ifnottex
61
62
63@node Introduction
64@chapter Introduction
65@cindex introduction
66
67@uref{http://www.nongnu.org/lzip/lziprecover.html,,Lziprecover}
68is a data recovery tool and decompressor for files in the lzip
69compressed data format (.lz). Lziprecover is able to repair slightly damaged
70files, produce a correct file by merging the good parts of two or more
71damaged copies, reproduce a missing (zeroed) sector using a reference file,
72extract data from damaged files, decompress files, and test integrity of
73files.
74
75Lziprecover can remove the damaged members from multimember files, for
76example multimember tar.lz archives.
77
78Lziprecover provides random access to the data in multimember files; it only
79decompresses the members containing the desired data.
80
81Lziprecover facilitates the management of metadata stored as trailing data
82in lzip files.
83
84Lziprecover is not a replacement for regular backups, but a last line of
85defense for the case where the backups are also damaged.
86
87The lzip file format is designed for data sharing and long-term archiving,
88taking into account both data integrity and decoder availability:
89
90@itemize @bullet
91@item
92The lzip format provides very safe integrity checking and some data
93recovery means. The program lziprecover can repair bit flip errors
94(one of the most common forms of data corruption) in lzip files, and
95provides data recovery capabilities, including error-checked merging
96of damaged copies of a file. @xref{Data safety}.
97
98@item
99The lzip format is as simple as possible (but not simpler). The lzip
100manual provides the source code of a simple decompressor along with a
101detailed explanation of how it works, so that with the only help of the
102lzip manual it would be possible for a digital archaeologist to extract
103the data from a lzip file long after quantum computers eventually render
104LZMA obsolete.
105
106@item
107Additionally the lzip reference implementation is copylefted, which
108guarantees that it will remain free forever.
109@end itemize
110
111A nice feature of the lzip format is that a corrupt byte is easier to repair
112the nearer it is from the beginning of the file. Therefore, with the help of
113lziprecover, losing an entire archive just because of a corrupt byte near
114the beginning is a thing of the past.
115
116Compression may be good for long-term archiving. For compressible data,
117multiple compressed copies may provide redundancy in a more useful form and
118may have a better chance of surviving intact than one uncompressed copy
119using the same amount of storage space. This is specially true if the format
120provides recovery capabilities like those of lziprecover, which is able to
121find and combine the good parts of several damaged copies.
122
123Lziprecover is able to recover or decompress files produced by any of the
124compressors in the lzip family; lzip, plzip, minilzip/lzlib, clzip, and
125pdlzip.
126
127If the cause of file corruption is a damaged medium, the combination
128@w{GNU ddrescue + lziprecover} is the recommended option for recovering data
129from damaged lzip files. @xref{ddrescue-example}, and
130@ref{ddrescue-example2}, for examples.
131
132If a file is too damaged for lziprecover to repair it, all the recoverable
133data in all members of the file can be extracted with the following command
134(the resulting file may contain errors and some garbage data may be produced
135at the end of each member):
136
137@example
138lziprecover -cd -i file.lz > file
139@end example
140
141When recovering data, lziprecover takes as arguments the names of the
142damaged files and writes zero or more recovered files depending on the
143operation selected and whether the recovery succeeded or not. The damaged
144files themselves are kept unchanged.
145
146When decompressing or testing file integrity, lziprecover behaves like lzip
147or lunzip.
148
149LANGUAGE NOTE: Uncompressed = not compressed = plain data; it may never have
150been compressed. Decompressed is used to refer to data which have undergone
151the process of decompression.
152
153
154@node Invoking lziprecover
155@chapter Invoking lziprecover
156@cindex invoking
157@cindex options
158@cindex usage
159@cindex version
160
161The format for running lziprecover is:
162
163@example
164lziprecover [@var{options}] [@var{files}]
165@end example
166
167@noindent
168When decompressing or testing, a hyphen @samp{-} used as a @var{file}
169argument means standard input. It can be mixed with other @var{files} and is
170read just once, the first time it appears in the command line. If no file
171names are specified, lziprecover decompresses from standard input to
172standard output.
173
174lziprecover supports the following
175@uref{http://www.nongnu.org/arg-parser/manual/arg_parser_manual.html#Argument-syntax,,options}:
176@ifnothtml
177@xref{Argument syntax,,,arg_parser}.
178@end ifnothtml
179
180@table @code
181@item -h
182@itemx --help
183Print an informative help message describing the options and exit.
184
185@item -V
186@itemx --version
187Print the version number of lziprecover on the standard output and exit.
188This version number should be included in all bug reports.
189
190@anchor{--trailing-error}
191@item -a
192@itemx --trailing-error
193Exit with error status 2 if any remaining input is detected after
194decompressing the last member. Such remaining input is usually trailing
195garbage that can be safely ignored. @xref{concat-example}.
196
197@item -A
198@itemx --alone-to-lz
199Convert lzma-alone files to lzip format without recompressing, just
200adding a lzip header and trailer. The conversion minimizes the
201dictionary size of the resulting file (and therefore the amount of
202memory required to decompress it). Only streamed files with default LZMA
203properties can be converted; non-streamed lzma-alone files lack the end
204of stream marker required in lzip files.
205
206The name of the converted lzip file is derived from that of the original
207lzma-alone file as follows:
208
209@multitable {filename.lzma} {becomes} {anyothername.lz}
210@item filename.lzma @tab becomes @tab filename.lz
211@item filename.tlz  @tab becomes @tab filename.tar.lz
212@item anyothername  @tab becomes @tab anyothername.lz
213@end multitable
214
215@item -c
216@itemx --stdout
217Write decompressed data to standard output; keep input files unchanged. This
218option (or @samp{-o}) is needed when reading from a named pipe (fifo) or
219from a device. Use it also to recover as much of the decompressed data as
220possible when decompressing a corrupt file. @samp{-c} overrides @samp{-o},
221but @samp{-c} has no effect when merging, removing members, repairing,
222reproducing, splitting, testing or listing.
223
224@item -d
225@itemx --decompress
226Decompress the files specified. If a file does not exist or can't be
227opened, lziprecover continues decompressing the rest of the files. If a file
228fails to decompress, or is a terminal, lziprecover exits immediately without
229decompressing the rest of the files.
230
231@item -D @var{range}
232@itemx --range-decompress=@var{range}
233Decompress only a range of bytes starting at decompressed byte position
234@var{begin} and up to byte position @w{@var{end} - 1}. Byte positions start
235at 0. This option provides random access to the data in multimember files;
236it only decompresses the members containing the desired data. In order to
237guarantee the correctness of the data produced, all members containing any
238part of the desired data are decompressed and their integrity is verified.
239
240@anchor{range-format}
241Four formats of @var{range} are recognized, @samp{@var{begin}},
242@samp{@var{begin}-@var{end}}, @samp{@var{begin},@var{size}}, and
243@samp{,@var{size}}. If only @var{begin} is specified, @var{end} is taken as
244the end of the file. If only @var{size} is specified, @var{begin} is taken
245as the beginning of the file. The bytes produced are sent to standard output
246unless the option @samp{--output} is used.
247
248@anchor{--reproduce}
249@item -e
250@itemx --reproduce
251Try to recover a missing (zeroed) sector in @var{file} using a reference
252file and the same version of lzip that created @var{file}. If successful, a
253repaired copy is written to the file @samp{@var{file}_fixed.lz}. @var{file}
254is not modified at all. The exit status is 0 if the member containing the
255zeroed sector could be repaired, 2 otherwise. Note that
256@samp{@var{file}_fixed.lz} may still contain errors in the members following
257the one repaired. @xref{Reproducing one sector}, for a complete description
258of the reproduce mode.
259
260@item --lzip-level=@var{digit}|a|m[@var{length}]
261Try only the given compression level or match length limit when reproducing
262a zeroed sector. @samp{--lzip-level=a} tries all the compression levels
263@w{(0 to 9)}, while @samp{--lzip-level=m} tries all the match length limits
264@w{(5 to 273)}.
265
266@item --lzip-name=@var{name}
267Set the name of the lzip executable used by @samp{--reproduce}. If
268@samp{--lzip-name} is not specified, @samp{lzip} is used.
269
270@item --reference-file=@var{file}
271Set the reference file used by @samp{--reproduce}. It must contain the
272uncompressed data corresponding to the missing compressed data of the zeroed
273sector, plus some context data before and after them.
274
275@item -f
276@itemx --force
277Force overwrite of output files.
278
279@item -i
280@itemx --ignore-errors
281Make @samp{--decompress}, @samp{--test}, and @samp{--range-decompress}
282ignore format and data errors and continue decompressing the remaining
283members in the file; keep input files unchanged. For example, the commands
284@w{@samp{lziprecover -cd -i file.lz > file}} or
285@w{@samp{lziprecover -D0 -i file.lz > file}} decompress all the recoverable
286data in all members of @samp{file.lz} without having to split it first. The
287@w{@samp{-cd -i}} method resyncs to the next member header after each error,
288and is immune to some format errors that make @w{@samp{-D0 -i}} fail. The
289range decompressed may be smaller than the range requested, because of the
290errors.
291
292Make @samp{--list}, @samp{--dump}, @samp{--remove}, and @samp{--strip}
293ignore format errors. The sizes of the members with errors (specially the
294last) may be wrong. The exit status is set to 0 unless other errors are
295found (I/O errors, for example).
296
297@item -k
298@itemx --keep
299Keep (don't delete) input files during decompression.
300
301@item -l
302@itemx --list
303Print the uncompressed size, compressed size, and percentage saved of the
304files specified. Trailing data are ignored. The values produced are correct
305even for multimember files. If more than one file is given, a final line
306containing the cumulative sizes is printed. With @samp{-v}, the dictionary
307size, the number of members in the file, and the amount of trailing data (if
308any) are also printed. With @samp{-vv}, the positions and sizes of each
309member in multimember files are also printed. With @samp{-i}, format errors
310are ignored, and with @samp{-ivv}, gaps between members are shown. The
311member numbers shown coincide with the file numbers produced by
312@samp{--split}.
313
314@samp{-lq} can be used to verify quickly (without decompressing) the
315structural integrity of the files specified. (Use @samp{--test} to verify
316the data integrity). @samp{-alq} additionally verifies that none of the
317files specified contain trailing data.
318
319@item -m
320@itemx --merge
321Try to produce a correct file by merging the good parts of two or more
322damaged copies. If successful, a repaired copy is written to the file
323@samp{@var{file}_fixed.lz}. The exit status is 0 if a correct file could
324be produced, 2 otherwise. @xref{Merging files}, for a complete
325description of the merge mode.
326
327@item -o @var{file}
328@itemx --output=@var{file}
329Place the output into @var{file} instead of into @samp{@var{file}_fixed.lz}.
330If splitting, the names of the files produced are in the form
331@samp{rec01@var{file}}, @samp{rec02@var{file}}, etc.
332
333If decompressing, or converting lzma-alone files, and @samp{-c} has not been
334also specified, write the decompressed or converted output to @var{file};
335keep input files unchanged. This option (or @samp{-c}) is needed when
336reading from a named pipe (fifo) or from a device. @w{@samp{-o -}} is
337equivalent to @samp{-c}. @samp{-o} has no effect when testing or listing.
338
339@item -q
340@itemx --quiet
341Quiet operation. Suppress all messages.
342
343@anchor{--repair}
344@item -R
345@itemx --repair
346Try to repair a @var{file} with small errors (up to one single-byte error
347per member). If successful, a repaired copy is written to the file
348@samp{@var{file}_fixed.lz}. @var{file} is not modified at all. The exit
349status is 0 if the file could be repaired, 2 otherwise. @xref{Repairing one
350byte}, for a complete description of the repair mode.
351
352@item -s
353@itemx --split
354Search for members in @var{file} and write each member in its own file. Gaps
355between members are detected and each gap is saved in its own file. Trailing
356data (if any) are saved alone in the last file. You can then use
357@w{@samp{lziprecover -t}} to test the integrity of the resulting files,
358decompress those which are undamaged, and try to repair or partially
359decompress those which are damaged. Gaps may contain garbage or may be
360members with corrupt headers or trailers. If other lziprecover functions
361fail to work on a multimember @var{file} because of damage in headers or
362trailers, try to split @var{file} and then work on each member individually.
363
364The names of the files produced are in the form @samp{rec01@var{file}},
365@samp{rec02@var{file}}, etc, and are designed so that the use of wildcards
366in subsequent processing, for example,
367@w{@samp{lziprecover -cd rec*@var{file} > recovered_data}}, processes the
368files in the correct order. The number of digits used in the names varies
369depending on the number of members in @var{file}.
370
371@item -t
372@itemx --test
373Check integrity of the files specified, but don't decompress them. This
374really performs a trial decompression and throws away the result. Use it
375together with @samp{-v} to see information about the files. If a file
376fails the test, does not exist, can't be opened, or is a terminal, lziprecover
377continues checking the rest of the files. A final diagnostic is shown at
378verbosity level 1 or higher if any file fails the test when testing
379multiple files.
380
381@item -v
382@itemx --verbose
383Verbose mode.@*
384When decompressing or testing, further -v's (up to 4) increase the
385verbosity level, showing status, compression ratio, dictionary size,
386trailer contents (CRC, data size, member size), and up to 6 bytes of
387trailing data (if any) both in hexadecimal and as a string of printable
388ASCII characters.@*
389Two or more @samp{-v} options show the progress of decompression.@*
390In other modes, increasing verbosity levels show final status, progress
391of operations, and extra information (for example, the failed areas).
392
393@item --loose-trailing
394When decompressing, testing, or listing, allow trailing data whose first
395bytes are so similar to the magic bytes of a lzip header that they can
396be confused with a corrupt header. Use this option if a file triggers a
397"corrupt header" error and the cause is not indeed a corrupt header.
398
399@item --dump=[@var{member_list}][:damaged][:tdata]
400Dump the members listed, the damaged members (if any), or the trailing
401data (if any) of one or more regular multimember files to standard
402output, or to a file if the option @samp{--output} is used. If more than
403one file is given, the elements dumped from all files are concatenated.
404If a file does not exist, can't be opened, or is not regular,
405lziprecover continues processing the rest of the files. If the dump
406fails in one file, lziprecover exits immediately without processing the
407rest of the files.
408
409The argument to @samp{--dump} is a colon-separated list of the following
410element specifiers; a member list (1,3-6), a reverse member list
411(r1,3-6), and the strings "damaged" and "tdata" (which may be shortened
412to 'd' and 't' respectively). A member list selects the members (or
413gaps) listed, whose numbers coincide with those shown by @samp{--list}.
414A reverse member list selects the members listed counting from the last
415member in the file (r1). Negated versions of both kinds of lists exist
416(^1,3-6:r^1,3-6) which selects all the members except those in the list.
417The strings "damaged" and "tdata" select the damaged members and the
418trailing data respectively. If the same member is selected more than
419once, for example by @samp{1:r1} in a single-member file, it is dumped
420just once. See the following examples:
421
422@multitable {@code{3,12:damaged:tdata}} {members 3, 12, damaged members, trailing data}
423@headitem @code{--dump} argument @tab Elements dumped
424@item @code{1,3-6}               @tab members 1, 3, 4, 5, 6
425@item @code{r1-3}                @tab last 3 members in file
426@item @code{^13,15}              @tab all but 13th and 15th members in file
427@item @code{r^1}                 @tab all but last member in file
428@item @code{damaged}             @tab all damaged members in file
429@item @code{tdata}               @tab trailing data
430@item @code{1-5:r1:tdata}        @tab members 1 to 5, last member, trailing data
431@item @code{damaged:tdata}       @tab damaged members, trailing data
432@item @code{3,12:damaged:tdata}  @tab members 3, 12, damaged members, trailing data
433@end multitable
434
435@item --remove=[@var{member_list}][:damaged][:tdata]
436Remove the members listed, the damaged members (if any), or the trailing
437data (if any) from regular multimember files in place. The date of each
438file is preserved if possible. If all members in a file are selected to
439be removed, the file is left unchanged and the exit status is set to 2.
440If a file does not exist, can't be opened, is not regular, or is left
441unchanged, lziprecover continues processing the rest of the files. In case
442of I/O error, lziprecover exits immediately without processing the rest of
443the files. See @samp{--dump} above for a description of the argument.
444
445This option may be dangerous even if only the trailing data is being
446removed because the file may be corrupt or the trailing data may contain
447a forbidden combination of characters. @xref{Trailing data}. It is
448advisable to make a backup before attempting the removal. At least
449verify that @w{@samp{lzip -cd file.lz | wc -c}} and the uncompressed
450size shown by @w{@samp{lzip -l file.lz}} match before attempting the
451removal of trailing data.
452
453@item --strip=[@var{member_list}][:damaged][:tdata]
454Copy one or more regular multimember files to standard output (or to a
455file if the option @samp{--output} is used), stripping the members
456listed, the damaged members (if any), or the trailing data (if any) from
457each file. If all members in a file are selected to be stripped, the
458trailing data (if any) are also stripped even if @samp{tdata} is not
459specified. If more than one file is given, the files are concatenated.
460In this case the trailing data are also stripped from all but the last
461file even if @samp{tdata} is not specified. If a file does not exist,
462can't be opened, or is not regular, lziprecover continues processing the
463rest of the files. If a file fails to copy, lziprecover exits
464immediately without processing the rest of the files. See @samp{--dump}
465above for a description of the argument.
466
467@end table
468
469Lziprecover also supports the following debug options (for experts):
470
471@table @code
472@item -E @var{range}[,@var{sector_size}]
473@itemx --debug-reproduce=@var{range}[,@var{sector_size}]
474Load the compressed @var{file} into memory, set all bytes in the positions
475specified by @var{range} to 0, and try to reproduce a correct compressed
476file. @xref{--reproduce}. @xref{range-format}, for a description of
477@var{range}. If a @var{sector_size} is specified, set each sector to 0 in
478sequence and try to reproduce the file, printing to standard output final
479statistics of the number of sectors reproduced successfully. Exit with
480nonzero status only in case of fatal error.
481
482@item -M
483@itemx --md5sum
484Print to standard output the MD5 digests of the input @var{files} one per
485line in the same format produced by the @command{md5sum} tool. Lziprecover
486uses MD5 digests to verify the result of some operations. This option allows
487the verification of lziprecover's implementation of the MD5 algorithm.
488
489@item -S[@var{value}]
490@itemx --nrep-stats[=@var{value}]
491Compare the frequency of sequences of N repeated bytes of a given
492@var{value} in the compressed LZMA streams of the input @var{files} with the
493frequency expected for random data (1 / 2^(8N)). If @var{value} is not
494specified, print the frequency of repeated sequences of all possible byte
495values. Print cumulative data for all files followed by the name of the
496first file with the longest sequence.
497
498@item -U
499@itemx --unzcrash
500Test 1-bit errors in the LZMA stream of the input @var{file} like the
501command @w{@samp{unzcrash -b1 -p7 -s-20 'lzip -t' @var{file}}} but in
502memory, and therefore much faster. @xref{Unzcrash}. This option tests all
503the members independently in a multimember file, skipping headers and
504trailers. If a decompression succeeds, the decompressed output is compared
505with the original decompressed output of @var{file} using MD5 digests. The
506compressed @var{file} must not contain errors and must decompress correctly
507for the comparisons to work.
508
509By default @samp{--unzcrash} only prints the interesting cases; CRC
510mismatches, size mismatches, unsupported marker codes, unexpected EOFs,
511apparently successful decompressions, and decoder errors detected 50_000 or
512more bytes beyond the byte being tested. At verbosity level 1 (-v) it also
513prints decoder errors detected 10_000 or more bytes beyond the byte being
514tested. At verbosity level 2 (-vv) it prints all cases.
515
516@item -W @var{position},@var{value}
517@itemx --debug-decompress=@var{position},@var{value}
518Load the compressed @var{file} into memory, set the byte at @var{position}
519to @var{value}, and decompress the modified compressed data to standard
520output.
521
522@item -X[@var{position},@var{value}]
523@itemx --show-packets[=@var{position},@var{value}]
524Load the compressed @var{file} into memory, optionally set the byte at
525@var{position} to @var{value}, decompress the modified compressed data
526(discarding the output), and print to standard output descriptions of the
527LZMA packets being decoded.
528
529@item -Y @var{range}
530@itemx --debug-delay=@var{range}
531Load the compressed @var{file} into memory and then repeatedly decompress
532it, increasing 256 times each byte of the subset of the compressed data
533positions specified by @var{range}, so as to test all possible one-byte
534errors. For each decompression error find the error detection delay and
535print to standard output the maximum delay. The error detection delay is the
536difference between the position of the error and the position where the
537decoder realized that the data contains an error. @xref{range-format}, for a
538description of @var{range}.
539
540@item -Z @var{position},@var{value}
541@itemx --debug-repair=@var{position},@var{value}
542Load the compressed @var{file} into memory, set the byte at @var{position}
543to @var{value}, and then try to repair the error. @xref{--repair}.
544
545@end table
546
547Numbers given as arguments to options may be followed by a multiplier
548and an optional @samp{B} for "byte".
549
550Table of SI and binary prefixes (unit multipliers):
551
552@multitable {Prefix} {kilobyte  (10^3 = 1000)} {|} {Prefix} {kibibyte (2^10 = 1024)}
553@item Prefix @tab Value               @tab | @tab Prefix @tab Value
554@item k @tab kilobyte  (10^3 = 1000)  @tab | @tab Ki @tab kibibyte (2^10 = 1024)
555@item M @tab megabyte  (10^6)         @tab | @tab Mi @tab mebibyte (2^20)
556@item G @tab gigabyte  (10^9)         @tab | @tab Gi @tab gibibyte (2^30)
557@item T @tab terabyte  (10^12)        @tab | @tab Ti @tab tebibyte (2^40)
558@item P @tab petabyte  (10^15)        @tab | @tab Pi @tab pebibyte (2^50)
559@item E @tab exabyte   (10^18)        @tab | @tab Ei @tab exbibyte (2^60)
560@item Z @tab zettabyte (10^21)        @tab | @tab Zi @tab zebibyte (2^70)
561@item Y @tab yottabyte (10^24)        @tab | @tab Yi @tab yobibyte (2^80)
562@end multitable
563
564@sp 1
565Exit status: 0 for a normal exit, 1 for environmental problems (file not
566found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or
567invalid input file, 3 for an internal consistency error (eg, bug) which
568caused lziprecover to panic.
569
570
571@node Data safety
572@chapter Protecting data from accidental loss
573@cindex data safety
574
575It is a fact of life that sometimes data will become corrupt. Software has
576errors. Hardware may misbehave or fail. RAM may be struck by a cosmic ray.
577This is why a safe enough integrity checking is needed in compressed
578formats, and the reason why a data recovery tool is sometimes needed.
579
580There are 3 main types of data corruption that may cause data loss:
581single-byte errors, multibyte errors (generally affecting a whole sector
582in a block device), and total device failure.
583
584Lziprecover protects natively against single-byte errors as long as file
585integrity is checked frequently enough that a second single-byte error does
586not develop in the same member before the first one is repaired.
587@xref{Repairing one byte}.
588
589Lziprecover also protects against multibyte errors if at least one backup
590copy of the file is made (@pxref{Merging files}), or if the error is a
591zeroed sector and the uncompressed data corresponding to the zeroed sector
592are available (@pxref{Reproducing one sector}). If you can choose between
593merging and reproducing, try merging first because it is usually faster,
594easier to use, and has a high probability of success.
595
596Lziprecover can't help in case of device failure. The only remedy for total
597device failure is storing backup copies in separate media.
598
599The extraordinary safety of the lzip format allows lziprecover to exploit
600the redundance that occurrs naturally when making compressed backups.
601Lziprecover can recover data that would not be recoverable from files
602compressed in other formats. Let's see two examples of how much better is
603lzip compared with gzip and bzip2 with respect to data safety:
604
605@menu
606* Merging with a backup::   Recovering a file using a damaged backup
607* Reproducing a mailbox::   Recovering new messages using an old backup
608@end menu
609
610
611@node Merging with a backup
612@section Recovering a file using a damaged backup
613@cindex merging with a backup
614
615Let's suppose that you made a compressed backup of your valuable scientific
616data and stored two copies on separate media. Years later you notice that
617both copies are corrupt.
618
619If you compressed the data with gzip and both copies suffer any damage in
620the data stream, even if it is just one altered bit, the original data can
621only be recovered by an expert, if at all.
622
623If you used bzip2, and if the file is large enough to contain more than one
624compressed data block (usually larger than @w{900 kB} uncompressed), and if
625no block is damaged in both files, then the data can be manually recovered
626by splitting the files with bzip2recover, verifying every block, and then
627copying the right blocks in the right order into another file.
628
629But if you used lzip, the data can be automatically recovered with
630@w{@samp{lziprecover --merge}} as long as the damaged areas don't overlap.
631
632Note that each error in a bzip2 file makes a whole block unusable, but each
633error in a lzip file only affects the damaged bytes, making it possible to
634recover a file with thousands of errors.
635
636
637@node Reproducing a mailbox
638@section Recovering new messages using an old backup
639@cindex reproducing a mailbox
640
641Let's suppose that you make periodic backups of your email messages stored
642in one or more mailboxes. (A mailbox is a file containing a possibly large
643number of email messages). New messages are appended to the end of each
644mailbox, therefore the initial part of two consecutive backups is identical
645unless some messages have been changed or deleted in the meantime. The new
646messages added to each backup are usually a small part of the whole mailbox.
647
648@verbatim
649+========================================================+
650|  Older backup containing some messages                 |
651+========================================================+
652+========================================================+================+
653|  Newer backup containing the messages above plus some  |  new messages  |
654+========================================================+================+
655@end verbatim
656
657One day you discover that your mailbox has disappeared because you deleted
658it inadvertently or because of a bug in your email reader. Not only that.
659You need to recover a recent message, but the last backup you made of the
660mailbox (the newer backup above) has lost the data corresponding to a whole
661sector because of an I/O error in the part containing the old messages.
662
663If you compressed the mailbox with gzip, usually none of the new messages
664can be recovered even if they are intact because all the data beyond the
665missing sector can't be decoded.
666
667If you used bzip2, and if the newer backup is large enough that the new
668messages are in a different compressed data block than the one damaged
669(usually larger than @w{900 kB} uncompressed), then you can recover the new
670messages manually with bzip2recover. If the backups are identical except for
671the new messages appended, you may even recover the whole newer backup by
672combining the good blocks from both backups.
673
674But if you used lzip, the whole newer backup can be automatically recovered
675with @w{@samp{lziprecover --reproduce}} as long as the missing bytes can be
676recovered from the older backup, even if other messages in the common part
677have been changed or deleted. Mailboxes seem to be specially easy to
678reproduce. The probability of reproducing a mailbox
679(@pxref{performance-of-reproduce}) is almost as high as that of merging two
680identical backups (@pxref{performance-of-merge}).
681
682
683@node Repairing one byte
684@chapter Repairing one byte
685@cindex repairing one byte
686
687Lziprecover can repair perfectly most files with small errors (up to one
688single-byte error per member), without the need of any extra redundance
689at all. If the reparation is successful, the repaired file will be
690identical bit for bit to the original. This makes lzip files resistant
691to bit flip, one of the most common forms of data corruption.
692
693The file is repaired in memory. Therefore, enough virtual memory
694@w{(RAM + swap)} to contain the largest damaged member is required.
695
696The error may be located anywhere in the file except in the first 5
697bytes of each member header or in the @samp{Member size} field of the
698trailer (last 8 bytes of each member). If the error is in the header it
699can be easily repaired with a text editor like GNU Moe (@pxref{File
700format}). If the error is in the member size, it is enough to ignore the
701message about @samp{bad member size} when decompressing.
702
703Bit flip happens when one bit in the file is changed from 0 to 1 or vice
704versa. It may be caused by bad RAM or even by natural radiation. I have
705seen a case of bit flip in a file stored on an USB flash drive.
706
707One byte may seem small, but most file corruptions not produced by
708transmission errors or I/O errors just affect one byte, or even one bit,
709of the file. Also, unlike magnetic media, where errors usually affect a
710whole sector, solid-state storage devices tend to produce single-byte
711errors, making of lzip the perfect format for data stored on such devices.
712
713Repairing a file can take some time. Small files or files with the error
714located near the beginning can be repaired in a few seconds. But
715repairing a large file compressed with a large dictionary size and with
716the error located far from the beginning, may take hours.
717
718On the other hand, errors located near the beginning of the file cause
719much more loss of data than errors located near the end. So lziprecover
720repairs more efficiently the worst errors.
721
722
723@node Merging files
724@chapter Merging files
725@cindex merging files
726
727If you have several copies of a file but all of them are too damaged to
728repair them (@pxref{Repairing one byte}), lziprecover can try to produce a
729correct file by merging the good parts of the damaged copies.
730
731The merge may succeed even if some copies of the file have all the
732headers and trailers damaged, as long as there is at least one copy of
733every header and trailer intact, even if they are in different copies of
734the file.
735
736The merge will fail if the damaged areas overlap (at least one byte is
737damaged in all copies), or are adjacent and the boundary can't be
738determined, or if the copies have too many damaged areas.
739
740All the copies to be merged must have the same size. If any of them is
741larger or smaller than it should, either because it has been truncated
742or because it got some garbage data appended at the end, it can be
743brought to the correct size with the following command before merging it
744with the other copies:
745
746@example
747ddrescue -s<correct_size> -x<correct_size> file.lz correct_size_file.lz
748@end example
749
750@anchor{performance-of-merge}
751To give you an idea of its possibilities, when merging two copies, each of
752them with one damaged area affecting 1 percent of the copy, the probability
753of obtaining a correct file is about 98 percent. With three such copies the
754probability rises to 99.97 percent. For large files (a few MB) with small
755errors (one sector damaged per copy), the probability approaches 100 percent
756even with only two copies. (Supposing that the errors are randomly located
757inside each copy).
758
759Some types of solid-state device (NAND flash, for example) can produce
760bursts of scattered single-bit errors. Lziprecover is able to merge
761files with thousands of such scattered errors by grouping the errors
762into clusters and then merging the files as if each cluster were a
763single error.
764
765Here is a real case of successful merging. Two copies of the file
766@samp{icecat-3.5.3-x86.tar.lz} (compressed size @w{9 MB}) became corrupt
767while stored on the same NAND flash device. One of the copies had 76
768single-bit errors scattered in an area of 1020 bytes, and the other had
7693028 such errors in an area of 31729 bytes. Lziprecover produced a
770correct file, identical to the original, in just 5 seconds:
771
772@example
773lziprecover -vvm a/icecat-3.5.3-x86.tar.lz b/icecat-3.5.3-x86.tar.lz
774Merging member 1 of 1  (2552 errors)
775  2552 errors have been grouped in 16 clusters.
776  Trying variation 2 of 2, block 2
777Input files merged successfully.
778@end example
779
780Note that the number of errors reported by lziprecover (2552) is lower
781than the number of corrupt bytes (3104) because contiguous corrupt bytes
782are counted as a single multibyte error.
783
784@sp 1
785@anchor{ddrescue-example}
786@noindent
787Example 1: Recover a compressed backup from two copies on CD-ROM with
788error-checked merging of copies.
789@ifnothtml
790@xref{Top,GNU ddrescue manual,,ddrescue},
791@end ifnothtml
792@ifhtml
793See the
794@uref{http://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html,,ddrescue manual}
795@end ifhtml
796for details about ddrescue.
797
798@example
799ddrescue -d -r1 -b2048 /dev/cdrom cdimage1 mapfile1
800mount -t iso9660 -o loop,ro cdimage1 /mnt/cdimage
801cp /mnt/cdimage/backup.tar.lz rescued1.tar.lz
802umount /mnt/cdimage
803  (insert second copy in the CD drive)
804ddrescue -d -r1 -b2048 /dev/cdrom cdimage2 mapfile2
805mount -t iso9660 -o loop,ro cdimage2 /mnt/cdimage
806cp /mnt/cdimage/backup.tar.lz rescued2.tar.lz
807umount /mnt/cdimage
808lziprecover -m -v -o backup.tar.lz rescued1.tar.lz rescued2.tar.lz
809  Input files merged successfully.
810lziprecover -tv backup.tar.lz
811  backup.tar.lz: ok
812@end example
813
814@sp 1
815@noindent
816Example 2: Recover the first volume of those created with the command
817@w{@samp{lzip -b 32MiB -S 650MB big_db}} from two copies,
818@samp{big_db1_00001.lz} and @samp{big_db2_00001.lz}, with member 07
819damaged in the first copy, member 18 damaged in the second copy, and
820member 12 damaged in both copies. The correct file produced is saved in
821@samp{big_db_00001.lz}.
822
823@example
824lziprecover -m -v -o big_db_00001.lz big_db1_00001.lz big_db2_00001.lz
825  Input files merged successfully.
826lziprecover -tv big_db_00001.lz
827  big_db_00001.lz: ok
828@end example
829
830
831@node Reproducing one sector
832@chapter Reproducing one sector
833@cindex reproducing one sector
834
835Lziprecover can recover a zeroed sector in a lzip file by concatenating the
836decompressed contents of the file up to the beginning of the zeroed sector
837and the uncompressed data corresponding to the zeroed sector, and then
838feeding the concatenated data to the same version of lzip that created the
839file. For this to work, a reference file is required containing the
840uncompressed data corresponding to the missing compressed data of the zeroed
841sector, plus some context data before and after them. It is possible to
842recover a large file using just a few KB of reference data.
843
844The difficult part is finding a suitable reference file. It must contain the
845exact data required (possibly mixed with other data). Containing similar
846data is not enough.
847
848A zeroed sector may be caused by the incomplete recovery of a damaged
849storage device (with I/O errors) using, for example, ddrescue. The
850reproduction can't be done if the zeroed sector overlaps with the first 15
851bytes of a member, or if the zeroed sector is smaller than 8 bytes.
852
853The file is reproduced in memory. Therefore, enough virtual memory
854@w{(RAM + swap)} to contain the damaged member is required.
855
856To understand how it works, take any lzipped file, say @samp{foo.lz},
857decompress it (keeping the original), and try to reproduce an artificially
858zeroed sector in it by running the following commands:
859
860@example
861lzip -kd foo.lz
862lziprecover -vv --debug-reproduce=65536,512 --reference-file=foo foo.lz
863@end example
864
865@noindent
866which should produce an output like the following:
867
868@example
869Reproducing:    foo.lz
870Reference file: foo
871Testing sectors of size 512 at file positions 65536 to 66047
872  (master mpos = 65536, dpos = 296892)
873foo: Match found at offset 296892
874Reproduction succeeded at pos 65536
875
876       1 sectors tested
877       1 reproductions returned with zero status
878         all comparisons passed
879@end example
880
881Using @samp{foo} as reference file guarantees that any zeroed sector in
882@samp{foo.lz} can be reproduced because both files contain the same data. In
883real use, the reference file needs to contain the data corresponding to the
884zeroed sector, but the rest of the data (if any) may differ between both
885files. The reference data may be obtained from the partial decompression of
886the damaged file itself if it contains repeated data. For example if the
887damaged file is a compressed tarball containing several partially modified
888versions of the same file.
889
890The offset reported by lziprecover is the position in the reference file of
891the first byte that could not be decompressed. This is the first byte that
892will be compressed to reproduce the zeroed sector.
893
894The reproduce mode tries to reproduce the missing compressed data originally
895present in the zeroed sector. It is based on the perfect reproducibility of
896lzip files (lzip produces identical compressed output from identical input).
897Therefore, the same version of lzip that created the file to be reproduced
898should be used to reproduce the zeroed sector. Near versions may also work
899because the output of lzip changes infrequently. If reproducing a tar.lz
900archive created with tarlz, the version of lzip, clzip, or minilzip
901corresponding to the version of the lzlib library used by tarlz to create
902the archive should be used.
903
904When recovering a tar.lz archive and using as reference a file from the
905filesystem, if the zeroed sector encodes (part of) a tar header, the archive
906can't be reproduced. Therefore, the less overhead (smaller headers) a tar
907archive has, the more probable is that the zeroed sector does not include a
908header, and that the archive can be reproduced. The tarlz format has minimum
909overhead. It uses basic ustar headers, and only adds extended pax headers
910when they are required.
911
912@anchor{performance-of-reproduce}
913@section Performance of @samp{--reproduce}
914Reproduce mode is specially useful when recovering a corrupt backup (or a
915corrupt source tarball) that is part of a series. Usually only a small
916fraction of the data changes from one backup to the next or from one version
917of a source tarball to the next. This makes sometimes possible to reproduce
918a given corrupted version using reference data from a near version. The
919following two tables show the fraction of reproducible sectors (reproducible
920sectors divided by total sectors in archive) for some archives, using sector
921sizes of 512 and 4096 bytes. @samp{mailbox-aug.tar.lz} is a backup of some
922of my mailboxes. @samp{backup-feb.tar.lz} and @samp{backup-apr.tar.lz} are
923real backups of my own working directory:
924
925@multitable {Reference file} {gawk-5.0.1.tar.lz} {4369 / 5844 = 74.76%}
926@headitem Reference file @tab File @tab Reproducible (512)
927@item backup-feb.tar @tab backup-apr.tar.lz @tab 3273 / 4342 = 75.38%
928@item backup-apr.tar @tab backup-feb.tar.lz @tab 3259 / 4161 = 78.32%
929@item gawk-5.0.0.tar @tab gawk-5.0.1.tar.lz @tab 4369 / 5844 = 74.76%
930@item gawk-5.0.1.tar @tab gawk-5.0.0.tar.lz @tab 4379 / 5603 = 78.15%
931@item gmp-6.1.1.tar @tab gmp-6.1.2.tar.lz @tab 2454 / 3787 = 64.8%
932@item gmp-6.1.2.tar @tab gmp-6.1.1.tar.lz @tab 2461 / 3782 = 65.07%
933@end multitable
934
935@multitable {mailbox-mar.tar} {mailbox-aug.tar.lz} {4036 / 4252 = 94.92%}
936@headitem Reference file @tab File @tab Reproducible (4096)
937@item mailbox-mar.tar @tab mailbox-aug.tar.lz @tab 4036 / 4252 = 94.92%
938@item backup-feb.tar @tab backup-apr.tar.lz @tab 264 / 542 = 48.71%
939@item backup-apr.tar @tab backup-feb.tar.lz @tab 264 / 520 = 50.77%
940@item gawk-5.0.0.tar @tab gawk-5.0.1.tar.lz @tab 327 / 730 = 44.79%
941@item gawk-5.0.1.tar @tab gawk-5.0.0.tar.lz @tab 326 / 700 = 46.57%
942@item gmp-6.1.1.tar @tab gmp-6.1.2.tar.lz @tab 175 / 473 = 37%
943@item gmp-6.1.2.tar @tab gmp-6.1.1.tar.lz @tab 181 / 472 = 38.35%
944@end multitable
945
946Note that the "performance of reproduce" is a probability, not a partial
947recovery. The data is either fully recovered (with the probability X shown
948in the last column of the tables above) or not recovered at all (with
949probability @w{1 - X}).
950
951Example 1: Recover a damaged source tarball with a zeroed sector of 512
952bytes at file position 1019904, using as reference another source tarball
953for a different version of the software.
954
955@example
956lziprecover -vv -e --reference-file=gmp-6.1.1.tar gmp-6.1.2.tar.lz
957Reproducing bad area in member 1 of 1
958  (begin = 1019904, size = 512, value = 0x00)
959  (master mpos = 1019904, dpos = 6292134)
960warning: gmp-6.1.1.tar: Partial match found at offset 6277798, len 8716.
961Reference data may be mixed with other data.
962Trying level -9
963  Reproducing position 1015808
964Member reproduced successfully.
965Copy of input file reproduced successfully.
966@end example
967
968@sp 1
969@anchor{ddrescue-example2}
970@noindent
971Example 2: Recover a damaged backup with a zeroed sector of 4096 bytes at
972file position 1019904, using as reference a previous backup. The damaged
973backup comes from a damaged partition copied with ddrescue.
974
975@example
976ddrescue -b4096 -r10 /dev/sdc1 hdimage mapfile
977mount -o loop,ro hdimage /mnt/hdimage
978cp /mnt/hdimage/backup.tar.lz backup.tar.lz
979umount /mnt/hdimage
980lzip -t backup.tar.lz
981  backup.tar.lz: Decoder error at pos 1020530
982lziprecover -vv -e --reference-file=old_backup.tar backup.tar.lz
983Reproducing bad area in member 1 of 1
984  (begin = 1019904, size = 4096, value = 0x00)
985  (master mpos = 1019903, dpos = 5857954)
986warning: old_backup.tar: Partial match found at offset 5743778, len 9546.
987Reference data may be mixed with other data.
988Trying level -9
989  Reproducing position 1015808
990Member reproduced successfully.
991Copy of input file reproduced successfully.
992@end example
993
994@sp 1
995@noindent
996Example 3: Recover a damaged backup with a zeroed sector of 4096 bytes at
997file position 1019904, using as reference a file from the filesystem. (If
998the zeroed sector encodes (part of) a tar header, the tarball can't be
999reproduced).
1000
1001@example
1002# List the contents of the backup tarball to locate the damaged member.
1003tarlz -n0 -tvf backup.tar.lz
1004  [...]
1005  example.txt
1006tarlz: Skipping to next header.
1007tarlz: backup.tar.lz: Archive ends unexpectedly.
1008# Find in the filesystem the last file listed and use it as reference.
1009lziprecover -vv -e --reference-file=/somedir/example.txt backup.tar.lz
1010Reproducing bad area in member 1 of 1
1011  (begin = 1019904, size = 4096, value = 0x00)
1012  (master mpos = 1019903, dpos = 5857954)
1013/somedir/example.txt: Match found at offset 9378
1014Trying level -9
1015  Reproducing position 1015808
1016Member reproduced successfully.
1017Copy of input file reproduced successfully.
1018@end example
1019
1020If @samp{backup.tar.lz} is a multimember file with more than one member
1021damaged and lziprecover shows the message @samp{One member reproduced. Copy
1022of input file still contains errors.}, the procedure shown in the example
1023above can be repeated until all the members have been reproduced.
1024
1025@samp{tarlz --keep-damaged -n0 -xf backup.tar.lz example.txt} produces a
1026partial copy of the reference file @samp{example.txt} that may help locate a
1027complete copy in the filesystem or in another backup, even if
1028@samp{example.txt} has been renamed.
1029
1030
1031@node Tarlz
1032@chapter Options supporting the tar.lz format
1033@cindex tarlz
1034
1035@uref{http://www.nongnu.org/lzip/manual/tarlz_manual.html,,Tarlz} is a
1036massively parallel (multi-threaded) combined implementation of the tar
1037archiver and the
1038@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html,,lzip} compressor.
1039
1040Tarlz creates tar archives using a simplified and safer variant of the POSIX
1041pax format compressed in lzip format, keeping the alignment between tar
1042members and lzip members. The resulting multimember tar.lz archive is fully
1043backward compatible with standard tar tools like GNU tar, which treat it
1044like any other tar.lz archive.
1045@ifnothtml
1046@xref{Top,tarlz manual,,tarlz}, and @ref{Top,lzip manual,,lzip}.
1047@end ifnothtml
1048
1049Multimember tar.lz archives have some safety advantages over solidly
1050compressed tar.lz archives. For example, in case of corruption, tarlz can
1051extract all the undamaged members from the tar.lz archive, skipping over the
1052damaged members, just like the standard (uncompressed) tar. Keeping the
1053alignment between tar members and lzip members minimizes the amount of data
1054lost in case of corruption. In this chapter we'll explain the ways in which
1055lziprecover can recover and process multimember tar.lz archives.
1056
1057@sp 1
1058@section Recovering damaged multimember tar.lz archives
1059
1060If you have several copies of the damaged archive, try merging them first
1061because merging has a high probability of success. @xref{Merging files}. If
1062the command below prints something like
1063@w{@samp{Input files merged successfully.}} you are done and
1064@samp{archive.tar.lz} now contains the recovered archive:
1065
1066@example
1067lziprecover -m -v -o archive.tar.lz a/archive.tar.lz b/archive.tar.lz
1068@end example
1069
1070If you only have one copy of the damaged archive with a zeroed block of data
1071caused by an I/O error, you may try to reproduce the archive.
1072@xref{Reproducing one sector}. If the command below prints something like
1073@w{@samp{Copy of input file reproduced successfully.}} you are done and
1074@samp{archive_fixed.tar.lz} now contains the recovered archive:
1075
1076@example
1077lziprecover -vv -e --reference-file=old_archive.tar archive.tar.lz
1078@end example
1079
1080If you only have one copy of the damaged archive, you may try to repair the
1081archive, but this has a lower probability of success. @xref{Repairing one
1082byte}. If the command below prints something like
1083@w{@samp{Copy of input file repaired successfully.}} you are done and
1084@samp{archive_fixed.tar.lz} now contains the recovered archive:
1085
1086@example
1087lziprecover -v -R archive.tar.lz
1088@end example
1089
1090If all the above fails, and the archive was created with tarlz, you may save
1091the damaged members for later and then copy the good members to another
1092archive. If the two commands below succeed, @samp{bad_members.tar.lz} will
1093contain all the damaged members and @samp{archive_cleaned.tar.lz} will
1094contain a good archive with the damaged members removed:
1095
1096@example
1097lziprecover -v --dump=damaged -o bad_members.tar.lz archive.tar.lz
1098lziprecover -v --strip=damaged -o archive_cleaned.tar.lz archive.tar.lz
1099@end example
1100
1101You can then use @samp{tarlz --keep-damaged} to recover as much data as
1102possible from each damaged member in @samp{bad_members.tar.lz}:
1103
1104@example
1105mkdir tmp
1106cd tmp
1107tarlz --keep-damaged -xvf ../bad_members.tar.lz
1108@end example
1109
1110@sp 1
1111@section Processing multimember tar.lz archives
1112
1113Lziprecover is able to copy a list of members from a file to another.
1114For example the command
1115@w{@samp{lziprecover --dump=1-10:r1:tdata archive.tar.lz > subarch.tar.lz}}
1116creates a subset archive containing the first ten members, the end-of-file
1117blocks, and the trailing data (if any) of @samp{archive.tar.lz}. The
1118@samp{r1} part selects the last member, which in an appendable tar.lz
1119archive contains the end-of-file blocks.
1120
1121
1122@node File names
1123@chapter Names of the files produced by lziprecover
1124@cindex file names
1125
1126The name of the fixed file produced by @samp{--merge} and @samp{--repair} is
1127made by appending the string @samp{_fixed.lz} to the original file name. If
1128the original file name ends with one of the extensions @samp{.tar.lz},
1129@samp{.lz}, or @samp{.tlz}, the string @samp{_fixed} is inserted before the
1130extension.
1131
1132
1133@node File format
1134@chapter File format
1135@cindex file format
1136
1137Perfection is reached, not when there is no longer anything to add, but
1138when there is no longer anything to take away.@*
1139--- Antoine de Saint-Exupery
1140
1141@sp 1
1142In the diagram below, a box like this:
1143
1144@verbatim
1145+---+
1146|   | <-- the vertical bars might be missing
1147+---+
1148@end verbatim
1149
1150represents one byte; a box like this:
1151
1152@verbatim
1153+==============+
1154|              |
1155+==============+
1156@end verbatim
1157
1158represents a variable number of bytes.
1159
1160@sp 1
1161A lzip file consists of a series of "members" (compressed data sets).
1162The members simply appear one after another in the file, with no
1163additional information before, between, or after them.
1164
1165Each member has the following structure:
1166
1167@verbatim
1168+--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1169| ID string | VN | DS | LZMA stream | CRC32 |   Data size   |  Member size  |
1170+--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1171@end verbatim
1172
1173All multibyte values are stored in little endian order.
1174
1175@table @samp
1176@item ID string (the "magic" bytes)
1177A four byte string, identifying the lzip format, with the value "LZIP"
1178(0x4C, 0x5A, 0x49, 0x50).
1179
1180@item VN (version number, 1 byte)
1181Just in case something needs to be modified in the future. 1 for now.
1182
1183@item DS (coded dictionary size, 1 byte)
1184The dictionary size is calculated by taking a power of 2 (the base size)
1185and subtracting from it a fraction between 0/16 and 7/16 of the base size.@*
1186Bits 4-0 contain the base 2 logarithm of the base size (12 to 29).@*
1187Bits 7-5 contain the numerator of the fraction (0 to 7) to subtract
1188from the base size to obtain the dictionary size.@*
1189Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB@*
1190Valid values for dictionary size range from 4 KiB to 512 MiB.
1191
1192@item LZMA stream
1193The LZMA stream, finished by an end of stream marker. Uses default values
1194for encoder properties.
1195@ifnothtml
1196@xref{Stream format,,,lzip},
1197@end ifnothtml
1198@ifhtml
1199See
1200@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html#Stream-format,,Stream format}
1201@end ifhtml
1202for a complete description.
1203
1204@item CRC32 (4 bytes)
1205Cyclic Redundancy Check (CRC) of the uncompressed original data.
1206
1207@item Data size (8 bytes)
1208Size of the uncompressed original data.
1209
1210@item Member size (8 bytes)
1211Total size of the member, including header and trailer. This field acts
1212as a distributed index, allows the verification of stream integrity, and
1213facilitates safe recovery of undamaged members from multimember files.
1214
1215@end table
1216
1217
1218@node Trailing data
1219@chapter Extra data appended to the file
1220@cindex trailing data
1221
1222Sometimes extra data are found appended to a lzip file after the last
1223member. Such trailing data may be:
1224
1225@itemize @bullet
1226@item
1227Padding added to make the file size a multiple of some block size, for
1228example when writing to a tape. It is safe to append any amount of
1229padding zero bytes to a lzip file.
1230
1231@item
1232Useful data added by the user; a cryptographically secure hash, a
1233description of file contents, etc. It is safe to append any amount of
1234text to a lzip file as long as none of the first four bytes of the text
1235match the corresponding byte in the string "LZIP", and the text does not
1236contain any zero bytes (null characters). Nonzero bytes and zero bytes
1237can't be safely mixed in trailing data.
1238
1239@item
1240Garbage added by some not totally successful copy operation.
1241
1242@item
1243Malicious data added to the file in order to make its total size and
1244hash value (for a chosen hash) coincide with those of another file.
1245
1246@item
1247In rare cases, trailing data could be the corrupt header of another
1248member. In multimember or concatenated files the probability of
1249corruption happening in the magic bytes is 5 times smaller than the
1250probability of getting a false positive caused by the corruption of the
1251integrity information itself. Therefore it can be considered to be below
1252the noise level. Additionally, the test used by lziprecover to discriminate
1253trailing data from a corrupt header has a Hamming distance (HD) of 3,
1254and the 3 bit flips must happen in different magic bytes for the test to
1255fail. In any case, the option @samp{--trailing-error} guarantees that
1256any corrupt header will be detected.
1257@end itemize
1258
1259Trailing data are in no way part of the lzip file format, but tools
1260reading lzip files are expected to behave as correctly and usefully as
1261possible in the presence of trailing data.
1262
1263Trailing data can be safely ignored in most cases. In some cases, like
1264that of user-added data, they are expected to be ignored. In those cases
1265where a file containing trailing data must be rejected, the option
1266@samp{--trailing-error} can be used. @xref{--trailing-error}.
1267
1268Lziprecover facilitates the management of metadata stored as trailing
1269data in lzip files. See the following examples:
1270
1271@noindent
1272Example 1: Add a comment or description to a compressed file.
1273
1274@example
1275# First append the comment as trailing data to a lzip file
1276echo 'This file contains this and that' >> file.lz
1277# This command prints the comment to standard output
1278lziprecover --dump=tdata file.lz
1279# This command outputs file.lz without the comment
1280lziprecover --strip=tdata file.lz
1281# This command removes the comment from file.lz
1282lziprecover --remove=tdata file.lz
1283@end example
1284
1285@sp 1
1286@noindent
1287Example 2: Add and verify a cryptographically secure hash. (This may be
1288convenient, but a separate copy of the hash must be kept in a safe place
1289to guarantee that both file and hash have not been maliciously replaced).
1290
1291@example
1292sha256sum < file.lz >> file.lz
1293lziprecover --strip=tdata file.lz | sha256sum -c \
1294  <(lziprecover --dump=tdata file.lz)
1295@end example
1296
1297
1298@node Examples
1299@chapter A small tutorial with examples
1300@cindex examples
1301
1302Example 1: Extract all the files from archive @samp{foo.tar.lz}.
1303
1304@example
1305  tar -xf foo.tar.lz
1306or
1307  lziprecover -cd foo.tar.lz | tar -xf -
1308@end example
1309
1310@sp 1
1311@noindent
1312Example 2: Restore a regular file from its compressed version
1313@samp{file.lz}. If the operation is successful, @samp{file.lz} is removed.
1314
1315@example
1316lziprecover -d file.lz
1317@end example
1318
1319@sp 1
1320@noindent
1321Example 3: Verify the integrity of the compressed file @samp{file.lz} and
1322show status.
1323
1324@example
1325lziprecover -tv file.lz
1326@end example
1327
1328@sp 1
1329@anchor{concat-example}
1330@noindent
1331Example 4: The right way of concatenating the decompressed output of two or
1332more compressed files. @xref{Trailing data}.
1333
1334@example
1335Don't do this
1336  cat file1.lz file2.lz file3.lz | lziprecover -d
1337Do this instead
1338  lziprecover -cd file1.lz file2.lz file3.lz
1339You may also concatenate the compressed files like this
1340  lziprecover --strip=tdata file1.lz file2.lz file3.lz > file123.lz
1341Or keeping the trailing data of the last file like this
1342  lziprecover --strip=damaged file1.lz file2.lz file3.lz > file123.lz
1343@end example
1344
1345@sp 1
1346@noindent
1347Example 5: Decompress @samp{file.lz} partially until @w{10 KiB} of
1348decompressed data are produced.
1349
1350@example
1351lziprecover -D 0,10KiB file.lz
1352@end example
1353
1354@sp 1
1355@noindent
1356Example 6: Decompress @samp{file.lz} partially from decompressed byte at
1357offset 10000 to decompressed byte at offset 14999 (5000 bytes are produced).
1358
1359@example
1360lziprecover -D 10000-15000 file.lz
1361@end example
1362
1363@sp 1
1364@noindent
1365Example 7: Repair small errors in the file @samp{file.lz}. (Indented lines
1366are abridged diagnostic messages from lziprecover).
1367
1368@example
1369lziprecover -v -R file.lz
1370  Copy of input file repaired successfully.
1371lziprecover -tv file_fixed.lz
1372  file_fixed.lz: ok
1373mv file_fixed.lz file.lz
1374@end example
1375
1376@sp 1
1377@noindent
1378Example 8: Split the multimember file @samp{file.lz} and write each member
1379in its own @samp{recXXXfile.lz} file. Then use @w{@samp{lziprecover -t}} to
1380test the integrity of the resulting files.
1381
1382@example
1383lziprecover -s file.lz
1384lziprecover -tv rec*file.lz
1385@end example
1386
1387
1388@node Unzcrash
1389@chapter Testing the robustness of decompressors
1390@cindex unzcrash
1391
1392The lziprecover package also includes unzcrash, a program written to test
1393robustness to decompression of corrupted data, inspired by unzcrash.c from
1394Julian Seward's bzip2. Type @samp{make unzcrash} in the lziprecover source
1395directory to build it.
1396
1397By default, unzcrash reads the file specified and then repeatedly
1398decompresses it, increasing 256 times each byte of the compressed data, so
1399as to test all possible one-byte errors. Note that it may take years or even
1400centuries to test all possible one-byte errors in a large file (tens of MB).
1401
1402If the option @samp{--block} is given, unzcrash reads the file specified and
1403then repeatedly decompresses it, setting all bytes in each successive block
1404to the value given, so as to test all possible full sector errors.
1405
1406If the option @samp{--truncate} is given, unzcrash reads the file specified
1407and then repeatedly decompresses it, truncating the file to increasing
1408lengths, so as to test all possible truncation points.
1409
1410None of the three test modes described above should cause any invalid memory
1411accesses. If any of them does, please, report it as a bug to the maintainers
1412of the decompressor being tested.
1413
1414Unzcrash really executes as a subprocess the shell command specified in the
1415first non-option argument, and then writes the file specified in the second
1416non-option argument to the standard input of the subprocess, modifying the
1417corresponding byte each time. Therefore unzcrash can be used to test any
1418decompressor (not only lzip), or even other decoder programs having a
1419suitable command line syntax.
1420
1421If the decompressor returns with zero status, unzcrash compares the output
1422of the decompressor for the original and corrupt files. If the outputs
1423differ, it means that the decompressor returned a false negative; it failed
1424to recognize the corruption and produced garbage output. The only exception
1425is when a multimember file is truncated just after the last byte of a
1426member, producing a shorter but valid compressed file. Except in this latter
1427case, please, report any false negative as a bug.
1428
1429In order to compare the outputs, unzcrash needs a @samp{zcmp} program able
1430to understand the format being tested. For example the @samp{zcmp} provided
1431by @uref{http://www.nongnu.org/zutils/manual/zutils_manual.html#Zcmp,,zutils}.
1432Use @samp{--zcmp=false} to disable comparisons.
1433@ifnothtml
1434@xref{Zcmp,,,zutils}.
1435@end ifnothtml
1436
1437The format for running unzcrash is:
1438
1439@example
1440unzcrash [@var{options}] 'lzip -t' @var{file}
1441@end example
1442
1443@noindent
1444The compressed @var{file} must not contain errors and the decompressor being
1445tested must decompress it correctly for the comparisons to work.
1446
1447unzcrash supports the following options:
1448
1449@table @code
1450@item -h
1451@itemx --help
1452Print an informative help message describing the options and exit.
1453
1454@item -V
1455@itemx --version
1456Print the version number of unzcrash on the standard output and exit.
1457This version number should be included in all bug reports.
1458
1459@item -b @var{range}
1460@itemx --bits=@var{range}
1461Test N-bit errors only, instead of testing all the 255 wrong values for
1462each byte. @samp{N-bit error} means any value differing from the
1463original value in N bit positions, not a value differing from the
1464original value in the bit position N.@*
1465The number of N-bit errors per byte (N = 1 to 8) is:
1466@w{8 28 56 70 56 28 8 1}
1467
1468@multitable {Examples of @var{range}} {Tests errors of N-bits}
1469@item Examples of @var{range} @tab Tests errors of N-bits
1470@item 1                       @tab 1
1471@item 1,2,3                   @tab 1, 2, 3
1472@item 2-4                     @tab 2, 3, 4
1473@item 1,3-5,8                 @tab 1, 3, 4, 5, 8
1474@item 1-3,5-8                 @tab 1, 2, 3, 5, 6, 7, 8
1475@end multitable
1476
1477@item -B[@var{size}][,@var{value}]
1478@itemx --block[=@var{size}][,@var{value}]
1479Test block errors of given @var{size}, simulating a whole sector I/O error.
1480@var{size} defaults to 512 bytes. @var{value} defaults to 0. By default,
1481only contiguous, non-overlapping blocks are tested, but this may be changed
1482with the option @samp{--delta}.
1483
1484@item -d @var{n}
1485@itemx --delta=@var{n}
1486Test one byte, block, or truncation size every @var{n} bytes. If
1487@samp{--delta} is not specified, unzcrash tests all the bytes,
1488non-overlapping blocks, or truncation sizes. Values of @var{n} smaller than
1489the block size will result in overlapping blocks. (Which is convenient for
1490testing because there are usually too few non-overlapping blocks in a file).
1491
1492@item -e @var{position},@var{value}
1493@itemx --set-byte=@var{position},@var{value}
1494Set byte at @var{position} to @var{value} in the internal buffer after
1495reading and testing @var{file} but before the first test call to the
1496decompressor. Byte positions start at 0. If @var{value} is preceded by
1497@samp{+}, it is added to the original value of the byte at @var{position}.
1498If @var{value} is preceded by @samp{f} (flip), it is XORed with the original
1499value of the byte at @var{position}. This option can be used to run tests
1500with a changed dictionary size, for example.
1501
1502@item -n
1503@itemx --no-verify
1504Skip initial verification of @var{file} and @samp{zcmp}. May speed up things
1505a lot when testing many (or large) known good files.
1506
1507@item -p @var{bytes}
1508@itemx --position=@var{bytes}
1509First byte position to test in the file. Defaults to 0. Negative values
1510are relative to the end of the file.
1511
1512@item -q
1513@itemx --quiet
1514Quiet operation. Suppress all messages.
1515
1516@item -s @var{bytes}
1517@itemx --size=@var{bytes}
1518Number of byte positions to test. If not specified, the rest of the file
1519is tested (from @samp{--position} to end of file). Negative values are
1520relative to the rest of the file.
1521
1522@item -t
1523@itemx --truncate
1524Test all possible truncation points in the range specified by
1525@samp{--position} and @samp{--size}.
1526
1527@item -v
1528@itemx --verbose
1529Verbose mode.
1530
1531@item -z
1532@itemx --zcmp=<command>
1533Set zcmp command name and options. Defaults to @samp{zcmp}. Use
1534@samp{--zcmp=false} to disable comparisons. If testing a decompressor
1535different from the one used by default by zcmp, it is needed to force
1536unzcrash and zcmp to use the same decompressor with a command like
1537@w{@samp{unzcrash --zcmp='zcmp --lz=plzip' 'plzip -t' @var{file}}}
1538
1539@end table
1540
1541Exit status: 0 for a normal exit, 1 for environmental problems (file not
1542found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or
1543invalid input file, 3 for an internal consistency error (eg, bug) which
1544caused unzcrash to panic.
1545
1546
1547@node Problems
1548@chapter Reporting bugs
1549@cindex bugs
1550@cindex getting help
1551
1552There are probably bugs in lziprecover. There are certainly errors and
1553omissions in this manual. If you report them, they will get fixed. If
1554you don't, no one will ever know about them and they will remain unfixed
1555for all eternity, if not longer.
1556
1557If you find a bug in lziprecover, please send electronic mail to
1558@email{lzip-bug@@nongnu.org}. Include the version number, which you can
1559find by running @w{@samp{lziprecover --version}}.
1560
1561
1562@node Concept index
1563@unnumbered Concept index
1564
1565@printindex cp
1566
1567@bye
1568