1.\" Copyright (c) 2003-2007 Tim Kientzle 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" $FreeBSD$ 26.\" 27.Dd January 26, 2011 28.Dt LIBARCHIVE_INTERNALS 3 29.Os 30.Sh NAME 31.Nm libarchive_internals 32.Nd description of libarchive internal interfaces 33.Sh OVERVIEW 34The 35.Nm libarchive 36library provides a flexible interface for reading and writing 37streaming archive files such as tar and cpio. 38Internally, it follows a modular layered design that should 39make it easy to add new archive and compression formats. 40.Sh GENERAL ARCHITECTURE 41Externally, libarchive exposes most operations through an 42opaque, object-style interface. 43The 44.Xr archive_entry 3 45objects store information about a single filesystem object. 46The rest of the library provides facilities to write 47.Xr archive_entry 3 48objects to archive files, 49read them from archive files, 50and write them to disk. 51(There are plans to add a facility to read 52.Xr archive_entry 3 53objects from disk as well.) 54.Pp 55The read and write APIs each have four layers: a public API 56layer, a format layer that understands the archive file format, 57a compression layer, and an I/O layer. 58The I/O layer is completely exposed to clients who can replace 59it entirely with their own functions. 60.Pp 61In order to provide as much consistency as possible for clients, 62some public functions are virtualized. 63Eventually, it should be possible for clients to open 64an archive or disk writer, and then use a single set of 65code to select and write entries, regardless of the target. 66.Sh READ ARCHITECTURE 67From the outside, clients use the 68.Xr archive_read 3 69API to manipulate an 70.Nm archive 71object to read entries and bodies from an archive stream. 72Internally, the 73.Nm archive 74object is cast to an 75.Nm archive_read 76object, which holds all read-specific data. 77The API has four layers: 78The lowest layer is the I/O layer. 79This layer can be overridden by clients, but most clients use 80the packaged I/O callbacks provided, for example, by 81.Xr archive_read_open_memory 3 , 82and 83.Xr archive_read_open_fd 3 . 84The compression layer calls the I/O layer to 85read bytes and decompresses them for the format layer. 86The format layer unpacks a stream of uncompressed bytes and 87creates 88.Nm archive_entry 89objects from the incoming data. 90The API layer tracks overall state 91(for example, it prevents clients from reading data before reading a header) 92and invokes the format and compression layer operations 93through registered function pointers. 94In particular, the API layer drives the format-detection process: 95When opening the archive, it reads an initial block of data 96and offers it to each registered compression handler. 97The one with the highest bid is initialized with the first block. 98Similarly, the format handlers are polled to see which handler 99is the best for each archive. 100(Prior to 2.4.0, the format bidders were invoked for each 101entry, but this design hindered error recovery.) 102.Ss I/O Layer and Client Callbacks 103The read API goes to some lengths to be nice to clients. 104As a result, there are few restrictions on the behavior of 105the client callbacks. 106.Pp 107The client read callback is expected to provide a block 108of data on each call. 109A zero-length return does indicate end of file, but otherwise 110blocks may be as small as one byte or as large as the entire file. 111In particular, blocks may be of different sizes. 112.Pp 113The client skip callback returns the number of bytes actually 114skipped, which may be much smaller than the skip requested. 115The only requirement is that the skip not be larger. 116In particular, clients are allowed to return zero for any 117skip that they don't want to handle. 118The skip callback must never be invoked with a negative value. 119.Pp 120Keep in mind that not all clients are reading from disk: 121clients reading from networks may provide different-sized 122blocks on every request and cannot skip at all; 123advanced clients may use 124.Xr mmap 2 125to read the entire file into memory at once and return the 126entire file to libarchive as a single block; 127other clients may begin asynchronous I/O operations for the 128next block on each request. 129.Ss Decompresssion Layer 130The decompression layer not only handles decompression, 131it also buffers data so that the format handlers see a 132much nicer I/O model. 133The decompression API is a two stage peek/consume model. 134A read_ahead request specifies a minimum read amount; 135the decompression layer must provide a pointer to at least 136that much data. 137If more data is immediately available, it should return more: 138the format layer handles bulk data reads by asking for a minimum 139of one byte and then copying as much data as is available. 140.Pp 141A subsequent call to the 142.Fn consume 143function advances the read pointer. 144Note that data returned from a 145.Fn read_ahead 146call is guaranteed to remain in place until 147the next call to 148.Fn read_ahead . 149Intervening calls to 150.Fn consume 151should not cause the data to move. 152.Pp 153Skip requests must always be handled exactly. 154Decompression handlers that cannot seek forward should 155not register a skip handler; 156the API layer fills in a generic skip handler that reads and discards data. 157.Pp 158A decompression handler has a specific lifecycle: 159.Bl -tag -compact -width indent 160.It Registration/Configuration 161When the client invokes the public support function, 162the decompression handler invokes the internal 163.Fn __archive_read_register_compression 164function to provide bid and initialization functions. 165This function returns 166.Cm NULL 167on error or else a pointer to a 168.Cm struct decompressor_t . 169This structure contains a 170.Va void * config 171slot that can be used for storing any customization information. 172.It Bid 173The bid function is invoked with a pointer and size of a block of data. 174The decompressor can access its config data 175through the 176.Va decompressor 177element of the 178.Cm archive_read 179object. 180The bid function is otherwise stateless. 181In particular, it must not perform any I/O operations. 182.Pp 183The value returned by the bid function indicates its suitability 184for handling this data stream. 185A bid of zero will ensure that this decompressor is never invoked. 186Return zero if magic number checks fail. 187Otherwise, your initial implementation should return the number of bits 188actually checked. 189For example, if you verify two full bytes and three bits of another 190byte, bid 19. 191Note that the initial block may be very short; 192be careful to only inspect the data you are given. 193(The current decompressors require two bytes for correct bidding.) 194.It Initialize 195The winning bidder will have its init function called. 196This function should initialize the remaining slots of the 197.Va struct decompressor_t 198object pointed to by the 199.Va decompressor 200element of the 201.Va archive_read 202object. 203In particular, it should allocate any working data it needs 204in the 205.Va data 206slot of that structure. 207The init function is called with the block of data that 208was used for tasting. 209At this point, the decompressor is responsible for all I/O 210requests to the client callbacks. 211The decompressor is free to read more data as and when 212necessary. 213.It Satisfy I/O requests 214The format handler will invoke the 215.Va read_ahead , 216.Va consume , 217and 218.Va skip 219functions as needed. 220.It Finish 221The finish method is called only once when the archive is closed. 222It should release anything stored in the 223.Va data 224and 225.Va config 226slots of the 227.Va decompressor 228object. 229It should not invoke the client close callback. 230.El 231.Ss Format Layer 232The read formats have a similar lifecycle to the decompression handlers: 233.Bl -tag -compact -width indent 234.It Registration 235Allocate your private data and initialize your pointers. 236.It Bid 237Formats bid by invoking the 238.Fn read_ahead 239decompression method but not calling the 240.Fn consume 241method. 242This allows each bidder to look ahead in the input stream. 243Bidders should not look further ahead than necessary, as long 244look aheads put pressure on the decompression layer to buffer 245lots of data. 246Most formats only require a few hundred bytes of look ahead; 247look aheads of a few kilobytes are reasonable. 248(The ISO9660 reader sometimes looks ahead by 48k, which 249should be considered an upper limit.) 250.It Read header 251The header read is usually the most complex part of any format. 252There are a few strategies worth mentioning: 253For formats such as tar or cpio, reading and parsing the header is 254straightforward since headers alternate with data. 255For formats that store all header data at the beginning of the file, 256the first header read request may have to read all headers into 257memory and store that data, sorted by the location of the file 258data. 259Subsequent header read requests will skip forward to the 260beginning of the file data and return the corresponding header. 261.It Read Data 262The read data interface supports sparse files; this requires that 263each call return a block of data specifying the file offset and 264size. 265This may require you to carefully track the location so that you 266can return accurate file offsets for each read. 267Remember that the decompressor will return as much data as it has. 268Generally, you will want to request one byte, 269examine the return value to see how much data is available, and 270possibly trim that to the amount you can use. 271You should invoke consume for each block just before you return it. 272.It Skip All Data 273The skip data call should skip over all file data and trailing padding. 274This is called automatically by the API layer just before each 275header read. 276It is also called in response to the client calling the public 277.Fn data_skip 278function. 279.It Cleanup 280On cleanup, the format should release all of its allocated memory. 281.El 282.Ss API Layer 283XXX to do XXX 284.Sh WRITE ARCHITECTURE 285The write API has a similar set of four layers: 286an API layer, a format layer, a compression layer, and an I/O layer. 287The registration here is much simpler because only 288one format and one compression can be registered at a time. 289.Ss I/O Layer and Client Callbacks 290XXX To be written XXX 291.Ss Compression Layer 292XXX To be written XXX 293.Ss Format Layer 294XXX To be written XXX 295.Ss API Layer 296XXX To be written XXX 297.Sh WRITE_DISK ARCHITECTURE 298The write_disk API is intended to look just like the write API 299to clients. 300Since it does not handle multiple formats or compression, it 301is not layered internally. 302.Sh GENERAL SERVICES 303The 304.Nm archive_read , 305.Nm archive_write , 306and 307.Nm archive_write_disk 308objects all contain an initial 309.Nm archive 310object which provides common support for a set of standard services. 311(Recall that ANSI/ISO C90 guarantees that you can cast freely between 312a pointer to a structure and a pointer to the first element of that 313structure.) 314The 315.Nm archive 316object has a magic value that indicates which API this object 317is associated with, 318slots for storing error information, 319and function pointers for virtualized API functions. 320.Sh MISCELLANEOUS NOTES 321Connecting existing archiving libraries into libarchive is generally 322quite difficult. 323In particular, many existing libraries strongly assume that you 324are reading from a file; they seek forwards and backwards as necessary 325to locate various pieces of information. 326In contrast, libarchive never seeks backwards in its input, which 327sometimes requires very different approaches. 328.Pp 329For example, libarchive's ISO9660 support operates very differently 330from most ISO9660 readers. 331The libarchive support utilizes a work-queue design that 332keeps a list of known entries sorted by their location in the input. 333Whenever libarchive's ISO9660 implementation is asked for the next 334header, checks this list to find the next item on the disk. 335Directories are parsed when they are encountered and new 336items are added to the list. 337This design relies heavily on the ISO9660 image being optimized so that 338directories always occur earlier on the disk than the files they 339describe. 340.Pp 341Depending on the specific format, such approaches may not be possible. 342The ZIP format specification, for example, allows archivers to store 343key information only at the end of the file. 344In theory, it is possible to create ZIP archives that cannot 345be read without seeking. 346Fortunately, such archives are very rare, and libarchive can read 347most ZIP archives, though it cannot always extract as much information 348as a dedicated ZIP program. 349.Sh SEE ALSO 350.Xr archive_entry 3 , 351.Xr archive_read 3 , 352.Xr archive_write 3 , 353.Xr archive_write_disk 3 , 354.Xr libarchive 3 355.Sh HISTORY 356The 357.Nm libarchive 358library first appeared in 359.Fx 5.3 . 360.Sh AUTHORS 361.An -nosplit 362The 363.Nm libarchive 364library was written by 365.An Tim Kientzle Aq kientzle@acm.org . 366