1.\" Copyright (c) 2003-2007 Tim Kientzle 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.Dd January 26, 2011 26.Dt LIBARCHIVE_INTERNALS 3 27.Os 28.Sh NAME 29.Nm libarchive_internals 30.Nd description of libarchive internal interfaces 31.Sh OVERVIEW 32The 33.Nm libarchive 34library provides a flexible interface for reading and writing 35streaming archive files such as tar and cpio. 36Internally, it follows a modular layered design that should 37make it easy to add new archive and compression formats. 38.Sh GENERAL ARCHITECTURE 39Externally, libarchive exposes most operations through an 40opaque, object-style interface. 41The 42.Xr archive_entry 3 43objects store information about a single filesystem object. 44The rest of the library provides facilities to write 45.Xr archive_entry 3 46objects to archive files, 47read them from archive files, 48and write them to disk. 49(There are plans to add a facility to read 50.Xr archive_entry 3 51objects from disk as well.) 52.Pp 53The read and write APIs each have four layers: a public API 54layer, a format layer that understands the archive file format, 55a compression layer, and an I/O layer. 56The I/O layer is completely exposed to clients who can replace 57it entirely with their own functions. 58.Pp 59In order to provide as much consistency as possible for clients, 60some public functions are virtualized. 61Eventually, it should be possible for clients to open 62an archive or disk writer, and then use a single set of 63code to select and write entries, regardless of the target. 64.Sh READ ARCHITECTURE 65From the outside, clients use the 66.Xr archive_read 3 67API to manipulate an 68.Nm archive 69object to read entries and bodies from an archive stream. 70Internally, the 71.Nm archive 72object is cast to an 73.Nm archive_read 74object, which holds all read-specific data. 75The API has four layers: 76The lowest layer is the I/O layer. 77This layer can be overridden by clients, but most clients use 78the packaged I/O callbacks provided, for example, by 79.Xr archive_read_open_memory 3 , 80and 81.Xr archive_read_open_fd 3 . 82The compression layer calls the I/O layer to 83read bytes and decompresses them for the format layer. 84The format layer unpacks a stream of uncompressed bytes and 85creates 86.Nm archive_entry 87objects from the incoming data. 88The API layer tracks overall state 89(for example, it prevents clients from reading data before reading a header) 90and invokes the format and compression layer operations 91through registered function pointers. 92In particular, the API layer drives the format-detection process: 93When opening the archive, it reads an initial block of data 94and offers it to each registered compression handler. 95The one with the highest bid is initialized with the first block. 96Similarly, the format handlers are polled to see which handler 97is the best for each archive. 98(Prior to 2.4.0, the format bidders were invoked for each 99entry, but this design hindered error recovery.) 100.Ss I/O Layer and Client Callbacks 101The read API goes to some lengths to be nice to clients. 102As a result, there are few restrictions on the behavior of 103the client callbacks. 104.Pp 105The client read callback is expected to provide a block 106of data on each call. 107A zero-length return does indicate end of file, but otherwise 108blocks may be as small as one byte or as large as the entire file. 109In particular, blocks may be of different sizes. 110.Pp 111The client skip callback returns the number of bytes actually 112skipped, which may be much smaller than the skip requested. 113The only requirement is that the skip not be larger. 114In particular, clients are allowed to return zero for any 115skip that they don't want to handle. 116The skip callback must never be invoked with a negative value. 117.Pp 118Keep in mind that not all clients are reading from disk: 119clients reading from networks may provide different-sized 120blocks on every request and cannot skip at all; 121advanced clients may use 122.Xr mmap 2 123to read the entire file into memory at once and return the 124entire file to libarchive as a single block; 125other clients may begin asynchronous I/O operations for the 126next block on each request. 127.Ss Decompression Layer 128The decompression layer not only handles decompression, 129it also buffers data so that the format handlers see a 130much nicer I/O model. 131The decompression API is a two stage peek/consume model. 132A read_ahead request specifies a minimum read amount; 133the decompression layer must provide a pointer to at least 134that much data. 135If more data is immediately available, it should return more: 136the format layer handles bulk data reads by asking for a minimum 137of one byte and then copying as much data as is available. 138.Pp 139A subsequent call to the 140.Fn consume 141function advances the read pointer. 142Note that data returned from a 143.Fn read_ahead 144call is guaranteed to remain in place until 145the next call to 146.Fn read_ahead . 147Intervening calls to 148.Fn consume 149should not cause the data to move. 150.Pp 151Skip requests must always be handled exactly. 152Decompression handlers that cannot seek forward should 153not register a skip handler; 154the API layer fills in a generic skip handler that reads and discards data. 155.Pp 156A decompression handler has a specific lifecycle: 157.Bl -tag -compact -width indent 158.It Registration/Configuration 159When the client invokes the public support function, 160the decompression handler invokes the internal 161.Fn __archive_read_register_compression 162function to provide bid and initialization functions. 163This function returns 164.Cm NULL 165on error or else a pointer to a 166.Cm struct decompressor_t . 167This structure contains a 168.Va void * config 169slot that can be used for storing any customization information. 170.It Bid 171The bid function is invoked with a pointer and size of a block of data. 172The decompressor can access its config data 173through the 174.Va decompressor 175element of the 176.Cm archive_read 177object. 178The bid function is otherwise stateless. 179In particular, it must not perform any I/O operations. 180.Pp 181The value returned by the bid function indicates its suitability 182for handling this data stream. 183A bid of zero will ensure that this decompressor is never invoked. 184Return zero if magic number checks fail. 185Otherwise, your initial implementation should return the number of bits 186actually checked. 187For example, if you verify two full bytes and three bits of another 188byte, bid 19. 189Note that the initial block may be very short; 190be careful to only inspect the data you are given. 191(The current decompressors require two bytes for correct bidding.) 192.It Initialize 193The winning bidder will have its init function called. 194This function should initialize the remaining slots of the 195.Va struct decompressor_t 196object pointed to by the 197.Va decompressor 198element of the 199.Va archive_read 200object. 201In particular, it should allocate any working data it needs 202in the 203.Va data 204slot of that structure. 205The init function is called with the block of data that 206was used for tasting. 207At this point, the decompressor is responsible for all I/O 208requests to the client callbacks. 209The decompressor is free to read more data as and when 210necessary. 211.It Satisfy I/O requests 212The format handler will invoke the 213.Va read_ahead , 214.Va consume , 215and 216.Va skip 217functions as needed. 218.It Finish 219The finish method is called only once when the archive is closed. 220It should release anything stored in the 221.Va data 222and 223.Va config 224slots of the 225.Va decompressor 226object. 227It should not invoke the client close callback. 228.El 229.Ss Format Layer 230The read formats have a similar lifecycle to the decompression handlers: 231.Bl -tag -compact -width indent 232.It Registration 233Allocate your private data and initialize your pointers. 234.It Bid 235Formats bid by invoking the 236.Fn read_ahead 237decompression method but not calling the 238.Fn consume 239method. 240This allows each bidder to look ahead in the input stream. 241Bidders should not look further ahead than necessary, as long 242look aheads put pressure on the decompression layer to buffer 243lots of data. 244Most formats only require a few hundred bytes of look ahead; 245look aheads of a few kilobytes are reasonable. 246(The ISO9660 reader sometimes looks ahead by 48k, which 247should be considered an upper limit.) 248.It Read header 249The header read is usually the most complex part of any format. 250There are a few strategies worth mentioning: 251For formats such as tar or cpio, reading and parsing the header is 252straightforward since headers alternate with data. 253For formats that store all header data at the beginning of the file, 254the first header read request may have to read all headers into 255memory and store that data, sorted by the location of the file 256data. 257Subsequent header read requests will skip forward to the 258beginning of the file data and return the corresponding header. 259.It Read Data 260The read data interface supports sparse files; this requires that 261each call return a block of data specifying the file offset and 262size. 263This may require you to carefully track the location so that you 264can return accurate file offsets for each read. 265Remember that the decompressor will return as much data as it has. 266Generally, you will want to request one byte, 267examine the return value to see how much data is available, and 268possibly trim that to the amount you can use. 269You should invoke consume for each block just before you return it. 270.It Skip All Data 271The skip data call should skip over all file data and trailing padding. 272This is called automatically by the API layer just before each 273header read. 274It is also called in response to the client calling the public 275.Fn data_skip 276function. 277.It Cleanup 278On cleanup, the format should release all of its allocated memory. 279.El 280.Ss API Layer 281XXX to do XXX 282.Sh WRITE ARCHITECTURE 283The write API has a similar set of four layers: 284an API layer, a format layer, a compression layer, and an I/O layer. 285The registration here is much simpler because only 286one format and one compression can be registered at a time. 287.Ss I/O Layer and Client Callbacks 288XXX To be written XXX 289.Ss Compression Layer 290XXX To be written XXX 291.Ss Format Layer 292XXX To be written XXX 293.Ss API Layer 294XXX To be written XXX 295.Sh WRITE_DISK ARCHITECTURE 296The write_disk API is intended to look just like the write API 297to clients. 298Since it does not handle multiple formats or compression, it 299is not layered internally. 300.Sh GENERAL SERVICES 301The 302.Nm archive_read , 303.Nm archive_write , 304and 305.Nm archive_write_disk 306objects all contain an initial 307.Nm archive 308object which provides common support for a set of standard services. 309(Recall that ANSI/ISO C90 guarantees that you can cast freely between 310a pointer to a structure and a pointer to the first element of that 311structure.) 312The 313.Nm archive 314object has a magic value that indicates which API this object 315is associated with, 316slots for storing error information, 317and function pointers for virtualized API functions. 318.Sh MISCELLANEOUS NOTES 319Connecting existing archiving libraries into libarchive is generally 320quite difficult. 321In particular, many existing libraries strongly assume that you 322are reading from a file; they seek forwards and backwards as necessary 323to locate various pieces of information. 324In contrast, libarchive never seeks backwards in its input, which 325sometimes requires very different approaches. 326.Pp 327For example, libarchive's ISO9660 support operates very differently 328from most ISO9660 readers. 329The libarchive support utilizes a work-queue design that 330keeps a list of known entries sorted by their location in the input. 331Whenever libarchive's ISO9660 implementation is asked for the next 332header, checks this list to find the next item on the disk. 333Directories are parsed when they are encountered and new 334items are added to the list. 335This design relies heavily on the ISO9660 image being optimized so that 336directories always occur earlier on the disk than the files they 337describe. 338.Pp 339Depending on the specific format, such approaches may not be possible. 340The ZIP format specification, for example, allows archivers to store 341key information only at the end of the file. 342In theory, it is possible to create ZIP archives that cannot 343be read without seeking. 344Fortunately, such archives are very rare, and libarchive can read 345most ZIP archives, though it cannot always extract as much information 346as a dedicated ZIP program. 347.Sh SEE ALSO 348.Xr archive_entry 3 , 349.Xr archive_read 3 , 350.Xr archive_write 3 , 351.Xr archive_write_disk 3 , 352.Xr libarchive 3 353.Sh HISTORY 354The 355.Nm libarchive 356library first appeared in 357.Fx 5.3 . 358.Sh AUTHORS 359.An -nosplit 360The 361.Nm libarchive 362library was written by 363.An Tim Kientzle Aq kientzle@acm.org . 364