1.\" Copyright (c) 2003-2007 Tim Kientzle
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.Dd January 26, 2011
26.Dt LIBARCHIVE_INTERNALS 3
27.Os
28.Sh NAME
29.Nm libarchive_internals
30.Nd description of libarchive internal interfaces
31.Sh OVERVIEW
32The
33.Nm libarchive
34library provides a flexible interface for reading and writing
35streaming archive files such as tar and cpio.
36Internally, it follows a modular layered design that should
37make it easy to add new archive and compression formats.
38.Sh GENERAL ARCHITECTURE
39Externally, libarchive exposes most operations through an
40opaque, object-style interface.
41The
42.Xr archive_entry 3
43objects store information about a single filesystem object.
44The rest of the library provides facilities to write
45.Xr archive_entry 3
46objects to archive files,
47read them from archive files,
48and write them to disk.
49(There are plans to add a facility to read
50.Xr archive_entry 3
51objects from disk as well.)
52.Pp
53The read and write APIs each have four layers: a public API
54layer, a format layer that understands the archive file format,
55a compression layer, and an I/O layer.
56The I/O layer is completely exposed to clients who can replace
57it entirely with their own functions.
58.Pp
59In order to provide as much consistency as possible for clients,
60some public functions are virtualized.
61Eventually, it should be possible for clients to open
62an archive or disk writer, and then use a single set of
63code to select and write entries, regardless of the target.
64.Sh READ ARCHITECTURE
65From the outside, clients use the
66.Xr archive_read 3
67API to manipulate an
68.Nm archive
69object to read entries and bodies from an archive stream.
70Internally, the
71.Nm archive
72object is cast to an
73.Nm archive_read
74object, which holds all read-specific data.
75The API has four layers:
76The lowest layer is the I/O layer.
77This layer can be overridden by clients, but most clients use
78the packaged I/O callbacks provided, for example, by
79.Xr archive_read_open_memory 3 ,
80and
81.Xr archive_read_open_fd 3 .
82The compression layer calls the I/O layer to
83read bytes and decompresses them for the format layer.
84The format layer unpacks a stream of uncompressed bytes and
85creates
86.Nm archive_entry
87objects from the incoming data.
88The API layer tracks overall state
89(for example, it prevents clients from reading data before reading a header)
90and invokes the format and compression layer operations
91through registered function pointers.
92In particular, the API layer drives the format-detection process:
93When opening the archive, it reads an initial block of data
94and offers it to each registered compression handler.
95The one with the highest bid is initialized with the first block.
96Similarly, the format handlers are polled to see which handler
97is the best for each archive.
98(Prior to 2.4.0, the format bidders were invoked for each
99entry, but this design hindered error recovery.)
100.Ss I/O Layer and Client Callbacks
101The read API goes to some lengths to be nice to clients.
102As a result, there are few restrictions on the behavior of
103the client callbacks.
104.Pp
105The client read callback is expected to provide a block
106of data on each call.
107A zero-length return does indicate end of file, but otherwise
108blocks may be as small as one byte or as large as the entire file.
109In particular, blocks may be of different sizes.
110.Pp
111The client skip callback returns the number of bytes actually
112skipped, which may be much smaller than the skip requested.
113The only requirement is that the skip not be larger.
114In particular, clients are allowed to return zero for any
115skip that they don't want to handle.
116The skip callback must never be invoked with a negative value.
117.Pp
118Keep in mind that not all clients are reading from disk:
119clients reading from networks may provide different-sized
120blocks on every request and cannot skip at all;
121advanced clients may use
122.Xr mmap 2
123to read the entire file into memory at once and return the
124entire file to libarchive as a single block;
125other clients may begin asynchronous I/O operations for the
126next block on each request.
127.Ss Decompression Layer
128The decompression layer not only handles decompression,
129it also buffers data so that the format handlers see a
130much nicer I/O model.
131The decompression API is a two stage peek/consume model.
132A read_ahead request specifies a minimum read amount;
133the decompression layer must provide a pointer to at least
134that much data.
135If more data is immediately available, it should return more:
136the format layer handles bulk data reads by asking for a minimum
137of one byte and then copying as much data as is available.
138.Pp
139A subsequent call to the
140.Fn consume
141function advances the read pointer.
142Note that data returned from a
143.Fn read_ahead
144call is guaranteed to remain in place until
145the next call to
146.Fn read_ahead .
147Intervening calls to
148.Fn consume
149should not cause the data to move.
150.Pp
151Skip requests must always be handled exactly.
152Decompression handlers that cannot seek forward should
153not register a skip handler;
154the API layer fills in a generic skip handler that reads and discards data.
155.Pp
156A decompression handler has a specific lifecycle:
157.Bl -tag -compact -width indent
158.It Registration/Configuration
159When the client invokes the public support function,
160the decompression handler invokes the internal
161.Fn __archive_read_register_compression
162function to provide bid and initialization functions.
163This function returns
164.Cm NULL
165on error or else a pointer to a
166.Cm struct decompressor_t .
167This structure contains a
168.Va void * config
169slot that can be used for storing any customization information.
170.It Bid
171The bid function is invoked with a pointer and size of a block of data.
172The decompressor can access its config data
173through the
174.Va decompressor
175element of the
176.Cm archive_read
177object.
178The bid function is otherwise stateless.
179In particular, it must not perform any I/O operations.
180.Pp
181The value returned by the bid function indicates its suitability
182for handling this data stream.
183A bid of zero will ensure that this decompressor is never invoked.
184Return zero if magic number checks fail.
185Otherwise, your initial implementation should return the number of bits
186actually checked.
187For example, if you verify two full bytes and three bits of another
188byte, bid 19.
189Note that the initial block may be very short;
190be careful to only inspect the data you are given.
191(The current decompressors require two bytes for correct bidding.)
192.It Initialize
193The winning bidder will have its init function called.
194This function should initialize the remaining slots of the
195.Va struct decompressor_t
196object pointed to by the
197.Va decompressor
198element of the
199.Va archive_read
200object.
201In particular, it should allocate any working data it needs
202in the
203.Va data
204slot of that structure.
205The init function is called with the block of data that
206was used for tasting.
207At this point, the decompressor is responsible for all I/O
208requests to the client callbacks.
209The decompressor is free to read more data as and when
210necessary.
211.It Satisfy I/O requests
212The format handler will invoke the
213.Va read_ahead ,
214.Va consume ,
215and
216.Va skip
217functions as needed.
218.It Finish
219The finish method is called only once when the archive is closed.
220It should release anything stored in the
221.Va data
222and
223.Va config
224slots of the
225.Va decompressor
226object.
227It should not invoke the client close callback.
228.El
229.Ss Format Layer
230The read formats have a similar lifecycle to the decompression handlers:
231.Bl -tag -compact -width indent
232.It Registration
233Allocate your private data and initialize your pointers.
234.It Bid
235Formats bid by invoking the
236.Fn read_ahead
237decompression method but not calling the
238.Fn consume
239method.
240This allows each bidder to look ahead in the input stream.
241Bidders should not look further ahead than necessary, as long
242look aheads put pressure on the decompression layer to buffer
243lots of data.
244Most formats only require a few hundred bytes of look ahead;
245look aheads of a few kilobytes are reasonable.
246(The ISO9660 reader sometimes looks ahead by 48k, which
247should be considered an upper limit.)
248.It Read header
249The header read is usually the most complex part of any format.
250There are a few strategies worth mentioning:
251For formats such as tar or cpio, reading and parsing the header is
252straightforward since headers alternate with data.
253For formats that store all header data at the beginning of the file,
254the first header read request may have to read all headers into
255memory and store that data, sorted by the location of the file
256data.
257Subsequent header read requests will skip forward to the
258beginning of the file data and return the corresponding header.
259.It Read Data
260The read data interface supports sparse files; this requires that
261each call return a block of data specifying the file offset and
262size.
263This may require you to carefully track the location so that you
264can return accurate file offsets for each read.
265Remember that the decompressor will return as much data as it has.
266Generally, you will want to request one byte,
267examine the return value to see how much data is available, and
268possibly trim that to the amount you can use.
269You should invoke consume for each block just before you return it.
270.It Skip All Data
271The skip data call should skip over all file data and trailing padding.
272This is called automatically by the API layer just before each
273header read.
274It is also called in response to the client calling the public
275.Fn data_skip
276function.
277.It Cleanup
278On cleanup, the format should release all of its allocated memory.
279.El
280.Ss API Layer
281XXX to do XXX
282.Sh WRITE ARCHITECTURE
283The write API has a similar set of four layers:
284an API layer, a format layer, a compression layer, and an I/O layer.
285The registration here is much simpler because only
286one format and one compression can be registered at a time.
287.Ss I/O Layer and Client Callbacks
288XXX To be written XXX
289.Ss Compression Layer
290XXX To be written XXX
291.Ss Format Layer
292XXX To be written XXX
293.Ss API Layer
294XXX To be written XXX
295.Sh WRITE_DISK ARCHITECTURE
296The write_disk API is intended to look just like the write API
297to clients.
298Since it does not handle multiple formats or compression, it
299is not layered internally.
300.Sh GENERAL SERVICES
301The
302.Nm archive_read ,
303.Nm archive_write ,
304and
305.Nm archive_write_disk
306objects all contain an initial
307.Nm archive
308object which provides common support for a set of standard services.
309(Recall that ANSI/ISO C90 guarantees that you can cast freely between
310a pointer to a structure and a pointer to the first element of that
311structure.)
312The
313.Nm archive
314object has a magic value that indicates which API this object
315is associated with,
316slots for storing error information,
317and function pointers for virtualized API functions.
318.Sh MISCELLANEOUS NOTES
319Connecting existing archiving libraries into libarchive is generally
320quite difficult.
321In particular, many existing libraries strongly assume that you
322are reading from a file; they seek forwards and backwards as necessary
323to locate various pieces of information.
324In contrast, libarchive never seeks backwards in its input, which
325sometimes requires very different approaches.
326.Pp
327For example, libarchive's ISO9660 support operates very differently
328from most ISO9660 readers.
329The libarchive support utilizes a work-queue design that
330keeps a list of known entries sorted by their location in the input.
331Whenever libarchive's ISO9660 implementation is asked for the next
332header, checks this list to find the next item on the disk.
333Directories are parsed when they are encountered and new
334items are added to the list.
335This design relies heavily on the ISO9660 image being optimized so that
336directories always occur earlier on the disk than the files they
337describe.
338.Pp
339Depending on the specific format, such approaches may not be possible.
340The ZIP format specification, for example, allows archivers to store
341key information only at the end of the file.
342In theory, it is possible to create ZIP archives that cannot
343be read without seeking.
344Fortunately, such archives are very rare, and libarchive can read
345most ZIP archives, though it cannot always extract as much information
346as a dedicated ZIP program.
347.Sh SEE ALSO
348.Xr archive_entry 3 ,
349.Xr archive_read 3 ,
350.Xr archive_write 3 ,
351.Xr archive_write_disk 3 ,
352.Xr libarchive 3
353.Sh HISTORY
354The
355.Nm libarchive
356library first appeared in
357.Fx 5.3 .
358.Sh AUTHORS
359.An -nosplit
360The
361.Nm libarchive
362library was written by
363.An Tim Kientzle Aq kientzle@acm.org .
364