xref: /386bsd/usr/src/kernel/kern/fs/ann/x/bio.c.me (revision a2142627)
.po 8n .po 0n $Id: bio.c.ann,v 1.3 94/08/26 13:18:49 bill Exp Locker: bill $
.he 'Annotation of kern/fs/bio.c''DRAFT' .fo 'Revision: 1.3 '% 'Copyright (C) 1994 TeleMuse .(l C .sz 14 Annotation of the File kernel/kern/fs/bio.c William and Lynne Jolitz .sz 10 Copyright (C) 1994 William and Lynne Jolitz. All Rights Reserved. .)l .uh "Introduction:" .pp The file .b "fs/bio.c" contains the machine independent code used to implement the cache management of independent filesystems. It corresponds to the middle layer of abstraction of a filesystem which maintains the exchange of file data and metadata contents read or written to the external storage comprising the physical file. .pp Because this interface is shared by many filesystems and many processes which are clients of a filesystem (even clients of the same exact file), the reference to this information is relative to the vnode describing the file and an opaque block offset (relative to the filesystem) that describes the contents that are cached. While the total resource dedicated to the filesystem is globally managed among all clients, the interpretation and management of the cached entries themselves is done relative to each filesystem and opaque to unrelated filesystems. .uh "Buffer Cache in Detail" .pp This filesystem cache consists of buffer elements, which are transactions of file data or metadata. Buffers are comprised of two parts: a fixed-sized buffer header that records details of the transaction, and a variable-sized data portion that holds the buffer contents associated with the transaction. The data portion can be thought of as a side-effect of the buffer header, since the buffer cache makes no direct use of it at all. .pp The cache works by attempting to either avoid transactions by reusing buffers and short-circuiting redundant transactions, or by anticipate new transactions before they happen so that they may be overlapped. In addition, the buffer cache provides an interface between the synchronous filesystem data transfer requests and the asynchronous mechanisms (device driver, network, and so forth) used as a basis for transferring the actual data. Finally, the scheduling of global I/O is made external to the various filesystems and managed at a central point for consistent policy. .pp The filesystem buffer cache is only called by filesystems; in turn, it only interacts with filesystems to affect its actions. It interfaces to the basic kernel by obtaining and releasing memory resource for buffer contents (see .b malloc() (***HYPERLINK WHAT IS MALLOC() IN MALLOC.C - POPUP) and .b free() (***HYPERLINK WHAT IS FREE() IN MALLOC.C - POPUP) in .b kern/malloc.c (***HYPERLINK MALLOC.C TOC - MACRO) for further information). It also interfaces to to the master VFS by performing generic operations on the control of filesystem contents (see .b vflushbuf(), .b bgetvp(), .b brelvp(), and .b vwakeup()). .pp One should note that the two write alternatives .b bawrite() and .b bdwrite () have their different operations cited at the front of the name because the decision to write occurs before the operation. This contrasts its operation from functions such as .b breada(), where in this case the operation of the second asynchronous read operation occurs after the first operation. .uh "The Life Cycle of a Buffer in the Buffer Cache" .pp The buffer cache caches I/O transfer requests (either to a device driver or to an underlying storage mechanism, such as a network protocol). As such, it performs I/O on demand when either a buffer does not contain the appropriate contents or when the contents are forced to be written out. I/O goes on independent of the requests to view the buffer cache. This can be considered not only a design feature but also a flaw, since filesystems don't have any control over when I/O might occur as all I/O is lumped together into one set of policies for all filesystems instead of using a more sensible mechanism such as a "lookaside" cache arrangement. .pp Buffers to be read are generally obtained by the .b bread() (***HYPERLINK WHAT IS BREAD() or .b breada() (***HYPERLINK WHAT IS BREADA() functions which ensure that the contents of a block exist and return the buffer containing this contents or an error indicating why an I/O operation to a buffer could not obtain the appropriate contents. In the case that contents could not be obtained, the buffer that was allocated to receive the contents is left as invalid and will be immediately recycled since it has no contents to cache. Since these functions had to allocate a buffer, they may block waiting for a usable buffer to become available. Either the buffer is the desired one that already has contents available without any I/O required (but is still in use by another process exclusively; thus, the caller must block until it can obtain exclusive access) or it is not. In this case, a free buffer must be obtained for .b bread() and .b breada() to obtain contents for it to return. .pp In the process of writing a buffer, the buffer is obtained from .b getblk() which either finds an existing buffer with the cached contents (the flag B_CACHE indicates that the buffer was found with contents intact in the cache) or allocates a buffer with no contents present covering this portion of underlying storage exclusively in the buffer cache and marks it as being invalid (B_INVAL). In the second case, if a portion of the buffer were to be modified, a read is necessary to obtain valid contents before the write of the rest of the buffer could take place. (A .b bread() would best provide this since it would ensure that the buffer was read. However, if the entire point of this operation was to ensure that there was only one entry in the buffer cache corresponding to the entry and the contents itself considered irrelevant, a .b getblk() call is sufficient. In either case, exclusive access to the portion of that vnode is now assured since all other requesters will wait until the buffer has been returned to the free pool for use by other processes. .pp Having now obtained the buffer, its contents can be loaded and passed to one of the write routines to be written to storage. The .b bwrite() (***HYPERLINK WHAT IS BWRITE() function will immediately force the operation to occur and wait for the result (suspending during the operation), while the .b bawrite() (***HYPERLINK WHAT IS BAWRITE() will issue the operation and arrange for its release to the free pool upon completion without waiting. A third variant, .b bdwrite(), (***HYPERLINK WHAT IS BDWRITE() will delay the actual write operation until either the buffer must be reused for purposes of being used elsewhere in the cache or if forced written by either a .b bawrite() or .b bwrite(). .pp Reads (when finished) return buffers to the buffer cache with a .b brelse() (***HYPERLINK WHAT IS BRELSE() function call. In the case of an asynchronous operation, .b brelse() can also be called from interrupt level by .b biodone() (***HYPERLINK WHAT IS BIODONE() to release the buffer cache buffer back to the free pool. Other operations waiting for a .b biodone() to signal completion of the operation will use the .b biowait() (***HYPERLINK WHAT IS BIOWAIT() function, waiting for the operation to complete by suspending (and also recovering the error if any occurred in performing the operation). Note that there is no functional interface externally to see if a buffer is in use, as all operations may block to gain exclusive access to a buffer, however indefinitely long that block may take. .pp Buffers may be expanded or contracted to hold different sized contents. The .b allocbuf() (***HYPERLINK WHAT IS ALLOCBUF() function expands or contracts the size of a buffer as requested. It is used by the filesystem to change the size of the entry that is being cached while preserving the contents present in the previously sized buffer (throwing away information on a reduction and retaining partial contents on an expansion). .uh "Delayed Write and Read-ahead Operations" .pp Delayed write operations are buffer transactions that are recorded as having been written when they actually haven't -- instead, they "dwell" in the buffer cache for a period of time. This mechanism allows for temporary blocks to appear to be "written" and "read" without any I/O actually occurring before invalidation occurs (as would be the case with compiler temporary files). In addition, delayed write operations allow output operations to be queued behind read operations, thus postponing their effect until the cache entry is needed by a read operation. .pp Read-ahead operations, in contrast, are the anodyne to delayed write operations, in that read-ahead operations anticipate the need for a successive block. Since most I/O on Unix systems is sequential, successive read operations can be detected and the operations pre-queued for the next successive block to achieve double buffering. .pp Note that read-ahead and "write-behind" (e.g. delayed writes) operations create additional traffic for the buffer cache because a certain amount of the buffer cache is tied up in holding the state for both reads and writes that may or may not be of use. This buffer cache memory overhead is the price paid for attempting to avoid a more expensive I/O operation. .pp One interesting question raised by this approach is that of just who has priority -- read-ahead blocks which haven't proved that the information needs to be cached, or write-behind blocks which may not ever be reused again and instead be queued for output? If priority is given to read-ahead blocks, we tend to commit delayed writes back to disk earlier and effectively perform more write operations than might have otherwise been needed. This may be especially expensive if we have to read them back in again often, as would be the case with compiler temporaries. On the other hand, if priority is given to delayed write operations, the read-ahead operation can be severely impacted by the lack of buffer space necessary to hold its contents. In this case, we will end up always synchronously reading from the disk. .pp Since the system does more reads than writes, read-ahead operations are favored over delayed write operations, resolving this contradiction, but this is a somewhat arbitrary solution. Unfortunately, since this caching operation occurs at a fairly low level of abstraction in the system, there is insufficient information available (e.g. arrival pattern, file persistence duration -- temporary versus permanent) with which to implement a sophisticated policy mechanism (such as one using queuing theory). .pp For example, when the system does a core dump of a program larger than the size of the buffer cache, the buffer cache is immediately overrun with contents bound for the disk (or network), wiping out all previously held contents in the process. (Ironically, this core dump contents may never even be read, especially if this is a system process). If this core dump is sufficiently large it may take several seconds to clear in writing to the disk. During this time, the system will appear to "pause" as the other blocks queued for transfer wait. .pp If instead of allowing the buffer cache to be overwritten we possessed the appropriate policy information, writes would only be accepted into the cache at the rate at which they were consumed by the underlying transfer mechanism, resulting in a steady-state. Since we cannot satisfy the transfer rate any faster than that allowed by the underlying device, it doesn't help if we accept any more transactions into the cache than can possibly be satisfied by the lower levels of the system. .pp One area of discussion is determining just "when" persistence in the buffer cache of information should be allowed in excess of the ability for the lower level storage mechanism to transfer the information. In essence, we must decide when we should reuse the information and that it should persist regardless (the compiler temporary case) and when we should constrain the "virtual bandwidth" by the lower level device (as in the core file case). .uh "Optimizing Buffer Transactions" .pp Read-ahead and write behind operations were used in the original Unix buffer cache. In fact, almost all timesharing operating systems that had buffered filesystem transfer caches prior to Unix, including BTSS and Multics, also used these operations to reduce the number of transactions incurred by managing a cache of them and reordering them as necessary to achieve the most optimal time and number of transactions necessary, so this concept is not unique to Unix. .pp Ironically, by allowing read-ahead operations we actually increase the number of transactions that occur. However, since the net cost of occasionally throwing away a read-ahead block is a low one, the overall system efficiency is increased anyways (since on the off chance we did use the read-ahead, an operation that would have otherwise been blocked would not need to occur). Some Unix systems defeat read-ahead and aging under the mistaken belief that simply reducing the number of operations must increase system efficiency. This is not always the case as one must examine the actual delay time reduction of the application programs as well as the number of operations to decide if the optimization is successful .pp For example, in the 386BSD Release 0.1 version of .b bio.c delayed writes and readaheads were not implemented correctly for all cases. After this was rectified, the total number of I/O operations were observed to have increased (which we found surprising). However, the time required for the average completion of a kernel build actually dropped! How could this be? We found that read-ahead operations would occasionally be successful and occasionally not -- those not successful would have to be reissued, so the number of operations increased. Why this did not degrade performance but instead actually increase it was that when the block was of immediate use, it avoided a delay in the utility getting a block of data. However, in the case where it wasn't of any value it would be reused by the delayed write operation of a compiler temporary and thus would hold onto the information anyways. As such, the proper effect of both delayed write and read-ahead operations would result. In addition, the reason why the number of aggregate read/write operations increased was because, while many of the read-ahead blocks were never used, we never had to wait for them anyways, so it did not affect the elapsed time of the running application. .uh "File I/O Clustering" .pp The effect of aggregating read/write operations for greater efficiency is possible not only at the low level of abstraction of transfers themselves but also possible at the higher levels of abstraction of whole transfers from the user process's perspective. Consider that a user process is performing I/O on an abstract file -- we can collect up all of their numerous I/O operations into a much larger aggregate of individual operations, coalesce them into a single operation and perform that. .pp Given this higher level of abstraction, a .i clustered read or write operation can be performed based not on the actual buffer transactions but on much larger scale driven by the economies of memory in a system and not by the economies of memory in the buffer cache. Obviously if the memory of a system is so constrained that the buffer cache is a significant portion of memory (as it was in earlier systems), clustered I/O would be no great advantage. However, we routinely use PCs that contain many times the amount of memory of that of the programs which they run. Clearly, the efficient use of memory as a resource might now allow us to use memory state to transfer larger aggregations of file transfers more efficiently. In other words, we exchange memory resource for I/O bandwidth to obtain a larger "virtual" bandwidth. This in a nutshell is what file I/O clustering is all about. .pp Since file I/O clustering lies between the top level view of the system (e.g. reads and writes) and the bottom level view of the system (actually performing the I/O), it is an operation that must invisibly reconcile the semantics of the top level reads and writes independent of the low level operations (in the current system, there is no such layer). Since this mechanism inherently affects the user process's address space, the virtual memory system is the logical means through which this new layer would be interposed. .pp To permit the virtual memory system to make decisions on the size and timing of transfers, additional information (queue sizes and I/O operation completion) must be available, so that the virtual memory system can properly choose how much resource to dedicate to a particular set of virtual transfers. As such, a sufficiently different architecture for attaching the lower level of the system to the higher level portion of the system is now necessary. .uh "fs/bio.c Functions" .pp Within this file are contained the following functions: .(l bufinit() incore() bread() breada() bwrite() bdwrite() bawrite() brelse() getnewbuf() getblk() geteblk() allocbuf() biowait() biodone() .)l (***HYPERLINK ALL OF THE WHAT IS TO THESE LIST OF FUNCTIONS) The names are indicative of the function they provide in the implementation of the buffer cache. .b "bread()" and .b "bwrite()" provide a virtual read/write interface to the buffer cache and the underlying storage, expressed as a filesystem. .b "breada()" and .b "bawrite()" provide asynchronous mechanisms avoiding a wait for the operation if the caller decides not to pause to see the outcome of the operation. .b "getblk()" and .b "brelse()" allow the buffer contents to be probed and released irrespective of any implicit I/O. .b "allocbuf()" allows for alteration of buffers to suit the needs of the filesystem. .b "biowait()" and .b "biodone()" provide an interface between underlying asynchronous I/O and the filesystem's more orderly synchronous world view. .b "incore()" and .b "getnewbuf()" are internal support routines used to implement the buffer cache. .b "bufinit()" is an entrypoint for the kernel to instantiate the buffer cache facility. .b geteblk() is an obsolete interface used to steal an empty block from the buffer cache for use in other mechanisms (a memory allocator, used prior to mbufs and .b "malloc()"). (***HYPERLINK WHAT IS MALLOC() IN MALLOC.C - POPUP) .uh "Antecedents to the 386BSD Buffer Cache" .pp Earlier UNIX systems used a block-oriented mechanism to buffer physical I/O transactions to a disk drive. At this time, there was a single UNIX filesystem that used the block buffer cache to interface to disk drivers directly. .pp The names of the above routines and the function of the buffer cache are directly descended from this earlier work. However, it is deceptive to assume that they are similar -- a myriad of subtle differences exist due to the fact that this buffer cache works with variable-sized buffers of independent filesystems, caching transactions of the logical contents of the files. Various idiosyncrasies of the buffer cache are due to its appendix-like role in the anatomy of UNIX systems. .uh "Evolution of the 386BSD Buffer Cache" .pp The 386BSD buffer cache was developed independently for the original Release 0.0 in 1992. The first version of this file and how it was developed is discussed in the article .b "Missing Pieces II" (***HYPERLINK TO ARTICLE TOC - MACRO) in the section .b "Block I/O Cache". (***HYPERLINK TO SECTION - POPUP) Both the strengths and weaknesses of using this buffer cache design were also discussed in this article (see .b "4.4BSD Demands" (***HYPERLINK TO SECTION - POPUP) and .b "4.4BSD Weaknesses", (***HYPERLINK TO SECTION - POPUP) respectively). .pp The reimplementation of the buffer cache code was quite distinct from the design of earlier inplementations, primarily by its use of the kernel memory allocator in place of a internal memory pool and page table entries to allocate buffer contents from. This was done to reduce the code size by relying on a common memory allocator, provide uniform memory policy for the kernel as a whole (the intent was that memory could always be reclaimable by the virtual memory system, and that unused memory (B_INVAL buffers) would always be held by the memory allocator for universal use within the kernel), and so to reduce memory fragmentation (partial page allocation of buffer contents would not require a whole, dedicated page of memory). .pp The buffer cache code had minor bugs (double allocation of space), unfinished implementation (buffer ageing and delayed writes), as well as inefficencies (buffer ageing and delayed writes, again). In addition, the design objective of returning unused buffer memory contents to the memory allocator was not completed. These problems have been eliminated from this version, as many of them had stemmed from early, abortative attempts to introduce an interim version of the page cache, the intended successor to the buffer cache. .pp This design, however, was intended from the beginning to be only an interim functional mechanism until a more modern mechanism -- the 386BSD Page Cache -- was ready for release. .uh "Future Directions: The 386BSD Page Cache" .pp While the page cache is conceptually simple, its implementation and the replacement of the buffer cache is far from simple. Both performance and operational issues dominate the such attempts. Most of these are rooted in the dependance of the rest of the kernel on artifacts of the buffer cache and its interface. .pp The buffer cache combines many different functions and features into a single mechanism. This concise arrangement was somewhat necessary given the tiny memory requirements of the original PDP-11 (and PDP-7 before it) environments it ran in. The buffer cache provides many mechanisms: .(l L * The interface for filesystems to access records (blocks) of storage. * A way to achieve I/O on a storage meduim. * A way of managing exclusive access to a portion of file contents. * A way to avoid redundant reads (transfer cache). * A way to schedule sequential read operations allowing overlap (read ahead). * A way to postpone/coalesce writes (write behind). * A way to resolve file-relative to actual physical record index. .)l .pp In addition, to manage the cache contents, it provides these additional mechanisms for the filesystem to use: .(l L * cache invalidation/ageing. * cache probing/statistics. * cache resizing. .)l .pp These mechanisms are all implemented by a series of functions that interact in an ad-hoc manner to allow these mechanisms to function. Since the policies are embedded in a buffer cache used by all filesystems, all filesystems are compelled to use the same policies. Also, since I/O is implied behind the cache, the buffer cache schedules buffer reuse/writeback based on its demands for memory rather then when the filesystem would like to see the output operation occur. Finally, even though the buffer cache can be made to relinquish memory back to the memory allocators, it still must have a notion of a fraction of memory resources dedicated to filesystem cache contents, which limits the effectiveness of memory to be used to accellerate I/O operations. Inherently, the virtual memory system and buffer cache are on a collision course, as both provide fundamentally the same operation (access to storage) for different subsystems (filesystem, address space). As both must interact (mapped files), and cache state to improve operation, each elbows the other for resources. .pp The resolution of this conflict is that the virtual memory system must be the only mediator to storage, and the only manager of memory resource. Filesystems should provide methods to perform and mediate I/O. A filesystem access to a portion of a file should be treated exactly like a program's access to a portion of a file mapped into its address space. A process consuming a large number of pages of I/O should compete alongside another process consuming a large number of pages to run its program within. Actual I/O for either should be managed efficently with the same mechanisms (buffering for overlap operation, caching to reduce redundant operations, clustering to coalesce multiple operations into a single one, reclamation to cope with finite memory resources, I/O scheduling to maximize benifit of storage transfer mechanism). .pp The page cache is this universal mechanism that makes possible this archetectural arrangement. With this mechanism, file contents are only reachable via the virtual memory system. Any memory resources used to cache contents are held by the virtual memory system. All I/O operations are initiated or scheduled to be initiated by the filesystem. The memory system may attempt to reclaim resources by writing back a cached entry, however, the filesystem may chose how, when, or if this is to be done. The virtual memory system provides access to memory based, logically ordered objects on which I/O may be applied; while the filesystem transforms these into physical transfer operations. .pp Thus, the resolution of the previously mentioned functions cannot all be present in the same place as it was in the buffer cache, but instead be spread accross both the virtual memory system and filesystem, with much of both changing considerably. In effect, the caching mechanism would be migrating in the level of abstraction layers: .(l L Is: vm filesystem(top) buffer cache filesystem (bottom) drivers | network | ... Becomes: filesystem(top) | | vm | | page cache filesystem(bottom) drivers | network | ... .)l .pp In this new arrangement, the filesystem "top" layer transforms requests for file storage into virtual memory system requests (if necessary for the filesystem semantics, the virtual memory system may be bypassed entirely for non-cachable transfers of information), which may either be found in the global page cache, or reloaded on demand by the filesystem "bottom" layer. Thus, the net effect of the change is to cache potentially mappable memory state to avoid I/O transfers instead of caching the I/O transfers themselves. .uh "What is bufinit()?" .pp .b bufinit() instantiates the data structures with sane values so that the remaining code can use the facility. .uh "What Does the Kernel Expect of bufinit()?" .pp .b bufinit() must bootstrap enough of the buffer cache so that it can commence operation. At this point, the memory allocator is active, but no disk I/O is yet allowed. The buffer cache will be needed to help initialize block devices and to open the root filesystem -- its first operations. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 80 ] void bufinit() ... .)b .lp .b bufinit() has no arguments and returns nothing. Its side-effect is to initialize the statically allocated hash table, buffer headers, and free buffer queues. .uh "How is bufinit() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 87 ] for(bh = bufhash; bh < bufhash + BUFHSZ; bh++) { bh->b_flags = 0; bh->b_forw = (struct buf *)bh; bh->b_back = (struct buf *)bh; } ... .)b .lp With nothing in the hash table, its entries point to themselves. Note that there is only one value per hash table header for both the beginning and end of the hash table, yet the sentinel itself is never allocated. One could have used a single entry, but this identical code allows for greater speed to unlink and link (one case only) a buffer on a queue. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 94 ] for(bp = bfreelist; bp < bfreelist + BQUEUES; bp++) { bp->b_flags = 0; bp->av_forw = bp; bp->av_back = bp; bp->b_forw = bp; bp->b_back = bp; } ... .)b .lp Likewise, a list of free queues exists on which inactive buffers are present. We link them the two ways they can be found (address match or availability on the .i freelist). .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 103 ] for(bp = buf; bp < buf + nbuf ; bp++) { bp->b_flags = B_HEAD | B_INVAL; /* we're just an empty header */ bp->b_dev = BLK_NODEV; bp->b_vp = 0; binstailfree(bp, bfreelist + BQ_EMPTY); binshash(bp, bfreelist + BQ_EMPTY); } ... .)b .lp With the list headers established, we place the buffer headers on the queues using macros used throughout the rest of the implementation. At this point, we take care to mark them as having no contents and matching no address. .uh "386BSD bufinit() Design Choices and Trade-Offs" .pp It is assumed in the 386BSD implementation that the buffer cache is empty to begin with and that the buffer cache code will allocate and fill-in the structures as needed, instead of allocating everything here. As such, the key design choice here was to place the burden on the rest of the implementation so that the virtual memory system, the ultimate memory manager, could reclaim space on-demand from the buffer cache. It becomes just one of many clients of memory inside the kernel. (N.B. It may eventually be desirable to allow revoking of memory as a capability on-demand.) .pp This implementation of .b bio.c still uses the buffer list and queue arrangement found in older Berkeley Unix (and traditional Unix) systems, instead of using the general purpose queue facility (see .b include/queue.h) used by the entire virtual memory system. Since the buffer cache is scheduled for replacement by the page cache (which uses these queues already), work has not been continued in this area. .uh "What is incore()?" .pp Active buffers are linked onto one list of a hash table of lists, where the hash function is determined by logical file and block number. The .b incore() function, used solely within the buffer cache, locates valid cache contents regardless of the status of the buffer. Since this function is small and can limit performance for large cache sizes, it is inlined to increase speed. .uh "What Does the Buffer Cache Expect of incore()?" .pp .b incore() attempts to locate a buffer in the buffer cache. If one is present, it returns a pointer to the buffer's header; otherwise, it returns a null buffer pointer. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 115 ] extern inline struct buf * incore(struct vnode *vp, daddr_t blkno) ... .)b .lp .b incore() has two arguments: a vnode pointer; and a logical block number. The arguments are used to find an exact match from one of the buffers holding valid data, regardless of buffer status. If it is matched, the buffer pointer is returned; otherwise, a null pointer is returned. .pp .b incore() should never block; otherwise, an allocation race could result and the wrong answer obtained. Since .b incore() is used to find buffers that can be in either a locked or unlocked state, it should never be used to itself lock or unlock buffers; otherwise, a chicken-and-egg recursion could result. .uh "How is incore() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 118 ] struct buf *bh = BUFHASH(vp, blkno), *bp; ... .)b .lp The buffer hash address is located by the hash macro (see .b include/buf.h) which folds the value of the vnode pointer and logical block number into an index inserted into a fixed-sized hash table of buffer hash headers. Buffer hash headers are actually only three words long, and overlay the first three fields of the buffer header structure .i (b_flags, .i b_forw, and .i b_back), so the definition for .i bh is a bit misleading. The hash header is the semaphore that anchors a doublely circularly linked chain of buffer headers which contains the buffer we are locating; otherwise, the buffer is not present in the buffer cache. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 121 ] for (bp = bh->b_forw; bp != bh ; bp = bp->b_forw) ... .)b .lp The hash chain is linearly searched for the desired buffer. The first buffer is pointed at by the forward pointer (although we could just as easily have walked the hash chain backwards via the back pointer). In either direction the header itself is chained as the last element, so that the empty hash header is an entry with both back and forward pointers pointing to itself (see .b bufinit()) (***HYPERLINK WHAT IS BUFINIT() - POPUP) .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 123 ] if (bp->b_lblkno == blkno && bp->b_vp == vp && (bp->b_flags & B_INVAL) == 0) return (bp); ... .)b .lp Valid matches are exact matches on the starting logical block number and vnode pointer. Note that both the size of each cached buffer and the block size associated with the block address (e.g. units of the address) are not taken into account by the .b incore() lookup, so it is not possible to locate a logical block contained within a cached buffer or locate a buffer by an alias of a different block size. The buffer cache effectively caches disk transactions of a particular arrangement by a filesystem, and does not provide a global view across all filesystems of vnode file state present in RAM. Thus, the buffer cache is relative to each of its filesystem clients. .pp This arrangement directly implies that .b bio.c is logically located at the middle layer of abstraction of a filesystem, separating the filesystem's top layer (its semantics and attachment to the virtual filesystem) from its lower layer (the logical to physical mapping, commitment to storage, and storage allocation policies). .lp A check for the validity of the buffer is made before returning the block. Normally, only valid data should be present on the hash chains; however, the block may have become invalid if a race condition occurred while removing it from the hash (by code running from interrupt level or on another processor). Since the unusual case is an invalid buffer, this is only tested prior to returning a positive match. Buffers are not locked by this function -- the information in the associated buffer header is subject to change prior to use, so other callers must take care to discard buffers as needed if they are stale. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 127 ] return(0); ... .)b .lp In the case of a buffer cache miss, .b incore() will return the value of a null buffer pointer. .uh "What is bread()?" .pp .b "bread()" is one of the top-level function entry calls for the buffer cache. The filesystem top-level code uses this function to access their file's contents by performing a virtual read of the combined cache and the lower-level of the filesystem. .uh "What Does the Filesystem Expect of bread()?" .pp .b bread() causes a buffer to be created with the contents requested. It either obtains this from the buffer cache if it is present, or obtains an empty buffer and fills it with the appropriate contents by means of the lower-level mechanisms of the filesystem memory for further use. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 135 ] int bread(struct vnode *vp, daddr_t blkno, int size, struct ucred *cred, struct buf **bpp) ... .)b .lp .b bread() has five arguments (logical file parameters): the vnode pointer; the logical block offset; size of transfer; the credentials used in performing the read; and a pointer to the buffer returned by this .b bread() operation. The credentials of the requester passed in the buffer are used to associate any lower-level I/O operations. The function will return a buffer at the location specified ("*bpp") as well as any error values encountered in performing the operation. .pp The buffer is returned locked for exclusive use, and then returned either to the buffer cache for further use by other clients of this block of the file or by the cache to recycle its buffer memory for further use. .pp Since the filesystem code is unaware of any concurrent access of buffers, .b bread() must mediate the shared free buffer pool in order to gain exclusive access to the desired buffer to its caller. .b bread() can block, since it is only called from the synchronous top-half of the kernel. However, if it does block, it must not hold any locked structures (except the buffer it allocates, of course), nor hold references to any global data structures across the block. (This is done to avoid deadlocks or circular waiting). .uh "How is bread() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 142 ] bp = getblk (vp, blkno, size); ... .)b .lp .b getblk() (***HYPERLINK WHAT IS GETBLK() - POPUP) is used to obtain an appropriate buffer for the block requested, by either finding it in the cache or freshly allocating an empty buffer. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 145 ] if ((bp->b_flags & B_CACHE) == 0 || (bp->b_flags & B_INVAL) != 0) { ... .)b .lp If the block was cached already, no further work needs to be done. However, if the block was either not cached or if the cached block is not valid, it must be filled with contents. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 146 ] bp->b_flags |= B_READ; bp->b_flags &= ~(B_DONE|B_ERROR|B_INVAL); if (cred != NOCRED) crhold(cred); bp->b_rcred = cred; if (cred != NOCRED) crhold(cred); bp->b_wcred = cred; ... .)b .lp A read operation is then selected and the associated bits are cleared prior to starting a read operation. Operations with credentials have their reference counts bumped for read and write references, which are asserted in case a .b bread() block is later written using .b bwrite(). (***HYPERLINK WHAT IS BWRITE() - POPUP) .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 154 ] #ifdef KTRACE if (curproc && KTRPOINT(curproc, KTR_BIO)) ktrbio(curproc->p_tracep, KTBIO_READ, vp, blkno); #endif ... .)b .lp If this process is traced, the read operation is recorded to the trace file. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 158 ] if (curproc) curproc->p_stats->p_ru.ru_inblock++; VOP_STRATEGY(bp); ... .)b .lp We tally the number of read requests, and then perform each request, by returning to the filesystem from which we were called and having it satisfy the request from whatever underlies the filesystem. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 161 ] rv = biowait (bp); } *bpp = bp; return (rv); ... .)b .lp The process is suspended until the read request completes. Then the status of the operation and the buffer obtained is returned to the caller. .uh "What is breada()?" .pp .b breada(), a special case of .b bread(), (***HYPERLINK WHAT IS BREAD() - POPUP) is another top-level function entry call for the buffer cache. Like .b bread(), a filesystem's top-level code uses this function to access the file's contents by performing a virtual read of the combined cache and the lower-level of the filesystem. .pp .b breada() allows for double-buffering read operations by starting consecutive operations as a side-effect. It assumes that by the time the first operation has been consumed, the second will be ready and the third will be started again. If this perfect double buffering was maintained, data would appear to be instantaneously available without delay, even if the file read was larger than the buffer cache size. More commonly, however, the buffer cache simply overlaps two read operations, reducing the synchronous wait to every other .b breada() request in the case where data is demanded faster than the drive can supply. .uh "What Does the Filesystem Expect of breada()?" .pp Like .b bread(), (***HYPERLINK WHAT IS BREAD() - POPUP) .b breada() causes a buffer to be created with the contents requested. It either obtains this from the buffer cache if it is present or obtains an empty buffer and fills it with the appropriate contents by means of the lower-level mechanisms of the filesystem memory for further use. .b breada() also has the "side-effect" of starting a read on the next buffer if it is not in the buffer cache. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 172 ] int breada(struct vnode *vp, daddr_t blkno, int size, daddr_t rablkno, int rabsize, struct ucred *cred, struct buf **bpp) ... .)b .lp .b breada() has seven arguments: the logical parameters of vnode pointer; logical block offset; size of transfer; the read-ahead block offset; the read-ahead block size; the credentials of the transfer; and the pointer to the buffer read in this operation. The credentials of the requester are passed in the buffer to associate any lower-level I/O operations. .pp Like .b bread(), this function will return a buffer at the location specified ("*bpp"), as well as any error value encountered in performing the operation. The buffer is returned locked for exclusive use, and must be returned to the buffer cache either for further use by other clients of this block of the file or for the cache to recycle its buffer memory for further use. .pp Since the units of the logical block address are unknown to the buffer cache, it is not necessarily the case that the next block is one plus the previous (it could be 2 or 4, for example) Also, since the block size can be variable in the filesystem, it is also not predictable by the buffer cache (as the read-ahead block may be the last block, usually a fragment). .pp Since the filesystem code is unaware of concurrent access of buffers, .b breada() (like .b bread()) must mediate the shared free buffer pool in order to gain exclusive access to the desired buffer to its caller. .b breada() can block, since it is only called from the synchronous top-half of the kernel. However, if this function does block, it must not hold any locked structures (except the buffer it allocates, of course) nor hold references to any global data structures across the block. .uh "How is breada() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 179 ] bp = getblk (vp, blkno, size); ... .)b .lp The .b getblk() (***HYPERLINK WHAT IS GETBLK() - POPUP) function is used to obtain an appropriate buffer for the block requested, either by finding it in the cache or by freshly allocating an empty buffer. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 182 ] if ((bp->b_flags & B_CACHE) == 0 || (bp->b_flags & B_INVAL) != 0) { ... .)b .lp If the block was cached, no further work needs to be done. However, if the block was either not cached, or if the cached block is not valid, it must be filled with contents. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 183 ] bp->b_flags |= B_READ; bp->b_flags &= ~(B_DONE|B_ERROR|B_INVAL); if (cred != NOCRED) crhold(cred); bp->b_rcred = cred; if (cred != NOCRED) crhold(cred); bp->b_wcred = cred; ... .)b .lp A read operation is selected and the associated bits are cleared prior to starting a read operation. Operations with credentials have their reference counts bumped for read and write references, which are asserted in case a .b breada() block is later written with .b bwrite() (***HYPERLINK WHAT IS BWRITE() - POPUP) .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 191 ] #ifdef KTRACE if (curproc && KTRPOINT(curproc, KTR_BIO)) ktrbio(curproc->p_tracep, KTBIO_READA1, vp, blkno); #endif ... .)b .lp If this process is traced, the read operation is recorded to the trace file. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 195 ] if (curproc) curproc->p_stats->p_ru.ru_inblock++; VOP_STRATEGY(bp); ... .)b .lp The number of read requests are next tallied. We then perform the requests, by going back to the filesystem from which we were called and having it satisfy the request from whatever underlies the filesystem. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 198 ] needwait++; } ... .)b .lp At this point the implementation of .b breada() diverges from that of .b bread(). Instead of waiting for the read operation to complete, we instead note that an operation is outstanding. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 201 ] retry: /* if not found in cache, allocate and do some I/O (overlapped with first) */ if ((rabp = incore(vp, rablkno)) == 0) { struct buf *bh; if ((rabp = getnewbuf(size)) == 0) goto retry; ... .)b .lp We then check to see if the next block is already in the cache. If it is not in the cache, we arrange to schedule an asynchronous read by obtaining a new buffer using .b getnewbuf(). (***HYPERLINK WHAT IS GETNEWBUF() - POPUP) If none was present and .b getnewbuf() was forced to wait (as conditions may be difficult), we need to go back and look in the cache again. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 208 ] rabp->b_blkno = rabp->b_lblkno = rablkno; bgetvp(vp, rabp); ... .)b .lp We then associate the buffer with this vnode and read-ahead block offset. To signal that we haven't calculated the underlying address for this block, the lower level address for the buffer is set to the logical address. (The .b bgetvp() function in .b fs/vnode.c associates this buffer with this vnode.) .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 210 ] x = splbio(); bh = BUFHASH(vp, rablkno); binshash(rabp, bh); rabp->b_flags = B_BUSY; splx(x); ... .)b .lp We insert the buffer into the hash table so that it can be found by .b incore(). (***HYPERLINK WHAT IS INCORE() - POPUP) The buffer is held busy so that if others find it, they must first wait for it to be released before using it. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 217 ] rabp->b_flags |= B_READ | B_ASYNC | B_AGE; if (cred != NOCRED) crhold(cred); rabp->b_rcred = cred; if (cred != NOCRED) crhold(cred); rabp->b_wcred = cred; #ifdef KTRACE if (curproc && KTRPOINT(curproc, KTR_BIO)) ktrbio(curproc->p_tracep, KTBIO_READA2, vp, rablkno); #endif if (curproc) curproc->p_stats->p_ru.ru_inblock++; VOP_STRATEGY(rabp); ... .)b .lp We request that an asynchronous read be performed and that the buffer be released to the "aged" free queue. Because the buffer has not yet been requested, we consider it aged and a candidate for reuse; otherwise, we handle this just like a .b bread() without a .b biowait(). (***HYPERLINK WHAT IS BIOWAIT() - POPUP) .lp Both read and write credentials for the read-ahead block are obtained. Tracing is then accounted for (if desired) and it is recorded that the transfer has occurred against this particular process. We then perform the transfer via the .i vop_strategy operation. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 234 ] if (rabp->b_flags & B_AGE) { /* put on the tail of the age queue */ if ((rabp->b_flags & B_BUSY) == 0) { x = splbio(); rabp->b_flags |= B_BUSY; bremfree(rabp); splx(x); brelse(rabp); } else rabp->b_flags &= ~B_AGE; } ... .)b .lp If the read-ahead block was in the cache, we prolong its life a bit more. If the age bit is still set and the buffer is in use we clear the age bit (so it won't be aged). However, if it is not in use, we force it to the tail of the age queue so that it won't be reused before being referenced. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 248 ] if (needwait) rv = biowait (bp); *bpp = bp; return (rv); ... .)b .lp Finally, we wait for the result if the original operation required a read operation. The buffer used for the requested read operation is returned to the supplied address of the caller and the value of this successful or failed operation is also returned. .uh "What is bwrite()?" .pp .b bwrite() takes a previously busy buffer obtained from either .b getblk(), (***HYPERLINK WHAT IS GETBLK() - POPUP) .b bread(), (***HYPERLINK WHAT IS BREAD() - POPUP) or .b breada(), (***HYPERLINK WHAT IS BREADA() - POPUP) and forces it to be immediately written to the underlying storage below the current filesystem. Like .b bread(), .b bwrite() waits for the operation to finish, and returns the status of the operation as a value. The buffer passed is then released back to the freelist for use by other clients of the buffer cache. .uh "What Does the Filesystem Expect of bwrite()?" .pp .b bwrite() forces the buffer to be immediately written and the invoked operation synchronously completed before the function returns. Delayed writes are taken into account by this function as well. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 259 ] int bwrite(register struct buf *bp) ... .)b .lp .b bwrite() has one argument: the buffer to be synchronously written. It returns the success or failure of the write operation. Since .b bwrite() will block waiting for completion of the I/O operation, it can only be called from the top half of the kernel. .uh "How is bwrite() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 264 ] if(bp->b_flags & B_INVAL) { brelse(bp); return (0); } else { ... .)b .lp If the supplied buffer's contents is invalid, no operation is performed and the buffer is immediately released for reuse on the free list. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 270 ] if(!(bp->b_flags & B_BUSY)) panic("bwrite: not busy"); wasdelayed = bp->b_flags & B_DELWRI; bp->b_flags &= ~(B_READ|B_DONE|B_ERROR|B_ASYNC|B_DELWRI); if(wasdelayed) { reassignbuf(bp, bp->b_vp); bp->b_flags |= B_AGE; } else if (curproc) curproc->p_stats->p_ru.ru_oublock++; ... .)b .lp If the buffer is not currently marked "busy", then .b bwrite() panics because it does not have exclusive use of the buffer. In general, a block has been obtained that has been already associated with a vnode for exclusive use -- thus, this is a "impossible" error. .lp If the buffer has already been virtually written by a delayed write operation, the buffer is reassociated with the vnode of the file to which it belongs and its contents are marked "aged" so that they may be reclaimed more quickly and not dwell on the free list. Otherwise, an output operation is accounted for by the process that calls .b bwrite(). Note that the output operation for a delayed write was already accounted for in the process where the delayed write had occurred. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 281 ] bp->b_flags |= B_DIRTY; bp->b_vp->v_numoutput++; ... .)b .lp The block is marked as a "dirty" block that must be cleaned before this vnode may be reused, and the number of output operations in progress on this vnode is increased. This counter will keep the vnode from being recycled or altered prior to all of its operations completing. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 288 ] VOP_STRATEGY(bp); rv = biowait(bp); brelse(bp); return (rv); } ... .)b .lp The output operation is engaged for this filesystem by calling its vnode strategy operation and the operation is waited upon until it completes. After completion, the buffer is released back into the free buffer cache for potential reuse by other processes and the value of a successful or failed operation is returned to the caller. Note that the buffer has been returned to the buffer cache and is no longer used by the caller. .uh "What is bdwrite()?" .pp Like .b "bwrite()", (***HYPERLINK WHAT IS BWRITE() - POPUP) .b bdwrite() is used to return the buffer back to the buffer cache with the contents written back to the file. However, unlike .b bwrite() (which immediately writes the buffer back and waits for it), .b bdwrite() does a .i lazy write by marking the buffer to be written before reuse, releasing it to the free list, and then returning. .uh "What Does the Filesystem Expect of bdwrite()?" .pp .b bdwrite() virtually writes the buffer into the buffer cache. The function in effect disassociates the operation from the file to which it is attached so that the buffer may "dwell" for a period of time until the buffer space needs to be reused for another purpose in the buffer cache or until the block is requested by another process to be read or written. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 306 ] void bdwrite(struct buf *bp, struct vnode *vp) ... .)b .lp .b bdwrite() has two arguments: the buffer to be written; and the vnode to which it is associated. It returns nothing. The side-effect of this function is to virtually write the buffer into the buffer cache. .lp In order for the buffer to be written, it must be exclusive to the process performing the write. .b bdwrite() occurs in the context of the writer. However, the operation for performing the write (when it actually occurs) may come in from yet another process or in no process at all (as the process may have terminated by the time the actual I/O operation is performed). .uh "How is bdwrite() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 310 ] if(!(bp->b_flags & B_BUSY)) panic("bdwrite: not busy"); if(bp->b_flags & B_INVAL) { brelse(bp); return; } if(bp->b_flags & B_TAPE) { bwrite(bp); return; } ... .)b .lp If the buffer is not marked for exclusive use (busy) .b bdwrite() panics as it has been called on a nonexclusive buffer. If the information in the buffer is invalid, the buffer is immediately released, since there is no write operation to be performed on this buffer and it is returned to the free list for reassignment. .lp If the buffer is associated with a tape device upon which a write should be performed, a synchronous write is substituted for this delayed write. Note that the return value for the write operation is thrown away as .b bdwrite() does not expect anything but success. However, the error is still recorded in the buffer's structure for later examination. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 322 ] bp->b_flags &= ~(B_READ|B_DONE); /* if first delayed write, account for it */ if ((bp->b_flags & B_DELWRI) == 0 && curproc) { curproc->p_stats->p_ru.ru_oublock++; bp->b_flags |= B_DIRTY|B_DELWRI; reassignbuf(bp, vp); } brelse(bp); ... .)b .lp The buffer's flags are changed to indicate that a write operation is present but has not yet been done for this buffer. If this is the first delayed write that has occurred, the process is accounted for as a single physical write (that may occur later) and marked as a delayed write with the dirty bit. Its buffer is then assigned to the vnode associated with this operation. Note that multiple delayed writes may occur, as a partial write to the buffer associated with the vnode may successively fill the contents of this buffer with many .b bdwrite()'s. The buffer is then released back to the free list where it may later be found associated with this vnode. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 332 ] #ifdef __DWELL /* if there are any delayed writes outstanding on this * vnode, and it isn't just us, force them out. We only * allow a single outstanding delayed write per vnode. */ if (bp->b_blockb && vp->v_dirtyblkhd && (bp != vp->v_dirtyblkhd || bp->b_blockf != 0)) /* eventually, should bawrite all of them and tsleep */ vflushbuf(vp, 0, bp); #endif /* __DWELL */ ... .)b .lp An optional portion of this .b bio.c implementation incorporates a mechanism to only allow a single outstanding delayed write buffer per every vnode. (This is an example of how to deal with the so-called core file problem discussed earlier -- see .b "Delayed Write and Read-ahead Operations" (**HYPERLINK TO SECTION). Delayed writes are controlled and kept from overrunning the contents of the buffer cache by delaying the top level application program when it attempts to write on more than one buffer. (This is not normally done because this mechanism does not discriminate for the default temporary file characteristics implemented in this file with delayed writes.) .lp The buffer is checked to see if it is the only block present on a chain of dirty blocks attached to the vnode. If there is more than one dirty block present, a .b bawrite() (with an embedded sleep operation) is forced by calling the vnode .b vflushbuf() operation (see .b fs/vnode.c) for all but the current outstanding delayed write buffer. Note that in the case where a preponderance of temporary file operations occur on the system, this operation will actually slow the system down by reducing the amount of write-behinds (delayed writes). However, it will speed up a system that had previously been dominated by cache flooding. It is the lack of information on making the decision whether or not to issue this operation that requires this conditional portion of this implementation. .uh "386BSD bdwrite() Design Choices and Trade-Offs" .pp .b bdwrite() presumes that the buffer is going to be reused and so procrastinates. This method works well when a buffer is fractionally written in succession or when the buffer cache can hold the entire contents of a temporary file that gets reused and removed prior to being rewritten. In this second case, the I/O can be entirely avoided, as the file ceases to exist before the buffers ever age out. This is the principle upon which log-based filesystems (like LFS) rely. Sequential devices like tapes cannot postpone or reorder writes, so delayed writes to such devices are turned into immediate writes. .uh "What is bawrite()?" .pp Like .b bwrite() (***HYPERLINK WHAT IS BWRITE() - POPUP) .b bawrite() takes a previously busy buffer obtained from either .b getblk(), (***HYPERLINK WHAT IS GETBLK() - POPUP) .b bread(), (***HYPERLINK WHAT IS BREAD() - POPUP) or .b breada(), (***HYPERLINK WHAT IS BREADA() - POPUP) and forces it to be immediately written to the underlying storage below the current filesystem. However, unlike .b bwrite() no wait for status is done. .uh "What Does the Filesystem Expect of bawrite()?" .pp .b bawrite() performs the write operation on the buffer and then immediately returns. No wait for the status is done. Instead, the buffer is marked to be released when the write operation completes. In this case, no status information can be conveyed back to the writer because the operation is completed from the writer's perspective, before the operation can produce a status for the completed operation. Only the next reference to a buffer can discover the status of the buffer following a .b bawrite() (or analogously the read-ahead blocks scheduled by .b breada()). .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 349 ] void bawrite(register struct buf *bp) ... .)b .lp .b bawrite() has one argument: the buffer being written. It returns nothing. .pp Note that by assigning a procedure, .b b_iodone(*)(), to be called on conclusion of the I/O operation and setting the call bit (B_CALL), a call from interrupt level may be done to a function to indicate an operation has been completed and the status may be checked on an asynchronous write. Note also that buffers remain locked from view during any I/O operations, so that even subsequent read operations will be delayed until the asynchronous write completes. .uh "How is bawrite() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 353 ] if(!(bp->b_flags & B_BUSY)) panic("bawrite: not busy"); if(bp->b_flags & B_INVAL) brelse(bp); else { ... .)b .lp The buffer is checked to see that it is exclusively accessed by checking for the busy (B_BUSY) bit; otherwise, it panics as the buffer must be obtained for exclusive access for this function to be called. If the buffer being written has an invalid flag, the buffer is immediately released without any I/O performed on it as it has no valid contents. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 361 ] wasdelayed = bp->b_flags & B_DELWRI; bp->b_flags &= ~(B_READ|B_DONE|B_ERROR|B_DELWRI); if(wasdelayed) { reassignbuf(bp, bp->b_vp); bp->b_flags |= B_AGE; } else if (curproc) curproc->p_stats->p_ru.ru_oublock++; bp->b_flags |= B_DIRTY | B_ASYNC; bp->b_vp->v_numoutput++; ... .)b .lp If the buffer was previously a delayed buffer, it is reassigned to the vnode and aged as a delayed write is finally converted into an asynchronous write buffer. In this case, since the operation of the write had previously been accounted, it is not be done so here upon the delayed write conversion; otherwise, it will be done for an initial .b bawrite(). .lp The buffer's flags are changed to be appropriate for that of a buffer about to be written which has not had any I/O done to it yet and no errors have been found. Since the buffer is being written, the dirty bit and the asynchronous flag are set for this buffer and the number of write transactions to the vnode are increased. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 372 ] #ifdef KTRACE if (curproc && KTRPOINT(curproc, KTR_BIO)) ktrbio(curproc->p_tracep, wasdelayed ? KTBIO_DTOA: KTBIO_AWRITE, bp->b_vp, bp->b_lblkno); #endif ... .)b .lp If the process is tracing .b bio.c operations, a trace record is generated to indicate that either an asynchronous write or a delayed write converted to an asynchronous write is occurring for this vnode's block. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 378 ] VOP_STRATEGY(bp); } ... .)b .lp The write operation is requested via a vnode operation on the filesystem associated with this buffer's vnode, and the function is complete. The caller can either .b biowait() (***HYPERLINK WHAT IS BIOWAIT() - POPUP) (in the case that many asynchronous operations are started before we wait for any of them) or continue requiring the block later (when necessary) to determine the status of the operation. .uh "What is brelse()?" .pp .b "brelse()" relinquishes exclusive access to a buffer and returns it to the global free buffer pool for potential use by other clients of the buffer cache. .uh "What Does the Filesystem Expect From brelse()?" .pp .b brelse() passes exclusive use of the buffer back to the kernel's free pool. Anyone waiting for this particular buffer or a free buffer in general is then notified. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 386 ] void brelse(register struct buf *bp) ... .)b .lp .b brelse() has one argument: the buffer to be released. It returns nothing. The side-effect of this function is that the buffer is placed in the free pool. .pp This function is implemented as an artifact of how exclusive access is maintained in the buffer cache. If any other clients of the buffer cache either desire this buffer or are waiting for any buffer for reuse, they will be awakened and allowed to compete for access to the desired resource (since there may be many waiting for exclusive access to a single buffer). Buffers are directed to free lists that determine the policy for reclamation and reuse with a different vnode/offset. .uh "How is brelse() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 391 ] #ifdef KTRACE if (curproc && KTRPOINT(curproc, KTR_BIO)) ktrbio(curproc->p_tracep, (bp->b_flags & B_INVAL) ? KTBIO_INVALID : ((bp->b_flags & B_DELWRI) ? KTBIO_MODIFY : KTBIO_RELEASE), bp->b_vp, bp->b_lblkno); #endif ... .)b .lp If kernel tracing is present, a trace record is written that indicates whether the buffer is being invalidated, modified, or simply released for reuse. Note that as currently implemented the tracing facility can only write records from the top level -- thus, if .b brelse() is called from interrupt level (as occurs from .b biodone() as the conclusion of an asynchronous operation), it will not be recorded because of the block that may result. This function is currently finessed in .b ktrbio() (see .b opt/ktrace/ktrace.c) as a special case as this trace package may be later changed to remove the current restriction. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 400 ] x=splbio(); if ((bfreelist + BQ_AGE)->b_flags & B_WANTED) { (bfreelist + BQ_AGE) ->b_flags &= ~B_WANTED; wakeup((caddr_t)bfreelist); } /* anyone need this very block? */ if (bp->b_flags & B_WANTED) { bp->b_flags &= ~B_WANTED; wakeup((caddr_t)bp); } if (bp->b_flags & (B_INVAL|B_ERROR|B_NOCACHE)) { if(bp->b_vp) brelvp(bp); bp->b_flags = B_INVAL; flags = B_INVAL; } ... .)b .lp To avoid contention for the busy bit, all block I/O devices are masked from interrupts (primarily because .b biodone() invokes .b brelse()). The age queue is examined to see if its associated wanted bit is set (indicating that any the buffer wanted request is present). If it is, the bit is cleared and a wakeup issued for those processes still waiting for any buffer to become available. Note that only the first of the possibly many processes waiting for a free buffer will get the newly freed buffer with the others returned to the queue to wait for another buffer to become available. Thus, there is no guarantee of when a buffer will be assigned to a particular process. .lp Similarly, if the wanted bit is set on this particular buffer header's flag, a wakeup will be issued for the buffer header and the bit cleared. If the buffer has an error, invalid contents, or has been specifically disallowed from being cached by the filesystem, its contents are rendered meaningless and it is disassociated from any vnode to which it might be attached. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 418 ] bp->b_flags &= ~(B_CACHE|B_NOCACHE|B_AGE); /* enqueue */ /* just an empty buffer head ... */ /*if(flags & B_HEAD) binsheadfree(bp, bfreelist + BQ_EMPTY)*/ /* buffers with junk contents */ /*else*/ if(flags & (B_ERROR|B_INVAL|B_NOCACHE)) binsheadfree(bp, bfreelist + BQ_AGE) /* buffers with stale but valid contents */ else if((flags & (B_CACHE|B_AGE)) == B_AGE) { binstailfree(bp, bfreelist + BQ_AGE) bp->b_flags |= B_AGE; /* buffers with valid and quite potentially reuseable contents */ } else binstailfree(bp, bfreelist + BQ_LRU) /* unlock */ bp->b_flags &= ~B_BUSY; splx(x); ... .)b .lp The buffer flags indicating cache status and aging are erased prior to enqueuing the buffer on the appropriate of the four free lists. If the buffer corresponds only to a header and has no contents whatsoever, it is passed back to the empty queue for use when assigning new storage. Otherwise, if the buffer contains no valid contents, it is inserted at the beginning of the age queue which indicates that it will be the first to be reused when a new buffer needs to be allocated. .lp If the buffer has aged but not been found in the cache, it will be installed at the tail of the age queue, where it will be reused before the rest of the contents of the buffer cache but after any invalid or empty queues that could be constituted into new buffers are used. Aged buffers are ones that have either not demonstrated their reason for holding a place in a cache or have lost their reason for being cached, as determined by the filesystem's use of the buffer. All other buffers will be placed on the LRU queue awaiting reuse. .lp In general, as buffers are found and released back to their free lists, they tend to migrate towards the head of the particular queue that they reside upon; thus, the less recently referenced entries move towards the front of the queue. No indication is kept on how frequently a buffer is reused but one is kept on the last time it was reused. Thus, we use a LRU algorithm (LIFO). .lp The busy bit is then removed from the buffer, allowing it to be claimed by another requester and the interrupt level mask is removed. .uh "What is getnewbuf()?" .pp .b getnewbuf() constitutes a new buffer by acquiring a buffer header and contents. .uh "What Does the Buffer Cache Expect From getnewbuf()?" .pp .b getnewbuf() obtains a buffer and stores in it the .i sz sized contents. These items either come from an undedicated resource (empty buffer headers and underutilized buffer contents space) or by recycling a free buffer and adjusting it to suit the desired size. If the recycled buffer is a delayed write, the write is forced before allowing reuse. .pp If a new buffer is still not available for use, the condition is signalled and the process sleeps until a free buffer is available. At this point, the function returns a null buffer pointer to signal that a sleep has occurred. If the conditions have changed and the buffer desired allocated to yet another racing process, any uses of .b incore() (***HYPERLINK WHAT IS INCORE - POPUP) may be stale. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 450 ] /*static*/ struct buf * getnewbuf(int sz) ... .)b .lp .b getnewbuf() has one argument: the size of the contents allocated with the buffer header. It returns a pointer to the allocated buffer header or zero if no buffer header could be allocated. .pp .b getnewbuf() returns a buffer that is entirely unassociated from the buffer cache. It is up to the caller to either associate the buffer with a vnode/offset so that it can be found or make temporary use of the buffer and return it to a free list. In its unassociated state, the buffer plays no role in the global view of buffer state. .uh "How is getnewbuf() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 456 ] x = splbio(); start: /* can we constitute a new buffer? */ if (freebufspace > sz && bfreelist[BQ_EMPTY].av_forw != (struct buf *)bfreelist+BQ_EMPTY) { caddr_t addr; if ((addr = malloc (sz, M_TEMP, M_NOWAIT)) == 0) goto tryfree; freebufspace -= sz; allocbufspace += sz; bp = bfreelist[BQ_EMPTY].av_forw; bp->b_flags = B_BUSY | B_INVAL; bremfree(bp); bp->b_un.b_addr = addr; bp->b_bufsize = sz; /* 20 Aug 92*/ goto fillin; } ... .)b .lp Buffers are first constituted from the empty list of buffer headers, as long as space is allocatable in the amount of memory reserved for buffer space and an empty header is available. In general, memory not used for cache contents is always returned to the primary memory allocator .b malloc() (***HYPERLINK WHAT IS MALLOC IN MALLOC.C - POPUP) (see .b kern/malloc()) (***HYPERLINK MALLOC.C TOC - MACRO) for generic use in the kernel. This is the sole purpose of the empty buffer free list. .lp Because buffers may have their status changed from exclusive to nonexclusive at interrupt level (see .b biodone()), (***HYPERLINK WHAT IS BIODONE - POPUP) any block device interrupts are masked to prevent contention for buffers at interrupt level (as .b brelse() (***HYPERLINK WHAT IS BRELSE - POPUP) may change the contents of the free lists). .lp If a new buffer can be constituted, memory is allocated from the memory allocator and assigned to a formerly empty buffer header detached from the empty free list of buffers and marked for exclusive use with invalid contents (since it has yet to be filled). Two global variables, .i allocbufspace and .i freebufspace record the amount of buffer space present in the buffer cache and the amount remaining for use by new buffers, respectively. Remaining portions of the buffer header are filled prior to passing it back to the caller. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 476 ] tryfree: if (bfreelist[BQ_AGE].av_forw != (struct buf *)bfreelist+BQ_AGE) { bp = bfreelist[BQ_AGE].av_forw; bremfree(bp); } else if (bfreelist[BQ_LRU].av_forw != (struct buf *)bfreelist+BQ_LRU) { bp = bfreelist[BQ_LRU].av_forw; bremfree(bp); } else { /* wait for a free buffer of any kind */ (bfreelist + BQ_AGE)->b_flags |= B_WANTED; tsleep((caddr_t)bfreelist, PRIBIO, "getnewbuf", 0); splx(x); return (0); } ... .)b .lp If all buffer space is in use or no empty buffer headers are available, the other free lists are checked to find a buffer for reuse. The age queue is consulted before the LRU queue. If either queue has no entry it will point back to its free queue pointer (indicating empty status); otherwise, the first buffer at the head of the queue will be chosen for use and removed from that free list. .lp If none of the buffer free queues has a buffer that can be used (all buffers are busy), the age queue's buffer queue header is marked with a wanted bit to indicate that any buffer header is requested. The process then sleeps waiting for .b brelse() to find it a new buffer. Upon being awakened, it will return to its caller to indicate that it could not find a buffer. In this case, the caller must examine the buffer cache again to see if someone else had allocated a buffer already for the vnode and block offset for which the caller was requesting a buffer. This procedure is followed rather than immediately branching back to the beginning of the search for a buffer because many of the assumptions of the caller may have changed during the sleep. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 492 ] if (bp->b_flags & B_DELWRI) { bp->b_flags |= B_BUSY; bawrite (bp); goto start; } ... .)b .lp If the buffer selected for reuse is one that has a delayed write outstanding on it, the buffer is acquired for exclusive access by asserting its busy bit and transformed into an asynchronous write by calling .b bawrite(). (***HYPERLINK WHAT IS BAWRITE() - POPUP) This forces a stale delayed write out upon reuse. .lp After starting the write operation, the function is restarted at its beginning to search for another buffer or, if all buffers are busy, suspend waiting for a free buffer until the asynchronous write completes. Now freed of its contents (written back to disk), the cleaned buffer is inserted onto the free list where it may be allocated by this routine. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 498 ] #ifdef KTRACE if (curproc && KTRPOINT(curproc, KTR_BIO)) ktrbio(curproc->p_tracep, KTBIO_RECYCLE, bp->b_vp, bp->b_lblkno); #endif ... .)b .lp If kernel tracing is enabled for this process, a recycle record is written to indicate that the buffer has been assigned to the new vnode from an old contents. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 504 ] if(bp->b_vp) brelvp(bp); /* we are not free, nor do we contain interesting data */ if (bp->b_rcred != NOCRED) crfree(bp->b_rcred); if (bp->b_wcred != NOCRED) crfree(bp->b_wcred); bp->b_flags = B_INVAL|B_BUSY; fillin: bremhash(bp); splx(x); bp->b_dev = BLK_NODEV; bp->b_vp = NULL; bp->b_blkno = bp->b_lblkno = 0; bp->b_iodone = 0; bp->b_error = 0; bp->b_wcred = bp->b_rcred = NOCRED; if (bp->b_bufsize != sz) allocbuf(bp, sz); bp->b_bcount = bp->b_bufsize = sz; bp->b_dirtyoff = bp->b_dirtyend = 0; bp->b_blockb = 0; bp->b_blockf = 0; return (bp); ... .)b .lp The buffer is prepared for reuse by disassociating it from any vnode it may have been associated with before and removing any credentials that had been attached by prior operation. Its contents are marked invalid and busy and its moved from the hash lookup queues so that it will not be found by .b incore() since it does not yet have associated with it a sensible name and address (vnode and block number). .lp The interrupt mask is removed and all remaining operations performed without holding the interrupts masked. The buffer is associated with no device and no vnode or logical block, and other fields in the buffer header are similarly default initialized to avoid unpredictable behavior by functions calling this routine which do not fill them out completely. The buffer is adjusted so that the size of its contents .i b_bufsize matches those of the requested size .i sz by the function .b allocbuf(). The buffer pointer is then returned as a successfully allocated buffer header and contents. .uh "What is getblk()?" .pp .b getblk() returns a buffer of the appropriate vnode, offset, and size to the caller. .uh "What Does the Buffer Cache Expect From getblk()?" .pp .b getblk() locates the buffer with valid contents in the buffer cache; otherwise, the function creates a buffer with empty (invalid) contents. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 539 ] struct buf * getblk(register struct vnode *vp, daddr_t blkno, int size) ... .)b .lp .b getblk() has three arguments: the vnode and logical block number of the contents sought; and the size of the buffer to be allocated. It returns a pointer to the buffer associated with this vnode and contents. .pp .b getblk() may block awaiting allocation of a buffer or waiting for a buffer that is in use by another process. The function does not read the buffer's contents from the underlying storage; .b bread() (***HYPERLINK WHAT IS BREAD- POPUP) or .b breada() (***HYPERLINK WHAT IS BREADA- POPUP) are used instead. Instead, .b getblk() is used to read the contents of the cache. .uh "How is getblk() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 545 ] for (;;) { if (bp = incore(vp, blkno)) { x = splbio(); if (bp->b_flags & B_BUSY) { ... .)b .lp The function endlessly loops searching for a buffer in the cache. It uses the inline function .b incore() (***HYPERLINK WHAT IS INCORE - POPUP) in its search, using a hash table for the requested buffer in the buffer cache associated with a vnode and logical block number. If the block is found, the processor's interrupts are masked to prevent a race with the buffer's mutual exclusion bit (B_BUSY) which is examined to determine if the buffer is in current use by another process. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 549 ] #ifdef KTRACE if (curproc && KTRPOINT(curproc, KTR_BIO)) ktrbio(curproc->p_tracep, KTBIO_BUSY, vp, blkno); #endif ... .)b .lp If kernel tracing is enabled, a trace record is written to indicate that a busy buffer has been found for this .b getblk() operation. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 554 ] bp->b_flags &= ~B_AGE; bp->b_flags |= B_WANTED | B_CACHE; tsleep ((caddr_t)bp, PRIBIO, "getblk", 0); splx(x); continue; } ... .)b .lp The age bit associated with the buffer (if present) is cleared to ensure that the buffer will not be released onto the age queue because we already know it will be reused. The wanted bit is then set for this buffer header so our process will be awakened when the buffer becomes available. We then sleep waiting for the buffer to become available. .lp Upon conclusion of the sleep, the interrupt mask is removed and the function is reentered by again checking the .b incore() status of this buffer (since it may have actually been invalidated and thus is no longer present in the cache). .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 560 ] #ifdef KTRACE if (curproc && KTRPOINT(curproc, KTR_BIO)) ktrbio(curproc->p_tracep, KTBIO_REFERENCE, vp, blkno); #endif ... .)b .lp If kernel tracing is enabled, a reference trace record is sent to indicate that the buffer has been rereferenced without it being busy. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 565 ] bp->b_flags &= ~B_AGE; bp->b_flags |= B_BUSY | B_CACHE; bremfree(bp); if (bp->b_bufsize != size) allocbuf(bp, size); } else { ... .)b .lp The buffer's age bit is removed to indicate that this buffer has proven its worth, and the buffer is then removed from the free list. Its size is adjusted according to the size required by this call to .b getblk(). Note that this may alter the contents of the buffer. All the cases dealing with a found buffer have now been dealt with. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 571 ] if ((bp = getnewbuf(size)) == 0) continue; bp->b_blkno = bp->b_lblkno = blkno; bgetvp(vp, bp); x = splbio(); bh = BUFHASH(vp, blkno); binshash(bp, bh); bp->b_flags = B_BUSY; } splx(x); ... .)b .lp Since no extant buffer has been found, a new buffer header and contents are obtained by calling .b getnewbuf(). (***HYPERLINK WHAT IS GETNEWBUF() - POPUP) If no buffer is available, .b getnewbuf() blocks and then returns zero. Since a buffer may have been created by another process at the same location during the pause, this function will restart itself from the beginning. If no buffer is found, it again calls .b getnewbuf() and the procedure repeats. .lp When a buffer is available, it is assigned to the logical block associated with this vnode. Because the physical block associated with the logical block has not yet been determined, the physical and logical block fields are identical, signalling that translation needs to be done. All block devices have their interrupts masked and the buffer is inserted into the hash table and marked for exclusive use by setting its buffer flags busy. .lp In either case of finding the block or not finding the block and having to create one, we now have a buffer in the buffer cache associated with the vnode and its offset. The interrupts are unmasked and the buffer returned, terminating the loop. .uh "What is geteblk()?" .pp .b geteblk() allocates an empty buffer from the buffer cache that will not be associated with a vnode. It is used to allocate buffers for special purposes (usually those not associated with the filesystem). .uh "What Does the Buffer Cache Expect From geteblk()?" .lp .b geteblk() isolates a buffer from the buffer cache and returns after it is allocated. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 591 ] struct buf * geteblk(int size) ... .)b .lp .b geteblk() has one argument: the size of the buffer requested. It returns a buffer when allocated. .pp In earlier Unix systems that had no general purpose memory allocator, buffers were used in their stead to gain control of a portion of memory for temporary use. In the current release of 386BSD, .b geteblk() is present here only for compatibility with older systems and to allow the .b disklabel feature of the UFS filesystem to function. .uh "How is geteblk() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 597 ] while ((bp = getnewbuf(size)) == 0) ; x = splbio(); binshash(bp, bfreelist + BQ_AGE); splx(x); return (bp); ... .)b .lp .b getnewbuf() is used to obtain a free buffer of the requested size. If one is not available, it is called until one is available. The buffer is inserted into a free list on the off-chance that the busy bit might be left off before being released, so the assumption of it already being on a free list is itself invalid. This is done in the case the buffer header is corrupted before passing it back. Since contents of the queues and hash tables may be altered by interrupt level calls to .b biodone() (which (***HYPERLINK WHAT IS BIODONE - POPUP) calls .b brelse()), (***HYPERLINK WHAT IS BRELSE - POPUP) the interrupts are masked on manipulation of the hash table. The buffer is then returned to the caller. .uh "What is allocbuf()?" .pp .b allocbuf() adjusts the buffer size to the desired size while preserving buffer contents. It is used by both the buffer cache and the filesystem clients to revise the size characteristics of the buffer after it is present in the buffer cache if it has expanded or contracted in size. .uh "What Does the Buffer Cache Expect From allocbuf()?" .pp .b allocbuf() allocates a region of memory of the appropriate size and exchanges it for the one the buffer had before, freeing the old contents. Before freeing the old contents, it copies as much of the buffer as will fit. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 616 ] void allocbuf(register struct buf *bp, int size) ... .)b .lp .b allocbuf() has two arguments: the pointer to the buffer header: and the size of the data contents that the buffer should contain. It returns nothing. The side-effect of this function is that it adjusts the size of the buffer contents. .uh "How is allocbuf() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 622 ] newcontents = (caddr_t) malloc (size, M_TEMP, M_WAITOK); /* copy the old into the new, up to the maximum that will fit */ (void) memcpy (newcontents, bp->b_un.b_addr, min(bp->b_bufsize, size)); /* return old contents to free heap */ free (bp->b_un.b_addr, M_TEMP); /* adjust buffer cache's idea of memory allocated to buffer contents */ freebufspace -= size - bp->b_bufsize; allocbufspace += size - bp->b_bufsize; /* update buffer header */ bp->b_un.b_addr = newcontents; bp->b_bcount = bp->b_bufsize = size; ... .)b .lp .b allocbuf() first calls .b malloc() (***HYPERLINK WHAT IS MALLOC() IN MALLOC.C - POPUP) to obtain a new portion of memory of the appropriate size and copies the old contents into place in the new buffer space. The function takes care to only copy as much of the old buffer contents that will fit or is present in the old buffer. It then frees the contents of the old buffer. .lp Note that .b allocbuf() may block on waiting for .b malloc() to provide it with the new buffer contents. This differs from the companion routine in .b getnewbuf(), which cannot block since .b getnewbuf() may need to constitute a buffer from a pagein operation when memory is resource-restricted and the system can not risk deadlock by blocking. .lp After obtaining the new buffer resource, .b allocbuf() adjusts for the change in buffer size against free buffer space and allocated buffer space. Note that intentionally no check is made as to overage of allocated buffer space (in fact, .i freebufspace can actually be negative), since its effect will only be felt for a short time period. As such, this is a soft limit check and not a hard limit check. .lp The buffer header is now updated to hold its new contents and size. Two fields in the buffer header, .i b_bufsize and .i b_bcount, hold the allocated size of the buffer and the currently used portion of buffer size, respectively. Filesystems may alter the latter but never the former. Note that in the case of expanding the buffer, there is no guarantee as to the contents of the expanded region of the buffer. It is left up to the caller to clear or otherwise initialize buffer contents. .uh "What is biowait()?" .pp .b biowait() waits for the buffer to finish an outstanding I/O operation and return the status of the operation. .uh "What Does the Buffer Cache Expect From biowait()?" .pp .b biowait() suspends waiting for an I/O operation to complete on the supplied buffer. After the buffer I/O operation is finished and .b biodone() (***HYPERLINK WHAT IS BIODONE() - POPUP) is called, .b biowait() will obtain the error or success and return it to the caller. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 645 ] int biowait(register struct buf *bp) ... .)b .lp .b biowait() has one argument: the pointer to the buffer header to be waited for. It returns the success value (0) or error code of the I/O operation. If no error code is present, yet an error indication was present on the buffer, the generic I/O error (EIO) will be returned. .uh "How is biowait() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 650 ] x = splbio(); while ((bp->b_flags & B_DONE) == 0) tsleep((caddr_t)bp, PRIBIO, "biowait", 0); #ifdef KTRACE if (curproc && KTRPOINT(curproc, KTR_BIO)) ktrbio(curproc->p_tracep, KTBIO_WAIT, bp->b_vp, bp->b_lblkno); #endif ... .)b .lp .b biowait() masks interrupts for its operation as it examines bits in the flags set by interrupt service routines from block devices and manipulates contents of the hash tables and state of the buffer. The function checks for the presence of the done bit to indicate that the operation has completed. If the done bit (set by .b biodone()) is not present, it sleeps on the buffer header's address waiting for I/O to complete. Note that the sleep is uninterruptible because there is no way with the current interface to block devices and inform the device driver to abort the I/O and return. .lp The significance of this uninterruptible wait is that if an I/O operation is started on a file or block device and that device either loses the interrupt or forgets to signal a .b biodone(), it is up to the device driver to timeout and abort the transfer for the block to be cleared -- otherwise, it would wait forever, as occurs in the case with taking out a floppy while reading a file off of it. .lp Some devices become ready immediately upon doing the I/O operation, so that by the time .b biowait() is reached the done bit is already set. In this case, no sleep operation need be performed and this phase is bypassed completely. .lp if the process that has called .b biowait() is being traced, a trace record is written indicating that a wait has completed for a block. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 658 ] if((bp->b_flags & B_ERROR) || bp->b_error) { if ((bp->b_flags & B_INVAL) == 0) { bp->b_flags |= B_INVAL; bremhash(bp); binshash(bp, bfreelist + BQ_AGE); } if (!bp->b_error) bp->b_error = EIO; else bp->b_flags |= B_ERROR; splx(x); return (bp->b_error); } else { splx(x); return (0); } ... .)b .lp The buffer header's flags are checked to see if an error is present or if the error field in the buffer header holds an error code. If so, the buffer header is evaluated to return an error code and make its contents consistent. If the contents of the buffer have not yet been marked as invalid, the buffer is marked as invalid and allowed to persist on the age queue where it may be reclaimed after a short dwell time. .lp If there is no error present in the error code field, a generic I/O error (EIO) is signalled in its place. Otherwise, the .i berror bit is forced present in the buffer flags so that if either the .b berror field or the BERROR flag is set by the calling code, the error will be properly signalled. In either the case of returning an error or success, the interrupt mask is removed and the value of success or error returned to the caller. .uh "What is biodone()?" .pp .b biodone() signals the completion of an I/O operation on a buffer, so that the exclusive user of the buffer can resume use of the buffer after its operation has completed. .uh "What Does the Buffer Cache Expect From biodone()?" .pp .b biodone() is called by both interrupt level and top level code. It processes completion of I/O from the perspective of the buffer cache, checking to see if an optional function should be called to do additional processing prior to either releasing a buffer that has had asynchronous I/O performed on it or informing the kernel that the I/O operation has been completed on the buffer. .lp If the operation is not being waited for (ASYNC), the buffer is released onto the free lists. If subsystems (such as swap) would like to be made aware of I/O completion (asynchronous paging wants to reclaim resources immediately), a function call is optionally processed. .lp If a dirty buffer has been cleaned, the vnode layer is informed so as to move the buffer from a dirty list to a clean list. This ensures that the file contents of the buffer cache are forced to disk upon file synchronization (fsync) operations. .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 683 ] void biodone(register struct buf *bp) ... .)b .lp .b biodone() has one argument: the pointer to the buffer header. It returns nothing. The side effect of this function is to inform the buffer cache that an I/O operation has been completed on the buffer header. .pp One possible alternative to this mechanism would be to have .b biowait() (***HYPERLINK WHAT IS BIOWAIT() - POPUP) remove B_ASYNC and let all .b bwrite()'s (***HYPERLINK WHAT IS BWRITE() - POPUP) become .b bawrite()s. (***HYPERLINK WHAT IS BAWRITE() - POPUP) .b biowait() would then reacquire them to recover the status information. .uh "How is biodone() Implemented on the 386?" .(b L ------------------------------------ ... [ File: /usr/src/kernel/kern/fs/bio.c, line: 688 ] x = splbio(); if (bp->b_flags & B_CALL) { (*bp->b_iodone)(bp); bp->b_flags &= ~B_CALL; } if ((bp->b_flags & (B_READ|B_DIRTY)) == B_DIRTY) { bp->b_flags &= ~B_DIRTY; vwakeup(bp); } if (bp->b_flags & B_ASYNC) brelse(bp); bp->b_flags &= ~B_ASYNC; bp->b_flags |= B_DONE; wakeup((caddr_t)bp); splx(x); ... .)b .lp Since .b biodone() is an interrupt level function that may affect the status of the buffer field, all other block interrupts are masked during operation. A flag is examined (B_CALL) to see if an optional I/O done field .i b_iodone is called from the buffer's header. If it is, the function will be called with the buffer header pointer provided and the flag bit cleared to indicate that the call had been performed. .lp If the buffer corresponds to a write of a dirty buffer, the dirty bit is removed from the buffer and the vnode layer is informed that the buffer has been cleaned by a write operation. If the buffer was an asynchronous operation, it is released by calling .b brelse() (***HYPERLINK WHAT IS BRELSE() - POPUP) since there is no top level routine that is waiting for it. All of the interrupt level masks and top level routines occur because of this call from potentially interrupt level. .lp The flags are adjusted to eliminate the indication of asynchronous I/O since the operation was performed (if present) and the flags are updated to indicate the I/O operation was performed successfully. Any processes sleeping for the buffer are awakened, the masked interrupts are unmasked, and the function completes. Note that the interrupt masks are frequently redundant, since interrupt routines mask the block devices anyways. The reason that this must be done for top level calls to .b biodone() is that they do not mask interrupts. .uh "Design Perspective on Bio.c" .pp Once upon a time the buffer cache in a Unix system was an exciting concept. Until a buffer cache was added, the earliest Unix systems at Bell Labs did not allow overlap of I/O operation with other processes executing. .b bio allowed multiprogramming operation in these early systems, so that when a reference was made to a block where I/O operation had not yet been performed the operation could be queued and another process run while the disk subsystem located and returned the appropriate block. .pp .b bio is an example of a kind of superdriver which allowed queued I/O requests to be processed without blocking the processor from operating. Given the resources of small machines like the PDP-11, this was a perfectly appropriate mechanism. .b bio became even more desireable with the addition of delayed writes and read-ahead operations, which allowed more decoupling of synchronous operation between disks and processes accessing the filesystem. .pp The buffer cache was especially useful to early unix systems because of their use of hierarchical filesystems (then unusual in any small system). A single pathname lookup (see .b fs/lookup.c) (***HYPERLINK LOOKUP.C TABLE OF CONTENTS - MACRO) could require many I/O operations be performed. With caching, however, many of these I/O operations would be avoided. .pp Nowadays, I/O performance enhancement requires more than the old buffer cache mechanism can provide. Besides handling name evaluation, this mechanism now must provide the ability to manage mass I/O systems so that they are always operating at peak transfer rate. It must also possess the ability to shift granularities of information back and forth that are larger than the size of the filesystem blocks themselves. .pp Because of these modern day requirements, we can no longer cache low level I/O transfers -- instead, we must cache regions of address space associated with files. In sum, we must work at a higher level of abstraction where we direct an I/O operation to be performed that far exceeds the size of the actual low level atomic transactions. .pp Why we cannot accomplish this at the lower level of filesystem blocks is simply because the information regarding the characteristics of the transfer is managed only at the top level of abstraction. While it is possible to devise an architecture which always "sends" this information down to the lower levels whenever any operation may be performed, this would still be inadequate since we may be required to manage at the top level of abstraction the timing (e.g. arrival and service times) in order to use the resource most effectively -- this requirement is not easily worked into our communications mechanism to the lower layers. .pp Perhaps even more importantly, in the future our operating system may wish to make high level changes in allocation strategies, independent of the lower levels which actually implement the transfers. Again, the burden of synchronizing lower and upper levels in a dogmatic way could easily prove too costly. .pp Another future consideration for future design is that the buffer cache mechanism provides one set of policies for all filesystems that make use of the buffer cache and does not allow for per-filesystem management of such policies. One example of this is the method by which filesystem metadata (inodes, disk indirect blocks, cylinder groups) must be synchronized when performing a file update operation. Currently in BSD systems, this is handled by forcing the association of metadata with the file's actual data on block lists and disassociating them when an I/O operation needs to be performed. This is done so that when a file is forced written to the disk (fsync), all of the related blocks can be found, including the indirect blocks which don't actually belong to the address space of the file itself. Thus, there is no general purpose way currently to make these associations other than by implementing the buffer cache to provide a callout to the vnode layer when this is necessary. .pp An alternative design is to avoid the implied I/O involved with the filesystem cache, by implementing a lookaside cache which does I/O directly to the lower layers of the filesystem. Prior to initiating the I/O, it checks to see if the block is present in the filesystem cache and then chooses to use the contents of the cache rather than doing the transfer. Such a mechanism is simpler and avoids the need for the knowledge of interrupt level operation or much in the way of semantic information regarding the contents of the buffers.