xref: /386bsd/usr/src/kernel/kern/fs/ann/bio.c.ann (revision a2142627)
1iNClude(/etc/ann/listing.m4)
2.if t \
3.po 8n
4.if n \
5.po 0n
6.ll 70n
7.\"$Id: bio.c.ann,v 1.3 94/08/26 13:18:49 bill Exp Locker: bill $
8.he 'Annotation of kern/fs/bio.c''DRAFT'
9.fo 'tRAnslit(`$Revision: 1.3 $',`$') '% 'Copyright (C) 1994 TeleMuse
10.(l C
11.sz 14
12Annotation of the File kernel/kern/fs/bio.c
13William and Lynne Jolitz
14.sz 10
15Copyright (C) 1994 William and Lynne Jolitz. All Rights Reserved.
16.)l
17.uh "Introduction:"
18.pp
19The file
20.b "fs/bio.c"
21contains the machine independent code used to
22implement the cache management of independent filesystems.
23It corresponds to the middle layer of abstraction of a filesystem which
24maintains the exchange of file data and metadata contents
25read or written to the external storage comprising the physical file.
26.pp
27Because this interface is shared by many filesystems
28and many processes which are clients of a filesystem (even clients
29of the same exact file), the reference to this information is
30relative to the vnode describing the file and an opaque block offset
31(relative to the filesystem) that describes the contents that are cached.
32While the total resource dedicated to the filesystem is globally
33managed among all clients, the interpretation and management of the
34cached entries themselves is done relative to each filesystem and opaque to
35unrelated filesystems.
36.uh "Buffer Cache in Detail"
37.pp
38This filesystem cache consists of buffer elements, which are transactions
39of file data or metadata. Buffers are comprised of two parts: a
40fixed-sized buffer header that records details of the transaction, and a
41variable-sized data portion that holds the buffer contents associated with the
42transaction. The data portion can be thought of as a side-effect of
43the buffer header, since the buffer cache makes no direct use of it at all.
44.pp
45The cache works by attempting to either
46avoid transactions by reusing buffers and short-circuiting redundant
47transactions, or by
48anticipate new transactions before they happen so that they may
49be overlapped. In addition, the buffer cache provides an interface
50between the synchronous filesystem data transfer requests and the
51asynchronous mechanisms (device driver, network, and so forth) used
52as a basis for transferring the actual data. Finally, the scheduling
53of global I/O is made external to the various filesystems and managed
54at a central point for consistent policy.
55.pp
56The filesystem buffer cache is only called by filesystems; in turn, it
57only interacts with filesystems to affect its actions. It interfaces
58to the basic kernel by obtaining and releasing memory resource for buffer
59contents
60(see
61.b malloc()
62(***HYPERLINK WHAT IS MALLOC() IN MALLOC.C - POPUP)
63and
64.b free()
65(***HYPERLINK WHAT IS FREE() IN MALLOC.C - POPUP)
66in
67.b kern/malloc.c
68(***HYPERLINK MALLOC.C TOC - MACRO)
69for further information). It also interfaces to
70to the master VFS by performing generic operations
71on the control of filesystem contents (see
72.b vflushbuf(),
73.b bgetvp(),
74.b brelvp(),
75and
76.b vwakeup()).
77.pp
78One should note that the two write alternatives
79.b bawrite()
80and
81.b bdwrite ()
82have their different operations cited at the front of the name
83because the decision to write occurs before the operation. This
84contrasts its operation from functions such as
85.b breada(),
86where in this case the operation of the second asynchronous read operation
87occurs after the first operation.
88.uh "The Life Cycle of a Buffer in the Buffer Cache"
89.pp
90The buffer cache caches I/O transfer requests (either to a device driver
91or to an underlying storage mechanism, such as a network protocol). As such,
92it performs I/O on demand when either a buffer does not contain the appropriate
93contents or when the contents are forced to be written out. I/O goes
94on independent of the requests to view the buffer cache. This can be
95considered not only a design feature but also a flaw, since filesystems
96don't have any control over when I/O might occur as all I/O is lumped
97together into one set of policies for all filesystems instead of using
98a more sensible mechanism such as a "lookaside" cache arrangement.
99.pp
100Buffers to be read are generally obtained by the
101.b bread()
102(***HYPERLINK WHAT IS BREAD()
103or
104.b breada()
105(***HYPERLINK WHAT IS BREADA()
106functions which ensure that the contents of a block exist and return
107the buffer containing this contents or an error indicating why an I/O
108operation to a buffer could not obtain the appropriate contents.
109In the case that contents could not be obtained, the buffer that was
110allocated to receive the contents is left as invalid and will be
111immediately recycled since it has no contents to cache. Since these
112functions had to allocate a buffer, they may block waiting for a usable
113buffer to become available. Either the buffer is the desired one that
114already has contents available without any I/O required (but is still in use by
115another process exclusively; thus, the caller must block until it
116can obtain exclusive access) or it is not. In this case, a free buffer must
117be obtained for
118.b bread()
119and
120.b breada()
121to obtain contents for it to return.
122.pp
123In the process of writing a buffer, the buffer is obtained from
124.b getblk()
125which either finds an existing buffer with the cached contents (the
126flag B_CACHE indicates that the buffer was found with contents
127intact in the cache) or allocates a buffer with no contents
128present covering this portion of underlying storage exclusively
129in the buffer cache and marks it as being invalid (B_INVAL). In the
130second case, if a portion of the buffer were to be modified, a read
131is necessary to obtain valid contents before the write of
132the rest of the buffer could take place. (A
133.b bread()
134would best provide this since it would ensure that the buffer was read.
135However, if the entire point of this operation was to ensure that there
136was only one entry in the buffer cache corresponding to the entry and
137the contents itself considered irrelevant, a
138.b getblk()
139call is sufficient.
140In either case, exclusive access to the portion of that vnode is now
141assured since all other requesters will wait until the buffer has
142been returned to the free pool for use by other processes.
143.pp
144Having now obtained the buffer, its contents can be loaded and passed to
145one of the write routines to be written to storage. The
146.b bwrite()
147(***HYPERLINK WHAT IS BWRITE()
148function will immediately force the operation to occur and wait for
149the result (suspending during the operation), while the
150.b bawrite()
151(***HYPERLINK WHAT IS BAWRITE()
152will issue the operation and arrange for its release to the free pool
153upon completion without waiting. A third variant,
154.b bdwrite(),
155(***HYPERLINK WHAT IS BDWRITE()
156will delay the actual write operation until either the buffer must be reused
157for purposes of being used elsewhere in the cache or if forced written
158by either a
159.b bawrite()
160or
161.b bwrite().
162.pp
163Reads (when finished) return buffers to the buffer cache with a
164.b brelse()
165(***HYPERLINK WHAT IS BRELSE()
166function call.
167In the case of an asynchronous operation,
168.b brelse()
169can also be called from interrupt level by
170.b biodone()
171(***HYPERLINK WHAT IS BIODONE()
172to release the buffer cache buffer back to the free pool.
173Other operations waiting for a
174.b biodone()
175to signal completion of the operation will use the
176.b biowait()
177(***HYPERLINK WHAT IS BIOWAIT()
178function, waiting for the operation to complete by suspending
179(and also recovering the error if any occurred in performing the operation).
180Note that there is no functional interface externally to see if a
181buffer is in use, as all operations may block to gain exclusive access
182to a buffer, however indefinitely long that block may take.
183.pp
184Buffers may be expanded or contracted to hold different sized contents.  The
185.b allocbuf()
186(***HYPERLINK WHAT IS ALLOCBUF()
187function expands or contracts the size of a buffer as
188requested. It is used by the filesystem to change the size of the entry
189that is being cached while preserving the contents present in the previously
190sized buffer (throwing away information on a reduction and retaining partial
191contents on an expansion).
192.uh "Delayed Write and Read-ahead Operations"
193.pp
194Delayed write operations are buffer transactions that are
195recorded as having been written when they actually haven't -- instead,
196they "dwell" in the buffer cache for a period of time.
197This mechanism allows for temporary blocks to appear to be
198"written" and "read" without any I/O actually occurring before
199invalidation occurs (as would be the case with compiler
200temporary files). In addition, delayed write operations allow
201output operations to be queued behind read operations, thus postponing
202their effect until the cache entry is needed by a read operation.
203.pp
204Read-ahead operations, in contrast, are the anodyne
205to delayed write operations, in that read-ahead operations anticipate
206the need for a successive block. Since most I/O on Unix systems is sequential,
207successive read operations can be detected and the operations pre-queued
208for the next successive block to achieve double buffering.
209.pp
210Note that read-ahead and "write-behind" (e.g. delayed writes)
211operations create additional traffic for the buffer cache because
212a certain amount of the buffer cache is tied up in holding the state
213for both reads and writes that may or may not be of use.
214This buffer cache memory overhead
215is the price paid for attempting to avoid a more expensive I/O operation.
216.pp
217One interesting question raised by this approach is that of just who
218has priority -- read-ahead blocks which haven't
219proved that the information needs to be cached, or write-behind blocks
220which may not ever be reused again and instead be queued for output?
221If priority is given to read-ahead blocks, we tend to commit delayed
222writes back to disk earlier and effectively perform more write
223operations than might have otherwise been needed. This may be especially
224expensive if we have to read them back in again often, as would be the
225case with compiler temporaries. On the other hand, if priority is
226given to delayed write operations, the read-ahead operation can be severely
227impacted by the lack of buffer space necessary to hold its contents. In
228this case, we will end up always synchronously reading from the disk.
229.pp
230Since the system does more reads than writes, read-ahead operations are
231favored over delayed write operations, resolving this contradiction,
232but this is a somewhat arbitrary solution. Unfortunately, since this
233caching operation occurs at a fairly low level of abstraction in the system,
234there is insufficient information available (e.g. arrival pattern,
235file persistence duration -- temporary versus permanent) with which to
236implement a sophisticated policy mechanism (such as one using queuing theory).
237.pp
238For example, when the system does a core dump of a program
239larger than the size of the buffer cache, the buffer cache
240is immediately overrun with contents bound for the disk (or network),
241wiping out all previously held contents in the process. (Ironically,
242this core dump contents may never even be read, especially if this is
243a system process). If this core dump is sufficiently large it may take
244several seconds to clear in writing to the disk. During this time, the
245system will appear to "pause" as the other blocks queued for transfer wait.
246.pp
247If instead of allowing the buffer cache to be
248overwritten we possessed the appropriate policy information,
249writes would only be accepted into the cache at the
250rate at which they were consumed by the underlying transfer mechanism,
251resulting in a steady-state. Since we cannot satisfy
252the transfer rate any faster than that allowed by the underlying device,
253it doesn't help if we accept any more transactions into the cache
254than can possibly be satisfied by the lower levels of the system.
255.pp
256One area of discussion is determining just "when"
257persistence in the buffer cache of information should be allowed
258in excess of the ability for the lower level storage mechanism to transfer
259the information. In essence, we must decide when we should reuse the
260information and that it should persist regardless (the compiler temporary
261case) and when we should constrain the "virtual bandwidth"
262by the lower level device (as in the core file case).
263.uh "Optimizing Buffer Transactions"
264.pp
265Read-ahead and write behind operations were used in the original
266Unix buffer cache. In fact, almost all timesharing operating systems
267that had buffered filesystem transfer caches prior to
268Unix, including BTSS and Multics, also used these operations to
269reduce the number of transactions incurred by managing a cache of them
270and reordering them as necessary to achieve the most optimal time and number
271of transactions necessary, so this concept is not unique to Unix.
272.pp
273Ironically, by allowing read-ahead operations we actually
274increase the number of transactions that occur. However, since the net
275cost of occasionally throwing away a read-ahead block is a low one,
276the overall system efficiency is increased anyways (since on the off
277chance we did use the read-ahead, an operation that would have otherwise
278been blocked would not need to occur). Some Unix systems
279defeat read-ahead and aging under the mistaken belief that
280simply reducing the number of operations must increase system efficiency.
281This is not always the case as one must examine the actual
282delay time reduction of the application programs as well as the
283number of operations to decide if the optimization is successful
284.pp
285For example, in the 386BSD Release 0.1 version of
286.b bio.c
287delayed writes and readaheads were not implemented correctly
288for all cases. After this was rectified, the total number of I/O
289operations were observed to have increased (which we found surprising).
290However, the time required for the average completion of a kernel
291build actually dropped! How could this be? We found that
292read-ahead operations would occasionally be successful
293and occasionally not -- those not successful would have to be reissued,
294so the number of operations increased. Why this did not degrade performance but
295instead actually increase it was that when the block was of immediate use, it
296avoided a delay in the utility getting a block of data. However, in the case
297where it wasn't of any value it would be reused by the delayed write operation
298of a compiler temporary and thus would hold onto the information anyways.  As
299such, the proper effect of both delayed write and read-ahead operations would
300result. In addition, the reason why the number of aggregate read/write
301operations increased was because, while many of the read-ahead blocks were
302never used, we never had to wait for them anyways, so it did not affect
303the elapsed time of the running application.
304.uh "File I/O Clustering"
305.pp
306The effect of aggregating read/write operations for greater efficiency is
307possible not only at the low level of abstraction of
308transfers themselves but also possible at the higher levels of abstraction
309of whole transfers from the user process's perspective. Consider that a user
310process is performing I/O on an abstract file -- we can collect up all of
311their numerous I/O operations into a much larger aggregate of individual
312operations, coalesce them into a single operation and perform that.
313.pp
314Given this higher level of abstraction, a
315.i clustered
316read or write operation can be performed based not on the actual
317buffer transactions but on much larger scale driven by
318the economies of memory in a system and not by the economies of
319memory in the buffer cache. Obviously if the memory of a system is so
320constrained that the buffer cache is a significant portion of memory
321(as it was in earlier systems), clustered I/O would be no great advantage.
322However, we routinely use PCs that contain many times the
323amount of memory of that of the programs which they run.
324Clearly, the efficient use of memory as a resource might now
325allow us to use memory state to transfer larger aggregations of file
326transfers more efficiently. In other words, we exchange memory resource
327for I/O bandwidth to obtain a larger "virtual" bandwidth. This in a nutshell
328is what file I/O clustering is all about.
329.pp
330Since file I/O clustering lies between the top level view of the system
331(e.g. reads and writes) and the bottom level view of the system (actually
332performing the I/O), it is an operation that must invisibly
333reconcile the semantics of the top level reads and writes independent of
334the low level operations (in the current system, there is no such layer).
335Since this mechanism inherently affects the user process's
336address space, the virtual memory system is the logical means through
337which this new layer would be interposed.
338.pp
339To permit the virtual memory system to make decisions on
340the size and timing of transfers, additional information
341(queue sizes and I/O operation completion) must be available,
342so that the virtual memory system can
343properly choose how much resource to dedicate to a particular set of
344virtual transfers. As such, a sufficiently different
345architecture for attaching the lower level of the system to the higher
346level portion of the system is now necessary.
347.uh "fs/bio.c Functions"
348.pp
349Within this file are contained the following functions:
350.(l
351bufinit()
352incore()
353bread()
354breada()
355bwrite()
356bdwrite()
357bawrite()
358brelse()
359getnewbuf()
360getblk()
361geteblk()
362allocbuf()
363biowait()
364biodone()
365.)l
366(***HYPERLINK ALL OF THE WHAT IS TO THESE LIST OF FUNCTIONS)
367The names are indicative of the function they provide in the
368implementation of the buffer cache.
369.b "bread()"
370and
371.b "bwrite()"
372provide a virtual read/write interface to the
373buffer cache and the underlying storage, expressed as a filesystem.
374.b "breada()"
375and
376.b "bawrite()"
377provide asynchronous mechanisms avoiding a wait
378for the operation if the caller decides not to pause to see the outcome
379of the operation.
380.b "getblk()"
381and
382.b "brelse()"
383allow the buffer contents
384to be probed and released irrespective of any implicit I/O.
385.b "allocbuf()"
386allows for alteration of buffers to suit the needs of the filesystem.
387.b "biowait()"
388and
389.b "biodone()"
390provide an interface between underlying asynchronous I/O
391and the filesystem's more orderly synchronous world view.
392.b "incore()"
393and
394.b "getnewbuf()"
395are internal support routines used to implement the buffer cache.
396.b "bufinit()"
397is an entrypoint for the kernel to instantiate the buffer
398cache facility.
399.b geteblk()
400is an obsolete interface used to steal
401an empty block from the buffer cache for use in other mechanisms (a
402memory allocator, used prior to mbufs and
403.b "malloc()").
404(***HYPERLINK WHAT IS MALLOC() IN MALLOC.C - POPUP)
405.uh "Antecedents to the 386BSD Buffer Cache"
406.pp
407Earlier UNIX systems
408used a block-oriented mechanism to buffer physical I/O transactions
409to a disk drive. At this time, there was a single UNIX filesystem
410that used the block buffer cache to interface to disk drivers directly.
411.pp
412The names of the above routines and the function of the buffer cache
413are directly descended from this earlier work.
414However, it is deceptive to assume that they are similar -- a
415myriad of subtle differences exist due to the fact that this buffer
416cache works with variable-sized buffers of independent filesystems,
417caching transactions of the logical contents of the files. Various
418idiosyncrasies of the buffer cache are due to its appendix-like role
419in the anatomy of UNIX systems.
420.uh "Evolution of the 386BSD Buffer Cache"
421.pp
422The 386BSD buffer cache was developed independently
423for the original Release 0.0 in 1992. The first version
424of this file and how it was developed is discussed in the article
425.b "Missing Pieces II"
426(***HYPERLINK TO ARTICLE TOC - MACRO)
427in the section
428.b "Block I/O Cache".
429(***HYPERLINK TO SECTION - POPUP)
430Both the strengths and weaknesses of using this buffer cache
431design were also discussed in this
432article (see
433.b "4.4BSD Demands"
434(***HYPERLINK TO SECTION - POPUP)
435and
436.b "4.4BSD Weaknesses",
437(***HYPERLINK TO SECTION - POPUP)
438respectively).
439.pp
440The reimplementation of the buffer cache code was quite distinct from the
441design of earlier inplementations, primarily by its use of the kernel memory allocator
442in place of a internal memory pool and page table entries to allocate
443buffer contents from. This was done to reduce the code size by relying
444on a common memory allocator, provide uniform memory policy for the kernel
445as a whole (the intent was that memory could always be reclaimable by the
446virtual memory system, and that unused memory (B_INVAL buffers) would always
447be held by the memory allocator for universal use within the kernel), and
448so to reduce memory fragmentation (partial page allocation of buffer contents
449would not require a whole, dedicated page of memory).
450.pp
451The buffer cache code had minor bugs (double allocation of space), unfinished
452implementation (buffer ageing and delayed writes), as well as inefficencies
453(buffer ageing and delayed writes, again). In addition, the design objective
454of returning unused buffer memory contents to the memory allocator was not
455completed. These problems have been
456eliminated from this version, as many of them had stemmed from early,
457abortative attempts to introduce an interim version of the page cache, the
458intended successor to the buffer cache.
459.pp
460This design, however, was intended from the beginning to be only an interim
461functional mechanism until a more modern mechanism -- the 386BSD
462Page Cache -- was ready for release.
463.uh "Future Directions: The 386BSD Page Cache"
464.pp
465While the page cache is conceptually simple, its implementation and
466the replacement of the buffer cache is far from simple. Both performance
467and operational issues dominate the such attempts. Most of these are
468rooted in the dependance of the rest of the kernel on artifacts of the
469buffer cache and its interface.
470.pp
471The buffer cache combines many different functions and features into
472a single mechanism. This concise arrangement was somewhat necessary given
473the tiny memory requirements of the original PDP-11 (and PDP-7 before it)
474environments it ran in. The buffer cache provides many mechanisms:
475.(l L
476* The interface for filesystems to access records (blocks) of storage.
477* A way to achieve I/O on a storage meduim.
478* A way of managing exclusive access to a portion of file contents.
479* A way to avoid redundant reads (transfer cache).
480* A way to schedule sequential read operations allowing overlap (read ahead).
481* A way to postpone/coalesce writes (write behind).
482* A way to resolve file-relative to actual physical record index.
483.)l
484.pp
485In addition, to manage the cache contents, it provides these additional
486mechanisms for the filesystem to use:
487.(l L
488* cache invalidation/ageing.
489* cache probing/statistics.
490* cache resizing.
491.)l
492.pp
493These mechanisms are all implemented by a series of functions that interact
494in an ad-hoc manner to allow these mechanisms to function. Since the
495policies are embedded in a buffer cache used by all filesystems, all
496filesystems are compelled to use the same policies. Also, since I/O is
497implied behind the cache, the buffer cache schedules buffer reuse/writeback
498based on its demands for memory rather then when the filesystem would like
499to see the output operation occur. Finally, even though the buffer cache
500can be made to relinquish memory back to the memory allocators, it still
501must have a notion of a fraction of memory resources dedicated to filesystem
502cache contents, which limits the effectiveness of memory to be used to
503accellerate I/O operations. Inherently, the virtual memory system and buffer
504cache are on a collision course, as both provide fundamentally the same
505operation (access to storage) for different subsystems (filesystem, address
506space). As both must interact (mapped files), and cache state to improve
507operation, each elbows the other for resources.
508.pp
509The resolution of this conflict is that the virtual memory system must
510be the only mediator to storage, and the only manager of memory resource.
511Filesystems should provide methods to perform and mediate I/O. A filesystem
512access to a portion of a file should be treated exactly like a program's
513access to a portion of a file mapped into its address space. A process
514consuming a large number of pages of I/O should compete alongside another
515process consuming a large number of pages to run its program within.
516Actual I/O for either should be managed efficently with the same mechanisms
517(buffering for overlap operation, caching to reduce redundant operations,
518clustering to coalesce multiple operations into a single one,
519reclamation
520to cope with finite memory resources, I/O scheduling to maximize benifit
521of storage transfer mechanism).
522.pp
523The page cache is this universal mechanism that makes possible this
524archetectural arrangement. With this mechanism, file contents are
525only reachable via the virtual memory system. Any memory resources
526used to cache contents are held by the virtual memory system. All I/O
527operations are initiated or scheduled to be initiated by the filesystem.
528The memory system may attempt to reclaim resources by writing back
529a cached entry, however, the filesystem may chose how, when, or if
530this is to be done. The virtual memory system provides access to memory
531based, logically ordered objects on which I/O may be applied;
532while the filesystem transforms these into physical transfer operations.
533.pp
534Thus, the resolution of the previously mentioned functions cannot all
535be present in the same place as it was in the buffer cache, but instead
536be spread accross both the virtual memory system and filesystem, with
537much of both changing considerably. In effect, the caching mechanism would
538be migrating in the level of abstraction layers:
539.(l L
540Is:
541vm
542filesystem(top)
543buffer cache
544filesystem (bottom)
545drivers | network | ...
546
547Becomes:
548filesystem(top)
549  |	| vm
550  |	| page cache
551filesystem(bottom)
552drivers | network | ...
553.)l
554.pp
555In this new arrangement, the filesystem "top" layer transforms requests for
556file storage into virtual memory system requests (if necessary for the
557filesystem semantics, the virtual memory system may be bypassed entirely
558for non-cachable transfers of information), which may either be found
559in the global page cache, or reloaded on demand by the filesystem "bottom"
560layer. Thus, the net effect of the change is to cache potentially mappable
561memory state to avoid I/O transfers instead of caching the I/O transfers
562themselves.
563.uh "What is bufinit()?"
564.pp
565.b bufinit()
566instantiates the data structures with sane values so that the
567remaining code can use the facility.
568.uh "What Does the Kernel Expect of bufinit()?"
569.pp
570.b bufinit()
571must bootstrap enough of the buffer cache so that it can
572commence operation. At this point, the memory allocator is active,
573but no disk I/O is yet allowed. The buffer cache will be
574needed to help initialize block devices and to open the root
575filesystem -- its first operations.
576__listing(/usr/src/kernel/kern/fs/bio.c,80,81)
577.lp
578.b bufinit()
579has no arguments and returns nothing.
580Its side-effect is to initialize the statically
581allocated hash table, buffer headers, and free buffer queues.
582.uh "How is bufinit() Implemented on the 386?"
583__listing(/usr/src/kernel/kern/fs/bio.c,87,91)
584.lp
585With nothing in the hash table, its entries point to themselves. Note
586that there is only one value per hash table header for both the
587beginning and end of the hash table, yet the sentinel itself is never
588allocated. One could have used a single entry, but this
589identical code allows for greater speed to unlink and link (one case only)
590a buffer on a queue.
591__listing(/usr/src/kernel/kern/fs/bio.c,94,100)
592.lp
593Likewise, a list of free queues exists on which inactive buffers are present.
594We link them the two ways they can be found (address match or
595availability on the
596.i freelist).
597__listing(/usr/src/kernel/kern/fs/bio.c,103,109)
598.lp
599With the list headers established, we place the buffer headers on the
600queues using macros used throughout the rest of the implementation.
601At this point, we take care to mark them as having no contents and
602matching no address.
603.uh "386BSD bufinit() Design Choices and Trade-Offs"
604.pp
605It is assumed in the 386BSD implementation that the buffer cache is
606empty to begin with and that the buffer cache code will allocate
607and fill-in the
608structures as needed, instead of allocating everything here.
609As such, the key design choice here was to
610place the burden on the rest of the implementation so that the virtual
611memory system, the ultimate memory manager, could reclaim space on-demand
612from the buffer cache. It becomes just one of many clients of memory inside
613the kernel. (N.B. It may eventually be desirable to allow revoking of memory
614as a capability on-demand.)
615.pp
616This implementation of
617.b bio.c
618still uses the buffer list and queue arrangement found in
619older Berkeley Unix (and traditional Unix) systems, instead of using
620the general purpose queue facility (see
621.b include/queue.h)
622used by the entire virtual memory system.
623Since the buffer cache is scheduled for replacement by the page
624cache (which uses these queues already), work has not been
625continued in this area.
626.uh "What is incore()?"
627.pp
628Active buffers are linked onto one list of a hash table of lists,
629where the hash function is determined by logical file and block number. The
630.b incore()
631function, used solely within the buffer cache,
632locates valid cache contents regardless of the status of the buffer.
633Since this function is small and can limit performance for large
634cache sizes, it is inlined to increase speed.
635.uh "What Does the Buffer Cache Expect of incore()?"
636.pp
637.b incore()
638attempts to locate a buffer in the buffer cache.
639If one is present, it returns a pointer to the
640buffer's header; otherwise, it returns a null buffer pointer.
641__listing(/usr/src/kernel/kern/fs/bio.c,115,116)
642.lp
643.b incore()
644has two arguments: a vnode pointer; and a logical block number.
645The arguments are used to find an exact match from one of the
646buffers holding valid data, regardless of buffer status.
647If it is matched, the buffer pointer is returned;
648otherwise, a null pointer is returned.
649.pp
650.b incore()
651should never block; otherwise, an allocation race could result and the
652wrong answer obtained. Since
653.b incore()
654is used to find buffers that can be in either a locked or unlocked state,
655it should never be used to itself lock or unlock
656buffers; otherwise, a chicken-and-egg recursion could result.
657.uh "How is incore() Implemented on the 386?"
658__listing1(/usr/src/kernel/kern/fs/bio.c,118)
659.lp
660The buffer hash address is located by the hash macro (see
661.b include/buf.h)
662which folds the value of the vnode pointer and logical block number
663into an index inserted into a fixed-sized hash table of buffer hash headers.
664Buffer hash headers are actually only three words long, and overlay the
665first three fields of the buffer header structure
666.i (b_flags,
667.i b_forw,
668and
669.i b_back),
670so the definition for
671.i bh
672is a bit misleading. The hash header is the
673semaphore that anchors a doublely circularly linked chain of buffer
674headers which contains the buffer we are locating; otherwise, the
675buffer is not present in the buffer cache.
676__listing1(/usr/src/kernel/kern/fs/bio.c,121)
677.lp
678The hash chain is linearly searched for the desired buffer. The
679first buffer is pointed at by the forward pointer (although we could
680just as easily have walked the hash chain backwards via the back pointer). In
681either direction the header itself is chained as the last element, so that
682the empty hash header is an entry with both back and forward pointers
683pointing to itself (see
684.b bufinit())
685(***HYPERLINK WHAT IS BUFINIT() - POPUP)
686__listing(/usr/src/kernel/kern/fs/bio.c,123,125)
687.lp
688Valid matches are exact matches on the starting logical block number and
689vnode pointer. Note that both the size of each cached buffer and the block
690size associated with the block address (e.g. units of the address) are not
691taken into account by the
692.b incore()
693lookup, so it is not possible to locate
694a logical block contained within a cached buffer or locate a buffer
695by an alias of a different block size. The buffer cache effectively
696caches disk transactions of a particular arrangement by a filesystem, and
697does not provide a global view across all filesystems of vnode file
698state present in RAM. Thus, the buffer cache is relative to each of its
699filesystem clients.
700.pp
701This arrangement directly implies that
702.b bio.c
703is logically located at the middle layer of abstraction
704of a filesystem, separating the filesystem's top layer (its semantics
705and attachment to the virtual filesystem) from its lower layer (the logical
706to physical mapping, commitment to storage, and storage allocation policies).
707.lp
708A check for the validity of the buffer is made before returning the block.
709Normally, only valid data should be present on the hash chains; however,
710the block may have become invalid if a race condition occurred while
711removing it from the
712hash (by code running from interrupt level or on another processor). Since
713the unusual case is an invalid buffer, this is only tested prior to returning
714a positive match. Buffers are not locked by this function -- the information
715in the associated buffer header is subject to change prior to use, so other
716callers must take care to discard buffers as needed if they are stale.
717__listing1(/usr/src/kernel/kern/fs/bio.c,127)
718.lp
719In the case of a buffer cache miss,
720.b incore()
721will return the value of a null buffer pointer.
722.uh "What is bread()?"
723.pp
724.b "bread()"
725is one of the top-level function entry calls for the
726buffer cache. The filesystem top-level code uses this function to access
727their file's contents by performing a virtual read of the
728combined cache and the lower-level of the filesystem.
729.uh "What Does the Filesystem Expect of bread()?"
730.pp
731.b bread()
732causes a buffer to be created with the contents requested.
733It either obtains this from the buffer cache if it is present, or
734obtains an empty buffer and fills it with the appropriate
735contents by means of the lower-level mechanisms of the filesystem
736memory for further use.
737__listing(/usr/src/kernel/kern/fs/bio.c,135,137)
738.lp
739.b bread()
740has five arguments (logical file parameters): the vnode pointer; the
741logical block offset; size of transfer; the credentials used in
742performing the read; and a pointer to the buffer returned by this
743.b bread()
744operation.
745The credentials of the requester
746passed in the buffer are used to associate any lower-level
747I/O operations. The function will return a buffer at
748the location specified ("*bpp") as well as any error values encountered
749in performing the operation.
750.pp
751The buffer is returned locked for exclusive
752use, and then returned either to the buffer cache for further use by other
753clients of this block of the file or by the cache to recycle its buffer
754memory for further use.
755.pp
756Since the filesystem code is unaware of any concurrent access of buffers,
757.b bread()
758must mediate the shared free buffer pool in order to gain exclusive
759access to the desired buffer to its caller.
760.b bread()
761can block, since it is only called from the synchronous
762top-half of the kernel. However, if it does
763block, it must not hold any locked structures (except the buffer
764it allocates, of course), nor hold references to any global data
765structures across the block. (This is done to avoid deadlocks or circular
766waiting).
767.uh "How is bread() Implemented on the 386?"
768__listing1(/usr/src/kernel/kern/fs/bio.c,142)
769.lp
770.b getblk()
771(***HYPERLINK WHAT IS GETBLK() - POPUP)
772is used to obtain an appropriate buffer for the block requested, by
773either finding it in the cache or freshly allocating an empty buffer.
774__listing1(/usr/src/kernel/kern/fs/bio.c,145)
775.lp
776If the block was cached already, no further work needs to be
777done. However, if the block was either not cached or if the cached
778block is not valid, it must be filled with contents.
779__listing(/usr/src/kernel/kern/fs/bio.c,146,153)
780.lp
781A read operation is then selected and the associated bits are cleared
782prior to starting a read operation. Operations with credentials have
783their reference counts bumped for read and write references, which are
784asserted in case a
785.b bread()
786block is later written using
787.b bwrite().
788(***HYPERLINK WHAT IS BWRITE() - POPUP)
789__listing(/usr/src/kernel/kern/fs/bio.c,154,157)
790.lp
791If this process is traced, the read operation is recorded to
792the trace file.
793__listing(/usr/src/kernel/kern/fs/bio.c,158,160)
794.lp
795We tally the number of read requests, and
796then perform each request, by returning
797to the filesystem from which we were called
798and having it satisfy the request from whatever underlies the filesystem.
799__listing(/usr/src/kernel/kern/fs/bio.c,161,165)
800.lp
801The process is suspended until the read request completes.
802Then the status of the operation
803and the buffer obtained is returned to the caller.
804.uh "What is breada()?"
805.pp
806.b breada(),
807a special case of
808.b bread(),
809(***HYPERLINK WHAT IS BREAD() - POPUP)
810is another top-level function entry call for the
811buffer cache. Like
812.b bread(),
813a filesystem's top-level code uses this function to access
814the file's contents by performing a virtual read of the
815combined cache and the lower-level of the filesystem.
816.pp
817.b breada()
818allows for double-buffering
819read operations by starting consecutive operations as a side-effect. It
820assumes that by the time the first operation has been consumed, the
821second will be ready and the third will be started again.
822If this perfect double buffering was maintained, data would appear
823to be instantaneously available without delay, even if the file
824read was larger than the buffer cache size. More commonly, however,
825the buffer cache simply overlaps two
826read operations, reducing the synchronous wait to every other
827.b breada()
828request in the case where data is demanded faster than the drive can supply.
829.uh "What Does the Filesystem Expect of breada()?"
830.pp
831Like
832.b bread(),
833(***HYPERLINK WHAT IS BREAD() - POPUP)
834.b breada()
835causes a buffer to be created with the contents requested.
836It either obtains this from the buffer cache if it is present or
837obtains an empty buffer and fills it with the appropriate
838contents by means of the lower-level mechanisms of the filesystem
839memory for further use.
840.b breada()
841also has the "side-effect" of starting a read on the next buffer if it is not
842in the buffer cache.
843__listing(/usr/src/kernel/kern/fs/bio.c,172,174)
844.lp
845.b breada()
846has seven arguments: the logical parameters of vnode pointer; logical
847block offset; size of transfer; the read-ahead block offset;
848the read-ahead block size; the credentials of the transfer; and the
849pointer to the buffer read in this operation.
850The credentials of the requester are passed in the
851buffer to associate any lower-level I/O operations.
852.pp
853Like
854.b bread(),
855this function will return a buffer at the location specified
856("*bpp"), as well as
857any error value encountered in performing the operation. The buffer is
858returned locked for exclusive
859use, and must be returned to the buffer cache either for further use by other
860clients of this block of the file or for the cache to recycle its buffer
861memory for further use.
862.pp
863Since the units of the logical block address are
864unknown to the buffer cache, it is not necessarily the case that
865the next block is one plus the previous (it could be 2 or 4, for example)
866Also, since the block size can be variable in the filesystem, it is also
867not predictable by the buffer cache (as the read-ahead block may be the
868last block, usually a fragment).
869.pp
870Since the filesystem code is unaware of concurrent access of buffers,
871.b breada()
872(like
873.b bread())
874must mediate the shared free buffer pool in order to
875gain exclusive access to the desired buffer to its caller.
876.b breada()
877can block, since it is only called from the
878synchronous top-half of the kernel. However, if this function
879does block, it must not hold any locked structures (except the buffer
880it allocates, of course) nor hold references to any global data
881structures across the block.
882.uh "How is breada() Implemented on the 386?"
883__listing1(/usr/src/kernel/kern/fs/bio.c,179)
884.lp
885The
886.b getblk()
887(***HYPERLINK WHAT IS GETBLK() - POPUP)
888function is used to obtain an appropriate buffer for the block requested,
889either by finding it in the cache or by freshly allocating an empty buffer.
890__listing1(/usr/src/kernel/kern/fs/bio.c,182)
891.lp
892If the block was cached, no further work needs to be
893done. However, if the block was either not cached, or if the cached
894block is not valid, it must be filled with contents.
895__listing(/usr/src/kernel/kern/fs/bio.c,183,190)
896.lp
897A read operation is selected and the associated bits are cleared
898prior to starting a read operation. Operations with credentials have
899their reference counts bumped for read and write references, which are
900asserted in case a
901.b breada()
902block is later written with
903.b bwrite()
904(***HYPERLINK WHAT IS BWRITE() - POPUP)
905__listing(/usr/src/kernel/kern/fs/bio.c,191,194)
906.lp
907If this process is traced, the read operation is recorded to the trace file.
908__listing(/usr/src/kernel/kern/fs/bio.c,195,197)
909.lp
910The number of read requests are next tallied.
911We then perform the requests, by going back to the filesystem from
912which we were called
913and having it satisfy the request from whatever underlies the filesystem.
914__listing(/usr/src/kernel/kern/fs/bio.c,198,199)
915.lp
916At this point the implementation of
917.b breada()
918diverges from that of
919.b bread().
920Instead of waiting for the read
921operation to complete, we instead note that an operation is outstanding.
922__listing(/usr/src/kernel/kern/fs/bio.c,201,207)
923.lp
924We then check to see if the next block is already in the cache.
925If it is not in the cache, we arrange to schedule an asynchronous read by
926obtaining a new buffer using
927.b getnewbuf().
928(***HYPERLINK WHAT IS GETNEWBUF() - POPUP)
929If none was present and
930.b getnewbuf()
931was forced to wait (as conditions may be difficult), we need to go
932back and look in the cache again.
933__listing(/usr/src/kernel/kern/fs/bio.c,208,209)
934.lp
935We then associate the buffer with this vnode and read-ahead block offset. To
936signal that we haven't calculated the underlying address for this block,
937the lower level address for the buffer is set to the logical address. (The
938.b bgetvp()
939function in
940.b fs/vnode.c
941associates this buffer with this vnode.)
942__listing(/usr/src/kernel/kern/fs/bio.c,210,214)
943.lp
944We insert the buffer into the hash table so that it can be found by
945.b incore().
946(***HYPERLINK WHAT IS INCORE() - POPUP)
947The buffer is held busy so that if others find
948it, they must first wait for it to be released before using it.
949__listing(/usr/src/kernel/kern/fs/bio.c,217,230)
950.lp
951We request that an asynchronous read be performed and that the buffer be
952released to the "aged" free queue.
953Because the buffer has not yet been requested, we consider it aged
954and a candidate for reuse; otherwise, we handle this just like a
955.b bread()
956without a
957.b biowait().
958(***HYPERLINK WHAT IS BIOWAIT() - POPUP)
959.lp
960Both read and write credentials for the read-ahead block are obtained.
961Tracing is then accounted for (if desired) and it is recorded that the transfer
962has occurred against this particular process. We then perform the transfer
963via the
964.i vop_strategy
965operation.
966__listing(/usr/src/kernel/kern/fs/bio.c,234,245)
967.lp
968If the read-ahead block was in the cache, we prolong its life a bit more.
969If the age bit is still set and the buffer is in use we clear
970the age bit (so it won't be aged).
971However, if it is not in use, we force it to the tail of the age queue so that
972it won't be reused before being referenced.
973__listing(/usr/src/kernel/kern/fs/bio.c,248,252)
974.lp
975Finally, we wait for the result if the original operation required a
976read operation.
977The buffer used for the requested read operation is returned to the
978supplied address of the caller and the value of this successful
979or failed operation is also returned.
980.uh "What is bwrite()?"
981.pp
982.b bwrite()
983takes a previously busy buffer obtained from either
984.b getblk(),
985(***HYPERLINK WHAT IS GETBLK() - POPUP)
986.b bread(),
987(***HYPERLINK WHAT IS BREAD() - POPUP)
988or
989.b breada(),
990(***HYPERLINK WHAT IS BREADA() - POPUP)
991and forces it to be immediately written to the underlying storage
992below the current filesystem.  Like
993.b bread(),
994.b bwrite()
995waits for the operation to finish, and returns the status
996of the operation as a value. The buffer passed is then released
997back to the freelist for use by other clients of the buffer cache.
998.uh "What Does the Filesystem Expect of bwrite()?"
999.pp
1000.b bwrite()
1001forces the buffer to be immediately written and the invoked
1002operation synchronously completed before the function returns.
1003Delayed writes are taken into account by this function as well.
1004__listing(/usr/src/kernel/kern/fs/bio.c,259,260)
1005.lp
1006.b bwrite()
1007has one argument: the buffer to be synchronously written. It returns
1008the success or failure of the write operation.
1009Since
1010.b bwrite()
1011will block waiting for completion of the I/O operation, it can
1012only be called from the top half of the kernel.
1013.uh "How is bwrite() Implemented on the 386?"
1014__listing(/usr/src/kernel/kern/fs/bio.c,264,267)
1015.lp
1016If the supplied buffer's contents is invalid, no operation is performed
1017and the buffer is immediately released for reuse on the free list.
1018__listing(/usr/src/kernel/kern/fs/bio.c,270,279)
1019.lp
1020If the buffer is not currently marked "busy", then
1021.b bwrite()
1022panics because it does not have exclusive use of the buffer. In general,
1023a block has been obtained that has been already associated with a vnode
1024for exclusive use -- thus, this is a "impossible" error.
1025.lp
1026If the buffer has already been virtually written by a delayed write
1027operation, the buffer is reassociated with the vnode of the file to which
1028it belongs and its contents are marked "aged" so that
1029they may be reclaimed more quickly and not dwell on the free list.
1030Otherwise, an output operation is accounted for by the process that calls
1031.b bwrite().
1032Note that the output operation for a delayed write
1033was already accounted for in the process where the delayed write had
1034occurred.
1035__listing(/usr/src/kernel/kern/fs/bio.c,281,282)
1036.lp
1037The block is marked as a "dirty" block that must be cleaned before
1038this vnode may be reused, and the number of output operations in progress
1039on this vnode is increased. This counter will keep the vnode from being
1040recycled or altered prior to all of its operations completing.
1041__listing(/usr/src/kernel/kern/fs/bio.c,288,292)
1042.lp
1043The output operation is engaged for this filesystem by calling its
1044vnode strategy operation and the operation is waited upon until it completes.
1045After completion, the buffer is released back into the free buffer cache
1046for potential reuse by other processes and the value of a successful
1047or failed operation is returned to the caller. Note that the buffer has been
1048returned to the buffer cache and is no longer used by the caller.
1049.uh "What is bdwrite()?"
1050.pp
1051Like
1052.b "bwrite()",
1053(***HYPERLINK WHAT IS BWRITE() - POPUP)
1054.b bdwrite()
1055is used to return the buffer back to the buffer
1056cache with the contents written back to the file. However, unlike
1057.b bwrite()
1058(which immediately writes the buffer back and waits for it),
1059.b bdwrite()
1060does a
1061.i lazy
1062write by marking the buffer to be written before reuse, releasing it to
1063the free list, and then returning.
1064.uh "What Does the Filesystem Expect of bdwrite()?"
1065.pp
1066.b bdwrite()
1067virtually writes the buffer into the buffer cache. The function
1068in effect disassociates the operation from the file to which it is attached
1069so that the buffer may "dwell" for a period of time until the
1070buffer space needs to be reused for another purpose in the buffer cache
1071or until the block is requested by another process to be read or written.
1072__listing(/usr/src/kernel/kern/fs/bio.c,306,307)
1073.lp
1074.b bdwrite()
1075has two arguments: the buffer to be written; and the vnode to which it
1076is associated. It returns nothing. The side-effect of this function
1077is to virtually write the buffer into the buffer cache.
1078.lp
1079In order for the buffer to be written, it must be exclusive to the process
1080performing the write.
1081.b bdwrite()
1082occurs in the context of the writer. However,
1083the operation for performing the write (when it actually occurs) may come in
1084from yet another process or in no process at all (as the process may have
1085terminated by the time the actual I/O operation is performed).
1086.uh "How is bdwrite() Implemented on the 386?"
1087__listing(/usr/src/kernel/kern/fs/bio.c,310,320)
1088.lp
1089If the buffer is not marked for exclusive use (busy)
1090.b bdwrite()
1091panics as it has been called on a nonexclusive buffer. If the information
1092in the buffer is invalid, the buffer is immediately released, since there is no
1093write operation to be performed on this buffer and it is returned to the
1094free list for reassignment.
1095.lp
1096If the buffer is associated with a tape device upon which a write
1097should be performed, a synchronous write is substituted for this delayed
1098write. Note that the return value for the write operation is thrown away as
1099.b bdwrite()
1100does not expect anything but success. However, the error
1101is still recorded in the buffer's structure for later examination.
1102__listing(/usr/src/kernel/kern/fs/bio.c,322,330)
1103.lp
1104The buffer's flags are changed to indicate that a write operation
1105is present but has not yet been done for this buffer. If this is the first
1106delayed write that has occurred, the process is accounted for as a single
1107physical write (that may occur later) and marked as a delayed
1108write with the dirty bit. Its buffer is then assigned to the vnode
1109associated with this operation. Note that multiple delayed writes may
1110occur, as a partial write to the buffer associated with the vnode may
1111successively fill the contents of this buffer with many
1112.b bdwrite()'s.
1113The buffer is then released back to the free list where it may later be found
1114associated with this vnode.
1115__listing(/usr/src/kernel/kern/fs/bio.c,332,341)
1116.lp
1117An optional portion of this
1118.b bio.c
1119implementation incorporates a mechanism to only allow a single
1120outstanding delayed write buffer per every vnode.
1121(This is an example of how to deal with the so-called
1122core file problem discussed earlier -- see
1123.b "Delayed Write and Read-ahead Operations"
1124(**HYPERLINK TO SECTION).
1125Delayed writes are controlled and kept from overrunning the contents
1126of the buffer cache by delaying the top level application program
1127when it attempts to write on more than one buffer. (This is not normally
1128done because this mechanism does not discriminate for the
1129default temporary file characteristics implemented in this
1130file with delayed writes.)
1131.lp
1132The buffer is checked to see if it is the only block
1133present on a chain of dirty blocks attached to the vnode. If there
1134is more than one dirty block present, a
1135.b bawrite()
1136(with an embedded sleep operation) is forced by calling the
1137vnode
1138.b vflushbuf()
1139operation (see
1140.b fs/vnode.c)
1141for all but the current outstanding delayed write buffer. Note that in
1142the case where a preponderance of temporary file operations occur on the
1143system, this operation will actually slow the system down by reducing the
1144amount of write-behinds (delayed writes). However, it will speed up a system
1145that had previously been dominated by cache flooding. It is the lack
1146of information on making the decision whether or not to issue this
1147operation that requires this conditional portion of this implementation.
1148.uh "386BSD bdwrite() Design Choices and Trade-Offs"
1149.pp
1150.b bdwrite()
1151presumes that the buffer is going to be reused and so procrastinates.
1152This method works well when a buffer is
1153fractionally written in succession or when the buffer cache can hold
1154the entire contents of a temporary file that gets reused and removed
1155prior to being rewritten. In this second case, the I/O can be entirely avoided,
1156as the file ceases to exist before the buffers ever age out.
1157This is the principle upon which log-based filesystems
1158(like LFS) rely. Sequential devices
1159like tapes cannot postpone or reorder writes, so delayed writes to such devices
1160are turned into immediate writes.
1161.uh "What is bawrite()?"
1162.pp
1163Like
1164.b bwrite()
1165(***HYPERLINK WHAT IS BWRITE() - POPUP)
1166.b bawrite()
1167takes a previously busy buffer obtained from either
1168.b getblk(),
1169(***HYPERLINK WHAT IS GETBLK() - POPUP)
1170.b bread(),
1171(***HYPERLINK WHAT IS BREAD() - POPUP)
1172or
1173.b breada(),
1174(***HYPERLINK WHAT IS BREADA() - POPUP)
1175and forces it to be immediately written to the underlying storage
1176below the current filesystem.  However, unlike
1177.b bwrite()
1178no wait for status is done.
1179.uh "What Does the Filesystem Expect of bawrite()?"
1180.pp
1181.b bawrite()
1182performs the write operation on the buffer and then
1183immediately returns. No wait for the status is done. Instead, the
1184buffer is marked to be released when the write operation completes.
1185In this case, no status information can be conveyed back to the
1186writer because the operation is completed from the writer's perspective,
1187before the operation can produce a status for the completed operation.
1188Only the next reference to a buffer can discover the status of the
1189buffer following a
1190.b bawrite()
1191(or analogously the read-ahead blocks scheduled by
1192.b breada()).
1193__listing(/usr/src/kernel/kern/fs/bio.c,349,350)
1194.lp
1195.b bawrite()
1196has one argument: the buffer being written. It returns nothing.
1197.pp
1198Note that by assigning a procedure,
1199.b b_iodone(*)(),
1200to be called on conclusion of the I/O
1201operation and setting the call bit (B_CALL), a
1202call from interrupt level may be done to a function to indicate an
1203operation has been completed and the status may be checked on an
1204asynchronous write. Note also that buffers remain locked from view
1205during any I/O operations, so that even subsequent read operations
1206will be delayed until the asynchronous write completes.
1207.uh "How is bawrite() Implemented on the 386?"
1208__listing(/usr/src/kernel/kern/fs/bio.c,353,358)
1209.lp
1210The buffer is checked to see that it is exclusively accessed by checking
1211for the busy (B_BUSY) bit; otherwise, it panics as the buffer must
1212be obtained for exclusive access for this function to be called.
1213If the buffer being written has an invalid flag, the buffer is immediately
1214released without any I/O performed on it as it has no valid contents.
1215__listing(/usr/src/kernel/kern/fs/bio.c,361,371)
1216.lp
1217If the buffer was previously a delayed buffer, it is reassigned to the
1218vnode and aged as a delayed write is finally converted into an
1219asynchronous write buffer. In this case, since the operation of the write
1220had previously been accounted, it is not be done so here upon the delayed write
1221conversion; otherwise, it will be done for an initial
1222.b bawrite().
1223.lp
1224The buffer's flags are changed to be appropriate for that of a buffer
1225about to be written which has not had any I/O done to it yet and
1226no errors have been found.  Since the buffer is being written, the dirty
1227bit and the asynchronous flag are set for this buffer and the number
1228of write transactions to the vnode are increased.
1229__listing(/usr/src/kernel/kern/fs/bio.c,372,377)
1230.lp
1231If the process is tracing
1232.b bio.c
1233operations, a trace record is generated
1234to indicate that either an asynchronous write or a delayed write
1235converted to an asynchronous write is occurring for this vnode's block.
1236__listing(/usr/src/kernel/kern/fs/bio.c,378,379)
1237.lp
1238The write operation is requested via a vnode operation on the filesystem
1239associated with this buffer's vnode, and the function is complete. The
1240caller can either
1241.b biowait()
1242(***HYPERLINK WHAT IS BIOWAIT() - POPUP)
1243(in the case that many asynchronous operations
1244are started before we wait for any of them) or continue requiring the
1245block later (when necessary) to determine the status of the operation.
1246.uh "What is brelse()?"
1247.pp
1248.b "brelse()"
1249relinquishes exclusive access to a buffer and returns it
1250to the global free buffer pool for potential use by other clients
1251of the buffer cache.
1252.uh "What Does the Filesystem Expect From brelse()?"
1253.pp
1254.b brelse()
1255passes exclusive use of the buffer back to the kernel's
1256free pool. Anyone waiting for this particular buffer or a free
1257buffer in general is then notified.
1258__listing(/usr/src/kernel/kern/fs/bio.c,386,387)
1259.lp
1260.b brelse()
1261has one argument: the buffer to be released. It returns nothing.
1262The side-effect of this function is that the buffer is placed in the free pool.
1263.pp
1264This function is implemented as an artifact of
1265how exclusive access is maintained in the buffer cache.
1266If any other clients of the buffer cache either
1267desire this buffer or are waiting for any buffer for reuse, they
1268will be awakened and allowed to compete for access to the desired
1269resource (since there may be many waiting for exclusive access to
1270a single buffer). Buffers are directed to free lists that determine
1271the policy for reclamation and reuse with a different vnode/offset.
1272.uh "How is brelse() Implemented on the 386?"
1273__listing(/usr/src/kernel/kern/fs/bio.c,391,398)
1274.lp
1275If kernel tracing is present, a trace record is written that indicates
1276whether the buffer is being invalidated, modified, or simply released
1277for reuse. Note that as currently implemented the tracing facility can
1278only write records from the top level -- thus, if
1279.b brelse()
1280is called from interrupt level
1281(as occurs from
1282.b biodone()
1283as the conclusion of an asynchronous operation), it will not be recorded because
1284of the block that may result. This function is currently finessed in
1285.b ktrbio()
1286(see
1287.b opt/ktrace/ktrace.c)
1288as a special case as this trace package may be later changed to remove
1289the current restriction.
1290__listing(/usr/src/kernel/kern/fs/bio.c,400,416)
1291.lp
1292To avoid contention for the busy bit, all block I/O devices are masked
1293from interrupts (primarily because
1294.b biodone()
1295invokes
1296.b brelse()).
1297The age queue is examined to see if its associated wanted bit is set
1298(indicating that any the buffer wanted request is present). If it is,
1299the bit is cleared and a wakeup issued for those processes still waiting
1300for any buffer to become available. Note that only the first of the possibly
1301many processes waiting for a free buffer will get the newly freed buffer
1302with the others returned to the queue to wait for another buffer to become
1303available. Thus, there is no guarantee of when a buffer will
1304be assigned to a particular process.
1305.lp
1306Similarly, if the wanted bit is set on this particular buffer header's flag,
1307a wakeup will be issued for the buffer header and the bit cleared. If the
1308buffer has an error, invalid contents, or has been specifically disallowed
1309from being cached by the filesystem, its contents are rendered meaningless
1310and it is disassociated from any vnode to which it might be attached.
1311__listing(/usr/src/kernel/kern/fs/bio.c,418,437)
1312.lp
1313The buffer flags indicating cache status and aging are erased prior to
1314enqueuing the buffer on the appropriate of the four free lists. If the
1315buffer corresponds only to a header and has no contents whatsoever, it
1316is passed back to the empty queue for use when assigning new storage.
1317Otherwise, if the buffer contains no valid contents, it is inserted
1318at the beginning of the age queue which indicates that it will be the
1319first to be reused when a new buffer needs to be allocated.
1320.lp
1321If the buffer has aged but not been found in the cache, it will be installed
1322at the tail of the age queue, where it will be reused before the rest of
1323the contents of the buffer cache but after any invalid or empty queues that
1324could be constituted into new buffers are used. Aged buffers are
1325ones that have either not demonstrated their reason for holding a place
1326in a cache or have lost their reason for being cached, as determined
1327by the filesystem's use of the buffer. All other buffers will be placed
1328on the LRU queue awaiting reuse.
1329.lp
1330In general, as buffers are found and released back to their free lists,
1331they tend to migrate towards the head of the particular queue that
1332they reside upon; thus, the less recently referenced entries move towards
1333the front of the queue. No indication is kept
1334on how frequently a buffer is reused but one is kept on the last time it was
1335reused. Thus, we use a LRU algorithm (LIFO).
1336.lp
1337The busy bit is then removed from the buffer, allowing it to be claimed
1338by another requester and the interrupt level mask is removed.
1339.uh "What is getnewbuf()?"
1340.pp
1341.b getnewbuf()
1342constitutes a new buffer by acquiring a buffer header and
1343contents.
1344.uh "What Does the Buffer Cache Expect From getnewbuf()?"
1345.pp
1346.b getnewbuf()
1347obtains a buffer and stores in it the
1348.i sz
1349sized contents. These items either come from an
1350undedicated resource (empty buffer headers and underutilized buffer
1351contents space) or by recycling a free buffer and adjusting it
1352to suit the desired size. If the recycled buffer is a delayed write,
1353the write is forced before allowing reuse.
1354.pp
1355If a new buffer is still not available for use,
1356the condition is signalled and the
1357process sleeps until a free buffer is available. At this point, the
1358function returns a null buffer pointer to signal that a sleep has
1359occurred. If the conditions have changed and the buffer
1360desired allocated to yet another racing process, any uses of
1361.b incore()
1362(***HYPERLINK WHAT IS INCORE - POPUP)
1363may be stale.
1364__listing(/usr/src/kernel/kern/fs/bio.c,450,451)
1365.lp
1366.b getnewbuf()
1367has one argument: the size of the contents allocated with the buffer
1368header. It returns a pointer to the allocated buffer header or zero
1369if no buffer header could be allocated.
1370.pp
1371.b getnewbuf()
1372returns a buffer that is entirely unassociated
1373from the buffer cache. It is up to the caller to either associate the
1374buffer with a vnode/offset so that it can be found or make
1375temporary use of the buffer and return it to a free list. In its
1376unassociated state, the buffer plays no role in the global view of buffer state.
1377.uh "How is getnewbuf() Implemented on the 386?"
1378__listing(/usr/src/kernel/kern/fs/bio.c,456,474)
1379.lp
1380Buffers are first constituted from the empty list of buffer headers, as long
1381as space is allocatable in the amount of memory reserved for buffer space
1382and an empty header is available.
1383In general, memory not used for cache contents is always returned to the
1384primary memory allocator
1385.b malloc()
1386(***HYPERLINK WHAT IS MALLOC IN MALLOC.C - POPUP)
1387(see
1388.b kern/malloc())
1389(***HYPERLINK MALLOC.C TOC - MACRO)
1390for generic use in the kernel. This is the sole purpose of
1391the empty buffer free list.
1392.lp
1393Because buffers may have their status changed from exclusive to nonexclusive
1394at interrupt level (see
1395.b biodone()),
1396(***HYPERLINK WHAT IS BIODONE - POPUP)
1397any block device interrupts are masked
1398to prevent contention for buffers at interrupt level (as
1399.b brelse()
1400(***HYPERLINK WHAT IS BRELSE - POPUP)
1401may change the contents of the free lists).
1402.lp
1403If a new buffer can be constituted, memory is allocated from the memory
1404allocator and assigned to a formerly empty buffer header detached
1405from the empty free list of buffers and marked for exclusive use with
1406invalid contents (since it has yet to be filled). Two global variables,
1407.i allocbufspace
1408and
1409.i freebufspace
1410record the amount of buffer space present in the buffer cache
1411and the amount remaining for use by new buffers, respectively. Remaining
1412portions of the buffer header are filled prior to passing it back
1413to the caller.
1414__listing(/usr/src/kernel/kern/fs/bio.c,476,489)
1415.lp
1416If all buffer space is in use or no empty buffer headers are available,
1417the other free lists are checked to find a buffer for reuse. The age
1418queue is consulted before the LRU queue. If either queue has no entry
1419it will point back to its free queue pointer (indicating empty status);
1420otherwise, the first buffer at the head of the queue will be chosen
1421for use and removed from that free list.
1422.lp
1423If none of the buffer free queues has a buffer that can be used (all
1424buffers are busy), the age queue's buffer queue header is marked with
1425a wanted bit to indicate that any buffer header is requested. The
1426process then sleeps waiting for
1427.b brelse()
1428to find it a new buffer.  Upon being awakened,
1429it will return to its caller to indicate that it
1430could not find a buffer. In this case, the caller must examine the buffer
1431cache again to see if someone else had allocated a buffer already for the
1432vnode and block offset for which the caller was requesting a buffer. This
1433procedure is followed rather than immediately branching back to
1434the beginning of the search for a buffer because many of the assumptions
1435of the caller may have changed during the sleep.
1436__listing(/usr/src/kernel/kern/fs/bio.c,492,496)
1437.lp
1438If the buffer selected for reuse is one that has a delayed write
1439outstanding on it, the buffer is acquired for exclusive access by
1440asserting its busy bit and transformed into an asynchronous write by calling
1441.b bawrite().
1442(***HYPERLINK WHAT IS BAWRITE() - POPUP)
1443This forces a stale delayed write out upon reuse.
1444.lp
1445After starting the write operation, the function is restarted at its
1446beginning to search for another buffer or, if all buffers are busy,
1447suspend waiting for a free buffer until the asynchronous write completes.
1448Now freed of its contents (written back to disk),
1449the cleaned buffer is inserted onto the free list where it
1450may be allocated by this routine.
1451__listing(/usr/src/kernel/kern/fs/bio.c,498,502)
1452.lp
1453If kernel tracing is enabled for this process, a recycle record is
1454written to indicate that the buffer has been assigned to the new vnode
1455from an old contents.
1456__listing(/usr/src/kernel/kern/fs/bio.c,504,528)
1457.lp
1458The buffer is prepared for reuse by disassociating it from any vnode
1459it may have been associated with before and removing any credentials
1460that had been attached by prior operation. Its contents are marked
1461invalid and busy and its moved from the hash lookup queues so that it
1462will not be found by
1463.b incore()
1464since it does not yet have associated with
1465it a sensible name and address (vnode and block number).
1466.lp
1467The interrupt mask is removed and all remaining operations performed
1468without holding the interrupts masked. The buffer is associated with
1469no device and no vnode or logical block, and other fields in the buffer
1470header are similarly default initialized to avoid unpredictable behavior
1471by functions calling this routine which do not fill them out completely.
1472The buffer is adjusted so that the size of its contents
1473.i b_bufsize
1474matches those of the requested size
1475.i sz by the function
1476.b allocbuf().
1477The buffer pointer is then returned as a successfully allocated
1478buffer header and contents.
1479.uh "What is getblk()?"
1480.pp
1481.b getblk()
1482returns a buffer of the appropriate vnode, offset, and size
1483to the caller.
1484.uh "What Does the Buffer Cache Expect From getblk()?"
1485.pp
1486.b getblk()
1487locates the buffer with valid contents in the buffer cache; otherwise,
1488the function creates a buffer with empty (invalid) contents.
1489__listing(/usr/src/kernel/kern/fs/bio.c,539,540)
1490.lp
1491.b getblk()
1492has three arguments: the vnode and logical block number of the contents
1493sought; and the size of the buffer to be allocated. It returns a pointer
1494to the buffer associated with this vnode and contents.
1495.pp
1496.b getblk()
1497may block awaiting allocation of a buffer or waiting for
1498a buffer that is in use by another process. The function
1499does not read the buffer's contents from the underlying storage;
1500.b bread()
1501(***HYPERLINK WHAT IS BREAD- POPUP)
1502or
1503.b breada()
1504(***HYPERLINK WHAT IS BREADA- POPUP)
1505are used instead. Instead,
1506.b getblk()
1507is used to read the contents of the cache.
1508.uh "How is getblk() Implemented on the 386?"
1509__listing(/usr/src/kernel/kern/fs/bio.c,545,548)
1510.lp
1511The function endlessly loops searching for a buffer in the cache. It
1512uses the inline function
1513.b incore()
1514(***HYPERLINK WHAT IS INCORE - POPUP)
1515in its search, using a hash table for
1516the requested buffer in the buffer cache associated with a vnode and logical
1517block number. If the block is found, the processor's interrupts are masked
1518to prevent a race with the buffer's mutual exclusion bit (B_BUSY) which
1519is examined to determine if the buffer is in current use by another process.
1520__listing(/usr/src/kernel/kern/fs/bio.c,549,553)
1521.lp
1522If kernel tracing is enabled, a trace record is written to indicate that
1523a busy buffer has been found for this
1524.b getblk()
1525operation.
1526__listing(/usr/src/kernel/kern/fs/bio.c,554,559)
1527.lp
1528The age bit associated with the buffer (if present) is cleared to ensure
1529that the buffer will not be released onto the age queue because we already
1530know it will be reused. The wanted bit is then set for this buffer header
1531so our process will be awakened when the buffer becomes available.
1532We then sleep waiting for the buffer to become available.
1533.lp
1534Upon conclusion of the sleep, the interrupt mask is removed and the
1535function is reentered by again checking the
1536.b incore()
1537status of this buffer (since it may have
1538actually been invalidated and thus is no longer present in the cache).
1539__listing(/usr/src/kernel/kern/fs/bio.c,560,564)
1540.lp
1541If kernel tracing is enabled, a reference trace record is sent to indicate
1542that the buffer has been rereferenced without it being busy.
1543__listing(/usr/src/kernel/kern/fs/bio.c,565,570)
1544.lp
1545The buffer's age bit is removed to indicate that this buffer has
1546proven its worth, and the buffer is then removed from the free list. Its
1547size is adjusted according to the size required by this call to
1548.b getblk().
1549Note that this may alter the contents of the buffer. All the
1550cases dealing with a found buffer have now been dealt with.
1551__listing(/usr/src/kernel/kern/fs/bio.c,571,581)
1552.lp
1553Since no extant buffer has been found,
1554a new buffer header and contents are obtained by calling
1555.b getnewbuf().
1556(***HYPERLINK WHAT IS GETNEWBUF() - POPUP)
1557If no buffer is available,
1558.b getnewbuf()
1559blocks and then returns zero.
1560Since a buffer may have been created by another process at the same location
1561during the pause, this function will restart itself from the beginning.
1562If no buffer is found, it again calls
1563.b getnewbuf()
1564and the procedure repeats.
1565.lp
1566When a buffer is available, it is assigned to the logical block
1567associated with this vnode. Because the physical block associated with
1568the logical block has not yet been determined, the physical and logical
1569block fields are identical, signalling that translation needs to be done.
1570All block devices have their interrupts masked and the buffer is
1571inserted into the hash table and marked for exclusive use by setting its
1572buffer flags busy.
1573.lp
1574In either case of finding the block or not finding the block and having
1575to create one, we now have a buffer in the buffer cache associated with
1576the vnode and its offset. The interrupts are unmasked and
1577the buffer returned, terminating the loop.
1578.uh "What is geteblk()?"
1579.pp
1580.b geteblk()
1581allocates an empty buffer from the buffer cache that will
1582not be associated with a vnode. It is used to allocate
1583buffers for special purposes (usually those not associated with the
1584filesystem).
1585.uh "What Does the Buffer Cache Expect From geteblk()?"
1586.lp
1587.b geteblk()
1588isolates a buffer from the buffer cache and returns after it is allocated.
1589__listing(/usr/src/kernel/kern/fs/bio.c,591,592)
1590.lp
1591.b geteblk()
1592has one argument: the size of the buffer requested. It returns a buffer
1593when allocated.
1594.pp
1595In earlier Unix systems that had no general purpose
1596memory allocator, buffers were used in their stead
1597to gain control of a portion of memory for temporary use.
1598In the current release of 386BSD,
1599.b geteblk()
1600is present here only for compatibility with
1601older systems and to allow the
1602.b disklabel
1603feature of the UFS filesystem to function.
1604.uh "How is geteblk() Implemented on the 386?"
1605__listing(/usr/src/kernel/kern/fs/bio.c,597,603)
1606.lp
1607.b getnewbuf()
1608is used to obtain a free buffer of the requested size. If
1609one is not available, it is called until one is available. The buffer
1610is inserted into a free list on the off-chance that the busy bit might be left
1611off before being released, so the assumption of it already being on
1612a free list is itself invalid. This is done in the case the buffer header
1613is corrupted before passing it back. Since contents of the queues and
1614hash tables may be altered by interrupt level calls to
1615.b biodone() (which
1616(***HYPERLINK WHAT IS BIODONE - POPUP)
1617calls
1618.b brelse()),
1619(***HYPERLINK WHAT IS BRELSE - POPUP)
1620the interrupts are masked on manipulation of the hash table.
1621The buffer is then returned to the caller.
1622.uh "What is allocbuf()?"
1623.pp
1624.b allocbuf()
1625adjusts the buffer size to the desired size while preserving
1626buffer contents. It is used by both the buffer cache and the filesystem
1627clients to revise the size characteristics of the buffer after it is
1628present in the buffer cache if it has expanded or contracted in size.
1629.uh "What Does the Buffer Cache Expect From allocbuf()?"
1630.pp
1631.b allocbuf()
1632allocates a region of memory of the appropriate
1633size and exchanges it for the one the buffer had before, freeing the
1634old contents. Before freeing the old contents, it copies as much of the
1635buffer as will fit.
1636__listing(/usr/src/kernel/kern/fs/bio.c,616,617)
1637.lp
1638.b allocbuf()
1639has two arguments: the pointer to the buffer header: and the size of the data
1640contents that the buffer should contain. It returns nothing. The side-effect
1641of this function is that it adjusts the size of the buffer contents.
1642.uh "How is allocbuf() Implemented on the 386?"
1643__listing(/usr/src/kernel/kern/fs/bio.c,622,636)
1644.lp
1645.b allocbuf()
1646first calls
1647.b malloc()
1648(***HYPERLINK WHAT IS MALLOC() IN MALLOC.C - POPUP)
1649to obtain a new portion of memory of the appropriate
1650size and copies the old contents into place in the new buffer space.
1651The function takes care to only copy as much of the old buffer contents
1652that will fit or is present in the old buffer. It then frees the contents
1653of the old buffer.
1654.lp
1655Note that
1656.b allocbuf()
1657may block on waiting for
1658.b malloc()
1659to provide it with
1660the new buffer contents. This differs from the companion routine in
1661.b getnewbuf(),
1662which cannot block since
1663.b getnewbuf()
1664may need to constitute a buffer from a
1665pagein operation when memory is resource-restricted and the system can
1666not risk deadlock by blocking.
1667.lp
1668After obtaining the new buffer resource,
1669.b allocbuf()
1670adjusts for the change
1671in buffer size against free buffer space and allocated buffer space. Note
1672that intentionally no check is made as to overage of allocated buffer space
1673(in fact,
1674.i freebufspace
1675can actually be negative), since its effect will only
1676be felt for a short time period. As such, this is a soft limit check
1677and not a hard limit check.
1678.lp
1679The buffer header is now updated to hold its new contents and size. Two fields
1680in the buffer header,
1681.i b_bufsize
1682and
1683.i b_bcount,
1684hold the allocated size of the buffer
1685and the currently used portion of buffer size, respectively.
1686Filesystems may alter the latter but never the former.
1687Note that in the case of expanding the buffer, there is no guarantee
1688as to the contents of the expanded region of the buffer. It is left up
1689to the caller to clear or otherwise initialize buffer contents.
1690.uh "What is biowait()?"
1691.pp
1692.b biowait()
1693waits for the buffer to finish an outstanding I/O operation and return
1694the status of the operation.
1695.uh "What Does the Buffer Cache Expect From biowait()?"
1696.pp
1697.b biowait()
1698suspends waiting for an I/O operation to complete on
1699the supplied buffer. After the buffer I/O operation is finished and
1700.b biodone()
1701(***HYPERLINK WHAT IS BIODONE() - POPUP)
1702is called,
1703.b biowait()
1704will obtain the error or success and return it to the caller.
1705__listing(/usr/src/kernel/kern/fs/bio.c,645,646)
1706.lp
1707.b biowait()
1708has one argument: the pointer to the buffer header to be waited for.
1709It returns the success value (0) or error code of the I/O
1710operation. If no error code is present, yet an error indication was
1711present on the buffer, the generic I/O error (EIO) will be returned.
1712.uh "How is biowait() Implemented on the 386?"
1713__listing(/usr/src/kernel/kern/fs/bio.c,650,657)
1714.lp
1715.b biowait()
1716masks interrupts for its operation as it examines
1717bits in the flags set by interrupt service routines from block devices
1718and manipulates contents of the hash tables and state of the buffer.
1719The function checks for the presence of the done bit to indicate
1720that the operation has completed. If the done bit (set by
1721.b biodone())
1722is not present, it sleeps on the buffer header's address waiting
1723for I/O to complete. Note that the sleep is uninterruptible because there
1724is no way with the current interface to block devices and inform the
1725device driver to abort the I/O and return.
1726.lp
1727The significance of this uninterruptible wait is that if an I/O operation
1728is started on a file or block device and that device either loses the
1729interrupt or forgets to signal a
1730.b biodone(),
1731it is up to the device driver
1732to timeout and abort the transfer for the block to be cleared -- otherwise,
1733it would wait forever, as occurs in the case with taking out a floppy
1734while reading a file off of it.
1735.lp
1736Some devices become ready immediately upon doing the I/O operation,
1737so that by the time
1738.b biowait()
1739is reached the done bit is already set. In this case, no sleep operation
1740need be performed and this phase is bypassed completely.
1741.lp
1742if the process that has called
1743.b biowait()
1744is being traced,
1745a trace record is written indicating that a wait has completed for a block.
1746__listing(/usr/src/kernel/kern/fs/bio.c,658,673)
1747.lp
1748The buffer header's flags are checked to see if an error is present
1749or if the error field in the buffer header holds an error code. If
1750so, the buffer header is evaluated to return an error code and make
1751its contents consistent. If the contents of the buffer have not yet
1752been marked as invalid, the buffer is marked as invalid and
1753allowed to persist on the age queue where it may be reclaimed after
1754a short dwell time.
1755.lp
1756If there is no error present in the error code
1757field, a generic I/O error (EIO) is signalled in its place. Otherwise, the
1758.i berror
1759bit is forced present in the buffer flags so that if either the
1760.b berror
1761field or the BERROR flag is set by the calling code, the
1762error will be properly signalled. In either the case of returning an
1763error or success, the interrupt mask is removed and the value of
1764success or error returned to the caller.
1765.uh "What is biodone()?"
1766.pp
1767.b biodone()
1768signals the completion of an I/O operation on a buffer,
1769so that the exclusive user of the buffer can resume use of the
1770buffer after its operation has completed.
1771.uh "What Does the Buffer Cache Expect From biodone()?"
1772.pp
1773.b biodone()
1774is called by both interrupt level and top level code.
1775It processes completion of I/O from the perspective of the buffer
1776cache, checking to see if an optional function should be called to
1777do additional processing prior to either releasing a buffer that
1778has had asynchronous I/O performed on it or informing the kernel that the
1779I/O operation has been completed on the buffer.
1780.lp
1781If the operation is not being waited for (ASYNC), the buffer is
1782released onto the free lists. If subsystems (such as swap) would like
1783to be made aware
1784of I/O completion (asynchronous paging wants to reclaim resources
1785immediately), a function call is optionally processed.
1786.lp
1787If a dirty buffer has been cleaned, the vnode layer is informed so
1788as to move the buffer from a dirty list to a clean list. This
1789ensures that the file contents of the buffer cache are forced to disk upon
1790file synchronization (fsync) operations.
1791__listing(/usr/src/kernel/kern/fs/bio.c,683,684)
1792.lp
1793.b biodone()
1794has one argument: the pointer to the buffer header. It returns nothing.
1795The side effect of this function is to inform the buffer cache that
1796an I/O operation has been completed on the buffer header.
1797.pp
1798One possible alternative to this mechanism would be to have
1799.b biowait()
1800(***HYPERLINK WHAT IS BIOWAIT() - POPUP)
1801remove B_ASYNC and let all
1802.b bwrite()'s
1803(***HYPERLINK WHAT IS BWRITE() - POPUP)
1804become
1805.b bawrite()s.
1806(***HYPERLINK WHAT IS BAWRITE() - POPUP)
1807.b biowait()
1808would then reacquire them to recover the status information.
1809.uh "How is biodone() Implemented on the 386?"
1810__listing(/usr/src/kernel/kern/fs/bio.c,688,702)
1811.lp
1812Since
1813.b biodone()
1814is an interrupt level function that may affect the status
1815of the buffer field, all other block interrupts are masked during
1816operation. A flag is examined (B_CALL) to see if an optional I/O done field
1817.i b_iodone is called from the buffer's header.
1818If it is, the function will be called with the buffer header pointer provided
1819and the flag bit cleared to indicate that the call had been performed.
1820.lp
1821If the buffer corresponds to a write of a dirty buffer, the dirty bit
1822is removed from the buffer and the vnode layer is informed that the buffer
1823has been cleaned by a write operation. If the buffer was an asynchronous
1824operation, it is released by calling
1825.b brelse()
1826(***HYPERLINK WHAT IS BRELSE() - POPUP)
1827since there is no top level routine that is waiting for it. All of the
1828interrupt level masks and top level routines occur because of this
1829call from potentially interrupt level.
1830.lp
1831The flags are adjusted to eliminate the indication of asynchronous I/O
1832since the operation was performed (if present) and the flags are updated
1833to indicate the I/O operation was performed successfully. Any processes
1834sleeping for the buffer are awakened, the masked interrupts are unmasked,
1835and the function completes. Note that the interrupt masks are frequently
1836redundant, since interrupt routines mask the block devices anyways.
1837The reason that this must be done for top level calls to
1838.b biodone()
1839is that they do not mask interrupts.
1840.uh "Design Perspective on Bio.c"
1841.pp
1842Once upon a time the buffer cache in a Unix system was an exciting concept.
1843Until a buffer cache was added,
1844the earliest Unix systems at Bell Labs did not allow overlap of I/O operation
1845with other processes executing.
1846.b bio
1847allowed multiprogramming operation in these early systems, so
1848that when a reference was made to a block where I/O operation
1849had not yet been performed the operation could be queued and another process
1850run while the disk subsystem located and returned
1851the appropriate block.
1852.pp
1853.b bio
1854is an example of a kind of superdriver which allowed queued
1855I/O requests to be processed without blocking the processor from operating.
1856Given the resources of small machines like the PDP-11, this was a perfectly
1857appropriate mechanism.
1858.b bio
1859became even more desireable with the addition of
1860delayed writes and read-ahead operations, which allowed more decoupling
1861of synchronous operation between disks and processes accessing the filesystem.
1862.pp
1863The buffer cache was especially useful to early unix systems because
1864of their use of hierarchical filesystems (then unusual in any small system).
1865A single pathname lookup
1866(see
1867.b fs/lookup.c)
1868(***HYPERLINK LOOKUP.C TABLE OF CONTENTS - MACRO)
1869could require many I/O operations be performed.  With caching, however,
1870many of these I/O operations would be avoided.
1871.pp
1872Nowadays, I/O performance enhancement requires more than the old
1873buffer cache mechanism can provide.
1874Besides handling name evaluation, this mechanism now must provide the
1875ability to manage mass I/O systems so that they are always operating
1876at peak transfer rate. It must also
1877possess the ability to shift granularities of information back and
1878forth that are larger than the size of the filesystem blocks themselves.
1879.pp
1880Because of these modern day requirements, we can no longer cache
1881low level I/O transfers -- instead, we must cache regions of
1882address space associated with files. In sum, we must
1883work at a higher level of abstraction where we
1884direct an I/O operation to be performed that far exceeds the size
1885of the actual low level atomic transactions.
1886.pp
1887Why we cannot accomplish this at the lower level of filesystem blocks
1888is simply because the information regarding
1889the characteristics of the transfer is managed only at the top level
1890of abstraction. While it is possible to devise an architecture which always
1891"sends" this information down to the lower levels whenever any operation
1892may be performed, this would still be inadequate since we may
1893be required to manage at the top level of abstraction the timing (e.g. arrival
1894and service times) in order to use the resource most effectively -- this
1895requirement is not easily worked into our communications mechanism
1896to the lower layers.
1897.pp
1898Perhaps even more importantly, in the future our operating system may
1899wish to make high level changes in allocation strategies, independent
1900of the lower levels which actually implement the transfers. Again,
1901the burden of synchronizing lower and upper levels in a dogmatic
1902way could easily prove too costly.
1903.pp
1904Another future consideration for future design is that the buffer
1905cache mechanism provides one set of policies for all filesystems
1906that make use of the buffer cache and does not allow for per-filesystem
1907management of such policies. One example of this is the method by which
1908filesystem metadata (inodes, disk indirect blocks, cylinder groups)
1909must be synchronized when performing a file update operation. Currently
1910in BSD systems, this is handled by forcing the association
1911of metadata with the file's actual data on block lists and disassociating
1912them when an I/O operation needs to be performed.
1913This is done so that when a file is forced written to the disk (fsync),
1914all of the related blocks can be found, including the indirect blocks which
1915don't actually belong to the address space of the file itself.
1916Thus, there is no general purpose way currently to make these
1917associations other than by implementing the buffer cache to
1918provide a callout to the vnode layer when this is necessary.
1919.pp
1920An alternative design is to avoid the implied I/O involved with the filesystem
1921cache, by implementing a lookaside cache which
1922does I/O directly to the lower layers of the filesystem. Prior
1923to initiating the I/O, it checks to see if the
1924block is present in the filesystem cache and then chooses to use the
1925contents of the cache rather than doing the transfer. Such a mechanism
1926is simpler and avoids the need for the knowledge of interrupt level
1927operation or much in the way of semantic information regarding the contents
1928of the buffers.
1929