1iNClude(/etc/ann/listing.m4) 2.if t \ 3.po 8n 4.if n \ 5.po 0n 6.ll 70n 7.\"$Id: bio.c.ann,v 1.3 94/08/26 13:18:49 bill Exp Locker: bill $ 8.he 'Annotation of kern/fs/bio.c''DRAFT' 9.fo 'tRAnslit(`$Revision: 1.3 $',`$') '% 'Copyright (C) 1994 TeleMuse 10.(l C 11.sz 14 12Annotation of the File kernel/kern/fs/bio.c 13William and Lynne Jolitz 14.sz 10 15Copyright (C) 1994 William and Lynne Jolitz. All Rights Reserved. 16.)l 17.uh "Introduction:" 18.pp 19The file 20.b "fs/bio.c" 21contains the machine independent code used to 22implement the cache management of independent filesystems. 23It corresponds to the middle layer of abstraction of a filesystem which 24maintains the exchange of file data and metadata contents 25read or written to the external storage comprising the physical file. 26.pp 27Because this interface is shared by many filesystems 28and many processes which are clients of a filesystem (even clients 29of the same exact file), the reference to this information is 30relative to the vnode describing the file and an opaque block offset 31(relative to the filesystem) that describes the contents that are cached. 32While the total resource dedicated to the filesystem is globally 33managed among all clients, the interpretation and management of the 34cached entries themselves is done relative to each filesystem and opaque to 35unrelated filesystems. 36.uh "Buffer Cache in Detail" 37.pp 38This filesystem cache consists of buffer elements, which are transactions 39of file data or metadata. Buffers are comprised of two parts: a 40fixed-sized buffer header that records details of the transaction, and a 41variable-sized data portion that holds the buffer contents associated with the 42transaction. The data portion can be thought of as a side-effect of 43the buffer header, since the buffer cache makes no direct use of it at all. 44.pp 45The cache works by attempting to either 46avoid transactions by reusing buffers and short-circuiting redundant 47transactions, or by 48anticipate new transactions before they happen so that they may 49be overlapped. In addition, the buffer cache provides an interface 50between the synchronous filesystem data transfer requests and the 51asynchronous mechanisms (device driver, network, and so forth) used 52as a basis for transferring the actual data. Finally, the scheduling 53of global I/O is made external to the various filesystems and managed 54at a central point for consistent policy. 55.pp 56The filesystem buffer cache is only called by filesystems; in turn, it 57only interacts with filesystems to affect its actions. It interfaces 58to the basic kernel by obtaining and releasing memory resource for buffer 59contents 60(see 61.b malloc() 62(***HYPERLINK WHAT IS MALLOC() IN MALLOC.C - POPUP) 63and 64.b free() 65(***HYPERLINK WHAT IS FREE() IN MALLOC.C - POPUP) 66in 67.b kern/malloc.c 68(***HYPERLINK MALLOC.C TOC - MACRO) 69for further information). It also interfaces to 70to the master VFS by performing generic operations 71on the control of filesystem contents (see 72.b vflushbuf(), 73.b bgetvp(), 74.b brelvp(), 75and 76.b vwakeup()). 77.pp 78One should note that the two write alternatives 79.b bawrite() 80and 81.b bdwrite () 82have their different operations cited at the front of the name 83because the decision to write occurs before the operation. This 84contrasts its operation from functions such as 85.b breada(), 86where in this case the operation of the second asynchronous read operation 87occurs after the first operation. 88.uh "The Life Cycle of a Buffer in the Buffer Cache" 89.pp 90The buffer cache caches I/O transfer requests (either to a device driver 91or to an underlying storage mechanism, such as a network protocol). As such, 92it performs I/O on demand when either a buffer does not contain the appropriate 93contents or when the contents are forced to be written out. I/O goes 94on independent of the requests to view the buffer cache. This can be 95considered not only a design feature but also a flaw, since filesystems 96don't have any control over when I/O might occur as all I/O is lumped 97together into one set of policies for all filesystems instead of using 98a more sensible mechanism such as a "lookaside" cache arrangement. 99.pp 100Buffers to be read are generally obtained by the 101.b bread() 102(***HYPERLINK WHAT IS BREAD() 103or 104.b breada() 105(***HYPERLINK WHAT IS BREADA() 106functions which ensure that the contents of a block exist and return 107the buffer containing this contents or an error indicating why an I/O 108operation to a buffer could not obtain the appropriate contents. 109In the case that contents could not be obtained, the buffer that was 110allocated to receive the contents is left as invalid and will be 111immediately recycled since it has no contents to cache. Since these 112functions had to allocate a buffer, they may block waiting for a usable 113buffer to become available. Either the buffer is the desired one that 114already has contents available without any I/O required (but is still in use by 115another process exclusively; thus, the caller must block until it 116can obtain exclusive access) or it is not. In this case, a free buffer must 117be obtained for 118.b bread() 119and 120.b breada() 121to obtain contents for it to return. 122.pp 123In the process of writing a buffer, the buffer is obtained from 124.b getblk() 125which either finds an existing buffer with the cached contents (the 126flag B_CACHE indicates that the buffer was found with contents 127intact in the cache) or allocates a buffer with no contents 128present covering this portion of underlying storage exclusively 129in the buffer cache and marks it as being invalid (B_INVAL). In the 130second case, if a portion of the buffer were to be modified, a read 131is necessary to obtain valid contents before the write of 132the rest of the buffer could take place. (A 133.b bread() 134would best provide this since it would ensure that the buffer was read. 135However, if the entire point of this operation was to ensure that there 136was only one entry in the buffer cache corresponding to the entry and 137the contents itself considered irrelevant, a 138.b getblk() 139call is sufficient. 140In either case, exclusive access to the portion of that vnode is now 141assured since all other requesters will wait until the buffer has 142been returned to the free pool for use by other processes. 143.pp 144Having now obtained the buffer, its contents can be loaded and passed to 145one of the write routines to be written to storage. The 146.b bwrite() 147(***HYPERLINK WHAT IS BWRITE() 148function will immediately force the operation to occur and wait for 149the result (suspending during the operation), while the 150.b bawrite() 151(***HYPERLINK WHAT IS BAWRITE() 152will issue the operation and arrange for its release to the free pool 153upon completion without waiting. A third variant, 154.b bdwrite(), 155(***HYPERLINK WHAT IS BDWRITE() 156will delay the actual write operation until either the buffer must be reused 157for purposes of being used elsewhere in the cache or if forced written 158by either a 159.b bawrite() 160or 161.b bwrite(). 162.pp 163Reads (when finished) return buffers to the buffer cache with a 164.b brelse() 165(***HYPERLINK WHAT IS BRELSE() 166function call. 167In the case of an asynchronous operation, 168.b brelse() 169can also be called from interrupt level by 170.b biodone() 171(***HYPERLINK WHAT IS BIODONE() 172to release the buffer cache buffer back to the free pool. 173Other operations waiting for a 174.b biodone() 175to signal completion of the operation will use the 176.b biowait() 177(***HYPERLINK WHAT IS BIOWAIT() 178function, waiting for the operation to complete by suspending 179(and also recovering the error if any occurred in performing the operation). 180Note that there is no functional interface externally to see if a 181buffer is in use, as all operations may block to gain exclusive access 182to a buffer, however indefinitely long that block may take. 183.pp 184Buffers may be expanded or contracted to hold different sized contents. The 185.b allocbuf() 186(***HYPERLINK WHAT IS ALLOCBUF() 187function expands or contracts the size of a buffer as 188requested. It is used by the filesystem to change the size of the entry 189that is being cached while preserving the contents present in the previously 190sized buffer (throwing away information on a reduction and retaining partial 191contents on an expansion). 192.uh "Delayed Write and Read-ahead Operations" 193.pp 194Delayed write operations are buffer transactions that are 195recorded as having been written when they actually haven't -- instead, 196they "dwell" in the buffer cache for a period of time. 197This mechanism allows for temporary blocks to appear to be 198"written" and "read" without any I/O actually occurring before 199invalidation occurs (as would be the case with compiler 200temporary files). In addition, delayed write operations allow 201output operations to be queued behind read operations, thus postponing 202their effect until the cache entry is needed by a read operation. 203.pp 204Read-ahead operations, in contrast, are the anodyne 205to delayed write operations, in that read-ahead operations anticipate 206the need for a successive block. Since most I/O on Unix systems is sequential, 207successive read operations can be detected and the operations pre-queued 208for the next successive block to achieve double buffering. 209.pp 210Note that read-ahead and "write-behind" (e.g. delayed writes) 211operations create additional traffic for the buffer cache because 212a certain amount of the buffer cache is tied up in holding the state 213for both reads and writes that may or may not be of use. 214This buffer cache memory overhead 215is the price paid for attempting to avoid a more expensive I/O operation. 216.pp 217One interesting question raised by this approach is that of just who 218has priority -- read-ahead blocks which haven't 219proved that the information needs to be cached, or write-behind blocks 220which may not ever be reused again and instead be queued for output? 221If priority is given to read-ahead blocks, we tend to commit delayed 222writes back to disk earlier and effectively perform more write 223operations than might have otherwise been needed. This may be especially 224expensive if we have to read them back in again often, as would be the 225case with compiler temporaries. On the other hand, if priority is 226given to delayed write operations, the read-ahead operation can be severely 227impacted by the lack of buffer space necessary to hold its contents. In 228this case, we will end up always synchronously reading from the disk. 229.pp 230Since the system does more reads than writes, read-ahead operations are 231favored over delayed write operations, resolving this contradiction, 232but this is a somewhat arbitrary solution. Unfortunately, since this 233caching operation occurs at a fairly low level of abstraction in the system, 234there is insufficient information available (e.g. arrival pattern, 235file persistence duration -- temporary versus permanent) with which to 236implement a sophisticated policy mechanism (such as one using queuing theory). 237.pp 238For example, when the system does a core dump of a program 239larger than the size of the buffer cache, the buffer cache 240is immediately overrun with contents bound for the disk (or network), 241wiping out all previously held contents in the process. (Ironically, 242this core dump contents may never even be read, especially if this is 243a system process). If this core dump is sufficiently large it may take 244several seconds to clear in writing to the disk. During this time, the 245system will appear to "pause" as the other blocks queued for transfer wait. 246.pp 247If instead of allowing the buffer cache to be 248overwritten we possessed the appropriate policy information, 249writes would only be accepted into the cache at the 250rate at which they were consumed by the underlying transfer mechanism, 251resulting in a steady-state. Since we cannot satisfy 252the transfer rate any faster than that allowed by the underlying device, 253it doesn't help if we accept any more transactions into the cache 254than can possibly be satisfied by the lower levels of the system. 255.pp 256One area of discussion is determining just "when" 257persistence in the buffer cache of information should be allowed 258in excess of the ability for the lower level storage mechanism to transfer 259the information. In essence, we must decide when we should reuse the 260information and that it should persist regardless (the compiler temporary 261case) and when we should constrain the "virtual bandwidth" 262by the lower level device (as in the core file case). 263.uh "Optimizing Buffer Transactions" 264.pp 265Read-ahead and write behind operations were used in the original 266Unix buffer cache. In fact, almost all timesharing operating systems 267that had buffered filesystem transfer caches prior to 268Unix, including BTSS and Multics, also used these operations to 269reduce the number of transactions incurred by managing a cache of them 270and reordering them as necessary to achieve the most optimal time and number 271of transactions necessary, so this concept is not unique to Unix. 272.pp 273Ironically, by allowing read-ahead operations we actually 274increase the number of transactions that occur. However, since the net 275cost of occasionally throwing away a read-ahead block is a low one, 276the overall system efficiency is increased anyways (since on the off 277chance we did use the read-ahead, an operation that would have otherwise 278been blocked would not need to occur). Some Unix systems 279defeat read-ahead and aging under the mistaken belief that 280simply reducing the number of operations must increase system efficiency. 281This is not always the case as one must examine the actual 282delay time reduction of the application programs as well as the 283number of operations to decide if the optimization is successful 284.pp 285For example, in the 386BSD Release 0.1 version of 286.b bio.c 287delayed writes and readaheads were not implemented correctly 288for all cases. After this was rectified, the total number of I/O 289operations were observed to have increased (which we found surprising). 290However, the time required for the average completion of a kernel 291build actually dropped! How could this be? We found that 292read-ahead operations would occasionally be successful 293and occasionally not -- those not successful would have to be reissued, 294so the number of operations increased. Why this did not degrade performance but 295instead actually increase it was that when the block was of immediate use, it 296avoided a delay in the utility getting a block of data. However, in the case 297where it wasn't of any value it would be reused by the delayed write operation 298of a compiler temporary and thus would hold onto the information anyways. As 299such, the proper effect of both delayed write and read-ahead operations would 300result. In addition, the reason why the number of aggregate read/write 301operations increased was because, while many of the read-ahead blocks were 302never used, we never had to wait for them anyways, so it did not affect 303the elapsed time of the running application. 304.uh "File I/O Clustering" 305.pp 306The effect of aggregating read/write operations for greater efficiency is 307possible not only at the low level of abstraction of 308transfers themselves but also possible at the higher levels of abstraction 309of whole transfers from the user process's perspective. Consider that a user 310process is performing I/O on an abstract file -- we can collect up all of 311their numerous I/O operations into a much larger aggregate of individual 312operations, coalesce them into a single operation and perform that. 313.pp 314Given this higher level of abstraction, a 315.i clustered 316read or write operation can be performed based not on the actual 317buffer transactions but on much larger scale driven by 318the economies of memory in a system and not by the economies of 319memory in the buffer cache. Obviously if the memory of a system is so 320constrained that the buffer cache is a significant portion of memory 321(as it was in earlier systems), clustered I/O would be no great advantage. 322However, we routinely use PCs that contain many times the 323amount of memory of that of the programs which they run. 324Clearly, the efficient use of memory as a resource might now 325allow us to use memory state to transfer larger aggregations of file 326transfers more efficiently. In other words, we exchange memory resource 327for I/O bandwidth to obtain a larger "virtual" bandwidth. This in a nutshell 328is what file I/O clustering is all about. 329.pp 330Since file I/O clustering lies between the top level view of the system 331(e.g. reads and writes) and the bottom level view of the system (actually 332performing the I/O), it is an operation that must invisibly 333reconcile the semantics of the top level reads and writes independent of 334the low level operations (in the current system, there is no such layer). 335Since this mechanism inherently affects the user process's 336address space, the virtual memory system is the logical means through 337which this new layer would be interposed. 338.pp 339To permit the virtual memory system to make decisions on 340the size and timing of transfers, additional information 341(queue sizes and I/O operation completion) must be available, 342so that the virtual memory system can 343properly choose how much resource to dedicate to a particular set of 344virtual transfers. As such, a sufficiently different 345architecture for attaching the lower level of the system to the higher 346level portion of the system is now necessary. 347.uh "fs/bio.c Functions" 348.pp 349Within this file are contained the following functions: 350.(l 351bufinit() 352incore() 353bread() 354breada() 355bwrite() 356bdwrite() 357bawrite() 358brelse() 359getnewbuf() 360getblk() 361geteblk() 362allocbuf() 363biowait() 364biodone() 365.)l 366(***HYPERLINK ALL OF THE WHAT IS TO THESE LIST OF FUNCTIONS) 367The names are indicative of the function they provide in the 368implementation of the buffer cache. 369.b "bread()" 370and 371.b "bwrite()" 372provide a virtual read/write interface to the 373buffer cache and the underlying storage, expressed as a filesystem. 374.b "breada()" 375and 376.b "bawrite()" 377provide asynchronous mechanisms avoiding a wait 378for the operation if the caller decides not to pause to see the outcome 379of the operation. 380.b "getblk()" 381and 382.b "brelse()" 383allow the buffer contents 384to be probed and released irrespective of any implicit I/O. 385.b "allocbuf()" 386allows for alteration of buffers to suit the needs of the filesystem. 387.b "biowait()" 388and 389.b "biodone()" 390provide an interface between underlying asynchronous I/O 391and the filesystem's more orderly synchronous world view. 392.b "incore()" 393and 394.b "getnewbuf()" 395are internal support routines used to implement the buffer cache. 396.b "bufinit()" 397is an entrypoint for the kernel to instantiate the buffer 398cache facility. 399.b geteblk() 400is an obsolete interface used to steal 401an empty block from the buffer cache for use in other mechanisms (a 402memory allocator, used prior to mbufs and 403.b "malloc()"). 404(***HYPERLINK WHAT IS MALLOC() IN MALLOC.C - POPUP) 405.uh "Antecedents to the 386BSD Buffer Cache" 406.pp 407Earlier UNIX systems 408used a block-oriented mechanism to buffer physical I/O transactions 409to a disk drive. At this time, there was a single UNIX filesystem 410that used the block buffer cache to interface to disk drivers directly. 411.pp 412The names of the above routines and the function of the buffer cache 413are directly descended from this earlier work. 414However, it is deceptive to assume that they are similar -- a 415myriad of subtle differences exist due to the fact that this buffer 416cache works with variable-sized buffers of independent filesystems, 417caching transactions of the logical contents of the files. Various 418idiosyncrasies of the buffer cache are due to its appendix-like role 419in the anatomy of UNIX systems. 420.uh "Evolution of the 386BSD Buffer Cache" 421.pp 422The 386BSD buffer cache was developed independently 423for the original Release 0.0 in 1992. The first version 424of this file and how it was developed is discussed in the article 425.b "Missing Pieces II" 426(***HYPERLINK TO ARTICLE TOC - MACRO) 427in the section 428.b "Block I/O Cache". 429(***HYPERLINK TO SECTION - POPUP) 430Both the strengths and weaknesses of using this buffer cache 431design were also discussed in this 432article (see 433.b "4.4BSD Demands" 434(***HYPERLINK TO SECTION - POPUP) 435and 436.b "4.4BSD Weaknesses", 437(***HYPERLINK TO SECTION - POPUP) 438respectively). 439.pp 440The reimplementation of the buffer cache code was quite distinct from the 441design of earlier inplementations, primarily by its use of the kernel memory allocator 442in place of a internal memory pool and page table entries to allocate 443buffer contents from. This was done to reduce the code size by relying 444on a common memory allocator, provide uniform memory policy for the kernel 445as a whole (the intent was that memory could always be reclaimable by the 446virtual memory system, and that unused memory (B_INVAL buffers) would always 447be held by the memory allocator for universal use within the kernel), and 448so to reduce memory fragmentation (partial page allocation of buffer contents 449would not require a whole, dedicated page of memory). 450.pp 451The buffer cache code had minor bugs (double allocation of space), unfinished 452implementation (buffer ageing and delayed writes), as well as inefficencies 453(buffer ageing and delayed writes, again). In addition, the design objective 454of returning unused buffer memory contents to the memory allocator was not 455completed. These problems have been 456eliminated from this version, as many of them had stemmed from early, 457abortative attempts to introduce an interim version of the page cache, the 458intended successor to the buffer cache. 459.pp 460This design, however, was intended from the beginning to be only an interim 461functional mechanism until a more modern mechanism -- the 386BSD 462Page Cache -- was ready for release. 463.uh "Future Directions: The 386BSD Page Cache" 464.pp 465While the page cache is conceptually simple, its implementation and 466the replacement of the buffer cache is far from simple. Both performance 467and operational issues dominate the such attempts. Most of these are 468rooted in the dependance of the rest of the kernel on artifacts of the 469buffer cache and its interface. 470.pp 471The buffer cache combines many different functions and features into 472a single mechanism. This concise arrangement was somewhat necessary given 473the tiny memory requirements of the original PDP-11 (and PDP-7 before it) 474environments it ran in. The buffer cache provides many mechanisms: 475.(l L 476* The interface for filesystems to access records (blocks) of storage. 477* A way to achieve I/O on a storage meduim. 478* A way of managing exclusive access to a portion of file contents. 479* A way to avoid redundant reads (transfer cache). 480* A way to schedule sequential read operations allowing overlap (read ahead). 481* A way to postpone/coalesce writes (write behind). 482* A way to resolve file-relative to actual physical record index. 483.)l 484.pp 485In addition, to manage the cache contents, it provides these additional 486mechanisms for the filesystem to use: 487.(l L 488* cache invalidation/ageing. 489* cache probing/statistics. 490* cache resizing. 491.)l 492.pp 493These mechanisms are all implemented by a series of functions that interact 494in an ad-hoc manner to allow these mechanisms to function. Since the 495policies are embedded in a buffer cache used by all filesystems, all 496filesystems are compelled to use the same policies. Also, since I/O is 497implied behind the cache, the buffer cache schedules buffer reuse/writeback 498based on its demands for memory rather then when the filesystem would like 499to see the output operation occur. Finally, even though the buffer cache 500can be made to relinquish memory back to the memory allocators, it still 501must have a notion of a fraction of memory resources dedicated to filesystem 502cache contents, which limits the effectiveness of memory to be used to 503accellerate I/O operations. Inherently, the virtual memory system and buffer 504cache are on a collision course, as both provide fundamentally the same 505operation (access to storage) for different subsystems (filesystem, address 506space). As both must interact (mapped files), and cache state to improve 507operation, each elbows the other for resources. 508.pp 509The resolution of this conflict is that the virtual memory system must 510be the only mediator to storage, and the only manager of memory resource. 511Filesystems should provide methods to perform and mediate I/O. A filesystem 512access to a portion of a file should be treated exactly like a program's 513access to a portion of a file mapped into its address space. A process 514consuming a large number of pages of I/O should compete alongside another 515process consuming a large number of pages to run its program within. 516Actual I/O for either should be managed efficently with the same mechanisms 517(buffering for overlap operation, caching to reduce redundant operations, 518clustering to coalesce multiple operations into a single one, 519reclamation 520to cope with finite memory resources, I/O scheduling to maximize benifit 521of storage transfer mechanism). 522.pp 523The page cache is this universal mechanism that makes possible this 524archetectural arrangement. With this mechanism, file contents are 525only reachable via the virtual memory system. Any memory resources 526used to cache contents are held by the virtual memory system. All I/O 527operations are initiated or scheduled to be initiated by the filesystem. 528The memory system may attempt to reclaim resources by writing back 529a cached entry, however, the filesystem may chose how, when, or if 530this is to be done. The virtual memory system provides access to memory 531based, logically ordered objects on which I/O may be applied; 532while the filesystem transforms these into physical transfer operations. 533.pp 534Thus, the resolution of the previously mentioned functions cannot all 535be present in the same place as it was in the buffer cache, but instead 536be spread accross both the virtual memory system and filesystem, with 537much of both changing considerably. In effect, the caching mechanism would 538be migrating in the level of abstraction layers: 539.(l L 540Is: 541vm 542filesystem(top) 543buffer cache 544filesystem (bottom) 545drivers | network | ... 546 547Becomes: 548filesystem(top) 549 | | vm 550 | | page cache 551filesystem(bottom) 552drivers | network | ... 553.)l 554.pp 555In this new arrangement, the filesystem "top" layer transforms requests for 556file storage into virtual memory system requests (if necessary for the 557filesystem semantics, the virtual memory system may be bypassed entirely 558for non-cachable transfers of information), which may either be found 559in the global page cache, or reloaded on demand by the filesystem "bottom" 560layer. Thus, the net effect of the change is to cache potentially mappable 561memory state to avoid I/O transfers instead of caching the I/O transfers 562themselves. 563.uh "What is bufinit()?" 564.pp 565.b bufinit() 566instantiates the data structures with sane values so that the 567remaining code can use the facility. 568.uh "What Does the Kernel Expect of bufinit()?" 569.pp 570.b bufinit() 571must bootstrap enough of the buffer cache so that it can 572commence operation. At this point, the memory allocator is active, 573but no disk I/O is yet allowed. The buffer cache will be 574needed to help initialize block devices and to open the root 575filesystem -- its first operations. 576__listing(/usr/src/kernel/kern/fs/bio.c,80,81) 577.lp 578.b bufinit() 579has no arguments and returns nothing. 580Its side-effect is to initialize the statically 581allocated hash table, buffer headers, and free buffer queues. 582.uh "How is bufinit() Implemented on the 386?" 583__listing(/usr/src/kernel/kern/fs/bio.c,87,91) 584.lp 585With nothing in the hash table, its entries point to themselves. Note 586that there is only one value per hash table header for both the 587beginning and end of the hash table, yet the sentinel itself is never 588allocated. One could have used a single entry, but this 589identical code allows for greater speed to unlink and link (one case only) 590a buffer on a queue. 591__listing(/usr/src/kernel/kern/fs/bio.c,94,100) 592.lp 593Likewise, a list of free queues exists on which inactive buffers are present. 594We link them the two ways they can be found (address match or 595availability on the 596.i freelist). 597__listing(/usr/src/kernel/kern/fs/bio.c,103,109) 598.lp 599With the list headers established, we place the buffer headers on the 600queues using macros used throughout the rest of the implementation. 601At this point, we take care to mark them as having no contents and 602matching no address. 603.uh "386BSD bufinit() Design Choices and Trade-Offs" 604.pp 605It is assumed in the 386BSD implementation that the buffer cache is 606empty to begin with and that the buffer cache code will allocate 607and fill-in the 608structures as needed, instead of allocating everything here. 609As such, the key design choice here was to 610place the burden on the rest of the implementation so that the virtual 611memory system, the ultimate memory manager, could reclaim space on-demand 612from the buffer cache. It becomes just one of many clients of memory inside 613the kernel. (N.B. It may eventually be desirable to allow revoking of memory 614as a capability on-demand.) 615.pp 616This implementation of 617.b bio.c 618still uses the buffer list and queue arrangement found in 619older Berkeley Unix (and traditional Unix) systems, instead of using 620the general purpose queue facility (see 621.b include/queue.h) 622used by the entire virtual memory system. 623Since the buffer cache is scheduled for replacement by the page 624cache (which uses these queues already), work has not been 625continued in this area. 626.uh "What is incore()?" 627.pp 628Active buffers are linked onto one list of a hash table of lists, 629where the hash function is determined by logical file and block number. The 630.b incore() 631function, used solely within the buffer cache, 632locates valid cache contents regardless of the status of the buffer. 633Since this function is small and can limit performance for large 634cache sizes, it is inlined to increase speed. 635.uh "What Does the Buffer Cache Expect of incore()?" 636.pp 637.b incore() 638attempts to locate a buffer in the buffer cache. 639If one is present, it returns a pointer to the 640buffer's header; otherwise, it returns a null buffer pointer. 641__listing(/usr/src/kernel/kern/fs/bio.c,115,116) 642.lp 643.b incore() 644has two arguments: a vnode pointer; and a logical block number. 645The arguments are used to find an exact match from one of the 646buffers holding valid data, regardless of buffer status. 647If it is matched, the buffer pointer is returned; 648otherwise, a null pointer is returned. 649.pp 650.b incore() 651should never block; otherwise, an allocation race could result and the 652wrong answer obtained. Since 653.b incore() 654is used to find buffers that can be in either a locked or unlocked state, 655it should never be used to itself lock or unlock 656buffers; otherwise, a chicken-and-egg recursion could result. 657.uh "How is incore() Implemented on the 386?" 658__listing1(/usr/src/kernel/kern/fs/bio.c,118) 659.lp 660The buffer hash address is located by the hash macro (see 661.b include/buf.h) 662which folds the value of the vnode pointer and logical block number 663into an index inserted into a fixed-sized hash table of buffer hash headers. 664Buffer hash headers are actually only three words long, and overlay the 665first three fields of the buffer header structure 666.i (b_flags, 667.i b_forw, 668and 669.i b_back), 670so the definition for 671.i bh 672is a bit misleading. The hash header is the 673semaphore that anchors a doublely circularly linked chain of buffer 674headers which contains the buffer we are locating; otherwise, the 675buffer is not present in the buffer cache. 676__listing1(/usr/src/kernel/kern/fs/bio.c,121) 677.lp 678The hash chain is linearly searched for the desired buffer. The 679first buffer is pointed at by the forward pointer (although we could 680just as easily have walked the hash chain backwards via the back pointer). In 681either direction the header itself is chained as the last element, so that 682the empty hash header is an entry with both back and forward pointers 683pointing to itself (see 684.b bufinit()) 685(***HYPERLINK WHAT IS BUFINIT() - POPUP) 686__listing(/usr/src/kernel/kern/fs/bio.c,123,125) 687.lp 688Valid matches are exact matches on the starting logical block number and 689vnode pointer. Note that both the size of each cached buffer and the block 690size associated with the block address (e.g. units of the address) are not 691taken into account by the 692.b incore() 693lookup, so it is not possible to locate 694a logical block contained within a cached buffer or locate a buffer 695by an alias of a different block size. The buffer cache effectively 696caches disk transactions of a particular arrangement by a filesystem, and 697does not provide a global view across all filesystems of vnode file 698state present in RAM. Thus, the buffer cache is relative to each of its 699filesystem clients. 700.pp 701This arrangement directly implies that 702.b bio.c 703is logically located at the middle layer of abstraction 704of a filesystem, separating the filesystem's top layer (its semantics 705and attachment to the virtual filesystem) from its lower layer (the logical 706to physical mapping, commitment to storage, and storage allocation policies). 707.lp 708A check for the validity of the buffer is made before returning the block. 709Normally, only valid data should be present on the hash chains; however, 710the block may have become invalid if a race condition occurred while 711removing it from the 712hash (by code running from interrupt level or on another processor). Since 713the unusual case is an invalid buffer, this is only tested prior to returning 714a positive match. Buffers are not locked by this function -- the information 715in the associated buffer header is subject to change prior to use, so other 716callers must take care to discard buffers as needed if they are stale. 717__listing1(/usr/src/kernel/kern/fs/bio.c,127) 718.lp 719In the case of a buffer cache miss, 720.b incore() 721will return the value of a null buffer pointer. 722.uh "What is bread()?" 723.pp 724.b "bread()" 725is one of the top-level function entry calls for the 726buffer cache. The filesystem top-level code uses this function to access 727their file's contents by performing a virtual read of the 728combined cache and the lower-level of the filesystem. 729.uh "What Does the Filesystem Expect of bread()?" 730.pp 731.b bread() 732causes a buffer to be created with the contents requested. 733It either obtains this from the buffer cache if it is present, or 734obtains an empty buffer and fills it with the appropriate 735contents by means of the lower-level mechanisms of the filesystem 736memory for further use. 737__listing(/usr/src/kernel/kern/fs/bio.c,135,137) 738.lp 739.b bread() 740has five arguments (logical file parameters): the vnode pointer; the 741logical block offset; size of transfer; the credentials used in 742performing the read; and a pointer to the buffer returned by this 743.b bread() 744operation. 745The credentials of the requester 746passed in the buffer are used to associate any lower-level 747I/O operations. The function will return a buffer at 748the location specified ("*bpp") as well as any error values encountered 749in performing the operation. 750.pp 751The buffer is returned locked for exclusive 752use, and then returned either to the buffer cache for further use by other 753clients of this block of the file or by the cache to recycle its buffer 754memory for further use. 755.pp 756Since the filesystem code is unaware of any concurrent access of buffers, 757.b bread() 758must mediate the shared free buffer pool in order to gain exclusive 759access to the desired buffer to its caller. 760.b bread() 761can block, since it is only called from the synchronous 762top-half of the kernel. However, if it does 763block, it must not hold any locked structures (except the buffer 764it allocates, of course), nor hold references to any global data 765structures across the block. (This is done to avoid deadlocks or circular 766waiting). 767.uh "How is bread() Implemented on the 386?" 768__listing1(/usr/src/kernel/kern/fs/bio.c,142) 769.lp 770.b getblk() 771(***HYPERLINK WHAT IS GETBLK() - POPUP) 772is used to obtain an appropriate buffer for the block requested, by 773either finding it in the cache or freshly allocating an empty buffer. 774__listing1(/usr/src/kernel/kern/fs/bio.c,145) 775.lp 776If the block was cached already, no further work needs to be 777done. However, if the block was either not cached or if the cached 778block is not valid, it must be filled with contents. 779__listing(/usr/src/kernel/kern/fs/bio.c,146,153) 780.lp 781A read operation is then selected and the associated bits are cleared 782prior to starting a read operation. Operations with credentials have 783their reference counts bumped for read and write references, which are 784asserted in case a 785.b bread() 786block is later written using 787.b bwrite(). 788(***HYPERLINK WHAT IS BWRITE() - POPUP) 789__listing(/usr/src/kernel/kern/fs/bio.c,154,157) 790.lp 791If this process is traced, the read operation is recorded to 792the trace file. 793__listing(/usr/src/kernel/kern/fs/bio.c,158,160) 794.lp 795We tally the number of read requests, and 796then perform each request, by returning 797to the filesystem from which we were called 798and having it satisfy the request from whatever underlies the filesystem. 799__listing(/usr/src/kernel/kern/fs/bio.c,161,165) 800.lp 801The process is suspended until the read request completes. 802Then the status of the operation 803and the buffer obtained is returned to the caller. 804.uh "What is breada()?" 805.pp 806.b breada(), 807a special case of 808.b bread(), 809(***HYPERLINK WHAT IS BREAD() - POPUP) 810is another top-level function entry call for the 811buffer cache. Like 812.b bread(), 813a filesystem's top-level code uses this function to access 814the file's contents by performing a virtual read of the 815combined cache and the lower-level of the filesystem. 816.pp 817.b breada() 818allows for double-buffering 819read operations by starting consecutive operations as a side-effect. It 820assumes that by the time the first operation has been consumed, the 821second will be ready and the third will be started again. 822If this perfect double buffering was maintained, data would appear 823to be instantaneously available without delay, even if the file 824read was larger than the buffer cache size. More commonly, however, 825the buffer cache simply overlaps two 826read operations, reducing the synchronous wait to every other 827.b breada() 828request in the case where data is demanded faster than the drive can supply. 829.uh "What Does the Filesystem Expect of breada()?" 830.pp 831Like 832.b bread(), 833(***HYPERLINK WHAT IS BREAD() - POPUP) 834.b breada() 835causes a buffer to be created with the contents requested. 836It either obtains this from the buffer cache if it is present or 837obtains an empty buffer and fills it with the appropriate 838contents by means of the lower-level mechanisms of the filesystem 839memory for further use. 840.b breada() 841also has the "side-effect" of starting a read on the next buffer if it is not 842in the buffer cache. 843__listing(/usr/src/kernel/kern/fs/bio.c,172,174) 844.lp 845.b breada() 846has seven arguments: the logical parameters of vnode pointer; logical 847block offset; size of transfer; the read-ahead block offset; 848the read-ahead block size; the credentials of the transfer; and the 849pointer to the buffer read in this operation. 850The credentials of the requester are passed in the 851buffer to associate any lower-level I/O operations. 852.pp 853Like 854.b bread(), 855this function will return a buffer at the location specified 856("*bpp"), as well as 857any error value encountered in performing the operation. The buffer is 858returned locked for exclusive 859use, and must be returned to the buffer cache either for further use by other 860clients of this block of the file or for the cache to recycle its buffer 861memory for further use. 862.pp 863Since the units of the logical block address are 864unknown to the buffer cache, it is not necessarily the case that 865the next block is one plus the previous (it could be 2 or 4, for example) 866Also, since the block size can be variable in the filesystem, it is also 867not predictable by the buffer cache (as the read-ahead block may be the 868last block, usually a fragment). 869.pp 870Since the filesystem code is unaware of concurrent access of buffers, 871.b breada() 872(like 873.b bread()) 874must mediate the shared free buffer pool in order to 875gain exclusive access to the desired buffer to its caller. 876.b breada() 877can block, since it is only called from the 878synchronous top-half of the kernel. However, if this function 879does block, it must not hold any locked structures (except the buffer 880it allocates, of course) nor hold references to any global data 881structures across the block. 882.uh "How is breada() Implemented on the 386?" 883__listing1(/usr/src/kernel/kern/fs/bio.c,179) 884.lp 885The 886.b getblk() 887(***HYPERLINK WHAT IS GETBLK() - POPUP) 888function is used to obtain an appropriate buffer for the block requested, 889either by finding it in the cache or by freshly allocating an empty buffer. 890__listing1(/usr/src/kernel/kern/fs/bio.c,182) 891.lp 892If the block was cached, no further work needs to be 893done. However, if the block was either not cached, or if the cached 894block is not valid, it must be filled with contents. 895__listing(/usr/src/kernel/kern/fs/bio.c,183,190) 896.lp 897A read operation is selected and the associated bits are cleared 898prior to starting a read operation. Operations with credentials have 899their reference counts bumped for read and write references, which are 900asserted in case a 901.b breada() 902block is later written with 903.b bwrite() 904(***HYPERLINK WHAT IS BWRITE() - POPUP) 905__listing(/usr/src/kernel/kern/fs/bio.c,191,194) 906.lp 907If this process is traced, the read operation is recorded to the trace file. 908__listing(/usr/src/kernel/kern/fs/bio.c,195,197) 909.lp 910The number of read requests are next tallied. 911We then perform the requests, by going back to the filesystem from 912which we were called 913and having it satisfy the request from whatever underlies the filesystem. 914__listing(/usr/src/kernel/kern/fs/bio.c,198,199) 915.lp 916At this point the implementation of 917.b breada() 918diverges from that of 919.b bread(). 920Instead of waiting for the read 921operation to complete, we instead note that an operation is outstanding. 922__listing(/usr/src/kernel/kern/fs/bio.c,201,207) 923.lp 924We then check to see if the next block is already in the cache. 925If it is not in the cache, we arrange to schedule an asynchronous read by 926obtaining a new buffer using 927.b getnewbuf(). 928(***HYPERLINK WHAT IS GETNEWBUF() - POPUP) 929If none was present and 930.b getnewbuf() 931was forced to wait (as conditions may be difficult), we need to go 932back and look in the cache again. 933__listing(/usr/src/kernel/kern/fs/bio.c,208,209) 934.lp 935We then associate the buffer with this vnode and read-ahead block offset. To 936signal that we haven't calculated the underlying address for this block, 937the lower level address for the buffer is set to the logical address. (The 938.b bgetvp() 939function in 940.b fs/vnode.c 941associates this buffer with this vnode.) 942__listing(/usr/src/kernel/kern/fs/bio.c,210,214) 943.lp 944We insert the buffer into the hash table so that it can be found by 945.b incore(). 946(***HYPERLINK WHAT IS INCORE() - POPUP) 947The buffer is held busy so that if others find 948it, they must first wait for it to be released before using it. 949__listing(/usr/src/kernel/kern/fs/bio.c,217,230) 950.lp 951We request that an asynchronous read be performed and that the buffer be 952released to the "aged" free queue. 953Because the buffer has not yet been requested, we consider it aged 954and a candidate for reuse; otherwise, we handle this just like a 955.b bread() 956without a 957.b biowait(). 958(***HYPERLINK WHAT IS BIOWAIT() - POPUP) 959.lp 960Both read and write credentials for the read-ahead block are obtained. 961Tracing is then accounted for (if desired) and it is recorded that the transfer 962has occurred against this particular process. We then perform the transfer 963via the 964.i vop_strategy 965operation. 966__listing(/usr/src/kernel/kern/fs/bio.c,234,245) 967.lp 968If the read-ahead block was in the cache, we prolong its life a bit more. 969If the age bit is still set and the buffer is in use we clear 970the age bit (so it won't be aged). 971However, if it is not in use, we force it to the tail of the age queue so that 972it won't be reused before being referenced. 973__listing(/usr/src/kernel/kern/fs/bio.c,248,252) 974.lp 975Finally, we wait for the result if the original operation required a 976read operation. 977The buffer used for the requested read operation is returned to the 978supplied address of the caller and the value of this successful 979or failed operation is also returned. 980.uh "What is bwrite()?" 981.pp 982.b bwrite() 983takes a previously busy buffer obtained from either 984.b getblk(), 985(***HYPERLINK WHAT IS GETBLK() - POPUP) 986.b bread(), 987(***HYPERLINK WHAT IS BREAD() - POPUP) 988or 989.b breada(), 990(***HYPERLINK WHAT IS BREADA() - POPUP) 991and forces it to be immediately written to the underlying storage 992below the current filesystem. Like 993.b bread(), 994.b bwrite() 995waits for the operation to finish, and returns the status 996of the operation as a value. The buffer passed is then released 997back to the freelist for use by other clients of the buffer cache. 998.uh "What Does the Filesystem Expect of bwrite()?" 999.pp 1000.b bwrite() 1001forces the buffer to be immediately written and the invoked 1002operation synchronously completed before the function returns. 1003Delayed writes are taken into account by this function as well. 1004__listing(/usr/src/kernel/kern/fs/bio.c,259,260) 1005.lp 1006.b bwrite() 1007has one argument: the buffer to be synchronously written. It returns 1008the success or failure of the write operation. 1009Since 1010.b bwrite() 1011will block waiting for completion of the I/O operation, it can 1012only be called from the top half of the kernel. 1013.uh "How is bwrite() Implemented on the 386?" 1014__listing(/usr/src/kernel/kern/fs/bio.c,264,267) 1015.lp 1016If the supplied buffer's contents is invalid, no operation is performed 1017and the buffer is immediately released for reuse on the free list. 1018__listing(/usr/src/kernel/kern/fs/bio.c,270,279) 1019.lp 1020If the buffer is not currently marked "busy", then 1021.b bwrite() 1022panics because it does not have exclusive use of the buffer. In general, 1023a block has been obtained that has been already associated with a vnode 1024for exclusive use -- thus, this is a "impossible" error. 1025.lp 1026If the buffer has already been virtually written by a delayed write 1027operation, the buffer is reassociated with the vnode of the file to which 1028it belongs and its contents are marked "aged" so that 1029they may be reclaimed more quickly and not dwell on the free list. 1030Otherwise, an output operation is accounted for by the process that calls 1031.b bwrite(). 1032Note that the output operation for a delayed write 1033was already accounted for in the process where the delayed write had 1034occurred. 1035__listing(/usr/src/kernel/kern/fs/bio.c,281,282) 1036.lp 1037The block is marked as a "dirty" block that must be cleaned before 1038this vnode may be reused, and the number of output operations in progress 1039on this vnode is increased. This counter will keep the vnode from being 1040recycled or altered prior to all of its operations completing. 1041__listing(/usr/src/kernel/kern/fs/bio.c,288,292) 1042.lp 1043The output operation is engaged for this filesystem by calling its 1044vnode strategy operation and the operation is waited upon until it completes. 1045After completion, the buffer is released back into the free buffer cache 1046for potential reuse by other processes and the value of a successful 1047or failed operation is returned to the caller. Note that the buffer has been 1048returned to the buffer cache and is no longer used by the caller. 1049.uh "What is bdwrite()?" 1050.pp 1051Like 1052.b "bwrite()", 1053(***HYPERLINK WHAT IS BWRITE() - POPUP) 1054.b bdwrite() 1055is used to return the buffer back to the buffer 1056cache with the contents written back to the file. However, unlike 1057.b bwrite() 1058(which immediately writes the buffer back and waits for it), 1059.b bdwrite() 1060does a 1061.i lazy 1062write by marking the buffer to be written before reuse, releasing it to 1063the free list, and then returning. 1064.uh "What Does the Filesystem Expect of bdwrite()?" 1065.pp 1066.b bdwrite() 1067virtually writes the buffer into the buffer cache. The function 1068in effect disassociates the operation from the file to which it is attached 1069so that the buffer may "dwell" for a period of time until the 1070buffer space needs to be reused for another purpose in the buffer cache 1071or until the block is requested by another process to be read or written. 1072__listing(/usr/src/kernel/kern/fs/bio.c,306,307) 1073.lp 1074.b bdwrite() 1075has two arguments: the buffer to be written; and the vnode to which it 1076is associated. It returns nothing. The side-effect of this function 1077is to virtually write the buffer into the buffer cache. 1078.lp 1079In order for the buffer to be written, it must be exclusive to the process 1080performing the write. 1081.b bdwrite() 1082occurs in the context of the writer. However, 1083the operation for performing the write (when it actually occurs) may come in 1084from yet another process or in no process at all (as the process may have 1085terminated by the time the actual I/O operation is performed). 1086.uh "How is bdwrite() Implemented on the 386?" 1087__listing(/usr/src/kernel/kern/fs/bio.c,310,320) 1088.lp 1089If the buffer is not marked for exclusive use (busy) 1090.b bdwrite() 1091panics as it has been called on a nonexclusive buffer. If the information 1092in the buffer is invalid, the buffer is immediately released, since there is no 1093write operation to be performed on this buffer and it is returned to the 1094free list for reassignment. 1095.lp 1096If the buffer is associated with a tape device upon which a write 1097should be performed, a synchronous write is substituted for this delayed 1098write. Note that the return value for the write operation is thrown away as 1099.b bdwrite() 1100does not expect anything but success. However, the error 1101is still recorded in the buffer's structure for later examination. 1102__listing(/usr/src/kernel/kern/fs/bio.c,322,330) 1103.lp 1104The buffer's flags are changed to indicate that a write operation 1105is present but has not yet been done for this buffer. If this is the first 1106delayed write that has occurred, the process is accounted for as a single 1107physical write (that may occur later) and marked as a delayed 1108write with the dirty bit. Its buffer is then assigned to the vnode 1109associated with this operation. Note that multiple delayed writes may 1110occur, as a partial write to the buffer associated with the vnode may 1111successively fill the contents of this buffer with many 1112.b bdwrite()'s. 1113The buffer is then released back to the free list where it may later be found 1114associated with this vnode. 1115__listing(/usr/src/kernel/kern/fs/bio.c,332,341) 1116.lp 1117An optional portion of this 1118.b bio.c 1119implementation incorporates a mechanism to only allow a single 1120outstanding delayed write buffer per every vnode. 1121(This is an example of how to deal with the so-called 1122core file problem discussed earlier -- see 1123.b "Delayed Write and Read-ahead Operations" 1124(**HYPERLINK TO SECTION). 1125Delayed writes are controlled and kept from overrunning the contents 1126of the buffer cache by delaying the top level application program 1127when it attempts to write on more than one buffer. (This is not normally 1128done because this mechanism does not discriminate for the 1129default temporary file characteristics implemented in this 1130file with delayed writes.) 1131.lp 1132The buffer is checked to see if it is the only block 1133present on a chain of dirty blocks attached to the vnode. If there 1134is more than one dirty block present, a 1135.b bawrite() 1136(with an embedded sleep operation) is forced by calling the 1137vnode 1138.b vflushbuf() 1139operation (see 1140.b fs/vnode.c) 1141for all but the current outstanding delayed write buffer. Note that in 1142the case where a preponderance of temporary file operations occur on the 1143system, this operation will actually slow the system down by reducing the 1144amount of write-behinds (delayed writes). However, it will speed up a system 1145that had previously been dominated by cache flooding. It is the lack 1146of information on making the decision whether or not to issue this 1147operation that requires this conditional portion of this implementation. 1148.uh "386BSD bdwrite() Design Choices and Trade-Offs" 1149.pp 1150.b bdwrite() 1151presumes that the buffer is going to be reused and so procrastinates. 1152This method works well when a buffer is 1153fractionally written in succession or when the buffer cache can hold 1154the entire contents of a temporary file that gets reused and removed 1155prior to being rewritten. In this second case, the I/O can be entirely avoided, 1156as the file ceases to exist before the buffers ever age out. 1157This is the principle upon which log-based filesystems 1158(like LFS) rely. Sequential devices 1159like tapes cannot postpone or reorder writes, so delayed writes to such devices 1160are turned into immediate writes. 1161.uh "What is bawrite()?" 1162.pp 1163Like 1164.b bwrite() 1165(***HYPERLINK WHAT IS BWRITE() - POPUP) 1166.b bawrite() 1167takes a previously busy buffer obtained from either 1168.b getblk(), 1169(***HYPERLINK WHAT IS GETBLK() - POPUP) 1170.b bread(), 1171(***HYPERLINK WHAT IS BREAD() - POPUP) 1172or 1173.b breada(), 1174(***HYPERLINK WHAT IS BREADA() - POPUP) 1175and forces it to be immediately written to the underlying storage 1176below the current filesystem. However, unlike 1177.b bwrite() 1178no wait for status is done. 1179.uh "What Does the Filesystem Expect of bawrite()?" 1180.pp 1181.b bawrite() 1182performs the write operation on the buffer and then 1183immediately returns. No wait for the status is done. Instead, the 1184buffer is marked to be released when the write operation completes. 1185In this case, no status information can be conveyed back to the 1186writer because the operation is completed from the writer's perspective, 1187before the operation can produce a status for the completed operation. 1188Only the next reference to a buffer can discover the status of the 1189buffer following a 1190.b bawrite() 1191(or analogously the read-ahead blocks scheduled by 1192.b breada()). 1193__listing(/usr/src/kernel/kern/fs/bio.c,349,350) 1194.lp 1195.b bawrite() 1196has one argument: the buffer being written. It returns nothing. 1197.pp 1198Note that by assigning a procedure, 1199.b b_iodone(*)(), 1200to be called on conclusion of the I/O 1201operation and setting the call bit (B_CALL), a 1202call from interrupt level may be done to a function to indicate an 1203operation has been completed and the status may be checked on an 1204asynchronous write. Note also that buffers remain locked from view 1205during any I/O operations, so that even subsequent read operations 1206will be delayed until the asynchronous write completes. 1207.uh "How is bawrite() Implemented on the 386?" 1208__listing(/usr/src/kernel/kern/fs/bio.c,353,358) 1209.lp 1210The buffer is checked to see that it is exclusively accessed by checking 1211for the busy (B_BUSY) bit; otherwise, it panics as the buffer must 1212be obtained for exclusive access for this function to be called. 1213If the buffer being written has an invalid flag, the buffer is immediately 1214released without any I/O performed on it as it has no valid contents. 1215__listing(/usr/src/kernel/kern/fs/bio.c,361,371) 1216.lp 1217If the buffer was previously a delayed buffer, it is reassigned to the 1218vnode and aged as a delayed write is finally converted into an 1219asynchronous write buffer. In this case, since the operation of the write 1220had previously been accounted, it is not be done so here upon the delayed write 1221conversion; otherwise, it will be done for an initial 1222.b bawrite(). 1223.lp 1224The buffer's flags are changed to be appropriate for that of a buffer 1225about to be written which has not had any I/O done to it yet and 1226no errors have been found. Since the buffer is being written, the dirty 1227bit and the asynchronous flag are set for this buffer and the number 1228of write transactions to the vnode are increased. 1229__listing(/usr/src/kernel/kern/fs/bio.c,372,377) 1230.lp 1231If the process is tracing 1232.b bio.c 1233operations, a trace record is generated 1234to indicate that either an asynchronous write or a delayed write 1235converted to an asynchronous write is occurring for this vnode's block. 1236__listing(/usr/src/kernel/kern/fs/bio.c,378,379) 1237.lp 1238The write operation is requested via a vnode operation on the filesystem 1239associated with this buffer's vnode, and the function is complete. The 1240caller can either 1241.b biowait() 1242(***HYPERLINK WHAT IS BIOWAIT() - POPUP) 1243(in the case that many asynchronous operations 1244are started before we wait for any of them) or continue requiring the 1245block later (when necessary) to determine the status of the operation. 1246.uh "What is brelse()?" 1247.pp 1248.b "brelse()" 1249relinquishes exclusive access to a buffer and returns it 1250to the global free buffer pool for potential use by other clients 1251of the buffer cache. 1252.uh "What Does the Filesystem Expect From brelse()?" 1253.pp 1254.b brelse() 1255passes exclusive use of the buffer back to the kernel's 1256free pool. Anyone waiting for this particular buffer or a free 1257buffer in general is then notified. 1258__listing(/usr/src/kernel/kern/fs/bio.c,386,387) 1259.lp 1260.b brelse() 1261has one argument: the buffer to be released. It returns nothing. 1262The side-effect of this function is that the buffer is placed in the free pool. 1263.pp 1264This function is implemented as an artifact of 1265how exclusive access is maintained in the buffer cache. 1266If any other clients of the buffer cache either 1267desire this buffer or are waiting for any buffer for reuse, they 1268will be awakened and allowed to compete for access to the desired 1269resource (since there may be many waiting for exclusive access to 1270a single buffer). Buffers are directed to free lists that determine 1271the policy for reclamation and reuse with a different vnode/offset. 1272.uh "How is brelse() Implemented on the 386?" 1273__listing(/usr/src/kernel/kern/fs/bio.c,391,398) 1274.lp 1275If kernel tracing is present, a trace record is written that indicates 1276whether the buffer is being invalidated, modified, or simply released 1277for reuse. Note that as currently implemented the tracing facility can 1278only write records from the top level -- thus, if 1279.b brelse() 1280is called from interrupt level 1281(as occurs from 1282.b biodone() 1283as the conclusion of an asynchronous operation), it will not be recorded because 1284of the block that may result. This function is currently finessed in 1285.b ktrbio() 1286(see 1287.b opt/ktrace/ktrace.c) 1288as a special case as this trace package may be later changed to remove 1289the current restriction. 1290__listing(/usr/src/kernel/kern/fs/bio.c,400,416) 1291.lp 1292To avoid contention for the busy bit, all block I/O devices are masked 1293from interrupts (primarily because 1294.b biodone() 1295invokes 1296.b brelse()). 1297The age queue is examined to see if its associated wanted bit is set 1298(indicating that any the buffer wanted request is present). If it is, 1299the bit is cleared and a wakeup issued for those processes still waiting 1300for any buffer to become available. Note that only the first of the possibly 1301many processes waiting for a free buffer will get the newly freed buffer 1302with the others returned to the queue to wait for another buffer to become 1303available. Thus, there is no guarantee of when a buffer will 1304be assigned to a particular process. 1305.lp 1306Similarly, if the wanted bit is set on this particular buffer header's flag, 1307a wakeup will be issued for the buffer header and the bit cleared. If the 1308buffer has an error, invalid contents, or has been specifically disallowed 1309from being cached by the filesystem, its contents are rendered meaningless 1310and it is disassociated from any vnode to which it might be attached. 1311__listing(/usr/src/kernel/kern/fs/bio.c,418,437) 1312.lp 1313The buffer flags indicating cache status and aging are erased prior to 1314enqueuing the buffer on the appropriate of the four free lists. If the 1315buffer corresponds only to a header and has no contents whatsoever, it 1316is passed back to the empty queue for use when assigning new storage. 1317Otherwise, if the buffer contains no valid contents, it is inserted 1318at the beginning of the age queue which indicates that it will be the 1319first to be reused when a new buffer needs to be allocated. 1320.lp 1321If the buffer has aged but not been found in the cache, it will be installed 1322at the tail of the age queue, where it will be reused before the rest of 1323the contents of the buffer cache but after any invalid or empty queues that 1324could be constituted into new buffers are used. Aged buffers are 1325ones that have either not demonstrated their reason for holding a place 1326in a cache or have lost their reason for being cached, as determined 1327by the filesystem's use of the buffer. All other buffers will be placed 1328on the LRU queue awaiting reuse. 1329.lp 1330In general, as buffers are found and released back to their free lists, 1331they tend to migrate towards the head of the particular queue that 1332they reside upon; thus, the less recently referenced entries move towards 1333the front of the queue. No indication is kept 1334on how frequently a buffer is reused but one is kept on the last time it was 1335reused. Thus, we use a LRU algorithm (LIFO). 1336.lp 1337The busy bit is then removed from the buffer, allowing it to be claimed 1338by another requester and the interrupt level mask is removed. 1339.uh "What is getnewbuf()?" 1340.pp 1341.b getnewbuf() 1342constitutes a new buffer by acquiring a buffer header and 1343contents. 1344.uh "What Does the Buffer Cache Expect From getnewbuf()?" 1345.pp 1346.b getnewbuf() 1347obtains a buffer and stores in it the 1348.i sz 1349sized contents. These items either come from an 1350undedicated resource (empty buffer headers and underutilized buffer 1351contents space) or by recycling a free buffer and adjusting it 1352to suit the desired size. If the recycled buffer is a delayed write, 1353the write is forced before allowing reuse. 1354.pp 1355If a new buffer is still not available for use, 1356the condition is signalled and the 1357process sleeps until a free buffer is available. At this point, the 1358function returns a null buffer pointer to signal that a sleep has 1359occurred. If the conditions have changed and the buffer 1360desired allocated to yet another racing process, any uses of 1361.b incore() 1362(***HYPERLINK WHAT IS INCORE - POPUP) 1363may be stale. 1364__listing(/usr/src/kernel/kern/fs/bio.c,450,451) 1365.lp 1366.b getnewbuf() 1367has one argument: the size of the contents allocated with the buffer 1368header. It returns a pointer to the allocated buffer header or zero 1369if no buffer header could be allocated. 1370.pp 1371.b getnewbuf() 1372returns a buffer that is entirely unassociated 1373from the buffer cache. It is up to the caller to either associate the 1374buffer with a vnode/offset so that it can be found or make 1375temporary use of the buffer and return it to a free list. In its 1376unassociated state, the buffer plays no role in the global view of buffer state. 1377.uh "How is getnewbuf() Implemented on the 386?" 1378__listing(/usr/src/kernel/kern/fs/bio.c,456,474) 1379.lp 1380Buffers are first constituted from the empty list of buffer headers, as long 1381as space is allocatable in the amount of memory reserved for buffer space 1382and an empty header is available. 1383In general, memory not used for cache contents is always returned to the 1384primary memory allocator 1385.b malloc() 1386(***HYPERLINK WHAT IS MALLOC IN MALLOC.C - POPUP) 1387(see 1388.b kern/malloc()) 1389(***HYPERLINK MALLOC.C TOC - MACRO) 1390for generic use in the kernel. This is the sole purpose of 1391the empty buffer free list. 1392.lp 1393Because buffers may have their status changed from exclusive to nonexclusive 1394at interrupt level (see 1395.b biodone()), 1396(***HYPERLINK WHAT IS BIODONE - POPUP) 1397any block device interrupts are masked 1398to prevent contention for buffers at interrupt level (as 1399.b brelse() 1400(***HYPERLINK WHAT IS BRELSE - POPUP) 1401may change the contents of the free lists). 1402.lp 1403If a new buffer can be constituted, memory is allocated from the memory 1404allocator and assigned to a formerly empty buffer header detached 1405from the empty free list of buffers and marked for exclusive use with 1406invalid contents (since it has yet to be filled). Two global variables, 1407.i allocbufspace 1408and 1409.i freebufspace 1410record the amount of buffer space present in the buffer cache 1411and the amount remaining for use by new buffers, respectively. Remaining 1412portions of the buffer header are filled prior to passing it back 1413to the caller. 1414__listing(/usr/src/kernel/kern/fs/bio.c,476,489) 1415.lp 1416If all buffer space is in use or no empty buffer headers are available, 1417the other free lists are checked to find a buffer for reuse. The age 1418queue is consulted before the LRU queue. If either queue has no entry 1419it will point back to its free queue pointer (indicating empty status); 1420otherwise, the first buffer at the head of the queue will be chosen 1421for use and removed from that free list. 1422.lp 1423If none of the buffer free queues has a buffer that can be used (all 1424buffers are busy), the age queue's buffer queue header is marked with 1425a wanted bit to indicate that any buffer header is requested. The 1426process then sleeps waiting for 1427.b brelse() 1428to find it a new buffer. Upon being awakened, 1429it will return to its caller to indicate that it 1430could not find a buffer. In this case, the caller must examine the buffer 1431cache again to see if someone else had allocated a buffer already for the 1432vnode and block offset for which the caller was requesting a buffer. This 1433procedure is followed rather than immediately branching back to 1434the beginning of the search for a buffer because many of the assumptions 1435of the caller may have changed during the sleep. 1436__listing(/usr/src/kernel/kern/fs/bio.c,492,496) 1437.lp 1438If the buffer selected for reuse is one that has a delayed write 1439outstanding on it, the buffer is acquired for exclusive access by 1440asserting its busy bit and transformed into an asynchronous write by calling 1441.b bawrite(). 1442(***HYPERLINK WHAT IS BAWRITE() - POPUP) 1443This forces a stale delayed write out upon reuse. 1444.lp 1445After starting the write operation, the function is restarted at its 1446beginning to search for another buffer or, if all buffers are busy, 1447suspend waiting for a free buffer until the asynchronous write completes. 1448Now freed of its contents (written back to disk), 1449the cleaned buffer is inserted onto the free list where it 1450may be allocated by this routine. 1451__listing(/usr/src/kernel/kern/fs/bio.c,498,502) 1452.lp 1453If kernel tracing is enabled for this process, a recycle record is 1454written to indicate that the buffer has been assigned to the new vnode 1455from an old contents. 1456__listing(/usr/src/kernel/kern/fs/bio.c,504,528) 1457.lp 1458The buffer is prepared for reuse by disassociating it from any vnode 1459it may have been associated with before and removing any credentials 1460that had been attached by prior operation. Its contents are marked 1461invalid and busy and its moved from the hash lookup queues so that it 1462will not be found by 1463.b incore() 1464since it does not yet have associated with 1465it a sensible name and address (vnode and block number). 1466.lp 1467The interrupt mask is removed and all remaining operations performed 1468without holding the interrupts masked. The buffer is associated with 1469no device and no vnode or logical block, and other fields in the buffer 1470header are similarly default initialized to avoid unpredictable behavior 1471by functions calling this routine which do not fill them out completely. 1472The buffer is adjusted so that the size of its contents 1473.i b_bufsize 1474matches those of the requested size 1475.i sz by the function 1476.b allocbuf(). 1477The buffer pointer is then returned as a successfully allocated 1478buffer header and contents. 1479.uh "What is getblk()?" 1480.pp 1481.b getblk() 1482returns a buffer of the appropriate vnode, offset, and size 1483to the caller. 1484.uh "What Does the Buffer Cache Expect From getblk()?" 1485.pp 1486.b getblk() 1487locates the buffer with valid contents in the buffer cache; otherwise, 1488the function creates a buffer with empty (invalid) contents. 1489__listing(/usr/src/kernel/kern/fs/bio.c,539,540) 1490.lp 1491.b getblk() 1492has three arguments: the vnode and logical block number of the contents 1493sought; and the size of the buffer to be allocated. It returns a pointer 1494to the buffer associated with this vnode and contents. 1495.pp 1496.b getblk() 1497may block awaiting allocation of a buffer or waiting for 1498a buffer that is in use by another process. The function 1499does not read the buffer's contents from the underlying storage; 1500.b bread() 1501(***HYPERLINK WHAT IS BREAD- POPUP) 1502or 1503.b breada() 1504(***HYPERLINK WHAT IS BREADA- POPUP) 1505are used instead. Instead, 1506.b getblk() 1507is used to read the contents of the cache. 1508.uh "How is getblk() Implemented on the 386?" 1509__listing(/usr/src/kernel/kern/fs/bio.c,545,548) 1510.lp 1511The function endlessly loops searching for a buffer in the cache. It 1512uses the inline function 1513.b incore() 1514(***HYPERLINK WHAT IS INCORE - POPUP) 1515in its search, using a hash table for 1516the requested buffer in the buffer cache associated with a vnode and logical 1517block number. If the block is found, the processor's interrupts are masked 1518to prevent a race with the buffer's mutual exclusion bit (B_BUSY) which 1519is examined to determine if the buffer is in current use by another process. 1520__listing(/usr/src/kernel/kern/fs/bio.c,549,553) 1521.lp 1522If kernel tracing is enabled, a trace record is written to indicate that 1523a busy buffer has been found for this 1524.b getblk() 1525operation. 1526__listing(/usr/src/kernel/kern/fs/bio.c,554,559) 1527.lp 1528The age bit associated with the buffer (if present) is cleared to ensure 1529that the buffer will not be released onto the age queue because we already 1530know it will be reused. The wanted bit is then set for this buffer header 1531so our process will be awakened when the buffer becomes available. 1532We then sleep waiting for the buffer to become available. 1533.lp 1534Upon conclusion of the sleep, the interrupt mask is removed and the 1535function is reentered by again checking the 1536.b incore() 1537status of this buffer (since it may have 1538actually been invalidated and thus is no longer present in the cache). 1539__listing(/usr/src/kernel/kern/fs/bio.c,560,564) 1540.lp 1541If kernel tracing is enabled, a reference trace record is sent to indicate 1542that the buffer has been rereferenced without it being busy. 1543__listing(/usr/src/kernel/kern/fs/bio.c,565,570) 1544.lp 1545The buffer's age bit is removed to indicate that this buffer has 1546proven its worth, and the buffer is then removed from the free list. Its 1547size is adjusted according to the size required by this call to 1548.b getblk(). 1549Note that this may alter the contents of the buffer. All the 1550cases dealing with a found buffer have now been dealt with. 1551__listing(/usr/src/kernel/kern/fs/bio.c,571,581) 1552.lp 1553Since no extant buffer has been found, 1554a new buffer header and contents are obtained by calling 1555.b getnewbuf(). 1556(***HYPERLINK WHAT IS GETNEWBUF() - POPUP) 1557If no buffer is available, 1558.b getnewbuf() 1559blocks and then returns zero. 1560Since a buffer may have been created by another process at the same location 1561during the pause, this function will restart itself from the beginning. 1562If no buffer is found, it again calls 1563.b getnewbuf() 1564and the procedure repeats. 1565.lp 1566When a buffer is available, it is assigned to the logical block 1567associated with this vnode. Because the physical block associated with 1568the logical block has not yet been determined, the physical and logical 1569block fields are identical, signalling that translation needs to be done. 1570All block devices have their interrupts masked and the buffer is 1571inserted into the hash table and marked for exclusive use by setting its 1572buffer flags busy. 1573.lp 1574In either case of finding the block or not finding the block and having 1575to create one, we now have a buffer in the buffer cache associated with 1576the vnode and its offset. The interrupts are unmasked and 1577the buffer returned, terminating the loop. 1578.uh "What is geteblk()?" 1579.pp 1580.b geteblk() 1581allocates an empty buffer from the buffer cache that will 1582not be associated with a vnode. It is used to allocate 1583buffers for special purposes (usually those not associated with the 1584filesystem). 1585.uh "What Does the Buffer Cache Expect From geteblk()?" 1586.lp 1587.b geteblk() 1588isolates a buffer from the buffer cache and returns after it is allocated. 1589__listing(/usr/src/kernel/kern/fs/bio.c,591,592) 1590.lp 1591.b geteblk() 1592has one argument: the size of the buffer requested. It returns a buffer 1593when allocated. 1594.pp 1595In earlier Unix systems that had no general purpose 1596memory allocator, buffers were used in their stead 1597to gain control of a portion of memory for temporary use. 1598In the current release of 386BSD, 1599.b geteblk() 1600is present here only for compatibility with 1601older systems and to allow the 1602.b disklabel 1603feature of the UFS filesystem to function. 1604.uh "How is geteblk() Implemented on the 386?" 1605__listing(/usr/src/kernel/kern/fs/bio.c,597,603) 1606.lp 1607.b getnewbuf() 1608is used to obtain a free buffer of the requested size. If 1609one is not available, it is called until one is available. The buffer 1610is inserted into a free list on the off-chance that the busy bit might be left 1611off before being released, so the assumption of it already being on 1612a free list is itself invalid. This is done in the case the buffer header 1613is corrupted before passing it back. Since contents of the queues and 1614hash tables may be altered by interrupt level calls to 1615.b biodone() (which 1616(***HYPERLINK WHAT IS BIODONE - POPUP) 1617calls 1618.b brelse()), 1619(***HYPERLINK WHAT IS BRELSE - POPUP) 1620the interrupts are masked on manipulation of the hash table. 1621The buffer is then returned to the caller. 1622.uh "What is allocbuf()?" 1623.pp 1624.b allocbuf() 1625adjusts the buffer size to the desired size while preserving 1626buffer contents. It is used by both the buffer cache and the filesystem 1627clients to revise the size characteristics of the buffer after it is 1628present in the buffer cache if it has expanded or contracted in size. 1629.uh "What Does the Buffer Cache Expect From allocbuf()?" 1630.pp 1631.b allocbuf() 1632allocates a region of memory of the appropriate 1633size and exchanges it for the one the buffer had before, freeing the 1634old contents. Before freeing the old contents, it copies as much of the 1635buffer as will fit. 1636__listing(/usr/src/kernel/kern/fs/bio.c,616,617) 1637.lp 1638.b allocbuf() 1639has two arguments: the pointer to the buffer header: and the size of the data 1640contents that the buffer should contain. It returns nothing. The side-effect 1641of this function is that it adjusts the size of the buffer contents. 1642.uh "How is allocbuf() Implemented on the 386?" 1643__listing(/usr/src/kernel/kern/fs/bio.c,622,636) 1644.lp 1645.b allocbuf() 1646first calls 1647.b malloc() 1648(***HYPERLINK WHAT IS MALLOC() IN MALLOC.C - POPUP) 1649to obtain a new portion of memory of the appropriate 1650size and copies the old contents into place in the new buffer space. 1651The function takes care to only copy as much of the old buffer contents 1652that will fit or is present in the old buffer. It then frees the contents 1653of the old buffer. 1654.lp 1655Note that 1656.b allocbuf() 1657may block on waiting for 1658.b malloc() 1659to provide it with 1660the new buffer contents. This differs from the companion routine in 1661.b getnewbuf(), 1662which cannot block since 1663.b getnewbuf() 1664may need to constitute a buffer from a 1665pagein operation when memory is resource-restricted and the system can 1666not risk deadlock by blocking. 1667.lp 1668After obtaining the new buffer resource, 1669.b allocbuf() 1670adjusts for the change 1671in buffer size against free buffer space and allocated buffer space. Note 1672that intentionally no check is made as to overage of allocated buffer space 1673(in fact, 1674.i freebufspace 1675can actually be negative), since its effect will only 1676be felt for a short time period. As such, this is a soft limit check 1677and not a hard limit check. 1678.lp 1679The buffer header is now updated to hold its new contents and size. Two fields 1680in the buffer header, 1681.i b_bufsize 1682and 1683.i b_bcount, 1684hold the allocated size of the buffer 1685and the currently used portion of buffer size, respectively. 1686Filesystems may alter the latter but never the former. 1687Note that in the case of expanding the buffer, there is no guarantee 1688as to the contents of the expanded region of the buffer. It is left up 1689to the caller to clear or otherwise initialize buffer contents. 1690.uh "What is biowait()?" 1691.pp 1692.b biowait() 1693waits for the buffer to finish an outstanding I/O operation and return 1694the status of the operation. 1695.uh "What Does the Buffer Cache Expect From biowait()?" 1696.pp 1697.b biowait() 1698suspends waiting for an I/O operation to complete on 1699the supplied buffer. After the buffer I/O operation is finished and 1700.b biodone() 1701(***HYPERLINK WHAT IS BIODONE() - POPUP) 1702is called, 1703.b biowait() 1704will obtain the error or success and return it to the caller. 1705__listing(/usr/src/kernel/kern/fs/bio.c,645,646) 1706.lp 1707.b biowait() 1708has one argument: the pointer to the buffer header to be waited for. 1709It returns the success value (0) or error code of the I/O 1710operation. If no error code is present, yet an error indication was 1711present on the buffer, the generic I/O error (EIO) will be returned. 1712.uh "How is biowait() Implemented on the 386?" 1713__listing(/usr/src/kernel/kern/fs/bio.c,650,657) 1714.lp 1715.b biowait() 1716masks interrupts for its operation as it examines 1717bits in the flags set by interrupt service routines from block devices 1718and manipulates contents of the hash tables and state of the buffer. 1719The function checks for the presence of the done bit to indicate 1720that the operation has completed. If the done bit (set by 1721.b biodone()) 1722is not present, it sleeps on the buffer header's address waiting 1723for I/O to complete. Note that the sleep is uninterruptible because there 1724is no way with the current interface to block devices and inform the 1725device driver to abort the I/O and return. 1726.lp 1727The significance of this uninterruptible wait is that if an I/O operation 1728is started on a file or block device and that device either loses the 1729interrupt or forgets to signal a 1730.b biodone(), 1731it is up to the device driver 1732to timeout and abort the transfer for the block to be cleared -- otherwise, 1733it would wait forever, as occurs in the case with taking out a floppy 1734while reading a file off of it. 1735.lp 1736Some devices become ready immediately upon doing the I/O operation, 1737so that by the time 1738.b biowait() 1739is reached the done bit is already set. In this case, no sleep operation 1740need be performed and this phase is bypassed completely. 1741.lp 1742if the process that has called 1743.b biowait() 1744is being traced, 1745a trace record is written indicating that a wait has completed for a block. 1746__listing(/usr/src/kernel/kern/fs/bio.c,658,673) 1747.lp 1748The buffer header's flags are checked to see if an error is present 1749or if the error field in the buffer header holds an error code. If 1750so, the buffer header is evaluated to return an error code and make 1751its contents consistent. If the contents of the buffer have not yet 1752been marked as invalid, the buffer is marked as invalid and 1753allowed to persist on the age queue where it may be reclaimed after 1754a short dwell time. 1755.lp 1756If there is no error present in the error code 1757field, a generic I/O error (EIO) is signalled in its place. Otherwise, the 1758.i berror 1759bit is forced present in the buffer flags so that if either the 1760.b berror 1761field or the BERROR flag is set by the calling code, the 1762error will be properly signalled. In either the case of returning an 1763error or success, the interrupt mask is removed and the value of 1764success or error returned to the caller. 1765.uh "What is biodone()?" 1766.pp 1767.b biodone() 1768signals the completion of an I/O operation on a buffer, 1769so that the exclusive user of the buffer can resume use of the 1770buffer after its operation has completed. 1771.uh "What Does the Buffer Cache Expect From biodone()?" 1772.pp 1773.b biodone() 1774is called by both interrupt level and top level code. 1775It processes completion of I/O from the perspective of the buffer 1776cache, checking to see if an optional function should be called to 1777do additional processing prior to either releasing a buffer that 1778has had asynchronous I/O performed on it or informing the kernel that the 1779I/O operation has been completed on the buffer. 1780.lp 1781If the operation is not being waited for (ASYNC), the buffer is 1782released onto the free lists. If subsystems (such as swap) would like 1783to be made aware 1784of I/O completion (asynchronous paging wants to reclaim resources 1785immediately), a function call is optionally processed. 1786.lp 1787If a dirty buffer has been cleaned, the vnode layer is informed so 1788as to move the buffer from a dirty list to a clean list. This 1789ensures that the file contents of the buffer cache are forced to disk upon 1790file synchronization (fsync) operations. 1791__listing(/usr/src/kernel/kern/fs/bio.c,683,684) 1792.lp 1793.b biodone() 1794has one argument: the pointer to the buffer header. It returns nothing. 1795The side effect of this function is to inform the buffer cache that 1796an I/O operation has been completed on the buffer header. 1797.pp 1798One possible alternative to this mechanism would be to have 1799.b biowait() 1800(***HYPERLINK WHAT IS BIOWAIT() - POPUP) 1801remove B_ASYNC and let all 1802.b bwrite()'s 1803(***HYPERLINK WHAT IS BWRITE() - POPUP) 1804become 1805.b bawrite()s. 1806(***HYPERLINK WHAT IS BAWRITE() - POPUP) 1807.b biowait() 1808would then reacquire them to recover the status information. 1809.uh "How is biodone() Implemented on the 386?" 1810__listing(/usr/src/kernel/kern/fs/bio.c,688,702) 1811.lp 1812Since 1813.b biodone() 1814is an interrupt level function that may affect the status 1815of the buffer field, all other block interrupts are masked during 1816operation. A flag is examined (B_CALL) to see if an optional I/O done field 1817.i b_iodone is called from the buffer's header. 1818If it is, the function will be called with the buffer header pointer provided 1819and the flag bit cleared to indicate that the call had been performed. 1820.lp 1821If the buffer corresponds to a write of a dirty buffer, the dirty bit 1822is removed from the buffer and the vnode layer is informed that the buffer 1823has been cleaned by a write operation. If the buffer was an asynchronous 1824operation, it is released by calling 1825.b brelse() 1826(***HYPERLINK WHAT IS BRELSE() - POPUP) 1827since there is no top level routine that is waiting for it. All of the 1828interrupt level masks and top level routines occur because of this 1829call from potentially interrupt level. 1830.lp 1831The flags are adjusted to eliminate the indication of asynchronous I/O 1832since the operation was performed (if present) and the flags are updated 1833to indicate the I/O operation was performed successfully. Any processes 1834sleeping for the buffer are awakened, the masked interrupts are unmasked, 1835and the function completes. Note that the interrupt masks are frequently 1836redundant, since interrupt routines mask the block devices anyways. 1837The reason that this must be done for top level calls to 1838.b biodone() 1839is that they do not mask interrupts. 1840.uh "Design Perspective on Bio.c" 1841.pp 1842Once upon a time the buffer cache in a Unix system was an exciting concept. 1843Until a buffer cache was added, 1844the earliest Unix systems at Bell Labs did not allow overlap of I/O operation 1845with other processes executing. 1846.b bio 1847allowed multiprogramming operation in these early systems, so 1848that when a reference was made to a block where I/O operation 1849had not yet been performed the operation could be queued and another process 1850run while the disk subsystem located and returned 1851the appropriate block. 1852.pp 1853.b bio 1854is an example of a kind of superdriver which allowed queued 1855I/O requests to be processed without blocking the processor from operating. 1856Given the resources of small machines like the PDP-11, this was a perfectly 1857appropriate mechanism. 1858.b bio 1859became even more desireable with the addition of 1860delayed writes and read-ahead operations, which allowed more decoupling 1861of synchronous operation between disks and processes accessing the filesystem. 1862.pp 1863The buffer cache was especially useful to early unix systems because 1864of their use of hierarchical filesystems (then unusual in any small system). 1865A single pathname lookup 1866(see 1867.b fs/lookup.c) 1868(***HYPERLINK LOOKUP.C TABLE OF CONTENTS - MACRO) 1869could require many I/O operations be performed. With caching, however, 1870many of these I/O operations would be avoided. 1871.pp 1872Nowadays, I/O performance enhancement requires more than the old 1873buffer cache mechanism can provide. 1874Besides handling name evaluation, this mechanism now must provide the 1875ability to manage mass I/O systems so that they are always operating 1876at peak transfer rate. It must also 1877possess the ability to shift granularities of information back and 1878forth that are larger than the size of the filesystem blocks themselves. 1879.pp 1880Because of these modern day requirements, we can no longer cache 1881low level I/O transfers -- instead, we must cache regions of 1882address space associated with files. In sum, we must 1883work at a higher level of abstraction where we 1884direct an I/O operation to be performed that far exceeds the size 1885of the actual low level atomic transactions. 1886.pp 1887Why we cannot accomplish this at the lower level of filesystem blocks 1888is simply because the information regarding 1889the characteristics of the transfer is managed only at the top level 1890of abstraction. While it is possible to devise an architecture which always 1891"sends" this information down to the lower levels whenever any operation 1892may be performed, this would still be inadequate since we may 1893be required to manage at the top level of abstraction the timing (e.g. arrival 1894and service times) in order to use the resource most effectively -- this 1895requirement is not easily worked into our communications mechanism 1896to the lower layers. 1897.pp 1898Perhaps even more importantly, in the future our operating system may 1899wish to make high level changes in allocation strategies, independent 1900of the lower levels which actually implement the transfers. Again, 1901the burden of synchronizing lower and upper levels in a dogmatic 1902way could easily prove too costly. 1903.pp 1904Another future consideration for future design is that the buffer 1905cache mechanism provides one set of policies for all filesystems 1906that make use of the buffer cache and does not allow for per-filesystem 1907management of such policies. One example of this is the method by which 1908filesystem metadata (inodes, disk indirect blocks, cylinder groups) 1909must be synchronized when performing a file update operation. Currently 1910in BSD systems, this is handled by forcing the association 1911of metadata with the file's actual data on block lists and disassociating 1912them when an I/O operation needs to be performed. 1913This is done so that when a file is forced written to the disk (fsync), 1914all of the related blocks can be found, including the indirect blocks which 1915don't actually belong to the address space of the file itself. 1916Thus, there is no general purpose way currently to make these 1917associations other than by implementing the buffer cache to 1918provide a callout to the vnode layer when this is necessary. 1919.pp 1920An alternative design is to avoid the implied I/O involved with the filesystem 1921cache, by implementing a lookaside cache which 1922does I/O directly to the lower layers of the filesystem. Prior 1923to initiating the I/O, it checks to see if the 1924block is present in the filesystem cache and then chooses to use the 1925contents of the cache rather than doing the transfer. Such a mechanism 1926is simpler and avoids the need for the knowledge of interrupt level 1927operation or much in the way of semantic information regarding the contents 1928of the buffers. 1929