vfs_cluster.c - OpenGrok history log for /dragonfly/sys/kern/vfs

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: v6.2.1, v6.2.0, v6.3.0, v6.0.1
# e91e64c7	17-May-2021	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Major refactor of pageout daemon algorithms * Rewrite a large chunk of the pageout daemon's algorithm to significantly improve page selection for pageout on low-memory systems. * Impleme kernel - Major refactor of pageout daemon algorithms * Rewrite a large chunk of the pageout daemon's algorithm to significantly improve page selection for pageout on low-memory systems. * Implement persistent markers for hold and active queue scans. Instead of moving pages within the queues, we now implement a persistent marker and just move the marker instead. This ensures 100% fair scanning of these queues. * The pageout state machine is now governed by the following sysctls (with some example default settings from a 32G box containing 8071042 pages): vm.v_free_reserved: 20216 vm.v_free_min: 40419 vm.v_paging_wait: 80838 vm.v_paging_start: 121257 vm.v_paging_target1: 161676 vm.v_paging_target2: 202095 And separately vm.v_inactive_target: 484161 The arrangement is as follows: reserved < severe < minimum < wait < start < target1 < target2 * Paging is governed as follows: The pageout daemon is activated when FREE+CACHE falls below (v_paging_start). The daemon will free memory up until FREE+CACHE reaches (v_paging_target1), and then continue to free memory up more slowly until FREE+CACHE reaches (v_paging_target2). If, due to memory demand, FREE+CACHE falls below (v_paging_wait), most userland processes will begin short-stalls on VM allocations and page faults, and return to normal operation once FREE+CACHE goes above (v_paging_wait) (that is, as soon as possible). If, due to memory demand, FREE+CACHE falls below (v_paging_min), most userland processes will block on VM allocations and page faults until the level returns to above (v_paging_wait). The hysteresis between (wait) and (start) allows most processes to continue running normally during nominal paging activities. * The pageout daemon operates in batches and then loops as necessary. Pages will be moved from CACHE to FREE as necessary, then from INACTIVE to CACHE as necessary, then from ACTIVE to INACTIVE as necessary. Care is taken to avoid completely exhausting any given queue to ensure that the queue scan is reasonably efficient. * The ACTIVE to INACTIVE scan has been significantly reorganized and integrated with the page_stats scan (which updates m->act_count for pages in the ACTIVE queue). Pages in the ACTIVE queue are no longer moved within the lists. Instead a persistent roving marker is employed for each queue. The m->act_count tests is made against a dynamically adjusted comparison variable called vm.pageout_stats_actcmp. When no progress is made this variable is increased, and when sufficient progress is made this variable is decreased. Thus, under very heavy memory loads, a more permission m->act_count test allows active pages to be deactivated more quickly. * The INACTIVE to FREE+CACHE scan remains relatively unchanged. A two-pass LRU arrangement continues to be employed in order to give the system time to reclaim a deactivated page before it would otherwise get paged out. * The vm_pageout_page_stats() scan has been almost completely rewritten. This scan is responsible for updating m->act_count on pages in the ACTIVE queue. Example sysctl settings shown below vm.pageout_stats_rsecs: 300 <--- passive run time (seconds) after pageout vm.pageout_stats_scan: 472 <--- max number of pages to scan per tick vm.pageout_stats_ticks: 10 <--- poll rate in ticks vm.pageout_stats_inamin: 16 <--- inactive ratio governing dynamic vm.pageout_stats_inalim: 4096 adjustment of actcmnp. vm.pageout_stats_actcmp: 2 <--- dynamically adjusted by the kernel The page stats code polls slowly and will update m->act_count and deactivate pages until it is able to achieve (v_inactive_target) worth of pages in the inactive queue. Once this target has been reached, the poll stops deactivating pages, but will continue to run for (pageout_stats_rsecs) seconds after the pageout daemon last ran (typically 5 minutes) and continue to passively update m->act_count duiring this period. The polling resumes upon any pageout daemon activation and the cycle repeats. * The vm_pageout_page_stats() scan is mostly responsible for selecting the correct pages to move from ACTIVE to INACTIVE. Choosing the correct pages allows the system to continue to operate smoothly while concurrent paging is in progress. The additional 5 minutes of passive operation allows it to pre-stage m->act_count for pages in the ACTIVE queue to help grease the wheels for the next pageout daemon activation. TESTING * On a test box with memory limited to 2GB, running chrome. Video runs smoothly despite constant paging. Active tabs appear to operate smoothly. Inactive tabs are able to page-in decently fast and resume operation. * On a workstation with 32GB of memory and a large number of open chrome tabs, allowed to sit overnight (chrome burns up a lot of memory when tabs remain open), then video tested the next day. Paging appeared to operate well and so far there has been no stuttering. * On a 64GB build box running dsynth 32/32 (intentionally overloaded). The full bulk starts normally. The packages tend to get larger and larger as they are built. dsynth and the pageout daemon operate reasonably well in this situation. I was mostly looking for excessive stalls due to heavy memory loads and it looks like the new code handles it quite well. show more ...
Revision tags: v6.0.0, v6.0.0rc1, v6.1.0, v5.8.3, v5.8.2, v5.8.1, v5.8.0, v5.9.0, v5.8.0rc1, v5.6.3
# 3f7b7260	23-Oct-2019	Sascha Wildner <saw@online.de>	world/kernel: Use the rounddown2() macro in various places. Tested-by: zrj
Revision tags: v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0, v5.4.3, v5.4.2
# 5606f9a7	30-Mar-2019	Sascha Wildner <saw@online.de>	kernel: Remove two duplicate #include's.
Revision tags: v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1, v5.2.2, v5.2.1, v5.2.0, v5.3.0, v5.2.0rc
# c1f5cf51	11-Feb-2018	Matthew Dillon <dillon@apollo.backplane.com>	kernel - syntax * Improve a formatting issue.
Revision tags: v5.0.2, v5.0.1
# c3c895a6	28-Oct-2017	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Fix cluster_awrite() race * Fix a race between cluster_awrite() and vnode destruction. We have to finish working the cluster pbuf before disposing of the component elements. b_vp in t kernel - Fix cluster_awrite() race * Fix a race between cluster_awrite() and vnode destruction. We have to finish working the cluster pbuf before disposing of the component elements. b_vp in the cluster pbuf is held only by the presence of the components. * Fixes NULL pointer indirection panic associated with heavy paging during tmpfs operations. This typically only occurs when maxvnodes is set to a relatively low value, but it can eventually occur in any modestly paging environment when tmpfs is used. show more ...
# bc0aa189	18-Oct-2017	Matthew Dillon <dillon@apollo.backplane.com>	kernel - refactor vm_page busy * Move PG_BUSY, PG_WANTED, PG_SBUSY, and PG_SWAPINPROG out of m->flags. * Add m->busy_count with PBUSY_LOCKED, PBUSY_WANTED, PBUSY_SWAPINPROG, and PBUSY_MASK (for t kernel - refactor vm_page busy * Move PG_BUSY, PG_WANTED, PG_SBUSY, and PG_SWAPINPROG out of m->flags. * Add m->busy_count with PBUSY_LOCKED, PBUSY_WANTED, PBUSY_SWAPINPROG, and PBUSY_MASK (for the soft-busy count). * Add support for acquiring a soft-busy count without a hard-busy. This requires that there not already be a hard-busy. The purpose of this is to allow a vm_page to be 'locked' in a shared manner via the soft-busy for situations where we only intend to read from it. show more ...
Revision tags: v5.0.0
# d32579c3	02-Oct-2017	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Add KVABIO API (ability to avoid global TLB syncs) * Add KVABIO support. This works as follows: (1) Devices can set D_KVABIO in the ops flags to specify that the device strategy r kernel - Add KVABIO API (ability to avoid global TLB syncs) * Add KVABIO support. This works as follows: (1) Devices can set D_KVABIO in the ops flags to specify that the device strategy routine supports the API. passed to The dev_dstrategy() wrapper will fully synchronize the buffer to all cpus prior to dispatch if the device flag is not set. (2) Vnodes can set VKVABIO in v_flag to indicate that VOP_STRATEGY supports the API. The vn_strategy() wrapper will fully synchronize the buffer to all cpus prior to dispatch if the vnode flag is not set. (3) GETBLK_KVABIO and FINDBLK_KVABIO flags added to allow buffer cache consumers (primarily filesystem code) to indicate that they support the API. B_KVABIO flag added to struct buf. This occurs on a per-acquisition basis. For example, a standard bread() will clear the flag, indicating no support. A bread_kvabio() will set the flag, indicating support. * The getblk(), getcacheblk(), and cluster*() interfaces set the flag for any I/O they dispatch, and then adjust the flag as necessary upon return according to the caller's wishes. show more ...
Revision tags: v5.0.0rc2
# 8158299a	30-Sep-2017	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Remove B_MALLOC * Remove B_MALLOC buffer support. All primary buffer cache buffer operations should now use pages. B_VMIO is required for all vnode-centric operations like allocbuf(), kernel - Remove B_MALLOC * Remove B_MALLOC buffer support. All primary buffer cache buffer operations should now use pages. B_VMIO is required for all vnode-centric operations like allocbuf(), but does not have to be set for nominal I/O. * Remove vm_hold_load_pages() and vm_hold_free_pages(). This code was used to support mapping ad-hoc data buffers into buf structures, but the only remaining use case in the CAM periph code can just use getpbuf_mem() instead. So this code is no longer used. show more ...
Revision tags: v5.1.0, v5.0.0rc1
# 9c93755a	09-Sep-2017	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Expand breadnx/breadcb/cluster_readx/cluster_readcb API * Pass B_NOTMETA flagging into breadnx(), breadcb(), cluster_readx(), and cluster_readcb(). Solve issues where data can wind up kernel - Expand breadnx/breadcb/cluster_readx/cluster_readcb API * Pass B_NOTMETA flagging into breadnx(), breadcb(), cluster_readx(), and cluster_readcb(). Solve issues where data can wind up not being tagged B_NOTMETA in read-ahead and clustered buffers. * Adjust the standard bread(), breadn(), and cluster_read() inlines to pass B_NOTMETA. show more ...
Revision tags: v4.8.1, v4.8.0, v4.6.2, v4.9.0, v4.8.0rc
# cf297f2c	07-Mar-2017	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Fix cluster_write() inefficiency * A bug in the cluster code was causing HAMMER to write out 64KB buffers in 32KB overlapping segments, resulting in data being written to the media twic kernel - Fix cluster_write() inefficiency * A bug in the cluster code was causing HAMMER to write out 64KB buffers in 32KB overlapping segments, resulting in data being written to the media twice. * This fix just about doubles HAMMER's sequential write bandwidth. show more ...
# d84f6fa1	08-Nov-2016	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Attempt to fix cluster pbuf deadlock on recursive filesystems * Change global pbuf count limits (used primarily for clustered I/O) to per-mount and per-device limits. The per-mount / per kernel - Attempt to fix cluster pbuf deadlock on recursive filesystems * Change global pbuf count limits (used primarily for clustered I/O) to per-mount and per-device limits. The per-mount / per-device limit is set to nswbuf_kva / 10, allowing 10 different entities to obtain pbufs concurrently without interference. * This change goes a long way towards fixing deadlocks that could occur with the old global system (a global limit of nswbuf_kva / 2) when the I/O system recurses through a virtual block device or filesystem. Two examples of virtual block devices are the 'vn' device and the crypto layer. * We also note that even normal filesystem read and write I/O strategy calls will recurse at least once to dive the underlying block device. DFly also had issues with pbuf hogging by one mount causing unnecessary stalls in other mounts. This fix also prevents pbuf hogging. * Remove unused internal O_MAPONREAD flag. Reported-by: htse, multiple Testing-by: htse, dillon show more ...
Revision tags: v4.6.1, v4.6.0, v4.6.0rc2
# ca88a24a	24-Jul-2016	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Add vfs.repurpose_enable, adjust B_HASBOGUS * Add vfs.repurpose_enable, default disabled. If this feature is turned on the system will try to repurpose the VM pages underlying a buffer o kernel - Add vfs.repurpose_enable, adjust B_HASBOGUS * Add vfs.repurpose_enable, default disabled. If this feature is turned on the system will try to repurpose the VM pages underlying a buffer on re-use instead of allowing the VM pages to cycle into the VM page cache. Designed for high I/O-load environments. * Use the B_HASBOGUS flag to determine if a pmap_qenter() is required, and devolve the case to a single call to pmap_qenter() instead of one for each bogus page. show more ...
Revision tags: v4.6.0rc, v4.7.0
# d9a07a60	29-Jun-2016	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Enhance buffer flush and cluster_write linearity * flushbufqueues() was iterating between cpus, taking only one buffer off of each cpu's queue. This forced non-linearly on-flush, messing kernel - Enhance buffer flush and cluster_write linearity * flushbufqueues() was iterating between cpus, taking only one buffer off of each cpu's queue. This forced non-linearly on-flush, messing up sequential performance for HAMMER1 and HAMMER2. For HAMMER2 this also caused physical blocks to be allocated out of order. Add sysctl vfs.flushperqueue to specify the number of buffers to flush per cpu before iterating the pcpu queue. Default 1024. * cluster_write() no longer requires that a buffer be VOP_BMAP()'d successfully in order to issue writes. This effects HAMMER2, which does not assign physical device blocks until the logical buffer is actually flushed to the backend device. * Fixes non-linearity problems for buffer daemon flushbufqueues() calls, and for cluster_write() with or without write_behind. show more ...
# 3b2afb67	11-Jun-2016	Matthew Dillon <dillon@apollo.backplane.com>	kernel - B_IODEBUG -> B_IOISSUED * Rename this flag. It still operates the same way. This flag is set by the kernel upon an actual I/O read into a buffer cache buffer and may be cleared by the kernel - B_IODEBUG -> B_IOISSUED * Rename this flag. It still operates the same way. This flag is set by the kernel upon an actual I/O read into a buffer cache buffer and may be cleared by the filesystem code to allow the filesystem code to detect when re-reads of the block cause another I/O or not. This allows HAMMER1 and HAMMER2 to avoid calculating the check code over and over again if it has already been calculated. show more ...
# cb1fa82f	08-Jun-2016	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Fix some clustering issues * Change B_RAM functionality. We were previously setting B_RAM on the last async buffer and doing some cruft to probe ahead. Instead, set B_RAM in the middl kernel - Fix some clustering issues * Change B_RAM functionality. We were previously setting B_RAM on the last async buffer and doing some cruft to probe ahead. Instead, set B_RAM in the middle and use a simple heuristic to estimate where to pick-up the read-ahead again. * Clean-up the read-ahead. When the caller of cluster_read() asks for read-ahead, we do the read-ahead whether or not BMAP says it is contiguous. All a failed BMAP does now is prevent cluster_rbuild() from getting called (that is, it doesn't try to gang multiple buffers together). When thinking about this, the logical buffer cache sequential heuristic is telling us that userland is going to read the data, so why stop and then have to stall on an I/O read later when userland actually reads the data? * This will improve pipelining for both hammer1 and hammer2. show more ...
Revision tags: v4.4.3, v4.4.2, v4.4.1, v4.4.0, v4.5.0, v4.4.0rc, v4.2.4, v4.3.1, v4.2.3, v4.2.1, v4.2.0, v4.0.6, v4.3.0, v4.2.0rc
# ffd3e597	21-May-2015	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Change bundirty() location in I/O sequence * When doing a write BIO, do not bundirty() the buffer prior to issuing the vn_strategy(). Instead, bundirty() the buffer when the I/O is com kernel - Change bundirty() location in I/O sequence * When doing a write BIO, do not bundirty() the buffer prior to issuing the vn_strategy(). Instead, bundirty() the buffer when the I/O is complete, primarily in bpdone(). The I/O's data buffer is protected during the operation by vfs_busy_pages(), so related VM pages cannot be modified while the write is running. And, of course, the buffer itself is locked exclusively for the duration of the opeartion. Thus this change should NOT introduce any redirtying races. * This change ensures that vp->v_rbdirty_tree remains non-empty until all related write I/Os have completed, removing a race condition for code which checks vp->v_rbdirty_tree to determine e.g. if a file requires synchronization or not. This race could cause problems because the system buffer flusher might be in the midst of flushing a buffer just as a filesystem decides to sync and starts checking vp->v_rbdirty_tree. * This should theoretically fix a long-standing but difficult-to-reproduce bug in HAMMER1 where a backend flush occurs at an inopportune time. show more ...
Revision tags: v4.0.5, v4.0.4
# 59b728a7	18-Feb-2015	Sascha Wildner <saw@online.de>	sys/kern: Adjust some function declaration vs. definition mismatches. All these functions are declared static already, so no functional change.
Revision tags: v4.0.3, v4.0.2, v4.0.1, v4.0.0, v4.0.0rc3, v4.0.0rc2, v4.0.0rc, v4.1.0, v3.8.2, v3.8.1, v3.6.3, v3.8.0, v3.8.0rc2, v3.9.0, v3.8.0rc, v3.6.2, v3.6.1
# 65ec5030	12-Dec-2013	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Fix rare buffer cache deadlock * cluster_collectbufs() was improperly using a blocking vfs/bio calls to find nearby buffers, which can deadlock against multi-threaded filesystems. * On kernel - Fix rare buffer cache deadlock * cluster_collectbufs() was improperly using a blocking vfs/bio calls to find nearby buffers, which can deadlock against multi-threaded filesystems. * Only occurs in the write path, probably only H2 is affected. show more ...
# 38a4b308	04-Dec-2013	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Fix SMP races with vnode cluster fields * The better concurrency we have due to the recent buffer cache work has revealed a SMP race in the vfs_cluster code. Various fields used by clu kernel - Fix SMP races with vnode cluster fields * The better concurrency we have due to the recent buffer cache work has revealed a SMP race in the vfs_cluster code. Various fields used by cluster_write() can race and cause the wrong buffers to be clustered to the wrong disk offset, resulting in disk corruption. * Rip the v_lastw, v_cstart, v_lasta, and v_clen fields out of struct vnode and replace with a global cluster state cache in vfs_cluster.c. The cache is implemented as a 512-entry hash table, 4-way set associative, and is more advanced than the original implementation in that it allows up to four different seek zones to be tracked on each vnode, instead of only one. This should make buffered device I/O (used by filesystems) work better. Cache elements are heuristically locked with an atomic_swap_int(). If the code is unable to instantly acquire a lock on an element it will simply not cluster that particular I/O (instead of blocking). Even though this is a global hash table, operations will have a tendancy to be localized to cache elements. * Remove some manual clearing of fields in UFS's ffs_truncate() routine. It should have no material effect. show more ...
Revision tags: v3.6.0, v3.7.1, v3.6.0rc, v3.7.0, v3.4.3
# dbb11a6e	15-Jun-2013	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Add cluster_readcb() * This function is similar to breadcb() in that it issues the requested buffer I/O asynchronously with a callback, but then also clusters additional asynchronous I/ kernel - Add cluster_readcb() * This function is similar to breadcb() in that it issues the requested buffer I/O asynchronously with a callback, but then also clusters additional asynchronous I/Os (without a callback) to improve performance. * Used by HAMMER2 to improve performance. show more ...
# dc71b7ab	31-May-2013	Justin C. Sherrill <justin@shiningsilence.com>	Correct BSD License clause numbering from 1-2-4 to 1-2-3. Apparently everyone's doing it: http://svnweb.freebsd.org/base?view=revision&revision=251069 Submitted-by: "Eitan Adler" <lists at eitanadl Correct BSD License clause numbering from 1-2-4 to 1-2-3. Apparently everyone's doing it: http://svnweb.freebsd.org/base?view=revision&revision=251069 Submitted-by: "Eitan Adler" <lists at eitanadler.com> show more ...
Revision tags: v3.4.2
# 2702099d	06-May-2013	Justin C. Sherrill <justin@shiningsilence.com>	Remove advertising clause from all that isn't contrib or userland bin. By: Eitan Adler <lists@eitanadler.com>
Revision tags: v3.4.0, v3.4.1, v3.4.0rc, v3.5.0
# 47269f33	14-Dec-2012	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Fix buffer cache mismatch assertion (hammer) * Fix an issue where cluster_write() could instantiate buffers with the wrong buffer size. Only effects HAMMER1 which uses two different bu kernel - Fix buffer cache mismatch assertion (hammer) * Fix an issue where cluster_write() could instantiate buffers with the wrong buffer size. Only effects HAMMER1 which uses two different buffer sizes for files. * Bug could cause a mismatched buffer size assertion in the kernel. show more ...
Revision tags: v3.2.2, v3.2.1, v3.2.0, v3.3.0, v3.0.3
# b642a6c1	30-Apr-2012	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Fix degenerate cluster_write() cases * cluster_write() should bdwrite() as a fallback, not bawrite(). Note that cluster_awrite() always bawrite()'s or equivalent. The DragonFly API sp kernel - Fix degenerate cluster_write() cases * cluster_write() should bdwrite() as a fallback, not bawrite(). Note that cluster_awrite() always bawrite()'s or equivalent. The DragonFly API split the functions out, so cluster_write() can now almost always bdwrite() for the non-clustered case. * Solves some serious performance and real-time disk space usage issues when HAMMER1 was updated to use the cluster calls. The disk space would be recovered by the daily cleanup but the extra writes could end up being quite excessive, 25:1 unnecessary writes vs necessary writes. Reported-by: multiple, testing by tuxillo show more ...
# 504ea70e	02-Apr-2012	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Reduce impact of write_behind on small/temporary files * Do not start issuing write-behind writes until a file has grown past a certain size, otherwise we wind up issuing excessive I/O fo kernel - Reduce impact of write_behind on small/temporary files * Do not start issuing write-behind writes until a file has grown past a certain size, otherwise we wind up issuing excessive I/O for small files and for temporary files which might be quickly deleted. * Add vfs.write_behind_minfilesize sysctl (defaults to 10MB). show more ...
12 3 4