History log of /dragonfly/sys/kern/vfs_cluster.c (Results 1 – 25 of 86)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.2.1, v6.2.0, v6.3.0, v6.0.1
# e91e64c7 17-May-2021 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Major refactor of pageout daemon algorithms

* Rewrite a large chunk of the pageout daemon's algorithm to significantly
improve page selection for pageout on low-memory systems.

* Impleme

kernel - Major refactor of pageout daemon algorithms

* Rewrite a large chunk of the pageout daemon's algorithm to significantly
improve page selection for pageout on low-memory systems.

* Implement persistent markers for hold and active queue scans. Instead
of moving pages within the queues, we now implement a persistent marker
and just move the marker instead. This ensures 100% fair scanning of
these queues.

* The pageout state machine is now governed by the following sysctls
(with some example default settings from a 32G box containing 8071042
pages):

vm.v_free_reserved: 20216
vm.v_free_min: 40419
vm.v_paging_wait: 80838
vm.v_paging_start: 121257
vm.v_paging_target1: 161676
vm.v_paging_target2: 202095

And separately

vm.v_inactive_target: 484161

The arrangement is as follows:

reserved < severe < minimum < wait < start < target1 < target2

* Paging is governed as follows: The pageout daemon is activated when
FREE+CACHE falls below (v_paging_start). The daemon will free memory
up until FREE+CACHE reaches (v_paging_target1), and then continue to
free memory up more slowly until FREE+CACHE reaches (v_paging_target2).

If, due to memory demand, FREE+CACHE falls below (v_paging_wait), most
userland processes will begin short-stalls on VM allocations and page
faults, and return to normal operation once FREE+CACHE goes above
(v_paging_wait) (that is, as soon as possible).

If, due to memory demand, FREE+CACHE falls below (v_paging_min), most
userland processes will block on VM allocations and page faults until
the level returns to above (v_paging_wait).

The hysteresis between (wait) and (start) allows most processes to
continue running normally during nominal paging activities.

* The pageout daemon operates in batches and then loops as necessary.
Pages will be moved from CACHE to FREE as necessary, then from INACTIVE
to CACHE as necessary, then from ACTIVE to INACTIVE as necessary. Care
is taken to avoid completely exhausting any given queue to ensure that
the queue scan is reasonably efficient.

* The ACTIVE to INACTIVE scan has been significantly reorganized and
integrated with the page_stats scan (which updates m->act_count for
pages in the ACTIVE queue). Pages in the ACTIVE queue are no longer
moved within the lists. Instead a persistent roving marker is employed
for each queue.

The m->act_count tests is made against a dynamically adjusted comparison
variable called vm.pageout_stats_actcmp. When no progress is made this
variable is increased, and when sufficient progress is made this variable
is decreased. Thus, under very heavy memory loads, a more permission
m->act_count test allows active pages to be deactivated more quickly.

* The INACTIVE to FREE+CACHE scan remains relatively unchanged. A two-pass
LRU arrangement continues to be employed in order to give the system
time to reclaim a deactivated page before it would otherwise get paged out.

* The vm_pageout_page_stats() scan has been almost completely rewritten.
This scan is responsible for updating m->act_count on pages in the
ACTIVE queue. Example sysctl settings shown below

vm.pageout_stats_rsecs: 300 <--- passive run time (seconds) after pageout
vm.pageout_stats_scan: 472 <--- max number of pages to scan per tick
vm.pageout_stats_ticks: 10 <--- poll rate in ticks
vm.pageout_stats_inamin: 16 <--- inactive ratio governing dynamic
vm.pageout_stats_inalim: 4096 adjustment of actcmnp.
vm.pageout_stats_actcmp: 2 <--- dynamically adjusted by the kernel

The page stats code polls slowly and will update m->act_count and
deactivate pages until it is able to achieve (v_inactive_target) worth
of pages in the inactive queue.

Once this target has been reached, the poll stops deactivating pages, but
will continue to run for (pageout_stats_rsecs) seconds after the pageout
daemon last ran (typically 5 minutes) and continue to passively update
m->act_count duiring this period.

The polling resumes upon any pageout daemon activation and the cycle
repeats.

* The vm_pageout_page_stats() scan is mostly responsible for selecting
the correct pages to move from ACTIVE to INACTIVE. Choosing the correct
pages allows the system to continue to operate smoothly while concurrent
paging is in progress. The additional 5 minutes of passive operation
allows it to pre-stage m->act_count for pages in the ACTIVE queue to
help grease the wheels for the next pageout daemon activation.

TESTING

* On a test box with memory limited to 2GB, running chrome. Video runs
smoothly despite constant paging. Active tabs appear to operate smoothly.
Inactive tabs are able to page-in decently fast and resume operation.

* On a workstation with 32GB of memory and a large number of open chrome
tabs, allowed to sit overnight (chrome burns up a lot of memory when tabs
remain open), then video tested the next day. Paging appeared to operate
well and so far there has been no stuttering.

* On a 64GB build box running dsynth 32/32 (intentionally overloaded). The
full bulk starts normally. The packages tend to get larger and larger as
they are built. dsynth and the pageout daemon operate reasonably well in
this situation. I was mostly looking for excessive stalls due to heavy
memory loads and it looks like the new code handles it quite well.

show more ...


Revision tags: v6.0.0, v6.0.0rc1, v6.1.0, v5.8.3, v5.8.2, v5.8.1, v5.8.0, v5.9.0, v5.8.0rc1, v5.6.3
# 3f7b7260 23-Oct-2019 Sascha Wildner <saw@online.de>

world/kernel: Use the rounddown2() macro in various places.

Tested-by: zrj


Revision tags: v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0, v5.4.3, v5.4.2
# 5606f9a7 30-Mar-2019 Sascha Wildner <saw@online.de>

kernel: Remove two duplicate #include's.


Revision tags: v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1, v5.2.2, v5.2.1, v5.2.0, v5.3.0, v5.2.0rc
# c1f5cf51 11-Feb-2018 Matthew Dillon <dillon@apollo.backplane.com>

kernel - syntax

* Improve a formatting issue.


Revision tags: v5.0.2, v5.0.1
# c3c895a6 28-Oct-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix cluster_awrite() race

* Fix a race between cluster_awrite() and vnode destruction. We
have to finish working the cluster pbuf before disposing of the
component elements. b_vp in t

kernel - Fix cluster_awrite() race

* Fix a race between cluster_awrite() and vnode destruction. We
have to finish working the cluster pbuf before disposing of the
component elements. b_vp in the cluster pbuf is held only by
the presence of the components.

* Fixes NULL pointer indirection panic associated with heavy
paging during tmpfs operations. This typically only occurs
when maxvnodes is set to a relatively low value, but it can
eventually occur in any modestly paging environment when
tmpfs is used.

show more ...


# bc0aa189 18-Oct-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - refactor vm_page busy

* Move PG_BUSY, PG_WANTED, PG_SBUSY, and PG_SWAPINPROG out of m->flags.

* Add m->busy_count with PBUSY_LOCKED, PBUSY_WANTED, PBUSY_SWAPINPROG,
and PBUSY_MASK (for t

kernel - refactor vm_page busy

* Move PG_BUSY, PG_WANTED, PG_SBUSY, and PG_SWAPINPROG out of m->flags.

* Add m->busy_count with PBUSY_LOCKED, PBUSY_WANTED, PBUSY_SWAPINPROG,
and PBUSY_MASK (for the soft-busy count).

* Add support for acquiring a soft-busy count without a hard-busy.
This requires that there not already be a hard-busy. The purpose
of this is to allow a vm_page to be 'locked' in a shared manner
via the soft-busy for situations where we only intend to read from
it.

show more ...


Revision tags: v5.0.0
# d32579c3 02-Oct-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Add KVABIO API (ability to avoid global TLB syncs)

* Add KVABIO support. This works as follows:

(1) Devices can set D_KVABIO in the ops flags to specify that the
device strategy r

kernel - Add KVABIO API (ability to avoid global TLB syncs)

* Add KVABIO support. This works as follows:

(1) Devices can set D_KVABIO in the ops flags to specify that the
device strategy routine supports the API.
passed to

The dev_dstrategy() wrapper will fully synchronize the buffer to
all cpus prior to dispatch if the device flag is not set.

(2) Vnodes can set VKVABIO in v_flag to indicate that VOP_STRATEGY
supports the API.

The vn_strategy() wrapper will fully synchronize the buffer to
all cpus prior to dispatch if the vnode flag is not set.

(3) GETBLK_KVABIO and FINDBLK_KVABIO flags added to allow buffer
cache consumers (primarily filesystem code) to indicate that
they support the API. B_KVABIO flag added to struct buf.

This occurs on a per-acquisition basis. For example, a standard
bread() will clear the flag, indicating no support. A bread_kvabio()
will set the flag, indicating support.

* The getblk(), getcacheblk(), and cluster*() interfaces set the flag for
any I/O they dispatch, and then adjust the flag as necessary upon return
according to the caller's wishes.

show more ...


Revision tags: v5.0.0rc2
# 8158299a 30-Sep-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Remove B_MALLOC

* Remove B_MALLOC buffer support. All primary buffer cache buffer
operations should now use pages. B_VMIO is required for all
vnode-centric operations like allocbuf(),

kernel - Remove B_MALLOC

* Remove B_MALLOC buffer support. All primary buffer cache buffer
operations should now use pages. B_VMIO is required for all
vnode-centric operations like allocbuf(), but does not have to be set
for nominal I/O.

* Remove vm_hold_load_pages() and vm_hold_free_pages(). This code was
used to support mapping ad-hoc data buffers into buf structures, but
the only remaining use case in the CAM periph code can just use
getpbuf_mem() instead. So this code is no longer used.

show more ...


Revision tags: v5.1.0, v5.0.0rc1
# 9c93755a 09-Sep-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Expand breadnx/breadcb/cluster_readx/cluster_readcb API

* Pass B_NOTMETA flagging into breadnx(), breadcb(), cluster_readx(),
and cluster_readcb().

Solve issues where data can wind up

kernel - Expand breadnx/breadcb/cluster_readx/cluster_readcb API

* Pass B_NOTMETA flagging into breadnx(), breadcb(), cluster_readx(),
and cluster_readcb().

Solve issues where data can wind up not being tagged B_NOTMETA
in read-ahead and clustered buffers.

* Adjust the standard bread(), breadn(), and cluster_read() inlines
to pass B_NOTMETA.

show more ...


Revision tags: v4.8.1, v4.8.0, v4.6.2, v4.9.0, v4.8.0rc
# cf297f2c 07-Mar-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix cluster_write() inefficiency

* A bug in the cluster code was causing HAMMER to write out 64KB buffers in
32KB overlapping segments, resulting in data being written to the media
twic

kernel - Fix cluster_write() inefficiency

* A bug in the cluster code was causing HAMMER to write out 64KB buffers in
32KB overlapping segments, resulting in data being written to the media
twice.

* This fix just about doubles HAMMER's sequential write bandwidth.

show more ...


# d84f6fa1 08-Nov-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Attempt to fix cluster pbuf deadlock on recursive filesystems

* Change global pbuf count limits (used primarily for clustered I/O) to
per-mount and per-device limits. The per-mount / per

kernel - Attempt to fix cluster pbuf deadlock on recursive filesystems

* Change global pbuf count limits (used primarily for clustered I/O) to
per-mount and per-device limits. The per-mount / per-device limit
is set to nswbuf_kva / 10, allowing 10 different entities to obtain
pbufs concurrently without interference.

* This change goes a long way towards fixing deadlocks that could occur
with the old global system (a global limit of nswbuf_kva / 2) when
the I/O system recurses through a virtual block device or filesystem.
Two examples of virtual block devices are the 'vn' device and the crypto
layer.

* We also note that even normal filesystem read and write I/O strategy calls
will recurse at least once to dive the underlying block device. DFly also
had issues with pbuf hogging by one mount causing unnecessary stalls
in other mounts. This fix also prevents pbuf hogging.

* Remove unused internal O_MAPONREAD flag.

Reported-by: htse, multiple
Testing-by: htse, dillon

show more ...


Revision tags: v4.6.1, v4.6.0, v4.6.0rc2
# ca88a24a 24-Jul-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Add vfs.repurpose_enable, adjust B_HASBOGUS

* Add vfs.repurpose_enable, default disabled. If this feature is turned on
the system will try to repurpose the VM pages underlying a buffer o

kernel - Add vfs.repurpose_enable, adjust B_HASBOGUS

* Add vfs.repurpose_enable, default disabled. If this feature is turned on
the system will try to repurpose the VM pages underlying a buffer on
re-use instead of allowing the VM pages to cycle into the VM page cache.
Designed for high I/O-load environments.

* Use the B_HASBOGUS flag to determine if a pmap_qenter() is required,
and devolve the case to a single call to pmap_qenter() instead of one
for each bogus page.

show more ...


Revision tags: v4.6.0rc, v4.7.0
# d9a07a60 29-Jun-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Enhance buffer flush and cluster_write linearity

* flushbufqueues() was iterating between cpus, taking only one buffer off
of each cpu's queue. This forced non-linearly on-flush, messing

kernel - Enhance buffer flush and cluster_write linearity

* flushbufqueues() was iterating between cpus, taking only one buffer off
of each cpu's queue. This forced non-linearly on-flush, messing up
sequential performance for HAMMER1 and HAMMER2. For HAMMER2 this also
caused physical blocks to be allocated out of order.

Add sysctl vfs.flushperqueue to specify the number of buffers to flush
per cpu before iterating the pcpu queue. Default 1024.

* cluster_write() no longer requires that a buffer be VOP_BMAP()'d
successfully in order to issue writes. This effects HAMMER2, which does
not assign physical device blocks until the logical buffer is actually
flushed to the backend device.

* Fixes non-linearity problems for buffer daemon flushbufqueues() calls,
and for cluster_write() with or without write_behind.

show more ...


# 3b2afb67 11-Jun-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - B_IODEBUG -> B_IOISSUED

* Rename this flag. It still operates the same way.

This flag is set by the kernel upon an actual I/O read into a buffer
cache buffer and may be cleared by the

kernel - B_IODEBUG -> B_IOISSUED

* Rename this flag. It still operates the same way.

This flag is set by the kernel upon an actual I/O read into a buffer
cache buffer and may be cleared by the filesystem code to allow the
filesystem code to detect when re-reads of the block cause another I/O
or not. This allows HAMMER1 and HAMMER2 to avoid calculating the check
code over and over again if it has already been calculated.

show more ...


# cb1fa82f 08-Jun-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix some clustering issues

* Change B_RAM functionality. We were previously setting B_RAM
on the last async buffer and doing some cruft to probe ahead.

Instead, set B_RAM in the middl

kernel - Fix some clustering issues

* Change B_RAM functionality. We were previously setting B_RAM
on the last async buffer and doing some cruft to probe ahead.

Instead, set B_RAM in the middle and use a simple heuristic to
estimate where to pick-up the read-ahead again.

* Clean-up the read-ahead. When the caller of cluster_read() asks for
read-ahead, we do the read-ahead whether or not BMAP says it is
contiguous. All a failed BMAP does now is prevent cluster_rbuild()
from getting called (that is, it doesn't try to gang multiple buffers
together).

When thinking about this, the logical buffer cache sequential heuristic
is telling us that userland is going to read the data, so why stop and
then have to stall on an I/O read later when userland actually reads
the data?

* This will improve pipelining for both hammer1 and hammer2.

show more ...


Revision tags: v4.4.3, v4.4.2, v4.4.1, v4.4.0, v4.5.0, v4.4.0rc, v4.2.4, v4.3.1, v4.2.3, v4.2.1, v4.2.0, v4.0.6, v4.3.0, v4.2.0rc
# ffd3e597 21-May-2015 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Change bundirty() location in I/O sequence

* When doing a write BIO, do not bundirty() the buffer prior to issuing
the vn_strategy(). Instead, bundirty() the buffer when the I/O
is com

kernel - Change bundirty() location in I/O sequence

* When doing a write BIO, do not bundirty() the buffer prior to issuing
the vn_strategy(). Instead, bundirty() the buffer when the I/O
is complete, primarily in bpdone().

The I/O's data buffer is protected during the operation by vfs_busy_pages(),
so related VM pages cannot be modified while the write is running. And,
of course, the buffer itself is locked exclusively for the duration of the
opeartion. Thus this change should NOT introduce any redirtying races.

* This change ensures that vp->v_rbdirty_tree remains non-empty until all
related write I/Os have completed, removing a race condition for code
which checks vp->v_rbdirty_tree to determine e.g. if a file requires
synchronization or not.

This race could cause problems because the system buffer flusher might
be in the midst of flushing a buffer just as a filesystem decides to
sync and starts checking vp->v_rbdirty_tree.

* This should theoretically fix a long-standing but difficult-to-reproduce
bug in HAMMER1 where a backend flush occurs at an inopportune time.

show more ...


Revision tags: v4.0.5, v4.0.4
# 59b728a7 18-Feb-2015 Sascha Wildner <saw@online.de>

sys/kern: Adjust some function declaration vs. definition mismatches.

All these functions are declared static already, so no functional change.


Revision tags: v4.0.3, v4.0.2, v4.0.1, v4.0.0, v4.0.0rc3, v4.0.0rc2, v4.0.0rc, v4.1.0, v3.8.2, v3.8.1, v3.6.3, v3.8.0, v3.8.0rc2, v3.9.0, v3.8.0rc, v3.6.2, v3.6.1
# 65ec5030 12-Dec-2013 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix rare buffer cache deadlock

* cluster_collectbufs() was improperly using a blocking vfs/bio calls
to find nearby buffers, which can deadlock against multi-threaded
filesystems.

* On

kernel - Fix rare buffer cache deadlock

* cluster_collectbufs() was improperly using a blocking vfs/bio calls
to find nearby buffers, which can deadlock against multi-threaded
filesystems.

* Only occurs in the write path, probably only H2 is affected.

show more ...


# 38a4b308 04-Dec-2013 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix SMP races with vnode cluster fields

* The better concurrency we have due to the recent buffer cache work has
revealed a SMP race in the vfs_cluster code. Various fields used by
clu

kernel - Fix SMP races with vnode cluster fields

* The better concurrency we have due to the recent buffer cache work has
revealed a SMP race in the vfs_cluster code. Various fields used by
cluster_write() can race and cause the wrong buffers to be clustered to
the wrong disk offset, resulting in disk corruption.

* Rip the v_lastw, v_cstart, v_lasta, and v_clen fields out of struct vnode
and replace with a global cluster state cache in vfs_cluster.c.

The cache is implemented as a 512-entry hash table, 4-way set associative,
and is more advanced than the original implementation in that it allows
up to four different seek zones to be tracked on each vnode, instead of
only one. This should make buffered device I/O (used by filesystems)
work better.

Cache elements are heuristically locked with an atomic_swap_int(). If
the code is unable to instantly acquire a lock on an element it will
simply not cluster that particular I/O (instead of blocking). Even though
this is a global hash table, operations will have a tendancy to be
localized to cache elements.

* Remove some manual clearing of fields in UFS's ffs_truncate() routine.
It should have no material effect.

show more ...


Revision tags: v3.6.0, v3.7.1, v3.6.0rc, v3.7.0, v3.4.3
# dbb11a6e 15-Jun-2013 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Add cluster_readcb()

* This function is similar to breadcb() in that it issues the requested
buffer I/O asynchronously with a callback, but then also clusters
additional asynchronous I/

kernel - Add cluster_readcb()

* This function is similar to breadcb() in that it issues the requested
buffer I/O asynchronously with a callback, but then also clusters
additional asynchronous I/Os (without a callback) to improve performance.

* Used by HAMMER2 to improve performance.

show more ...


# dc71b7ab 31-May-2013 Justin C. Sherrill <justin@shiningsilence.com>

Correct BSD License clause numbering from 1-2-4 to 1-2-3.

Apparently everyone's doing it:
http://svnweb.freebsd.org/base?view=revision&revision=251069

Submitted-by: "Eitan Adler" <lists at eitanadl

Correct BSD License clause numbering from 1-2-4 to 1-2-3.

Apparently everyone's doing it:
http://svnweb.freebsd.org/base?view=revision&revision=251069

Submitted-by: "Eitan Adler" <lists at eitanadler.com>

show more ...


Revision tags: v3.4.2
# 2702099d 06-May-2013 Justin C. Sherrill <justin@shiningsilence.com>

Remove advertising clause from all that isn't contrib or userland bin.

By: Eitan Adler <lists@eitanadler.com>


Revision tags: v3.4.0, v3.4.1, v3.4.0rc, v3.5.0
# 47269f33 14-Dec-2012 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix buffer cache mismatch assertion (hammer)

* Fix an issue where cluster_write() could instantiate buffers with
the wrong buffer size. Only effects HAMMER1 which uses two different
bu

kernel - Fix buffer cache mismatch assertion (hammer)

* Fix an issue where cluster_write() could instantiate buffers with
the wrong buffer size. Only effects HAMMER1 which uses two different
buffer sizes for files.

* Bug could cause a mismatched buffer size assertion in the kernel.

show more ...


Revision tags: v3.2.2, v3.2.1, v3.2.0, v3.3.0, v3.0.3
# b642a6c1 30-Apr-2012 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix degenerate cluster_write() cases

* cluster_write() should bdwrite() as a fallback, not bawrite().

Note that cluster_awrite() always bawrite()'s or equivalent. The
DragonFly API sp

kernel - Fix degenerate cluster_write() cases

* cluster_write() should bdwrite() as a fallback, not bawrite().

Note that cluster_awrite() always bawrite()'s or equivalent. The
DragonFly API split the functions out, so cluster_write() can now
almost always bdwrite() for the non-clustered case.

* Solves some serious performance and real-time disk space usage issues
when HAMMER1 was updated to use the cluster calls. The disk space
would be recovered by the daily cleanup but the extra writes could
end up being quite excessive, 25:1 unnecessary writes vs necessary
writes.

Reported-by: multiple, testing by tuxillo

show more ...


# 504ea70e 02-Apr-2012 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Reduce impact of write_behind on small/temporary files

* Do not start issuing write-behind writes until a file has grown past
a certain size, otherwise we wind up issuing excessive I/O fo

kernel - Reduce impact of write_behind on small/temporary files

* Do not start issuing write-behind writes until a file has grown past
a certain size, otherwise we wind up issuing excessive I/O for
small files and for temporary files which might be quickly deleted.

* Add vfs.write_behind_minfilesize sysctl (defaults to 10MB).

show more ...


1234