#
5479a2c1 |
| 20-Oct-2023 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Clean up cache_hysteresis contention, better trigger_syncer()
* cache_hysteresis() is no longer multi-entrant, which causes unnecessary contention between cpus. Only one cpu can run it
kernel - Clean up cache_hysteresis contention, better trigger_syncer()
* cache_hysteresis() is no longer multi-entrant, which causes unnecessary contention between cpus. Only one cpu can run it at a time and it just returns for other cpus if it is already running.
* Add better trigger_syncer functions. Adding trigger_syncer_start() and trigger_syncer_stop() to elide filesystem sleeps, ensuring that a filesystem waiting on a flush due to excessive dirty pages does not race the flusher and wind up twiddling its fingers while no flush is happening.
show more ...
|
Revision tags: v6.4.0, v6.4.0rc1, v6.5.0 |
|
#
a786f1c9 |
| 04-Jul-2022 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Attempt to fix broken vfs.cache.numunres tracker (2)
* The main culprit appears to be cache_allocroot() accounting for new root ncps differently than the rest of the module. So anything
kernel - Attempt to fix broken vfs.cache.numunres tracker (2)
* The main culprit appears to be cache_allocroot() accounting for new root ncps differently than the rest of the module. So anything which mounts and umounts continuously, like dsynth, can seriously make the numbers whacky.
* Fix that and run an overnight test.
show more ...
|
#
417b1086 |
| 04-Jul-2022 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - check nc_generation in nlookup path
* With nc_generation now operating in a more usable manner, we can use it in nlookup() to check for changes. When a change is detected, the related
kernel - check nc_generation in nlookup path
* With nc_generation now operating in a more usable manner, we can use it in nlookup() to check for changes. When a change is detected, the related lock will be cycled and the entire nlookup() will retry up to debug.nlookup_max_retries, which currently defaults to 4.
* Add debugging via debug.nlookup_debug. Set to 3 for nc_generation debugging.
* Move "Parent directory lost" kprintfs into a debugging conditional, reported via (debug.nlookup_debug & 4).
* This fixes lookup/remove races which could sometimes cause open() and other system calls to return EINVAL or ENOTCONN. Basically what happened was that nlookup() wound up on a NCF_DESTROYED entry.
* A few minutes worth of a dsynth bulk does not report any random generation number mismatches or retries, so the code in this commit is probably very close to correct.
show more ...
|
#
cb96f7b7 |
| 04-Jul-2022 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Change ncp->nc_generation operation
* Change nc_generation operation. Bit 0 is reserved. The field is incremented by 2 whenever major changes are being made to the ncp (linking, unlin
kernel - Change ncp->nc_generation operation
* Change nc_generation operation. Bit 0 is reserved. The field is incremented by 2 whenever major changes are being made to the ncp (linking, unlinking, destruction, resolve, unresolve, vnode adjustment), and then incremented by 2 again when the operation is complete.
The caller can test for a major gen change using:
curr_gen = ncp->nc_generation & ~3; if ((orig_gen - curr_gen) & ~1) (retry needed)
* Allows unlocked/relocked code to determine whether the ncp has possibly changed or not (will be used in upcoming commits).
* Adjust the kern_rename() code to use the generation numbers.
* Bit 0 will be used to check for a combination of major changes and lock cycling inthe future.
show more ...
|
#
8ae75bb2 |
| 03-Jul-2022 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Attempt to fix broken vfs.cache.numunres tracker
* Try to fix a mis-count that can accumulate under heavy loads.
* In cache_setvp() and cache_setunresolved(), only adjust the unres count
kernel - Attempt to fix broken vfs.cache.numunres tracker
* Try to fix a mis-count that can accumulate under heavy loads.
* In cache_setvp() and cache_setunresolved(), only adjust the unres count for namecache entries that are linked into the topology.
show more ...
|
#
1201253b |
| 11-Jun-2022 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - namecache eviction fixes
* Fix several namecache eviction issues which interfere with nlookup*() functions.
There is an optimization where nlookup*() avoids locking intermediate ncp'
kernel - namecache eviction fixes
* Fix several namecache eviction issues which interfere with nlookup*() functions.
There is an optimization where nlookup*() avoids locking intermediate ncp's in a path whenever possible on the assumption that the ref on the ncp will prevent eviction. This assumption fails when the machine is under a heavy namecache load.
Errors included spurious ENOTCONN and EINVAL error codes from file operations.
* Refactor the namecache code to not evict resolved namecache entries which have extra refs under normal operation. This allows nlookup*() and other functions to operate semi-lockless for intermediate elements in a path. However, they still obtain a ref which is a cache-unfriendly atomic operation.
This fixes numerous weird errors that occur during heavy dsynth bulk builds.
* Also fix a bug which evicted too many resolved namecache entries when attempting to evict unresolved entries. This should improve performance under heavy namecache loads a bit.
show more ...
|
Revision tags: v6.2.2 |
|
#
ba1dbd39 |
| 27-May-2022 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix lock order reversal in cache_resolve_mp()
* This function is a helper when path lookups cross mount boundaries.
* Locking order between namecache records and vnodes must be { ncp,
kernel - Fix lock order reversal in cache_resolve_mp()
* This function is a helper when path lookups cross mount boundaries.
* Locking order between namecache records and vnodes must be { ncp, vnode }.
* Fix a lock order reversal in cache_resolve_mp() which was doing { vnode, ncp }. This deadlock is very rare because mount points are almost never evicted from the namecache. However, dsynth can trigger this bug due to its heavy use of null mounts and high concurrent path lookup loads.
show more ...
|
#
8938f217 |
| 29-Apr-2022 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - vnode recycling, intermediate fix
* Fix a condition where vnlru (the vnode recycler) can live- lock on unsuitable vnodes in the inactive list and stop making progress, causing the syste
kernel - vnode recycling, intermediate fix
* Fix a condition where vnlru (the vnode recycler) can live- lock on unsuitable vnodes in the inactive list and stop making progress, causing the system to block.
First, don't deactivate vnodes which the inactive scan won't recycle. Vnodes which are in the namecache topology but not at a leaf won't be recycled by the vnlru thread. Leave these vnodes on the active queue. This prevents the inactive queue from filling up with vnodes that it can't recycle.
Second, the active scan in vnlru() will now call cache_inval_vp_quick() to attempt to make a vnode presentable so it can be deactivated. The inactive scan also does the same thing, because some leakage can happen anyway.
* The active scan should be able to make continuous progress as successful cache_inval_vp_quick() calls make more and more vnodes presentable that might have previously been internal nodes in the namecache topology. So the active scan should be able to achieve the desired balance between the active and inactive queue.
* This should also improve performance when constant recycling is happening by moving more of the work to the active->inactive transition and doing less work in the inactive->free transition
* Add cache_inval_vp_quick(), a function which attempts to trivially disassociate a vnode from the namecache topology and will handle any direct children if the vnode is not at a leaf (but not recursively on its own). The definition of 'trivially' for the children are namecache records that can be locked non-blocking, have no additional refs, and do not record a vnode.
* Cleanup cache_unlink_parent(). Have cache_zap() use this function instead of rerolling the same code. The cache_rename() code winds up being slightly more complex. And now cache_inval_vp_quick() can use the function too.
show more ...
|
#
83fe95c9 |
| 29-Apr-2022 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Temporary work-around for vnode recyclement problems
* vnlru deadlocks were encountered on grok while indexing ~20 million files in deep directory trees.
* Add vfscache_unres accounting
kernel - Temporary work-around for vnode recyclement problems
* vnlru deadlocks were encountered on grok while indexing ~20 million files in deep directory trees.
* Add vfscache_unres accounting to keep track of unresolved ncp's at the leaves of the namecache tree. Start trimming the namecache when the unres leaf count exceeds 1/16 maxvnodes, in addition to the other algorithms.
* Add code in vnlru to decomission vnodes with children in the namecache when those children are trivial (e.g. unresolved, dead, or negative entries that can be easily locked).
show more ...
|
#
62560bbb |
| 30-Jan-2022 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix namecache issue that can slow systems down over time
* Fix a serious dead-record issue with the namecache.
It is very common to create a blah.new file and then rename it over an ex
kernel - Fix namecache issue that can slow systems down over time
* Fix a serious dead-record issue with the namecache.
It is very common to create a blah.new file and then rename it over an existing file, say, blah.jpg, in order to atomically replace the existing file. Such rename-over operations can cause up to (2 * maxvnodes) dead dead namecache records to build up on a single hash slot's list.
* Over time, this could result in over a million records on a single hash slot's list which is often scanned during namecache lookups, causing the kernel to turn into a sludge-pile.
This was not a memory leak per-say, the kernel still cleans excess structures (above 2 * maxvnodes) up, but that just maintains the status-quo and leaves the system in a slow, poorly-responsive state.
* Fixed by proactively deleting matching dead entries during namecache lookups. The 'live' record is typically at the beginning of the list. So to fix, the namecache lookup now scans the list for the hash slot backwards and attempts to dispose of dead records.
show more ...
|
Revision tags: v6.2.1, v6.2.0, v6.3.0, v6.0.1 |
|
#
aaf02314 |
| 08-Jun-2021 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Make sure nl_dvp is non-NULL in a few situations
* When NLC_REFDVP is set, nl_dvp should be returned non-NULL when the nlookup succeeds.
However, there is one case where nlookup() can
kernel - Make sure nl_dvp is non-NULL in a few situations
* When NLC_REFDVP is set, nl_dvp should be returned non-NULL when the nlookup succeeds.
However, there is one case where nlookup() can succeed but nl_dvp can be NULL, and this is when the nlookup() represents a mount-point.
* Fix three instances where this case was not being checked and could lead to a NULL pointer dereference / kernel panic.
* Do the full resolve treatment for cache_resolve_dvp(). In null-mount situations where we have A/B and we null-mount B onto C, path resolutions of C via the null mount will resolve B but not resolve A.
This breaks an assumption that nlookup() and cache_dvpref() make about the parent ncp having a valid vnode. In fact, the parent ncp of B (which is A) might not, because the resolve path for B may have bypassed it due to the presence of the null mount.
* Should fix occassional 'mkdir /var/cache' calls that fail with EINVAL instead of EEXIST.
Reported-by: zach
show more ...
|
Revision tags: v6.0.0, v6.0.0rc1, v6.1.0 |
|
#
e9dbfea1 |
| 21-Mar-2021 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Add kmalloc_obj subsystem step 1
* Implement per-zone memory management to kmalloc() in the form of kmalloc_obj() and friends. Currently the subsystem uses the same malloc_type struct
kernel - Add kmalloc_obj subsystem step 1
* Implement per-zone memory management to kmalloc() in the form of kmalloc_obj() and friends. Currently the subsystem uses the same malloc_type structure but is otherwise distinct from the normal kmalloc(), so to avoid programming mistakes the *_obj() subsystem post-pends '_obj' to malloc_type pointers passed into it.
This mechanism will eventually replace objcache. This mechanism is designed to greatly reduce fragmentation issues on systems with long uptimes.
Eventually the feature will be better integrated and I will be able to remove the _obj stuff.
* This is a object allocator, so the zone must be dedicated to one type of object with a fixed size. All allocations out of the zone are of the object.
The allocator is not quite type-stable yet, but will be once existential locks are integrated into the freeing mechanism.
* Implement a mini-slab allocator for management. Since the zones are single-object, similar to objcache, the fixed-size mini-slabs are a lot easier to optimize and much simpler in construction than the main kernel slab allocator.
Uses a per-zone/per-cpu active/alternate slab with an ultra-optimized allocation path, and a per-zone partial/full/empty list.
Also has a globaldata-based per-cpu cache of free slabs. The mini-slab allocator frees slabs back to the same cpu they were originally allocated from in order to retain memory locality over time.
* Implement a passive cleanup poller. This currently polls kmalloc zones very slowly looking for excess full slabs to release back to the global slab cache or the system (if the global slab cache is full).
This code will ultimately also handle existential type-stable freeing.
* Fragmentation is greatly reduced due to the distinct zones. Slabs are dedicated to the zone and do not share allocation space with other zones. Also, when a zone is destroyed, all of its memory is cleanly disposed of and there will be no left-over fragmentation.
* Initially use the new interface for the following. These zones tend to or can become quite big:
vnodes namecache (but not related strings) hammer2 chains hammer2 inodes tmpfs nodes tmpfs dirents (but not related strings)
show more ...
|
#
bb8026fc |
| 31-Jan-2021 |
Sascha Wildner <saw@online.de> |
kernel: Fix two typos in comments.
|
#
d6853f97 |
| 01-Nov-2020 |
Daniel Fojt <df@neosystem.org> |
kernel: fix getcwd(3) return value for non-existing directory
In case current directory no longer exists, properly return ENOENT from getcwd(), as described in manpage.
Issue: https://bugs.dragonfl
kernel: fix getcwd(3) return value for non-existing directory
In case current directory no longer exists, properly return ENOENT from getcwd(), as described in manpage.
Issue: https://bugs.dragonflybsd.org/issues/3250
show more ...
|
Revision tags: v5.8.3, v5.8.2 |
|
#
ad121268 |
| 27-Aug-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Deal with VOP_NRENAME races
* VOP_NRENAME() as implemented by the kernel can race any number of ways, including deadlocking, allowing duplicate entries, and panicing tmpfs. It typicall
kernel - Deal with VOP_NRENAME races
* VOP_NRENAME() as implemented by the kernel can race any number of ways, including deadlocking, allowing duplicate entries, and panicing tmpfs. It typically requires a heavy test load to replicate this but a dsynth build triggered the issue at least once.
Other recently reported tmpfs issues with log file handling might also be effected.
* A per-mount (semi-global) lock is now obtained whenever a directory is renamed. This helps deal with numerous MP races that can cause lock order reversals.
Loosely taken from netbsd and linux (mjg brought me up to speed on this). Renaming directories is fraught with issues and this fix, while somewhat brutish, is fine. Directories are very rarely renamed at a high rate.
* kern_rename() now proactively locks all four elements of a rename operation (source_dir, source_file, dest_dir, dest_file) instead of only two.
* The new locking function, cache_lock4_tondlocked(), takes no chances on lock order reversals and will use a (currently brute-force) non-blocking and lock cycling algorithm. Probably needs some work.
* Fix a bug in cache_nlookup() related to reusing DESTROYED entries in the hash table. This algorithm tried to reuse the entries while maintaining shared locks, since only the entries need to be manipulate to reuse them. However, this resulted in lookup races which could cause duplicate entries. The duplicate entries then triggered assertions in TMPFS.
* nlookup now tries a little harder and will retry if the parent of an element is flagged DESTROYED after its lock was released. DESTROYED elements are not necessarily temporary events as an operation can wind up running in a deleted directory and must properly fail under those conditions.
* Use krateprintf() to reduce debug output related to rename race reporting.
* Revamp nfsrv_rename() as well (requires more testing).
* Allow nfs_namei() to be called in a loop for retry purposes if desired. It now detects that the nd structure is initialized from a prior run and won't try to re-parse the mbuf (needs testing).
Reported-by: zrj, mjg
show more ...
|
#
0df73a27 |
| 21-Aug-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Remove global cwd statistics counters
* These were only used for debugging purposes and interfere with MP operation. Just remove them.
|
#
80d831e1 |
| 25-Jul-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor in-kernel system call API to remove bcopy()
* Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-onl
kernel - Refactor in-kernel system call API to remove bcopy()
* Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only.
int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *);
* System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe.
The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
show more ...
|
Revision tags: v5.8.1 |
|
#
ac9f076f |
| 24-Apr-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Remove CACHE_VREF: RACED debugging output
* Remove 'CACHE_VREF: RACED' debugging output after verifying that the race can occur and is properly handled.
|
#
b1999ea8 |
| 28-Mar-2020 |
Sascha Wildner <saw@online.de> |
kernel: Remove <sys/n{amei,lookup}.h> from all files that don't need it.
|
#
c065b635 |
| 04-Mar-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Rename spinlock counter trick API
* Rename the access side of the API from spin_update_*() to spin_access_*() to avoid confusion.
|
#
4acc6b1f |
| 04-Mar-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor cache_vref() using counter trick
* Refactor cache_vref() such that it is able to validate that a vnode (whos ref count might be 0) is not in VRECLAIM, without acquiring the vno
kernel - Refactor cache_vref() using counter trick
* Refactor cache_vref() such that it is able to validate that a vnode (whos ref count might be 0) is not in VRECLAIM, without acquiring the vnode lock. This is the normal case.
If cache_vref() is unable to do this, it backs down to the old method which was to get a vnode lock, validate that the vnode is not in VRECLAIM, then release the lock.
* NOTE: In DragonFlyBSD, holding a vref on a vnode (vref, NOT vhold) will prevent the vnode from transitioning to VRECLAIM.
* Use the new feature for nlookup's naccess() tests and for the *stat*() series of system calls.
This significantly increases performance. However, we are not entirely cache-contention free as both the namecache entry and the vnode are still referenced, requiring atomic adds.
show more ...
|
#
fc36a10b |
| 03-Mar-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Normalize the vx_*() vnode interface
* The vx_*() vnode interface is used for initial allocations, reclaims, and terminations.
Normalize all use cases to prevent the mixing together of
kernel - Normalize the vx_*() vnode interface
* The vx_*() vnode interface is used for initial allocations, reclaims, and terminations.
Normalize all use cases to prevent the mixing together of the vx_*() API and the vn_*() API. For example, vx_lock() should not be paired with vn_unlock(), and so forth.
* Integrate an update-counter mechanism into the vx_*() API, assert reasonability.
* Change vfs_cache.c to use an int update counter instead of a long. The vfs_cache code can't quite use the spin-lock update counter API yet.
Use proper atomics for load and store.
* Implement VOP_GETATTR_QUICK, meant to be a 'quick' version of VOP_GETATTR() that only retrieves information related to permissions and ownership. This will be fast-pathed in a later commit.
* Implement vx_downgrade() to convert an exclusive vx_lock into an exclusive vn_lock (for vnodes). Adjust all use cases in the getnewvnode() path.
* Remove unnecessary locks in tmpfs_getattr() and don't use any in tmpfs_getattr_quick().
* Remove unnecessary locks in hammer2_vop_getattr() and don't use any in hammer2_vop_getattr_quick()
show more ...
|
#
377c06c2 |
| 03-Mar-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor vfs_cache 4/N
* Refactor cache_findmount() to operate conflict-free and cache line bounce free for the most part. The counter trick is used to probe cache entries and combined
kernel - Refactor vfs_cache 4/N
* Refactor cache_findmount() to operate conflict-free and cache line bounce free for the most part. The counter trick is used to probe cache entries and combined with a local pcpu spinlock to interlock against unmounts.
The umount code is now a bit more expensive (it has to acquire all pcpu umount spinlocks before cleaning the cache out).
This code is not ideal but it performs up to 6x better on multple cpus.
* Refactor _cache_mntref() to use a 4-way set association.
* Rewrite cache_copy() and cache_drop_and_cache()'s caching algorithm.
* Use cache_dvpref() from nlookup() instead of rolling the code twice.
* Rewrite the nlookup*() and cache_nlookup*() code to generally leave namecache records unlocked throughout, removing one layer of shared locks from cpu contention. Only the last element is locked.
* Refactor nlookup*()'s handling of absolute paths a bit more.
* Refactor nlookup*()'s handling of NLC_REFDVP to better-validate that the parent directory is actually the parent directory.
This also necessitates a nlookupdata.nl_dvp check in various system calls using NLC_REFDVP to detect the mount-point case and return the proper error code (usually EINVAL, but e.g. mkdir would return EEXIST).
* Clean up _cache_lock() and friends to restore the diagnostic messages when a namecache lock stalls for too long.
* FIX: Fix bugs in nlookup*() retry code. The retry code was not properly unwinding symlink path construction during the loop and also not properly resetting the base directory when looping up. This primarily effects NFS.
* NOTE: Using iscsi_crc32() at the moment to get a good hash distribution. This is obviously expensive, but at least it is per-cpu.
* NOTE: The cache_nlookup() nchpp cache still has a shared spin-lock that will cache-line-bounce concurrent aquisitions.
show more ...
|
#
9460074f |
| 29-Feb-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Improve cache_fullpath(), plus cleanup
* Improve cache_fullpath(). It can use a shared lock rather than an exclusive lock, significantly improving concurrency. Important now since rea
kernel - Improve cache_fullpath(), plus cleanup
* Improve cache_fullpath(). It can use a shared lock rather than an exclusive lock, significantly improving concurrency. Important now since realpath() indirectly uses this function.
* Code cleanup. Remove unused vfscache_rollup_all()
show more ...
|
Revision tags: v5.8.0 |
|
#
03081357 |
| 28-Feb-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor vfs_cache 3/N
* Leave the vnode held for each linked namecache entry, allowing us to remove all the hold/drop code for 0->1 and 1->0 lock transitions of ncps.
This significa
kernel - Refactor vfs_cache 3/N
* Leave the vnode held for each linked namecache entry, allowing us to remove all the hold/drop code for 0->1 and 1->0 lock transitions of ncps.
This significantly simplifies the cache_lock*() and cache_unlock() functions.
* Adjust the vnode recycling code to check v_auxrefs against v_namecache_count instead of against 0.
show more ...
|