d6924570 | 03-May-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix serious bug in MAP_STACK, deprecate auto-grow semantics
* When MAP_STACK is used without MAP_TRYFIXED, the address the kernel determines for the stack was *NOT* being returned to user
kernel - Fix serious bug in MAP_STACK, deprecate auto-grow semantics
* When MAP_STACK is used without MAP_TRYFIXED, the address the kernel determines for the stack was *NOT* being returned to userland. Instead, userland always got only the hint address.
* This fixes ruby MAP_STACK use cases and possibly more.
* Deprecate MAP_STACK auto-grow semantics. All user mmap() calls with MAP_STACK are now converted to normal MAP_ANON mmaps. The kernel will continue to create an auto-grow stack segment for the primary user stack in exec(), allowing older pthread libraries to continue working, but this feature is deprecated and will be removed in a future release.
show more ...
|
b396bb03 | 24-Mar-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor swapcache heuristic
* Refactor the swapcache inactive queue heuristic to remove a write to a global variable that is in the critical path, and to improve operation. This shoul
kernel - Refactor swapcache heuristic
* Refactor the swapcache inactive queue heuristic to remove a write to a global variable that is in the critical path, and to improve operation. This should reduce cpu cache ping-ponging.
* Change vpgqueues.lcnt from an int to a long, change misc use cases in the pageout code to a long.
* Use __aligned(64) to 64-byte-align vm_page_queues[]. It was previously only 32-byte aligned.
show more ...
|
e05899ce | 23-Mar-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Preliminary vm_page hash lookup (2), cleanups, page wiring
* Correct a bug in vm.fault_quick operation. Soft-busied pages cannot be safely wired or unwired. This fixes a wire/unwire rac
kernel - Preliminary vm_page hash lookup (2), cleanups, page wiring
* Correct a bug in vm.fault_quick operation. Soft-busied pages cannot be safely wired or unwired. This fixes a wire/unwire race related panic.
* Optimize vm_page_unhold() to normally not have to obtain any spin-locks at all, since related pages are almost never in the PQ_HOLD VM page queue. This leaves open a minor race condition where pages with a hold_count of 0 can accumulate in PQ_HOLD.
* Add vm_page_scan_hold() to the pageout daemon. Unconditionally scan PQ_HOLD very slowly to remove any pages whos hold_count is 0.
* REFACTOR PAGE WIRING. Wiring vm_page's no longer removes them from whatever paging queue they are on. Instead, proactively remove such pages from the queue only when we need to (typically in the pageout code).
* Remove unused PV_FLAG_VMOBJECT.
* Fix missing atomic-op in pc64/x86_64/efirt.c
* Do not use m->md.pv_list for pagetable pages. It is now only used for terminal pages.
* Properly initialize pv_flags to 0 when a pv_entry is allocated.
* Add debugging to detect managed pmap_enter()s without an object.
* Conditionalize the setting of PG_MAPPED and PG_WRITEABLE in the pmap code to avoid unnecessary cpu cache mastership changes.
* Move assertions in vm_pageout.c that could trigger improperly due to a race.
show more ...
|
16b1cc2d | 23-Mar-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Optimize vm_page_wakeup(), vm_page_hold(), vm_page_unhold()
* vm_page_wakeup() does not need to acquire the vm_page spin-lock. The caller holding the page busied is sufficient.
* vm_page
kernel - Optimize vm_page_wakeup(), vm_page_hold(), vm_page_unhold()
* vm_page_wakeup() does not need to acquire the vm_page spin-lock. The caller holding the page busied is sufficient.
* vm_page_hold() does not need to acquire the vm_page spin-lock as the caller is expected to hold the page busied, soft-busied, or stabilized via an interlock (such as the vm_object interlock).
* vm_page_unhold() only needs to acquire the vm_page spin-lock on the 1->0 transition of m->hold_count.
show more ...
|
70f3bb08 | 23-Mar-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Preliminary vm_page hash lookup
* Add preliminary vm_page hash lookup code which avoids most locks, plus support in vm_fault. Default disabled, with debugging for now.
* This code sti
kernel - Preliminary vm_page hash lookup
* Add preliminary vm_page hash lookup code which avoids most locks, plus support in vm_fault. Default disabled, with debugging for now.
* This code still soft-busies the vm_page, which is an improvement over hard-busying it in that it won't contend, but we will eventually want to entirely avoid all atomic ops on the vm_page to *really* get the concurrent fault performance.
show more ...
|
53ddc8a1 | 26-Feb-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Reduce vm_page_list_find2() stalls
* Reduce stalls in vm_page_list_find2() which can occur in low-memory situations, as well as in other situations. The problem is two fold.
First, th
kernel - Reduce vm_page_list_find2() stalls
* Reduce stalls in vm_page_list_find2() which can occur in low-memory situations, as well as in other situations. The problem is two fold.
First, that potentially all cpu cores can wind up waiting for a single vm_page's spin-lock to be released.
Second, that a long-held vm_page spin-lock can cause the VM system to stall unnecessarily long.
* Change vm_page_list_find() and vm_page_list_find2() to no longer unconditionally spinlock a vm_page candidate and then retry if it is found to be on the wrong queue.
Instead the code now spinlocks the queue, then iterates vm_page candidates using spin_trylock(), skipping any pages whos spinlocks cannot be immediately acquired. This is lock-order-reversed but is ok because we use trylock. Also, by locking the queue first we guarantee that a successfully spinlocked vm_page will be on the correct queue.
* Should also reduce IPIQ drain stalls reported to the console as shown below. The %rip sample is often found in vm_page_list_find2().
send_ipiq X->Y tgt not draining (STALL_SECONDS)
show more ...
|
8e5d7c42 | 15-Oct-2018 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix NUMA contention due to assymetric memory
* Fix NUMA contention in situations where memory is associated with CPU cores assymetrically. In particular, with the 2990WX, half the core
kernel - Fix NUMA contention due to assymetric memory
* Fix NUMA contention in situations where memory is associated with CPU cores assymetrically. In particular, with the 2990WX, half the cores will have no memory associated with them.
* This was forcing DFly to allocate memory from queues belonging to other nearby cores, causing unnecessary SMP contention, as well as burn extra time iterating queues.
* Fix by calculating the average number of free pages per-core, and then adjust any VM page queue with pages less than the average by stealing pages from queues with greater than the average. We use a simple iterator to steal pages, so the CPUs with less (or zero) direct-attached memory will operate more UMA-like (just on 4K boundaries instead of 256-1024 byte boundaries).
* Tested with a 64-thread concurrent compile test. systat -pv 1 showed all remaining contention disappear. Literally, *ZERO* contention when we run the test with each thread in its own jail with no shared resources.
* NOTE! This fix is specific to asymetric NUMA configurations which are fairly rare in the wild and will not speed up more conventional systems.
* Before and after timings on the 2990WX.
cd /tmp/src time make -j 128 nativekernel NO_MODULES=TRUE > /dev/null
BEFORE 703.915u 167.605s 0:49.97 1744.0% 9993+749k 22188+8io 216pf+0w 699.550u 171.148s 0:50.87 1711.5% 9994+749k 21066+8io 150pf+0w
AFTER 678.406u 108.857s 0:45.66 1724.1% 10105+757k 22188+8io 216pf+0w 674.805u 115.256s 0:46.67 1692.8% 10077+755k 21066+8io 150pf+0w
This is a 4.2 second difference on the second run, an over 8% improvement which is nothing to sneeze at.
show more ...
|