Revision tags: v6.2.1, v6.2.0, v6.3.0, v6.0.1 |
|
#
ab4aa0bb |
| 16-Jul-2021 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Improve invltlb latency warnings
* Improve kprintf()s for smp_invltlb latency warnings. Make it abundantly clear that these are mostly WARNING messages, not fatal messages.
* Tested o
kernel - Improve invltlb latency warnings
* Improve kprintf()s for smp_invltlb latency warnings. Make it abundantly clear that these are mostly WARNING messages, not fatal messages.
* Tested on VM with host under load and VM running nice +5.
show more ...
|
#
9bbbdb7e |
| 27-Jun-2021 |
Aaron LI <aly@aaronly.me> |
nvmm: Revamp host TLB flush mechanism
* Leverage the pmap layer to track guest pmap generation id and the host CPUs that the guest pmap is active on. This avoids the inefficient _tlb_flush() ca
nvmm: Revamp host TLB flush mechanism
* Leverage the pmap layer to track guest pmap generation id and the host CPUs that the guest pmap is active on. This avoids the inefficient _tlb_flush() callbacks from NVMM that invalidate all TLB entries.
* Currently just add all CPUs to the backing pmap for guest physical memory as they are encountered. Do not yet try to remove any CPUs, because multiple vCPUs may wind up (temporarily) scheduled to the same physical CPU. So more sophisticated tracking is needed.
* Fix a bug in SVM's host TLB flush handling where breaking out of the loop and returning, then re-entering the loop on the same cpu, could improperly clear the machine flush request.
Credit to Matt Dillon.
show more ...
|
#
39d0d2cb |
| 25-Jun-2021 |
Aaron LI <aly@aaronly.me> |
pmap: Change pmap->pm_invgen to uint64_t to be compatible with NVMM
Change the 'pmap->pm_invgen' member from 'long' to 'uint64_t', to be compatible with NVMM's machgen.
Update the atomic operation
pmap: Change pmap->pm_invgen to uint64_t to be compatible with NVMM
Change the 'pmap->pm_invgen' member from 'long' to 'uint64_t', to be compatible with NVMM's machgen.
Update the atomic operation on 'pm_invgen' accordingly, and no need to use the '_acq' acquire version (including a read barrier).
Credit to Matt Dillon.
show more ...
|
#
3ecc20a0 |
| 06-Jun-2021 |
Aaron LI <aly@aaronly.me> |
nvmm: Port to DragonFly #24: pmap transform & TLB invalidation
* Port NetBSD's pmap_ept_transform() to DragonFly's. We don't make 'pmap_ept_has_ad' a global in the pmap code, so need to pass extr
nvmm: Port to DragonFly #24: pmap transform & TLB invalidation
* Port NetBSD's pmap_ept_transform() to DragonFly's. We don't make 'pmap_ept_has_ad' a global in the pmap code, so need to pass extra flags to our pmap_ept_transform().
* Replace NetBSD's pmap_tlb_shootdown() with our pmap_inval_smp().
* Add two new fields 'pm_data' & 'pm_tlb_flush' to 'struct pmap', which are used as a callback by NVMM to handle its own TLB invalidation.
Note that pmap_enter() also calls pmap_inval_smp() on EPT/NPT pmap and requires the old PTE be returned, so we can't place the NVMM TLB callback at the beginning part of pmap_inval_smp() and return 0.
show more ...
|
#
c713db65 |
| 24-May-2021 |
Aaron LI <aly@aaronly.me> |
vm: Change 'kernel_pmap' global to pointer type
Following the previous commits, this commit changes the 'kernel_pmap' to pointer type of 'struct pmap *'. This makes it align better with 'kernel_map
vm: Change 'kernel_pmap' global to pointer type
Following the previous commits, this commit changes the 'kernel_pmap' to pointer type of 'struct pmap *'. This makes it align better with 'kernel_map' and simplifies the code a bit.
No functional changes.
show more ...
|
Revision tags: v6.0.0, v6.0.0rc1, v6.1.0, v5.8.3, v5.8.2, v5.8.1, v5.8.0, v5.9.0, v5.8.0rc1, v5.6.3 |
|
#
0ad80e33 |
| 14-Sep-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Add needed ccfence and more error checks
* Add a cpu_ccfence() to PMAP_PAGE_BACKING_SCAN in order to prevent over-optimization of the ipte load by the compiler.
* Add machine-dependent a
kernel - Add needed ccfence and more error checks
* Add a cpu_ccfence() to PMAP_PAGE_BACKING_SCAN in order to prevent over-optimization of the ipte load by the compiler.
* Add machine-dependent assertion in the vm_page_free*() path to ensure that the page is not normally mapped at the time of the free.
show more ...
|
Revision tags: v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0 |
|
#
e3c330f0 |
| 19-May-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - VM rework part 12 - Core pmap work, stabilize & optimize
* Add tracking for the number of PTEs mapped writeable in md_page. Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page
kernel - VM rework part 12 - Core pmap work, stabilize & optimize
* Add tracking for the number of PTEs mapped writeable in md_page. Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page to avoid clear/set races. This problem occurs because we would have otherwise tried to clear the bits without hard-busying the page. This allows the bits to be set with only an atomic op.
Procedures which test these bits universally do so while holding the page hard-busied, and now call pmap_mapped_sfync() prior to properly synchronize the bits.
* Fix bugs related to various counterse. pm_stats.resident_count, wiring counts, vm_page->md.writeable_count, and vm_page->md.pmap_count.
* Fix bugs related to synchronizing removed pte's with the vm_page. Fix one case where we were improperly updating (m)'s state based on a lost race against a pte swap-to-0 (pulling the pte).
* Fix a bug related to the page soft-busying code when the m->object/m->pindex race is lost.
* Implement a heuristical version of vm_page_active() which just updates act_count unlocked if the page is already in the PQ_ACTIVE queue, or if it is fictitious.
* Allow races against the backing scan for pmap_remove_all() and pmap_page_protect(VM_PROT_READ). Callers of these routines for these cases expect full synchronization of the page dirty state. We can identify when a page has not been fully cleaned out by checking vm_page->md.pmap_count and vm_page->md.writeable_count. In the rare situation where this happens, simply retry.
* Assert that the PTE pindex is properly interlocked in pmap_enter(). We still allows PTEs to be pulled by other routines without the interlock, but multiple pmap_enter()s of the same page will be interlocked.
* Assert additional wiring count failure cases.
* (UNTESTED) Flag DEVICE pages (dev_pager_getfake()) as being PG_UNMANAGED. This essentially prevents all the various reference counters (e.g. vm_page->md.pmap_count and vm_page->md.writeable_count), PG_M, PG_A, etc from being updated.
The vm_page's aren't tracked in the pmap at all because there is no way to find them.. they are 'fake', so without a pv_entry, we can't track them. Instead we simply rely on the vm_map_backing scan to manipulate the PTEs.
* Optimize the new vm_map_entry_shadow() to use a shared object token instead of an exclusive one. OBJ_ONEMAPPING will be cleared with the shared token.
* Optimize single-threaded access to pmaps to avoid pmap_inval_*() complexities.
* Optimize __read_mostly for more globals.
* Optimize pmap_testbit(), pmap_clearbit(), pmap_page_protect(). Pre-check vm_page->md.writeable_count and vm_page->md.pmap_count for an easy degenerate return; before real work.
* Optimize pmap_inval_smp() and pmap_inval_smp_cmpset() for the single-threaded pmap case, when called on the same CPU the pmap is associated with. This allows us to use simple atomics and cpu_*() instructions and avoid the complexities of the pmap_inval_*() infrastructure.
* Randomize the page queue used in bio_page_alloc(). This does not appear to hurt performance (e.g. heavy tmpfs use) on large many-core NUMA machines and it makes the vm_page_alloc()'s job easier.
This change might have a downside for temporary files, but for more long-lasting files there's no point allocating pages localized to a particular cpu.
* Optimize vm_page_alloc().
(1) Refactor the _vm_page_list_find*() routines to avoid re-scanning the same array indices over and over again when trying to find a page.
(2) Add a heuristic, vpq.lastq, for each queue, which we set if a _vm_page_list_find*() operation had to go far-afield to find its page. Subsequent finds will skip to the far-afield position until the current CPUs queues have pages again.
(3) Reduce PQ_L2_SIZE From an extravagant 2048 entries per queue down to 1024. The original 2048 was meant to provide 8-way set-associativity for 256 cores but wound up reducing performance due to longer index iterations.
* Refactor the vm_page_hash[] array. This array is used to shortcut vm_object locks and locate VM pages more quickly, without locks. The new code limits the size of the array to something more reasonable, implements a 4-way set-associative replacement policy using 'ticks', and rewrites the hashing math.
* Effectively remove pmap_object_init_pt() for now. In current tests it does not actually improve performance, probably because it may map pages that are not actually used by the program.
* Remove vm_map_backing->refs. This field is no longer used.
* Remove more of the old now-stale code related to use of pv_entry's for terminal PTEs.
* Remove more of the old shared page-table-page code. This worked but could never be fully validated and was prone to bugs. So remove it. In the future we will likely use larger 2MB and 1GB pages anyway.
* Remove pmap_softwait()/pmap_softhold()/pmap_softdone().
* Remove more #if 0'd code.
show more ...
|
Revision tags: v5.4.3, v5.4.2, v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1, v5.2.2, v5.2.1, v5.2.0, v5.3.0, v5.2.0rc, v5.0.2, v5.0.1, v5.0.0 |
|
#
5b49787b |
| 05-Oct-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor smp collision statistics
* Add an indefinite wait timing API (sys/indefinite.h, sys/indefinite2.h). This interface uses the TSC and will record lock latencies to our pcpu stat
kernel - Refactor smp collision statistics
* Add an indefinite wait timing API (sys/indefinite.h, sys/indefinite2.h). This interface uses the TSC and will record lock latencies to our pcpu stats in microseconds. The systat -pv 1 display shows this under smpcoll.
Note that latencies generated by tokens, lockmgr, and mutex locks do not necessarily reflect actual lost cpu time as the kernel will schedule other threads while those are blocked, if other threads are available.
* Formalize TSC operations more, supply a type (tsc_uclock_t and tsc_sclock_t).
* Reinstrument lockmgr, mutex, token, and spinlocks to use the new indefinite timing interface.
show more ...
|
Revision tags: v5.0.0rc2, v5.1.0, v5.0.0rc1, v4.8.1 |
|
#
67534613 |
| 27-Mar-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Enhance the sniff code, refactor interrupt disablement for IPIs
* Add kern.sniff_enable, default to 1. Allows the sysop to disable the feature if desired.
* Add kern.sniff_target, allow
kernel - Enhance the sniff code, refactor interrupt disablement for IPIs
* Add kern.sniff_enable, default to 1. Allows the sysop to disable the feature if desired.
* Add kern.sniff_target, allows sniff IPIs to be targetted to all cpus (-1), or to a particular cpu (0...N). This feature allows the sysop to test IPI delivery to particular CPUs (typically monitoring with systat -pv 0.1) to determine that delivery is working properly.
* Bring in some additional AMD-specific setup from FreeBSD, beginnings of support for the APIC Extended space. For now just make sure the extended entries are masked.
* Change interrupt disablement expectations. The caller of apic_ipi(), selected_apic_ipi(), and related macros is now required to hard-disable interrupts rather than these functions doing so. This allows the caller to run certain operational sequences atomically.
* Use the TSC to detect IPI send stalls instead of a hard-coded loop count.
* Also set the APIC_LEVEL_ASSERT bit when issuing a directed IPI, though the spec says this is unnecessary. Do it anyway.
* Remove unnecessary critical section in selected_apic_ipi(). We are in a hard-disablement and in particular we do not want to accidently trigger a splz() due to the crit_exit() while in the hard-disablement.
* Enhance the IPI stall detection and recovery code. Provide more inforamtion. Also enable the LOOPMASK_IN debugging tracker by default.
* Add a testing feature to machdep.all_but_self_ipi_enable. By setting this to 2, we force the smp_invltlb() to always use the ALL_BUT_SELF IPI. For testing only.
show more ...
|
Revision tags: v4.8.0, v4.6.2, v4.9.0, v4.8.0rc |
|
#
95270b7e |
| 01-Feb-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Many fixes for vkernel support, plus a few main kernel fixes
REAL KERNEL
* The big enchillada is that the main kernel's thread switch code has a small timing window where it clears t
kernel - Many fixes for vkernel support, plus a few main kernel fixes
REAL KERNEL
* The big enchillada is that the main kernel's thread switch code has a small timing window where it clears the PM_ACTIVE bit for the cpu while switching between two threads. However, it *ALSO* checks and avoids loading the %cr3 if the two threads have the same pmap.
This results in a situation where an invalidation on the pmap in another cpuc may not have visibility to the cpu doing the switch, and yet the cpu doing the switch also decides not to reload %cr3 and so does not invalidate the TLB either. The result is a stale TLB and bad things happen.
For now just unconditionally load %cr3 until I can come up with code to handle the case.
This bug is very difficult to reproduce on a normal system, it requires a multi-threaded program doing nasty things (munmap, etc) on one cpu while another thread is switching to a third thread on some other cpu.
* KNOTE after handling the vkernel trap in postsig() instead of before.
* Change the kernel's pmap_inval_smp() code to take a 64-bit npgs argument instead of a 32-bit npgs argument. This fixes situations that crop up when a process uses more than 16TB of address space.
* Add an lfence to the pmap invalidation code that I think might be needed.
* Handle some wrap/overflow cases in pmap_scan() related to the use of large address spaces.
* Fix an unnecessary invltlb in pmap_clearbit() for unmanaged PTEs.
* Test PG_RW after locking the pv_entry to handle potential races.
* Add bio_crc to struct bio. This field is only used for debugging for now but may come in useful later.
* Add some global debug variables in the pmap_inval_smp() and related paths. Refactor the npgs handling.
* Load the tsc_target field after waiting for completion of the previous invalidation op instead of before. Also add a conservative mfence() in the invalidation path before loading the info fields.
* Remove the global pmap_inval_bulk_count counter.
* Adjust swtch.s to always reload the user process %cr3, with an explanation. FIXME LATER!
* Add some test code to vm/swap_pager.c which double-checks that the page being paged out does not get corrupted during the operation. This code is #if 0'd.
* We must hold an object lock around the swp_pager_meta_ctl() call in swp_pager_async_iodone(). I think.
* Reorder when PG_SWAPINPROG is cleared. Finish the I/O before clearing the bit.
* Change the vm_map_growstack() API to pass a vm_map in instead of curproc.
* Use atomic ops for vm_object->generation counts, since objects can be locked shared.
VKERNEL
* Unconditionally save the FP state after returning from VMSPACE_CTL_RUN. This solves a severe FP corruption bug in the vkernel due to calls it makes into libc (which uses %xmm registers all over the place).
This is not a complete fix. We need a formal userspace/kernelspace FP abstraction. Right now the vkernel doesn't have a kernelspace FP abstraction so if a kernel thread switches preemptively bad things happen.
* The kernel tracks and locks pv_entry structures to interlock pte's. The vkernel never caught up, and does not really have a pv_entry or placemark mechanism. The vkernel's pmap really needs a complete re-port from the real-kernel pmap code. Until then, we use poor hacks.
* Use the vm_page's spinlock to interlock pte changes.
* Make sure that PG_WRITEABLE is set or cleared with the vm_page spinlock held.
* Have pmap_clearbit() acquire the pmobj token for the pmap in the iteration. This appears to be necessary, currently, as most of the rest of the vkernel pmap code also uses the pmobj token.
* Fix bugs in the vkernel's swapu32() and swapu64().
* Change pmap_page_lookup() and pmap_unwire_pgtable() to fully busy the page. Note however that a page table page is currently never soft-busied. Also other vkernel code that busies a page table page.
* Fix some sillycode in a pmap->pm_ptphint test.
* Don't inherit e.g. PG_M from the previous pte when overwriting it with a pte of a different physical address.
* Change the vkernel's pmap_clear_modify() function to clear VTPE_RW (which also clears VPTE_M), and not just VPTE_M. Formally we want the vkernel to be notified when a page becomes modified and it won't be unless we also clear VPTE_RW and force a fault. <--- I may change this back after testing.
* Wrap pmap_replacevm() with a critical section.
* Scrap the old grow_stack() code. vm_fault() and vm_fault_page() handle it (vm_fault_page() just now got the ability).
* Properly flag VM_FAULT_USERMODE.
show more ...
|
Revision tags: v4.6.1 |
|
#
e1bcf416 |
| 08-Oct-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor VMX code
* Refactor the VMX code to use all three VMM states available to use instead of two. The three states available are:
active and current (VMPTRLD) active not curren
kernel - Refactor VMX code
* Refactor the VMX code to use all three VMM states available to use instead of two. The three states available are:
active and current (VMPTRLD) active not current (replaced by some other context being VMPTRLD'd) inactive not current (VMCLEAR)
In short, there is no need to VMCLEAR the current context when activating another via VMPTRLD, doing so greatly reduces performance. VMCLEAR is only really needed when a context is being destroyed or being moved to another cpu.
* Also fixes a few bugs along the way.
* Live loop in vmx_vmrun() when necessary, otherwise we wind up with serious problems synchronizing IPIs. The thread will still be subject to its process priority.
show more ...
|
#
4373ea1c |
| 26-Sep-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Change IPI recovery watchdog
* Change the default recovery watchdog timeout for IPIs from 1/16 second to 2 seconds and the repeated timeout from 1/2 second to 2 seconds.
* Add missing in
kernel - Change IPI recovery watchdog
* Change the default recovery watchdog timeout for IPIs from 1/16 second to 2 seconds and the repeated timeout from 1/2 second to 2 seconds.
* Add missing initialization to pmap_inval_smp_cmpset(), without it an improper watchdog timeout/retry could occur.
* This still may not fix issues with VMs when core threads cause extreme latencies on the host.
Reported-by: zach
show more ...
|
#
398af52e |
| 07-Sep-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Deal with lost IPIs (VM related) (2)
* Fix an issue where Xinvltlb interacts badly with a drm console framebuffer, imploding the machine. The 1/16 second watchdog can trigger during cert
kernel - Deal with lost IPIs (VM related) (2)
* Fix an issue where Xinvltlb interacts badly with a drm console framebuffer, imploding the machine. The 1/16 second watchdog can trigger during certain DRM operations due to excessive interrupt disablement in the linux DRM code.
* Avoid kprintf()ing anything by default.
* Also make a minor fix to the watchdog logic to force the higher-level Xinvltlb loop to re-test.
show more ...
|
#
bba35d66 |
| 06-Sep-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Deal with lost IPIs (VM related)
* Some (all?) VMs appear to be able to lose IPIs. Hopefully the same can't be said for device interrupts! Add some recovery code for lost Xinvltlb IPI
kernel - Deal with lost IPIs (VM related)
* Some (all?) VMs appear to be able to lose IPIs. Hopefully the same can't be said for device interrupts! Add some recovery code for lost Xinvltlb IPIs for now.
For synchronizing invalidations we use the TSC and run a recovery attempt after 1/16 second, and every 1 second there-after, if an Xinvltlb is not responded to (smp_invltlb() and smp_invlpg()). The IPI will be re-issued.
* Some basic testing shows that a VM can stall out a cpu thread for an indefinite period of time, potentially causing the above watchdog to trigger. Even so it should not have required re-issuing the IPI, but it seems it does, so the VM appears to be losing the IPI(!) when a cpu thread stalls out on the host! At least with the VM we tested under, type unknown.
* IPIQ IPIs currently do not have any specific recovery but I think each cpu will poll for IPIQs slowly in the idle thread, so they might automatically recover anyway.
Reported-by: zach
show more ...
|
#
5dac90bc |
| 31-Aug-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix LOOPMASK debugging for Xinvltlb
* Fix LOOPMASK debugging for Xinvltlb, the #if 1 can now be set to #if 0 to turn off the debugging.
|
Revision tags: v4.6.0, v4.6.0rc2 |
|
#
1a5c7e0f |
| 24-Jul-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor Xinvltlb a little, turn off the idle-thread invltlb opt
* Turn off the idle-thread invltlb optimization. This feature can be turned on with a sysctl (default-off) machdep.optimi
kernel - Refactor Xinvltlb a little, turn off the idle-thread invltlb opt
* Turn off the idle-thread invltlb optimization. This feature can be turned on with a sysctl (default-off) machdep.optimized_invltlb. It will be turned on by default when we've life-tested that it works properly.
* Remove excess critical sections and interrupt disablements. All entries into smp_invlpg() now occur with interrupts already disabled and the thread already in a critical section. This also defers critical-section 1->0 transition handling away from smp_invlpg() and into its caller.
* Refactor the Xinvltlb APIs a bit. Have Xinvltlb enter the critical section (it didn't before). Remove the critical section from smp_inval_intr(). The critical section is now handled by the assembly, and by any other callers.
* Add additional tsc-based loop/counter debugging to try to catch problems.
* Move inner-loop handling of smp_invltlb_mask to act on invltlbs a little faster.
* Disable interrupts a little later inside pmap_inval_smp() and pmap_inval_smp_cmpset().
show more ...
|
Revision tags: v4.6.0rc, v4.7.0 |
|
#
ccd67bf6 |
| 17-Jul-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor Xinvltlb (3)
* Rollup invalidation operations for numerous kernel-related pmap, reducing the number of IPIs needed (particularly for buffer cache operations).
* Implement semi-s
kernel - Refactor Xinvltlb (3)
* Rollup invalidation operations for numerous kernel-related pmap, reducing the number of IPIs needed (particularly for buffer cache operations).
* Implement semi-synchronous command execution, where target cpus do not need to wait for the originating cpu to execute a command. This is used for the above rollups when the related kernel memory is known to be accessed concurrently with the pmap operations.
* Support invalidation of VA ranges.
* Support reduction of target cpu set for semi-synchronous commands, including invltlb's, by removing idle cpus from the set when possible.
show more ...
|
#
79f2da03 |
| 15-Jul-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code
* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation
* Remove the old lwkt_ipi-based per-page inv
kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code
* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation
* Remove the old lwkt_ipi-based per-page invalidation code.
* Include Xinvltlb interrupts in the V_IPI statistics counter (so they show up in systat -pv 1).
* Add loop counters to detect and log possible endless loops.
* (Fix single_apic_ipi_passive() but note that this function is currently not used. Interrupts must be hard-disabled when checking icr_lo).
* NEW INVALIDATION MECHANISM
The new invalidation mechanism is primarily enclosed in mp_machdep.c and pmap_inval.c. Supply new all-in-one rollup functions which include the *ptep contents adjustment, instead of prior piecemeal functions.
The new mechanism uses Xinvltlb for both full-tlb and per-page invalidations. This interrupt ignores critical sections (that is, will operate even if kernel code is in a critical section), which significantly improves the latency and stability of our pmap pte invalidation support functions.
For example, prior to these changes the invalidation code uses the lwkt_ipiq paths which are subject to critical sections and could result in long stalls across substantially ALL cpus when one cpu was in a long cpu-bound critical section.
* NEW SMP_INVLTLB() OPTIMIZATION
smp_invltlb() always used Xinvltlb, and it still does. However the code now avoids IPIing idle cpus, instead flagging them to issue the cpu_invltlb() call when they wake-up.
To make this work the idle code must temporarily enter a critical section so 'normal' interrupts do not run until it has a chance to check and act on the flag. This will slightly increase interrupt latency on an idle cpu.
This change significantly improves smp_invltlb() overhead by avoiding having to pull idle cpus out of their high-latency/low-power state. Thus it also avoids the high latency on those cpus messing up.
* Remove unnecessary calls to smp_invltlb(). It is not necessary to call this function when a *ptep is transitioning from 0 to non-zero. This significantly cuts down on smp_invltlb() traffic under load.
* Remove a bunch of unused code in these paths.
* Add machdep.report_invltlb_src and machdep.report_invlpg_src, down counters which do one stack backtrace when they hit 0.
TIMING TESTS
No appreciable differences with the new code other than feeling smoother.
mount_tmpfs dummy /usr/obj
On monster (4-socket, 48-core): time make -j 50 buildworld BEFORE: 7849.697u 4693.979s 16:23.07 1275.9% AFTER: 7682.598u 4467.224s 15:47.87 1281.8%
time make -j 50 nativekernel NO_MODULES=TRUE BEFORE: 927.608u 254.626s 1:36.01 1231.3% AFTER: 531.124u 204.456s 1:25.99 855.4%
On 2 x E5-2620 (2-socket, 32-core): time make -j 50 buildworld BEFORE: 5750.042u 2291.083s 10:35.62 1265.0% AFTER: 5694.573u 2280.078s 10:34.96 1255.9%
time make -j 50 nativekernel NO_MODULES=TRUE BEFORE: 431.338u 84.458s 0:54.71 942.7% AFTER: 414.962u 92.312s 0:54.75 926.5% (time mostly spend in mkdep line and on final link)
Memory thread tests, 64 threads each allocating memory.
BEFORE: 3.1M faults/sec AFTER: 3.1M faults/sec.
show more ...
|
Revision tags: v4.4.3, v4.4.2, v4.4.1, v4.4.0, v4.5.0, v4.4.0rc, v4.2.4, v4.3.1, v4.2.3, v4.2.1, v4.2.0, v4.0.6, v4.3.0, v4.2.0rc, v4.0.5, v4.0.4, v4.0.3, v4.0.2, v4.0.1, v4.0.0, v4.0.0rc3, v4.0.0rc2, v4.0.0rc, v4.1.0, v3.8.2 |
|
#
b8d5441d |
| 13-Jul-2014 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Add two features to improve qemu emulation (64-bit only)
* Implement a tunable for machdep.cpu_idle_hlt, allowing it to be set in /boot/loader.conf. For qemu the admin might want to set
kernel - Add two features to improve qemu emulation (64-bit only)
* Implement a tunable for machdep.cpu_idle_hlt, allowing it to be set in /boot/loader.conf. For qemu the admin might want to set the value to 4 (always use HLT) instead of the default 2.
* Implement a tunable and new sysctl, machdep.pmap_fast_kernel_cpusync, which defaults to disabled (0). Setting this to 1 in /boot/loader.conf or at anytime via sysctl tells the kernel to use a one-stage pmap invalidation for kernel_pmap updates. User pmaps are not affected and will still use two-stage invalidations.
One-stage pmap invalidations only have to spin on the originating cpu, but all other cpus will not be quiesced when updating a kernel_map pmap entry. This is untested as there might be situations where the kernel pmap is updated without an interlock (though most should be interlocked already).
This second sysctl/tunable, if enabled, greatly improves qemu performance particularly when the number of qemu cpus is greater than the number of real cpus. It probably improves real hardware system performance as well, but is not recommended for production at this time.
show more ...
|
#
cc694a4a |
| 30-Jun-2014 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Move CPUMASK_LOCK out of the cpumask_t
* Add cpulock_t (a 32-bit integer on all platforms) and implement CPULOCK_EXCL as well as space for a counter.
* Break-out CPUMASK_LOCK, add a new
kernel - Move CPUMASK_LOCK out of the cpumask_t
* Add cpulock_t (a 32-bit integer on all platforms) and implement CPULOCK_EXCL as well as space for a counter.
* Break-out CPUMASK_LOCK, add a new field to the pmap (pm_active_lock) and do the process vmm (p_vmm_cpulock) and implement the mmu interlock there.
The VMM subsystem uses additional bits in cpulock_t as a mask counter for implementing its interlock.
The PMAP subsystem just uses the CPULOCK_EXCL bit in pm_active_lock for its own interlock.
* Max cpus on 64-bit systems is now 64 instead of 63.
* cpumask_t is now just a pure cpu mask and no longer requires all-or-none atomic ops, just normal bit-for-bit atomic ops. This will allow us to hopefully extend it past the 64-cpu limit soon.
show more ...
|
Revision tags: v3.8.1, v3.6.3, v3.8.0, v3.8.0rc2, v3.9.0, v3.8.0rc, v3.6.2, v3.6.1, v3.6.0, v3.7.1, v3.6.0rc, v3.7.0 |
|
#
a86ce0cd |
| 20-Sep-2013 |
Matthew Dillon <dillon@apollo.backplane.com> |
hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree
* This merge contains work primarily by Mihai Carabas, with some misc fixes also by Matthew Dillon.
* Special note on G
hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree
* This merge contains work primarily by Mihai Carabas, with some misc fixes also by Matthew Dillon.
* Special note on GSOC core
This is, needless to say, a huge amount of work compressed down into a few paragraphs of comments. Adds the pc64/vmm subdirectory and tons of stuff to support hardware virtualization in guest-user mode, plus the ability for programs (vkernels) running in this mode to make normal system calls to the host.
* Add system call infrastructure for VMM mode operations in kern/sys_vmm.c which vectors through a structure to machine-specific implementations.
vmm_guest_ctl_args() vmm_guest_sync_addr_args()
vmm_guest_ctl_args() - bootstrap VMM and EPT modes. Copydown the original user stack for EPT (since EPT 'physical' addresses cannot reach that far into the backing store represented by the process's original VM space). Also installs the GUEST_CR3 for the guest using parameters supplied by the guest.
vmm_guest_sync_addr_args() - A host helper function that the vkernel can use to invalidate page tables on multiple real cpus. This is a lot more efficient than having the vkernel try to do it itself with IPI signals via cpusync*().
* Add Intel VMX support to the host infrastructure. Again, tons of work compressed down into a one paragraph commit message. Intel VMX support added. AMD SVM support is not part of this GSOC and not yet supported by DragonFly.
* Remove PG_* defines for PTE's and related mmu operations. Replace with a table lookup so the same pmap code can be used for normal page tables and also EPT tables.
* Also include X86_PG_V defines specific to normal page tables for a few situations outside the pmap code.
* Adjust DDB to disassemble SVM related (intel) instructions.
* Add infrastructure to exit1() to deal related structures.
* Optimize pfind() and pfindn() to remove the global token when looking up the current process's PID (Matt)
* Add support for EPT (double layer page tables). This primarily required adjusting the pmap code to use a table lookup to get the PG_* bits.
Add an indirect vector for copyin, copyout, and other user address space copy operations to support manual walks when EPT is in use.
A multitude of system calls which manually looked up user addresses via the vm_map now need a VMM layer call to translate EPT.
* Remove the MP lock from trapsignal() use cases in trap().
* (Matt) Add pthread_yield()s in most spin loops to help situations where the vkernel is running on more cpu's than the host has, and to help with scheduler edge cases on the host.
* (Matt) Add a pmap_fault_page_quick() infrastructure that vm_fault_page() uses to try to shortcut operations and avoid locks. Implement it for pc64. This function checks whether the page is already faulted in as requested by looking up the PTE. If not it returns NULL and the full blown vm_fault_page() code continues running.
* (Matt) Remove the MP lock from most the vkernel's trap() code
* (Matt) Use a shared spinlock when possible for certain critical paths related to the copyin/copyout path.
show more ...
|
Revision tags: v3.4.3, v3.4.2, v3.4.0, v3.4.1, v3.4.0rc, v3.5.0, v3.2.2 |
|
#
1918fc5c |
| 24-Oct-2012 |
Sascha Wildner <saw@online.de> |
kernel: Make SMP support default (and non-optional).
The 'SMP' kernel option gets removed with this commit, so it has to be removed from everybody's configs.
Reviewed-by: sjg Approved-by: many
|
Revision tags: v3.2.1, v3.2.0, v3.3.0 |
|
#
921c891e |
| 13-Sep-2012 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Implement segment pmap optimizations for x86-64
* Implement 2MB segment optimizations for x86-64. Any shared read-only or read-write VM object mapped into memory, including physical obje
kernel - Implement segment pmap optimizations for x86-64
* Implement 2MB segment optimizations for x86-64. Any shared read-only or read-write VM object mapped into memory, including physical objects (so both sysv_shm and mmap), which is a multiple of the segment size and segment-aligned can be optimized.
* Enable with sysctl machdep.pmap_mmu_optimize=1
Default is off for now. This is an experimental feature.
* It works as follows: A VM object which is large enough will, when VM faults are generated, store a truncated pmap (PD, PT, and PTEs) in the VM object itself.
VM faults whos vm_map_entry's can be optimized will cause the PTE, PT, and also the PD (for now) to be stored in a pmap embedded in the VM_OBJECT, instead of in the process pmap.
The process pmap then creates PT entry in the PD page table that points to the PT page table page stored in the VM_OBJECT's pmap.
* This removes nearly all page table overhead from fork()'d processes or even unrelated process which massively share data via mmap() or sysv_shm. We still recommend using sysctl kern.ipc.shm_use_phys=1 (which is now the default), which also removes the PV entries associated with the shared pmap. However, with this optimization PV entries are no longer a big issue since they will not be replicated in each process, only in the common pmap stored in the VM_OBJECT.
* Features of this optimization:
* Number of PV entries is reduced to approximately the number of live pages and no longer multiplied by the number of processes separately mapping the shared memory.
* One process faulting in a page naturally makes the PTE available to all other processes mapping the same shared memory. The other processes do not have to fault that same page in.
* Page tables survive process exit and restart.
* Once page tables are populated and cached, any new process that maps the shared memory will take far fewer faults because each fault will bring in an ENTIRE page table. Postgres w/ 64-clients, VM fault rate was observed to drop from 1M faults/sec to less than 500 at startup, and during the run the fault rates dropped from a steady decline into the hundreds of thousands into an instant decline to virtually zero VM faults.
* We no longer have to depend on sysv_shm to optimize the MMU.
* CPU caches will do a better job caching page tables since most of them are now themselves shared. Even when we invltlb, more of the page tables will be in the L1, L2, and L3 caches.
* EXPERIMENTAL!!!!!
show more ...
|
Revision tags: v3.0.3, v3.0.2, v3.0.1, v3.1.0, v3.0.0 |
|
#
86d7f5d3 |
| 26-Nov-2011 |
John Marino <draco@marino.st> |
Initial import of binutils 2.22 on the new vendor branch
Future versions of binutils will also reside on this branch rather than continuing to create new binutils branches for each new version.
|
#
54341a3b |
| 15-Nov-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Greatly improve shared memory fault rate concurrency / shared tokens
This commit rolls up a lot of work to improve postgres database operations and the system in general. With this changes
kernel - Greatly improve shared memory fault rate concurrency / shared tokens
This commit rolls up a lot of work to improve postgres database operations and the system in general. With this changes we can pgbench -j 8 -c 40 on our 48-core opteron monster at 140000+ tps, and the shm vm_fault rate hits 3.1M pps.
* Implement shared tokens. They work as advertised, with some cavets.
It is acceptable to acquire a shared token while you already hold the same token exclusively, but you will deadlock if you acquire an exclusive token while you hold the same token shared.
Currently exclusive tokens are not given priority over shared tokens so starvation is possible under certain circumstances.
* Create a critical code path in vm_fault() using the new shared token feature to quickly fault-in pages which already exist in the VM cache. pmap_object_init_pt() also uses the new feature.
This increases fault-in concurrency by a ridiculously huge amount, particularly on SHM segments (say when you have a large number of postgres clients). Scaling for large numbers of clients on large numbers of cores is significantly improved.
This also increases fault-in concurrency for MAP_SHARED file maps.
* Expand the breadn() and cluster_read() APIs. Implement breadnx() and cluster_readx() which allows a getblk()'d bp to be passed. If *bpp is not NULL a bp is being passed in, otherwise the routines call getblk().
* Modify the HAMMER read path to use the new API. Instead of calling getcacheblk() HAMMER now calls getblk() and checks the B_CACHE flag. This gives getblk() a chance to regenerate a fully cached buffer from VM backing store without having to acquire any hammer-related locks, resulting in even faster operation.
* If kern.ipc.shm_use_phys is set to 2 the VM pages will be pre-allocated. This can take quite a while for a large map and also lock the machine up for a few seconds. Defaults to off.
* Reorder the smp_invltlb()/cpu_invltlb() combos in a few places, running cpu_invltlb() last.
* An invalidation interlock might be needed in pmap_enter() under certain circumstances, enable the code for now.
* vm_object_backing_scan_callback() was failing to properly check the validity of a vm_object after acquiring its token. Add the required check + some debugging.
* Make vm_object_set_writeable_dirty() a bit more cache friendly.
* The vmstats sysctl was scanning every process's vm_map (requiring a vm_map read lock to do so), which can stall for long periods of time when the system is paging heavily. Change the mechanic to a LWP flag which can be tested with minimal locking.
* Have the phys_pager mark the page as dirty too, to make sure nothing tries to free it.
* Remove the spinlock in pmap_prefault_ok(), since we do not delete page table pages it shouldn't be needed.
* Add a required cpu_ccfence() in pmap_inval.c. The code generated prior to this fix was still correct, and this makes sure it stays that way.
* Replace several manual wiring cases with calls to vm_page_wire().
show more ...
|