History log of /dragonfly/sys/platform/pc64/x86_64/pmap_inval.c (Results 1 – 25 of 31)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.2.1, v6.2.0, v6.3.0, v6.0.1
# ab4aa0bb 16-Jul-2021 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Improve invltlb latency warnings

* Improve kprintf()s for smp_invltlb latency warnings. Make
it abundantly clear that these are mostly WARNING messages,
not fatal messages.

* Tested o

kernel - Improve invltlb latency warnings

* Improve kprintf()s for smp_invltlb latency warnings. Make
it abundantly clear that these are mostly WARNING messages,
not fatal messages.

* Tested on VM with host under load and VM running nice +5.

show more ...


# 9bbbdb7e 27-Jun-2021 Aaron LI <aly@aaronly.me>

nvmm: Revamp host TLB flush mechanism

* Leverage the pmap layer to track guest pmap generation id and the host
CPUs that the guest pmap is active on. This avoids the inefficient
_tlb_flush() ca

nvmm: Revamp host TLB flush mechanism

* Leverage the pmap layer to track guest pmap generation id and the host
CPUs that the guest pmap is active on. This avoids the inefficient
_tlb_flush() callbacks from NVMM that invalidate all TLB entries.

* Currently just add all CPUs to the backing pmap for guest physical
memory as they are encountered. Do not yet try to remove any CPUs,
because multiple vCPUs may wind up (temporarily) scheduled to the same
physical CPU. So more sophisticated tracking is needed.

* Fix a bug in SVM's host TLB flush handling where breaking out of the
loop and returning, then re-entering the loop on the same cpu, could
improperly clear the machine flush request.

Credit to Matt Dillon.

show more ...


# 39d0d2cb 25-Jun-2021 Aaron LI <aly@aaronly.me>

pmap: Change pmap->pm_invgen to uint64_t to be compatible with NVMM

Change the 'pmap->pm_invgen' member from 'long' to 'uint64_t', to be
compatible with NVMM's machgen.

Update the atomic operation

pmap: Change pmap->pm_invgen to uint64_t to be compatible with NVMM

Change the 'pmap->pm_invgen' member from 'long' to 'uint64_t', to be
compatible with NVMM's machgen.

Update the atomic operation on 'pm_invgen' accordingly, and no need to
use the '_acq' acquire version (including a read barrier).

Credit to Matt Dillon.

show more ...


# 3ecc20a0 06-Jun-2021 Aaron LI <aly@aaronly.me>

nvmm: Port to DragonFly #24: pmap transform & TLB invalidation

* Port NetBSD's pmap_ept_transform() to DragonFly's. We don't make
'pmap_ept_has_ad' a global in the pmap code, so need to pass extr

nvmm: Port to DragonFly #24: pmap transform & TLB invalidation

* Port NetBSD's pmap_ept_transform() to DragonFly's. We don't make
'pmap_ept_has_ad' a global in the pmap code, so need to pass extra
flags to our pmap_ept_transform().

* Replace NetBSD's pmap_tlb_shootdown() with our pmap_inval_smp().

* Add two new fields 'pm_data' & 'pm_tlb_flush' to 'struct pmap', which
are used as a callback by NVMM to handle its own TLB invalidation.

Note that pmap_enter() also calls pmap_inval_smp() on EPT/NPT pmap
and requires the old PTE be returned, so we can't place the NVMM TLB
callback at the beginning part of pmap_inval_smp() and return 0.

show more ...


# c713db65 24-May-2021 Aaron LI <aly@aaronly.me>

vm: Change 'kernel_pmap' global to pointer type

Following the previous commits, this commit changes the 'kernel_pmap'
to pointer type of 'struct pmap *'. This makes it align better with
'kernel_map

vm: Change 'kernel_pmap' global to pointer type

Following the previous commits, this commit changes the 'kernel_pmap'
to pointer type of 'struct pmap *'. This makes it align better with
'kernel_map' and simplifies the code a bit.

No functional changes.

show more ...


Revision tags: v6.0.0, v6.0.0rc1, v6.1.0, v5.8.3, v5.8.2, v5.8.1, v5.8.0, v5.9.0, v5.8.0rc1, v5.6.3
# 0ad80e33 14-Sep-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Add needed ccfence and more error checks

* Add a cpu_ccfence() to PMAP_PAGE_BACKING_SCAN in order to prevent
over-optimization of the ipte load by the compiler.

* Add machine-dependent a

kernel - Add needed ccfence and more error checks

* Add a cpu_ccfence() to PMAP_PAGE_BACKING_SCAN in order to prevent
over-optimization of the ipte load by the compiler.

* Add machine-dependent assertion in the vm_page_free*() path to
ensure that the page is not normally mapped at the time of the
free.

show more ...


Revision tags: v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0
# e3c330f0 19-May-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - VM rework part 12 - Core pmap work, stabilize & optimize

* Add tracking for the number of PTEs mapped writeable in md_page.
Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page

kernel - VM rework part 12 - Core pmap work, stabilize & optimize

* Add tracking for the number of PTEs mapped writeable in md_page.
Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page
to avoid clear/set races. This problem occurs because we would
have otherwise tried to clear the bits without hard-busying the
page. This allows the bits to be set with only an atomic op.

Procedures which test these bits universally do so while holding
the page hard-busied, and now call pmap_mapped_sfync() prior to
properly synchronize the bits.

* Fix bugs related to various counterse. pm_stats.resident_count,
wiring counts, vm_page->md.writeable_count, and
vm_page->md.pmap_count.

* Fix bugs related to synchronizing removed pte's with the vm_page.
Fix one case where we were improperly updating (m)'s state based
on a lost race against a pte swap-to-0 (pulling the pte).

* Fix a bug related to the page soft-busying code when the
m->object/m->pindex race is lost.

* Implement a heuristical version of vm_page_active() which just
updates act_count unlocked if the page is already in the
PQ_ACTIVE queue, or if it is fictitious.

* Allow races against the backing scan for pmap_remove_all() and
pmap_page_protect(VM_PROT_READ). Callers of these routines for
these cases expect full synchronization of the page dirty state.
We can identify when a page has not been fully cleaned out by
checking vm_page->md.pmap_count and vm_page->md.writeable_count.
In the rare situation where this happens, simply retry.

* Assert that the PTE pindex is properly interlocked in pmap_enter().
We still allows PTEs to be pulled by other routines without the
interlock, but multiple pmap_enter()s of the same page will be
interlocked.

* Assert additional wiring count failure cases.

* (UNTESTED) Flag DEVICE pages (dev_pager_getfake()) as being
PG_UNMANAGED. This essentially prevents all the various
reference counters (e.g. vm_page->md.pmap_count and
vm_page->md.writeable_count), PG_M, PG_A, etc from being
updated.

The vm_page's aren't tracked in the pmap at all because there
is no way to find them.. they are 'fake', so without a pv_entry,
we can't track them. Instead we simply rely on the vm_map_backing
scan to manipulate the PTEs.

* Optimize the new vm_map_entry_shadow() to use a shared object
token instead of an exclusive one. OBJ_ONEMAPPING will be cleared
with the shared token.

* Optimize single-threaded access to pmaps to avoid pmap_inval_*()
complexities.

* Optimize __read_mostly for more globals.

* Optimize pmap_testbit(), pmap_clearbit(), pmap_page_protect().
Pre-check vm_page->md.writeable_count and vm_page->md.pmap_count
for an easy degenerate return; before real work.

* Optimize pmap_inval_smp() and pmap_inval_smp_cmpset() for the
single-threaded pmap case, when called on the same CPU the pmap
is associated with. This allows us to use simple atomics and
cpu_*() instructions and avoid the complexities of the
pmap_inval_*() infrastructure.

* Randomize the page queue used in bio_page_alloc(). This does not
appear to hurt performance (e.g. heavy tmpfs use) on large many-core
NUMA machines and it makes the vm_page_alloc()'s job easier.

This change might have a downside for temporary files, but for more
long-lasting files there's no point allocating pages localized to a
particular cpu.

* Optimize vm_page_alloc().

(1) Refactor the _vm_page_list_find*() routines to avoid re-scanning
the same array indices over and over again when trying to find
a page.

(2) Add a heuristic, vpq.lastq, for each queue, which we set if a
_vm_page_list_find*() operation had to go far-afield to find its
page. Subsequent finds will skip to the far-afield position until
the current CPUs queues have pages again.

(3) Reduce PQ_L2_SIZE From an extravagant 2048 entries per queue down
to 1024. The original 2048 was meant to provide 8-way
set-associativity for 256 cores but wound up reducing performance
due to longer index iterations.

* Refactor the vm_page_hash[] array. This array is used to shortcut
vm_object locks and locate VM pages more quickly, without locks.
The new code limits the size of the array to something more reasonable,
implements a 4-way set-associative replacement policy using 'ticks',
and rewrites the hashing math.

* Effectively remove pmap_object_init_pt() for now. In current tests
it does not actually improve performance, probably because it may
map pages that are not actually used by the program.

* Remove vm_map_backing->refs. This field is no longer used.

* Remove more of the old now-stale code related to use of pv_entry's
for terminal PTEs.

* Remove more of the old shared page-table-page code. This worked but
could never be fully validated and was prone to bugs. So remove it.
In the future we will likely use larger 2MB and 1GB pages anyway.

* Remove pmap_softwait()/pmap_softhold()/pmap_softdone().

* Remove more #if 0'd code.

show more ...


Revision tags: v5.4.3, v5.4.2, v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1, v5.2.2, v5.2.1, v5.2.0, v5.3.0, v5.2.0rc, v5.0.2, v5.0.1, v5.0.0
# 5b49787b 05-Oct-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor smp collision statistics

* Add an indefinite wait timing API (sys/indefinite.h,
sys/indefinite2.h). This interface uses the TSC and will
record lock latencies to our pcpu stat

kernel - Refactor smp collision statistics

* Add an indefinite wait timing API (sys/indefinite.h,
sys/indefinite2.h). This interface uses the TSC and will
record lock latencies to our pcpu stats in microseconds.
The systat -pv 1 display shows this under smpcoll.

Note that latencies generated by tokens, lockmgr, and mutex
locks do not necessarily reflect actual lost cpu time as the
kernel will schedule other threads while those are blocked,
if other threads are available.

* Formalize TSC operations more, supply a type (tsc_uclock_t and
tsc_sclock_t).

* Reinstrument lockmgr, mutex, token, and spinlocks to use the new
indefinite timing interface.

show more ...


Revision tags: v5.0.0rc2, v5.1.0, v5.0.0rc1, v4.8.1
# 67534613 27-Mar-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Enhance the sniff code, refactor interrupt disablement for IPIs

* Add kern.sniff_enable, default to 1. Allows the sysop to disable the
feature if desired.

* Add kern.sniff_target, allow

kernel - Enhance the sniff code, refactor interrupt disablement for IPIs

* Add kern.sniff_enable, default to 1. Allows the sysop to disable the
feature if desired.

* Add kern.sniff_target, allows sniff IPIs to be targetted to all cpus
(-1), or to a particular cpu (0...N). This feature allows the sysop
to test IPI delivery to particular CPUs (typically monitoring with
systat -pv 0.1) to determine that delivery is working properly.

* Bring in some additional AMD-specific setup from FreeBSD, beginnings
of support for the APIC Extended space. For now just make sure the
extended entries are masked.

* Change interrupt disablement expectations. The caller of apic_ipi(),
selected_apic_ipi(), and related macros is now required to hard-disable
interrupts rather than these functions doing so. This allows the caller
to run certain operational sequences atomically.

* Use the TSC to detect IPI send stalls instead of a hard-coded loop count.

* Also set the APIC_LEVEL_ASSERT bit when issuing a directed IPI, though
the spec says this is unnecessary. Do it anyway.

* Remove unnecessary critical section in selected_apic_ipi(). We are in
a hard-disablement and in particular we do not want to accidently trigger
a splz() due to the crit_exit() while in the hard-disablement.

* Enhance the IPI stall detection and recovery code. Provide more
inforamtion. Also enable the LOOPMASK_IN debugging tracker by default.

* Add a testing feature to machdep.all_but_self_ipi_enable. By setting
this to 2, we force the smp_invltlb() to always use the ALL_BUT_SELF IPI.
For testing only.

show more ...


Revision tags: v4.8.0, v4.6.2, v4.9.0, v4.8.0rc
# 95270b7e 01-Feb-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Many fixes for vkernel support, plus a few main kernel fixes

REAL KERNEL

* The big enchillada is that the main kernel's thread switch code has
a small timing window where it clears t

kernel - Many fixes for vkernel support, plus a few main kernel fixes

REAL KERNEL

* The big enchillada is that the main kernel's thread switch code has
a small timing window where it clears the PM_ACTIVE bit for the cpu
while switching between two threads. However, it *ALSO* checks and
avoids loading the %cr3 if the two threads have the same pmap.

This results in a situation where an invalidation on the pmap in another
cpuc may not have visibility to the cpu doing the switch, and yet the
cpu doing the switch also decides not to reload %cr3 and so does not
invalidate the TLB either. The result is a stale TLB and bad things
happen.

For now just unconditionally load %cr3 until I can come up with code
to handle the case.

This bug is very difficult to reproduce on a normal system, it requires
a multi-threaded program doing nasty things (munmap, etc) on one cpu
while another thread is switching to a third thread on some other cpu.

* KNOTE after handling the vkernel trap in postsig() instead of before.

* Change the kernel's pmap_inval_smp() code to take a 64-bit npgs
argument instead of a 32-bit npgs argument. This fixes situations
that crop up when a process uses more than 16TB of address space.

* Add an lfence to the pmap invalidation code that I think might be
needed.

* Handle some wrap/overflow cases in pmap_scan() related to the use of
large address spaces.

* Fix an unnecessary invltlb in pmap_clearbit() for unmanaged PTEs.

* Test PG_RW after locking the pv_entry to handle potential races.

* Add bio_crc to struct bio. This field is only used for debugging for
now but may come in useful later.

* Add some global debug variables in the pmap_inval_smp() and related
paths. Refactor the npgs handling.

* Load the tsc_target field after waiting for completion of the previous
invalidation op instead of before. Also add a conservative mfence()
in the invalidation path before loading the info fields.

* Remove the global pmap_inval_bulk_count counter.

* Adjust swtch.s to always reload the user process %cr3, with an
explanation. FIXME LATER!

* Add some test code to vm/swap_pager.c which double-checks that the page
being paged out does not get corrupted during the operation. This code
is #if 0'd.

* We must hold an object lock around the swp_pager_meta_ctl() call in
swp_pager_async_iodone(). I think.

* Reorder when PG_SWAPINPROG is cleared. Finish the I/O before clearing
the bit.

* Change the vm_map_growstack() API to pass a vm_map in instead of
curproc.

* Use atomic ops for vm_object->generation counts, since objects can be
locked shared.

VKERNEL

* Unconditionally save the FP state after returning from VMSPACE_CTL_RUN.
This solves a severe FP corruption bug in the vkernel due to calls it
makes into libc (which uses %xmm registers all over the place).

This is not a complete fix. We need a formal userspace/kernelspace FP
abstraction. Right now the vkernel doesn't have a kernelspace FP
abstraction so if a kernel thread switches preemptively bad things
happen.

* The kernel tracks and locks pv_entry structures to interlock pte's.
The vkernel never caught up, and does not really have a pv_entry or
placemark mechanism. The vkernel's pmap really needs a complete
re-port from the real-kernel pmap code. Until then, we use poor hacks.

* Use the vm_page's spinlock to interlock pte changes.

* Make sure that PG_WRITEABLE is set or cleared with the vm_page
spinlock held.

* Have pmap_clearbit() acquire the pmobj token for the pmap in the
iteration. This appears to be necessary, currently, as most of the
rest of the vkernel pmap code also uses the pmobj token.

* Fix bugs in the vkernel's swapu32() and swapu64().

* Change pmap_page_lookup() and pmap_unwire_pgtable() to fully busy
the page. Note however that a page table page is currently never
soft-busied. Also other vkernel code that busies a page table page.

* Fix some sillycode in a pmap->pm_ptphint test.

* Don't inherit e.g. PG_M from the previous pte when overwriting it
with a pte of a different physical address.

* Change the vkernel's pmap_clear_modify() function to clear VTPE_RW
(which also clears VPTE_M), and not just VPTE_M. Formally we want
the vkernel to be notified when a page becomes modified and it won't
be unless we also clear VPTE_RW and force a fault. <--- I may change
this back after testing.

* Wrap pmap_replacevm() with a critical section.

* Scrap the old grow_stack() code. vm_fault() and vm_fault_page() handle
it (vm_fault_page() just now got the ability).

* Properly flag VM_FAULT_USERMODE.

show more ...


Revision tags: v4.6.1
# e1bcf416 08-Oct-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor VMX code

* Refactor the VMX code to use all three VMM states available to use
instead of two. The three states available are:

active and current (VMPTRLD)
active not curren

kernel - Refactor VMX code

* Refactor the VMX code to use all three VMM states available to use
instead of two. The three states available are:

active and current (VMPTRLD)
active not current (replaced by some other context being VMPTRLD'd)
inactive not current (VMCLEAR)

In short, there is no need to VMCLEAR the current context when activating
another via VMPTRLD, doing so greatly reduces performance. VMCLEAR is
only really needed when a context is being destroyed or being moved to
another cpu.

* Also fixes a few bugs along the way.

* Live loop in vmx_vmrun() when necessary, otherwise we wind up with serious
problems synchronizing IPIs. The thread will still be subject to its
process priority.

show more ...


# 4373ea1c 26-Sep-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Change IPI recovery watchdog

* Change the default recovery watchdog timeout for IPIs from 1/16 second
to 2 seconds and the repeated timeout from 1/2 second to 2 seconds.

* Add missing in

kernel - Change IPI recovery watchdog

* Change the default recovery watchdog timeout for IPIs from 1/16 second
to 2 seconds and the repeated timeout from 1/2 second to 2 seconds.

* Add missing initialization to pmap_inval_smp_cmpset(), without it an
improper watchdog timeout/retry could occur.

* This still may not fix issues with VMs when core threads cause extreme
latencies on the host.

Reported-by: zach

show more ...


# 398af52e 07-Sep-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Deal with lost IPIs (VM related) (2)

* Fix an issue where Xinvltlb interacts badly with a drm console framebuffer,
imploding the machine. The 1/16 second watchdog can trigger during cert

kernel - Deal with lost IPIs (VM related) (2)

* Fix an issue where Xinvltlb interacts badly with a drm console framebuffer,
imploding the machine. The 1/16 second watchdog can trigger during certain
DRM operations due to excessive interrupt disablement in the linux DRM code.

* Avoid kprintf()ing anything by default.

* Also make a minor fix to the watchdog logic to force the higher-level
Xinvltlb loop to re-test.

show more ...


# bba35d66 06-Sep-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Deal with lost IPIs (VM related)

* Some (all?) VMs appear to be able to lose IPIs. Hopefully the same can't
be said for device interrupts! Add some recovery code for lost Xinvltlb
IPI

kernel - Deal with lost IPIs (VM related)

* Some (all?) VMs appear to be able to lose IPIs. Hopefully the same can't
be said for device interrupts! Add some recovery code for lost Xinvltlb
IPIs for now.

For synchronizing invalidations we use the TSC and run a recovery attempt
after 1/16 second, and every 1 second there-after, if an Xinvltlb is not
responded to (smp_invltlb() and smp_invlpg()). The IPI will be re-issued.

* Some basic testing shows that a VM can stall out a cpu thread for an
indefinite period of time, potentially causing the above watchdog to
trigger. Even so it should not have required re-issuing the IPI, but
it seems it does, so the VM appears to be losing the IPI(!) when a cpu
thread stalls out on the host! At least with the VM we tested under,
type unknown.

* IPIQ IPIs currently do not have any specific recovery but I think each
cpu will poll for IPIQs slowly in the idle thread, so they might
automatically recover anyway.

Reported-by: zach

show more ...


# 5dac90bc 31-Aug-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix LOOPMASK debugging for Xinvltlb

* Fix LOOPMASK debugging for Xinvltlb, the #if 1 can now be set to #if 0
to turn off the debugging.


Revision tags: v4.6.0, v4.6.0rc2
# 1a5c7e0f 24-Jul-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor Xinvltlb a little, turn off the idle-thread invltlb opt

* Turn off the idle-thread invltlb optimization. This feature can be
turned on with a sysctl (default-off) machdep.optimi

kernel - Refactor Xinvltlb a little, turn off the idle-thread invltlb opt

* Turn off the idle-thread invltlb optimization. This feature can be
turned on with a sysctl (default-off) machdep.optimized_invltlb. It
will be turned on by default when we've life-tested that it works
properly.

* Remove excess critical sections and interrupt disablements. All entries
into smp_invlpg() now occur with interrupts already disabled and the
thread already in a critical section. This also defers critical-section
1->0 transition handling away from smp_invlpg() and into its caller.

* Refactor the Xinvltlb APIs a bit. Have Xinvltlb enter the critical
section (it didn't before). Remove the critical section from
smp_inval_intr(). The critical section is now handled by the assembly,
and by any other callers.

* Add additional tsc-based loop/counter debugging to try to catch problems.

* Move inner-loop handling of smp_invltlb_mask to act on invltlbs a little
faster.

* Disable interrupts a little later inside pmap_inval_smp() and
pmap_inval_smp_cmpset().

show more ...


Revision tags: v4.6.0rc, v4.7.0
# ccd67bf6 17-Jul-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor Xinvltlb (3)

* Rollup invalidation operations for numerous kernel-related pmap, reducing
the number of IPIs needed (particularly for buffer cache operations).

* Implement semi-s

kernel - Refactor Xinvltlb (3)

* Rollup invalidation operations for numerous kernel-related pmap, reducing
the number of IPIs needed (particularly for buffer cache operations).

* Implement semi-synchronous command execution, where target cpus do not
need to wait for the originating cpu to execute a command. This is used
for the above rollups when the related kernel memory is known to be accessed
concurrently with the pmap operations.

* Support invalidation of VA ranges.

* Support reduction of target cpu set for semi-synchronous commands, including
invltlb's, by removing idle cpus from the set when possible.

show more ...


# 79f2da03 15-Jul-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code

* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation

* Remove the old lwkt_ipi-based per-page inv

kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code

* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation

* Remove the old lwkt_ipi-based per-page invalidation code.

* Include Xinvltlb interrupts in the V_IPI statistics counter
(so they show up in systat -pv 1).

* Add loop counters to detect and log possible endless loops.

* (Fix single_apic_ipi_passive() but note that this function is currently
not used. Interrupts must be hard-disabled when checking icr_lo).

* NEW INVALIDATION MECHANISM

The new invalidation mechanism is primarily enclosed in mp_machdep.c and
pmap_inval.c. Supply new all-in-one rollup functions which include the
*ptep contents adjustment, instead of prior piecemeal functions.

The new mechanism uses Xinvltlb for both full-tlb and per-page
invalidations. This interrupt ignores critical sections (that is,
will operate even if kernel code is in a critical section), which
significantly improves the latency and stability of our pmap pte
invalidation support functions.

For example, prior to these changes the invalidation code uses the
lwkt_ipiq paths which are subject to critical sections and could result
in long stalls across substantially ALL cpus when one cpu was in a long
cpu-bound critical section.

* NEW SMP_INVLTLB() OPTIMIZATION

smp_invltlb() always used Xinvltlb, and it still does. However the
code now avoids IPIing idle cpus, instead flagging them to issue the
cpu_invltlb() call when they wake-up.

To make this work the idle code must temporarily enter a critical section
so 'normal' interrupts do not run until it has a chance to check and act
on the flag. This will slightly increase interrupt latency on an idle
cpu.

This change significantly improves smp_invltlb() overhead by avoiding
having to pull idle cpus out of their high-latency/low-power state. Thus
it also avoids the high latency on those cpus messing up.

* Remove unnecessary calls to smp_invltlb(). It is not necessary to call
this function when a *ptep is transitioning from 0 to non-zero. This
significantly cuts down on smp_invltlb() traffic under load.

* Remove a bunch of unused code in these paths.

* Add machdep.report_invltlb_src and machdep.report_invlpg_src, down
counters which do one stack backtrace when they hit 0.

TIMING TESTS

No appreciable differences with the new code other than feeling smoother.

mount_tmpfs dummy /usr/obj

On monster (4-socket, 48-core):
time make -j 50 buildworld
BEFORE: 7849.697u 4693.979s 16:23.07 1275.9%
AFTER: 7682.598u 4467.224s 15:47.87 1281.8%

time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 927.608u 254.626s 1:36.01 1231.3%
AFTER: 531.124u 204.456s 1:25.99 855.4%

On 2 x E5-2620 (2-socket, 32-core):
time make -j 50 buildworld
BEFORE: 5750.042u 2291.083s 10:35.62 1265.0%
AFTER: 5694.573u 2280.078s 10:34.96 1255.9%

time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 431.338u 84.458s 0:54.71 942.7%
AFTER: 414.962u 92.312s 0:54.75 926.5%
(time mostly spend in mkdep line and on final link)

Memory thread tests, 64 threads each allocating memory.

BEFORE: 3.1M faults/sec
AFTER: 3.1M faults/sec.

show more ...


Revision tags: v4.4.3, v4.4.2, v4.4.1, v4.4.0, v4.5.0, v4.4.0rc, v4.2.4, v4.3.1, v4.2.3, v4.2.1, v4.2.0, v4.0.6, v4.3.0, v4.2.0rc, v4.0.5, v4.0.4, v4.0.3, v4.0.2, v4.0.1, v4.0.0, v4.0.0rc3, v4.0.0rc2, v4.0.0rc, v4.1.0, v3.8.2
# b8d5441d 13-Jul-2014 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Add two features to improve qemu emulation (64-bit only)

* Implement a tunable for machdep.cpu_idle_hlt, allowing it to be
set in /boot/loader.conf. For qemu the admin might want to set

kernel - Add two features to improve qemu emulation (64-bit only)

* Implement a tunable for machdep.cpu_idle_hlt, allowing it to be
set in /boot/loader.conf. For qemu the admin might want to set
the value to 4 (always use HLT) instead of the default 2.

* Implement a tunable and new sysctl, machdep.pmap_fast_kernel_cpusync,
which defaults to disabled (0). Setting this to 1 in /boot/loader.conf
or at anytime via sysctl tells the kernel to use a one-stage pmap
invalidation for kernel_pmap updates. User pmaps are not affected and
will still use two-stage invalidations.

One-stage pmap invalidations only have to spin on the originating cpu,
but all other cpus will not be quiesced when updating a kernel_map pmap
entry. This is untested as there might be situations where the kernel
pmap is updated without an interlock (though most should be interlocked
already).

This second sysctl/tunable, if enabled, greatly improves qemu performance
particularly when the number of qemu cpus is greater than the number of
real cpus. It probably improves real hardware system performance as well,
but is not recommended for production at this time.

show more ...


# cc694a4a 30-Jun-2014 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Move CPUMASK_LOCK out of the cpumask_t

* Add cpulock_t (a 32-bit integer on all platforms) and implement
CPULOCK_EXCL as well as space for a counter.

* Break-out CPUMASK_LOCK, add a new

kernel - Move CPUMASK_LOCK out of the cpumask_t

* Add cpulock_t (a 32-bit integer on all platforms) and implement
CPULOCK_EXCL as well as space for a counter.

* Break-out CPUMASK_LOCK, add a new field to the pmap (pm_active_lock)
and do the process vmm (p_vmm_cpulock) and implement the mmu interlock
there.

The VMM subsystem uses additional bits in cpulock_t as a mask counter
for implementing its interlock.

The PMAP subsystem just uses the CPULOCK_EXCL bit in pm_active_lock for
its own interlock.

* Max cpus on 64-bit systems is now 64 instead of 63.

* cpumask_t is now just a pure cpu mask and no longer requires all-or-none
atomic ops, just normal bit-for-bit atomic ops. This will allow us to
hopefully extend it past the 64-cpu limit soon.

show more ...


Revision tags: v3.8.1, v3.6.3, v3.8.0, v3.8.0rc2, v3.9.0, v3.8.0rc, v3.6.2, v3.6.1, v3.6.0, v3.7.1, v3.6.0rc, v3.7.0
# a86ce0cd 20-Sep-2013 Matthew Dillon <dillon@apollo.backplane.com>

hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree

* This merge contains work primarily by Mihai Carabas, with some misc
fixes also by Matthew Dillon.

* Special note on G

hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree

* This merge contains work primarily by Mihai Carabas, with some misc
fixes also by Matthew Dillon.

* Special note on GSOC core

This is, needless to say, a huge amount of work compressed down into a
few paragraphs of comments. Adds the pc64/vmm subdirectory and tons
of stuff to support hardware virtualization in guest-user mode, plus
the ability for programs (vkernels) running in this mode to make normal
system calls to the host.

* Add system call infrastructure for VMM mode operations in kern/sys_vmm.c
which vectors through a structure to machine-specific implementations.

vmm_guest_ctl_args()
vmm_guest_sync_addr_args()

vmm_guest_ctl_args() - bootstrap VMM and EPT modes. Copydown the original
user stack for EPT (since EPT 'physical' addresses cannot reach that far
into the backing store represented by the process's original VM space).
Also installs the GUEST_CR3 for the guest using parameters supplied by
the guest.

vmm_guest_sync_addr_args() - A host helper function that the vkernel can
use to invalidate page tables on multiple real cpus. This is a lot more
efficient than having the vkernel try to do it itself with IPI signals
via cpusync*().

* Add Intel VMX support to the host infrastructure. Again, tons of work
compressed down into a one paragraph commit message. Intel VMX support
added. AMD SVM support is not part of this GSOC and not yet supported
by DragonFly.

* Remove PG_* defines for PTE's and related mmu operations. Replace with
a table lookup so the same pmap code can be used for normal page tables
and also EPT tables.

* Also include X86_PG_V defines specific to normal page tables for a few
situations outside the pmap code.

* Adjust DDB to disassemble SVM related (intel) instructions.

* Add infrastructure to exit1() to deal related structures.

* Optimize pfind() and pfindn() to remove the global token when looking
up the current process's PID (Matt)

* Add support for EPT (double layer page tables). This primarily required
adjusting the pmap code to use a table lookup to get the PG_* bits.

Add an indirect vector for copyin, copyout, and other user address space
copy operations to support manual walks when EPT is in use.

A multitude of system calls which manually looked up user addresses via
the vm_map now need a VMM layer call to translate EPT.

* Remove the MP lock from trapsignal() use cases in trap().

* (Matt) Add pthread_yield()s in most spin loops to help situations where
the vkernel is running on more cpu's than the host has, and to help with
scheduler edge cases on the host.

* (Matt) Add a pmap_fault_page_quick() infrastructure that vm_fault_page()
uses to try to shortcut operations and avoid locks. Implement it for
pc64. This function checks whether the page is already faulted in as
requested by looking up the PTE. If not it returns NULL and the full
blown vm_fault_page() code continues running.

* (Matt) Remove the MP lock from most the vkernel's trap() code

* (Matt) Use a shared spinlock when possible for certain critical paths
related to the copyin/copyout path.

show more ...


Revision tags: v3.4.3, v3.4.2, v3.4.0, v3.4.1, v3.4.0rc, v3.5.0, v3.2.2
# 1918fc5c 24-Oct-2012 Sascha Wildner <saw@online.de>

kernel: Make SMP support default (and non-optional).

The 'SMP' kernel option gets removed with this commit, so it has to
be removed from everybody's configs.

Reviewed-by: sjg
Approved-by: many


Revision tags: v3.2.1, v3.2.0, v3.3.0
# 921c891e 13-Sep-2012 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Implement segment pmap optimizations for x86-64

* Implement 2MB segment optimizations for x86-64. Any shared read-only
or read-write VM object mapped into memory, including physical obje

kernel - Implement segment pmap optimizations for x86-64

* Implement 2MB segment optimizations for x86-64. Any shared read-only
or read-write VM object mapped into memory, including physical objects
(so both sysv_shm and mmap), which is a multiple of the segment size
and segment-aligned can be optimized.

* Enable with sysctl machdep.pmap_mmu_optimize=1

Default is off for now. This is an experimental feature.

* It works as follows: A VM object which is large enough will, when VM
faults are generated, store a truncated pmap (PD, PT, and PTEs) in the
VM object itself.

VM faults whos vm_map_entry's can be optimized will cause the PTE, PT,
and also the PD (for now) to be stored in a pmap embedded in the VM_OBJECT,
instead of in the process pmap.

The process pmap then creates PT entry in the PD page table that points
to the PT page table page stored in the VM_OBJECT's pmap.

* This removes nearly all page table overhead from fork()'d processes or
even unrelated process which massively share data via mmap() or sysv_shm.
We still recommend using sysctl kern.ipc.shm_use_phys=1 (which is now
the default), which also removes the PV entries associated with the
shared pmap. However, with this optimization PV entries are no longer
a big issue since they will not be replicated in each process, only in
the common pmap stored in the VM_OBJECT.

* Features of this optimization:

* Number of PV entries is reduced to approximately the number of live
pages and no longer multiplied by the number of processes separately
mapping the shared memory.

* One process faulting in a page naturally makes the PTE available to
all other processes mapping the same shared memory. The other processes
do not have to fault that same page in.

* Page tables survive process exit and restart.

* Once page tables are populated and cached, any new process that maps
the shared memory will take far fewer faults because each fault will
bring in an ENTIRE page table. Postgres w/ 64-clients, VM fault rate
was observed to drop from 1M faults/sec to less than 500 at startup,
and during the run the fault rates dropped from a steady decline into
the hundreds of thousands into an instant decline to virtually zero
VM faults.

* We no longer have to depend on sysv_shm to optimize the MMU.

* CPU caches will do a better job caching page tables since most of
them are now themselves shared. Even when we invltlb, more of the
page tables will be in the L1, L2, and L3 caches.

* EXPERIMENTAL!!!!!

show more ...


Revision tags: v3.0.3, v3.0.2, v3.0.1, v3.1.0, v3.0.0
# 86d7f5d3 26-Nov-2011 John Marino <draco@marino.st>

Initial import of binutils 2.22 on the new vendor branch

Future versions of binutils will also reside on this branch rather
than continuing to create new binutils branches for each new version.


# 54341a3b 15-Nov-2011 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Greatly improve shared memory fault rate concurrency / shared tokens

This commit rolls up a lot of work to improve postgres database operations
and the system in general. With this changes

kernel - Greatly improve shared memory fault rate concurrency / shared tokens

This commit rolls up a lot of work to improve postgres database operations
and the system in general. With this changes we can pgbench -j 8 -c 40 on
our 48-core opteron monster at 140000+ tps, and the shm vm_fault rate
hits 3.1M pps.

* Implement shared tokens. They work as advertised, with some cavets.

It is acceptable to acquire a shared token while you already hold the same
token exclusively, but you will deadlock if you acquire an exclusive token
while you hold the same token shared.

Currently exclusive tokens are not given priority over shared tokens so
starvation is possible under certain circumstances.

* Create a critical code path in vm_fault() using the new shared token
feature to quickly fault-in pages which already exist in the VM cache.
pmap_object_init_pt() also uses the new feature.

This increases fault-in concurrency by a ridiculously huge amount,
particularly on SHM segments (say when you have a large number of postgres
clients). Scaling for large numbers of clients on large numbers of
cores is significantly improved.

This also increases fault-in concurrency for MAP_SHARED file maps.

* Expand the breadn() and cluster_read() APIs. Implement breadnx() and
cluster_readx() which allows a getblk()'d bp to be passed. If *bpp is not
NULL a bp is being passed in, otherwise the routines call getblk().

* Modify the HAMMER read path to use the new API. Instead of calling
getcacheblk() HAMMER now calls getblk() and checks the B_CACHE flag.
This gives getblk() a chance to regenerate a fully cached buffer from
VM backing store without having to acquire any hammer-related locks,
resulting in even faster operation.

* If kern.ipc.shm_use_phys is set to 2 the VM pages will be pre-allocated.
This can take quite a while for a large map and also lock the machine
up for a few seconds. Defaults to off.

* Reorder the smp_invltlb()/cpu_invltlb() combos in a few places, running
cpu_invltlb() last.

* An invalidation interlock might be needed in pmap_enter() under certain
circumstances, enable the code for now.

* vm_object_backing_scan_callback() was failing to properly check the
validity of a vm_object after acquiring its token. Add the required
check + some debugging.

* Make vm_object_set_writeable_dirty() a bit more cache friendly.

* The vmstats sysctl was scanning every process's vm_map (requiring a
vm_map read lock to do so), which can stall for long periods of time
when the system is paging heavily. Change the mechanic to a LWP flag
which can be tested with minimal locking.

* Have the phys_pager mark the page as dirty too, to make sure nothing
tries to free it.

* Remove the spinlock in pmap_prefault_ok(), since we do not delete page
table pages it shouldn't be needed.

* Add a required cpu_ccfence() in pmap_inval.c. The code generated prior
to this fix was still correct, and this makes sure it stays that way.

* Replace several manual wiring cases with calls to vm_page_wire().

show more ...


12