History log of /dragonfly/sys/vm/vm_map.h (Results 1 – 25 of 94)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.2.1, v6.2.0, v6.3.0, v6.0.1
# 949c56f8 23-Jul-2021 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Rename vm_map_wire() and vm_map_unwire()

* These names are mutant throwbacks to an earlier age and no
longer mean what is implied.

* Rename vm_map_wire() to vm_map_kernel_wiring(). This

kernel - Rename vm_map_wire() and vm_map_unwire()

* These names are mutant throwbacks to an earlier age and no
longer mean what is implied.

* Rename vm_map_wire() to vm_map_kernel_wiring(). This function can
wire and unwire VM ranges in a vm_map under kernel control. Userland
has no say.

* Rename vm_map_unwire() to vm_map_user_wiring(). This function can
wire and unwire VM ranges in a vm_map under user control. Userland
can adjust the user wiring state for pages.

show more ...


# 30d365ff 22-May-2021 Aaron LI <aly@aaronly.me>

nvmm: Port to DragonFly #18: kernel memory allocation

Use kmem_alloc() and kmem_free() to implement uvm_km_alloc() and
uvm_km_free() as they're used in svm_vcpu_create() and vmx_vcpu_create().
Howev

nvmm: Port to DragonFly #18: kernel memory allocation

Use kmem_alloc() and kmem_free() to implement uvm_km_alloc() and
uvm_km_free() as they're used in svm_vcpu_create() and vmx_vcpu_create().
However, our kmem_alloc() may return 0 (i.e., allocation failure), so
need an extra check in the caller functions.

Since we've defined 'kmem_alloc' and 'kmem_free' macros to adapt
NetBSD's functions to use our kmalloc() and kfree(). Therefore, extra
parentheses are added around 'kmem_alloc' and 'kmem_free' to avoid macro
expansion, so the original functions would be called.

In addition, change the 'kmem_free()' to 'uvm_km_free()' in
vmx_vcpu_create(), aligning with the invocation pattern as well as
the use case in svm_vcpu_create().

show more ...


# 737b020b 29-May-2021 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Misc adjustments to code documentation

* Misc adjustments to bring some of the pmap related code
comments up-to-date.

Submitted-by: falsifian (James Cook)


Revision tags: v6.0.0, v6.0.0rc1, v6.1.0
# 4d4f84f5 07-Jan-2021 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Remove MAP_VPAGETABLE

* This will break vkernel support for now, but after a lot of mulling
there's just no other way forward. MAP_VPAGETABLE was basically a
software page-table featur

kernel - Remove MAP_VPAGETABLE

* This will break vkernel support for now, but after a lot of mulling
there's just no other way forward. MAP_VPAGETABLE was basically a
software page-table feature for mmap()s that allowed the vkernel
to implement page tables without needing hardware virtualization support.

* The basic problem is that the VM system is moving to an extent-based
mechanism for tracking VM pages entered into PMAPs and is no longer
indexing individual terminal PTEs with pv_entry's.

This means that the VM system is no longer able to get an exact list of
PTEs in PMAPs that a particular vm_page is using. It just has a
flag 'this page is in at least one pmap' or 'this page is not in any
pmaps'. To track down the PTEs, the VM system must run through the
extents via the vm_map_backing structures hanging off the related
VM object.

This mechanism does not work with MAP_VPAGETABLE. Short of scanning
the entire real pmap, the kernel has no way to reverse-index a page
that might be indirected through MAP_VPAGETABLE.

* We will need actual hardware mmu virtualization to get the vkernel
working again.

show more ...


Revision tags: v5.8.3, v5.8.2, v5.8.1
# c7d06799 16-Mar-2020 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix rare vm_map_entry exhaustion panic (2)

* Increase per-cpu fast-cache hysteresis from its absurdly small
value to a significantly larger value.

* Missing header file update for prior

kernel - Fix rare vm_map_entry exhaustion panic (2)

* Increase per-cpu fast-cache hysteresis from its absurdly small
value to a significantly larger value.

* Missing header file update for prior commit

show more ...


Revision tags: v5.8.0, v5.9.0, v5.8.0rc1
# 01251219 14-Feb-2020 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Start work on a better burst page-fault mechanic

* The vm.fault_quick sysctl is now a burst count. It still
defaults to 1 which is the same operation as before.

Performance is roughly

kernel - Start work on a better burst page-fault mechanic

* The vm.fault_quick sysctl is now a burst count. It still
defaults to 1 which is the same operation as before.

Performance is roughly the same with it set to 1 to 8 as
more work needs to be done to optimize pmap_enter().

show more ...


Revision tags: v5.6.3
# 4aa6d05c 12-Nov-2019 Matthew Dillon <dillon@apollo.backplane.com>

libc - Implement sigblockall() and sigunblockall() (2)

* Cleanup the logic a bit. Store the lwp or proc pointer
in the vm_map_backing structure and make vm_map_fork()
and friends more aware of

libc - Implement sigblockall() and sigunblockall() (2)

* Cleanup the logic a bit. Store the lwp or proc pointer
in the vm_map_backing structure and make vm_map_fork()
and friends more aware of it.

* Rearrange lwp allocation in [v]fork() to make the pointer(s)
available to vm_fork().

* Put the thread mappings on the lwp's list immediately rather
than waiting for the first fault, which means that per-thread
mappings will be deterministically removed on thread exit
whether any faults happened or not.

* Adjust vmspace_fork*() functions to not propagate 'dead' lwp
mappings for threads that won't exist in the forked process.
Only the lwp mappings for the thread doing the [v]fork() is
retained.

show more ...


# 64b5a8a5 12-Nov-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - sigblockall()/sigunblockall() support (per thread shared page)

* Implement /dev/lpmap, a per-thread RW shared page between userland
and the kernel. Each thread in the process will receiv

kernel - sigblockall()/sigunblockall() support (per thread shared page)

* Implement /dev/lpmap, a per-thread RW shared page between userland
and the kernel. Each thread in the process will receive a unique
shared page for communication with the kernel when memory-mapping
/dev/lpmap and can access varous variables via this map.

* The current thread's TID is retained for both fork() and vfork().
Previously it was only retained for vfork(). This avoids userland
code confusion for any bits and pieces that are indexed based on the
TID.

* Implement support for a per-thread block-all-signals feature that
does not require any system calls (see next commit to libc). The
functions will be called sigblockall() and sigunblockall().

The lpmap->blockallsigs variable prevents normal signals from being
dispatched. They will still be queued to the LWP as per normal.
The behavior is not quite that of a signal mask when dealing with
global signals.

The low 31 bits represents a recursion counter, allowing recursive
use of the functions. The high bit (bit 31) is set by the kernel
if a signal was prevented from being dispatched. When userland decrements
the counter to 0 (the low 31 bits), it can check and clear bit 31 and
if found to be set userland can then make a dummy 'real' system call
to cause pending signals to be delivered.

Synchronous TRAPs (e.g. kernel-generated SIGFPE, SIGSEGV, etc) are not
affected by this feature and will still be dispatched synchronously.

* PThreads is expected to unmap the mapped page upon thread exit.
The kernel will force-unmap the page upon thread exit if pthreads
does not.

XXX needs work - currently if the page has not been faulted in
the kernel has no visbility into the mapping and will not unmap it,
but neither will it get confused if the address is accessed. To
be fixed soon. Because if we don't, programs using LWP primitives
instead of pthreads might not realize that libc has mapped the page.

* The TID is reset to 1 on a successful exec*()

* On [v]fork(), if lpmap exists for the current thread, the kernel will
copy the lpmap->blockallsigs value to the lpmap for the new thread
in the new process. This way sigblock*() state is retained across
the [v]fork().

This feature not only reduces code confusion in userland, it also
allows [v]fork() to be implemented by the userland program in a way
that ensures no signal races in either the parent or the new child
process until it is ready for them.

* The implementation leverages our vm_map_backing extents by having
the per-thread memory mappings indexed within the lwp. This allows
the lwp to remove the mappings when it exits (since not doing so
would result in a wild pmap entry and kernel memory disclosure).

* The implementation currently delays instantiation of the mapped
page(s) and some side structures until the first fault.

XXX this will have to be changed.

show more ...


Revision tags: v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0
# e3c330f0 19-May-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - VM rework part 12 - Core pmap work, stabilize & optimize

* Add tracking for the number of PTEs mapped writeable in md_page.
Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page

kernel - VM rework part 12 - Core pmap work, stabilize & optimize

* Add tracking for the number of PTEs mapped writeable in md_page.
Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page
to avoid clear/set races. This problem occurs because we would
have otherwise tried to clear the bits without hard-busying the
page. This allows the bits to be set with only an atomic op.

Procedures which test these bits universally do so while holding
the page hard-busied, and now call pmap_mapped_sfync() prior to
properly synchronize the bits.

* Fix bugs related to various counterse. pm_stats.resident_count,
wiring counts, vm_page->md.writeable_count, and
vm_page->md.pmap_count.

* Fix bugs related to synchronizing removed pte's with the vm_page.
Fix one case where we were improperly updating (m)'s state based
on a lost race against a pte swap-to-0 (pulling the pte).

* Fix a bug related to the page soft-busying code when the
m->object/m->pindex race is lost.

* Implement a heuristical version of vm_page_active() which just
updates act_count unlocked if the page is already in the
PQ_ACTIVE queue, or if it is fictitious.

* Allow races against the backing scan for pmap_remove_all() and
pmap_page_protect(VM_PROT_READ). Callers of these routines for
these cases expect full synchronization of the page dirty state.
We can identify when a page has not been fully cleaned out by
checking vm_page->md.pmap_count and vm_page->md.writeable_count.
In the rare situation where this happens, simply retry.

* Assert that the PTE pindex is properly interlocked in pmap_enter().
We still allows PTEs to be pulled by other routines without the
interlock, but multiple pmap_enter()s of the same page will be
interlocked.

* Assert additional wiring count failure cases.

* (UNTESTED) Flag DEVICE pages (dev_pager_getfake()) as being
PG_UNMANAGED. This essentially prevents all the various
reference counters (e.g. vm_page->md.pmap_count and
vm_page->md.writeable_count), PG_M, PG_A, etc from being
updated.

The vm_page's aren't tracked in the pmap at all because there
is no way to find them.. they are 'fake', so without a pv_entry,
we can't track them. Instead we simply rely on the vm_map_backing
scan to manipulate the PTEs.

* Optimize the new vm_map_entry_shadow() to use a shared object
token instead of an exclusive one. OBJ_ONEMAPPING will be cleared
with the shared token.

* Optimize single-threaded access to pmaps to avoid pmap_inval_*()
complexities.

* Optimize __read_mostly for more globals.

* Optimize pmap_testbit(), pmap_clearbit(), pmap_page_protect().
Pre-check vm_page->md.writeable_count and vm_page->md.pmap_count
for an easy degenerate return; before real work.

* Optimize pmap_inval_smp() and pmap_inval_smp_cmpset() for the
single-threaded pmap case, when called on the same CPU the pmap
is associated with. This allows us to use simple atomics and
cpu_*() instructions and avoid the complexities of the
pmap_inval_*() infrastructure.

* Randomize the page queue used in bio_page_alloc(). This does not
appear to hurt performance (e.g. heavy tmpfs use) on large many-core
NUMA machines and it makes the vm_page_alloc()'s job easier.

This change might have a downside for temporary files, but for more
long-lasting files there's no point allocating pages localized to a
particular cpu.

* Optimize vm_page_alloc().

(1) Refactor the _vm_page_list_find*() routines to avoid re-scanning
the same array indices over and over again when trying to find
a page.

(2) Add a heuristic, vpq.lastq, for each queue, which we set if a
_vm_page_list_find*() operation had to go far-afield to find its
page. Subsequent finds will skip to the far-afield position until
the current CPUs queues have pages again.

(3) Reduce PQ_L2_SIZE From an extravagant 2048 entries per queue down
to 1024. The original 2048 was meant to provide 8-way
set-associativity for 256 cores but wound up reducing performance
due to longer index iterations.

* Refactor the vm_page_hash[] array. This array is used to shortcut
vm_object locks and locate VM pages more quickly, without locks.
The new code limits the size of the array to something more reasonable,
implements a 4-way set-associative replacement policy using 'ticks',
and rewrites the hashing math.

* Effectively remove pmap_object_init_pt() for now. In current tests
it does not actually improve performance, probably because it may
map pages that are not actually used by the program.

* Remove vm_map_backing->refs. This field is no longer used.

* Remove more of the old now-stale code related to use of pv_entry's
for terminal PTEs.

* Remove more of the old shared page-table-page code. This worked but
could never be fully validated and was prone to bugs. So remove it.
In the future we will likely use larger 2MB and 1GB pages anyway.

* Remove pmap_softwait()/pmap_softhold()/pmap_softdone().

* Remove more #if 0'd code.

show more ...


Revision tags: v5.4.3
# 67e7cb85 14-May-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - VM rework part 8 - Precursor work for terminal pv_entry removal

* Adjust structures so the pmap code can iterate backing_ba's with
just the vm_object spinlock.

Add a ba.pmap back-point

kernel - VM rework part 8 - Precursor work for terminal pv_entry removal

* Adjust structures so the pmap code can iterate backing_ba's with
just the vm_object spinlock.

Add a ba.pmap back-pointer.

Move entry->start and entry->end into the ba (ba.start, ba.end).
This is replicative of the base entry->ba.start and entry->ba.end,
but local modifications are locked by individual objects to allow
pmap ops to just look at backing ba's iterated via the object.

Remove the entry->map back-pointer.

Remove the ba.entry_base back-pointer.

* ba.offset is now an absolute offset and not additive. Adjust all code
that calculates and uses ba.offset (fortunately it is all concentrated
in vm_map.c and vm_fault.c).

* Refactor ba.start/offset/end modificatons to be atomic with
the necessary spin-locks to allow the pmap code to safely iterate
the vm_map_backing list for a vm_object.

* Test VM system with full synth run.

show more ...


# 5b329e62 11-May-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - VM rework part 7 - Initial vm_map_backing index

* Implement a TAILQ and hang vm_map_backing structures off
of the related object. This feature is still in progress
and will eventually

kernel - VM rework part 7 - Initial vm_map_backing index

* Implement a TAILQ and hang vm_map_backing structures off
of the related object. This feature is still in progress
and will eventually be used to allow pmaps to manipulate
vm_page's without pv_entry's.

At the same time, remove all sharing of vm_map_backing.
For example, clips no longer share the vm_map_backing. We
can't share the structures if they are being used to
itemize areas for pmap management.

TODO - reoptimize this at some point.

TODO - not yet quite deterministic enough for pmap
searches (due to clips).

* Refactor vm_object_reference_quick() to again allow
operation on any vm_object whos ref_count is already
at least 1, or which belongs to a vnode. The ref_count
is no longer being used for complex vm_object collapse,
shadowing, or migration code.

This allows us to avoid a number of unnecessary token
grabs on objects during clips, shadowing, and forks.

* Cleanup a few fields in vm_object. Name TAILQ_ENTRY()
elements blahblah_entry instead of blahblah_list.

* Fix an issue with a.out binaries (that are still supported but
nobody uses) where the object refs on the binaries were not
being properly accounted for.

show more ...


# 8492a2fe 10-May-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - VM rework part 5 - Cleanup

* Cleanup vm_map_entry_shadow()

* Remove (unused) vmspace_president_count()
Remove (barely used) struct lwkt_token typedef.

* Cleanup the vm_map_aux, vm_map_e

kernel - VM rework part 5 - Cleanup

* Cleanup vm_map_entry_shadow()

* Remove (unused) vmspace_president_count()
Remove (barely used) struct lwkt_token typedef.

* Cleanup the vm_map_aux, vm_map_entry, vm_map, and vm_object
structures

* Adjfustments to in-code documentation

show more ...


# 44293a80 09-May-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - VM rework part 3 - Cleanup pass

* Cleanup various structures and code


# 9de48ead 09-May-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - VM rework part 2 - Replace backing_object with backing_ba

* Remove the vm_object based backing_object chains and all related
chaining code.

This removes an enormous number of locks fro

kernel - VM rework part 2 - Replace backing_object with backing_ba

* Remove the vm_object based backing_object chains and all related
chaining code.

This removes an enormous number of locks from the VM system and
also removes object-to-object dependencies which requires careful
traversal code. A great deal of complex code has been removed
and replaced with far simpler code.

Ultimately the intention will be to support removal of pv_entry
tracking from vm_pages to gain lockless shared faults, but that
is far in the future. It will require hanging vm_map_backing
structures off of a list based in the object.

* Implement the vm_map_backing structure which is embedded in the
vm_map_entry and then links to additional dynamically allocated
vm_map_backing structures via entry->ba.backing_ba. This structure
contains the object and offset and essentially takes over the
functionality that object->backing_object used to have.

backing objects are now handled via vm_map_backing. In this
commit, fork operations create a fan-in tree to shared subsets
of backings via vm_map_backing. In this particular commit,
these subsets are not collapsed in any way.

* Remove all the vm_map_split and collapse code. Every last line
is gone. It will be reimplemented using vm_map_backing in a
later commit.

This means that as-of this commit both recursive forks and
parent-to-multiple-children forks cause an accumulation of
inefficient lists of backing objects to occur in the parent
and children. This will begin to get addressed in part 3.

* The code no longer releases the vm_map lock (typically shared)
across (get_pages) I/O. There are no longer any chaining locks to
get in the way (hopefully). This means that the code does not
have to re-check as carefully as it did before. However, some
complexity will have to be added back in once we begin to address
the accumulation of vm_map_backing structures.

* Paging performance improved by 30-40%

show more ...


# d6924570 03-May-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix serious bug in MAP_STACK, deprecate auto-grow semantics

* When MAP_STACK is used without MAP_TRYFIXED, the address the kernel
determines for the stack was *NOT* being returned to user

kernel - Fix serious bug in MAP_STACK, deprecate auto-grow semantics

* When MAP_STACK is used without MAP_TRYFIXED, the address the kernel
determines for the stack was *NOT* being returned to userland. Instead,
userland always got only the hint address.

* This fixes ruby MAP_STACK use cases and possibly more.

* Deprecate MAP_STACK auto-grow semantics. All user mmap() calls with
MAP_STACK are now converted to normal MAP_ANON mmaps. The kernel will
continue to create an auto-grow stack segment for the primary user stack
in exec(), allowing older pthread libraries to continue working, but this
feature is deprecated and will be removed in a future release.

show more ...


Revision tags: v5.4.2
# 70f3bb08 23-Mar-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Preliminary vm_page hash lookup

* Add preliminary vm_page hash lookup code which avoids most
locks, plus support in vm_fault. Default disabled, with debugging
for now.

* This code sti

kernel - Preliminary vm_page hash lookup

* Add preliminary vm_page hash lookup code which avoids most
locks, plus support in vm_fault. Default disabled, with debugging
for now.

* This code still soft-busies the vm_page, which is an improvement over
hard-busying it in that it won't contend, but we will eventually want
to entirely avoid all atomic ops on the vm_page to *really* get the
concurrent fault performance.

show more ...


# 47ec0953 23-Mar-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor vm_map structure 1/2

* Remove the embedded vm_map_entry 'header' from vm_map.

* Remove the prev and next fields from vm_map_entry.

* Refactor the code to iterate only via the RB

kernel - Refactor vm_map structure 1/2

* Remove the embedded vm_map_entry 'header' from vm_map.

* Remove the prev and next fields from vm_map_entry.

* Refactor the code to iterate only via the RB tree. This is not as
optimal as the prev/next fields were, but we can improve the RB tree
code later to recover the performance.

show more ...


# 2752a90b 05-Mar-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Remove vm_map min_offset and max_offset macros

* The symbols 'min_offset' and 'max_offset' are common, do not use
them as a #define'd macro for vm_map's header.start and header.end.

* Fi

kernel - Remove vm_map min_offset and max_offset macros

* The symbols 'min_offset' and 'max_offset' are common, do not use
them as a #define'd macro for vm_map's header.start and header.end.

* Fixes symbol collision related to the drm work.

show more ...


# 4b566556 17-Feb-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Implement sbrk(), change low-address mmap hinting

* Change mmap()'s internal lower address bound from dmax (32GB)
to RLIMIT_DATA's current value. This allows the rlimit to be
e.g. redu

kernel - Implement sbrk(), change low-address mmap hinting

* Change mmap()'s internal lower address bound from dmax (32GB)
to RLIMIT_DATA's current value. This allows the rlimit to be
e.g. reduced and for hinted mmap()s to then map space below
the 4GB mark. The default data rlimit is 32GB.

This change is needed to support several languages, at least
lua and probably another one or two, who use mmap hinting
under the assumption that it can map space below the 4GB
address mark. The data limit must be lowered with a limit command
too, which can be scripted or patched for such programs.

* Implement the sbrk() system call. This system call was already
present but just returned EOPNOTSUPP and libc previously had its
own shim for sbrk() which used the ancient break() system call.
(Note that the prior implementation did not ENOSYS or signal).

sbrk() in the kernel is thread-safe for positive increments and
is also byte-granular (the old libc sbrk() was only page-granular).

sbrk() in the kernel does not implement negative increments and
will return EOPNOTSUPP if asked to. Negative increments were
historically designed to be able to 'free' memory allocated with
sbrk(), but it is not possible to implement the case in a modern
VM system due to the mmap changes above.

(1) Because the new mmap hinting changes make it possible for
normal mmap()s to have mapped space prior to the RLIMIT_DATA resource
limit being increased, causing intermingling of sbrk() and user mmap()d
regions. (2) because negative increments are not even remotely
thread-safe.

* Note the previous commit refactored libc to use the kernel sbrk()
and fall-back to its previous emulation code on failure, so libc
supports both new and old kernels.

* Remove the brk() shim from libc. brk() is not implemented by the
kernel. Symbol removed. Requires testing against ports so we may
have to add it back in but basically there is no way to implement
brk() properly with the mmap() hinting fix

* Adjust manual pages.

show more ...


Revision tags: v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1, v5.2.2, v5.2.1, v5.2.0, v5.3.0, v5.2.0rc, v5.0.2
# 7a45978d 09-Nov-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix bug in vm_fault_page()

* Fix a bug in vm_fault_page() and vm_fault_page_quick(). The code
is not intended to update the user pmap, but if the vm_map_lookup()
results in a COW, any

kernel - Fix bug in vm_fault_page()

* Fix a bug in vm_fault_page() and vm_fault_page_quick(). The code
is not intended to update the user pmap, but if the vm_map_lookup()
results in a COW, any existing page in the underlying pmap will no
longer match the page that should be there.

The user process will still work correctly in that it will fault the
COW'd page if/when it tries to issue a write to that address, but
userland will not have visibility to any kernel use of vm_fault_page()
that modifies the page and causes a COW if the page has already been
faulted in.

* Fixed by detecting the COW and at least removing the pte from the pmap
to force userland to re-fault it.

* This fixes gdb operation on programs. The problem did not rear its
head before because the kernel did not pre-populate as many pages in the
initial exec as it does now.

* Enhance vm_map_lookup()'s &wired argument to return wflags instead,
which includes FS_WIRED and also now has FS_DIDCOW.

Reported-by: profmakx

show more ...


Revision tags: v5.0.1
# 641f3b0a 02-Nov-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor vm_fault and vm_map a bit.

* Allow the virtual copy feature to be disabled via a sysctl.
Default enabled.

* Fix a bug in the virtual copy test. Multiple elements were
not bei

kernel - Refactor vm_fault and vm_map a bit.

* Allow the virtual copy feature to be disabled via a sysctl.
Default enabled.

* Fix a bug in the virtual copy test. Multiple elements were
not being retested after reacquiring the map lock.

* Change the auto-partitioning of vm_map_entry structures from
16MB to 32MB. Add a sysctl to allow the feature to be disabled.
Default enabled.

* Cleanup map->timestamp bumps. Basically we bump it in
vm_map_lock(), and also fix a bug where it was not being
bumped after relocking the map in the virtual copy feature.

* Fix an incorrect assertion in vm_map_split(). Refactor tests
in vm_map_split(). Also, acquire the chain lock for the VM
object in the caller to vm_map_split() instead of in vm_map_split()
itself, allowing us to include the pmap adjustment within the
locked area.

* Make sure OBJ_ONEMAPPING is cleared for nobject in vm_map_split().

* Fix a bug in a call to vm_map_transition_wait() that
double-locked the vm_map in the partitioning code.

* General cleanups in vm/vm_object.c

show more ...


# 22b7a3db 17-Oct-2017 Sascha Wildner <saw@online.de>

kernel: Remove <sys/sysref{,2}.h> inclusion from files that don't need it.

Some of the headers are public in one way or another so bump
__DragonFly_version for safety.

While here, add a missing <sy

kernel: Remove <sys/sysref{,2}.h> inclusion from files that don't need it.

Some of the headers are public in one way or another so bump
__DragonFly_version for safety.

While here, add a missing <sys/objcache.h> include to kern_exec.c which
was previously relying on it coming in via <sys/sysref.h> (which was
included by <sys/vm_map.h> prior to this commit).

show more ...


# ce5d7a1c 15-Oct-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Partition large anon mappings, optimize vm_map_entry_reserve*()

* Partition large anonymous mappings in (for now) 16MB chunks.
The purpose of this is to improve concurrent VM faults for

kernel - Partition large anon mappings, optimize vm_map_entry_reserve*()

* Partition large anonymous mappings in (for now) 16MB chunks.
The purpose of this is to improve concurrent VM faults for
threaded programs. Note that the pmap itself is still a
bottleneck.

* Refactor vm_map_entry_reserve() and related code to remove
unnecessary critical sections.

show more ...


Revision tags: v5.0.0, v5.0.0rc2, v5.1.0, v5.0.0rc1
# e6b81333 12-Aug-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix bottlenecks that develop when many processes are running

* When a large number of processes or threads are running (in the tens of
thousands or more), a number of O(n) or O(ncpus) bot

kernel - Fix bottlenecks that develop when many processes are running

* When a large number of processes or threads are running (in the tens of
thousands or more), a number of O(n) or O(ncpus) bottlenecks can develop.
These bottlenecks do not develop when only a few thousand threads
are present.

By fixing these bottlenecks, and assuming kern.maxproc is autoconfigured
or manually set high enough, DFly can now handle hundreds of thousands
of active processes running, polling, sleeping, whatever.

Tested to around 400,000 discrete processes (no shared VM pages) on
a 32-thread dual-socket Xeon system. Each process is placed in a
1/10 second sleep loop using umtx timeouts:

baseline - (before changes), system bottlenecked starting
at around the 30,000 process mark, eating all
available cpu, high IPI rate from hash
collisions, and other unrelated user processes
bogged down due to the scheduling overhead.

200,000 processes - System settles down to 45% idle, and low IPI
rate.

220,000 processes - System 30% idle and low IPI rate

250,000 processes - System 0% idle and low IPI rate

300,000 processes - System 0% idle and low IPI rate.

400,000 processes - Scheduler begins to bottleneck again after the
350,000 while the process test is still in its
fork/exec loop.

Once all 400,000 processes are settled down,
system behaves fairly well. 0% idle, modest
IPI rate averaging 300 IPI/sec/cpu (due to
hash collisions in the wakeup code).

* More work will be needed to better handle processes with massively
shared VM pages.

It should also be noted that the system does a *VERY* good job
allocating and releasing kernel resources during this test using
discrete processes. It can kill 400,000 processes in a few seconds
when I ^C the test.

* Change lwkt_enqueue()'s linear td_runq scan into a double-ended scan.
This bottleneck does not arise when large numbers of processes are
running in usermode, because typically only one user process per cpu
will be scheduled to LWKT.

However, this bottleneck does arise when large numbers of threads
are woken up in-kernel. While in-kernel, a thread schedules directly
to LWKT. Round-robin operation tends to result in appends to the tail
of the queue, so this optimization saves an enormous amount of cpu
time when large numbers of threads are present.

* Limit ncallout to ~5 minutes worth of ring. The calculation code is
primarily designed to allocate less space on low-memory machines,
but will also cause an excessively-sized ring to be allocated on
large-memory machines. 512MB was observed on a 32-way box.

* Remove vm_map->hint, which had basically stopped functioning in a
useful manner. Add a new vm_map hinting mechanism that caches up to
four (size, align) start addresses for vm_map_findspace(). This cache
is used to quickly index into the linear vm_map_entry list before
entering the linear search phase.

This fixes a serious bottleneck that arises due to vm_map_findspace()'s
linear scan if the vm_map_entry list when the kernel_map becomes
fragmented, typically when the machine is managing a large number of
processes or threads (in the tens of thousands or more).

This will also reduce overheads for processes with highly fragmented
vm_maps.

* Dynamically size the action_hash[] array in vm/vm_page.c. This array
is used to record blocked umtx operations. The limited size of the
array could result in an excessive number of hash entries when a large
number of processes/threads are present in the system. Again, the
effect is noticed as the number of threads exceeds a few tens of
thousands.

show more ...


Revision tags: v4.8.1, v4.8.0, v4.6.2, v4.9.0, v4.8.0rc
# fc531fbc 05-Feb-2017 Matthew Dillon <dillon@apollo.backplane.com>

vkernel - Fix more pagein/pageout corruption

* There is a race when the real kernel walks a virtual page
table (VPAGETABLE) as created by a vkernel managing various
contexts. The real kernel ma

vkernel - Fix more pagein/pageout corruption

* There is a race when the real kernel walks a virtual page
table (VPAGETABLE) as created by a vkernel managing various
contexts. The real kernel may complete the lookup but get
interrupted by a pmap invalidation BEFORE it enters the results
into the pmap. The result is that the pmap invalidation is not
applied to the PTE entered into the pmap, leading to data corruption.

* Fix with a bit of a hack for now. Lock the VA in vm_fault and lock
the VA in MADV_INVAL operations (which is what the vkernel uses to
invalidate the pmap). This closes the hole.

* This race has to be fixed in the real kernel but normal programs outside
of a vkernel are not affected by it because they don't use VPAGETABLE
mappings.

* buildworld -j many in an intentionally hard-paging vkernel now completes
without error.

show more ...


1234