Revision tags: v6.2.1, v6.2.0, v6.3.0, v6.0.1, v6.0.0, v6.0.0rc1, v6.1.0, v5.8.3, v5.8.2, v5.8.1 |
|
#
53a91b8e |
| 04-Mar-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Minor optimizations
* Minor __predict and __read_mostly/frequently optimizations.
|
Revision tags: v5.8.0, v5.9.0, v5.8.0rc1 |
|
#
9cd8f4f8 |
| 11-Feb-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Warn/assert on broken ACPI MADT
* Add warnings and assertions for broken ACPI MADT tables. I encountered this trying to boot a 3990X on a motherboard with an old BIOS that didn't suppo
kernel - Warn/assert on broken ACPI MADT
* Add warnings and assertions for broken ACPI MADT tables. I encountered this trying to boot a 3990X on a motherboard with an old BIOS that didn't support it. It tried to boot anyway, but the MADT table was mangled and caused a null-pointer indirection in the kernel. Assert nicely instead.
show more ...
|
Revision tags: v5.6.3 |
|
#
e2164e29 |
| 18-Oct-2019 |
zrj <rimvydas.jasinskas@gmail.com> |
<sys/slaballoc.h>: Switch to lighter <sys/_malloc.h> header.
The <sys/globaldata.h> embeds SLGlobalData that in turn embeds the "struct malloc_type". Adjust several kernel sources for missing in
<sys/slaballoc.h>: Switch to lighter <sys/_malloc.h> header.
The <sys/globaldata.h> embeds SLGlobalData that in turn embeds the "struct malloc_type". Adjust several kernel sources for missing includes where memory allocation is performed. Try to use alphabetical include order.
Now (in most cases) <sys/malloc.h> is included after <sys/objcache.h>. Once it gets cleaned up, the <sys/malloc.h> inclusion could be moved out of <sys/idr.h> to drm Linux compat layer linux/slab.h without side effects.
show more ...
|
#
bce6845a |
| 18-Oct-2019 |
zrj <rimvydas.jasinskas@gmail.com> |
kernel: Minor whitespace cleanup in few sources.
|
Revision tags: v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0, v5.4.3, v5.4.2, v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1 |
|
#
8e5d7c42 |
| 15-Oct-2018 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix NUMA contention due to assymetric memory
* Fix NUMA contention in situations where memory is associated with CPU cores assymetrically. In particular, with the 2990WX, half the core
kernel - Fix NUMA contention due to assymetric memory
* Fix NUMA contention in situations where memory is associated with CPU cores assymetrically. In particular, with the 2990WX, half the cores will have no memory associated with them.
* This was forcing DFly to allocate memory from queues belonging to other nearby cores, causing unnecessary SMP contention, as well as burn extra time iterating queues.
* Fix by calculating the average number of free pages per-core, and then adjust any VM page queue with pages less than the average by stealing pages from queues with greater than the average. We use a simple iterator to steal pages, so the CPUs with less (or zero) direct-attached memory will operate more UMA-like (just on 4K boundaries instead of 256-1024 byte boundaries).
* Tested with a 64-thread concurrent compile test. systat -pv 1 showed all remaining contention disappear. Literally, *ZERO* contention when we run the test with each thread in its own jail with no shared resources.
* NOTE! This fix is specific to asymetric NUMA configurations which are fairly rare in the wild and will not speed up more conventional systems.
* Before and after timings on the 2990WX.
cd /tmp/src time make -j 128 nativekernel NO_MODULES=TRUE > /dev/null
BEFORE 703.915u 167.605s 0:49.97 1744.0% 9993+749k 22188+8io 216pf+0w 699.550u 171.148s 0:50.87 1711.5% 9994+749k 21066+8io 150pf+0w
AFTER 678.406u 108.857s 0:45.66 1724.1% 10105+757k 22188+8io 216pf+0w 674.805u 115.256s 0:46.67 1692.8% 10077+755k 21066+8io 150pf+0w
This is a 4.2 second difference on the second run, an over 8% improvement which is nothing to sneeze at.
show more ...
|
#
c70d4562 |
| 23-Aug-2018 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Update AMD topology detection, scheduler NUMA work (TR2)
* Update AMD topology detection to use the correct cpuid. It now properly detects the Threadripper 2990WX as having four nodes
kernel - Update AMD topology detection, scheduler NUMA work (TR2)
* Update AMD topology detection to use the correct cpuid. It now properly detects the Threadripper 2990WX as having four nodes with 8 cores and 2 threads per core, per node. It previously detected the chip as one node with 32 cores and 2 threads per core.
* Report the basic detected topology without requiring bootverbose.
* Record information about how much memory is attached to each node. We previously just assumed that it was symmetric. This will be used by the scheduler.
* Fix instability in the scheduler when running on a large number of cores. Flag 0x08 (on by default) is needed to actively schedule overloaded threads onto other cores, but this operation was being executed on all cores simultaneously which throws the uload/ucount metrics into an unstable state, causing threads to bounce around longer the necessary.
Fix by round-robining the operation based on something similar to sched_ticks % cpuid.
This significantly improves heavy multi-tasking performance on systems with many cores.
* Add memory-on-node weighting to the scheduler. This detects asymetric NUMA configurations for situations where not all DIMM slots have been populated, and for CPUs which are naturally assymetric such as the 2990WX which only has memory directly connected to two of its four nodes.
This change will preferentially schedule threads onto nodes with greater amounts of attached memory under light loads, and dig into the less desirable cpu nodes as the load increases.
show more ...
|
Revision tags: v5.2.2, v5.2.1, v5.2.0, v5.3.0, v5.2.0rc, v5.0.2, v5.0.1, v5.0.0, v5.0.0rc2, v5.1.0, v5.0.0rc1, v4.8.1, v4.8.0, v4.6.2, v4.9.0, v4.8.0rc |
|
#
3a3b0c3a |
| 06-Jan-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - vmm_init() must run after SMP startup
* vmm_init() must run after SMP startup (fix bug introduced by recent commits).
* cleanup.
|
#
6f2099fe |
| 06-Jan-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Add NUMA awareness to vm_page_alloc() and related functions (2)
* Fix miscellaneous bugs in the recent NUMA commits.
* Add kern.numa_disable, setting this to 1 in /boot/loader.conf will
kernel - Add NUMA awareness to vm_page_alloc() and related functions (2)
* Fix miscellaneous bugs in the recent NUMA commits.
* Add kern.numa_disable, setting this to 1 in /boot/loader.conf will disable the NUMA code. Note that NUMA is only applicable on multi-socket systems.
show more ...
|
#
c7f9edd8 |
| 06-Jan-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Add NUMA awareness to vm_page_alloc() and related functions
* Add NUMA awareness to the kernel memory subsystem. This first iteration will primarily affect user pages. kmalloc and objca
kernel - Add NUMA awareness to vm_page_alloc() and related functions
* Add NUMA awareness to the kernel memory subsystem. This first iteration will primarily affect user pages. kmalloc and objcache are not NUMA-friendly yet (and its questionable how useful it would be to make them so).
* Tested with synth on monster (4-socket opteron / 48 cores) and a 2-socket xeon (32 threads). Appears to dole out localized pages 5:1 to 10:1.
show more ...
|
Revision tags: v4.6.1, v4.6.0 |
|
#
9002b0d5 |
| 30-Jul-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor cpu localization for VM page allocations (2)
* Finish up the refactoring. Localize backoffs for search failures by doing a masked domain search. This avoids bleeding into non-l
kernel - Refactor cpu localization for VM page allocations (2)
* Finish up the refactoring. Localize backoffs for search failures by doing a masked domain search. This avoids bleeding into non-local page queues until we've completely exhausted our local queues, regardess of the starting pg_color index.
* We try to maintain 16-way set associativity for VM page allocations even if the topology does not allow us to do it perfect. So, for example, a 4-socket x 12-core (48-core) opteron can break the 256 queues into 4 x 64 queues, then split the 12-cores per socket into sets of 3 giving 16 queues (the minimum) to each set of 3 cores.
* Refactor the page-zeroing code to only check the localized area. This fixes a number of issues related to the zerod pages in the queues winding up severely unbalanced. Other cpus in the local group can help replentish a particular cpu's pre-zerod pages but we intentionally allow a heavy user to exhaust the pages.
* Adjust the cpu topology code to normalize the physical package id. Some machines start at 1, some machines start at 0. Normalize everything to start at 0.
show more ...
|
#
33ee48c4 |
| 30-Jul-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor cpu localization for VM page allocations
* Change how cpu localization works. The old scheme was extremely unbalanced in terms of vm_page_queue[] load.
The new scheme uses cp
kernel - Refactor cpu localization for VM page allocations
* Change how cpu localization works. The old scheme was extremely unbalanced in terms of vm_page_queue[] load.
The new scheme uses cpu topology information to break the vm_page_queue[] down into major blocks based on the physical package id, minor blocks based on the core id in each physical package, and then by 1's based on (pindex + object->pg_color).
If PQ_L2_SIZE is not big enough such that 16-way operation is attainable by physical and core id, we break the queue down only by physical id.
Note that the core id is a real core count, not a cpu thread count, so an 8-core/16-thread x 2 socket xeon system will just fit in the 16-way requirement (there are 256 PQ_FREE queues).
* When a particular queue does not have a free page, iterate nearby queues start at +/- 1 (before we started at +/- PQ_L2_SIZE/2), in an attempt to retain as much locality as possible. This won't be perfect but it should be good enough.
* Also fix an issue with the idlezero counters.
show more ...
|
Revision tags: v4.6.0rc2, v4.6.0rc, v4.7.0 |
|
#
d8f4ebf4 |
| 23-Apr-2016 |
Charlie Root <root@apollo.backplane.com> |
kernel - Reduce BSS size to fix loader initrd problem
* kernel + modules + initrd.img (unpacked) exceeded the 63MB the loader has available for load-time data.
* Top hogs are mainly in BSS. Make
kernel - Reduce BSS size to fix loader initrd problem
* kernel + modules + initrd.img (unpacked) exceeded the 63MB the loader has available for load-time data.
* Top hogs are mainly in BSS. Make intr_info_ary and pcpu_sysctl kmalloc after boot instead of BSS as a temporary fix.
335872 ifnet_threads 335872 netisr_cpu 339968 dummy_pcpu 344064 bsd4_pcpu 344064 stoppcbs 346112 softclock_pcpu_ary 538624 cpu_topology_nodes 755712 dfly_pcpu 786432 icu_irqmaps 958464 map_entry_init 1048576 idt_arr 1064960 pcpu_sysctl <---- now kmallocd 1179648 ioapic_irqmaps <---- (used too early, cannot be kmallocd) 5242880 intr_info_ary <---- now kmallocd
* Should fix loader issues when trying to use initrd.img[.gz] for now.
Reported-by: Valheru
show more ...
|
Revision tags: v4.4.3, v4.4.2, v4.4.1, v4.4.0, v4.5.0, v4.4.0rc, v4.2.4, v4.3.1, v4.2.3, v4.2.1, v4.2.0, v4.0.6, v4.3.0, v4.2.0rc |
|
#
d452e98b |
| 05-Jun-2015 |
Sepherosa Ziehau <sephe@dragonflybsd.org> |
cpu_topo: Add get_cpu_node_by_chipid()
This function retrieve cpu_node according to the chip ID passed.
|
#
52a4925c |
| 12-May-2015 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Improve cpu topology text output
* Fix a bug the cpu range display to properly display e.g. cpu 3 through cpu 3 as cpu 3 instead of cpu 3-3.
* Makes sysctl hw.cpu_topology more readabl
kernel - Improve cpu topology text output
* Fix a bug the cpu range display to properly display e.g. cpu 3 through cpu 3 as cpu 3 instead of cpu 3-3.
* Makes sysctl hw.cpu_topology more readable.
show more ...
|
Revision tags: v4.0.5 |
|
#
f3f3eadb |
| 12-Mar-2015 |
Sascha Wildner <saw@online.de> |
kernel: Move semicolon from the definition of SYSINIT() to its invocations.
This affected around 70 of our (more or less) 270 SYSINIT() calls.
style(9) advocates the terminating semicolon to be sup
kernel: Move semicolon from the definition of SYSINIT() to its invocations.
This affected around 70 of our (more or less) 270 SYSINIT() calls.
style(9) advocates the terminating semicolon to be supplied by the invocation too, because it can make life easier for editors and other source code parsing programs.
show more ...
|
Revision tags: v4.0.4, v4.0.3, v4.0.2, v4.0.1, v4.0.0, v4.0.0rc3, v4.0.0rc2, v4.0.0rc, v4.1.0, v3.8.2 |
|
#
399efd7f |
| 13-Jul-2014 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Reduce console spam in verbose mode when printing cpu sets
* Add helper function kprint_cpuset().
* Print cpu ranges when printing out cpu sets.
* Print cpu ranges when generating topolog
kernel - Reduce console spam in verbose mode when printing cpu sets
* Add helper function kprint_cpuset().
* Print cpu ranges when printing out cpu sets.
* Print cpu ranges when generating topology output for sysctl
show more ...
|
#
c07315c4 |
| 04-Jul-2014 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor cpumask_t to extend cpus past 64, part 1/2
* 64-bit systems only. 32-bit builds use the macros but cannot be expanded past 32 cpus.
* Change cpumask_t from __uint64_t to a stru
kernel - Refactor cpumask_t to extend cpus past 64, part 1/2
* 64-bit systems only. 32-bit builds use the macros but cannot be expanded past 32 cpus.
* Change cpumask_t from __uint64_t to a structure. This commit implements one 64-bit sub-element (the next one will implement four for 256 cpus).
* Create a CPUMASK_*() macro API for non-atomic and atomic cpumask manipulation. These macros generally take lvalues as arguments, allowing for a fairly optimal implementation.
* Change all C code operating on cpumask's to use the newly created CPUMASK_*() macro API.
* Compile-test 32 and 64-bit. Run-test 64-bit.
* Adjust sbin/usched, usr.sbin/powerd. usched currently needs more work.
show more ...
|
Revision tags: v3.8.1, v3.6.3, v3.8.0, v3.8.0rc2, v3.9.0, v3.8.0rc, v3.6.2 |
|
#
9fc6d334 |
| 28-Mar-2014 |
Sascha Wildner <saw@online.de> |
kernel/cpu_topology: Fix a casting issue in the topology tree printing.
compute_unit_id is uint8_t so it's actually 0xff we want to check for. Before this commit, the "... != -1" checks were always
kernel/cpu_topology: Fix a casting issue in the topology tree printing.
compute_unit_id is uint8_t so it's actually 0xff we want to check for. Before this commit, the "... != -1" checks were always true.
show more ...
|
Revision tags: v3.6.1 |
|
#
493a3e86 |
| 13-Feb-2014 |
Sascha Wildner <saw@online.de> |
kernel: Fix topology fallout for vkernel and i386.
* No need to fix AMD topology in vkernel.
* Expose compute_unit_id in i386 too. It stays -1 always.
Reported-by: tuxillo
|
#
0e9325d3 |
| 12-Feb-2014 |
Mihai Carabas <mihai.carabas@gmail.com> |
CPU Topology: add support for Compute Units on AMD processors
Detect shared compute units between cores on AMD processors and downgrade them to THREAD_LEVEL in the logical CPU topology used by the s
CPU Topology: add support for Compute Units on AMD processors
Detect shared compute units between cores on AMD processors and downgrade them to THREAD_LEVEL in the logical CPU topology used by the scheduler.
show more ...
|
Revision tags: v3.6.0, v3.7.1, v3.6.0rc, v3.7.0, v3.4.3, v3.4.2, v3.4.0, v3.4.1, v3.4.0rc, v3.5.0, v3.2.2 |
|
#
1918fc5c |
| 24-Oct-2012 |
Sascha Wildner <saw@online.de> |
kernel: Make SMP support default (and non-optional).
The 'SMP' kernel option gets removed with this commit, so it has to be removed from everybody's configs.
Reviewed-by: sjg Approved-by: many
|
Revision tags: v3.2.1, v3.2.0, v3.3.0 |
|
#
e28d8b15 |
| 18-Sep-2012 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - add usched_dfly algorith, set as default for now
* Fork usched_bsd4 for continued development.
* Rewrite the bsd4 scheduler to use per-cpu spinlocks and queues.
* Reformulate the cpu sele
kernel - add usched_dfly algorith, set as default for now
* Fork usched_bsd4 for continued development.
* Rewrite the bsd4 scheduler to use per-cpu spinlocks and queues.
* Reformulate the cpu selection algorithm using the topology info. We now do a top-down iteration instead of a bottom-up iteration to calculate the best cpu node to schedule something to.
Implements both thread push to remote queue and pull from remote queue.
* Track a load factor on a per-cpu basis.
show more ...
|
#
f77c018a |
| 22-Aug-2012 |
Mihai Carabas <mihai.carabas@gmail.com> |
CPU topology support
* Part of "Add SMT/HT awareness to DragonFly BSD scheduler" GSoC project.
* Details at: http://leaf.dragonflybsd.org/mailarchive/kernel/2012-08/msg00009.html
Mentored-by:
CPU topology support
* Part of "Add SMT/HT awareness to DragonFly BSD scheduler" GSoC project.
* Details at: http://leaf.dragonflybsd.org/mailarchive/kernel/2012-08/msg00009.html
Mentored-by: Alex Hornung (alexh@) Sponsored-by: Google Summer of Code 2012
show more ...
|