#
2b3f93ea |
| 13-Oct-2023 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Add per-process capability-based restrictions
* This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restricti
kernel - Add per-process capability-based restrictions
* This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed.
Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features.
* Add new system calls and a manual page for syscap_get(2) and syscap_set(2)
* Add sys/caps.h
* Add the "setcaps" userland utility and manual page.
* Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure.
* The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
show more ...
|
Revision tags: v6.4.0, v6.4.0rc1, v6.5.0, v6.2.2, v6.2.1, v6.2.0, v6.3.0, v6.0.1, v6.0.0, v6.0.0rc1, v6.1.0 |
|
#
941642e8 |
| 17-Feb-2021 |
Aaron LI <aly@aaronly.me> |
fexecve(2): Return ENOENT if exec a script opened with O_CLOEXEC
If a script (i.e., interpreter file) is opened with the O_CLOEXEC flag, it would be closed by the time the interpreter is executed, a
fexecve(2): Return ENOENT if exec a script opened with O_CLOEXEC
If a script (i.e., interpreter file) is opened with the O_CLOEXEC flag, it would be closed by the time the interpreter is executed, and then the executation would fail. So just return ENOENT from fexecve(2). This behavior aligns with Linux's.
See Linux's fexecve(2) man page.
See also: https://bugzilla.kernel.org/show_bug.cgi?id=74481
Thank dillon for implementing the holdvnode2() function to obtain the fileflags together with the fp from fd.
show more ...
|
#
337acc44 |
| 17-Feb-2021 |
Aaron LI <aly@aaronly.me> |
Implement the fexecve(2) system call
The fexecve(2) function is equivalent to execve(2), except that the file to be executed is determined by the file descriptor fd instead of a pathname.
The purpo
Implement the fexecve(2) system call
The fexecve(2) function is equivalent to execve(2), except that the file to be executed is determined by the file descriptor fd instead of a pathname.
The purpose of fexecve(2) is to enable executing a file which has been verified to be the intended file. It is possible to actively check the file by reading from the file descriptor and be sure that the file is not exchanged for another between the reading and the execution.
See https://pubs.opengroup.org/onlinepubs/9699919799/functions/fexecve.html
This work is partially based on swildner's patch and FreeBSD's implementation (revisions 177787, 182191, 238220).
XXX: We're missing O_EXEC support in open(2).
Reviewed-by: dillon
show more ...
|
#
274a4bc4 |
| 17-Feb-2021 |
Aaron LI <aly@aaronly.me> |
kern: Clean error paths in kern_execve()
Merge the original 'exec_fail_dealloc' and 'exec_fail' to a single 'failed' error path. In addition, introduce the 'nch' variable to clean some expressions
kern: Clean error paths in kern_execve()
Merge the original 'exec_fail_dealloc' and 'exec_fail' to a single 'failed' error path. In addition, introduce the 'nch' variable to clean some expressions a bit. These will help the following fexecve() implementation.
While there, adjust the styles a bit.
Reviewed-by: dillon
show more ...
|
#
c09ae1e6 |
| 17-Feb-2021 |
Aaron LI <aly@aaronly.me> |
kern: Return error from exec_copyin_strings() if fname copy failed
|
#
08512cb0 |
| 17-Feb-2021 |
Aaron LI <aly@aaronly.me> |
kern: Staticize several functions and variables in kern_exec.c
Staticize exec_copyin_args(), exec_free_args() and print_execve_args() functions, and move the related prototypes and exec_path_segflg
kern: Staticize several functions and variables in kern_exec.c
Staticize exec_copyin_args(), exec_free_args() and print_execve_args() functions, and move the related prototypes and exec_path_segflg enum from <sys/imgact.h> here.
In addition, staticize the 'debug_execve_args' variable.
show more ...
|
#
acdf1ee6 |
| 15-Nov-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS
* Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS to procctl(2).
This follows the linux and freebsd semantics, however it should be note
kernel - Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS
* Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS to procctl(2).
This follows the linux and freebsd semantics, however it should be noted that since the child of a fork() clears the setting, these semantics have a fork/exit race between an exiting parent and a child which has not yet setup its death wish.
* Also fix a number of signal ranging checks.
Requested-by: zrj
show more ...
|
#
5ebb17ad |
| 04-Nov-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Change pager interface to pass page index 1/2
* Change the *getpage() API to include the page index as an argument. This allows us to avoid passing any vm_page_t for OBJT_MGTDEVICE VM
kernel - Change pager interface to pass page index 1/2
* Change the *getpage() API to include the page index as an argument. This allows us to avoid passing any vm_page_t for OBJT_MGTDEVICE VM pages.
By removing this requirement, the VM system no longer has to pre-allocate a placemarker page for DRM faults and the DRM system can directly install the page in the pmap without tracking it via a vm_page_t.
show more ...
|
Revision tags: v5.8.3, v5.8.2 |
|
#
de9bb133 |
| 08-Aug-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor GETATTR_QUICK() -> GETATTR_LITE()
* Refactor GETATTR_QUICK() into GETATTR_LITE() and use struct vattr_lite instead of struct vattr. The original GETATTR_QUICK() just used a st
kernel - Refactor GETATTR_QUICK() -> GETATTR_LITE()
* Refactor GETATTR_QUICK() into GETATTR_LITE() and use struct vattr_lite instead of struct vattr. The original GETATTR_QUICK() just used a struct vattr.
This change ensures that users of this new VOP do not attempt to access attr fields that are not populated.
Suggested-by: mjg
show more ...
|
#
80d831e1 |
| 25-Jul-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor in-kernel system call API to remove bcopy()
* Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-onl
kernel - Refactor in-kernel system call API to remove bcopy()
* Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only.
int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *);
* System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe.
The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
show more ...
|
Revision tags: v5.8.1, v5.8.0 |
|
#
2ff21866 |
| 26-Feb-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Simple code path optimizations
* Add __read_mostly and __read_frequently to numerous variables as appropriate to reduce unnecessary cache line ping-ponging.
* Adjust conditionals in the
kernel - Simple code path optimizations
* Add __read_mostly and __read_frequently to numerous variables as appropriate to reduce unnecessary cache line ping-ponging.
* Adjust conditionals in the syscall code with __predict_true/false to clean up the execution path.
show more ...
|
Revision tags: v5.9.0, v5.8.0rc1, v5.6.3 |
|
#
64b5a8a5 |
| 12-Nov-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - sigblockall()/sigunblockall() support (per thread shared page)
* Implement /dev/lpmap, a per-thread RW shared page between userland and the kernel. Each thread in the process will receiv
kernel - sigblockall()/sigunblockall() support (per thread shared page)
* Implement /dev/lpmap, a per-thread RW shared page between userland and the kernel. Each thread in the process will receive a unique shared page for communication with the kernel when memory-mapping /dev/lpmap and can access varous variables via this map.
* The current thread's TID is retained for both fork() and vfork(). Previously it was only retained for vfork(). This avoids userland code confusion for any bits and pieces that are indexed based on the TID.
* Implement support for a per-thread block-all-signals feature that does not require any system calls (see next commit to libc). The functions will be called sigblockall() and sigunblockall().
The lpmap->blockallsigs variable prevents normal signals from being dispatched. They will still be queued to the LWP as per normal. The behavior is not quite that of a signal mask when dealing with global signals.
The low 31 bits represents a recursion counter, allowing recursive use of the functions. The high bit (bit 31) is set by the kernel if a signal was prevented from being dispatched. When userland decrements the counter to 0 (the low 31 bits), it can check and clear bit 31 and if found to be set userland can then make a dummy 'real' system call to cause pending signals to be delivered.
Synchronous TRAPs (e.g. kernel-generated SIGFPE, SIGSEGV, etc) are not affected by this feature and will still be dispatched synchronously.
* PThreads is expected to unmap the mapped page upon thread exit. The kernel will force-unmap the page upon thread exit if pthreads does not.
XXX needs work - currently if the page has not been faulted in the kernel has no visbility into the mapping and will not unmap it, but neither will it get confused if the address is accessed. To be fixed soon. Because if we don't, programs using LWP primitives instead of pthreads might not realize that libc has mapped the page.
* The TID is reset to 1 on a successful exec*()
* On [v]fork(), if lpmap exists for the current thread, the kernel will copy the lpmap->blockallsigs value to the lpmap for the new thread in the new process. This way sigblock*() state is retained across the [v]fork().
This feature not only reduces code confusion in userland, it also allows [v]fork() to be implemented by the userland program in a way that ensures no signal races in either the parent or the new child process until it is ready for them.
* The implementation leverages our vm_map_backing extents by having the per-thread memory mappings indexed within the lwp. This allows the lwp to remove the mappings when it exits (since not doing so would result in a wild pmap entry and kernel memory disclosure).
* The implementation currently delays instantiation of the mapped page(s) and some side structures until the first fault.
XXX this will have to be changed.
show more ...
|
Revision tags: v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0 |
|
#
2a7bd4d8 |
| 18-May-2019 |
Sascha Wildner <saw@online.de> |
kernel: Don't include <sys/user.h> in kernel code.
There is really no point in doing that because its main purpose is to expose kernel structures to userland. The majority of cases wasn't needed at
kernel: Don't include <sys/user.h> in kernel code.
There is really no point in doing that because its main purpose is to expose kernel structures to userland. The majority of cases wasn't needed at all and the rest required only a couple of other includes.
show more ...
|
Revision tags: v5.4.3 |
|
#
d6924570 |
| 03-May-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix serious bug in MAP_STACK, deprecate auto-grow semantics
* When MAP_STACK is used without MAP_TRYFIXED, the address the kernel determines for the stack was *NOT* being returned to user
kernel - Fix serious bug in MAP_STACK, deprecate auto-grow semantics
* When MAP_STACK is used without MAP_TRYFIXED, the address the kernel determines for the stack was *NOT* being returned to userland. Instead, userland always got only the hint address.
* This fixes ruby MAP_STACK use cases and possibly more.
* Deprecate MAP_STACK auto-grow semantics. All user mmap() calls with MAP_STACK are now converted to normal MAP_ANON mmaps. The kernel will continue to create an auto-grow stack segment for the primary user stack in exec(), allowing older pthread libraries to continue working, but this feature is deprecated and will be removed in a future release.
show more ...
|
Revision tags: v5.4.2 |
|
#
4b566556 |
| 17-Feb-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Implement sbrk(), change low-address mmap hinting
* Change mmap()'s internal lower address bound from dmax (32GB) to RLIMIT_DATA's current value. This allows the rlimit to be e.g. redu
kernel - Implement sbrk(), change low-address mmap hinting
* Change mmap()'s internal lower address bound from dmax (32GB) to RLIMIT_DATA's current value. This allows the rlimit to be e.g. reduced and for hinted mmap()s to then map space below the 4GB mark. The default data rlimit is 32GB.
This change is needed to support several languages, at least lua and probably another one or two, who use mmap hinting under the assumption that it can map space below the 4GB address mark. The data limit must be lowered with a limit command too, which can be scripted or patched for such programs.
* Implement the sbrk() system call. This system call was already present but just returned EOPNOTSUPP and libc previously had its own shim for sbrk() which used the ancient break() system call. (Note that the prior implementation did not ENOSYS or signal).
sbrk() in the kernel is thread-safe for positive increments and is also byte-granular (the old libc sbrk() was only page-granular).
sbrk() in the kernel does not implement negative increments and will return EOPNOTSUPP if asked to. Negative increments were historically designed to be able to 'free' memory allocated with sbrk(), but it is not possible to implement the case in a modern VM system due to the mmap changes above.
(1) Because the new mmap hinting changes make it possible for normal mmap()s to have mapped space prior to the RLIMIT_DATA resource limit being increased, causing intermingling of sbrk() and user mmap()d regions. (2) because negative increments are not even remotely thread-safe.
* Note the previous commit refactored libc to use the kernel sbrk() and fall-back to its previous emulation code on failure, so libc supports both new and old kernels.
* Remove the brk() shim from libc. brk() is not implemented by the kernel. Symbol removed. Requires testing against ports so we may have to add it back in but basically there is no way to implement brk() properly with the mmap() hinting fix
* Adjust manual pages.
show more ...
|
Revision tags: v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1, v5.2.2, v5.2.1, v5.2.0, v5.3.0, v5.2.0rc, v5.0.2, v5.0.1 |
|
#
22b7a3db |
| 17-Oct-2017 |
Sascha Wildner <saw@online.de> |
kernel: Remove <sys/sysref{,2}.h> inclusion from files that don't need it.
Some of the headers are public in one way or another so bump __DragonFly_version for safety.
While here, add a missing <sy
kernel: Remove <sys/sysref{,2}.h> inclusion from files that don't need it.
Some of the headers are public in one way or another so bump __DragonFly_version for safety.
While here, add a missing <sys/objcache.h> include to kern_exec.c which was previously relying on it coming in via <sys/sysref.h> (which was included by <sys/vm_map.h> prior to this commit).
show more ...
|
Revision tags: v5.0.0, v5.0.0rc2, v5.1.0, v5.0.0rc1 |
|
#
11ba7f73 |
| 10-Aug-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Lower VM_MAX_USER_ADDRESS to finalize work-around for Ryzen bug
* Reduce VM_MAX_USER_ADDRESS by 2MB, effectively making the top 2MB of the user address space unmappable. The user stack n
kernel - Lower VM_MAX_USER_ADDRESS to finalize work-around for Ryzen bug
* Reduce VM_MAX_USER_ADDRESS by 2MB, effectively making the top 2MB of the user address space unmappable. The user stack now starts 2MB down from where it did before. Theoretically we only need to reduce the top of the user address space by 4KB, but doing it by 2MB may be more useful for future page table optimizations.
* As per AMD, Ryzen has an issue when the instruction pre-fetcher crosses from canonical to non-canonical address space. This can only occur at the top of the user stack.
In DragonFlyBSD, the signal trampoline resides at the top of the user stack and an IRETQ into it can cause a Ryzen box to lockup and destabilize due to this action. The bug case was, basically two cpu threads on the same core, one in a cpu-bound loop of some sort while the other takes a normal UNIX signal (causing the IRETQ into the signal trampoline). The IRETQ microcode freezes until the cpu-bound loop terminates, preventing the cpu thread from being able to take any interrupt or IPI whatsoever for the duration, and the cpu may destabilize afterwords as well.
* The pre-fetcher is somewhat heuristical, so just moving the trampoline down is no guarantee if the top 4KB of the user stack is mapped or mappable. It is better to make the boundary unmappable by userland.
* Bug first tracked down by myself in early 2017. AMD validated the bug and determined that unmapping the boundary page completely solves the issue.
* Also retain the code which places the signal trampoline in its own page so we can maintain separate protection settings for the code, and make it read-only (R+X).
show more ...
|
#
da1e1cb6 |
| 06-Aug-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Move sigtramp even lower
* Attempt to work around a Ryzen cpu bug by moving sigtramp even lower than we have already.
|
Revision tags: v4.8.1 |
|
#
3e925ec2 |
| 03-Apr-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Implement NX
* Implement the NX (no-execute) pmap bit.
* Shift sigtramp down to a page-bound and protect it prot|VM_PROT_EXECUTE.
* Map the rest of the user stack VM_PROT_READ|VM_PROT_WRI
kernel - Implement NX
* Implement the NX (no-execute) pmap bit.
* Shift sigtramp down to a page-bound and protect it prot|VM_PROT_EXECUTE.
* Map the rest of the user stack VM_PROT_READ|VM_PROT_WRITE without VM_PROT_EXECUTE.
show more ...
|
#
e6141a7f |
| 29-Mar-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Add KERN_PROC_SIGTRAMP
* Add a sysctl to retrieve the sigtramp address range for gdb.
Reported-by: marino
|
Revision tags: v4.8.0, v4.6.2, v4.9.0, v4.8.0rc |
|
#
5947157e |
| 26-Jan-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor suword, fuword, etc. change vmm_guest_sync_addr()
* Rename the entire family of functions to reduce confusion.
* Change how vmm_guest_sync_addr() works. Instead of loading one v
kernel - Refactor suword, fuword, etc. change vmm_guest_sync_addr()
* Rename the entire family of functions to reduce confusion.
* Change how vmm_guest_sync_addr() works. Instead of loading one value into a target location we exchange the two target locations, with the first address using an atomic op. This will allow the vkernel to drop privs and query pte state atomically.
show more ...
|
#
5a4bb8ec |
| 09-Jan-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Incidental MPLOCK removal (non-performance)
* proc filterops.
* kernel linkerops and kld code.
* Warn if a non-MPSAFE interrupt is installed.
* Use a private token in the disk messaging
kernel - Incidental MPLOCK removal (non-performance)
* proc filterops.
* kernel linkerops and kld code.
* Warn if a non-MPSAFE interrupt is installed.
* Use a private token in the disk messaging core (subr_disk) instead of the mp token.
* Use a private token for sysv shm adminstrative calls.
* Cleanup.
show more ...
|
Revision tags: v4.6.1, v4.6.0 |
|
#
2eca01a4 |
| 28-Jul-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix getpid() issue in vfork() when threaded
* upmap->invfork was a 0 or 1, but in a threaded program it is possible for multiple threads to be in vfork() at the same time. Change invfork
kernel - Fix getpid() issue in vfork() when threaded
* upmap->invfork was a 0 or 1, but in a threaded program it is possible for multiple threads to be in vfork() at the same time. Change invfork to a count.
* Fixes improper getpid() return when concurrent vfork()s are occuring in a threaded program.
show more ...
|
Revision tags: v4.6.0rc2, v4.6.0rc, v4.7.0, v4.4.3, v4.4.2, v4.4.1, v4.4.0, v4.5.0, v4.4.0rc, v4.2.4, v4.3.1, v4.2.3, v4.2.1, v4.2.0, v4.0.6, v4.3.0, v4.2.0rc, v4.0.5, v4.0.4 |
|
#
59b728a7 |
| 18-Feb-2015 |
Sascha Wildner <saw@online.de> |
sys/kern: Adjust some function declaration vs. definition mismatches.
All these functions are declared static already, so no functional change.
|
Revision tags: v4.0.3, v4.0.2, v4.0.1, v4.0.0, v4.0.0rc3, v4.0.0rc2 |
|
#
95d468db |
| 05-Nov-2014 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix exec optimization race
* Fix an improper vm_page_unhold() in exec_map_page() which under heavy memory loads can cause a later assertion on m->hold_count == 0.
* Triggered every few
kernel - Fix exec optimization race
* Fix an improper vm_page_unhold() in exec_map_page() which under heavy memory loads can cause a later assertion on m->hold_count == 0.
* Triggered every few days by bulk builds on pkgbox64.
show more ...
|