History log of /dragonfly/sys/kern/kern_exec.c (Results 1 – 25 of 142)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 2b3f93ea 13-Oct-2023 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Add per-process capability-based restrictions

* This new system allows userland to set capability restrictions which
turns off numerous kernel features and root accesses. These restricti

kernel - Add per-process capability-based restrictions

* This new system allows userland to set capability restrictions which
turns off numerous kernel features and root accesses. These restrictions
are inherited by sub-processes recursively. Once set, restrictions cannot
be removed.

Basic restrictions that mimic an unadorned jail can be enabled without
creating a jail, but generally speaking real security also requires
creating a chrooted filesystem topology, and a jail is still needed
to really segregate processes from each other. If you do so, however,
you can (for example) disable mount/umount and most global root-only
features.

* Add new system calls and a manual page for syscap_get(2) and syscap_set(2)

* Add sys/caps.h

* Add the "setcaps" userland utility and manual page.

* Remove priv.9 and the priv_check infrastructure, replacing it with
a newly designed caps infrastructure.

* The intention is to add path restriction lists and similar features to
improve jailess security in the near future, and to optimize the
priv_check code.

show more ...


Revision tags: v6.4.0, v6.4.0rc1, v6.5.0, v6.2.2, v6.2.1, v6.2.0, v6.3.0, v6.0.1, v6.0.0, v6.0.0rc1, v6.1.0
# 941642e8 17-Feb-2021 Aaron LI <aly@aaronly.me>

fexecve(2): Return ENOENT if exec a script opened with O_CLOEXEC

If a script (i.e., interpreter file) is opened with the O_CLOEXEC flag,
it would be closed by the time the interpreter is executed, a

fexecve(2): Return ENOENT if exec a script opened with O_CLOEXEC

If a script (i.e., interpreter file) is opened with the O_CLOEXEC flag,
it would be closed by the time the interpreter is executed, and then the
executation would fail. So just return ENOENT from fexecve(2). This
behavior aligns with Linux's.

See Linux's fexecve(2) man page.

See also: https://bugzilla.kernel.org/show_bug.cgi?id=74481

Thank dillon for implementing the holdvnode2() function to obtain the
fileflags together with the fp from fd.

show more ...


# 337acc44 17-Feb-2021 Aaron LI <aly@aaronly.me>

Implement the fexecve(2) system call

The fexecve(2) function is equivalent to execve(2), except that the file
to be executed is determined by the file descriptor fd instead of a
pathname.

The purpo

Implement the fexecve(2) system call

The fexecve(2) function is equivalent to execve(2), except that the file
to be executed is determined by the file descriptor fd instead of a
pathname.

The purpose of fexecve(2) is to enable executing a file which has been
verified to be the intended file. It is possible to actively check the
file by reading from the file descriptor and be sure that the file is
not exchanged for another between the reading and the execution.

See https://pubs.opengroup.org/onlinepubs/9699919799/functions/fexecve.html

This work is partially based on swildner's patch and FreeBSD's
implementation (revisions 177787, 182191, 238220).

XXX: We're missing O_EXEC support in open(2).

Reviewed-by: dillon

show more ...


# 274a4bc4 17-Feb-2021 Aaron LI <aly@aaronly.me>

kern: Clean error paths in kern_execve()

Merge the original 'exec_fail_dealloc' and 'exec_fail' to a single
'failed' error path. In addition, introduce the 'nch' variable to
clean some expressions

kern: Clean error paths in kern_execve()

Merge the original 'exec_fail_dealloc' and 'exec_fail' to a single
'failed' error path. In addition, introduce the 'nch' variable to
clean some expressions a bit. These will help the following fexecve()
implementation.

While there, adjust the styles a bit.

Reviewed-by: dillon

show more ...


# c09ae1e6 17-Feb-2021 Aaron LI <aly@aaronly.me>

kern: Return error from exec_copyin_strings() if fname copy failed


# 08512cb0 17-Feb-2021 Aaron LI <aly@aaronly.me>

kern: Staticize several functions and variables in kern_exec.c

Staticize exec_copyin_args(), exec_free_args() and print_execve_args()
functions, and move the related prototypes and exec_path_segflg

kern: Staticize several functions and variables in kern_exec.c

Staticize exec_copyin_args(), exec_free_args() and print_execve_args()
functions, and move the related prototypes and exec_path_segflg enum
from <sys/imgact.h> here.

In addition, staticize the 'debug_execve_args' variable.

show more ...


# acdf1ee6 15-Nov-2020 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS

* Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS to procctl(2).

This follows the linux and freebsd semantics, however it should be note

kernel - Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS

* Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS to procctl(2).

This follows the linux and freebsd semantics, however it should be noted
that since the child of a fork() clears the setting, these semantics have
a fork/exit race between an exiting parent and a child which has not
yet setup its death wish.

* Also fix a number of signal ranging checks.

Requested-by: zrj

show more ...


# 5ebb17ad 04-Nov-2020 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Change pager interface to pass page index 1/2

* Change the *getpage() API to include the page index as
an argument. This allows us to avoid passing any vm_page_t
for OBJT_MGTDEVICE VM

kernel - Change pager interface to pass page index 1/2

* Change the *getpage() API to include the page index as
an argument. This allows us to avoid passing any vm_page_t
for OBJT_MGTDEVICE VM pages.

By removing this requirement, the VM system no longer has to
pre-allocate a placemarker page for DRM faults and the DRM
system can directly install the page in the pmap without
tracking it via a vm_page_t.

show more ...


Revision tags: v5.8.3, v5.8.2
# de9bb133 08-Aug-2020 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor GETATTR_QUICK() -> GETATTR_LITE()

* Refactor GETATTR_QUICK() into GETATTR_LITE() and use struct
vattr_lite instead of struct vattr. The original GETATTR_QUICK()
just used a st

kernel - Refactor GETATTR_QUICK() -> GETATTR_LITE()

* Refactor GETATTR_QUICK() into GETATTR_LITE() and use struct
vattr_lite instead of struct vattr. The original GETATTR_QUICK()
just used a struct vattr.

This change ensures that users of this new VOP do not attempt to
access attr fields that are not populated.

Suggested-by: mjg

show more ...


# 80d831e1 25-Jul-2020 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor in-kernel system call API to remove bcopy()

* Change the in-kernel system call prototype to take the
system call arguments as a separate pointer, and make the
contents read-onl

kernel - Refactor in-kernel system call API to remove bcopy()

* Change the in-kernel system call prototype to take the
system call arguments as a separate pointer, and make the
contents read-only.

int sy_call_t (void *);
int sy_call_t (struct sysmsg *sysmsg, const void *);

* System calls with 6 arguments or less no longer need to copy
the arguments from the trapframe to a holding structure. Instead,
we simply point into the trapframe.

The L1 cache footprint will be a bit smaller, but in simple tests
the results are not noticably faster... maybe 1ns or so
(roughly 1%).

show more ...


Revision tags: v5.8.1, v5.8.0
# 2ff21866 26-Feb-2020 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Simple code path optimizations

* Add __read_mostly and __read_frequently to numerous variables as
appropriate to reduce unnecessary cache line ping-ponging.

* Adjust conditionals in the

kernel - Simple code path optimizations

* Add __read_mostly and __read_frequently to numerous variables as
appropriate to reduce unnecessary cache line ping-ponging.

* Adjust conditionals in the syscall code with __predict_true/false
to clean up the execution path.

show more ...


Revision tags: v5.9.0, v5.8.0rc1, v5.6.3
# 64b5a8a5 12-Nov-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - sigblockall()/sigunblockall() support (per thread shared page)

* Implement /dev/lpmap, a per-thread RW shared page between userland
and the kernel. Each thread in the process will receiv

kernel - sigblockall()/sigunblockall() support (per thread shared page)

* Implement /dev/lpmap, a per-thread RW shared page between userland
and the kernel. Each thread in the process will receive a unique
shared page for communication with the kernel when memory-mapping
/dev/lpmap and can access varous variables via this map.

* The current thread's TID is retained for both fork() and vfork().
Previously it was only retained for vfork(). This avoids userland
code confusion for any bits and pieces that are indexed based on the
TID.

* Implement support for a per-thread block-all-signals feature that
does not require any system calls (see next commit to libc). The
functions will be called sigblockall() and sigunblockall().

The lpmap->blockallsigs variable prevents normal signals from being
dispatched. They will still be queued to the LWP as per normal.
The behavior is not quite that of a signal mask when dealing with
global signals.

The low 31 bits represents a recursion counter, allowing recursive
use of the functions. The high bit (bit 31) is set by the kernel
if a signal was prevented from being dispatched. When userland decrements
the counter to 0 (the low 31 bits), it can check and clear bit 31 and
if found to be set userland can then make a dummy 'real' system call
to cause pending signals to be delivered.

Synchronous TRAPs (e.g. kernel-generated SIGFPE, SIGSEGV, etc) are not
affected by this feature and will still be dispatched synchronously.

* PThreads is expected to unmap the mapped page upon thread exit.
The kernel will force-unmap the page upon thread exit if pthreads
does not.

XXX needs work - currently if the page has not been faulted in
the kernel has no visbility into the mapping and will not unmap it,
but neither will it get confused if the address is accessed. To
be fixed soon. Because if we don't, programs using LWP primitives
instead of pthreads might not realize that libc has mapped the page.

* The TID is reset to 1 on a successful exec*()

* On [v]fork(), if lpmap exists for the current thread, the kernel will
copy the lpmap->blockallsigs value to the lpmap for the new thread
in the new process. This way sigblock*() state is retained across
the [v]fork().

This feature not only reduces code confusion in userland, it also
allows [v]fork() to be implemented by the userland program in a way
that ensures no signal races in either the parent or the new child
process until it is ready for them.

* The implementation leverages our vm_map_backing extents by having
the per-thread memory mappings indexed within the lwp. This allows
the lwp to remove the mappings when it exits (since not doing so
would result in a wild pmap entry and kernel memory disclosure).

* The implementation currently delays instantiation of the mapped
page(s) and some side structures until the first fault.

XXX this will have to be changed.

show more ...


Revision tags: v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0
# 2a7bd4d8 18-May-2019 Sascha Wildner <saw@online.de>

kernel: Don't include <sys/user.h> in kernel code.

There is really no point in doing that because its main purpose is to
expose kernel structures to userland. The majority of cases wasn't
needed at

kernel: Don't include <sys/user.h> in kernel code.

There is really no point in doing that because its main purpose is to
expose kernel structures to userland. The majority of cases wasn't
needed at all and the rest required only a couple of other includes.

show more ...


Revision tags: v5.4.3
# d6924570 03-May-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix serious bug in MAP_STACK, deprecate auto-grow semantics

* When MAP_STACK is used without MAP_TRYFIXED, the address the kernel
determines for the stack was *NOT* being returned to user

kernel - Fix serious bug in MAP_STACK, deprecate auto-grow semantics

* When MAP_STACK is used without MAP_TRYFIXED, the address the kernel
determines for the stack was *NOT* being returned to userland. Instead,
userland always got only the hint address.

* This fixes ruby MAP_STACK use cases and possibly more.

* Deprecate MAP_STACK auto-grow semantics. All user mmap() calls with
MAP_STACK are now converted to normal MAP_ANON mmaps. The kernel will
continue to create an auto-grow stack segment for the primary user stack
in exec(), allowing older pthread libraries to continue working, but this
feature is deprecated and will be removed in a future release.

show more ...


Revision tags: v5.4.2
# 4b566556 17-Feb-2019 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Implement sbrk(), change low-address mmap hinting

* Change mmap()'s internal lower address bound from dmax (32GB)
to RLIMIT_DATA's current value. This allows the rlimit to be
e.g. redu

kernel - Implement sbrk(), change low-address mmap hinting

* Change mmap()'s internal lower address bound from dmax (32GB)
to RLIMIT_DATA's current value. This allows the rlimit to be
e.g. reduced and for hinted mmap()s to then map space below
the 4GB mark. The default data rlimit is 32GB.

This change is needed to support several languages, at least
lua and probably another one or two, who use mmap hinting
under the assumption that it can map space below the 4GB
address mark. The data limit must be lowered with a limit command
too, which can be scripted or patched for such programs.

* Implement the sbrk() system call. This system call was already
present but just returned EOPNOTSUPP and libc previously had its
own shim for sbrk() which used the ancient break() system call.
(Note that the prior implementation did not ENOSYS or signal).

sbrk() in the kernel is thread-safe for positive increments and
is also byte-granular (the old libc sbrk() was only page-granular).

sbrk() in the kernel does not implement negative increments and
will return EOPNOTSUPP if asked to. Negative increments were
historically designed to be able to 'free' memory allocated with
sbrk(), but it is not possible to implement the case in a modern
VM system due to the mmap changes above.

(1) Because the new mmap hinting changes make it possible for
normal mmap()s to have mapped space prior to the RLIMIT_DATA resource
limit being increased, causing intermingling of sbrk() and user mmap()d
regions. (2) because negative increments are not even remotely
thread-safe.

* Note the previous commit refactored libc to use the kernel sbrk()
and fall-back to its previous emulation code on failure, so libc
supports both new and old kernels.

* Remove the brk() shim from libc. brk() is not implemented by the
kernel. Symbol removed. Requires testing against ports so we may
have to add it back in but basically there is no way to implement
brk() properly with the mmap() hinting fix

* Adjust manual pages.

show more ...


Revision tags: v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1, v5.2.2, v5.2.1, v5.2.0, v5.3.0, v5.2.0rc, v5.0.2, v5.0.1
# 22b7a3db 17-Oct-2017 Sascha Wildner <saw@online.de>

kernel: Remove <sys/sysref{,2}.h> inclusion from files that don't need it.

Some of the headers are public in one way or another so bump
__DragonFly_version for safety.

While here, add a missing <sy

kernel: Remove <sys/sysref{,2}.h> inclusion from files that don't need it.

Some of the headers are public in one way or another so bump
__DragonFly_version for safety.

While here, add a missing <sys/objcache.h> include to kern_exec.c which
was previously relying on it coming in via <sys/sysref.h> (which was
included by <sys/vm_map.h> prior to this commit).

show more ...


Revision tags: v5.0.0, v5.0.0rc2, v5.1.0, v5.0.0rc1
# 11ba7f73 10-Aug-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Lower VM_MAX_USER_ADDRESS to finalize work-around for Ryzen bug

* Reduce VM_MAX_USER_ADDRESS by 2MB, effectively making the top 2MB of the
user address space unmappable. The user stack n

kernel - Lower VM_MAX_USER_ADDRESS to finalize work-around for Ryzen bug

* Reduce VM_MAX_USER_ADDRESS by 2MB, effectively making the top 2MB of the
user address space unmappable. The user stack now starts 2MB down from
where it did before. Theoretically we only need to reduce the top of
the user address space by 4KB, but doing it by 2MB may be more useful for
future page table optimizations.

* As per AMD, Ryzen has an issue when the instruction pre-fetcher crosses
from canonical to non-canonical address space. This can only occur at
the top of the user stack.

In DragonFlyBSD, the signal trampoline resides at the top of the user stack
and an IRETQ into it can cause a Ryzen box to lockup and destabilize due
to this action. The bug case was, basically two cpu threads on the same
core, one in a cpu-bound loop of some sort while the other takes a normal
UNIX signal (causing the IRETQ into the signal trampoline). The IRETQ
microcode freezes until the cpu-bound loop terminates, preventing the
cpu thread from being able to take any interrupt or IPI whatsoever for
the duration, and the cpu may destabilize afterwords as well.

* The pre-fetcher is somewhat heuristical, so just moving the trampoline
down is no guarantee if the top 4KB of the user stack is mapped or mappable.
It is better to make the boundary unmappable by userland.

* Bug first tracked down by myself in early 2017. AMD validated the bug
and determined that unmapping the boundary page completely solves the
issue.

* Also retain the code which places the signal trampoline in its own page
so we can maintain separate protection settings for the code, and make it
read-only (R+X).

show more ...


# da1e1cb6 06-Aug-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Move sigtramp even lower

* Attempt to work around a Ryzen cpu bug by moving sigtramp even lower than
we have already.


Revision tags: v4.8.1
# 3e925ec2 03-Apr-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Implement NX

* Implement the NX (no-execute) pmap bit.

* Shift sigtramp down to a page-bound and protect it prot|VM_PROT_EXECUTE.

* Map the rest of the user stack VM_PROT_READ|VM_PROT_WRI

kernel - Implement NX

* Implement the NX (no-execute) pmap bit.

* Shift sigtramp down to a page-bound and protect it prot|VM_PROT_EXECUTE.

* Map the rest of the user stack VM_PROT_READ|VM_PROT_WRITE without
VM_PROT_EXECUTE.

show more ...


# e6141a7f 29-Mar-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Add KERN_PROC_SIGTRAMP

* Add a sysctl to retrieve the sigtramp address range for gdb.

Reported-by: marino


Revision tags: v4.8.0, v4.6.2, v4.9.0, v4.8.0rc
# 5947157e 26-Jan-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor suword, fuword, etc. change vmm_guest_sync_addr()

* Rename the entire family of functions to reduce confusion.

* Change how vmm_guest_sync_addr() works. Instead of loading one v

kernel - Refactor suword, fuword, etc. change vmm_guest_sync_addr()

* Rename the entire family of functions to reduce confusion.

* Change how vmm_guest_sync_addr() works. Instead of loading one value
into a target location we exchange the two target locations, with the
first address using an atomic op. This will allow the vkernel to
drop privs and query pte state atomically.

show more ...


# 5a4bb8ec 09-Jan-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Incidental MPLOCK removal (non-performance)

* proc filterops.

* kernel linkerops and kld code.

* Warn if a non-MPSAFE interrupt is installed.

* Use a private token in the disk messaging

kernel - Incidental MPLOCK removal (non-performance)

* proc filterops.

* kernel linkerops and kld code.

* Warn if a non-MPSAFE interrupt is installed.

* Use a private token in the disk messaging core (subr_disk) instead of
the mp token.

* Use a private token for sysv shm adminstrative calls.

* Cleanup.

show more ...


Revision tags: v4.6.1, v4.6.0
# 2eca01a4 28-Jul-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix getpid() issue in vfork() when threaded

* upmap->invfork was a 0 or 1, but in a threaded program it is possible
for multiple threads to be in vfork() at the same time. Change invfork

kernel - Fix getpid() issue in vfork() when threaded

* upmap->invfork was a 0 or 1, but in a threaded program it is possible
for multiple threads to be in vfork() at the same time. Change invfork
to a count.

* Fixes improper getpid() return when concurrent vfork()s are occuring in
a threaded program.

show more ...


Revision tags: v4.6.0rc2, v4.6.0rc, v4.7.0, v4.4.3, v4.4.2, v4.4.1, v4.4.0, v4.5.0, v4.4.0rc, v4.2.4, v4.3.1, v4.2.3, v4.2.1, v4.2.0, v4.0.6, v4.3.0, v4.2.0rc, v4.0.5, v4.0.4
# 59b728a7 18-Feb-2015 Sascha Wildner <saw@online.de>

sys/kern: Adjust some function declaration vs. definition mismatches.

All these functions are declared static already, so no functional change.


Revision tags: v4.0.3, v4.0.2, v4.0.1, v4.0.0, v4.0.0rc3, v4.0.0rc2
# 95d468db 05-Nov-2014 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Fix exec optimization race

* Fix an improper vm_page_unhold() in exec_map_page() which
under heavy memory loads can cause a later assertion
on m->hold_count == 0.

* Triggered every few

kernel - Fix exec optimization race

* Fix an improper vm_page_unhold() in exec_map_page() which
under heavy memory loads can cause a later assertion
on m->hold_count == 0.

* Triggered every few days by bulk builds on pkgbox64.

show more ...


123456