History log of /linux/kernel/seccomp.c (Results 1 – 25 of 150)
Revision Date Author Comments
# 4e94ddfe 30-Nov-2023 Christian Brauner <brauner@kernel.org>

file: remove __receive_fd()

Honestly, there's little value in having a helper with and without that
int __user *ufd argument. It's just messy and doesn't really give us
anything. Just expose receive

file: remove __receive_fd()

Honestly, there's little value in having a helper with and without that
int __user *ufd argument. It's just messy and doesn't really give us
anything. Just expose receive_fd() with that argument and get rid of
that helper.

Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-5-e73ca6f4ea83@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>

show more ...


# 46822860 17-Aug-2023 Kees Cook <keescook@chromium.org>

seccomp: Add missing kerndoc notations

The kerndoc for some struct member and function arguments were missing.
Add them.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>

seccomp: Add missing kerndoc notations

The kerndoc for some struct member and function arguments were missing.
Add them.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308171742.AncabIG1-lkp@intel.com/
Signed-off-by: Kees Cook <keescook@chromium.org>

show more ...


# 48a1084a 08-Mar-2023 Andrei Vagin <avagin@google.com>

seccomp: add the synchronous mode for seccomp_unotify

seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchr

seccomp: add the synchronous mode for seccomp_unotify

seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls to the target
process. In this context, "synchronous" means that only one process is
running and another one is waiting.

There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
move the wakee to the current CPU. For such synchronous workflows, it
makes context switches a few times faster.

Right now, each interaction takes 12µs. With this patch, it takes about
3µs.

This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
it used to enable the sync mode.

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-5-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>

show more ...


# 4943b66d 08-Mar-2023 Andrei Vagin <avagin@google.com>

seccomp: don't use semaphore and wait_queue together

The main reason is to use new wake_up helpers that will be added in the
following patches. But here are a few other reasons:

* if we use two dif

seccomp: don't use semaphore and wait_queue together

The main reason is to use new wake_up helpers that will be added in the
following patches. But here are a few other reasons:

* if we use two different ways, we always need to call them both. This
patch fixes seccomp_notify_recv where we forgot to call wake_up_poll
in the error path.

* If we use one primitive, we can control how many waiters are woken up
for each request. Our goal is to wake up just one that will handle a
request. Right now, wake_up_poll can wake up one waiter and
up(&match->notif->request) can wake up one more.

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-2-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>

show more ...


# 02a6b455 02-Mar-2023 Luis Chamberlain <mcgrof@kernel.org>

seccomp: simplify sysctls with register_sysctl_init()

register_sysctl_paths() is only needed if you have childs (directories)
with entries. Just use register_sysctl_init() as it also does the
kmemle

seccomp: simplify sysctls with register_sysctl_init()

register_sysctl_paths() is only needed if you have childs (directories)
with entries. Just use register_sysctl_init() as it also does the
kmemleak check for you.

Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

show more ...


# 0fb0624b 08-Jan-2023 Randy Dunlap <rdunlap@infradead.org>

seccomp: fix kernel-doc function name warning

Move the ACTION_ONLY() macro so that it is not between the kernel-doc
notation and the function definition for seccomp_run_filters(),
eliminating a kern

seccomp: fix kernel-doc function name warning

Move the ACTION_ONLY() macro so that it is not between the kernel-doc
notation and the function definition for seccomp_run_filters(),
eliminating a kernel-doc warning:

kernel/seccomp.c:400: warning: expecting prototype for seccomp_run_filters(). Prototype was for ACTION_ONLY() instead

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230108021228.15975-1-rdunlap@infradead.org

show more ...


# c2aa2dfe 03-May-2022 Sargun Dhillon <sargun@sargun.me>

seccomp: Add wait_killable semantic to seccomp user notifier

This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
that makes it so that when notifications are received by the s

seccomp: Add wait_killable semantic to seccomp user notifier

This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
that makes it so that when notifications are received by the supervisor the
notifying process will transition to wait killable semantics. Although wait
killable isn't a set of semantics formally exposed to userspace, the
concept is searchable. If the notifying process is signaled prior to the
notification being received by the userspace agent, it will be handled as
normal.

One quirk about how this is handled is that the notifying process
only switches to TASK_KILLABLE if it receives a wakeup from either
an addfd or a signal. This is to avoid an unnecessary wakeup of
the notifying task.

The reasons behind switching into wait_killable only after userspace
receives the notification are:
* Avoiding unncessary work - Often, workloads will perform work that they
may abort (request racing comes to mind). This allows for syscalls to be
aborted safely prior to the notification being received by the
supervisor. In this, the supervisor doesn't end up doing work that the
workload does not want to complete anyways.
* Avoiding side effects - We don't want the syscall to be interruptible
once the supervisor starts doing work because it may not be trivial
to reverse the operation. For example, unmounting a file system may
take a long time, and it's hard to rollback, or treat that as
reentrant.
* Avoid breaking runtimes - Various runtimes do not GC when they are
during a syscall (or while running native code that subsequently
calls a syscall). If many notifications are blocked, and not picked
up by the supervisor, this can get the application into a bad state.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220503080958.20220-2-sargun@sargun.me

show more ...


# 4cbf6f62 28-Apr-2022 Sargun Dhillon <sargun@sargun.me>

seccomp: Use FIFO semantics to order notifications

Previously, the seccomp notifier used LIFO semantics, where each
notification would be added on top of the stack, and notifications
were popped off

seccomp: Use FIFO semantics to order notifications

Previously, the seccomp notifier used LIFO semantics, where each
notification would be added on top of the stack, and notifications
were popped off the top of the stack. This could result one process
that generates a large number of notifications preventing other
notifications from being handled. This patch moves from LIFO (stack)
semantics to FIFO (queue semantics).

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220428015447.13661-1-sargun@sargun.me

show more ...


# 355f841a 09-Feb-2022 Eric W. Biederman <ebiederm@xmission.com>

tracehook: Remove tracehook.h

Now that all of the definitions have moved out of tracehook.h into
ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in
tracehook.h so remove it.

Upda

tracehook: Remove tracehook.h

Now that all of the definitions have moved out of tracehook.h into
ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in
tracehook.h so remove it.

Update the few files that were depending upon tracehook.h to bring in
definitions to use the headers they need directly.

Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

show more ...


# 495ac306 08-Feb-2022 Kees Cook <keescook@chromium.org>

seccomp: Invalidate seccomp mode to catch death failures

If seccomp tries to kill a process, it should never see that process
again. To enforce this proactively, switch the mode to something
impossi

seccomp: Invalidate seccomp mode to catch death failures

If seccomp tries to kill a process, it should never see that process
again. To enforce this proactively, switch the mode to something
impossible. If encountered: WARN, reject all syscalls, and attempt to
kill the process again even harder.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Fixes: 8112c4f140fa ("seccomp: remove 2-phase API")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>

show more ...


# d21918e5 23-Jun-2021 Eric W. Biederman <ebiederm@xmission.com>

signal/seccomp: Dump core when there is only one live thread

Replace get_nr_threads with atomic_read(&current->signal->live) as
that is a more accurate number that is decremented sooner.

Acked-by:

signal/seccomp: Dump core when there is only one live thread

Replace get_nr_threads with atomic_read(&current->signal->live) as
that is a more accurate number that is decremented sooner.

Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87lf6z6qbd.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

show more ...


# 307d522f 23-Jun-2021 Eric W. Biederman <ebiederm@xmission.com>

signal/seccomp: Refactor seccomp signal and coredump generation

Factor out force_sig_seccomp from the seccomp signal generation and
place it in kernel/signal.c. The function force_sig_seccomp takes

signal/seccomp: Refactor seccomp signal and coredump generation

Factor out force_sig_seccomp from the seccomp signal generation and
place it in kernel/signal.c. The function force_sig_seccomp takes a
parameter force_coredump to indicate that the sigaction field should
be reset to SIGDFL so that a coredump will be generated when the
signal is delivered.

force_sig_seccomp is then used to replace both seccomp_send_sigsys
and seccomp_init_siginfo.

force_sig_info_to_task gains an extra parameter to force using
the default signal action.

With this change seccomp is no longer a special case and there
becomes exactly one place do_coredump is called from.

Further it no longer becomes necessary for __seccomp_filter
to call do_group_exit.

Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87r1gr6qc4.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

show more ...


# b4d8a58f 04-Mar-2021 Hsuan-Chi Kuo <hsuanchikuo@gmail.com>

seccomp: Fix setting loaded filter count during TSYNC

The desired behavior is to set the caller's filter count to thread's.
This value is reported via /proc, so this fixes the inaccurate count
expos

seccomp: Fix setting loaded filter count during TSYNC

The desired behavior is to set the caller's filter count to thread's.
This value is reported via /proc, so this fixes the inaccurate count
exposed to userspace; it is not used for reference counting, etc.

Signed-off-by: Hsuan-Chi Kuo <hsuanchikuo@gmail.com>
Link: https://lore.kernel.org/r/20210304233708.420597-1-hsuanchikuo@gmail.com
Co-developed-by: Wiktor Garbacz <wiktorg@google.com>
Signed-off-by: Wiktor Garbacz <wiktorg@google.com>
Link: https://lore.kernel.org/lkml/20210810125158.329849-1-wiktorg@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Fixes: c818c03b661c ("seccomp: Report number of loaded filters in /proc/$pid/status")

show more ...


# 0ae71c77 17-May-2021 Rodrigo Campos <rodrigo@kinvolk.io>

seccomp: Support atomic "addfd + send reply"

Alban Crequy reported a race condition userspace faces when we want to
add some fds and make the syscall return them[1] using seccomp notify.

The proble

seccomp: Support atomic "addfd + send reply"

Alban Crequy reported a race condition userspace faces when we want to
add some fds and make the syscall return them[1] using seccomp notify.

The problem is that currently two different ioctl() calls are needed by
the process handling the syscalls (agent) for another userspace process
(target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and
SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible
for the agent to do the first ioctl to add a file descriptor but the
target is interrupted (EINTR) before the agent does the second ioctl()
call.

This patch adds a flag to the ADDFD ioctl() so it adds the fd and
returns that value atomically to the target program, as suggested by
Kees Cook[2]. This is done by simply allowing
seccomp_do_user_notification() to add the fd and return it in this case.
Therefore, in this case the target wakes up from the wait in
seccomp_do_user_notification() either to interrupt the syscall or to add
the fd and return it.

This "allocate an fd and return" functionality is useful for syscalls
that return a file descriptor only, like connect(2). Other syscalls that
return a file descriptor but not as return value (or return more than
one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not
work with this flag.

This effectively combines SECCOMP_IOCTL_NOTIF_ADDFD and
SECCOMP_IOCTL_NOTIF_SEND into an atomic opteration. The notification's
return value, nor error can be set by the user. Upon successful invocation
of the SECCOMP_IOCTL_NOTIF_ADDFD ioctl with the SECCOMP_ADDFD_FLAG_SEND
flag, the notifying process's errno will be 0, and the return value will
be the file descriptor number that was installed.

[1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Tycho Andersen <tycho@tycho.pizza>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210517193908.3113-4-sargun@sargun.me

show more ...


# ddc47391 17-May-2021 Sargun Dhillon <sargun@sargun.me>

seccomp: Refactor notification handler to prepare for new semantics

This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in se

seccomp: Refactor notification handler to prepare for new semantics

This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in semantic, in that
previously we ignored addfd calls upon wakeup if the notification had been
responded to, but instead with the new change we check for an outstanding
addfd calls prior to returning to userspace.

Rodrigo Campos also identified a bug that can result in addfd causing
an early return, when the supervisor didn't actually handle the
syscall [1].

[1]: https://lore.kernel.org/lkml/20210413160151.3301-1-rodrigo@kinvolk.io/

Fixes: 7cf97b125455 ("seccomp: Introduce addfd ioctl to seccomp user notifier")
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Tycho Andersen <tycho@tycho.pizza>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Rodrigo Campos <rodrigo@kinvolk.io>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210517193908.3113-3-sargun@sargun.me

show more ...


# 42eb0d54 25-Mar-2021 Christoph Hellwig <hch@lst.de>

fs: split receive_fd_replace from __receive_fd

receive_fd_replace shares almost no code with the general case, so split
it out. Also remove the "Bump the sock usage counts" comment from
both copies

fs: split receive_fd_replace from __receive_fd

receive_fd_replace shares almost no code with the general case, so split
it out. Also remove the "Bump the sock usage counts" comment from
both copies, as that is now what __receive_sock actually does.

[AV: ... and make the only user of receive_fd_replace() choose between
it and receive_fd() according to what userland had passed to it in
flags]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

show more ...


# a3fc712c 31-Mar-2021 Cui GaoSheng <cuigaosheng1@huawei.com>

seccomp: Fix "cacheable" typo in comments

Do a trivial typo fix: s/cachable/cacheable/

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Cui GaoSheng <cuigaosheng1@huawei.com>
Signed-off-b

seccomp: Fix "cacheable" typo in comments

Do a trivial typo fix: s/cachable/cacheable/

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Cui GaoSheng <cuigaosheng1@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210331030724.84419-1-cuigaosheng1@huawei.com

show more ...


# a381b70a 05-Feb-2021 wanghongzhe <wanghongzhe@huawei.com>

seccomp: Improve performace by optimizing rmb()

According to Kees's suggest, we started with the patch that just replaces
rmb() with smp_rmb() and did a performance test with UnixBench. The
results

seccomp: Improve performace by optimizing rmb()

According to Kees's suggest, we started with the patch that just replaces
rmb() with smp_rmb() and did a performance test with UnixBench. The
results showed the overhead about 2.53% in rmb() test compared to the
smp_rmb() one, in a x86-64 kernel with CONFIG_SMP enabled running inside a
qemu-kvm vm. The test is a "syscall" testcase in UnixBench, which executes
5 syscalls in a loop during a certain timeout (100 second in our test) and
counts the total number of executions of this 5-syscall sequence. We set
a seccomp filter with all allow rule for all used syscalls in this test
(which will go bitmap path) to make sure the rmb() will be executed. The
details for the test:

with rmb():
/txm # ./syscall_allow_min 100
COUNT|35861159|1|lps
/txm # ./syscall_allow_min 100
COUNT|35545501|1|lps
/txm # ./syscall_allow_min 100
COUNT|35664495|1|lps

with smp_rmb():
/txm # ./syscall_allow_min 100
COUNT|36552771|1|lps
/txm # ./syscall_allow_min 100
COUNT|36491247|1|lps
/txm # ./syscall_allow_min 100
COUNT|36504746|1|lps

For a x86-64 kernel with CONFIG_SMP enabled, the smp_rmb() is just a
compiler barrier() which have no impact in runtime, while rmb() is a
lfence which will prevent all memory access operations (not just load
according the recently claim by Intel) behind itself. We can also figure
it out in disassembly:

with rmb():
0000000000001430 <__seccomp_filter>:
1430: 41 57 push %r15
1432: 41 56 push %r14
1434: 41 55 push %r13
1436: 41 54 push %r12
1438: 55 push %rbp
1439: 53 push %rbx
143a: 48 81 ec 90 00 00 00 sub $0x90,%rsp
1441: 89 7c 24 10 mov %edi,0x10(%rsp)
1445: 89 54 24 14 mov %edx,0x14(%rsp)
1449: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax
1450: 00 00
1452: 48 89 84 24 88 00 00 mov %rax,0x88(%rsp)
1459: 00
145a: 31 c0 xor %eax,%eax
* 145c: 0f ae e8 lfence
145f: 48 85 f6 test %rsi,%rsi
1462: 49 89 f4 mov %rsi,%r12
1465: 0f 84 42 03 00 00 je 17ad <__seccomp_filter+0x37d>
146b: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
1472: 00 00
1474: 48 8b 98 80 07 00 00 mov 0x780(%rax),%rbx
147b: 48 85 db test %rbx,%rbx

with smp_rmb();
0000000000001430 <__seccomp_filter>:
1430: 41 57 push %r15
1432: 41 56 push %r14
1434: 41 55 push %r13
1436: 41 54 push %r12
1438: 55 push %rbp
1439: 53 push %rbx
143a: 48 81 ec 90 00 00 00 sub $0x90,%rsp
1441: 89 7c 24 10 mov %edi,0x10(%rsp)
1445: 89 54 24 14 mov %edx,0x14(%rsp)
1449: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax
1450: 00 00
1452: 48 89 84 24 88 00 00 mov %rax,0x88(%rsp)
1459: 00
145a: 31 c0 xor %eax,%eax
145c: 48 85 f6 test %rsi,%rsi
145f: 49 89 f4 mov %rsi,%r12
1462: 0f 84 42 03 00 00 je 17aa <__seccomp_filter+0x37a>
1468: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
146f: 00 00
1471: 48 8b 98 80 07 00 00 mov 0x780(%rax),%rbx
1478: 48 85 db test %rbx,%rbx

Signed-off-by: wanghongzhe <wanghongzhe@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/1612496049-32507-1-git-send-email-wanghongzhe@huawei.com

show more ...


# 04b38d01 11-Jan-2021 Paul Cercueil <paul@crapouillou.net>

seccomp: Add missing return in non-void function

We don't actually care about the value, since the kernel will panic
before that; but a value should nonetheless be returned, otherwise the
compiler w

seccomp: Add missing return in non-void function

We don't actually care about the value, since the kernel will panic
before that; but a value should nonetheless be returned, otherwise the
compiler will complain.

Fixes: 8112c4f140fa ("seccomp: remove 2-phase API")
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210111172839.640914-1-paul@crapouillou.net

show more ...


# fab686eb 20-Nov-2020 Jann Horn <jannh@google.com>

seccomp: Remove bogus __user annotations

Buffers that are passed to read_actions_logged() and write_actions_logged()
are in kernel memory; the sysctl core takes care of copying from/to
userspace.

F

seccomp: Remove bogus __user annotations

Buffers that are passed to read_actions_logged() and write_actions_logged()
are in kernel memory; the sysctl core takes care of copying from/to
userspace.

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Reviewed-by: Tyler Hicks <code@tyhicks.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201120170545.1419332-1-jannh@google.com

show more ...


# 0d8315dd 11-Nov-2020 YiFei Zhu <yifeifz2@illinois.edu>

seccomp/cache: Report cache data through /proc/pid/seccomp_cache

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating sysca

seccomp/cache: Report cache data through /proc/pid/seccomp_cache

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]

This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/94e663fa53136f5a11f432c661794d1ee7060779.1605101222.git.yifeifz2@illinois.edu

show more ...


# 8e01b51a 11-Oct-2020 YiFei Zhu <yifeifz2@illinois.edu>

seccomp/cache: Add "emulator" to check if filter is constant allow

SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we

seccomp/cache: Add "emulator" to check if filter is constant allow

SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

Nearly all seccomp filters are built from these cBPF instructions:

BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
BPF_ALU | BPF_AND | BPF_K

Each of these instructions are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.

The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.

Emulator structure and comments are from Kees [1] and Jann [2].

Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.

[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: Jann Horn <jannh@google.com>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/71c7be2db5ee08905f41c3be5c1ad6e2601ce88f.1602431034.git.yifeifz2@illinois.edu

show more ...


# f9d480b6 11-Oct-2020 YiFei Zhu <yifeifz2@illinois.edu>

seccomp/cache: Lookup syscall allowlist bitmap for fast path

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of

seccomp/cache: Lookup syscall allowlist bitmap for fast path

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

When it can be concluded that an allow must occur for the given
architecture and syscall pair (this determination is introduced in
the next commit), seccomp will immediately allow the syscall,
bypassing further BPF execution.

Each architecture number has its own bitmap. The architecture
number in seccomp_data is checked against the defined architecture
number constant before proceeding to test the bit against the
bitmap with the syscall number as the index of the bit in the
bitmap, and if the bit is set, seccomp returns allow. The bitmaps
are all clear in this patch and will be initialized in the next
commit.

When only one architecture exists, the check against architecture
number is skipped, suggested by Kees Cook [7].

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
[7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/

Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/10f91a367ec4fcdea7fc3f086de3f5f13a4a7436.1602431034.git.yifeifz2@illinois.edu

show more ...


# fb14528e 30-Oct-2020 Mickaël Salaün <mic@linux.microsoft.com>

seccomp: Set PF_SUPERPRIV when checking capability

Replace the use of security_capable(current_cred(), ...) with
ns_capable_noaudit() which set PF_SUPERPRIV.

Since commit 98f368e9e263 ("kernel: Add

seccomp: Set PF_SUPERPRIV when checking capability

Replace the use of security_capable(current_cred(), ...) with
ns_capable_noaudit() which set PF_SUPERPRIV.

Since commit 98f368e9e263 ("kernel: Add noaudit variant of
ns_capable()"), a new ns_capable_noaudit() helper is available. Let's
use it!

Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Fixes: e2cfabdfd075 ("seccomp: add system call filtering using BPF")
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201030123849.770769-3-mic@digikod.net

show more ...


# 23d67a54 16-Nov-2020 Gabriel Krisman Bertazi <krisman@collabora.com>

seccomp: Migrate to use SYSCALL_WORK flag

On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes

seccomp: Migrate to use SYSCALL_WORK flag

On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.

Define SYSCALL_WORK_SECCOMP, use it in the generic entry code and convert
the code which uses the TIF specific helper functions to use the new
*_syscall_work() helpers which either resolve to the new mode for users of
the generic entry code or to the TIF based functions for the other
architectures.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-5-krisman@collabora.com

show more ...


123456