History log of /linux/kernel/user.c (Results 1 – 25 of 95)
Revision Date Author Comments
# 21ca59b3 28-Oct-2021 Christian Brauner <christian.brauner@ubuntu.com>

binfmt_misc: enable sandboxed mounts

Enable unprivileged sandboxes to create their own binfmt_misc mounts.
This is based on Laurent's work in [1] but has been significantly
reworked to fix various i

binfmt_misc: enable sandboxed mounts

Enable unprivileged sandboxes to create their own binfmt_misc mounts.
This is based on Laurent's work in [1] but has been significantly
reworked to fix various issues we identified in earlier versions.

While binfmt_misc can currently only be mounted in the initial user
namespace, binary types registered in this binfmt_misc instance are
available to all sandboxes (Either by having them installed in the
sandbox or by registering the binary type with the F flag causing the
interpreter to be opened right away). So binfmt_misc binary types are
already delegated to sandboxes implicitly.

However, while a sandbox has access to all registered binary types in
binfmt_misc a sandbox cannot currently register its own binary types
in binfmt_misc. This has prevented various use-cases some of which were
already outlined in [1] but we have a range of issues associated with
this (cf. [3]-[5] below which are just a small sample).

Extend binfmt_misc to be mountable in non-initial user namespaces.
Similar to other filesystem such as nfsd, mqueue, and sunrpc we use
keyed superblock management. The key determines whether we need to
create a new superblock or can reuse an already existing one. We use the
user namespace of the mount as key. This means a new binfmt_misc
superblock is created once per user namespace creation. Subsequent
mounts of binfmt_misc in the same user namespace will mount the same
binfmt_misc instance. We explicitly do not create a new binfmt_misc
superblock on every binfmt_misc mount as the semantics for
load_misc_binary() line up with the keying model. This also allows us to
retrieve the relevant binfmt_misc instance based on the caller's user
namespace which can be done in a simple (bounded to 32 levels) loop.

Similar to the current binfmt_misc semantics allowing access to the
binary types in the initial binfmt_misc instance we do allow sandboxes
access to their parent's binfmt_misc mounts if they do not have created
a separate binfmt_misc instance.

Overall, this will unblock the use-cases mentioned below and in general
will also allow to support and harden execution of another
architecture's binaries in tight sandboxes. For instance, using the
unshare binary it possible to start a chroot of another architecture and
configure the binfmt_misc interpreter without being root to run the
binaries in this chroot and without requiring the host to modify its
binary type handlers.

Henning had already posted a few experiments in the cover letter at [1].
But here's an additional example where an unprivileged container
registers qemu-user-static binary handlers for various binary types in
its separate binfmt_misc mount and is then seamlessly able to start
containers with a different architecture without affecting the host:

root [lxc monitor] /var/snap/lxd/common/lxd/containers f1
1000000 \_ /sbin/init
1000000 \_ /lib/systemd/systemd-journald
1000000 \_ /lib/systemd/systemd-udevd
1000100 \_ /lib/systemd/systemd-networkd
1000101 \_ /lib/systemd/systemd-resolved
1000000 \_ /usr/sbin/cron -f
1000103 \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
1000000 \_ /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1000104 \_ /usr/sbin/rsyslogd -n -iNONE
1000000 \_ /lib/systemd/systemd-logind
1000000 \_ /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1000107 \_ dnsmasq --conf-file=/dev/null -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --liste
1000000 \_ [lxc monitor] /var/lib/lxc f1-s390x
1100000 \_ /usr/bin/qemu-s390x-static /sbin/init
1100000 \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-journald
1100000 \_ /usr/bin/qemu-s390x-static /usr/sbin/cron -f
1100103 \_ /usr/bin/qemu-s390x-static /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-ac
1100000 \_ /usr/bin/qemu-s390x-static /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1100104 \_ /usr/bin/qemu-s390x-static /usr/sbin/rsyslogd -n -iNONE
1100000 \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-logind
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/0 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/1 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/2 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/3 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-udevd

[1]: https://lore.kernel.org/all/20191216091220.465626-1-laurent@vivier.eu
[2]: https://discuss.linuxcontainers.org/t/binfmt-misc-permission-denied
[3]: https://discuss.linuxcontainers.org/t/lxd-binfmt-support-for-qemu-static-interpreters
[4]: https://discuss.linuxcontainers.org/t/3-1-0-binfmt-support-service-in-unprivileged-guest-requires-write-access-on-hosts-proc-sys-fs-binfmt-misc
[5]: https://discuss.linuxcontainers.org/t/qemu-user-static-not-working-4-11

Link: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu (origin)
Link: https://lore.kernel.org/r/20211028103114.2849140-2-brauner@kernel.org (v1)
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Henning Schild <henning.schild@siemens.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Laurent Vivier <laurent@vivier.eu>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
/* v2 */
- Serge Hallyn <serge@hallyn.com>:
- Use GFP_KERNEL_ACCOUNT for userspace triggered allocations when a
new binary type handler is registered.
- Christian Brauner <christian.brauner@ubuntu.com>:
- Switch authorship to me. I refused to do that earlier even though
Laurent said I should do so because I think it's genuinely bad form.
But by now I have changed so many things that it'd be unfair to
blame Laurent for any potential bugs in here.
- Add more comments that explain what's going on.
- Rename functions while changing them to better reflect what they are
doing to make the code easier to understand.
- In the first version when a specific binary type handler was removed
either through a write to the entry's file or all binary type
handlers were removed by a write to the binfmt_misc mount's status
file all cleanup work happened during inode eviction.
That includes removal of the relevant entries from entry list. While
that works fine I disliked that model after thinking about it for a
bit. Because it means that there was a window were someone has
already removed a or all binary handlers but they could still be
safely reached from load_misc_binary() when it has managed to take
the read_lock() on the entries list while inode eviction was already
happening. Again, that perfectly benign but it's cleaner to remove
the binary handler from the list immediately meaning that ones the
write to then entry's file or the binfmt_misc status file returns
the binary type cannot be executed anymore. That gives stronger
guarantees to the user.

show more ...


# ce5a23c8 29-Nov-2022 Jason Gunthorpe <jgg@nvidia.com>

kernel/user: Allow user_struct::locked_vm to be usable for iommufd

Following the pattern of io_uring, perf, skb, and bpf, iommfd will use
user->locked_vm for accounting pinned pages. Ensure the valu

kernel/user: Allow user_struct::locked_vm to be usable for iommufd

Following the pattern of io_uring, perf, skb, and bpf, iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.

user->locked_vm is the good accounting to use for ulimit because it is
per-user, and the security sandboxing of locked pages is not supposed to
be per-process. Other places (vfio, vdpa and infiniband) have used
mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this
is only per-process and inconsistent with the new FOLL_LONGTERM users in
the kernel.

Concurrent work is underway to try to put this in a cgroup, so everything
can be consistent and the kernel can provide a FOLL_LONGTERM limit that
actually provides security.

Link: https://lore.kernel.org/r/7-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

show more ...


# 1e1c1583 08-Sep-2021 Nicholas Piggin <npiggin@gmail.com>

fs/epoll: use a per-cpu counter for user's watches count

This counter tracks the number of watches a user has, to compare against
the 'max_user_watches' limit. This causes a scalability bottleneck o

fs/epoll: use a per-cpu counter for user's watches count

This counter tracks the number of watches a user has, to compare against
the 'max_user_watches' limit. This causes a scalability bottleneck on
SPECjbb2015 on large systems as there is only one user. Changing to a
per-cpu counter increases throughput of the benchmark by about 30% on a
16-socket, > 1000 thread system.

[rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n]
[npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message]
Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none
[akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter]
Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net

Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reported-by: Anton Blanchard <anton@ozlabs.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


# d7c9e99a 22-Apr-2021 Alexey Gladkov <legion@kernel.org>

Reimplement RLIMIT_MEMLOCK on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded b

Reimplement RLIMIT_MEMLOCK on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Fix issue found by lkp robot.

v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

show more ...


# d6469690 22-Apr-2021 Alexey Gladkov <legion@kernel.org>

Reimplement RLIMIT_SIGPENDING on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceede

Reimplement RLIMIT_SIGPENDING on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Revert most of changes to fix performance issues.

v10:
* Fix memory leak on get_ucounts failure.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/df9d7764dddd50f28616b7840de74ec0f81711a8.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

show more ...


# 21d1c5e3 22-Apr-2021 Alexey Gladkov <legion@kernel.org>

Reimplement RLIMIT_NPROC on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by

Reimplement RLIMIT_NPROC on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
\- program (uid=1000, container1, rlimit_nproc=1)
\- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Changelog

v11:
* Change inc_rlimit_ucounts() which now returns top value of ucounts.
* Drop inc_rlimit_ucounts_and_test() because the return code of
inc_rlimit_ucounts() can be checked.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

show more ...


# 265cbd62 03-Aug-2020 Kirill Tkhai <ktkhai@virtuozzo.com>

user: Use generic ns_common::count

Switch over user namespaces to use the newly introduced common lifetime
counter.

Currently every namespace type has its own lifetime counter which is stored
in th

user: Use generic ns_common::count

Switch over user namespaces to use the newly introduced common lifetime
counter.

Currently every namespace type has its own lifetime counter which is stored
in the specific namespace struct. The lifetime counters are used
identically for all namespaces types. Namespaces may of course have
additional unrelated counters and these are not altered.

This introduces a common lifetime counter into struct ns_common. The
ns_common struct encompasses information that all namespaces share. That
should include the lifetime counter since its common for all of them.

It also allows us to unify the type of the counters across all namespaces.
Most of them use refcount_t but one uses atomic_t and at least one uses
kref. Especially the last one doesn't make much sense since it's just a
wrapper around refcount_t since 2016 and actually complicates cleanup
operations by having to use container_of() to cast the correct namespace
struct out of struct ns_common.

Having the lifetime counter for the namespaces in one place reduces
maintenance cost. Not just because after switching all namespaces over we
will have removed more code than we added but also because the logic is
more easily understandable and we indicate to the user that the basic
lifetime requirements for all namespaces are currently identical.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/159644979754.604812.601625186726406922.stgit@localhost.localdomain
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

show more ...


# de83dbd9 04-Jun-2020 Jason Yan <yanaijie@huawei.com>

user.c: make uidhash_table static

Fix the following sparse warning:

kernel/user.c:85:19: warning: symbol 'uidhash_table' was not declared.
Should it be static?

Reported-by: Hulk Robot <hulkci@

user.c: make uidhash_table static

Fix the following sparse warning:

kernel/user.c:85:19: warning: symbol 'uidhash_table' was not declared.
Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200413082146.22737-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


# 0f44e4d9 26-Jun-2019 David Howells <dhowells@redhat.com>

keys: Move the user and user-session keyrings to the user_namespace

Move the user and user-session keyrings to the user_namespace struct rather
than pinning them from the user_struct struct. This p

keys: Move the user and user-session keyrings to the user_namespace

Move the user and user-session keyrings to the user_namespace struct rather
than pinning them from the user_struct struct. This prevents these
keyrings from propagating across user-namespaces boundaries with regard to
the KEY_SPEC_* flags, thereby making them more useful in a containerised
environment.

The issue is that a single user_struct may be represent UIDs in several
different namespaces.

The way the patch does this is by attaching a 'register keyring' in each
user_namespace and then sticking the user and user-session keyrings into
that. It can then be searched to retrieve them.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jann Horn <jannh@google.com>

show more ...


# b206f281 26-Jun-2019 David Howells <dhowells@redhat.com>

keys: Namespace keyring names

Keyring names are held in a single global list that any process can pick
from by means of keyctl_join_session_keyring (provided the keyring grants
Search permission).

keys: Namespace keyring names

Keyring names are held in a single global list that any process can pick
from by means of keyctl_join_session_keyring (provided the keyring grants
Search permission). This isn't very container friendly, however.

Make the following changes:

(1) Make default session, process and thread keyring names begin with a
'.' instead of '_'.

(2) Keyrings whose names begin with a '.' aren't added to the list. Such
keyrings are system specials.

(3) Replace the global list with per-user_namespace lists. A keyring adds
its name to the list for the user_namespace that it is currently in.

(4) When a user_namespace is deleted, it just removes itself from the
keyring name list.

The global keyring_name_lock is retained for accessing the name lists.
This allows (4) to work.

This can be tested by:

# keyctl newring foo @s
995906392
# unshare -U
$ keyctl show
...
995906392 --alswrv 65534 65534 \_ keyring: foo
...
$ keyctl session foo
Joined session keyring: 935622349

As can be seen, a new session keyring was created.

The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
employing this feature.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric W. Biederman <ebiederm@xmission.com>

show more ...


# 457c8996 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was use

treewide: Add SPDX license identifier for missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


# 6c4e121f 14-May-2019 Rasmus Villemoes <linux@rasmusvillemoes.dk>

kernel/user.c: clean up some leftover code

The out_unlock label is misleading; no unlocking happens after it, so
just return NULL directly.

Also, nothing between the kmem_cache_zalloc() that create

kernel/user.c: clean up some leftover code

The out_unlock label is misleading; no unlocking happens after it, so
just return NULL directly.

Also, nothing between the kmem_cache_zalloc() that creates new and the
two key_put() can initialize new->uid_keyring or new->session_keyring,
so those calls are no-ops.

Link: http://lkml.kernel.org/r/20190424200404.9114-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


# ce0a568d 22-Aug-2018 Anna-Maria Gleixner <anna-maria@linutronix.de>

userns: use irqsave variant of refcount_dec_and_lock()

The irqsave variant of refcount_dec_and_lock handles irqsave/restore when
taking/releasing the spin lock. With this variant the call of
local_

userns: use irqsave variant of refcount_dec_and_lock()

The irqsave variant of refcount_dec_and_lock handles irqsave/restore when
taking/releasing the spin lock. With this variant the call of
local_irq_save/restore is no longer required.

[bigeasy@linutronix.de: s@atomic_dec_and_lock@refcount_dec_and_lock@g]
Link: http://lkml.kernel.org/r/20180703200141.28415-7-bigeasy@linutronix.de
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


# fc371912 22-Aug-2018 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

userns: use refcount_t for reference counting instead atomic_t

refcount_t type and corresponding API should be used instead of atomic_t
wh en the variable is used as a reference counter. This avoid

userns: use refcount_t for reference counting instead atomic_t

refcount_t type and corresponding API should be used instead of atomic_t
wh en the variable is used as a reference counter. This avoids accidental
refcounter overflows that might lead to use-after-free situations.

Link: http://lkml.kernel.org/r/20180703200141.28415-6-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


# bef3efbe 22-Feb-2018 Luck, Tony <tony.luck@intel.com>

efivarfs: Limit the rate for non-root to read files

Each read from a file in efivarfs results in two calls to EFI
(one to get the file size, another to get the actual data).

On X86 these EFI calls

efivarfs: Limit the rate for non-root to read files

Each read from a file in efivarfs results in two calls to EFI
(one to get the file size, another to get the actual data).

On X86 these EFI calls result in broadcast system management
interrupts (SMI) which affect performance of the whole system.
A malicious user can loop performing reads from efivarfs bringing
the system to its knees.

Linus suggested per-user rate limit to solve this.

So we add a ratelimit structure to "user_struct" and initialize
it for the root user for no limit. When allocating user_struct for
other users we set the limit to 100 per second. This could be used
for other places that want to limit the rate of some detrimental
user action.

In efivarfs if the limit is exceeded when reading, we take an
interruptible nap for 50ms and check the rate limit again.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


# aa4bf44d 24-Oct-2017 Christian Brauner <christian.brauner@ubuntu.com>

userns: use union in {g,u}idmap struct

- Add a struct containing two pointer to extents and wrap both the static extent
array and the struct into a union. This is done in preparation for bumping t

userns: use union in {g,u}idmap struct

- Add a struct containing two pointer to extents and wrap both the static extent
array and the struct into a union. This is done in preparation for bumping the
{g,u}idmap limits for user namespaces.
- Add brackets around anonymous union when using designated initializers to
initialize members in order to please gcc <= 4.4.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

show more ...


# 8703e8a4 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h>

We are going to split <linux/sched/user.h> out of <linux/sched.h>, which
will have to be picked up from

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h>

We are going to split <linux/sched/user.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/user.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

show more ...


# 9cc46516 02-Dec-2014 Eric W. Biederman <ebiederm@xmission.com>

userns: Add a knob to disable setgroups on a per user namespace basis

- Expose the knob to user space through a proc file /proc/<pid>/setgroups

A value of "deny" means the setgroups system call i

userns: Add a knob to disable setgroups on a per user namespace basis

- Expose the knob to user space through a proc file /proc/<pid>/setgroups

A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.

A value of "allow" means the segtoups system call is enabled.

- Descendant user namespaces inherit the value of setgroups from
their parents.

- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.

- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.

This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.

A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

show more ...


# 33c42940 01-Nov-2014 Al Viro <viro@zeniv.linux.org.uk>

copy address of proc_ns_ops into ns_common

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 435d5f4b 01-Nov-2014 Al Viro <viro@zeniv.linux.org.uk>

common object embedded into various struct ....ns

for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <vi

common object embedded into various struct ....ns

for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

show more ...


# b300a4ea 04-Jun-2014 Kirill A. Shutemov <kirill@shutemov.name>

kernel/user.c: drop unused field 'files' from user_struct

Nobody seems uses it for a long time. Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: And

kernel/user.c: drop unused field 'files' from user_struct

Nobody seems uses it for a long time. Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


# c96d6660 03-Apr-2014 Paul Gortmaker <paul.gortmaker@windriver.com>

kernel: audit/fix non-modular users of module_init in core code

Code that is obj-y (always built-in) or dependent on a bool Kconfig
(built-in or absent) can never be modular. So using module_init a

kernel: audit/fix non-modular users of module_init in core code

Code that is obj-y (always built-in) or dependent on a bool Kconfig
(built-in or absent) can never be modular. So using module_init as an
alias for __initcall can be somewhat misleading.

Fix these up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.

The audit targets the following module_init users for change:
kernel/user.c obj-y
kernel/kexec.c bool KEXEC (one instance per arch)
kernel/profile.c bool PROFILING
kernel/hung_task.c bool DETECT_HUNG_TASK
kernel/sched/stats.c bool SCHEDSTATS
kernel/user_namespace.c bool USER_NS

Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of subsys_initcall (which makes sense for these
files) will thus change this registration from level 6-device to level
4-subsys (i.e. slightly earlier). However no observable impact of that
difference has been observed during testing.

Also, two instances of missing ";" at EOL are fixed in kexec.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


# 6bd364d8 13-Dec-2013 Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

KEYS: fix uninitialized persistent_keyring_register_sem

We run into this bug:
[ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000
[ 2736.063293] Faulting instruction

KEYS: fix uninitialized persistent_keyring_register_sem

We run into this bug:
[ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000
[ 2736.063293] Faulting instruction address: 0xc00000000037efb0
[ 2736.063300] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2736.063303] SMP NR_CPUS=2048 NUMA pSeries
[ 2736.063310] Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT iptable_nat nf_nat_ipv4 iptable_mangle iptable_security iptable_raw ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ebtable_filter ebtables ip6table_filter iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables ibmveth pseries_rng nx_crypto nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common ibmvfc scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_multipath dm_mod
[ 2736.063383] CPU: 1 PID: 7128 Comm: ssh Not tainted 3.10.0-48.el7.ppc64 #1
[ 2736.063389] task: c000000131930120 ti: c0000001319a0000 task.ti: c0000001319a0000
[ 2736.063394] NIP: c00000000037efb0 LR: c0000000006c40f8 CTR: 0000000000000000
[ 2736.063399] REGS: c0000001319a3870 TRAP: 0300 Not tainted (3.10.0-48.el7.ppc64)
[ 2736.063403] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 28824242 XER: 20000000
[ 2736.063415] SOFTE: 0
[ 2736.063418] CFAR: c00000000000908c
[ 2736.063421] DAR: 0000000000000000, DSISR: 40000000
[ 2736.063425]
GPR00: c0000000006c40f8 c0000001319a3af0 c000000001074788 c0000001319a3bf0
GPR04: 0000000000000000 0000000000000000 0000000000000020 000000000000000a
GPR08: fffffffe00000002 00000000ffff0000 0000000080000001 c000000000924888
GPR12: 0000000028824248 c000000007e00400 00001fffffa0f998 0000000000000000
GPR16: 0000000000000022 00001fffffa0f998 0000010022e92470 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 c000000000f4a828 00003ffffe527108 0000000000000000
GPR28: c000000000f4a730 c000000000f4a828 0000000000000000 c0000001319a3bf0
[ 2736.063498] NIP [c00000000037efb0] .__list_add+0x30/0x110
[ 2736.063504] LR [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
[ 2736.063508] PACATMSCRATCH [800000000280f032]
[ 2736.063511] Call Trace:
[ 2736.063516] [c0000001319a3af0] [c0000001319a3b80] 0xc0000001319a3b80 (unreliable)
[ 2736.063523] [c0000001319a3b80] [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
[ 2736.063530] [c0000001319a3c50] [c0000000006c1bb0] .down_write+0x70/0x78
[ 2736.063536] [c0000001319a3cd0] [c0000000002e5ffc] .keyctl_get_persistent+0x20c/0x320
[ 2736.063542] [c0000001319a3dc0] [c0000000002e2388] .SyS_keyctl+0x238/0x260
[ 2736.063548] [c0000001319a3e30] [c000000000009e7c] syscall_exit+0x0/0x7c
[ 2736.063553] Instruction dump:
[ 2736.063556] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbd2b78 7c9e2378 7c7f1b78 f8010010
[ 2736.063566] f821ff71 e8a50008 7fa52040 40de00c0 <e8be0000> 7fbd2840 40de0094 7fbff040
[ 2736.063579] ---[ end trace 2708241785538296 ]---

It's caused by uninitialized persistent_keyring_register_sem.

The bug was introduced by commit f36f8c75, two typos are in that commit:
CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS and
krb_cache_register_sem should be persistent_keyring_register_sem.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: David Howells <dhowells@redhat.com>

show more ...


# f36f8c75 24-Sep-2013 David Howells <dhowells@redhat.com>

KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches

Add support for per-user_namespace registers of persistent per-UID kerberos
caches held within the kernel.

This allows

KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches

Add support for per-user_namespace registers of persistent per-UID kerberos
caches held within the kernel.

This allows the kerberos cache to be retained beyond the life of all a user's
processes so that the user's cron jobs can work.

The kerberos cache is envisioned as a keyring/key tree looking something like:

struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 big_key - A ccache blob
\___ tkt12345 big_key - Another ccache blob

Or possibly:

struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 keyring - A ccache
\___ krbtgt/REDHAT.COM@REDHAT.COM big_key
\___ http/REDHAT.COM@REDHAT.COM user
\___ afs/REDHAT.COM@REDHAT.COM user
\___ nfs/REDHAT.COM@REDHAT.COM user
\___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
\___ http/KERNEL.ORG@KERNEL.ORG big_key

What goes into a particular Kerberos cache is entirely up to userspace. Kernel
support is limited to giving you the Kerberos cache keyring that you want.

The user asks for their Kerberos cache by:

krb_cache = keyctl_get_krbcache(uid, dest_keyring);

The uid is -1 or the user's own UID for the user's own cache or the uid of some
other user's cache (requires CAP_SETUID). This permits rpc.gssd or whatever to
mess with the cache.

The cache returned is a keyring named "_krb.<uid>" that the possessor can read,
search, clear, invalidate, unlink from and add links to. Active LSMs get a
chance to rule on whether the caller is permitted to make a link.

Each uid's cache keyring is created when it first accessed and is given a
timeout that is extended each time this function is called so that the keyring
goes away after a while. The timeout is configurable by sysctl but defaults to
three days.

Each user_namespace struct gets a lazily-created keyring that serves as the
register. The cache keyrings are added to it. This means that standard key
search and garbage collection facilities are available.

The user_namespace struct's register goes away when it does and anything left
in it is then automatically gc'd.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Simo Sorce <simo@redhat.com>
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>

show more ...


# e51db735 31-Mar-2013 Eric W. Biederman <ebiederm@xmission.com>

userns: Better restrictions on when proc and sysfs can be mounted

Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.

Verify th

userns: Better restrictions on when proc and sysfs can be mounted

Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.

Verify that the mounted filesystem is not covered in any significant
way. I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.

Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs. This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

show more ...


1234