#
cd3cec0a |
| 15-Feb-2024 |
Roberto Sassu <roberto.sassu@huawei.com> |
ima: Move to LSM infrastructure
Move hardcoded IMA function calls (not appraisal-specific functions) from various places in the kernel to the LSM infrastructure, by introducing a new LSM named 'ima'
ima: Move to LSM infrastructure
Move hardcoded IMA function calls (not appraisal-specific functions) from various places in the kernel to the LSM infrastructure, by introducing a new LSM named 'ima' (at the end of the LSM list and always enabled like 'integrity').
Having IMA before EVM in the Makefile is sufficient to preserve the relative order of the new 'ima' LSM in respect to the upcoming 'evm' LSM, and thus the order of IMA and EVM function calls as when they were hardcoded.
Make moved functions as static (except ima_post_key_create_or_update(), which is not in ima_main.c), and register them as implementation of the respective hooks in the new function init_ima_lsm().
Select CONFIG_SECURITY_PATH, to ensure that the path-based LSM hook path_post_mknod is always available and ima_post_path_mknod() is always executed to mark files as new, as before the move.
A slight difference is that IMA and EVM functions registered for the inode_post_setattr, inode_post_removexattr, path_post_mknod, inode_post_create_tmpfile, inode_post_set_acl and inode_post_remove_acl won't be executed for private inodes. Since those inodes are supposed to be fs-internal, they should not be of interest to IMA or EVM. The S_PRIVATE flag is used for anonymous inodes, hugetlbfs, reiserfs xattrs, XFS scrub and kernel-internal tmpfs files.
Conditionally register ima_post_key_create_or_update() if CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS is enabled. Also, conditionally register ima_kernel_module_request() if CONFIG_INTEGRITY_ASYMMETRIC_KEYS is enabled.
Finally, add the LSM_ID_IMA case in lsm_list_modules_test.c.
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Acked-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
show more ...
|
#
5f0d594c |
| 02-Feb-2024 |
Tony Solomonik <tony.solomonik@gmail.com> |
Add do_ftruncate that truncates a struct file
do_sys_ftruncate receives a file descriptor, fgets the struct file, and finally actually truncates the file.
do_ftruncate allows for passing in a file
Add do_ftruncate that truncates a struct file
do_sys_ftruncate receives a file descriptor, fgets the struct file, and finally actually truncates the file.
do_ftruncate allows for passing in a file directly, with the caller already holding a reference to it.
Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
a69ce85e |
| 31-Jan-2024 |
Jeff Layton <jlayton@kernel.org> |
filelock: split common fields into struct file_lock_core
In a future patch, we're going to split file leases into their own structure. Since a lot of the underlying machinery uses the same fields mo
filelock: split common fields into struct file_lock_core
In a future patch, we're going to split file leases into their own structure. Since a lot of the underlying machinery uses the same fields move those into a new file_lock_core, and embed that inside struct file_lock.
For now, add some macros to ensure that we can continue to build while the conversion is in progress.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
f91a704f |
| 02-Oct-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: prepare for stackable filesystems backing file helpers
In preparation for factoring out some backing file io helpers from overlayfs, move backing_file_open() into a new file fs/backing-file.c an
fs: prepare for stackable filesystems backing file helpers
In preparation for factoring out some backing file io helpers from overlayfs, move backing_file_open() into a new file fs/backing-file.c and header.
Add a MAINTAINERS entry for stackable filesystems and add a Kconfig FS_STACK which stackable filesystems need to select.
For now, the backing_file struct, the backing_file alloc/free functions and the backing_file_real_path() accessor remain internal to file_table.c. We may change that in the future.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
show more ...
|
#
ae191417 |
| 15-Dec-2023 |
Jens Axboe <axboe@kernel.dk> |
cred: get rid of CONFIG_DEBUG_CREDENTIALS
This code is rarely (never?) enabled by distros, and it hasn't caught anything in decades. Let's kill off this legacy debug code.
Suggested-by: Linus Torva
cred: get rid of CONFIG_DEBUG_CREDENTIALS
This code is rarely (never?) enabled by distros, and it hasn't caught anything in decades. Let's kill off this legacy debug code.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
d9e5d310 |
| 12-Dec-2023 |
Amir Goldstein <amir73il@gmail.com> |
fsnotify: optionally pass access range in file permission hooks
In preparation for pre-content permission events with file access range, move fsnotify_file_perm() hook out of security_file_permissio
fsnotify: optionally pass access range in file permission hooks
In preparation for pre-content permission events with file access range, move fsnotify_file_perm() hook out of security_file_permission() and into the callers.
Callers that have the access range information call the new hook fsnotify_file_area_perm() with the access range.
Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231212094440.250945-6-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
a88c955f |
| 30-Nov-2023 |
Christian Brauner <brauner@kernel.org> |
file: s/close_fd_get_file()/file_close_fd()/g
That really shouldn't have "get" in there as that implies we're bumping the reference count which we don't do at all. We used to but not anmore. Now we'
file: s/close_fd_get_file()/file_close_fd()/g
That really shouldn't have "get" in there as that implies we're bumping the reference count which we don't do at all. We used to but not anmore. Now we're just closing the fd and pick that file from the fdtable without bumping the reference count. Update the wrong documentation while at it.
Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-1-e73ca6f4ea83@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
d2185690 |
| 31-Oct-2023 |
Bagas Sanjaya <bagasdotme@gmail.com> |
fs: Clarify "non-RCY" in access_override_creds() comment
The term is originally intended as a joke that stands for "non-racy". This trips new contributors who mistake it for RCU typo [1].
Replace t
fs: Clarify "non-RCY" in access_override_creds() comment
The term is originally intended as a joke that stands for "non-racy". This trips new contributors who mistake it for RCU typo [1].
Replace the term with more-explicit wording.
Link: https://lore.kernel.org/r/20231030-debatten-nachrangig-f58abcdac530@brauner/ Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20231031114728.41485-1-bagasdotme@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
def3ae83 |
| 09-Oct-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: store real path instead of fake path in backing file f_path
A backing file struct stores two path's, one "real" path that is referring to f_inode and one "fake" path, which should be displayed t
fs: store real path instead of fake path in backing file f_path
A backing file struct stores two path's, one "real" path that is referring to f_inode and one "fake" path, which should be displayed to users in /proc/<pid>/maps.
There is a lot more potential code that needs to know the "real" path, then code that needs to know the "fake" path.
Instead of code having to request the "real" path with file_real_path(), store the "real" path in f_path and require code that needs to know the "fake" path request it with file_user_path(). Replace the file_real_path() helper with a simple const accessor f_path().
After this change, file_dentry() is not expected to observe any files with overlayfs f_path and real f_inode, so the call to ->d_real() should not be needed. Leave the ->d_real() call for now and add an assertion in ovl_d_real() to catch if we made wrong assumptions.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Link: https://lore.kernel.org/r/CAJfpegtt48eXhhjDFA1ojcHPNKj3Go6joryCPtEFAKpocyBsnw@mail.gmail.com/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231009153712.1566422-4-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
83bc1d29 |
| 09-Oct-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: get mnt_writers count for an open backing file's real path
A writeable mapped backing file can perform writes to the real inode. Therefore, the real path mount must be kept writable so long as t
fs: get mnt_writers count for an open backing file's real path
A writeable mapped backing file can perform writes to the real inode. Therefore, the real path mount must be kept writable so long as the writable map exists.
This may not be strictly needed for ovelrayfs private upper mount, but it is correct to take the mnt_writers count in the vfs helper.
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231009153712.1566422-2-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
3e15dcf7 |
| 08-Sep-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: rename __mnt_{want,drop}_write*() helpers
Before exporting these helpers to modules, make their names more meaningful.
The names mnt_{get,put)_write_access*() were chosen, because they rhyme wi
fs: rename __mnt_{want,drop}_write*() helpers
Before exporting these helpers to modules, make their names more meaningful.
The names mnt_{get,put)_write_access*() were chosen, because they rhyme with the inode {get,put)_write_access() helpers, which have a very close meaning for the inode object.
Suggested-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230817-anfechtbar-ruhelosigkeit-8c6cca8443fc@brauner/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230908132900.2983519-2-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
35931eb3 |
| 18-Aug-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
fs: Fix kernel-doc warnings
These have a variety of causes and a corresponding variety of solutions.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Message-Id: <20230818200824.27200
fs: Fix kernel-doc warnings
These have a variety of causes and a corresponding variety of solutions.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Message-Id: <20230818200824.2720007-1-willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
021a160a |
| 08-Aug-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
fs: use __fput_sync in close(2)
close(2) is a special case which guarantees a shallow kernel stack, making delegation to task_work machinery unnecessary. Said delegation is problematic as it involve
fs: use __fput_sync in close(2)
close(2) is a special case which guarantees a shallow kernel stack, making delegation to task_work machinery unnecessary. Said delegation is problematic as it involves atomic ops and interrupt masking trips, none of which are cheap on x86-64. Forcing close(2) to do it looks like an oversight in the original work.
Moreover presence of CONFIG_RSEQ adds an additional overhead as fput() -> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the thread returning to userspace land in resume_user_mode_work(), where rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for the thread (and it is by default with contemporary glibc).
Sample result when benchmarking open1_processes -t 1 from will-it-scale (that's an open + close loop) + tmpfs on /tmp, running on the Sapphire Rapid CPU (ops/s): stock+RSEQ: 1329857 stock-RSEQ: 1421667 (+7%) patched: 1523521 (+14.5% / +7%) (with / without rseq)
Patched result is the same regardless of rseq as the codepath is avoided.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
a0fc452a |
| 05-Aug-2023 |
Aleksa Sarai <cyphar@cyphar.com> |
open: make RESOLVE_CACHED correctly test for O_TMPFILE
O_TMPFILE is actually __O_TMPFILE|O_DIRECTORY. This means that the old fast-path check for RESOLVE_CACHED would reject all users passing O_DIRE
open: make RESOLVE_CACHED correctly test for O_TMPFILE
O_TMPFILE is actually __O_TMPFILE|O_DIRECTORY. This means that the old fast-path check for RESOLVE_CACHED would reject all users passing O_DIRECTORY with -EAGAIN, when in fact the intended test was to check for __O_TMPFILE.
Cc: stable@vger.kernel.org # v5.12+ Fixes: 99668f618062 ("fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230806-resolve_cached-o_tmpfile-v1-1-7ba16308465e@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
5daeb41a |
| 28-Jul-2023 |
Aleksa Sarai <cyphar@cyphar.com> |
fchmodat2: add support for AT_EMPTY_PATH
This allows userspace to avoid going through /proc/self/fd when dealing with all types of file descriptors for chmod(), and makes fchmodat2() a proper supers
fchmodat2: add support for AT_EMPTY_PATH
This allows userspace to avoid going through /proc/self/fd when dealing with all types of file descriptors for chmod(), and makes fchmodat2() a proper superset of all other chmod syscalls.
The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is that fchmod() doesn't operate on O_PATH file descriptors by design. To quote open(2):
> O_PATH (since Linux 2.6.39) > [...] > The file itself is not opened, and other file operations (e.g., > read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2), > mmap(2)) fail with the error EBADF.
However, procfs has allowed userspace to do this operation ever since the introduction of O_PATH through magic-links, so adding this feature is only an improvement for programs that have to mess around with /proc/self/fd/$n today to get this behaviour. In addition, fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and allows chown() operations directly on O_PATH descriptors.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Acked-by: Alexey Gladkov <legion@kernel.org> Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
09da082b |
| 11-Jul-2023 |
Alexey Gladkov <legion@kernel.org> |
fs: Add fchmodat2()
On the userspace side fchmodat(3) is implemented as a wrapper function which implements the POSIX-specified interface. This interface differs from the underlying kernel system ca
fs: Add fchmodat2()
On the userspace side fchmodat(3) is implemented as a wrapper function which implements the POSIX-specified interface. This interface differs from the underlying kernel system call, which does not have a flags argument. Most implementations require procfs [1][2].
There doesn't appear to be a good userspace workaround for this issue but the implementation in the kernel is pretty straight-forward.
The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, unlike existing fchmodat.
[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [2] https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
Co-developed-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Message-Id: <f2a846ef495943c5d101011eebcf01179d0c7b61.1689092120.git.legion@kernel.org> [brauner: pre reviews, do flag conversion in do_fchmodat() directly] Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
62d53c4a |
| 15-Jun-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: use backing_file container for internal files with "fake" f_path
Overlayfs uses open_with_fake_path() to allocate internal kernel files, with a "fake" path - whose f_path is not on the same fs a
fs: use backing_file container for internal files with "fake" f_path
Overlayfs uses open_with_fake_path() to allocate internal kernel files, with a "fake" path - whose f_path is not on the same fs as f_inode.
Allocate a container struct backing_file for those internal files, that is used to hold the "fake" ovl path along with the real path.
backing_file_real_path() can be used to access the stored real path.
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-5-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
cbb0b9d4 |
| 15-Jun-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: use a helper for opening kernel internal files
cachefiles uses kernel_open_tmpfile() to open kernel internal tmpfile without accounting for nr_files.
cachefiles uses open_with_fake_path() for t
fs: use a helper for opening kernel internal files
cachefiles uses kernel_open_tmpfile() to open kernel internal tmpfile without accounting for nr_files.
cachefiles uses open_with_fake_path() for the same reason without the need for a fake path.
Fork open_with_fake_path() to kernel_file_open() which only does the noaccount part and use it in cachefiles.
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-3-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
7b8c9d7b |
| 11-Jun-2023 |
Amir Goldstein <amir73il@gmail.com> |
fsnotify: move fsnotify_open() hook into do_dentry_open()
fsnotify_open() hook is called only from high level system calls context and not called for the very many helpers to open files.
This may m
fsnotify: move fsnotify_open() hook into do_dentry_open()
fsnotify_open() hook is called only from high level system calls context and not called for the very many helpers to open files.
This may makes sense for many of the special file open cases, but it is inconsistent with fsnotify_close() hook that is called for every last fput() of on a file object with FMODE_OPENED.
As a result, it is possible to observe ACCESS, MODIFY and CLOSE events without ever observing an OPEN event.
Fix this inconsistency by replacing all the fsnotify_open() hooks with a single hook inside do_dentry_open().
If there are special cases that would like to opt-out of the possible overhead of fsnotify() call in fsnotify_open(), they would probably also want to avoid the overhead of fsnotify() call in the rest of the fsnotify hooks, so they should be opening that file with the __FMODE_NONOTIFY flag.
However, in the majority of those cases, the s_fsnotify_connectors optimization in fsnotify_parent() would be sufficient to avoid the overhead of fsnotify() call anyway.
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>
show more ...
|
#
cedd0bdc |
| 02-May-2023 |
Min-Hua Chen <minhuadotchen@gmail.com> |
fs: fix incorrect fmode_t casts
Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY to fixes the following sparce warnings: fs/overlayfs/file.c:48:37: sparse: warning: restricted fmode_t degrades to inte
fs: fix incorrect fmode_t casts
Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY to fixes the following sparce warnings: fs/overlayfs/file.c:48:37: sparse: warning: restricted fmode_t degrades to integer fs/overlayfs/file.c:128:13: sparse: warning: restricted fmode_t degrades to integer fs/open.c:1159:21: sparse: warning: restricted fmode_t degrades to integer
Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Message-Id: <20230502232210.119063-1-minhuadotchen@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
55650b2f |
| 06-May-2023 |
Anuradha Weeraman <anuradha@debian.org> |
fs/open.c: Fix W=1 kernel doc warnings
fs/open.c: In functions 'setattr_vfsuid' and 'setattr_vfsgid': warning: Function parameter or member 'attr' not described - Fix warning by removing kernel-do
fs/open.c: Fix W=1 kernel doc warnings
fs/open.c: In functions 'setattr_vfsuid' and 'setattr_vfsgid': warning: Function parameter or member 'attr' not described - Fix warning by removing kernel-doc for these as they are static inline functions and not required to be exposed via kernel-doc.
fs/open.c: warning: Excess function parameter 'opened' description in 'finish_open' warning: Excess function parameter 'cred' description in 'vfs_open' - Fix by removing the parameters from the kernel-doc as they are no longer required by the function.
Signed-off-by: Anuradha Weeraman <anuradha@debian.org> Message-Id: <20230506182928.384105-1-anuradha@debian.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
43b45063 |
| 21-Mar-2023 |
Christian Brauner <brauner@kernel.org> |
open: return EINVAL for O_DIRECTORY | O_CREAT
After a couple of years and multiple LTS releases we received a report that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7.
On kernel
open: return EINVAL for O_DIRECTORY | O_CREAT
After a couple of years and multiple LTS releases we received a report that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7.
On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL had the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: create regular file * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: create regular file * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory
On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL have the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory
This is a fairly substantial semantic change that userspace didn't notice until Pedro took the time to deliberately figure out corner cases. Since no one noticed this breakage we can somewhat safely assume that O_DIRECTORY | O_CREAT combinations are likely unused.
The v5.7 breakage is especially weird because while ENOTDIR is returned indicating failure a regular file is actually created. This doesn't make a lot of sense.
Time was spent finding potential users of this combination. Searching on codesearch.debian.net showed that codebases often express semantical expectations about O_DIRECTORY | O_CREAT which are completely contrary to what our code has done and currently does.
The expectation often is that this particular combination would create and open a directory. This suggests users who tried to use that combination would stumble upon the counterintuitive behavior no matter if pre-v5.7 or post v5.7 and quickly realize neither semantics give them what they want. For some examples see the code examples in [1] to [3] and the discussion in [4].
There are various ways to address this issue. The lazy/simple option would be to restore the pre-v5.7 behavior and to just live with that bug forever. But since there's a real chance that the O_DIRECTORY | O_CREAT quirk isn't relied upon we should try to get away with murder(ing bad semantics) first. If we need to Frankenstein pre-v5.7 behavior later so be it.
So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT combinations. In addition to cleaning up the old bug this also opens up the possiblity to make that flag combination do something more intuitive in the future.
Starting with this commit the following semantics apply:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL
(3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory
One additional note, O_TMPFILE is implemented as:
#define __O_TMPFILE 020000000 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
For older kernels it was important to return an explicit error when O_TMPFILE wasn't supported. So O_TMPFILE requires that O_DIRECTORY is raised alongside __O_TMPFILE. It also enforced that O_CREAT wasn't specified. Since O_DIRECTORY | O_CREAT could be used to create a regular allowing that combination together with __O_TMPFILE would've meant that false positives were possible, i.e., that a regular file was created instead of a O_TMPFILE. This could've been used to trick userspace into thinking it operated on a O_TMPFILE when it wasn't.
Now that we block O_DIRECTORY | O_CREAT completely the check for O_CREAT in the __O_TMPFILE branch via if ((flags & O_TMPFILE_MASK) != O_TMPFILE) can be dropped. Instead we can simply check verify that O_DIRECTORY is raised via if (!(flags & O_DIRECTORY)) and explain this in two comments.
As Aleksa pointed out O_PATH is unaffected by this change since it always returned EINVAL if O_CREAT was specified - with or without O_DIRECTORY.
Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1] Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2] Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3] Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4] Reported-by: Pedro Falcato <pedro.falcato@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
981ee95c |
| 25-Jan-2023 |
Mateusz Guzik <mjguzik@gmail.com> |
vfs: avoid duplicating creds in faccessat if possible
access(2) remains commonly used, for example on exec: access("/etc/ld.so.preload", R_OK)
or when running gcc: strace -c gcc empty.c
% time
vfs: avoid duplicating creds in faccessat if possible
access(2) remains commonly used, for example on exec: access("/etc/ld.so.preload", R_OK)
or when running gcc: strace -c gcc empty.c
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 0.00 0.000000 0 42 26 access
It falls down to do_faccessat without the AT_EACCESS flag, which in turn results in allocation of new creds in order to modify fsuid/fsgid and caps. This is a very expensive process single-threaded and most notably multi-threaded, with numerous structures getting refed and unrefed on imminent new cred destruction.
Turns out for typical consumers the resulting creds would be identical and this can be checked upfront, avoiding the hard work.
An access benchmark plugged into will-it-scale running on Cascade Lake shows:
test proc before after access1 1 1310582 2908735 (+121%) # distinct files access1 24 4716491 63822173 (+1353%) # distinct files access2 24 2378041 5370335 (+125%) # same file
The above benchmarks are not integrated into will-it-scale, but can be found in a pull request:
https://github.com/antonblanchard/will-it-scale/pull/36/files
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
47d58691 |
| 16-Jan-2023 |
Jann Horn <jannh@google.com> |
fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
Currently, filp_close() and generic_shutdown_super() use printk() to log messages when bugs are detected. This is problematic because in
fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
Currently, filp_close() and generic_shutdown_super() use printk() to log messages when bugs are detected. This is problematic because infrastructure like syzkaller has no idea that this message indicates a bug. In addition, some people explicitly want their kernels to BUG() when kernel data corruption has been detected (CONFIG_BUG_ON_DATA_CORRUPTION). And finally, when generic_shutdown_super() detects remaining inodes on a system without CONFIG_BUG_ON_DATA_CORRUPTION, it would be nice if later accesses to a busy inode would at least crash somewhat cleanly rather than walking through freed memory.
To address all three, use CHECK_DATA_CORRUPTION() when kernel bugs are detected.
Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
4d7ca409 |
| 13-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port vfs{g,u}id helpers to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is ju
fs: port vfs{g,u}id helpers to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|