#
0611a640 |
| 21-Feb-2024 |
Dmitry Antipov <dmantipov@yandex.ru> |
eventpoll: prefer kfree_rcu() in __ep_remove()
In '__ep_remove()', prefer 'kfree_rcu()' over 'call_rcu()' with dummy 'epi_rcu_free()' callback. This follows commit d0089603fa7a ("fs: prefer kfree_rc
eventpoll: prefer kfree_rcu() in __ep_remove()
In '__ep_remove()', prefer 'kfree_rcu()' over 'call_rcu()' with dummy 'epi_rcu_free()' callback. This follows commit d0089603fa7a ("fs: prefer kfree_rcu() in fasync_remove_entry()") and should not be backported to stable as well.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Link: https://lore.kernel.org/r/20240221112205.48389-2-dmantipov@yandex.ru Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
18e2bf0e |
| 13-Feb-2024 |
Joe Damato <jdamato@fastly.com> |
eventpoll: Add epoll ioctl for epoll_params
Add an ioctl for getting and setting epoll_params. User programs can use this ioctl to get and set the busy poll usec time, packet budget, and prefer busy
eventpoll: Add epoll ioctl for epoll_params
Add an ioctl for getting and setting epoll_params. User programs can use this ioctl to get and set the busy poll usec time, packet budget, and prefer busy poll params for a specific epoll context.
Parameters are limited: - busy_poll_usecs is limited to <= s32_max - busy_poll_budget is limited to <= NAPI_POLL_WEIGHT by unprivileged users (!capable(CAP_NET_ADMIN)) - prefer_busy_poll must be 0 or 1 - __pad must be 0
Signed-off-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
de57a251 |
| 13-Feb-2024 |
Joe Damato <jdamato@fastly.com> |
eventpoll: Add per-epoll prefer busy poll option
When using epoll-based busy poll, the prefer_busy_poll option is hardcoded to false. Users may want to enable prefer_busy_poll to be used in conjunct
eventpoll: Add per-epoll prefer busy poll option
When using epoll-based busy poll, the prefer_busy_poll option is hardcoded to false. Users may want to enable prefer_busy_poll to be used in conjunction with gro_flush_timeout and defer_hard_irqs_count to keep device IRQs masked.
Other busy poll methods allow enabling or disabling prefer busy poll via SO_PREFER_BUSY_POLL, but epoll-based busy polling uses a hardcoded value.
Fix this edge case by adding support for a per-epoll context prefer_busy_poll option. The default is false, as it was hardcoded before this change.
Signed-off-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
c6aa2a77 |
| 13-Feb-2024 |
Joe Damato <jdamato@fastly.com> |
eventpoll: Add per-epoll busy poll packet budget
When using epoll-based busy poll, the packet budget is hardcoded to BUSY_POLL_BUDGET (8). Users may desire larger busy poll budgets, which can potent
eventpoll: Add per-epoll busy poll packet budget
When using epoll-based busy poll, the packet budget is hardcoded to BUSY_POLL_BUDGET (8). Users may desire larger busy poll budgets, which can potentially increase throughput when busy polling under high network load.
Other busy poll methods allow setting the busy poll budget via SO_BUSY_POLL_BUDGET, but epoll-based busy polling uses a hardcoded value.
Fix this edge case by adding support for a per-epoll context busy poll packet budget. If not specified, the default value (BUSY_POLL_BUDGET) is used.
Signed-off-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
85455c79 |
| 13-Feb-2024 |
Joe Damato <jdamato@fastly.com> |
eventpoll: support busy poll per epoll instance
Allow busy polling on a per-epoll context basis. The per-epoll context usec timeout value is preferred, but the pre-existing system wide sysctl value
eventpoll: support busy poll per epoll instance
Allow busy polling on a per-epoll context basis. The per-epoll context usec timeout value is preferred, but the pre-existing system wide sysctl value is still supported if it specified.
busy_poll_usecs is a u32, but in a follow up patch the ioctl provided to the user only allows setting a value from 0 to S32_MAX.
Signed-off-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
e6f79580 |
| 06-Feb-2024 |
Huang Xiaojia <huangxiaojia2@huawei.com> |
epoll: Remove ep_scan_ready_list() in comments
Since commit 443f1a042233 ("lift the calls of ep_send_events_proc() into the callers"), ep_scan_ready_list() has been removed. But there are still seve
epoll: Remove ep_scan_ready_list() in comments
Since commit 443f1a042233 ("lift the calls of ep_send_events_proc() into the callers"), ep_scan_ready_list() has been removed. But there are still several in comments. All of them should be replaced with other caller functions.
Signed-off-by: Huang Xiaojia <huangxiaojia2@huawei.com> Link: https://lore.kernel.org/r/20240206014353.4191262-1-huangxiaojia2@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
9d5b9475 |
| 21-Nov-2023 |
Joel Granados <j.granados@samsung.com> |
fs: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels
fs: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
Remove sentinel elements ctl_table struct. Special attention was placed in making sure that an empty directory for fs/verity was created when CONFIG_FS_VERITY_BUILTIN_SIGNATURES is not defined. In this case we use the register sysctl call that expects a size.
Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
show more ...
|
#
68279f9c |
| 11-Oct-2023 |
Alexey Dobriyan <adobriyan@gmail.com> |
treewide: mark stuff as __ro_after_init
__read_mostly predates __ro_after_init. Many variables which are marked __read_mostly should have been __ro_after_init from day 1.
Also, mark some stuff as "
treewide: mark stuff as __ro_after_init
__read_mostly predates __ro_after_init. Many variables which are marked __read_mostly should have been __ro_after_init from day 1.
Also, mark some stuff as "const" and "__init" while I'm at it.
[akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
05f26f86 |
| 26-Jul-2023 |
Zhen Lei <thunder.leizhen@huawei.com> |
epoll: simplify ep_alloc()
The get_current_user() does not fail, and moving it after kzalloc() can simplify the code a bit.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Message-Id: <2023072
epoll: simplify ep_alloc()
The get_current_user() does not fail, and moving it after kzalloc() can simplify the code a bit.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Message-Id: <20230726032135.933-1-thunder.leizhen@huaweicloud.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
2192bba0 |
| 30-May-2023 |
Benjamin Segall <bsegall@google.com> |
epoll: ep_autoremove_wake_function should use list_del_init_careful
autoremove_wake_function uses list_del_init_careful, so should epoll's more aggressive variant. It only doesn't because it was co
epoll: ep_autoremove_wake_function should use list_del_init_careful
autoremove_wake_function uses list_del_init_careful, so should epoll's more aggressive variant. It only doesn't because it was copied from an older wait.c rather than the most recent.
[bsegall@google.com: add comment] Link: https://lkml.kernel.org/r/xm26bki0ulsr.fsf_-_@google.com Link: https://lkml.kernel.org/r/xm26pm6hvfer.fsf@google.com Fixes: a16ceb139610 ("epoll: autoremove wakers even more aggressively") Signed-off-by: Ben Segall <bsegall@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
38f1755a |
| 11-May-2023 |
Min-Hua Chen <minhuadotchen@gmail.com> |
fs: use correct __poll_t type
Fix the following sparse warnings by using __poll_t instead of unsigned type.
fs/eventpoll.c:541:9: sparse: warning: restricted __poll_t degrades to integer fs/eventfd
fs: use correct __poll_t type
Fix the following sparse warnings by using __poll_t instead of unsigned type.
fs/eventpoll.c:541:9: sparse: warning: restricted __poll_t degrades to integer fs/eventfd.c:67:17: sparse: warning: restricted __poll_t degrades to integer
Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Message-Id: <20230511164628.336586-1-minhuadotchen@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
d4cb626d |
| 11-Apr-2023 |
Davidlohr Bueso <dave@stgolabs.net> |
epoll: rename global epmutex
As of 4f04cbaf128 ("epoll: use refcount to reduce ep_mutex contention"), this lock is now specific to nesting cases - inserting an epoll fd onto another epoll fd. Renam
epoll: rename global epmutex
As of 4f04cbaf128 ("epoll: use refcount to reduce ep_mutex contention"), this lock is now specific to nesting cases - inserting an epoll fd onto another epoll fd. Rename the lock to be less generic.
Link: https://lkml.kernel.org/r/20230411234159.20421-1-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
58c9b016 |
| 22-Mar-2023 |
Paolo Abeni <pabeni@redhat.com> |
epoll: use refcount to reduce ep_mutex contention
We are observing huge contention on the epmutex during an http connection/rate test:
83.17% 0.25% nginx [kernel.kallsyms] [k]
epoll: use refcount to reduce ep_mutex contention
We are observing huge contention on the epmutex during an http connection/rate test:
83.17% 0.25% nginx [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe [...] |--66.96%--__fput |--60.04%--eventpoll_release_file |--58.41%--__mutex_lock.isra.6 |--56.56%--osq_lock
The application is multi-threaded, creates a new epoll entry for each incoming connection, and does not delete it before the connection shutdown - that is, before the connection's fd close().
Many different threads compete frequently for the epmutex lock, affecting the overall performance.
To reduce the contention this patch introduces explicit reference counting for the eventpoll struct. Each registered event acquires a reference, and references are released at ep_remove() time.
The eventpoll struct is released by whoever - among EP file close() and and the monitored file close() drops its last reference.
Additionally, this introduces a new 'dying' flag to prevent races between the EP file close() and the monitored file close(). ep_eventpoll_release() marks, under f_lock spinlock, each epitem as dying before removing it, while EP file close() does not touch dying epitems.
The above is needed as both close operations could run concurrently and drop the EP reference acquired via the epitem entry. Without the above flag, the monitored file close() could reach the EP struct via the epitem list while the epitem is still listed and then try to put it after its disposal.
An alternative could be avoiding touching the references acquired via the epitems at EP file close() time, but that could leave the EP struct alive for potentially unlimited time after EP file close(), with nasty side effects.
With all the above in place, we can drop the epmutex usage at disposal time.
Overall this produces a significant performance improvement in the mentioned connection/rate scenario: the mutex operations disappear from the topmost offenders in the perf report, and the measured connections/rate grows by ~60%.
To make the change more readable this additionally renames ep_free() to ep_clear_and_put(), and moves the actual memory cleanup in a separate ep_free() helper.
Link: https://lkml.kernel.org/r/4a57788dcaf28f5eb4f8dfddcc3a8b172a7357bb.1679504153.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Xiumei Mu <xmu@redhiat.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Jacob Keller <jacob.e.keller@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
7059a9aa |
| 12-Mar-2023 |
Changcheng Liu <changchengx.liu@outlook.com> |
eventpoll: align comment with nested epoll limitation
fix comment in commit 02edc6fc4d5f ("epoll: comment the funky #ifdef")
Signed-off-by: Liu, Changcheng <changchengx.liu@outlook.com> Signed-off-
eventpoll: align comment with nested epoll limitation
fix comment in commit 02edc6fc4d5f ("epoll: comment the funky #ifdef")
Signed-off-by: Liu, Changcheng <changchengx.liu@outlook.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
063f3ed9 |
| 10-Mar-2023 |
Palmer Dabbelt <palmer@dabbelt.com> |
Move ep_take_care_of_epollwakeup() to fs/eventpoll.c
This doesn't make any sense to expose to userspace, so it's been moved to the one user. This was introduced by commit 95f19f658ce1 ("epoll: drop
Move ep_take_care_of_epollwakeup() to fs/eventpoll.c
This doesn't make any sense to expose to userspace, so it's been moved to the one user. This was introduced by commit 95f19f658ce1 ("epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled").
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com> Reviewed-by: Andrew Waterman <waterman@eecs.berkeley.edu> Reviewed-by: Albert Ou <aou@eecs.berkeley.edu> Message-Id: <1447119071-19392-7-git-send-email-palmer@dabbelt.com> [thuth: Rebased to fix contextual conflicts] Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
show more ...
|
#
caf1aeaf |
| 20-Nov-2022 |
Jens Axboe <axboe@kernel.dk> |
eventpoll: add EPOLL_URING_WAKE poll wakeup flag
We can have dependencies between epoll and io_uring. Consider an epoll context, identified by the epfd file descriptor, and an io_uring file descript
eventpoll: add EPOLL_URING_WAKE poll wakeup flag
We can have dependencies between epoll and io_uring. Consider an epoll context, identified by the epfd file descriptor, and an io_uring file descriptor identified by iofd. If we add iofd to the epfd context, and arm a multishot poll request for epfd with iofd, then the multishot poll request will repeatedly trigger and generate events until terminated by CQ ring overflow. This isn't a desired behavior.
Add EPOLL_URING so that io_uring can pass it in as part of the poll wakeup key, and io_uring can check for that to detect a potential recursive invocation.
Cc: stable@vger.kernel.org # 6.0 Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
693fc06e |
| 14-Jul-2022 |
Uros Bizjak <ubizjak@gmail.com> |
epoll: use try_cmpxchg in list_add_tail_lockless
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in list_add_tail_lockless. x86 CMPXCHG instruction returns success in ZF flag, so this ch
epoll: use try_cmpxchg in list_add_tail_lockless
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in list_add_tail_lockless. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg).
No functional change intended.
Link: https://lkml.kernel.org/r/20220714173255.12987-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
a16ceb13 |
| 15-Jun-2022 |
Benjamin Segall <bsegall@google.com> |
epoll: autoremove wakers even more aggressively
If a process is killed or otherwise exits while having active network connections and many threads waiting on epoll_wait, the threads will all be woke
epoll: autoremove wakers even more aggressively
If a process is killed or otherwise exits while having active network connections and many threads waiting on epoll_wait, the threads will all be woken immediately, but not removed from ep->wq. Then when network traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will not remove the entries from the list.
This means that the cost of the wakeup attempt is far higher than usual, does not decrease, and this also competes with the dying threads trying to actually make progress and remove themselves from the wq.
Handle this by removing visited epoll wq entries unconditionally, rather than only when the wakeup succeeds - the structure of ep_poll means that the only potential loss is the timed_out->eavail heuristic, which now can race and result in a redundant ep_send_events attempt. (But only when incoming data and a timeout actually race, not on every timeout)
Shakeel added:
: We are seeing this issue in production with real workloads and it has : caused hard lockups. Particularly network heavy workloads with a lot : of threads in epoll_wait() can easily trigger this issue if they get : killed (oom-killed in our case).
Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com Signed-off-by: Ben Segall <bsegall@google.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Heiher <r@hev.cc> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
a8f5de89 |
| 22-Jan-2022 |
Xiaoming Ni <nixiaoming@huawei.com> |
eventpoll: simplify sysctl declaration with register_sysctl()
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain.
To help with
eventpoll: simplify sysctl declaration with register_sysctl()
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic.
So move the epoll_table sysctl to fs/eventpoll.c and use register_sysctl().
Link: https://lkml.kernel.org/r/20211123202422.819032-9-mcgrof@kernel.org Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Antti Palosaari <crope@iki.fi> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Clemens Ladisch <clemens@ladisch.de> Cc: David Airlie <airlied@linux.ie> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kees Cook <keescook@chromium.org> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Mark Fasheh <mark@fasheh.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Phillip Potter <phil@philpotter.co.uk> Cc: Qing Wang <wangqing@vivo.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Sebastian Reichel <sre@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Stephen Kitt <steve@sk2.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: James E.J. Bottomley <jejb@linux.ibm.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
1e1c1583 |
| 08-Sep-2021 |
Nicholas Piggin <npiggin@gmail.com> |
fs/epoll: use a per-cpu counter for user's watches count
This counter tracks the number of watches a user has, to compare against the 'max_user_watches' limit. This causes a scalability bottleneck o
fs/epoll: use a per-cpu counter for user's watches count
This counter tracks the number of watches a user has, to compare against the 'max_user_watches' limit. This causes a scalability bottleneck on SPECjbb2015 on large systems as there is only one user. Changing to a per-cpu counter increases throughput of the benchmark by about 30% on a 16-socket, > 1000 thread system.
[rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n] [npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message] Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none [akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter] Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net
Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reported-by: Anton Blanchard <anton@ozlabs.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
249dbe74 |
| 11-Aug-2021 |
Arnd Bergmann <arnd@arndb.de> |
ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation
The epoll_wait() system call wrapper is one of the remaining users of the set_fs() infrasturcture for Arm. Changing it to not requir
ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation
The epoll_wait() system call wrapper is one of the remaining users of the set_fs() infrasturcture for Arm. Changing it to not require set_fs() is rather complex unfortunately.
The approach I'm taking here is to allow architectures to override the code that copies the output to user space, and let the oabi-compat implementation check whether it is getting called from an EABI or OABI system call based on the thread_info->syscall value.
The in_oabi_syscall() check here mirrors the in_compat_syscall() and in_x32_syscall() helpers for 32-bit compat implementations on other architectures.
Overall, the amount of code goes down, at least with the newly added sys_oabi_epoll_pwait() helper getting removed again. The downside is added complexity in the source code for the native implementation. There should be no difference in runtime performance except for Arm kernels with CONFIG_OABI_COMPAT enabled that now have to go through an external function call to check which of the two variants to use.
Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
show more ...
|
#
7fab29e3 |
| 07-May-2021 |
Davidlohr Bueso <dave@stgolabs.net> |
fs/epoll: restore waking from ep_done_scan()
Commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") changed the userspace visible behavior of exclusive waiters blocked on a com
fs/epoll: restore waking from ep_done_scan()
Commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") changed the userspace visible behavior of exclusive waiters blocked on a common epoll descriptor upon a single event becoming ready.
Previously, all tasks doing epoll_wait would awake, and now only one is awoken, potentially causing missed wakeups on applications that rely on this behavior, such as Apache Qpid.
While the aforementioned commit aims at having only a wakeup single path in ep_poll_callback (with the exceptions of epoll_ctl cases), we need to restore the wakeup in what was the old ep_scan_ready_list() such that the next thread can be awoken, in a cascading style, after the waker's corresponding ep_send_events().
Link: https://lkml.kernel.org/r/20210405231025.33829-3-dave@stgolabs.net Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@akamai.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
a6c67fee |
| 01-Mar-2021 |
Randy Dunlap <rdunlap@infradead.org> |
fs: eventpoll: fix comments & kernel-doc notation
Use the documented kernel-doc format for function Return: descriptions. Begin constant values in kernel-doc comments with '%'.
Remove kernel-doc "/
fs: eventpoll: fix comments & kernel-doc notation
Use the documented kernel-doc format for function Return: descriptions. Begin constant values in kernel-doc comments with '%'.
Remove kernel-doc "/**" from 2 functions that are not documented with kernel-doc notation.
Fix typos, punctuation, & grammar.
Also fix a few kernel-doc warnings:
../fs/eventpoll.c:1883: warning: Function parameter or member 'ep' not described in 'ep_loop_check_proc' ../fs/eventpoll.c:1883: warning: Excess function parameter 'priv' description in 'ep_loop_check_proc' ../fs/eventpoll.c:1932: warning: Function parameter or member 'ep' not described in 'ep_loop_check' ../fs/eventpoll.c:1932: warning: Excess function parameter 'from' description in 'ep_loop_check'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
show more ...
|
#
bfe3911a |
| 05-Feb-2021 |
Chris Wilson <chris@chris-wilson.co.uk> |
kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE
Userspace has discovered the functionality offered by SYS_kcmp and has started to depend upon it. In particular, Mesa uses SYS_kcmp for
kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE
Userspace has discovered the functionality offered by SYS_kcmp and has started to depend upon it. In particular, Mesa uses SYS_kcmp for os_same_file_description() in order to identify when two fd (e.g. device or dmabuf) point to the same struct file. Since they depend on it for core functionality, lift SYS_kcmp out of the non-default CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.
Rasmus Villemoes also pointed out that systemd uses SYS_kcmp to deduplicate the per-service file descriptor store.
Note that some distributions such as Ubuntu are already enabling CHECKPOINT_RESTORE in their configs and so, by extension, SYS_kcmp.
References: https://gitlab.freedesktop.org/drm/intel/-/issues/3046 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: stable@vger.kernel.org Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> # DRM depends on kcmp Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> # systemd uses kcmp Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20210205220012.1983-1-chris@chris-wilson.co.uk
show more ...
|
#
58169a52 |
| 18-Dec-2020 |
Willem de Bruijn <willemb@google.com> |
epoll: add syscall epoll_pwait2
Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution that replaces int timeout with struct timespec. It is equivalent otherwise.
int epoll_pwait
epoll: add syscall epoll_pwait2
Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution that replaces int timeout with struct timespec. It is equivalent otherwise.
int epoll_pwait2(int fd, struct epoll_event *events, int maxevents, const struct timespec *timeout, const sigset_t *sigset);
The underlying hrtimer is already programmed with nsec resolution. pselect and ppoll also set nsec resolution timeout with timespec.
The sigset_t in epoll_pwait has a compat variant. epoll_pwait2 needs the same.
For timespec, only support this new interface on 2038 aware platforms that define __kernel_timespec_t. So no CONFIG_COMPAT_32BIT_TIME.
Link: https://lkml.kernel.org/r/20201121144401.3727659-3-willemdebruijn.kernel@gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|