#
1e8a4237 |
| 19-Mar-2024 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename werr and err to ret in __btrfs_wait_marked_extents()
Rename the function's local return variables err and werr to ret. Also, align the variable declarations with the other declarations
btrfs: rename werr and err to ret in __btrfs_wait_marked_extents()
Rename the function's local return variables err and werr to ret. Also, align the variable declarations with the other declarations in the function for better function space alignment.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ce875311 |
| 19-Mar-2024 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename werr and err to ret in btrfs_write_marked_extents()
Rename the function's local variable werr and err to ret.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain
btrfs: rename werr and err to ret in btrfs_write_marked_extents()
Rename the function's local variable werr and err to ret.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
9a7b68d3 |
| 16-Apr-2024 |
Anand Jain <anand.jain@oracle.com> |
btrfs: report filemap_fdata<write|wait>_range() error
In the function btrfs_write_marked_extents() and in __btrfs_wait_marked_extents() return the actual error if when filemap_fdata<write|wait>_rang
btrfs: report filemap_fdata<write|wait>_range() error
In the function btrfs_write_marked_extents() and in __btrfs_wait_marked_extents() return the actual error if when filemap_fdata<write|wait>_range() fails.
Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
e094f480 |
| 15-Apr-2024 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: change root->root_key.objectid to btrfs_root_id()
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which ma
btrfs: change root->root_key.objectid to btrfs_root_id()
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code.
The changes where made with the following Coccinelle semantic patch:
// <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl>
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
fdee5e55 |
| 19-Mar-2024 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename err to ret in __btrfs_end_transaction()
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.co
btrfs: rename err to ret in __btrfs_end_transaction()
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
05aa0243 |
| 07-Mar-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove pointless BUG_ON() when creating snapshot
When creating a snapshot we first check with btrfs_lookup_dir_item() if there is a name collision in the parent directory and then return an e
btrfs: remove pointless BUG_ON() when creating snapshot
When creating a snapshot we first check with btrfs_lookup_dir_item() if there is a name collision in the parent directory and then return an error if there's a collision. Then later on when trying to insert a dir item for the snapshot we BUG_ON() if the return value is -EEXIST or -EOVERFLOW:
static noinline int create_pending_snapshot(...) { (...)
/* check if there is a file/dir which has the same name. */ dir_item = btrfs_lookup_dir_item(...); (...)
ret = btrfs_insert_dir_item(...); /* We have check then name at the beginning, so it is impossible. */ BUG_ON(ret == -EEXIST || ret == -EOVERFLOW); if (ret) { btrfs_abort_transaction(trans, ret); goto fail; }
(...) }
It's impossible to get the -EEXIST because we previously checked for a potential collision with btrfs_lookup_dir_item() and we know that after that no one could have added a colliding name because at this point the transaction is in its critical section, state TRANS_STATE_COMMIT_DOING, so no one can join this transaction to add a colliding name and neither can anyone start a new transaction to do that.
As for the -EOVERFLOW, that can't happen as long as we have the extended references feature enabled, which is a mkfs default for many years now.
In either case, the BUG_ON() is excessive as we can properly deal with any error and can abort the transaction and jump to the 'fail' label, in which case we'll also get the useful stack trace (just like a BUG_ON()) from the abort if the error is either -EEXIST or -EOVERFLOW.
So remove the BUG_ON().
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
6e68de0b |
| 26-Mar-2024 |
Boris Burkov <boris@bur.io> |
btrfs: always clear PERTRANS metadata during commit
It is possible to clear a root's IN_TRANS tag from the radix tree, but not clear its PERTRANS, if there is some error in between. Eliminate that p
btrfs: always clear PERTRANS metadata during commit
It is possible to clear a root's IN_TRANS tag from the radix tree, but not clear its PERTRANS, if there is some error in between. Eliminate that possibility by moving the free up to where we clear the tag.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
211de933 |
| 21-Mar-2024 |
Boris Burkov <boris@bur.io> |
btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_trans
The transaction is only able to free PERTRANS reservations for a root once that root has been recorded with the TRANS tag on th
btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_trans
The transaction is only able to free PERTRANS reservations for a root once that root has been recorded with the TRANS tag on the roots radix tree. Therefore, until we are sure that this root will get tagged, it isn't safe to convert. Generally, this is not an issue as *some* transaction will likely tag the root before long and this reservation will get freed in that transaction, but technically it could stick around until unmount and result in a warning about leaked metadata reservation space.
This path is most exercised by running the generic/269 fstest with CONFIG_BTRFS_DEBUG.
Fixes: a6496849671a ("btrfs: fix start transaction qgroup rsv double free") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ef5a05c5 |
| 24-Feb-2024 |
Chengming Zhou <zhouchengming@bytedance.com> |
btrfs: remove SLAB_MEM_SPREAD flag use
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: re
btrfs: remove SLAB_MEM_SPREAD flag use
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
2753b4d8 |
| 20-Feb-2024 |
Kunwu Chan <chentao@kylinos.cn> |
btrfs: use KMEM_CACHE() to create btrfs_trans_handle cache
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify the creation of SLAB caches when the default values are used.
Signed
btrfs: use KMEM_CACHE() to create btrfs_trans_handle cache
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify the creation of SLAB caches when the default values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
f33163ee |
| 13-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove no longer used btrfs_transaction_in_commit()
The function btrfs_transaction_in_commit() is no longer used, its last use was removed in commit 11aeb97b45ad ("btrfs: don't arbitrarily sl
btrfs: remove no longer used btrfs_transaction_in_commit()
The function btrfs_transaction_in_commit() is no longer used, its last use was removed in commit 11aeb97b45ad ("btrfs: don't arbitrarily slow down delalloc if we're committing"), so just remove it.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
2b712e3b |
| 25-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused included headers
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (w
btrfs: remove unused included headers
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
e2b54eaf |
| 23-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device in case there's an error committing the transaction
btrfs: fix double free of anonymous device after snapshot creation failure
When creating a snapshot we may do a double free of an anonymous device in case there's an error committing the transaction. The second free may result in freeing an anonymous device number that was allocated by some other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because btrfs_lookup_fs_root() returned a root - someone else did a lookup of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at ioctl.c:create_snapshot() we jump to the 'fail' label, and after that we free again the same anonymous device number, which in the meanwhile may have been reallocated somewhere else, because pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------ ida_free called for id=51 which is not allocated. WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525 Modules linked in: CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525 Code: 10 42 80 3c 28 (...) RSP: 0018:ffffc90015a67300 EFLAGS: 00010246 RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000 RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4 R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246 R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246 FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0 Call Trace: <TASK> btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346 create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931 btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404 create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306 btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393 btrfs_ioctl+0xa74/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fca3e67dda9 Code: 28 00 00 00 (...) RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9 RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658 </TASK>
Where we get an explicit message where we attempt to free an anonymous device number that is not currently allocated. It happens in a different code path from the example below, at btrfs_get_root_ref(), so this change may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change btrfs_get_root_ref() and its callers to receive a dev_t pointer argument for the anonymous device number, so that in case it frees the number, it also resets it to 0, so that up in the call chain we don't attempt to do the double free.
CC: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/ Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
2f6397e4 |
| 02-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: don't refill whole delayed refs block reserve when starting transaction
Since commit 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction") we started not onl
btrfs: don't refill whole delayed refs block reserve when starting transaction
Since commit 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction") we started not only to reserve metadata space for the delayed refs a caller of btrfs_start_transaction() might generate but also to try to fully refill the delayed refs block reserve, because there are several case where we generate delayed refs and haven't reserved space for them, relying on the global block reserve. Relying too much on the global block reserve is not always safe, and can result in hitting -ENOSPC during transaction commits or worst, in rare cases, being unable to mount a filesystem that needs to do orphan cleanup or anything that requires modifying the filesystem during mount, and has no more unallocated space and the metadata space is nearly full. This was explained in detail in that commit's change log.
However the gap between the reserved amount and the size of the delayed refs block reserve can be huge, so attempting to reserve space for such a gap can result in allocating many metadata block groups that end up not being used. After a recent patch, with the subject:
"btrfs: add new unused block groups to the list of unused block groups"
We started to add new block groups that are unused to the list of unused block groups, to avoid having them around for a very long time in case they are never used, because a block group is only added to the list of unused block groups when we deallocate the last extent or when mounting the filesystem and the block group has 0 bytes used. This is not a problem introduced by the commit mentioned earlier, it always existed as our metadata space reservations are, most of the time, pessimistic and end up not using all the space they reserved, so we can occasionally end up with one or two unused metadata block groups for a long period. However after that commit mentioned earlier, we are just more pessimistic in the metadata space reservations when starting a transaction and therefore the issue is more likely to happen.
This however is not always enough because we might create unused metadata block groups when reserving metadata space at a high rate if there's always a gap in the delayed refs block reserve and the cleaner kthread isn't triggered often enough or is busy with other work (running delayed iputs, cleaning deleted roots, etc), not to mention the block group's allocated space is only usable for a new block group after the transaction used to remove it is committed.
A user reported that he's getting a lot of allocated metadata block groups but the usage percentage of metadata space was very low compared to the total allocated space, specially after running a series of block group relocations.
So for now stop trying to refill the gap in the delayed refs block reserve and reserve space only for the delayed refs we are expected to generate when starting a transaction.
CC: stable@vger.kernel.org # 6.7+ Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H6802ayLHUJFztzZAVzBLJAGdFx=6FHNNy87+obZXXZpQ@mail.gmail.com/ Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Reported-by: Heddxh <g311571057@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAE93xANEby6RezOD=zcofENYZOT-wpYygJyauyUAZkLv6XVFOA@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b321a52c |
| 01-Dec-2023 |
Boris Burkov <boris@bur.io> |
btrfs: free qgroup pertrans reserve on transaction abort
If we abort a transaction, we never run the code that frees the pertrans qgroup reservation. This results in warnings on unmount as that rese
btrfs: free qgroup pertrans reserve on transaction abort
If we abort a transaction, we never run the code that frees the pertrans qgroup reservation. This results in warnings on unmount as that reservation has been leaked. The leak isn't a huge issue since the fs is read-only, but it's better to clean it up when we know we can/should. Do it during the cleanup_transaction step of aborting.
CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
8049ba5d |
| 10-Nov-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: do not abort transaction if there is already an existing qgroup
[BUG] Syzbot reported a regression that after commit 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation") we can
btrfs: do not abort transaction if there is already an existing qgroup
[BUG] Syzbot reported a regression that after commit 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation") we can trigger transaction abort during snapshot creation:
BTRFS: Transaction aborted (error -17) WARNING: CPU: 0 PID: 5057 at fs/btrfs/transaction.c:1778 create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778 Modules linked in: CPU: 0 PID: 5057 Comm: syz-executor225 Not tainted 6.6.0-syzkaller-15365-g305230142ae0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 RIP: 0010:create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778 Call Trace: <TASK> create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1967 btrfs_commit_transaction+0xf1c/0x3730 fs/btrfs/transaction.c:2440 create_snapshot+0x4a5/0x7e0 fs/btrfs/ioctl.c:845 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:995 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1041 __btrfs_ioctl_snap_create+0x344/0x460 fs/btrfs/ioctl.c:1294 btrfs_ioctl_snap_create+0x13c/0x190 fs/btrfs/ioctl.c:1321 btrfs_ioctl+0xbbf/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7f2f791127b9 </TASK>
[CAUSE] The error number is -EEXIST, which can happen for qgroup if there is already an existing qgroup and then we're trying to create a snapshot for it.
[FIX] In that case, we can continue creating the snapshot, although it may lead to qgroup inconsistency, it's not so critical to abort the current transaction.
So in this case, we can just ignore the non-critical errors, mostly -EEXIST (there is already a qgroup).
Reported-by: syzbot+4d81015bc10889fd12ea@syzkaller.appspotmail.com Fixes: 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b1c38a13 |
| 04-Oct-2023 |
Jeff Layton <jlayton@kernel.org> |
btrfs: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347
btrfs: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
0124855f |
| 04-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add and use helpers for reading and writing last_trans_committed
Currently the last_trans_committed field of struct btrfs_fs_info is modified and read without any locking or other protection.
btrfs: add and use helpers for reading and writing last_trans_committed
Currently the last_trans_committed field of struct btrfs_fs_info is modified and read without any locking or other protection. For example early in the fsync path, skip_inode_logging() is called which reads fs_info->last_trans_committed, but at the same time we can have a transaction commit completing and updating that field.
In the case of an fsync this is harmless and any data race should be rare and at most cause an unnecessary logging of an inode.
To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_trans_committed field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
4a4f8fe2 |
| 04-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add and use helpers for reading and writing fs_info->generation
Currently the generation field of struct btrfs_fs_info is always modified while holding fs_info->trans_lock locked. Most reader
btrfs: add and use helpers for reading and writing fs_info->generation
Currently the generation field of struct btrfs_fs_info is always modified while holding fs_info->trans_lock locked. Most readers will access this field without taking that lock but while holding a transaction handle, which is safe to do due to the transaction life cycle.
However there are other readers that are neither holding the lock nor holding a transaction handle open:
1) When reading an inode from disk, at btrfs_read_locked_inode();
2) When reading the generation to expose it to sysfs, at btrfs_generation_show();
3) Early in the fsync path, at skip_inode_logging();
4) When creating a hole at btrfs_cont_expand(), during write paths, truncate and reflinking;
5) In the fs_info ioctl (btrfs_ioctl_fs_info());
6) While mounting the filesystem, in the open_ctree() path. In these cases it's safe to directly read fs_info->generation as no one can concurrently start a transaction and update fs_info->generation.
In case of the fsync path, races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. In the other cases it's not so clear if functional problems may arise, though in case 1 rare things like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC flag not being set on an inode and therefore result in incorrect logging later on in case a fsync call is made.
To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the generation field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
9ef17228 |
| 28-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: stop reserving excessive space for block group item insertions
Space for block group item insertions, necessary after allocating a new block group, is reserved in the delayed refs block reser
btrfs: stop reserving excessive space for block group item insertions
Space for block group item insertions, necessary after allocating a new block group, is reserved in the delayed refs block reserve. Currently we do this by incrementing the transaction handle's delayed_ref_updates counter and then calling btrfs_update_delayed_refs_rsv(), which will increase the size of the delayed refs block reserve by an amount that corresponds to the same amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes().
That is an excessive amount because it corresponds to the amount of space needed to insert one item in a btree (btrfs_calc_insert_metadata_size()) times 2 when the free space tree feature is enabled. All we need is an amount as given by btrfs_calc_insert_metadata_size(), since we only need to insert a block group item in the extent tree (or block group tree if this feature is enabled). By using btrfs_calc_insert_metadata_size() we will need to reserve 2 times less space when using the free space tree, putting less pressure on space reservation.
So use helpers to reserve and release space for block group item insertions that use btrfs_calc_insert_metadata_size() for calculation of the space.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
1723270f |
| 22-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move btrfs_defrag_root() to defrag.{c,h}
The btrfs_defrag_root() function does not really belong in the transaction.{c,h} module and as we have a defrag.{c,h} nowadays, move it to there inste
btrfs: move btrfs_defrag_root() to defrag.{c,h}
The btrfs_defrag_root() function does not really belong in the transaction.{c,h} module and as we have a defrag.{c,h} nowadays, move it to there instead. This also allows to stop exporting btrfs_defrag_leaves(), so we can make it static.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename info to fs_info for consistency ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
0a5d0dc5 |
| 22-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove redundant root argument from btrfs_update_inode_fallback()
The root argument for btrfs_update_inode_fallback() always matches the root of the given inode, so remove the root argument a
btrfs: remove redundant root argument from btrfs_update_inode_fallback()
The root argument for btrfs_update_inode_fallback() always matches the root of the given inode, so remove the root argument and get it from the inode argument.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
5343cd93 |
| 28-Mar-2023 |
Boris Burkov <boris@bur.io> |
btrfs: qgroup: simple quota auto hierarchy for nested subvolumes
Consider the following sequence:
- enable quotas - create subvol S id 256 at dir outer/ - create a qgroup 1/100 - add 0/256 (S's aut
btrfs: qgroup: simple quota auto hierarchy for nested subvolumes
Consider the following sequence:
- enable quotas - create subvol S id 256 at dir outer/ - create a qgroup 1/100 - add 0/256 (S's auto qgroup) to 1/100 - create subvol T id 257 at dir outer/inner/
With full qgroups, there is no relationship between 0/257 and either of 0/256 or 1/100. There is an inherit feature that the creator of inner/ can use to specify it ought to be in 1/100.
Simple quotas are targeted at container isolation, where such automatic inheritance for not necessarily trusted/controlled nested subvol creation would be quite helpful. Therefore, add a new default behavior for simple quotas: when you create a nested subvol, automatically inherit as parents any parents of the qgroup of the subvol the new inode is going in.
In our example, 257/0 would also be under 1/100, allowing easy control of a total quota over an arbitrary hierarchy of subvolumes.
I think this _might_ be a generally useful behavior, so it could be interesting to put it behind a new inheritance flag that simple quotas always use while traditional quotas let the user specify, but this is a minimally intrusive change to start.
Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
6ed05643 |
| 28-Jun-2023 |
Boris Burkov <boris@bur.io> |
btrfs: create qgroup earlier in snapshot creation
Pull creating the qgroup earlier in the snapshot. This allows simple quotas qgroups to see all the metadata writes related to the snapshot being cre
btrfs: create qgroup earlier in snapshot creation
Pull creating the qgroup earlier in the snapshot. This allows simple quotas qgroups to see all the metadata writes related to the snapshot being created and to be born with the root node accounted.
Note this has an impact on transaction commit where the qgroup creation can do a lot of work, allocate memory and take locks. The change is done for correctness, potential performance issues will be fixed in the future.
Signed-off-by: Boris Burkov <boris@bur.io> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
182940f4 |
| 16-May-2023 |
Boris Burkov <boris@bur.io> |
btrfs: qgroup: add new quota mode for simple quotas
Add a new quota mode called "simple quotas". It can be enabled by the existing quota enable ioctl via a new command, and sets an incompat bit, as
btrfs: qgroup: add new quota mode for simple quotas
Add a new quota mode called "simple quotas". It can be enabled by the existing quota enable ioctl via a new command, and sets an incompat bit, as the implementation of simple quotas will make backwards incompatible changes to the disk format of the extent tree.
Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|