#
f3a5367c |
| 06-Jun-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: protect folio::private when attaching extent buffer folios
[BUG] Since v6.8 there are rare kernel crashes reported by various people, the common factor is bad page status error messages like
btrfs: protect folio::private when attaching extent buffer folios
[BUG] Since v6.8 there are rare kernel crashes reported by various people, the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840 page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c pfn:0xd6e840 aops:btree_aops ino:1 flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff) page_type: 0xffffffff() raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0 raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: non-NULL mapping
[CAUSE] Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") changes the sequence when allocating a new extent buffer.
Previously we always called grab_extent_buffer() under mapping->i_private_lock, to ensure the safety on modification on folio::private (which is a pointer to extent buffer for regular sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4 4K pages, meanwhile thread B is trying to release the page at X + 4K (the second page of the extent buffer at X).
Thread A | Thread B -----------------------------------+------------------------------------- | btree_release_folio() | | This is for the page at X + 4K, | | Not page X. | | alloc_extent_buffer() | |- release_extent_buffer() |- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs) | page at bytenr X (the first | | | | page). | | | | Which returned -EEXIST. | | | | | | | |- filemap_lock_folio() | | | | Returned the first page locked. | | | | | | | |- grab_extent_buffer() | | | | |- atomic_inc_not_zero() | | | | | Returned false | | | | |- folio_detach_private() | | |- folio_detach_private() for X | |- folio_test_private() | | |- folio_test_private() | Returned true | | | Returned true |- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount underflow of the folio X, and eventually causing the BUG_ON() on the page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb If the release is on the same first page of an eb, page lock would kick in and prevent the race.
- folio_detach_private() has a very small race window It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race, and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as filemap_release_folio() also requires the page to be locked, but forgot the filemap_release_folio() only locks one page, not all pages of an extent buffer.
[FIX] Move all the code requiring i_private_lock into attach_eb_folio_to_filemap(), so that everything is done with proper lock protection.
Furthermore to prevent future problems, add an extra lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh drop_caches () { while(true); do echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory done }
run_tar () { while(true); do for x in `seq 1 80` ; do tar cf /dev/zero /mnt > /dev/null & done wait done }
mkfs.btrfs -f -d single -m single /dev/vda mount -o noatime /dev/vda /mnt # create 200,000 files, 1K each ./simoop -n 200000 -E -f 1k /mnt drop_caches & (run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/ Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/ Reported-by: Toralf Förster <toralf.foerster@gmx.de> Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/ Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") CC: stable@vger.kernel.org # 6.8+ CC: Chris Mason <clm@fb.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
bc00965d |
| 20-Apr-2024 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
btrfs: count super block write errors in device instead of tracking folio error state
Currently the error status of super block write is tracked in page/folio status bit Error. For that we need to k
btrfs: count super block write errors in device instead of tracking folio error state
Currently the error status of super block write is tracked in page/folio status bit Error. For that we need to keep the reference for the whole duration of write and wait.
Count the number of superblock writeback errors in the btrfs_device. That means we don't need the folio to stay around until it's waited for, and can avoid the extra call to folio_get/put.
Also remove a mention of PageError in a comment as it's the last mention of the page Error state.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
6b0a63a4 |
| 03-Apr-2024 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a cached state to extent_clear_unlock_delalloc
Now that we have the lock_extent tightly coupled with extent_clear_unlock_delalloc we can add a cached state to extent_clear_unlock_delalloc
btrfs: add a cached state to extent_clear_unlock_delalloc
Now that we have the lock_extent tightly coupled with extent_clear_unlock_delalloc we can add a cached state to extent_clear_unlock_delalloc and benefit from skipping the extra lookup when we're doing cow.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
c0707c9e |
| 12-Feb-2024 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: push the extent lock into btrfs_run_delalloc_range
We want to limit the scope of the extent lock to be around operations that can change in flight. Currently we hold the extent lock through
btrfs: push the extent lock into btrfs_run_delalloc_range
We want to limit the scope of the extent lock to be around operations that can change in flight. Currently we hold the extent lock through the entire writepage operation, which isn't really necessary.
We want to protect to make sure nobody has updated DELALLOC. In find_lock_delalloc_range we must lock the range in order to validate the contents of our io_tree. However once we've done that we're safe to unlock the range and continue, as we have the page lock already held for the range.
We are protected from all operations at this point.
* mmap() - we're holding the page lock, thus are protected. * buffered writes - again, we're protected because we take the page lock for the first and last page in our range for buffered writes so we won't create new delalloc ranges in this area. * direct IO - we invalidate pagecache before attempting to write a new area, which requires the page lock, so again are protected once we're holding the page lock on this range.
Additionally this behavior actually already exists for compressed, we unlock the range as soon as we start to process the async extents, and re-lock it during compression. So this is completely safe, and makes the locking more consistent.
Make this simple by just pushing the extent lock into btrfs_run_delalloc_range. From there followup patches will push the lock further down into its users.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
de6f14e8 |
| 16-Apr-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make try_release_extent_mapping() return a bool
Currently try_release_extent_mapping() as an int return type, but we use it as a boolean. Its only caller, the release folio callback, also ret
btrfs: make try_release_extent_mapping() return a bool
Currently try_release_extent_mapping() as an int return type, but we use it as a boolean. Its only caller, the release folio callback, also returns a boolean which corresponds to try_release_extent_mapping()'s return value. So change its return value type to bool as well as its helper try_release_extent_state().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
2e504418 |
| 16-Apr-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: be better releasing extent maps at try_release_extent_mapping()
At try_release_extent_mapping(), called during the release folio callback (btrfs_release_folio() callchain), we don't release a
btrfs: be better releasing extent maps at try_release_extent_mapping()
At try_release_extent_mapping(), called during the release folio callback (btrfs_release_folio() callchain), we don't release any extent maps in the range if the GFP flags don't allow blocking. This behaviour is exaggerated because:
1) Both searching for extent maps and removing them are not blocking operations. The only thing that it is the cond_resched() call at the end of the loop that searches for and removes extent maps;
2) We currently only operate on a single page, so for the case where block size matches the page size, we can only have one extent map, and for the case where the block size is smaller than the page size, we can have at most 16 extent maps.
So it's very unlikely the cond_resched() call will ever block even in the block size smaller than page size scenario.
So instead of not removing any extent maps at all in case the GFP glags don't allow blocking, keep removing extent maps while we don't need to reschedule. This makes it safe for the subpage case and for a future where we can process folios with a size larger than a page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
433a3e01 |
| 16-Apr-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove i_size restriction at try_release_extent_mapping()
Currently we don't attempt to release extent maps if the inode has an i_size that is not greater than 16M. This condition was added w
btrfs: remove i_size restriction at try_release_extent_mapping()
Currently we don't attempt to release extent maps if the inode has an i_size that is not greater than 16M. This condition was added way back in 2008 by commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), without any explanation about it. A quick chat with Chris on slack revealed that the goal was probably to release the extent maps for small files only when closing the inode. This however can be harmful in case we have tons of such files being kept open for very long periods of time, since we will consume more and more pages for extent maps.
So remove the condition.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
85d28830 |
| 16-Apr-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use btrfs_get_fs_generation() at try_release_extent_mapping()
Nowadays we have the btrfs_get_fs_generation() to get the current generation of the filesystem, so there's no need anymore to loc
btrfs: use btrfs_get_fs_generation() at try_release_extent_mapping()
Nowadays we have the btrfs_get_fs_generation() to get the current generation of the filesystem, so there's no need anymore to lock the transaction spinlock to read it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
078b981a |
| 16-Apr-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename some variables at try_release_extent_mapping()
Rename the following variables:
1) "btrfs_inode" to "inode", because it's shorter to type and clear, and we don't have a VFS inode he
btrfs: rename some variables at try_release_extent_mapping()
Rename the following variables:
1) "btrfs_inode" to "inode", because it's shorter to type and clear, and we don't have a VFS inode here as well, so there's no confusion;
2) "tree" to "io_tree", to be clear which tree we are dealing with, since we use 2 different trees in the function;
3) "map" to "extent_tree" since "map" gives the idea we are dealing with an extent map for example, but we are dealing with the inode's extent tree (the tree which stores extent maps).
These also make the next patches simpler.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
c2fbd812 |
| 21-Mar-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: pass the extent map tree's inode to remove_extent_mapping()
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to remov
btrfs: pass the extent map tree's inode to remove_extent_mapping()
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to remove_extent_mapping().
In order to facilitate an upcoming change that adds a shrinker for extent maps, change remove_extent_mapping() to receive the inode instead of its extent map tree.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
53e24158 |
| 14-Apr-2024 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: set start on clone before calling copy_extent_buffer_full
Our subpage testing started hanging on generic/560 and I bisected it down to 1cab1375ba6d ("btrfs: reuse cloned extent buffer during
btrfs: set start on clone before calling copy_extent_buffer_full
Our subpage testing started hanging on generic/560 and I bisected it down to 1cab1375ba6d ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations"). This is subtle because we use eb->start to figure out where in the folio we're copying to when we're subpage, as our ->start may refer to an area inside of the folio.
For example, assume a 16K page size machine with a 4K node size, and assume that we already have a cloned extent buffer when we cloned the previous search.
copy_extent_buffer_full() will do the following when copying the extent buffer path->nodes[0] (src) into cloned (dest):
src->start = 8k; // this is the new leaf we're cloning cloned->start = 4k; // this is left over from the previous clone
src_addr = folio_address(src->folios[0]); dest_addr = folio_address(dest->folios[0]);
memcpy(dest_addr + get_eb_offset_in_folio(dst, 0), src_addr + get_eb_offset_in_folio(src, 0), src->len);
Now get_eb_offset_in_folio() is where the problems occur, because for sub-pagesize blocksize we can have multiple eb's per folio, the code for this is as follows
size_t get_eb_offset_in_folio(eb, offset) { return (eb->start + offset & (folio_size(eb->folio[0]) - 1)); }
So in the above example we are copying into offset 4K inside the folio. However once we update cloned->start to 8K to match the src the math for get_eb_offset_in_folio() changes, and any subsequent reads (i.e. btrfs_item_key_to_cpu()) will start reading from the offset 8K instead of 4K where we copied to, giving us garbage.
Fix this by setting start before we co copy_extent_buffer_full() to make sure that we're copying into the same offset inside of the folio that we will read from later.
All other sites of copy_extent_buffer_full() are correct because we either set ->start beforehand or we simply don't change it in the case of the tree-log usage.
With this fix we now pass generic/560 on our subpage tests.
Fixes: 1cab1375ba6d ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations") Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
11e03f2f |
| 29-Jan-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: introduce btrfs_alloc_folio_array()
The new helper will do the same thing as btrfs_alloc_page_array(), but with folios.
One extra difference is, there is no extra helper for bulk allocation,
btrfs: introduce btrfs_alloc_folio_array()
The new helper will do the same thing as btrfs_alloc_page_array(), but with folios.
One extra difference is, there is no extra helper for bulk allocation, thus it may not be as efficient as the page version.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
f32f20e2 |
| 18-Mar-2024 |
Tavian Barnes <tavianator@tavianator.com> |
btrfs: warn if EXTENT_BUFFER_UPTODATE is set while reading
We recently tracked down a race condition that triggered a read for an extent buffer with EXTENT_BUFFER_UPTODATE already set. While this r
btrfs: warn if EXTENT_BUFFER_UPTODATE is set while reading
We recently tracked down a race condition that triggered a read for an extent buffer with EXTENT_BUFFER_UPTODATE already set. While this read was in progress, other concurrent readers would see the UPTODATE bit and return early as if the read was already complete, making accesses to the extent buffer conflict with the read operation that was overwriting it.
Add a WARN_ON() to end_bbio_meta_read() for this situation to make similar races easier to spot in the future.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Tavian Barnes <tavianator@tavianator.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
1e2d1837 |
| 18-Mar-2024 |
Tavian Barnes <tavianator@tavianator.com> |
btrfs: add helper to clear EXTENT_BUFFER_READING
We are clearing the bit and waking up any waiters in two different places. Factor that code out into a static helper function.
Reviewed-by: Qu Wenr
btrfs: add helper to clear EXTENT_BUFFER_READING
We are clearing the bit and waking up any waiters in two different places. Factor that code out into a static helper function.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Tavian Barnes <tavianator@tavianator.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
c66f2afc |
| 18-Mar-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove pointless writepages callback wrapper
There's no point in having a static writepages callback in inode.c that does nothing besides calling extent_writepages from extent_io.c. So just r
btrfs: remove pointless writepages callback wrapper
There's no point in having a static writepages callback in inode.c that does nothing besides calling extent_writepages from extent_io.c. So just remove the callback at inode.c and rename extent_writepages() to btrfs_writepages().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
7938d38b |
| 18-Mar-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove pointless readahead callback wrapper
There's no point in having a static readahead callback in inode.c that does nothing besides calling extent_readahead() from extent_io.c. So just re
btrfs: remove pointless readahead callback wrapper
There's no point in having a static readahead callback in inode.c that does nothing besides calling extent_readahead() from extent_io.c. So just remove the callback at inode.c and rename extent_readahead() to btrfs_readahead().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
1db7959a |
| 25-Mar-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: do not wait for short bulk allocation
[BUG] There is a recent report that when memory pressure is high (including cached pages), btrfs can spend most of its time on memory allocation in btrfs
btrfs: do not wait for short bulk allocation
[BUG] There is a recent report that when memory pressure is high (including cached pages), btrfs can spend most of its time on memory allocation in btrfs_alloc_page_array() for compressed read/write.
[CAUSE] For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and even if the bulk allocation failed (fell back to single page allocation) we still retry but with extra memalloc_retry_wait().
If the bulk alloc only returned one page a time, we would spend a lot of time on the retry wait.
The behavior was introduced in commit 395cb57e8560 ("btrfs: wait between incomplete batch memory allocations").
[FIX] Although the commit mentioned that other filesystems do the wait, it's not the case at least nowadays.
All the mainlined filesystems only call memalloc_retry_wait() if they failed to allocate any page (not only for bulk allocation). If there is any progress, they won't call memalloc_retry_wait() at all.
For example, xfs_buf_alloc_pages() would only call memalloc_retry_wait() if there is no allocation progress at all, and the call is not for metadata readahead.
So I don't believe we should call memalloc_retry_wait() unconditionally for short allocation.
Call memalloc_retry_wait() if it fails to allocate any page for tree block allocation (which goes with __GFP_NOFAIL and may not need the special handling anyway), and reduce the latency for btrfs_alloc_page_array().
Reported-by: Julian Taylor <julian.taylor@1und1.de> Tested-by: Julian Taylor <julian.taylor@1und1.de> Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/ Fixes: 395cb57e8560 ("btrfs: wait between incomplete batch memory allocations") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
073bda7a |
| 26-Mar-2024 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: add ASSERT and WARN for EXTENT_BUFFER_ZONED_ZEROOUT handling
Add an ASSERT to catch a faulty delayed reference item resulting from prematurely cleared extent buffer.
Also, add a WARN
btrfs: zoned: add ASSERT and WARN for EXTENT_BUFFER_ZONED_ZEROOUT handling
Add an ASSERT to catch a faulty delayed reference item resulting from prematurely cleared extent buffer.
Also, add a WARN to detect if we try to dirty a ZEROOUT buffer again, which is suspicious as its update will be lost.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
68879386 |
| 26-Mar-2024 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: do not flag ZEROOUT on non-dirty extent buffer
Btrfs clears the content of an extent buffer marked as EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is introduce
btrfs: zoned: do not flag ZEROOUT on non-dirty extent buffer
Btrfs clears the content of an extent buffer marked as EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is introduced to prevent a write hole of an extent buffer, which is once allocated, marked dirty, but turns out unnecessary and cleaned up within one transaction operation.
Currently, btrfs_clear_buffer_dirty() marks the extent buffer as EXTENT_BUFFER_ZONED_ZEROOUT, and skips the entry function. If this call happens while the buffer is under IO (with the WRITEBACK flag set, without the DIRTY flag), we can add the ZEROOUT flag and clear the buffer's content just before a bio submission. As a result:
1) it can lead to adding faulty delayed reference item which leads to a FS corrupted (EUCLEAN) error, and
2) it writes out cleared tree node on disk
The former issue is previously discussed in [1]. The corruption happens when it runs a delayed reference update. So, on-disk data is safe.
[1] https://lore.kernel.org/linux-btrfs/3f4f2a0ff1a6c818050434288925bdcf3cd719e5.1709124777.git.naohiro.aota@wdc.com/
The latter one can reach on-disk data. But, as that node is already processed by btrfs_clear_buffer_dirty(), that will be invalidated in the next transaction commit anyway. So, the chance of hitting the corruption is relatively small.
Anyway, we should skip flagging ZEROOUT on a non-DIRTY extent buffer, to keep the content under IO intact.
Fixes: aa6313e6ff2b ("btrfs: zoned: don't clear dirty flag of extent buffer") CC: stable@vger.kernel.org # 6.8 Link: https://lore.kernel.org/linux-btrfs/oadvdekkturysgfgi4qzuemd57zudeasynswurjxw3ocdfsef6@sjyufeugh63f/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ef1e6823 |
| 16-Mar-2024 |
Tavian Barnes <tavianator@tavianator.com> |
btrfs: fix race in read_extent_buffer_pages()
There are reports from tree-checker that detects corrupted nodes, without any obvious pattern so possibly an overwrite in memory. After some debugging i
btrfs: fix race in read_extent_buffer_pages()
There are reports from tree-checker that detects corrupted nodes, without any obvious pattern so possibly an overwrite in memory. After some debugging it turns out there's a race when reading an extent buffer the uptodate status can be missed.
To prevent concurrent reads for the same extent buffer, read_extent_buffer_pages() performs these checks:
/* (1) */ if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) return 0;
/* (2) */ if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) goto done;
At this point, it seems safe to start the actual read operation. Once that completes, end_bbio_meta_read() does
/* (3) */ set_extent_buffer_uptodate(eb);
/* (4) */ clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
Normally, this is enough to ensure only one read happens, and all other callers wait for it to finish before returning. Unfortunately, there is a racey interleaving:
Thread A | Thread B | Thread C ---------+----------+--------- (1) | | | (1) | (2) | | (3) | | (4) | | | (2) | | | (1)
When this happens, thread B kicks of an unnecessary read. Worse, thread C will see UPTODATE set and return immediately, while the read from thread B is still in progress. This race could result in tree-checker errors like this as the extent buffer is concurrently modified:
BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360]
Fix it by testing UPTODATE again after setting the READING bit, and if it's been set, skip the unnecessary read.
Fixes: d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading") Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/ Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/ CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tavian Barnes <tavianator@tavianator.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor update of changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
1cab1375 |
| 28-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations
During fiemap we may have to visit multiple leaves of the subvolume's inode tree, and each time we are freeing and allocating
btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations
During fiemap we may have to visit multiple leaves of the subvolume's inode tree, and each time we are freeing and allocating an extent buffer to use as a clone of each visited leaf. Optimize this by reusing cloned extent buffers, to avoid the freeing and re-allocation both of the extent buffer structure itself and more importantly of the pages attached to the extent buffer.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
978b63f7 |
| 28-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race when detecting delalloc ranges during fiemap
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in
btrfs: fix race when detecting delalloc ranges during fiemap
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in a scenario where the fiemap buffer happens to be a memory mapped range of the same file. This use case is very unlikely to be useful in practice but it may be triggered by fuzz testing (syzbot, etc).
This however introduced a race that makes us miss delalloc ranges for file regions that are currently holes, so the caller of fiemap will not be aware that there's data for some file regions. This can be quite serious for some use cases - for example in coreutils versions before 9.0, the cp program used fiemap to detect holes and data in the source file, copying only regions with data (extents or delalloc) from the source file to the destination file in order to preserve holes (see the documentation for its --sparse command line option). This means that if cp was used with a source file that had delalloc in a hole, the destination file could end up without that data, which is effectively a data loss issue, if it happened to hit the race described below.
The race happens like this:
1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that has delalloc in the file range [64M, 65M[, which is currently a hole;
2) Fiemap locks the inode in shared mode, then starts iterating the inode's subvolume tree searching for file extent items, without having the whole fiemap target range locked in the inode's io tree - the change introduced recently by commit b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking"). It only locks ranges in the io tree when it finds a hole or prealloc extent since that commit;
3) Note that fiemap clones each leaf before using it, and this is to avoid deadlocks when locking a file range in the inode's io tree and the fiemap buffer is memory mapped to some file, because writing to the page with btrfs_page_mkwrite() will wait on any ordered extent for the page's range and the ordered extent needs to lock the range and may need to modify the same leaf, therefore leading to a deadlock on the leaf;
4) While iterating the file extent items in the cloned leaf before finding the hole in the range [64M, 65M[, the delalloc in that range is flushed and its ordered extent completes - meaning the corresponding file extent item is in the inode's subvolume tree, but not present in the cloned leaf that fiemap is iterating over;
5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in the cloned leaf (or a file extent item with disk_bytenr == 0 in case the NO_HOLES feature is not enabled), it will lock that file range in the inode's io tree and then search for delalloc by checking for the EXTENT_DELALLOC bit in the io tree for that range and ordered extents (with btrfs_find_delalloc_in_range()). But it finds nothing since the delalloc in that range was already flushed and the ordered extent completed and is gone - as a result fiemap will not report that there's delalloc or an extent for the range [64M, 65M[, so user space will be mislead into thinking that there's a hole in that range.
This could actually be sporadically triggered with test case generic/094 from fstests, which reports a missing extent/delalloc range like this:
generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad) --- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000 @@ -1,3 +1,9 @@ QA output created by 094 fiemap run with sync fiemap run without sync +ERROR: couldn't find extent at 7 +map is 'HHDDHPPDPHPH' +logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2 +logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1 ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)
So in order to fix this, while still avoiding deadlocks in the case where the fiemap buffer is memory mapped to the same file, change fiemap to work like the following:
1) Always lock the whole range in the inode's io tree before starting to iterate the inode's subvolume tree searching for file extent items, just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking");
2) Now instead of writing to the fiemap buffer every time we have an extent to report, write instead to a temporary buffer (1 page), and when that buffer becomes full, stop iterating the file extent items, unlock the range in the io tree, release the search path, submit all the entries kept in that buffer to the fiemap buffer, and then resume the search for file extent items after locking again the remainder of the range in the io tree.
The buffer having a size of a page, allows for 146 entries in a system with 4K pages. This is a large enough value to have a good performance by avoiding too many restarts of the search for file extent items. In other words this preserves the huge performance gains made in the last two years to fiemap, while avoiding the deadlocks in case the fiemap buffer is memory mapped to the same file (useless in practice, but possible and exercised by fuzz testing and syzbot).
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ef5a05c5 |
| 24-Feb-2024 |
Chengming Zhou <zhouchengming@bytedance.com> |
btrfs: remove SLAB_MEM_SPREAD flag use
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: re
btrfs: remove SLAB_MEM_SPREAD flag use
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
970ea374 |
| 06-Feb-2024 |
David Sterba <dsterba@suse.com> |
btrfs: pass a valid extent map cache pointer to __get_extent_map()
We can pass a valid em cache pointer down to __get_extent_map() and drop the validity check. This avoids the special case, the call
btrfs: pass a valid extent map cache pointer to __get_extent_map()
We can pass a valid em cache pointer down to __get_extent_map() and drop the validity check. This avoids the special case, the call stacks are simple:
btrfs_read_folio btrfs_do_readpage __get_extent_map
extent_readahead contiguous_readpages btrfs_do_readpage __get_extent_map
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
e84bfffc |
| 24-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: hoist fs_info out of loops in end_bbio_data_write and end_bbio_data_read
The fs_info and sectorsize remain the same during the loops, no need to set them on each iteration.
Reviewed-by: Joha
btrfs: hoist fs_info out of loops in end_bbio_data_write and end_bbio_data_read
The fs_info and sectorsize remain the same during the loops, no need to set them on each iteration.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|