#
67a841f9 |
| 30-Apr-2024 |
Christoph Hellwig <hch@lst.de> |
xfs: clean up buffer allocation in xlog_do_recovery_pass
Merge the initial xlog_alloc_buffer calls, and pass the variable designating the length that is initialized to 1 above instead of passing the
xfs: clean up buffer allocation in xlog_do_recovery_pass
Merge the initial xlog_alloc_buffer calls, and pass the variable designating the length that is initialized to 1 above instead of passing the open coded 1 directly.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
45cf9760 |
| 30-Apr-2024 |
Christoph Hellwig <hch@lst.de> |
xfs: fix log recovery buffer allocation for the legacy h_size fixup
Commit a70f9fe52daa ("xfs: detect and handle invalid iclog size set by mkfs") added a fixup for incorrect h_size values used for t
xfs: fix log recovery buffer allocation for the legacy h_size fixup
Commit a70f9fe52daa ("xfs: detect and handle invalid iclog size set by mkfs") added a fixup for incorrect h_size values used for the initial umount record in old xfsprogs versions. Later commit 0c771b99d6c9 ("xfs: clean up calculation of LR header blocks") cleaned up the log reover buffer calculation, but stoped using the fixed up h_size value to size the log recovery buffer, which can lead to an out of bounds access when the incorrect h_size does not come from the old mkfs tool, but a fuzzer.
Fix this by open coding xlog_logrec_hblks and taking the fixed h_size into account for this calculation.
Fixes: 0c771b99d6c9 ("xfs: clean up calculation of LR header blocks") Reported-by: Sam Sun <samsun1006219@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
14f19991 |
| 15-Apr-2024 |
Darrick J. Wong <djwong@kernel.org> |
xfs: capture inode generation numbers in the ondisk exchmaps log item
Per some very late review comments, capture the generation numbers of both inodes involved in a file content exchange operation
xfs: capture inode generation numbers in the ondisk exchmaps log item
Per some very late review comments, capture the generation numbers of both inodes involved in a file content exchange operation so that we don't accidentally target files with have been reallocated.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
6c08f434 |
| 15-Apr-2024 |
Darrick J. Wong <djwong@kernel.org> |
xfs: introduce a file mapping exchange log intent item
Introduce a new intent log item to handle exchanging mappings between the forks of two files.
Signed-off-by: Darrick J. Wong <djwong@kernel.or
xfs: introduce a file mapping exchange log intent item
Introduce a new intent log item to handle exchanging mappings between the forks of two files.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
5302a5c8 |
| 15-Apr-2024 |
Darrick J. Wong <djwong@kernel.org> |
xfs: only clear log incompat flags at clean unmount
While reviewing the online fsck patchset, someone spied the xfs_swapext_can_use_without_log_assistance function and wondered why we go through thi
xfs: only clear log incompat flags at clean unmount
While reviewing the online fsck patchset, someone spied the xfs_swapext_can_use_without_log_assistance function and wondered why we go through this inverted-bitmask dance to avoid setting the XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT feature.
(The same principles apply to the logged extended attribute update feature bit in the since-merged LARP series.)
The reason for this dance is that xfs_add_incompat_log_feature is an expensive operation -- it forces the log, pushes the AIL, and then if nobody's beaten us to it, sets the feature bit and issues a synchronous write of the primary superblock. That could be a one-time cost amortized over the life of the filesystem, but the log quiesce and cover operations call xfs_clear_incompat_log_features to remove feature bits opportunistically. On a moderately loaded filesystem this leads to us cycling those bits on and off over and over, which hurts performance.
Why do we clear the log incompat bits? Back in ~2020 I think Dave and I had a conversation on IRC[2] about what the log incompat bits represent. IIRC in that conversation we decided that the log incompat bits protect unrecovered log items so that old kernels won't try to recover them and barf. Since a clean log has no protected log items, we could clear the bits at cover/quiesce time.
As Dave Chinner pointed out in the thread, clearing log incompat bits at unmount time has positive effects for golden root disk image generator setups, since the generator could be running a newer kernel than what gets written to the golden image -- if there are log incompat fields set in the golden image that was generated by a newer kernel/OS image builder then the provisioning host cannot mount the filesystem even though the log is clean and recovery is unnecessary to mount the filesystem.
Given that it's expensive to set log incompat bits, we really only want to do that once per bit per mount. Therefore, I propose that we only clear log incompat bits as part of writing a clean unmount record. Do this by adding an operational state flag to the xfs mount that guards whether or not the feature bit clearing can actually take place.
This eliminates the l_incompat_users rwsem that we use to protect a log cleaning operation from clearing a feature bit that a frontend thread is trying to set -- this lock adds another way to fail w.r.t. locking. For the swapext series, I shard that into multiple locks just to work around the lockdep complaints, and that's fugly.
Link: https://lore.kernel.org/linux-xfs/20240131230043.GA6180@frogsfrogsfrogs/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
#
549d3c9a |
| 15-Apr-2024 |
Darrick J. Wong <djwong@kernel.org> |
xfs: pass xfs_buf lookup flags to xfs_*read_agi
Allow callers to pass buffer lookup flags to xfs_read_agi and xfs_ialloc_read_agi. This will be used in the next patch to fix a deadlock in the onlin
xfs: pass xfs_buf lookup flags to xfs_*read_agi
Allow callers to pass buffer lookup flags to xfs_read_agi and xfs_ialloc_read_agi. This will be used in the next patch to fix a deadlock in the online fsck inode scanner.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
1e5efd72 |
| 23-Feb-2024 |
Darrick J. Wong <djwong@kernel.org> |
xfs: fix log recovery erroring out on refcount recovery failure
Per the comment in the error case of xfs_reflink_recover_cow, zero out any error (after shutting down the log) so that we actually kil
xfs: fix log recovery erroring out on refcount recovery failure
Per the comment in the error case of xfs_reflink_recover_cow, zero out any error (after shutting down the log) so that we actually kill any new intent items that might have gotten logged by later recovery steps. Discovered by xfs/434, which few people actually seem to run.
Fixes: 2c1e31ed5c88 ("xfs: place intent recovery under NOFS allocation context") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
e4c3b72a |
| 17-Jan-2024 |
Long Li <leo.lilong@huawei.com> |
xfs: ensure submit buffers on LSN boundaries in error handlers
While performing the IO fault injection test, I caught the following data corruption report:
XFS (dm-0): Internal error ltbno + ltlen
xfs: ensure submit buffers on LSN boundaries in error handlers
While performing the IO fault injection test, I caught the following data corruption report:
XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_ag_extent+0x79c/0x1130 CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker Call Trace: <TASK> dump_stack_lvl+0x50/0x70 xfs_corruption_error+0x134/0x150 xfs_free_ag_extent+0x7d3/0x1130 __xfs_free_extent+0x201/0x3c0 xfs_trans_free_extent+0x29b/0xa10 xfs_extent_free_finish_item+0x2a/0xb0 xfs_defer_finish_noroll+0x8d1/0x1b40 xfs_defer_finish+0x21/0x200 xfs_itruncate_extents_flags+0x1cb/0x650 xfs_free_eofblocks+0x18f/0x250 xfs_inactive+0x485/0x570 xfs_inodegc_worker+0x207/0x530 process_scheduled_works+0x24a/0xe10 worker_thread+0x5ac/0xc60 kthread+0x2cd/0x3c0 ret_from_fork+0x4a/0x80 ret_from_fork_asm+0x11/0x20 </TASK> XFS (dm-0): Corruption detected. Unmount and run xfs_repair
After analyzing the disk image, it was found that the corruption was triggered by the fact that extent was recorded in both inode datafork and AGF btree blocks. After a long time of reproduction and analysis, we found that the reason of free sapce btree corruption was that the AGF btree was not recovered correctly.
Consider the following situation, Checkpoint A and Checkpoint B are in the same record and share the same start LSN1, buf items of same object (AGF btree block) is included in both Checkpoint A and Checkpoint B. If the buf item in Checkpoint A has been recovered and updates metadata LSN permanently, then the buf item in Checkpoint B cannot be recovered, because log recovery skips items with a metadata LSN >= the current LSN of the recovery item. If there is still an inode item in Checkpoint B that records the Extent X, the Extent X will be recorded in both inode datafork and AGF btree block after Checkpoint B is recovered. Such transaction can be seen when allocing enxtent for inode bmap, it record both the addition of extent to the inode extent list and the removing extent from the AGF.
|------------Record (LSN1)------------------|---Record (LSN2)---| |-------Checkpoint A----------|----------Checkpoint B-----------| | Buf Item(Extent X) | Buf Item / Inode item(Extent X) | | Extent X is freed | Extent X is allocated |
After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries") was introduced, we submit buffers on lsn boundaries during log recovery. The above problem can be avoided under normal paths, but it's not guaranteed under abnormal paths. Consider the following process, if an error was encountered after recover buf item in Checkpoint A and before recover buf item in Checkpoint B, buffers that have been added to the buffer_list will still be submitted, this violates the submits rule on lsn boundaries. So buf item in Checkpoint B cannot be recovered on the next mount due to current lsn of transaction equal to metadata lsn on disk. The detailed process of the problem is as follows.
First Mount:
xlog_do_recovery_pass error = xlog_recover_process xlog_recover_process_data xlog_recover_process_ophdr xlog_recovery_process_trans ... /* recover buf item in Checkpoint A */ xlog_recover_buf_commit_pass2 xlog_recover_do_reg_buffer /* add buffer of agf btree block to buffer_list */ xfs_buf_delwri_queue(bp, buffer_list) ... ==> Encounter read IO error and return /* submit buffers regardless of error */ if (!list_empty(&buffer_list)) xfs_buf_delwri_submit(&buffer_list);
<buf items of agf btree block in Checkpoint A recovery success>
Second Mount:
xlog_do_recovery_pass error = xlog_recover_process xlog_recover_process_data xlog_recover_process_ophdr xlog_recovery_process_trans ... /* recover buf item in Checkpoint B */ xlog_recover_buf_commit_pass2 /* buffer of agf btree block wouldn't added to buffer_list due to lsn equal to current_lsn */ if (XFS_LSN_CMP(lsn, current_lsn) >= 0) goto out_release
<buf items of agf btree block in Checkpoint B wouldn't recovery>
In order to make sure that submits buffers on lsn boundaries in the abnormal paths, we need to check error status before submit buffers that have been added from the last record processed. If error status exist, buffers in the bufffer_list should not be writen to disk.
Canceling the buffers in the buffer_list directly isn't correct, unlike any other place where write list was canceled, these buffers has been initialized by xfs_buf_item_init() during recovery and held by buf item, buf items will not be released in xfs_buf_delwri_cancel(), it's not easy to solve.
If the filesystem has been shut down, then delwri list submission will error out all buffers on the list via IO submission/completion and do all the correct cleanup automatically. So shutting down the filesystem could prevents buffers in the bufffer_list from being written to disk.
Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
2c1e31ed |
| 15-Jan-2024 |
Dave Chinner <dchinner@redhat.com> |
xfs: place intent recovery under NOFS allocation context
When recovery starts processing intents, all of the initial intent allocations are done outside of transaction contexts. That means they need
xfs: place intent recovery under NOFS allocation context
When recovery starts processing intents, all of the initial intent allocations are done outside of transaction contexts. That means they need to specifically use GFP_NOFS as we do not want memory reclaim to attempt to run direct reclaim of filesystem objects while we have lots of objects added into deferred operations.
Rather than use GFP_NOFS for these specific allocations, just place the entire intent recovery process under NOFS context and we can then just use GFP_KERNEL for these allocations.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
d4c75a1b |
| 15-Jan-2024 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert remaining kmem_free() to kfree()
The remaining callers of kmem_free() are freeing heap memory, so we can convert them directly to kfree() and get rid of kmem_free() altogether.
This co
xfs: convert remaining kmem_free() to kfree()
The remaining callers of kmem_free() are freeing heap memory, so we can convert them directly to kfree() and get rid of kmem_free() altogether.
This conversion was done with:
$ for f in `git grep -l kmem_free fs/xfs`; do > sed -i s/kmem_free/kfree/ $f > done $
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
49292576 |
| 15-Jan-2024 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert kmem_free() for kvmalloc users to kvfree()
Start getting rid of kmem_free() by converting all the cases where memory can come from vmalloc interfaces to calling kvfree() directly.
Sign
xfs: convert kmem_free() for kvmalloc users to kvfree()
Start getting rid of kmem_free() by converting all the cases where memory can come from vmalloc interfaces to calling kvfree() directly.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
f078d4ea |
| 15-Jan-2024 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert kmem_alloc() to kmalloc()
kmem_alloc() is just a thin wrapper around kmalloc() these days. Convert everything to use kmalloc() so we can get rid of the wrapper.
Note: the transaction r
xfs: convert kmem_alloc() to kmalloc()
kmem_alloc() is just a thin wrapper around kmalloc() these days. Convert everything to use kmalloc() so we can get rid of the wrapper.
Note: the transaction region allocation in xlog_add_to_transaction() can be a high order allocation. Converting it to use kmalloc(__GFP_NOFAIL) results in warnings in the page allocation code being triggered because the mm subsystem does not want us to use __GFP_NOFAIL with high order allocations like we've been doing with the kmem_alloc() wrapper for a couple of decades. Hence this specific case gets converted to xlog_kvmalloc() rather than kmalloc() to avoid this issue.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
10634530 |
| 15-Jan-2024 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert kmem_zalloc() to kzalloc()
There's no reason to keep the kmem_zalloc() around anymore, it's just a thin wrapper around kmalloc(), so lets get rid of it.
Signed-off-by: Dave Chinner <dc
xfs: convert kmem_zalloc() to kzalloc()
There's no reason to keep the kmem_zalloc() around anymore, it's just a thin wrapper around kmalloc(), so lets get rid of it.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
dc22af64 |
| 14-Dec-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: pass the defer ops instead of type to xfs_defer_start_recovery
xfs_defer_start_recovery is only called from xlog_recover_intent_item, and the callers of that all have the actual xfs_defer_ops_t
xfs: pass the defer ops instead of type to xfs_defer_start_recovery
xfs_defer_start_recovery is only called from xlog_recover_intent_item, and the callers of that all have the actual xfs_defer_ops_type operation vector at hand. Pass that directly instead of looking it up from the defer_op_types table.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
db7ccc0b |
| 22-Nov-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: move ->iop_recover to xfs_defer_op_type
Finish off the series by moving the intent item recovery function pointer to the xfs_defer_op_type struct, since this is really a deferred work function
xfs: move ->iop_recover to xfs_defer_op_type
Finish off the series by moving the intent item recovery function pointer to the xfs_defer_op_type struct, since this is really a deferred work function now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
e5f1a514 |
| 22-Nov-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use xfs_defer_finish_one to finish recovered work items
Get rid of the open-coded calls to xfs_defer_finish_one. This also means that the recovery transaction takes care of cleaning up the dfp
xfs: use xfs_defer_finish_one to finish recovered work items
Get rid of the open-coded calls to xfs_defer_finish_one. This also means that the recovery transaction takes care of cleaning up the dfp, and we have solved (I hope) all the ownership issues in recovery.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
deb4cd8b |
| 22-Nov-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: transfer recovered intent item ownership in ->iop_recover
Now that we pass the xfs_defer_pending object into the intent item recovery functions, we know exactly when ownership of the sole refco
xfs: transfer recovered intent item ownership in ->iop_recover
Now that we pass the xfs_defer_pending object into the intent item recovery functions, we know exactly when ownership of the sole refcount passes from the recovery context to the intent done item. At that point, we need to null out dfp_intent so that the recovery mechanism won't release it. This should fix the UAF problem reported by Long Li.
Note that we still want to recreate the full deferred work state. That will be addressed in the next patches.
Fixes: 2e76f188fd90 ("xfs: cancel intents immediately if process_intents fails") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
a050acdf |
| 22-Nov-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: pass the xfs_defer_pending object to iop_recover
Now that log intent item recovery recreates the xfs_defer_pending state, we should pass that into the ->iop_recover routines so that the intent
xfs: pass the xfs_defer_pending object to iop_recover
Now that log intent item recovery recreates the xfs_defer_pending state, we should pass that into the ->iop_recover routines so that the intent item can finish the recreation work.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
03f7767c |
| 22-Nov-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use xfs_defer_pending objects to recover intent items
One thing I never quite got around to doing is porting the log intent item recovery code to reconstruct the deferred pending work state. A
xfs: use xfs_defer_pending objects to recover intent items
One thing I never quite got around to doing is porting the log intent item recovery code to reconstruct the deferred pending work state. As a result, each intent item open codes xfs_defer_finish_one in its recovery method, because that's what the EFI code did before xfs_defer.c even existed.
This is a gross thing to have left unfixed -- if an EFI cannot proceed due to busy extents, we end up creating separate new EFIs for each unfinished work item, which is a change in behavior from what runtime would have done.
Worse yet, Long Li pointed out that there's a UAF in the recovery code. The ->commit_pass2 function adds the intent item to the AIL and drops the refcount. The one remaining refcount is now owned by the recovery mechanism (aka the log intent items in the AIL) with the intent of giving the refcount to the intent done item in the ->iop_recover function.
However, if something fails later in recovery, xlog_recover_finish will walk the recovered intent items in the AIL and release them. If the CIL hasn't been pushed before that point (which is possible since we don't force the log until later) then the intent done release will try to free its associated intent, which has already been freed.
This patch starts to address this mess by having the ->commit_pass2 functions recreate the xfs_defer_pending state. The next few patches will fix the recovery functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
f8f9d952 |
| 31-Jul-2023 |
Long Li <leo.lilong@huawei.com> |
xfs: abort intent items when recovery intents fail
When recovering intents, we capture newly created intent items as part of committing recovered intent items. If intent recovery fails at a later p
xfs: abort intent items when recovery intents fail
When recovering intents, we capture newly created intent items as part of committing recovered intent items. If intent recovery fails at a later point, we forget to remove those newly created intent items from the AIL and hang:
[root@localhost ~]# cat /proc/539/stack [<0>] xfs_ail_push_all_sync+0x174/0x230 [<0>] xfs_unmount_flush_inodes+0x8d/0xd0 [<0>] xfs_mountfs+0x15f7/0x1e70 [<0>] xfs_fs_fill_super+0x10ec/0x1b20 [<0>] get_tree_bdev+0x3c8/0x730 [<0>] vfs_get_tree+0x89/0x2c0 [<0>] path_mount+0xecf/0x1800 [<0>] do_mount+0xf3/0x110 [<0>] __x64_sys_mount+0x154/0x1f0 [<0>] do_syscall_64+0x39/0x80 [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
When newly created intent items fail to commit via transaction, intent recovery hasn't created done items for these newly created intent items, so the capture structure is the sole owner of the captured intent items. We must release them explicitly or else they leak:
unreferenced object 0xffff888016719108 (size 432): comm "mount", pid 529, jiffies 4294706839 (age 144.463s) hex dump (first 32 bytes): 08 91 71 16 80 88 ff ff 08 91 71 16 80 88 ff ff ..q.......q..... 18 91 71 16 80 88 ff ff 18 91 71 16 80 88 ff ff ..q.......q..... backtrace: [<ffffffff8230c68f>] xfs_efi_init+0x18f/0x1d0 [<ffffffff8230c720>] xfs_extent_free_create_intent+0x50/0x150 [<ffffffff821b671a>] xfs_defer_create_intents+0x16a/0x340 [<ffffffff821bac3e>] xfs_defer_ops_capture_and_commit+0x8e/0xad0 [<ffffffff82322bb9>] xfs_cui_item_recover+0x819/0x980 [<ffffffff823289b6>] xlog_recover_process_intents+0x246/0xb70 [<ffffffff8233249a>] xlog_recover_finish+0x8a/0x9a0 [<ffffffff822eeafb>] xfs_log_mount_finish+0x2bb/0x4a0 [<ffffffff822c0f4f>] xfs_mountfs+0x14bf/0x1e70 [<ffffffff822d1f80>] xfs_fs_fill_super+0x10d0/0x1b20 [<ffffffff81a21fa2>] get_tree_bdev+0x3d2/0x6d0 [<ffffffff81a1ee09>] vfs_get_tree+0x89/0x2c0 [<ffffffff81a9f35f>] path_mount+0xecf/0x1800 [<ffffffff81a9fd83>] do_mount+0xf3/0x110 [<ffffffff81aa00e4>] __x64_sys_mount+0x154/0x1f0 [<ffffffff83968739>] do_syscall_64+0x39/0x80
Fix the problem above by abort intent items that don't have a done item when recovery intents fail.
Fixes: e6fff81e4870 ("xfs: proper replay of deferred ops queued during log recovery") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
8b010acb |
| 13-Sep-2023 |
Wang Jianchao <jianchwa@outlook.com> |
xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail
In our production environment, we find that mounting a 500M /boot which is umount cleanly needs ~6s. One cause is that ffs() is used
xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail
In our production environment, we find that mounting a 500M /boot which is umount cleanly needs ~6s. One cause is that ffs() is used by xlog_write_log_records() to decide the buffer size. It can cause a lot of small IO easily when xlog_clear_stale_blocks() needs to wrap around the end of log area and log head block is not power of two. Things are similar in xlog_find_verify_cycle().
The code is able to handed bigger buffer very well, we can use roundup_pow_of_two() to replace ffs() directly to avoid small and sychronous IOs.
Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Wang Jianchao <wangjc136@midea.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
show more ...
|
#
d4d12c02 |
| 05-Jun-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: collect errors from inodegc for unlinked inode recovery
Unlinked list recovery requires errors removing the inode the from the unlinked list get fed back to the main recovery loop. Now that we
xfs: collect errors from inodegc for unlinked inode recovery
Unlinked list recovery requires errors removing the inode the from the unlinked list get fed back to the main recovery loop. Now that we offload the unlinking to the inodegc work, we don't get errors being fed back when we trip over a corruption that prevents the inode from being removed from the unlinked list.
This means we never clear the corrupt unlinked list bucket, resulting in runtime operations eventually tripping over it and shutting down.
Fix this by collecting inodegc worker errors and feed them back to the flush caller. This is largely best effort - the only context that really cares is log recovery, and it only flushes a single inode at a time so we don't need complex synchronised handling. Essentially the inodegc workers will capture the first error that occurs and the next flush will gather them and clear them. The flush itself will only report the first gathered error.
In the cases where callers can return errors, propagate the collected inodegc flush error up the error handling chain.
In the case of inode unlinked list recovery, there are several superfluous calls to flush queued unlinked inodes - xlog_recover_iunlink_bucket() guarantees that it has flushed the inodegc and collected errors before it returns. Hence nothing in the calling path needs to run a flush, even when an error is returned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
97cf7967 |
| 17-Oct-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: avoid a UAF when log intent item recovery fails
KASAN reported a UAF bug when I was running xfs/235:
BUG: KASAN: use-after-free in xlog_recover_process_intents+0xa77/0xae0 [xfs] Read of size
xfs: avoid a UAF when log intent item recovery fails
KASAN reported a UAF bug when I was running xfs/235:
BUG: KASAN: use-after-free in xlog_recover_process_intents+0xa77/0xae0 [xfs] Read of size 8 at addr ffff88804391b360 by task mount/5680
CPU: 2 PID: 5680 Comm: mount Not tainted 6.0.0-xfsx #6.0.0 77e7b52a4943a975441e5ac90a5ad7748b7867f6 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report.cold+0x2cc/0x682 kasan_report+0xa3/0x120 xlog_recover_process_intents+0xa77/0xae0 [xfs fb841c7180aad3f8359438576e27867f5795667e] xlog_recover_finish+0x7d/0x970 [xfs fb841c7180aad3f8359438576e27867f5795667e] xfs_log_mount_finish+0x2d7/0x5d0 [xfs fb841c7180aad3f8359438576e27867f5795667e] xfs_mountfs+0x11d4/0x1d10 [xfs fb841c7180aad3f8359438576e27867f5795667e] xfs_fs_fill_super+0x13d5/0x1a80 [xfs fb841c7180aad3f8359438576e27867f5795667e] get_tree_bdev+0x3da/0x6e0 vfs_get_tree+0x7d/0x240 path_mount+0xdd3/0x17d0 __x64_sys_mount+0x1fa/0x270 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7ff5bc069eae Code: 48 8b 0d 85 1f 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 52 1f 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe433fd448 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff5bc069eae RDX: 00005575d7213290 RSI: 00005575d72132d0 RDI: 00005575d72132b0 RBP: 00005575d7212fd0 R08: 00005575d7213230 R09: 00005575d7213fe0 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00005575d7213290 R14: 00005575d72132b0 R15: 00005575d7212fd0 </TASK>
Allocated by task 5680: kasan_save_stack+0x1e/0x40 __kasan_slab_alloc+0x66/0x80 kmem_cache_alloc+0x152/0x320 xfs_rui_init+0x17a/0x1b0 [xfs] xlog_recover_rui_commit_pass2+0xb9/0x2e0 [xfs] xlog_recover_items_pass2+0xe9/0x220 [xfs] xlog_recover_commit_trans+0x673/0x900 [xfs] xlog_recovery_process_trans+0xbe/0x130 [xfs] xlog_recover_process_data+0x103/0x2a0 [xfs] xlog_do_recovery_pass+0x548/0xc60 [xfs] xlog_do_log_recovery+0x62/0xc0 [xfs] xlog_do_recover+0x73/0x480 [xfs] xlog_recover+0x229/0x460 [xfs] xfs_log_mount+0x284/0x640 [xfs] xfs_mountfs+0xf8b/0x1d10 [xfs] xfs_fs_fill_super+0x13d5/0x1a80 [xfs] get_tree_bdev+0x3da/0x6e0 vfs_get_tree+0x7d/0x240 path_mount+0xdd3/0x17d0 __x64_sys_mount+0x1fa/0x270 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0
Freed by task 5680: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 ____kasan_slab_free+0x144/0x1b0 slab_free_freelist_hook+0xab/0x180 kmem_cache_free+0x1f1/0x410 xfs_rud_item_release+0x33/0x80 [xfs] xfs_trans_free_items+0xc3/0x220 [xfs] xfs_trans_cancel+0x1fa/0x590 [xfs] xfs_rui_item_recover+0x913/0xd60 [xfs] xlog_recover_process_intents+0x24e/0xae0 [xfs] xlog_recover_finish+0x7d/0x970 [xfs] xfs_log_mount_finish+0x2d7/0x5d0 [xfs] xfs_mountfs+0x11d4/0x1d10 [xfs] xfs_fs_fill_super+0x13d5/0x1a80 [xfs] get_tree_bdev+0x3da/0x6e0 vfs_get_tree+0x7d/0x240 path_mount+0xdd3/0x17d0 __x64_sys_mount+0x1fa/0x270 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0
The buggy address belongs to the object at ffff88804391b300 which belongs to the cache xfs_rui_item of size 688 The buggy address is located 96 bytes inside of 688-byte region [ffff88804391b300, ffff88804391b5b0)
The buggy address belongs to the physical page: page:ffffea00010e4600 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888043919320 pfn:0x43918 head:ffffea00010e4600 order:2 compound_mapcount:0 compound_pincount:0 flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff) raw: 04fff80000010200 0000000000000000 dead000000000122 ffff88807f0eadc0 raw: ffff888043919320 0000000080140010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff88804391b200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88804391b280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88804391b300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88804391b380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88804391b400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ==================================================================
The test fuzzes an rmap btree block and starts writer threads to induce a filesystem shutdown on the corrupt block. When the filesystem is remounted, recovery will try to replay the committed rmap intent item, but the corruption problem causes the recovery transaction to fail. Cancelling the transaction frees the RUD, which frees the RUI that we recovered.
When we return to xlog_recover_process_intents, @lip is now a dangling pointer, and we cannot use it to find the iop_recover method for the tracepoint. Hence we must store the item ops before calling ->iop_recover if we want to give it to the tracepoint so that the trace data will tell us exactly which intent item failed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
d03025ae |
| 14-Jul-2022 |
Bart Van Assche <bvanassche@acm.org> |
fs/xfs: Use the enum req_op and blk_opf_t types
Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for the combina
fs/xfs: Use the enum req_op and blk_opf_t types
Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for the combination of a request operation with request flags.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-63-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
2fd26cc0 |
| 14-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: double link the unlinked inode list
Now we have forwards traversal via the incore inode in place, we now need to add back pointers to the incore inode to entirely replace the back reference cac
xfs: double link the unlinked inode list
Now we have forwards traversal via the incore inode in place, we now need to add back pointers to the incore inode to entirely replace the back reference cache. We use the same lookup semantics and constraints as for the forwards pointer lookups during unlinks, and so we can look up any inode in the unlinked list directly and update the list pointers, forwards or backwards, at any time.
The only wrinkle in converting the unlinked list manipulations to use in-core previous pointers is that log recovery doesn't have the incore inode state built up so it can't just read in an inode and release it to finish off the unlink. Hence we need to modify the traversal in recovery to read one inode ahead before we release the inode at the head of the list. This populates the next->prev relationship sufficient to be able to replay the unlinked list and hence greatly simplify the runtime code.
This recovery algorithm also requires that we actually remove inodes from the unlinked list one at a time as background inode inactivation will result in unlinked list removal racing with the building of the in-memory unlinked list state. We could serialise this by holding the AGI buffer lock when constructing the in memory state, but all that does is lockstep background processing with list building. It is much simpler to flush the inodegc immediately after releasing the inode so that it is unlinked immediately and there is no races present at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|