#
c034ec84 |
| 24-Feb-2024 |
Ankit Agrawal <ankita@nvidia.com> |
KVM: arm64: Introduce new flag for non-cacheable IO memory
Currently, KVM for ARM64 maps at stage 2 memory that is considered device (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this se
KVM: arm64: Introduce new flag for non-cacheable IO memory
Currently, KVM for ARM64 maps at stage 2 memory that is considered device (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system cannot determine device MMIO mapping memory attributes on its own but it is always overridden by the KVM stage 2 default.
This set-up does not allow guest operating systems to select device memory attributes independently from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), which turns out to be an issue in that guest operating systems (e.g. Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (e.g. gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (e.g. unaligned transactions) such as the NormalNC memory type.
The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (i.e. it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be asynchronous (SError) defeating the purpose.
Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings.
Actually, DEVICE_nGnRE memory type is even more problematic than Normal-NC memory type in terms of faults containability in that e.g. aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally, synchronous (i.e. that would imply that the processor should issue at most 1 load transaction at a time - it cannot pipeline them - otherwise the synchronous abort semantics would break the no-speculation attribute attached to DEVICE_XXX memory).
This means that regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures and that in turn relies on platform capabilities and the device type being assigned (i.e. PCIe AER/DPC error containment and RAS architecture[3]); therefore the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform (if the platform design adheres to design guidelines outlined in [3]) and therefore can be relaxed.
For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to Normal-NC.
The NormalNC was chosen over a different Normal memory type default at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to trigger any issue on guest device reclaim use cases either (i.e. device MMIO unmap followed by a device reset) at least for PCIe devices, in that in PCIe a device reset is architected and carried out through PCI config space transactions that are naturally ordered with respect to MMIO transactions according to the PCI ordering rules.
Having Normal-NC S2 default puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) - stage1], [(S2) - stage 2]):
S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr>
It is worth noting that currently, to map devices MMIO space to user space in a device pass-through use case the VFIO framework applies memory attributes derived from pgprot_noncached() settings applied to VMAs, which result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
This means that a userspace mapping for device MMIO space carried out with the current VFIO framework and a guest OS mapping for the same MMIO space may result in a mismatched alias as described in [2].
Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change anything in this respect, in that the mismatched aliases would only affect (refer to [2] for a detailed explanation) ordering between the userspace and GuestOS mappings resulting stream of transactions (i.e. it does not cause loss of property for either stream of transactions on its own), which is harmless given that the userspace and GuestOS access to the device is carried out through independent transactions streams.
A Normal-NC flag is not present today. So add a new kvm_pgtable_prot (KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its corresponding PTE value 0x5 (0b101) determined from [1].
Lastly, adapt the stage2 PTE property setter function (stage2_set_prot_attr) to handle the NormalNC attribute.
The entire discussion leading to this patch series may be followed through the following links. Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com
[1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf [2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
show more ...
|
#
0abc1b11 |
| 27-Nov-2023 |
Ryan Roberts <ryan.roberts@arm.com> |
KVM: arm64: Support up to 5 levels of translation in kvm_pgtable
FEAT_LPA2 increases the maximum levels of translation from 4 to 5 for the 4KB page case, when IA is >48 bits. While we can still use
KVM: arm64: Support up to 5 levels of translation in kvm_pgtable
FEAT_LPA2 increases the maximum levels of translation from 4 to 5 for the 4KB page case, when IA is >48 bits. While we can still use 4 levels for stage2 translation in this case (due to stage2 allowing concatenated page tables for first level lookup), the same kvm_pgtable library is used for the hyp stage1 page tables and stage1 does not support concatenation.
Therefore, modify the library to support up to 5 levels. Previous patches already laid the groundwork for this by refactoring code to work in terms of KVM_PGTABLE_FIRST_LEVEL and KVM_PGTABLE_LAST_LEVEL. So we just need to change these macros.
The hardware sometimes encodes the new level differently from the others: One such place is when reading the level from the FSC field in the ESR_EL2 register. We never expect to see the lowest level (-1) here since the stage 2 page tables always use concatenated tables for first level lookup and therefore only use 4 levels of lookup. So we get away with just adding a comment to explain why we are not being careful about decoding level -1.
For stage2 VTCR_EL2.SL2 is introduced to encode the new start level. However, since we always use concatenated page tables for first level look up at stage2 (and therefore we will never need the new extra level) we never touch this new field.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-10-ryan.roberts@arm.com
show more ...
|
#
419edf48 |
| 27-Nov-2023 |
Ryan Roberts <ryan.roberts@arm.com> |
KVM: arm64: Convert translation level parameter to s8
With the introduction of FEAT_LPA2, the Arm ARM adds a new level of translation, level -1, so levels can now be in the range [-1;3]. 3 is always
KVM: arm64: Convert translation level parameter to s8
With the introduction of FEAT_LPA2, the Arm ARM adds a new level of translation, level -1, so levels can now be in the range [-1;3]. 3 is always the last level and the first level is determined based on the number of VA bits in use.
Convert level variables to use a signed type in preparation for supporting this new level -1.
Since the last level is always anchored at 3, and the first level varies to suit the number of VA/IPA bits, take the opportunity to replace KVM_PGTABLE_MAX_LEVELS with the 2 macros KVM_PGTABLE_FIRST_LEVEL and KVM_PGTABLE_LAST_LEVEL. This removes the assumption from the code that levels run from 0 to KVM_PGTABLE_MAX_LEVELS - 1, which will soon no longer be true.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-9-ryan.roberts@arm.com
show more ...
|
#
bd412e2a |
| 27-Nov-2023 |
Ryan Roberts <ryan.roberts@arm.com> |
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1
Implement a simple policy whereby if the HW supports FEAT_LPA2 for the page size we are using, always use LPA2-style page-tables for stage
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1
Implement a simple policy whereby if the HW supports FEAT_LPA2 for the page size we are using, always use LPA2-style page-tables for stage 2 and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested IPA size or HW-implemented PA size. When in use we can now support up to 52-bit IPA and PA sizes.
We use the previously created cpu feature to track whether LPA2 is supported for deciding whether to use the LPA2 or classic pte format.
Note that FEAT_LPA2 brings support for bigger block mappings (512GB with 4KB, 64GB with 16KB). We explicitly don't enable these in the library because stage2_apply_range() works on batch sizes of the largest used block mapping, and increasing the size of the batch would lead to soft lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit stage2_apply_range() batch size to largest block").
With the addition of LPA2 support in the hypervisor, the PA size supported by the HW must be capped with a runtime decision, rather than simply using a compile-time decision based on PA_BITS. For example, on a system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB or 16KB kernel compiled with LPA2 support must still limit the PA size to 48 bits.
Therefore, move the insertion of the PS field into TCR_EL2 out of __kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode() where the rest of TCR_EL2 is prepared. This allows us to figure out PS with kvm_get_parange(), which has the appropriate logic to ensure the above requirement. (and the PS field of VTCR_EL2 is already populated this way).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
show more ...
|
#
936a4ec2 |
| 27-Nov-2023 |
Ryan Roberts <ryan.roberts@arm.com> |
arm64/mm: Add lpa2_is_enabled() kvm_lpa2_is_enabled() stubs
Add stub functions which is initially always return false. These provide the hooks that we need to update the range-based TLBI routines, w
arm64/mm: Add lpa2_is_enabled() kvm_lpa2_is_enabled() stubs
Add stub functions which is initially always return false. These provide the hooks that we need to update the range-based TLBI routines, whose operands are encoded differently depending on whether lpa2 is enabled or not.
The kernel and kvm will enable the use of lpa2 asynchronously in future, and part of that enablement will involve fleshing out their respective hook to advertise when it is using lpa2.
Since the kernel's decision to use lpa2 relies on more than just whether the HW supports the feature, it can't just use the same static key as kvm. This is another reason to use separate functions. lpa2_is_enabled() is already implemented as part of Ard's kernel lpa2 series. Since kvm will make its decision solely based on HW support, kvm_lpa2_is_enabled() will be defined as system_supports_lpa2() once kvm starts using lpa2.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-3-ryan.roberts@arm.com
show more ...
|
#
117940aa |
| 11-Aug-2023 |
Raghavendra Rao Ananta <rananta@google.com> |
KVM: arm64: Define kvm_tlb_flush_vmid_range()
Implement the helper kvm_tlb_flush_vmid_range() that acts as a wrapper for range-based TLB invalidations. For the given VMID, use the range-based TLBI i
KVM: arm64: Define kvm_tlb_flush_vmid_range()
Implement the helper kvm_tlb_flush_vmid_range() that acts as a wrapper for range-based TLB invalidations. For the given VMID, use the range-based TLBI instructions to do the job or fallback to invalidating all the TLB entries.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-11-rananta@google.com
show more ...
|
#
df6556ad |
| 27-Jun-2023 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot
Userspace is allowed to select any PAGE_SIZE aligned hva to back guest memory. This is even the case with hugepages, although
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot
Userspace is allowed to select any PAGE_SIZE aligned hva to back guest memory. This is even the case with hugepages, although it is a rather suboptimal configuration as PTE level mappings are used at stage-2.
The arm64 page aging handlers have an assumption that the specified range is exactly one page/block of memory, which in the aforementioned case is not necessarily true. All together this leads to the WARN() in kvm_age_gfn() firing.
However, the WARN is only part of the issue as the table walkers visit at most a single leaf PTE. For hugepage-backed memory in a memslot that isn't hugepage-aligned, page aging entirely misses accesses to the hugepage beyond the first page in the memslot.
Add a new walker dedicated to handling page aging MMU notifiers capable of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new walker and drop the WARN that caught the issue in the first place. The implementation of this walker was inspired by the test_clear_young() implementation by Yu Zhao [*], but repurposed to address a bug in the existing aging implementation.
Cc: stable@vger.kernel.org # v5.15 Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers") Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/ Co-developed-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
show more ...
|
#
a9f0e3d5 |
| 22-May-2023 |
Fuad Tabba <tabba@google.com> |
KVM: arm64: Reload PTE after invoking walker callback on preorder traversal
The preorder callback on the kvm_pgtable_stage2_map() path can replace a table with a block, then recursively free the det
KVM: arm64: Reload PTE after invoking walker callback on preorder traversal
The preorder callback on the kvm_pgtable_stage2_map() path can replace a table with a block, then recursively free the detached table. The higher-level walking logic stashes the old page table entry and then walks the freed table, invoking the leaf callback and potentially freeing pgtable pages prematurely.
In normal operation, the call to tear down the detached stage-2 is indirected and uses an RCU callback to trigger the freeing. RCU is not available to pKVM, which is where this bug is triggered.
Change the behavior of the walker to reload the page table entry after invoking the walker callback on preorder traversal, as it does for leaf entries.
Tested on Pixel 6.
Fixes: 5c359cca1faf ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make") Suggested-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230522103258.402272-1-tabba@google.com
show more ...
|
#
8f5a3eb7 |
| 26-Apr-2023 |
Ricardo Koller <ricarkol@google.com> |
KVM: arm64: Add kvm_pgtable_stage2_split()
Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a range of huge pages. This will be used for eager-splitting huge pages into PAGE_SIZE
KVM: arm64: Add kvm_pgtable_stage2_split()
Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a range of huge pages. This will be used for eager-splitting huge pages into PAGE_SIZE pages. The goal is to avoid having to split huge pages on write-protection faults, and instead use this function to do it ahead of time for large ranges (e.g., all guest memory in 1G chunks at a time).
Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-7-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
show more ...
|
#
2f440b72 |
| 26-Apr-2023 |
Ricardo Koller <ricarkol@google.com> |
KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
Add a capability for userspace to specify the eager split chunk size. The chunk size specifies how many pages to break at a time, using a single al
KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
Add a capability for userspace to specify the eager split chunk size. The chunk size specifies how many pages to break at a time, using a single allocation. Bigger the chunk size, more pages need to be allocated ahead of time.
Suggested-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-6-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
show more ...
|
#
e7c05540 |
| 26-Apr-2023 |
Ricardo Koller <ricarkol@google.com> |
KVM: arm64: Add helper for creating unlinked stage2 subtrees
Add a stage2 helper, kvm_pgtable_stage2_create_unlinked(), for creating unlinked tables (which is the opposite of kvm_pgtable_stage2_free
KVM: arm64: Add helper for creating unlinked stage2 subtrees
Add a stage2 helper, kvm_pgtable_stage2_create_unlinked(), for creating unlinked tables (which is the opposite of kvm_pgtable_stage2_free_unlinked()). Creating an unlinked table is useful for splitting level 1 and 2 entries into subtrees of PAGE_SIZE PTEs. For example, a level 1 entry can be split into PAGE_SIZE PTEs by first creating a fully populated tree, and then use it to replace the level 1 entry in a single step. This will be used in a subsequent commit for eager huge-page splitting (a dirty-logging optimization).
Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-4-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
show more ...
|
#
02f10845 |
| 26-Apr-2023 |
Ricardo Koller <ricarkol@google.com> |
KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs
Add two flags to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_SKIP_BBM_TLBI and KVM_PGTABLE_WALK_SKIP_CMO, to indicate that the walk
KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs
Add two flags to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_SKIP_BBM_TLBI and KVM_PGTABLE_WALK_SKIP_CMO, to indicate that the walk should not perform TLB invalidations (TLBIs) in break-before-make (BBM) nor cache maintenance operations (CMO). This will be used by a future commit to create unlinked tables not accessible to the HW page-table walker.
Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-3-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
show more ...
|
#
c14d08c5 |
| 26-Apr-2023 |
Ricardo Koller <ricarkol@google.com> |
KVM: arm64: Rename free_removed to free_unlinked
Normalize on referring to tables outside of an active paging structure as 'unlinked'.
A subsequent change to KVM will add support for building page
KVM: arm64: Rename free_removed to free_unlinked
Normalize on referring to tables outside of an active paging structure as 'unlinked'.
A subsequent change to KVM will add support for building page tables that are not part of an active paging structure. The existing 'removed_table' terminology is quite clunky when applied in this context.
Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-2-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
show more ...
|
#
1f0f4a2e |
| 21-Apr-2023 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Infer the PA offset from IPA in stage-2 map walker
Until now, the page table walker counted increments to the PA and IPA of a walk in two separate places. While the PA is incremented as
KVM: arm64: Infer the PA offset from IPA in stage-2 map walker
Until now, the page table walker counted increments to the PA and IPA of a walk in two separate places. While the PA is incremented as soon as a leaf PTE is installed in stage2_map_walker_try_leaf(), the IPA is actually bumped in the generic table walker context. Critically, __kvm_pgtable_visit() rereads the PTE after the LEAF callback returns to work out if a table or leaf was installed, and only bumps the IPA for a leaf PTE.
This arrangement worked fine when we handled faults behind the write lock, as the walker had exclusive access to the stage-2 page tables. However, commit 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel") started handling all stage-2 faults behind the read lock, opening up a race where a walker could increment the PA but not the IPA of a walk. Nothing good ensues, as the walker starts mapping with the incorrect IPA -> PA relationship.
For example, assume that two vCPUs took a data abort on the same IPA. One observes that dirty logging is disabled, and the other observed that it is enabled:
vCPU attempting PMD mapping vCPU attempting PTE mapping ====================================== ===================================== /* install PMD */ stage2_make_pte(ctx, leaf); data->phys += granule; /* replace PMD with a table */ stage2_try_break_pte(ctx, data->mmu); stage2_make_pte(ctx, table); /* table is observed */ ctx.old = READ_ONCE(*ptep); table = kvm_pte_table(ctx.old, level);
/* * map walk continues w/o incrementing * IPA. */ __kvm_pgtable_walk(..., level + 1);
Bring an end to the whole mess by using the IPA as the single source of truth for how far along a walk has gotten. Work out the correct PA to map by calculating the IPA offset from the beginning of the walk and add that to the starting physical address.
Cc: stable@vger.kernel.org Fixes: 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel") Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230421071606.1603916-2-oliver.upton@linux.dev
show more ...
|
#
ddcadb29 |
| 02-Dec-2022 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Ignore EAGAIN for walks outside of a fault
The page table walkers are invoked outside fault handling paths, such as write protecting a range of memory. EAGAIN is generally used by the wa
KVM: arm64: Ignore EAGAIN for walks outside of a fault
The page table walkers are invoked outside fault handling paths, such as write protecting a range of memory. EAGAIN is generally used by the walkers to retry execution due to races on a particular PTE, like taking an access fault on a PTE being invalidated from another thread.
This early return behavior is undesirable for walkers that operate outside a fault handler. Suppress EAGAIN and continue the walk if operating outside a fault handler.
Link: https://lore.kernel.org/r/20221202185156.696189-3-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
show more ...
|
#
9a7ad19a |
| 02-Dec-2022 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Use KVM's pte type/helpers in handle_access_fault()
Consistently use KVM's own pte types and helpers in handle_access_fault().
No functional change intended.
Link: https://lore.kernel.
KVM: arm64: Use KVM's pte type/helpers in handle_access_fault()
Consistently use KVM's own pte types and helpers in handle_access_fault().
No functional change intended.
Link: https://lore.kernel.org/r/20221202185156.696189-2-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
show more ...
|
#
5e806c58 |
| 18-Nov-2022 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Reject shared table walks in the hyp code
Exclusive table walks are the only supported table walk in the hyp, as there is no construct like RCU available in the hypervisor code. Reject a
KVM: arm64: Reject shared table walks in the hyp code
Exclusive table walks are the only supported table walk in the hyp, as there is no construct like RCU available in the hypervisor code. Reject any attempt to do a shared table walk by returning an error and allowing the caller to clean up the mess.
Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221118182222.3932898-4-oliver.upton@linux.dev
show more ...
|
#
b7833bf2 |
| 18-Nov-2022 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Don't acquire RCU read lock for exclusive table walks
Marek reported a BUG resulting from the recent parallel faults changes, as the hyp stage-1 map walker attempted to allocate table me
KVM: arm64: Don't acquire RCU read lock for exclusive table walks
Marek reported a BUG resulting from the recent parallel faults changes, as the hyp stage-1 map walker attempted to allocate table memory while holding the RCU read lock:
BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 2 locks held by swapper/0/1: #0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at: __create_hyp_mappings+0x80/0xc4 #1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at: kvm_pgtable_walk+0x0/0x1f4 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918 Hardware name: Raspberry Pi 3 Model B (DT) Call trace: dump_backtrace.part.0+0xe4/0xf0 show_stack+0x18/0x40 dump_stack_lvl+0x8c/0xb8 dump_stack+0x18/0x34 __might_resched+0x178/0x220 __might_sleep+0x48/0xa0 prepare_alloc_pages+0x178/0x1a0 __alloc_pages+0x9c/0x109c alloc_page_interleave+0x1c/0xc4 alloc_pages+0xec/0x160 get_zeroed_page+0x1c/0x44 kvm_hyp_zalloc_page+0x14/0x20 hyp_map_walker+0xd4/0x134 kvm_pgtable_visitor_cb.isra.0+0x38/0x5c __kvm_pgtable_walk+0x1a4/0x220 kvm_pgtable_walk+0x104/0x1f4 kvm_pgtable_hyp_map+0x80/0xc4 __create_hyp_mappings+0x9c/0xc4 kvm_mmu_init+0x144/0x1cc kvm_arch_init+0xe4/0xef4 kvm_init+0x3c/0x3d0 arm_init+0x20/0x30 do_one_initcall+0x74/0x400 kernel_init_freeable+0x2e0/0x350 kernel_init+0x24/0x130 ret_from_fork+0x10/0x20
Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex, RCU protection really doesn't add anything. Don't acquire the RCU read lock for an exclusive walk.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
show more ...
|
#
3a5154c7 |
| 18-Nov-2022 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Take a pointer to walker data in kvm_dereference_pteref()
Rather than passing through the state of the KVM_PGTABLE_WALK_SHARED flag, just take a pointer to the whole walker structure ins
KVM: arm64: Take a pointer to walker data in kvm_dereference_pteref()
Rather than passing through the state of the KVM_PGTABLE_WALK_SHARED flag, just take a pointer to the whole walker structure instead. Move around struct kvm_pgtable and the RCU indirection such that the associated ifdeffery remains in one place while ensuring the walker + flags definitions precede their use.
No functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221118182222.3932898-2-oliver.upton@linux.dev
show more ...
|
#
aa6948f8 |
| 10-Nov-2022 |
Quentin Perret <qperret@google.com> |
KVM: arm64: Add per-cpu fixmap infrastructure at EL2
Mapping pages in a guest page-table from within the pKVM hypervisor at EL2 may require cache maintenance to ensure that the initialised page cont
KVM: arm64: Add per-cpu fixmap infrastructure at EL2
Mapping pages in a guest page-table from within the pKVM hypervisor at EL2 may require cache maintenance to ensure that the initialised page contents is visible even to non-cacheable (e.g. MMU-off) accesses from the guest.
In preparation for performing this maintenance at EL2, introduce a per-vCPU fixmap which allows the pKVM hypervisor to map guest pages temporarily into its stage-1 page-table for the purposes of cache maintenance and, in future, poisoning on the reclaim path. The use of a fixmap avoids the need for memory allocation or locking on the map() path.
Tested-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Co-developed-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110190259.26861-15-will@kernel.org
show more ...
|
#
a1ec5c70 |
| 10-Nov-2022 |
Fuad Tabba <tabba@google.com> |
KVM: arm64: Add infrastructure to create and track pKVM instances at EL2
Introduce a global table (and lock) to track pKVM instances at EL2, and provide hypercalls that can be used by the untrusted
KVM: arm64: Add infrastructure to create and track pKVM instances at EL2
Introduce a global table (and lock) to track pKVM instances at EL2, and provide hypercalls that can be used by the untrusted host to create and destroy pKVM VMs and their vCPUs. pKVM VM/vCPU state is directly accessible only by the trusted hypervisor (EL2).
Each pKVM VM is directly associated with an untrusted host KVM instance, and is referenced by the host using an opaque handle. Future patches will provide hypercalls to allow the host to initialize/set/get pKVM VM/vCPU state using the opaque handle.
Tested-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Co-developed-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> [maz: silence warning on unmap_donated_memory_noclear()] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110190259.26861-13-will@kernel.org
show more ...
|
#
1577cb58 |
| 07-Nov-2022 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Handle stage-2 faults in parallel
The stage-2 map walker has been made parallel-aware, and as such can be called while only holding the read side of the MMU lock. Rip out the conditional
KVM: arm64: Handle stage-2 faults in parallel
The stage-2 map walker has been made parallel-aware, and as such can be called while only holding the read side of the MMU lock. Rip out the conditional locking in user_mem_abort() and instead grab the read lock. Continue to take the write lock from other callsites to kvm_pgtable_stage2_map().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107220033.1895655-1-oliver.upton@linux.dev
show more ...
|
#
c3119ae4 |
| 07-Nov-2022 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Protect stage-2 traversal with RCU
Use RCU to safely walk the stage-2 page tables in parallel. Acquire and release the RCU read lock when traversing the page tables. Defer the freeing of
KVM: arm64: Protect stage-2 traversal with RCU
Use RCU to safely walk the stage-2 page tables in parallel. Acquire and release the RCU read lock when traversing the page tables. Defer the freeing of table memory to an RCU callback. Indirect the calls into RCU and provide stubs for hypervisor code, as RCU is not available in such a context.
The RCU protection doesn't amount to much at the moment, as readers are already protected by the read-write lock (all walkers that free table memory take the write lock). Nonetheless, a subsequent change will futher relax the locking requirements around the stage-2 MMU, thereby depending on RCU.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-9-oliver.upton@linux.dev
show more ...
|
#
5c359cca |
| 07-Nov-2022 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make
The break-before-make sequence is a bit annoying as it opens a window wherein memory is unmapped from the guest. KVM should rep
KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make
The break-before-make sequence is a bit annoying as it opens a window wherein memory is unmapped from the guest. KVM should replace the PTE as quickly as possible and avoid unnecessary work in between.
Presently, the stage-2 map walker tears down a removed table before installing a block mapping when coalescing a table into a block. As the removed table is no longer visible to hardware walkers after the DSB+TLBI, it is possible to move the remaining cleanup to happen after installing the new PTE.
Reshuffle the stage-2 map walker to install the new block entry in the pre-order callback. Unwire all of the teardown logic and replace it with a call to kvm_pgtable_stage2_free_removed() after fixing the PTE. The post-order visitor is now completely unnecessary, so drop it. Finally, touch up the comments to better represent the now simplified map walker.
Note that the call to tear down the unlinked stage-2 is indirected as a subsequent change will use an RCU callback to trigger tear down. RCU is not available to pKVM, so there is a need to use different implementations on pKVM and non-pKVM VMs.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-8-oliver.upton@linux.dev
show more ...
|
#
6b91b8f9 |
| 07-Nov-2022 |
Oliver Upton <oliver.upton@linux.dev> |
KVM: arm64: Use an opaque type for pteps
Use an opaque type for pteps and require visitors explicitly dereference the pointer before using. Protecting page table memory with RCU requires that KVM de
KVM: arm64: Use an opaque type for pteps
Use an opaque type for pteps and require visitors explicitly dereference the pointer before using. Protecting page table memory with RCU requires that KVM dereferences RCU-annotated pointers before using. However, RCU is not available for use in the nVHE hypervisor and the opaque type can be conditionally annotated with RCU for the stage-2 MMU.
Call the type a 'pteref' to avoid a naming collision with raw pteps. No functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-7-oliver.upton@linux.dev
show more ...
|