#
a428636d |
| 29-Nov-2022 |
Ard Biesheuvel <ardb@kernel.org> |
crypto: arm64/ghash-ce - use frame_push/pop macros consistently
Use the frame_push and frame_pop macros to set up the stack frame so that return address protections will be enabled automically when
crypto: arm64/ghash-ce - use frame_push/pop macros consistently
Use the frame_push and frame_pop macros to set up the stack frame so that return address protections will be enabled automically when configured.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
c50d3285 |
| 08-Sep-2022 |
Sami Tolvanen <samitolvanen@google.com> |
arm64: Add types to indirect called assembly functions
With CONFIG_CFI_CLANG, assembly functions indirectly called from C code must be annotated with type identifiers to pass CFI checking. Use SYM_T
arm64: Add types to indirect called assembly functions
With CONFIG_CFI_CLANG, assembly functions indirectly called from C code must be annotated with type identifiers to pass CFI checking. Use SYM_TYPED_FUNC_START for the indirectly called functions, and ensure we emit `bti c` also with SYM_TYPED_FUNC_START.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-10-samitolvanen@google.com
show more ...
|
#
3ad99c22 |
| 10-Nov-2020 |
Ard Biesheuvel <ardb@kernel.org> |
crypto: arm64/gcm - move authentication tag check to SIMD domain
Instead of copying the calculated authentication tag to memory and calling crypto_memneq() to verify it, use vector bytewise compare
crypto: arm64/gcm - move authentication tag check to SIMD domain
Instead of copying the calculated authentication tag to memory and calling crypto_memneq() to verify it, use vector bytewise compare and min across vector instructions to decide whether the tag is valid. This is more efficient, and given that the tag is only transiently held in a NEON register, it is also safer, given that calculated tags for failed decryptions should be withheld.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
2ca86c34 |
| 18-Feb-2020 |
Mark Brown <broonie@kernel.org> |
arm64: crypto: Modernize some extra assembly annotations
A couple of functions were missed in the modernisation of assembly macros, update them too.
Signed-off-by: Mark Brown <broonie@kernel.org> R
arm64: crypto: Modernize some extra assembly annotations
A couple of functions were missed in the modernisation of assembly macros, update them too.
Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
show more ...
|
#
0e89640b |
| 13-Dec-2019 |
Mark Brown <broonie@kernel.org> |
crypto: arm64 - Use modern annotations for assembly functions
In an effort to clarify and simplify the annotation of assembly functions in the kernel new macros have been introduced. These replace E
crypto: arm64 - Use modern annotations for assembly functions
In an effort to clarify and simplify the annotation of assembly functions in the kernel new macros have been introduced. These replace ENTRY and ENDPROC and also add a new annotation for static functions which previously had no ENTRY equivalent. Update the annotations in the crypto code to the new macros.
There are a small number of files imported from OpenSSL where the assembly is generated using perl programs, these are not currently annotated at all and have not been modified.
Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
11031c0d |
| 10-Sep-2019 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/gcm-ce - implement 4 way interleave
To improve performance on cores with deep pipelines such as ThunderX2, reimplement gcm(aes) using a 4-way interleave rather than the 2-way interleav
crypto: arm64/gcm-ce - implement 4 way interleave
To improve performance on cores with deep pipelines such as ThunderX2, reimplement gcm(aes) using a 4-way interleave rather than the 2-way interleave we use currently.
This comes down to a complete rewrite of the GCM part of the combined GCM/GHASH driver, and instead of interleaving two invocations of AES with the GHASH handling at the instruction level, the new version uses a more coarse grained approach where each chunk of 64 bytes is encrypted first and then ghashed (or ghashed and then decrypted in the converse case).
The core NEON routine is now able to consume inputs of any size, and tail blocks of less than 64 bytes are handled using overlapping loads and stores, and processed by the same 4-way encryption and hashing routines. This gets rid of most of the branches, and avoids having to return to the C code to handle the tail block using a stack buffer.
The table below compares the performance of the old driver and the new one on various micro-architectures and running in various modes.
| AES-128 | AES-192 | AES-256 | #bytes | 512 | 1500 | 4k | 512 | 1500 | 4k | 512 | 1500 | 4k | -------+-----+------+-----+-----+------+-----+-----+------+-----+ TX2 | 35% | 23% | 11% | 34% | 20% | 9% | 38% | 25% | 16% | EMAG | 11% | 6% | 3% | 12% | 4% | 2% | 11% | 4% | 2% | A72 | 8% | 5% | -4% | 9% | 4% | -5% | 7% | 4% | -5% | A53 | 11% | 6% | -1% | 10% | 8% | -1% | 10% | 8% | -2% |
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
d2912cb1 |
| 04-Jun-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify it under the terms of th
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation
this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation #
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 4122 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
22240df7 |
| 04-Aug-2018 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/ghash-ce - implement 4-way aggregation
Enhance the GHASH implementation that uses 64-bit polynomial multiplication by adding support for 4-way aggregation. This more than doubles the p
crypto: arm64/ghash-ce - implement 4-way aggregation
Enhance the GHASH implementation that uses 64-bit polynomial multiplication by adding support for 4-way aggregation. This more than doubles the performance, from 2.4 cycles per byte to 1.1 cpb on Cortex-A53.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
8e492eff |
| 04-Aug-2018 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/ghash-ce - replace NEON yield check with block limit
Checking the TIF_NEED_RESCHED flag is disproportionately costly on cores with fast crypto instructions and comparatively slow memor
crypto: arm64/ghash-ce - replace NEON yield check with block limit
Checking the TIF_NEED_RESCHED flag is disproportionately costly on cores with fast crypto instructions and comparatively slow memory accesses.
On algorithms such as GHASH, which executes at ~1 cycle per byte on cores that implement support for 64 bit polynomial multiplication, there is really no need to check the TIF_NEED_RESCHED particularly often, and so we can remove the NEON yield check from the assembler routines.
However, unlike the AEAD or skcipher APIs, the shash/ahash APIs take arbitrary input lengths, and so there needs to be some sanity check to ensure that we don't hog the CPU for excessive amounts of time.
So let's simply cap the maximum input size that is processed in one go to 64 KB.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
30f1a9f5 |
| 30-Jul-2018 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/aes-ce-gcm - don't reload key schedule if avoidable
Squeeze out another 5% of performance by minimizing the number of invocations of kernel_neon_begin()/kernel_neon_end() on the common
crypto: arm64/aes-ce-gcm - don't reload key schedule if avoidable
Squeeze out another 5% of performance by minimizing the number of invocations of kernel_neon_begin()/kernel_neon_end() on the common path, which also allows some reloads of the key schedule to be optimized away.
The resulting code runs at 2.3 cycles per byte on a Cortex-A53.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
e0bd888d |
| 30-Jul-2018 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/aes-ce-gcm - implement 2-way aggregation
Implement a faster version of the GHASH transform which amortizes the reduction modulo the characteristic polynomial across two input blocks at
crypto: arm64/aes-ce-gcm - implement 2-way aggregation
Implement a faster version of the GHASH transform which amortizes the reduction modulo the characteristic polynomial across two input blocks at a time.
On a Cortex-A53, the gcm(aes) performance increases 24%, from 3.0 cycles per byte to 2.4 cpb for large input sizes.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
71e52c27 |
| 30-Jul-2018 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/aes-ce-gcm - operate on two input blocks at a time
Update the core AES/GCM transform and the associated plumbing to operate on 2 AES/GHASH blocks at a time. By itself, this is not expe
crypto: arm64/aes-ce-gcm - operate on two input blocks at a time
Update the core AES/GCM transform and the associated plumbing to operate on 2 AES/GHASH blocks at a time. By itself, this is not expected to result in a noticeable speedup, but it paves the way for reimplementing the GHASH component using 2-way aggregation.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
f10dc56c |
| 29-Jul-2018 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64 - revert NEON yield for fast AEAD implementations
As it turns out, checking the TIF_NEED_RESCHED flag after each iteration results in a significant performance regression (~10%) when r
crypto: arm64 - revert NEON yield for fast AEAD implementations
As it turns out, checking the TIF_NEED_RESCHED flag after each iteration results in a significant performance regression (~10%) when running fast algorithms (i.e., ones that use special instructions and operate in the < 4 cycles per byte range) on in-order cores with comparatively slow memory accesses such as the Cortex-A53.
Given the speed of these ciphers, and the fact that the page based nature of the AEAD scatterwalk API guarantees that the core NEON transform is never invoked with more than a single page's worth of input, we can estimate the worst case duration of any resulting scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages, processing a page's worth of input at 4 cycles per byte results in a delay of ~250 us, which is a reasonable upper bound.
So let's remove the yield checks from the fused AES-CCM and AES-GCM routines entirely.
This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3.
Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...") Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...") Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
7c50136a |
| 30-Apr-2018 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/aes-ghash - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by yielding the NEON after every block of input.
Signed-off-by: Ard Bies
crypto: arm64/aes-ghash - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
03c9a333 |
| 24-Jul-2017 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL
Implement a NEON fallback for systems that do support NEON but have no support for the optional 64x64->128 polynomial multiplicat
crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL
Implement a NEON fallback for systems that do support NEON but have no support for the optional 64x64->128 polynomial multiplication instruction that is part of the ARMv8 Crypto Extensions. It is based on the paper "Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine" by Danilo Camara, Conrado Gouvea, Julio Lopez and Ricardo Dahab (https://hal.inria.fr/hal-01506572), but has been reworked extensively for the AArch64 ISA.
On a low-end core such as the Cortex-A53 found in the Raspberry Pi3, the NEON based implementation is 4x faster than the table based one, and is time invariant as well, making it less vulnerable to timing attacks. When combined with the bit-sliced NEON implementation of AES-CTR, the AES-GCM performance increases by 2x (from 58 to 29 cycles per byte).
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
537c1445 |
| 24-Jul-2017 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/gcm - implement native driver using v8 Crypto Extensions
Currently, the AES-GCM implementation for arm64 systems that support the ARMv8 Crypto Extensions is based on the generic GCM mo
crypto: arm64/gcm - implement native driver using v8 Crypto Extensions
Currently, the AES-GCM implementation for arm64 systems that support the ARMv8 Crypto Extensions is based on the generic GCM module, which combines the AES-CTR implementation using AES instructions with the PMULL based GHASH driver. This is suboptimal, given the fact that the input data needs to be loaded twice, once for the encryption and again for the MAC calculation.
On Cortex-A57 (r1p2) and other recent cores that implement micro-op fusing for the AES instructions, AES executes at less than 1 cycle per byte, which means that any cycles wasted on loading the data twice hurt even more.
So implement a new GCM driver that combines the AES and PMULL instructions at the block level. This improves performance on Cortex-A57 by ~37% (from 3.5 cpb to 2.6 cpb)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
9c433ad5 |
| 11-Oct-2016 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
crypto: arm64/ghash-ce - fix for big endian
The GHASH key and digest are both pairs of 64-bit quantities, but the GHASH code does not always refer to them as such, causing failures when built for bi
crypto: arm64/ghash-ce - fix for big endian
The GHASH key and digest are both pairs of 64-bit quantities, but the GHASH code does not always refer to them as such, causing failures when built for big endian. So replace the 16x1 loads and stores with 2x8 ones.
Fixes: b913a6404ce2 ("arm64/crypto: improve performance of GHASH algorithm") Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|
#
b913a640 |
| 16-Jun-2014 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
arm64/crypto: improve performance of GHASH algorithm
This patches modifies the GHASH secure hash implementation to switch to a faster, polynomial multiplication based reduction instead of one that u
arm64/crypto: improve performance of GHASH algorithm
This patches modifies the GHASH secure hash implementation to switch to a faster, polynomial multiplication based reduction instead of one that uses shifts and rotates.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
show more ...
|
#
fdd23894 |
| 26-Mar-2014 |
Ard Biesheuvel <ard.biesheuvel@linaro.org> |
arm64/crypto: GHASH secure hash using ARMv8 Crypto Extensions
This is a port to ARMv8 (Crypto Extensions) of the Intel implementation of the GHASH Secure Hash (used in the Galois/Counter chaining mo
arm64/crypto: GHASH secure hash using ARMv8 Crypto Extensions
This is a port to ARMv8 (Crypto Extensions) of the Intel implementation of the GHASH Secure Hash (used in the Galois/Counter chaining mode). It relies on the optional PMULL/PMULL2 instruction (polynomial multiply long, what Intel call carry-less multiply).
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
show more ...
|