History log of /linux/arch/arm64/net/bpf_jit_comp.c (Results 1 – 25 of 133)
Revision Date Author Comments
# a552e2ef 18-Oct-2024 Peter Collingbourne <pcc@google.com>

bpf, arm64: Fix address emission with tag-based KASAN enabled

When BPF_TRAMP_F_CALL_ORIG is enabled, the address of a bpf_tramp_image
struct on the stack is passed during the size calculation pass a

bpf, arm64: Fix address emission with tag-based KASAN enabled

When BPF_TRAMP_F_CALL_ORIG is enabled, the address of a bpf_tramp_image
struct on the stack is passed during the size calculation pass and
an address on the heap is passed during code generation. This may
cause a heap buffer overflow if the heap address is tagged because
emit_a64_mov_i64() will emit longer code than it did during the size
calculation pass. The same problem could occur without tag-based
KASAN if one of the 16-bit words of the stack address happened to
be all-ones during the size calculation pass. Fix the problem by
assuming the worst case (4 instructions) when calculating the size
of the bpf_tramp_image address emission.

Fixes: 19d3c179a377 ("bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIG")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://linux-review.googlesource.com/id/I1496f2bc24fba7a1d492e16e2b94cf43714f2d3c
Link: https://lore.kernel.org/bpf/20241018221644.3240898-1-pcc@google.com

show more ...


# ddbe9ec5 03-Sep-2024 Xu Kuohai <xukuohai@huawei.com>

bpf, arm64: Jit BPF_CALL to direct call when possible

Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.

For

bpf, arm64: Jit BPF_CALL to direct call when possible

Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.

For example, the following BPF_CALL

call __htab_map_lookup_elem

is always jited to indirect call:

mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10

When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:

bl 0xfffffffffd33bc98

This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.

Without this patch, the jit works as follows.

1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.

2. Allocate jited image with size computed in step 1.

3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.

This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.

Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.

For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.

The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.

Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.

Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.

This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.

Based on the observation, with this patch, the jit works as follows.

1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.

2. Allocate jited image with size estimated in step 1.

3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.

4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 5d4fa9ec 26-Aug-2024 Xu Kuohai <xukuohai@huawei.com>

bpf, arm64: Avoid blindly saving/restoring all callee-saved registers

The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example

bpf, arm64: Avoid blindly saving/restoring all callee-saved registers

The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:

0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret

Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.

Now the jited result of empty prog is:

0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret

Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.

[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.

- Before:

Performance counter stats for './test_progs -t tailcalls' (5 runs):

4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )

Performance counter stats for './test_progs -t flow_dissector' (5 runs):

10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )

- After:

Performance counter stats for './test_progs -t tailcalls' (5 runs):

4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )

Performance counter stats for './test_progs -t flow_dissector' (5 runs):

10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )

[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# bd737fcb 26-Aug-2024 Xu Kuohai <xukuohai@huawei.com>

bpf, arm64: Get rid of fpb

bpf prog accesses stack using BPF_FP as the base address and a negative
immediate number as offset. But arm64 ldr/str instructions only support
non-negative immediate numb

bpf, arm64: Get rid of fpb

bpf prog accesses stack using BPF_FP as the base address and a negative
immediate number as offset. But arm64 ldr/str instructions only support
non-negative immediate number as offset. To simplify the jited result,
commit 5b3d19b9bd40 ("bpf, arm64: Adjust the offset of str/ldr(immediate)
to positive number") introduced FPB to represent the lowest stack address
that the bpf prog being jited may access, and with this address as the
baseline, it converts BPF_FP plus negative immediate offset number to FPB
plus non-negative immediate offset.

Considering that for a given bpf prog, the jited stack space is fixed
with A64_SP as the lowest address and BPF_FP as the highest address.
Thus we can get rid of FPB and converts BPF_FP plus negative immediate
offset to A64_SP plus non-negative immediate offset.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 66ff4d61 14-Jul-2024 Leon Hwang <hffilwlqm@gmail.com>

bpf, arm64: Fix tailcall hierarchy

This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".

On arm64, when a

bpf, arm64: Fix tailcall hierarchy

This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".

On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.

At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.

At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.

At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.

Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

show more ...


# 19d3c179 11-Jul-2024 Puranjay Mohan <puranjay@kernel.org>

bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIG

When BPF_TRAMP_F_CALL_ORIG is set, the trampoline calls
__bpf_tramp_enter() and __bpf_tramp_exit() functions, passing them
the struct bpf_tramp_i

bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIG

When BPF_TRAMP_F_CALL_ORIG is set, the trampoline calls
__bpf_tramp_enter() and __bpf_tramp_exit() functions, passing them
the struct bpf_tramp_image *im pointer as an argument in R0.

The trampoline generation code uses emit_addr_mov_i64() to emit
instructions for moving the bpf_tramp_image address into R0, but
emit_addr_mov_i64() assumes the address to be in the vmalloc() space
and uses only 48 bits. Because bpf_tramp_image is allocated using
kzalloc(), its address can use more than 48-bits, in this case the
trampoline will pass an invalid address to __bpf_tramp_enter/exit()
causing a kernel crash.

Fix this by using emit_a64_mov_i64() in place of emit_addr_mov_i64()
as it can work with addresses that are greater than 48-bits.

Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://lore.kernel.org/all/SJ0PR15MB461564D3F7E7A763498CA6A8CBDB2@SJ0PR15MB4615.namprd15.prod.outlook.com/
Link: https://lore.kernel.org/bpf/20240711151838.43469-1-puranjay@kernel.org

show more ...


# 2bb138cb 19-Jun-2024 Puranjay Mohan <puranjay@kernel.org>

bpf, arm64: Inline bpf_get_current_task/_btf() helpers

On ARM64, the pointer to task_struct is always available in the sp_el0
register and therefore the calls to bpf_get_current_task() and
bpf_get_c

bpf, arm64: Inline bpf_get_current_task/_btf() helpers

On ARM64, the pointer to task_struct is always available in the sp_el0
register and therefore the calls to bpf_get_current_task() and
bpf_get_current_task_btf() can be inlined into a single MRS instruction.

Here is the difference before and after this change:

Before:

; struct task_struct *task = bpf_get_current_task_btf();
54: mov x10, #0xffffffffffff7978 // #-34440
58: movk x10, #0x802b, lsl #16
5c: movk x10, #0x8000, lsl #32
60: blr x10 --------------> 0xffff8000802b7978 <+0>: mrs x0, sp_el0
64: add x7, x0, #0x0 <-------------- 0xffff8000802b797c <+4>: ret

After:

; struct task_struct *task = bpf_get_current_task_btf();
54: mrs x7, sp_el0

This shows around 1% performance improvement in artificial microbenchmark.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240619131334.4297-1-puranjay@kernel.org

show more ...


# 9919c5c9 15-Jun-2024 Rafael Passos <rafael@rcpassos.me>

bpf: remove unused parameter in bpf_jit_binary_pack_finalize

Fixes a compiler warning. the bpf_jit_binary_pack_finalize function
was taking an extra bpf_prog parameter that went unused.
This removve

bpf: remove unused parameter in bpf_jit_binary_pack_finalize

Fixes a compiler warning. the bpf_jit_binary_pack_finalize function
was taking an extra bpf_prog parameter that went unused.
This removves it and updates the callers accordingly.

Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Link: https://lore.kernel.org/r/20240615022641.210320-2-rafael@rcpassos.me
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# e2effa22 05-May-2024 Mike Rapoport (IBM) <rppt@kernel.org>

arm64: extend execmem_info for generated code allocations

The memory allocations for kprobes and BPF on arm64 can be placed
anywhere in vmalloc address space and currently this is implemented with
o

arm64: extend execmem_info for generated code allocations

The memory allocations for kprobes and BPF on arm64 can be placed
anywhere in vmalloc address space and currently this is implemented with
overrides of alloc_insn_page() and bpf_jit_alloc_exec() in arm64.

Define EXECMEM_KPROBES and EXECMEM_BPF ranges in arm64::execmem_info and
drop overrides of alloc_insn_page() and bpf_jit_alloc_exec().

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

show more ...


# 75fe4c0b 02-May-2024 Puranjay Mohan <puranjay@kernel.org>

bpf, arm64: inline bpf_get_smp_processor_id() helper

Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting
a read from struct thread_info. The SP_EL0 system register holds the
poi

bpf, arm64: inline bpf_get_smp_processor_id() helper

Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting
a read from struct thread_info. The SP_EL0 system register holds the
pointer to the task_struct and thread_info is the first member of this
struct. We can read the cpu number from the thread_info.

Here is how the ARM64 JITed assembly changes after this commit:

ARM64 JIT
===========

BEFORE AFTER
-------- -------

int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id();

mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0
movk x10, #0x802b, lsl #16 ldr w7, [x10, #24]
movk x10, #0x8000, lsl #32
blr x10
add x7, x0, #0x0

Performance improvement using benchmark[1]

./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

+---------------+-------------------+-------------------+--------------+
| Name | Before | After | % change |
|---------------+-------------------+-------------------+--------------|
| glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% |
| arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% |
| hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% |
+---------------+-------------------+-------------------+--------------+

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 7a4c3222 02-May-2024 Puranjay Mohan <puranjay12@gmail.com>

arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs

Support an instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is interna

arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs

Support an instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF
JITs.

Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu
access using tpidr_el1"), the per-cpu offset for the CPU is stored in
the tpidr_el1/2 register of that CPU.

To support this BPF instruction in the ARM64 JIT, the following ARM64
instructions are emitted:

mov dst, src // Move src to dst, if src != dst
mrs tmp, tpidr_el1/2 // Move per-cpu offset of the current cpu in tmp.
add dst, dst, tmp // Add the per cpu offset to the dst.

To measure the performance improvement provided by this change, the
benchmark in [1] was used:

Before:
glob-arr-inc : 23.597 ± 0.012M/s
arr-inc : 23.173 ± 0.019M/s
hash-inc : 12.186 ± 0.028M/s

After:
glob-arr-inc : 23.819 ± 0.034M/s
arr-inc : 23.285 ± 0.017M/s
hash-inc : 12.419 ± 0.011M/s

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# e612b5c1 26-Apr-2024 Puranjay Mohan <puranjay@kernel.org>

bpf, arm64: Add support for lse atomics in bpf_arena

When LSE atomics are available, BPF atomic instructions are implemented
as single ARM64 atomic instructions, therefore it is easy to enable
these

bpf, arm64: Add support for lse atomics in bpf_arena

When LSE atomics are available, BPF atomic instructions are implemented
as single ARM64 atomic instructions, therefore it is easy to enable
these in bpf_arena using the currently available exception handling
setup.

LL_SC atomics use loops and therefore would need more work to enable in
bpf_arena.

Enable LSE atomics based instructions in bpf_arena and use the
bpf_jit_supports_insn() callback to reject atomics in bpf_arena if LSE
atomics are not available.

All atomics and arena_atomics selftests are passing:

[root@ip-172-31-2-216 bpf]# ./test_progs -a atomics,arena_atomics
#3/1 arena_atomics/add:OK
#3/2 arena_atomics/sub:OK
#3/3 arena_atomics/and:OK
#3/4 arena_atomics/or:OK
#3/5 arena_atomics/xor:OK
#3/6 arena_atomics/cmpxchg:OK
#3/7 arena_atomics/xchg:OK
#3 arena_atomics:OK
#10/1 atomics/add:OK
#10/2 atomics/sub:OK
#10/3 atomics/and:OK
#10/4 atomics/or:OK
#10/5 atomics/xor:OK
#10/6 atomics/cmpxchg:OK
#10/7 atomics/xchg:OK
#10 atomics:OK
Summary: 2/14 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240426161116.441-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# dc7d7447 16-Apr-2024 Xu Kuohai <xukuohai@huawei.com>

bpf, arm64: Fix incorrect runtime stats

When __bpf_prog_enter() returns zero, the arm64 register x20 that stores
prog start time is not assigned to zero, causing incorrect runtime stats.

To fix it,

bpf, arm64: Fix incorrect runtime stats

When __bpf_prog_enter() returns zero, the arm64 register x20 that stores
prog start time is not assigned to zero, causing incorrect runtime stats.

To fix it, assign the return value of bpf_prog_enter() to x20 register
immediately upon its return.

Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Ivan Babrou <ivan@cloudflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240416064208.2919073-2-xukuohai@huaweicloud.com

show more ...


# 4dd31243 25-Mar-2024 Puranjay Mohan <puranjay12@gmail.com>

bpf: Add arm64 JIT support for bpf_addr_space_cast instruction.

LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((addre

bpf: Add arm64 JIT support for bpf_addr_space_cast instruction.

LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))). The addr_space=0 is reserved as
bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY.

rY = addr_space_cast(rX, 1, 0) : used to convert a bpf arena pointer to
a pointer in the userspace vma. This has to be converted by the JIT.

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240325150716.4387-3-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 339af577 25-Mar-2024 Puranjay Mohan <puranjay12@gmail.com>

bpf: Add arm64 JIT support for PROBE_MEM32 pseudo instructions.

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW]
instructions. They are similar to PROBE_MEM instructions with the
fol

bpf: Add arm64 JIT support for PROBE_MEM32 pseudo instructions.

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW]
instructions. They are similar to PROBE_MEM instructions with the
following differences:
- PROBE_MEM32 supports store.
- PROBE_MEM32 relies on the verifier to clear upper 32-bit of the
src/dst register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in R28
in the prologue). Due to bpf_arena constructions such R28 + reg +
off16 access is guaranteed to be within arena virtual range, so no
address check at run-time.
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop. When
LDX faults the destination register is zeroed.

To support these on arm64, we do tmp2 = R28 + src/dst reg and then use
tmp2 as the new src/dst register. This allows us to reuse most of the
code for normal [LDX | STX | ST].

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240325150716.4387-2-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# a51cd6bf 21-Mar-2024 Artem Savkov <asavkov@redhat.com>

arm64: bpf: fix 32bit unconditional bswap

In case when is64 == 1 in emit(A64_REV32(is64, dst, dst), ctx) the
generated insn reverses byte order for both high and low 32-bit words,
resuling in an inc

arm64: bpf: fix 32bit unconditional bswap

In case when is64 == 1 in emit(A64_REV32(is64, dst, dst), ctx) the
generated insn reverses byte order for both high and low 32-bit words,
resuling in an incorrect swap as indicated by the jit test:

[ 9757.262607] test_bpf: #312 BSWAP 16: 0x0123456789abcdef -> 0xefcd jited:1 8 PASS
[ 9757.264435] test_bpf: #313 BSWAP 32: 0x0123456789abcdef -> 0xefcdab89 jited:1 ret 1460850314 != -271733879 (0x5712ce8a != 0xefcdab89)FAIL (1 times)
[ 9757.266260] test_bpf: #314 BSWAP 64: 0x0123456789abcdef -> 0x67452301 jited:1 8 PASS
[ 9757.268000] test_bpf: #315 BSWAP 64: 0x0123456789abcdef >> 32 -> 0xefcdab89 jited:1 8 PASS
[ 9757.269686] test_bpf: #316 BSWAP 16: 0xfedcba9876543210 -> 0x1032 jited:1 8 PASS
[ 9757.271380] test_bpf: #317 BSWAP 32: 0xfedcba9876543210 -> 0x10325476 jited:1 ret -1460850316 != 271733878 (0xa8ed3174 != 0x10325476)FAIL (1 times)
[ 9757.273022] test_bpf: #318 BSWAP 64: 0xfedcba9876543210 -> 0x98badcfe jited:1 7 PASS
[ 9757.274721] test_bpf: #319 BSWAP 64: 0xfedcba9876543210 >> 32 -> 0x10325476 jited:1 9 PASS

Fix this by forcing 32bit variant of rev32.

Fixes: 1104247f3f979 ("bpf, arm64: Support unconditional bswap")
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Tested-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Message-ID: <20240321081809.158803-1-asavkov@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 114b5b3b 12-Mar-2024 Puranjay Mohan <puranjay12@gmail.com>

bpf, arm64: fix bug in BPF_LDX_MEMSX

A64_LDRSW() takes three registers: Xt, Xn, Xm as arguments and it loads
and sign extends the value at address Xn + Xm into register Xt.

Currently, the offset is

bpf, arm64: fix bug in BPF_LDX_MEMSX

A64_LDRSW() takes three registers: Xt, Xn, Xm as arguments and it loads
and sign extends the value at address Xn + Xm into register Xt.

Currently, the offset is being directly used in place of the tmp
register which has the offset already loaded by the last emitted
instruction.

This will cause JIT failures. The easiest way to reproduce this is to
test the following code through test_bpf module:

{
"BPF_LDX_MEMSX | BPF_W",
.u.insns_int = {
BPF_LD_IMM64(R1, 0x00000000deadbeefULL),
BPF_LD_IMM64(R2, 0xffffffffdeadbeefULL),
BPF_STX_MEM(BPF_DW, R10, R1, -7),
BPF_LDX_MEMSX(BPF_W, R0, R10, -7),
BPF_JMP_REG(BPF_JNE, R0, R2, 1),
BPF_ALU64_IMM(BPF_MOV, R0, 0),
BPF_EXIT_INSN(),
},
INTERNAL,
{ },
{ { 0, 0 } },
.stack_depth = 7,
},

We need to use the offset as -7 to trigger this code path, there could
be other valid ways to trigger this from proper BPF programs as well.

This code is rejected by the JIT because -7 is passed to A64_LDRSW() but
it expects a valid register (0 - 31).

roott@pjy:~# modprobe test_bpf test_name="BPF_LDX_MEMSX | BPF_W"
[11300.490371] test_bpf: test_bpf: set 'test_bpf' as the default test_suite.
[11300.491750] test_bpf: #345 BPF_LDX_MEMSX | BPF_W
[11300.493179] aarch64_insn_encode_register: unknown register encoding -7
[11300.494133] aarch64_insn_encode_register: unknown register encoding -7
[11300.495292] FAIL to select_runtime err=-524
[11300.496804] test_bpf: Summary: 0 PASSED, 1 FAILED, [0/0 JIT'ed]
modprobe: ERROR: could not insert 'test_bpf': Invalid argument

Applying this patch fixes the issue.

root@pjy:~# modprobe test_bpf test_name="BPF_LDX_MEMSX | BPF_W"
[ 292.837436] test_bpf: test_bpf: set 'test_bpf' as the default test_suite.
[ 292.839416] test_bpf: #345 BPF_LDX_MEMSX | BPF_W jited:1 156 PASS
[ 292.844794] test_bpf: Summary: 1 PASSED, 0 FAILED, [1/1 JIT'ed]

Fixes: cc88f540da52 ("bpf, arm64: Support sign-extension load instructions")
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Message-ID: <20240312235917.103626-1-puranjay12@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# c733239f 16-Mar-2024 Christophe Leroy <christophe.leroy@csgroup.eu>

bpf: Check return from set_memory_rox()

arch_protect_bpf_trampoline() and alloc_new_pack() call
set_memory_rox() which can fail, leading to unprotected memory.

Take into account return from set_mem

bpf: Check return from set_memory_rox()

arch_protect_bpf_trampoline() and alloc_new_pack() call
set_memory_rox() which can fail, leading to unprotected memory.

Take into account return from set_memory_rox() function and add
__must_check flag to arch_protect_bpf_trampoline().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/fe1c163c83767fde5cab31d209a4a6be3ddb3a73.1710574353.git.christophe.leroy@csgroup.eu
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

show more ...


# e3362acd 16-Mar-2024 Christophe Leroy <christophe.leroy@csgroup.eu>

bpf: Remove arch_unprotect_bpf_trampoline()

Last user of arch_unprotect_bpf_trampoline() was removed by
commit 187e2af05abe ("bpf: struct_ops supports more than one page for
trampolines.")

Remove a

bpf: Remove arch_unprotect_bpf_trampoline()

Last user of arch_unprotect_bpf_trampoline() was removed by
commit 187e2af05abe ("bpf: struct_ops supports more than one page for
trampolines.")

Remove arch_unprotect_bpf_trampoline()

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Fixes: 187e2af05abe ("bpf: struct_ops supports more than one page for trampolines.")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/r/42c635bb54d3af91db0f9b85d724c7c290069f67.1710574353.git.christophe.leroy@csgroup.eu
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

show more ...


# 96b0f5ad 04-Mar-2024 Puranjay Mohan <puranjay12@gmail.com>

arm64, bpf: Use bpf_prog_pack for arm64 bpf trampoline

We used bpf_prog_pack to aggregate bpf programs into huge page to
relieve the iTLB pressure on the system. This was merged for ARM64[1]
We can

arm64, bpf: Use bpf_prog_pack for arm64 bpf trampoline

We used bpf_prog_pack to aggregate bpf programs into huge page to
relieve the iTLB pressure on the system. This was merged for ARM64[1]
We can apply it to bpf trampoline as well. This would increase the
preformance of fentry and struct_ops programs.

[1] https://lore.kernel.org/bpf/20240228141824.119877-1-puranjay12@gmail.com/

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Message-ID: <20240304202803.31400-1-puranjay12@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 1dad391d 28-Feb-2024 Puranjay Mohan <puranjay12@gmail.com>

bpf, arm64: use bpf_prog_pack for memory management

Use bpf_jit_binary_pack_alloc for memory management of JIT binaries in
ARM64 BPF JIT. The bpf_jit_binary_pack_alloc creates a pair of RW and RX
bu

bpf, arm64: use bpf_prog_pack for memory management

Use bpf_jit_binary_pack_alloc for memory management of JIT binaries in
ARM64 BPF JIT. The bpf_jit_binary_pack_alloc creates a pair of RW and RX
buffers. The JIT writes the program into the RW buffer. When the JIT is
done, the program is copied to the final RX buffer
with bpf_jit_binary_pack_finalize.

Implement bpf_arch_text_copy() and bpf_arch_text_invalidate() for ARM64
JIT as these functions are required by bpf_jit_binary_pack allocator.

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240228141824.119877-3-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 22fc0e80 01-Feb-2024 Puranjay Mohan <puranjay12@gmail.com>

bpf, arm64: support exceptions

The prologue generation code has been modified to make the callback
program use the stack of the program marked as exception boundary where
callee-saved registers are

bpf, arm64: support exceptions

The prologue generation code has been modified to make the callback
program use the stack of the program marked as exception boundary where
callee-saved registers are already pushed.

As the bpf_throw function never returns, if it clobbers any callee-saved
registers, they would remain clobbered. So, the prologue of the
exception-boundary program is modified to push R23 and R24 as well,
which the callback will then recover in its epilogue.

The Procedure Call Standard for the Arm 64-bit Architecture[1] states
that registers r19 to r28 should be saved by the callee. BPF programs on
ARM64 already save all callee-saved registers except r23 and r24. This
patch adds an instruction in prologue of the program to save these
two registers and another instruction in the epilogue to recover them.

These extra instructions are only added if bpf_throw() is used. Otherwise
the emitted prologue/epilogue remains unchanged.

[1] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240201125225.72796-3-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 18a45f12 19-Jan-2024 Hou Tao <houtao1@huawei.com>

bpf, arm64: Enable the inline of bpf_kptr_xchg()

ARM64 bpf jit satisfies the following two conditions:
1) support BPF_XCHG() on pointer-sized word.
2) the implementation of xchg is the same as atomi

bpf, arm64: Enable the inline of bpf_kptr_xchg()

ARM64 bpf jit satisfies the following two conditions:
1) support BPF_XCHG() on pointer-sized word.
2) the implementation of xchg is the same as atomic_xchg() on
pointer-sized words. Both of these two functions use arch_xchg() to
implement the exchange.

So enable the inline of bpf_kptr_xchg() for arm64 bpf jit.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240119102529.99581-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 26ef208c 06-Dec-2023 Song Liu <song@kernel.org>

bpf: Use arch_bpf_trampoline_size

Instead of blindly allocating PAGE_SIZE for each trampoline, check the size
of the trampoline with arch_bpf_trampoline_size(). This size is saved in
bpf_tramp_image

bpf: Use arch_bpf_trampoline_size

Instead of blindly allocating PAGE_SIZE for each trampoline, check the size
of the trampoline with arch_bpf_trampoline_size(). This size is saved in
bpf_tramp_image->size, and used for modmem charge/uncharge. The fallback
arch_alloc_bpf_trampoline() still allocates a whole page because we need to
use set_memory_* to protect the memory.

struct_ops trampoline still uses a whole page for multiple trampolines.

With this size check at caller (regular trampoline and struct_ops
trampoline), remove arch_bpf_trampoline_size() from
arch_prepare_bpf_trampoline() in archs.

Also, update bpf_image_ksym_add() to handle symbol of different sizes.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv
Link: https://lore.kernel.org/r/20231206224054.492250-7-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


# 96d1b7c0 06-Dec-2023 Song Liu <song@kernel.org>

bpf: Add arch_bpf_trampoline_size()

This helper will be used to calculate the size of the trampoline before
allocating the memory.

arch_prepare_bpf_trampoline() for arm64 and riscv64 can use
arch_b

bpf: Add arch_bpf_trampoline_size()

This helper will be used to calculate the size of the trampoline before
allocating the memory.

arch_prepare_bpf_trampoline() for arm64 and riscv64 can use
arch_bpf_trampoline_size() to check the trampoline fits in the image.

OTOH, arch_prepare_bpf_trampoline() for s390 has to call the JIT process
twice, so it cannot use arch_bpf_trampoline_size().

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv
Link: https://lore.kernel.org/r/20231206224054.492250-6-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


123456