Exploring BPF LSM support on aarch64 with ftrace
This blog post is an extract of an internal research we made about BPF LSM
support on aarch64
in Linux. It's very difficult to start looking inside the kernel source if you are not familiar with the codebase, so we decided to publish this post to show our approach as it can be useful to anyone who wants to explore kernel internals.
Introduction
We were already using BPF LSM
on x86_64
, while on aarch64
we are relying on Kprobes
, so we wanted to known what is missing inside the kernel to have this features on aarch64
.
We dug many times into the kernel source code, but in general we search for something that is already there to understand how it works. In this case instead we were looking for something that doesn't exists, looking for something that returns an error because is not implemented.
Remembering Steven Rostedt talks on how to start learning the Linux kernel, we started with ftrace
(and tools built on the tracing infrastructure) to understand what happens when we load an unsupported BPF
program into the kernel.
The problem
This is the output when we try to load a BPF LSM
program into a aarch64
5.15 Linux kernel using our software pulsar:
root@pine64-1:/home/exein# ./pulsar-enterprise-exec pulsard
[2023-02-16T14:52:45Z INFO pulsar::pulsard::daemon] Starting module process-monitor
[2023-02-16T14:52:45Z INFO pulsar::pulsard::daemon] Starting module file-system-monitor
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module network-monitor
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module logger
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module rules-engine
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module desktop-notifier
[2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in file-system-monitor: failed program attach lsm path_mknod
Caused by:
0: `bpf_raw_tracepoint_open` failed
1: No error information (os error 524)
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module anomaly-detection
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module malware-detection
[2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in malware-detection: /var/lib/pulsar/malware_detection/models/parameters.json not found
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module platform-connector
[2023-02-16T14:52:46Z INFO platform_connector::client] Connected to https://platform-dev-instance.exein.io:8001/
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module threat-response
[2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in network-monitor: failed program attach lsm socket_bind
Caused by:
0: `bpf_raw_tracepoint_open` failed
1: No error information (os error 524)
We got error 524
, or ENOTSUPP
, while pulsar
tries to load the BPF program related to path_mknod
LSM hook. Let's try to dig into this problem.
Note: At the time of this research we were not able to find a precompiledaarch64
withBPF
andBTF
enabled, so we had to compile a custom kernel. We also enabled the tracing option and thefunction_graph
plugin to use the tools below.
All the experiments were done on a Pine A64 with custom Armbian images.
This images have custom kernels with a standardUbuntu 22.04 LTS Jammy
userspace.
The tools
To investigate on this issue we used the following tools:
- bpftrace: BPF based tool to dynamically attach probes with a custom C-like language.
- trace-cmd: wrapper around
tracefs
filesystem that interacts withftrace
infrastructure.
To use this tools you need some options enabled into the Linux kernel, please check the official documentation for the full requirements.
Note: It's also possible to use other tools to do the same job, for examplefuncgraph
andkprobe
from perf-tools
Linux 5.15
Now we start using these tools to see what happens in the kernel 5.15 when we try to load our BPF program.
From this point to the end of this post we will use the probe
binary instead of pulsar
, because it is simpler. To recap how it work this is the command line help:
exein@pine64-1:~$ ./probe
Test runner for eBPF programs
Usage: probe [OPTIONS] <COMMAND>
Commands:
file-system-monitor Watch file creations
process-monitor Watch process events (fork/exec/exit)
network-monitor Watch network events
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose
-h, --help Print help
-V, --version Print version
In the examples, we will try to load the file-system-monitor
probe.
Running the following commands we can see the function graph calls of __sys_bpf
function, the BPF syscall entrypoint:
trace-cmd record -p function_graph -g __sys_bpf ./probe file-system-monitor
trace-cmd report
The output is a giant function graph, too large to paste here. Because we have an error we are interested in the last functions before the program stopped. These are the last lines of the trace-cmd report
output.
...
tokio-runtime-w-1666 [003] 1318.058019: funcgraph_entry: | bpf_trampoline_link_prog() {
tokio-runtime-w-1666 [003] 1318.058020: funcgraph_entry: 2.292 us | bpf_attach_type_to_tramp();
tokio-runtime-w-1666 [003] 1318.058024: funcgraph_entry: 1.250 us | mutex_lock();
tokio-runtime-w-1666 [003] 1318.058028: funcgraph_entry: | bpf_trampoline_update() {
tokio-runtime-w-1666 [003] 1318.058030: funcgraph_entry: | kmem_cache_alloc_trace() {
tokio-runtime-w-1666 [003] 1318.058031: funcgraph_entry: 1.167 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058036: funcgraph_exit: 6.792 us | }
tokio-runtime-w-1666 [003] 1318.058039: funcgraph_entry: | kmem_cache_alloc_trace() {
tokio-runtime-w-1666 [003] 1318.058042: funcgraph_entry: 2.750 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058046: funcgraph_exit: 6.417 us | }
tokio-runtime-w-1666 [003] 1318.058048: funcgraph_entry: 2.708 us | bpf_jit_charge_modmem();
tokio-runtime-w-1666 [003] 1318.058053: funcgraph_entry: | bpf_jit_alloc_exec_page() {
tokio-runtime-w-1666 [003] 1318.058055: funcgraph_entry: | bpf_jit_alloc_exec() {
tokio-runtime-w-1666 [003] 1318.058057: funcgraph_entry: | vmalloc() {
tokio-runtime-w-1666 [003] 1318.058059: funcgraph_entry: | __vmalloc_node() {
tokio-runtime-w-1666 [003] 1318.058061: funcgraph_entry: | __vmalloc_node_range() {
tokio-runtime-w-1666 [003] 1318.058064: funcgraph_entry: | __get_vm_area_node.constprop.64() {
tokio-runtime-w-1666 [003] 1318.058067: funcgraph_entry: | kmem_cache_alloc_node_trace() {
tokio-runtime-w-1666 [003] 1318.058069: funcgraph_entry: 1.459 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058073: funcgraph_exit: 6.292 us | }
tokio-runtime-w-1666 [003] 1318.058075: funcgraph_entry: | alloc_vmap_area() {
tokio-runtime-w-1666 [003] 1318.058077: funcgraph_entry: | kmem_cache_alloc_node() {
tokio-runtime-w-1666 [003] 1318.058079: funcgraph_entry: 1.167 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058085: funcgraph_exit: 7.625 us | }
tokio-runtime-w-1666 [003] 1318.058088: funcgraph_entry: | kmem_cache_alloc_node() {
tokio-runtime-w-1666 [003] 1318.058089: funcgraph_entry: 1.208 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058092: funcgraph_exit: 4.584 us | }
tokio-runtime-w-1666 [003] 1318.058104: funcgraph_entry: | kmem_cache_free() {
tokio-runtime-w-1666 [003] 1318.058107: funcgraph_entry: 2.084 us | __slab_free();
tokio-runtime-w-1666 [003] 1318.058110: funcgraph_exit: 5.667 us | }
tokio-runtime-w-1666 [003] 1318.058112: funcgraph_entry: 6.375 us | insert_vmap_area.constprop.74();
tokio-runtime-w-1666 [003] 1318.058119: funcgraph_exit: + 44.667 us | }
tokio-runtime-w-1666 [003] 1318.058122: funcgraph_exit: + 58.250 us | }
tokio-runtime-w-1666 [003] 1318.058124: funcgraph_entry: | __kmalloc_node() {
tokio-runtime-w-1666 [003] 1318.058125: funcgraph_entry: 1.625 us | kmalloc_slab();
tokio-runtime-w-1666 [003] 1318.058128: funcgraph_entry: 1.167 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058131: funcgraph_exit: 7.208 us | }
tokio-runtime-w-1666 [003] 1318.058133: funcgraph_entry: | alloc_pages() {
tokio-runtime-w-1666 [003] 1318.058135: funcgraph_entry: 1.583 us | get_task_policy.part.48();
tokio-runtime-w-1666 [003] 1318.058138: funcgraph_entry: 1.500 us | policy_node();
tokio-runtime-w-1666 [003] 1318.058141: funcgraph_entry: 1.209 us | policy_nodemask();
tokio-runtime-w-1666 [003] 1318.058143: funcgraph_entry: | __alloc_pages() {
tokio-runtime-w-1666 [003] 1318.058145: funcgraph_entry: 1.458 us | should_fail_alloc_page();
tokio-runtime-w-1666 [003] 1318.058147: funcgraph_entry: | get_page_from_freelist() {
tokio-runtime-w-1666 [003] 1318.058150: funcgraph_entry: 1.583 us | prep_new_page();
tokio-runtime-w-1666 [003] 1318.058153: funcgraph_exit: 5.459 us | }
tokio-runtime-w-1666 [003] 1318.058154: funcgraph_exit: + 10.542 us | }
tokio-runtime-w-1666 [003] 1318.058155: funcgraph_exit: + 22.083 us | }
tokio-runtime-w-1666 [003] 1318.058157: funcgraph_entry: | __cond_resched() {
tokio-runtime-w-1666 [003] 1318.058158: funcgraph_entry: 1.833 us | rcu_all_qs();
tokio-runtime-w-1666 [003] 1318.058161: funcgraph_exit: 4.167 us | }
tokio-runtime-w-1666 [003] 1318.058166: funcgraph_entry: 5.542 us | vmap_pages_range_noflush();
tokio-runtime-w-1666 [003] 1318.058173: funcgraph_exit: ! 112.375 us | }
tokio-runtime-w-1666 [003] 1318.058175: funcgraph_exit: ! 116.000 us | }
tokio-runtime-w-1666 [003] 1318.058176: funcgraph_exit: ! 119.292 us | }
tokio-runtime-w-1666 [003] 1318.058177: funcgraph_exit: ! 122.542 us | }
tokio-runtime-w-1666 [003] 1318.058179: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-1666 [003] 1318.058180: funcgraph_entry: 1.375 us | find_vmap_area();
tokio-runtime-w-1666 [003] 1318.058183: funcgraph_exit: 4.333 us | }
tokio-runtime-w-1666 [003] 1318.058185: funcgraph_entry: | set_memory_x() {
tokio-runtime-w-1666 [003] 1318.058186: funcgraph_entry: | change_memory_common() {
tokio-runtime-w-1666 [003] 1318.058188: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-1666 [003] 1318.058189: funcgraph_entry: 1.333 us | find_vmap_area();
tokio-runtime-w-1666 [003] 1318.058192: funcgraph_exit: 3.875 us | }
tokio-runtime-w-1666 [003] 1318.058193: funcgraph_entry: | vm_unmap_aliases() {
tokio-runtime-w-1666 [003] 1318.058194: funcgraph_entry: | _vm_unmap_aliases.part.58() {
tokio-runtime-w-1666 [003] 1318.058196: funcgraph_entry: 1.542 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058199: funcgraph_entry: 1.208 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058202: funcgraph_entry: 1.166 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058205: funcgraph_entry: 1.208 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058207: funcgraph_entry: 1.208 us | mutex_lock();
tokio-runtime-w-1666 [003] 1318.058210: funcgraph_entry: | purge_fragmented_blocks_allcpus() {
tokio-runtime-w-1666 [003] 1318.058212: funcgraph_entry: 1.500 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058214: funcgraph_entry: 1.500 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058217: funcgraph_entry: 1.500 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058220: funcgraph_entry: 1.167 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058222: funcgraph_exit: + 11.917 us | }
tokio-runtime-w-1666 [003] 1318.058224: funcgraph_entry: | __purge_vmap_area_lazy() {
tokio-runtime-w-1666 [003] 1318.058232: funcgraph_entry: | kmem_cache_free() {
tokio-runtime-w-1666 [003] 1318.058234: funcgraph_entry: 1.250 us | __slab_free();
tokio-runtime-w-1666 [003] 1318.058237: funcgraph_exit: 4.791 us | }
tokio-runtime-w-1666 [003] 1318.058241: funcgraph_entry: 1.209 us | __cond_resched_lock();
tokio-runtime-w-1666 [003] 1318.058244: funcgraph_exit: + 19.625 us | }
tokio-runtime-w-1666 [003] 1318.058245: funcgraph_entry: 1.167 us | mutex_unlock();
tokio-runtime-w-1666 [003] 1318.058247: funcgraph_exit: + 53.042 us | }
tokio-runtime-w-1666 [003] 1318.058248: funcgraph_exit: + 55.625 us | }
tokio-runtime-w-1666 [003] 1318.058250: funcgraph_entry: | __change_memory_common() {
tokio-runtime-w-1666 [003] 1318.058251: funcgraph_entry: | apply_to_page_range() {
tokio-runtime-w-1666 [003] 1318.058253: funcgraph_entry: | __apply_to_page_range() {
tokio-runtime-w-1666 [003] 1318.058255: funcgraph_entry: 1.250 us | pud_huge();
tokio-runtime-w-1666 [003] 1318.058258: funcgraph_entry: 1.166 us | pmd_huge();
tokio-runtime-w-1666 [003] 1318.058260: funcgraph_entry: 1.208 us | change_page_range();
tokio-runtime-w-1666 [003] 1318.058263: funcgraph_exit: 9.834 us | }
tokio-runtime-w-1666 [003] 1318.058264: funcgraph_exit: + 12.709 us | }
tokio-runtime-w-1666 [003] 1318.058266: funcgraph_exit: + 15.459 us | }
tokio-runtime-w-1666 [003] 1318.058268: funcgraph_exit: + 80.791 us | }
tokio-runtime-w-1666 [003] 1318.058270: funcgraph_exit: + 84.834 us | }
tokio-runtime-w-1666 [003] 1318.058272: funcgraph_exit: ! 218.500 us | }
tokio-runtime-w-1666 [003] 1318.058274: funcgraph_entry: | __alloc_percpu_gfp() {
tokio-runtime-w-1666 [003] 1318.058276: funcgraph_entry: | pcpu_alloc() {
tokio-runtime-w-1666 [003] 1318.058281: funcgraph_entry: 2.250 us | mutex_lock_killable();
tokio-runtime-w-1666 [003] 1318.058290: funcgraph_entry: | pcpu_find_block_fit() {
tokio-runtime-w-1666 [003] 1318.058293: funcgraph_entry: 2.833 us | pcpu_next_fit_region.constprop.38();
tokio-runtime-w-1666 [003] 1318.058299: funcgraph_exit: 9.084 us | }
tokio-runtime-w-1666 [003] 1318.058301: funcgraph_entry: | pcpu_alloc_area() {
tokio-runtime-w-1666 [003] 1318.058315: funcgraph_entry: 4.000 us | pcpu_block_update_hint_alloc();
tokio-runtime-w-1666 [003] 1318.058320: funcgraph_entry: 2.208 us | pcpu_chunk_relocate();
tokio-runtime-w-1666 [003] 1318.058324: funcgraph_exit: + 22.625 us | }
tokio-runtime-w-1666 [003] 1318.058327: funcgraph_entry: 1.208 us | mutex_unlock();
tokio-runtime-w-1666 [003] 1318.058332: funcgraph_entry: 1.584 us | pcpu_memcg_post_alloc_hook();
tokio-runtime-w-1666 [003] 1318.058335: funcgraph_exit: + 58.833 us | }
tokio-runtime-w-1666 [003] 1318.058336: funcgraph_exit: + 61.834 us | }
tokio-runtime-w-1666 [003] 1318.058338: funcgraph_entry: | kmem_cache_alloc_trace() {
tokio-runtime-w-1666 [003] 1318.058339: funcgraph_entry: 1.167 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058342: funcgraph_exit: 4.458 us | }
tokio-runtime-w-1666 [003] 1318.058359: funcgraph_entry: | bpf_image_ksym_add() {
tokio-runtime-w-1666 [003] 1318.058360: funcgraph_entry: | bpf_ksym_add() {
tokio-runtime-w-1666 [003] 1318.058363: funcgraph_entry: 1.583 us | __local_bh_enable_ip();
tokio-runtime-w-1666 [003] 1318.058366: funcgraph_exit: 5.750 us | }
tokio-runtime-w-1666 [003] 1318.058369: funcgraph_exit: 9.834 us | }
tokio-runtime-w-1666 [003] 1318.058371: funcgraph_entry: 1.250 us | arch_prepare_bpf_trampoline();
tokio-runtime-w-1666 [003] 1318.058373: funcgraph_entry: 2.292 us | kfree();
tokio-runtime-w-1666 [003] 1318.058377: funcgraph_exit: ! 348.625 us | }
tokio-runtime-w-1666 [003] 1318.058379: funcgraph_entry: 1.250 us | mutex_unlock();
tokio-runtime-w-1666 [003] 1318.058382: funcgraph_exit: ! 363.167 us | }
tokio-runtime-w-1666 [003] 1318.058384: funcgraph_entry: | bpf_link_cleanup() {
tokio-runtime-w-1666 [003] 1318.058386: funcgraph_entry: | bpf_link_free_id.part.30() {
tokio-runtime-w-1666 [003] 1318.058392: funcgraph_entry: | call_rcu() {
tokio-runtime-w-1666 [003] 1318.058396: funcgraph_entry: 1.834 us | rcu_segcblist_enqueue();
tokio-runtime-w-1666 [003] 1318.058401: funcgraph_exit: 9.333 us | }
tokio-runtime-w-1666 [003] 1318.058403: funcgraph_entry: 1.542 us | __local_bh_enable_ip();
tokio-runtime-w-1666 [003] 1318.058406: funcgraph_exit: + 19.542 us | }
tokio-runtime-w-1666 [003] 1318.058408: funcgraph_entry: | fput() {
tokio-runtime-w-1666 [003] 1318.058409: funcgraph_entry: | fput_many() {
tokio-runtime-w-1666 [003] 1318.058411: funcgraph_entry: | task_work_add() {
tokio-runtime-w-1666 [003] 1318.058414: funcgraph_entry: 1.625 us | kick_process();
tokio-runtime-w-1666 [003] 1318.058418: funcgraph_exit: 6.750 us | }
tokio-runtime-w-1666 [003] 1318.058419: funcgraph_exit: + 10.333 us | }
tokio-runtime-w-1666 [003] 1318.058420: funcgraph_exit: + 12.708 us | }
tokio-runtime-w-1666 [003] 1318.058422: funcgraph_entry: 2.250 us | put_unused_fd();
tokio-runtime-w-1666 [003] 1318.058426: funcgraph_exit: + 41.416 us | }
tokio-runtime-w-1666 [003] 1318.058428: funcgraph_entry: 1.292 us | mutex_unlock();
tokio-runtime-w-1666 [003] 1318.058430: funcgraph_entry: 1.250 us | kfree();
tokio-runtime-w-1666 [003] 1318.058433: funcgraph_exit: ! 567.458 us | }
tokio-runtime-w-1666 [003] 1318.058435: funcgraph_entry: 2.125 us | __bpf_prog_put.isra.47();
tokio-runtime-w-1666 [003] 1318.058438: funcgraph_exit: ! 602.291 us | }
tokio-runtime-w-1666 [003] 1318.058439: funcgraph_exit: ! 631.791 us | }
This is the corresponding source code in kernel/bpf/trampoline.c
for the last function executed, bpf_trampoline_update
:
static int bpf_trampoline_update(struct bpf_trampoline *tr)
{
struct bpf_tramp_image *im;
struct bpf_tramp_progs *tprogs;
u32 flags = BPF_TRAMP_F_RESTORE_REGS;
bool ip_arg = false;
int err, total;
tprogs = bpf_trampoline_get_progs(tr, &total, &ip_arg);
if (IS_ERR(tprogs))
return PTR_ERR(tprogs);
if (total == 0) {
err = unregister_fentry(tr, tr->cur_image->image);
bpf_tramp_image_put(tr->cur_image);
tr->cur_image = NULL;
tr->selector = 0;
goto out;
}
im = bpf_tramp_image_alloc(tr->key, tr->selector);
if (IS_ERR(im)) {
err = PTR_ERR(im);
goto out;
}
if (tprogs[BPF_TRAMP_FEXIT].nr_progs ||
tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs)
flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
if (ip_arg)
flags |= BPF_TRAMP_F_IP_ARG;
err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
&tr->func.model, flags, tprogs,
tr->func.addr);
if (err < 0)
goto out;
WARN_ON(tr->cur_image && tr->selector == 0);
WARN_ON(!tr->cur_image && tr->selector);
if (tr->cur_image)
/* progs already running at this address */
err = modify_fentry(tr, tr->cur_image->image, im->image);
else
/* first time registering */
err = register_fentry(tr, im->image);
if (err)
goto out;
if (tr->cur_image)
bpf_tramp_image_put(tr->cur_image);
tr->cur_image = im;
tr->selector++;
out:
kfree(tprogs);
return err;
}
Looking at the previous output we can see:
tokio-runtime-w-1666 [003] 1318.058371: funcgraph_entry: 1.250 us | arch_prepare_bpf_trampoline();
tokio-runtime-w-1666 [003] 1318.058373: funcgraph_entry: 2.292 us | kfree();
so there are no other function calls between arch_prepare_bpf_trampoline
and kfree
functions, so probably the first function returns an error code in the err
variable. Let's verify it!
Launching bpftace
in a shell in this way, we capture the return value of arch_prepare_bpf_trampoline
function and print it to the console:
bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d\n", retval); }'
and starting probe
in another terminal we get the the following output from bpftace
:
root@pine64-1:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d\n", retval); }'
Attaching 1 probe...
retval link: -524
This is because the kernel 5.15
lacks the implementation of arch_prepare_bpf_trampoline
for aarch64
and uses the default implementation placeholder:
int __weak
arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
const struct btf_func_model *m, u32 flags,
struct bpf_tramp_links *tlinks,
void *orig_call)
{
return -ENOTSUPP;
}
So this feature is unsupported for this kernel version. The good news is that is implemented in kernel 6.x thanks to this patch.
Let's move to the kernel 6.x.
Linux 6.1
If we try to run probe
on the kernel 6.1 we get the following output:
root@pine64:/home/exein# ./probe file-system-monitor
thread 'main' panicked at 'initialization failed: ProgramAttachError { program: "lsm path_mknod", program_error: SyscallError { call: "bpf_raw_tracepoint_open", io_error: Os { code: 524, kind: Uncategorized, message: "No error information" } } }', src/bin/probe.rs:72:43
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
the same error of the 5.15 kernel !!! Let's find an explanation.
Running bpftrace
on arch_prepare_bpf_trampoline
, this time we get this output:
root@pine64:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval tp link: %d\n", retval); }'
Attaching 1 probe...
retval tp link: 284
So the problem is not here, the function doesn't return an error anymore. Let's go back to the function graph.
This time we launch trace-cmd
skipping some functions to have a cleaner output:
trace-cmd record \
-p function_graph \
-g bpf_trampoline_link_prog \
-n bpf_jit_alloc_exec \
-n kmalloc_trace \
-n arch_prepare_bpf_trampoline \
-n generic_handle_domain_irq \
-n do_interrupt_handler \
-n irq_exit_rcu \
./probe file-system-monitor
And we get this output from trace-cmd report
:
root@pine64:/home/exein# trace-cmd report
CPU 0 is empty
CPU 1 is empty
CPU 3 is empty
cpus=4
tokio-runtime-w-11886 [002] 193385.056283: funcgraph_entry: | bpf_trampoline_link_prog() {
tokio-runtime-w-11886 [002] 193385.056321: funcgraph_entry: + 15.042 us | mutex_lock();
tokio-runtime-w-11886 [002] 193385.056373: funcgraph_entry: | __bpf_trampoline_link_prog() {
tokio-runtime-w-11886 [002] 193385.056395: funcgraph_entry: + 14.833 us | bpf_attach_type_to_tramp();
tokio-runtime-w-11886 [002] 193385.056428: funcgraph_entry: | bpf_trampoline_update.isra.23() {
tokio-runtime-w-11886 [002] 193385.056459: funcgraph_entry: 2.917 us | bpf_jit_charge_modmem();
tokio-runtime-w-11886 [002] 193385.056531: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-11886 [002] 193385.056540: funcgraph_entry: 3.000 us | find_vmap_area();
tokio-runtime-w-11886 [002] 193385.056547: funcgraph_exit: + 16.208 us | }
tokio-runtime-w-11886 [002] 193385.056554: funcgraph_entry: | __alloc_percpu_gfp() {
tokio-runtime-w-11886 [002] 193385.056563: funcgraph_entry: | pcpu_alloc() {
tokio-runtime-w-11886 [002] 193385.056568: funcgraph_entry: 4.875 us | mutex_lock_killable();
tokio-runtime-w-11886 [002] 193385.056591: funcgraph_entry: | pcpu_find_block_fit() {
tokio-runtime-w-11886 [002] 193385.056599: funcgraph_entry: 8.625 us | pcpu_next_fit_region.constprop.38();
tokio-runtime-w-11886 [002] 193385.056608: funcgraph_exit: + 17.166 us | }
tokio-runtime-w-11886 [002] 193385.056610: funcgraph_entry: | pcpu_alloc_area() {
tokio-runtime-w-11886 [002] 193385.056639: funcgraph_entry: 9.167 us | pcpu_block_update();
tokio-runtime-w-11886 [002] 193385.056656: funcgraph_entry: 7.667 us | pcpu_block_update_hint_alloc();
tokio-runtime-w-11886 [002] 193385.056671: funcgraph_entry: 7.750 us | pcpu_chunk_relocate();
tokio-runtime-w-11886 [002] 193385.056679: funcgraph_exit: + 69.667 us | }
tokio-runtime-w-11886 [002] 193385.056682: funcgraph_entry: 7.042 us | mutex_unlock();
tokio-runtime-w-11886 [002] 193385.056703: funcgraph_entry: 2.792 us | pcpu_memcg_post_alloc_hook();
tokio-runtime-w-11886 [002] 193385.056712: funcgraph_exit: ! 148.709 us | }
tokio-runtime-w-11886 [002] 193385.056719: funcgraph_exit: ! 165.250 us | }
tokio-runtime-w-11886 [002] 193385.056866: funcgraph_entry: | bpf_image_ksym_add() {
tokio-runtime-w-11886 [002] 193385.056873: funcgraph_entry: | bpf_ksym_add() {
tokio-runtime-w-11886 [002] 193385.056882: funcgraph_entry: 2.750 us | __local_bh_disable_ip();
tokio-runtime-w-11886 [002] 193385.056897: funcgraph_entry: 4.625 us | __local_bh_enable_ip();
tokio-runtime-w-11886 [002] 193385.056905: funcgraph_exit: + 32.459 us | }
tokio-runtime-w-11886 [002] 193385.056922: funcgraph_entry: 7.584 us | perf_event_ksymbol();
tokio-runtime-w-11886 [002] 193385.056944: funcgraph_exit: + 78.417 us | }
tokio-runtime-w-11886 [002] 193385.057492: funcgraph_entry: | set_memory_ro() {
tokio-runtime-w-11886 [002] 193385.057501: funcgraph_entry: | change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057504: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-11886 [002] 193385.057506: funcgraph_entry: 8.875 us | find_vmap_area();
tokio-runtime-w-11886 [002] 193385.057518: funcgraph_exit: + 14.250 us | }
tokio-runtime-w-11886 [002] 193385.057522: funcgraph_entry: | __change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057531: funcgraph_entry: | apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057538: funcgraph_entry: | __apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057544: funcgraph_entry: + 12.791 us | pud_huge();
tokio-runtime-w-11886 [002] 193385.057559: funcgraph_entry: 2.708 us | pmd_huge();
tokio-runtime-w-11886 [002] 193385.057574: funcgraph_entry: + 15.125 us | change_page_range();
tokio-runtime-w-11886 [002] 193385.057591: funcgraph_exit: + 53.792 us | }
tokio-runtime-w-11886 [002] 193385.057597: funcgraph_exit: + 66.083 us | }
tokio-runtime-w-11886 [002] 193385.057610: funcgraph_exit: + 88.125 us | }
tokio-runtime-w-11886 [002] 193385.057619: funcgraph_entry: | vm_unmap_aliases() {
tokio-runtime-w-11886 [002] 193385.057622: funcgraph_entry: | _vm_unmap_aliases.part.77() {
tokio-runtime-w-11886 [002] 193385.057625: funcgraph_entry: 9.125 us | mutex_lock();
tokio-runtime-w-11886 [002] 193385.057637: funcgraph_entry: 3.084 us | purge_fragmented_blocks_allcpus();
tokio-runtime-w-11886 [002] 193385.057643: funcgraph_entry: | __purge_vmap_area_lazy() {
tokio-runtime-w-11886 [002] 193385.057687: funcgraph_entry: | kmem_cache_free() {
tokio-runtime-w-11886 [002] 193385.057693: funcgraph_entry: + 13.250 us | __slab_free();
tokio-runtime-w-11886 [002] 193385.057705: funcgraph_exit: + 18.750 us | }
tokio-runtime-w-11886 [002] 193385.057718: funcgraph_entry: 7.416 us | __cond_resched_lock();
tokio-runtime-w-11886 [002] 193385.057733: funcgraph_exit: + 90.042 us | }
tokio-runtime-w-11886 [002] 193385.057741: funcgraph_entry: 2.792 us | mutex_unlock();
tokio-runtime-w-11886 [002] 193385.057747: funcgraph_exit: ! 124.666 us | }
tokio-runtime-w-11886 [002] 193385.057749: funcgraph_exit: ! 130.291 us | }
tokio-runtime-w-11886 [002] 193385.057756: funcgraph_entry: | __change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057759: funcgraph_entry: | apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057765: funcgraph_entry: | __apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057768: funcgraph_entry: 4.125 us | pud_huge();
tokio-runtime-w-11886 [002] 193385.057778: funcgraph_entry: 8.750 us | pmd_huge();
tokio-runtime-w-11886 [002] 193385.057790: funcgraph_entry: 4.625 us | change_page_range();
tokio-runtime-w-11886 [002] 193385.057797: funcgraph_exit: + 31.958 us | }
tokio-runtime-w-11886 [002] 193385.057803: funcgraph_exit: + 44.375 us | }
tokio-runtime-w-11886 [002] 193385.057817: funcgraph_exit: + 61.208 us | }
tokio-runtime-w-11886 [002] 193385.057820: funcgraph_exit: ! 319.292 us | }
tokio-runtime-w-11886 [002] 193385.057826: funcgraph_exit: ! 333.667 us | }
tokio-runtime-w-11886 [002] 193385.057840: funcgraph_entry: | set_memory_x() {
tokio-runtime-w-11886 [002] 193385.057847: funcgraph_entry: | change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057855: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-11886 [002] 193385.057858: funcgraph_entry: 2.917 us | find_vmap_area();
tokio-runtime-w-11886 [002] 193385.057870: funcgraph_exit: + 14.375 us | }
tokio-runtime-w-11886 [002] 193385.057876: funcgraph_entry: | vm_unmap_aliases() {
tokio-runtime-w-11886 [002] 193385.057879: funcgraph_entry: | _vm_unmap_aliases.part.77() {
tokio-runtime-w-11886 [002] 193385.057882: funcgraph_entry: 3.959 us | mutex_lock();
tokio-runtime-w-11886 [002] 193385.057893: funcgraph_entry: 3.000 us | purge_fragmented_blocks_allcpus();
tokio-runtime-w-11886 [002] 193385.057900: funcgraph_entry: 2.791 us | __purge_vmap_area_lazy();
tokio-runtime-w-11886 [002] 193385.057907: funcgraph_entry: 2.709 us | mutex_unlock();
tokio-runtime-w-11886 [002] 193385.057913: funcgraph_exit: + 33.708 us | }
tokio-runtime-w-11886 [002] 193385.057915: funcgraph_exit: + 43.000 us | }
tokio-runtime-w-11886 [002] 193385.057922: funcgraph_entry: | __change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057925: funcgraph_entry: | apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057930: funcgraph_entry: | __apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057933: funcgraph_entry: 4.292 us | pud_huge();
tokio-runtime-w-11886 [002] 193385.057945: funcgraph_entry: 8.750 us | pmd_huge();
tokio-runtime-w-11886 [002] 193385.057956: funcgraph_entry: 3.958 us | change_page_range();
tokio-runtime-w-11886 [002] 193385.058037: funcgraph_exit: + 32.083 us | }
tokio-runtime-w-11886 [002] 193385.058089: funcgraph_entry: 7.667 us | irq_enter_rcu();
tokio-runtime-w-11886 [002] 193385.058233: funcgraph_exit: ! 308.041 us | }
tokio-runtime-w-11886 [002] 193385.058239: funcgraph_exit: ! 316.709 us | }
tokio-runtime-w-11886 [002] 193385.058247: funcgraph_exit: ! 400.417 us | }
tokio-runtime-w-11886 [002] 193385.058255: funcgraph_exit: ! 415.000 us | }
tokio-runtime-w-11886 [002] 193385.058555: funcgraph_entry: 8.250 us | irq_enter_rcu();
tokio-runtime-w-11886 [002] 193385.058958: funcgraph_entry: | kallsyms_lookup_size_offset() {
tokio-runtime-w-11886 [002] 193385.058974: funcgraph_entry: + 36.333 us | get_symbol_pos();
tokio-runtime-w-11886 [002] 193385.059017: funcgraph_exit: + 59.750 us | }
tokio-runtime-w-11886 [002] 193385.059043: funcgraph_entry: | kfree() {
tokio-runtime-w-11886 [002] 193385.059057: funcgraph_entry: 3.000 us | __kmem_cache_free();
tokio-runtime-w-11886 [002] 193385.059065: funcgraph_exit: + 22.833 us | }
tokio-runtime-w-11886 [002] 193385.059073: funcgraph_exit: # 2644.708 us | }
tokio-runtime-w-11886 [002] 193385.059079: funcgraph_exit: # 2706.292 us | }
tokio-runtime-w-11886 [002] 193385.059095: funcgraph_entry: 2.792 us | mutex_unlock();
tokio-runtime-w-11886 [002] 193385.059101: funcgraph_exit: # 2870.416 us | }
So this time the program has passed arch_prepare_bpf_trampoline
, set_memory_ro
and set_memory_x
and the last function we see is kallsyms_lookup_size_offset
.
As we can see in bpf_trampoline_update
function from kernel/bpf/trampoline.c
, there is no explicit call to kallsyms_lookup_size_offset
:
static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex)
{
// ... OTHER CODE ...
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
again:
if ((tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY) &&
(tr->flags & BPF_TRAMP_F_CALL_ORIG))
tr->flags |= BPF_TRAMP_F_ORIG_STACK;
#endif
err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
&tr->func.model, tr->flags, tlinks,
tr->func.addr);
if (err < 0)
goto out;
set_memory_ro((long)im->image, 1);
set_memory_x((long)im->image, 1);
WARN_ON(tr->cur_image && tr->selector == 0);
WARN_ON(!tr->cur_image && tr->selector);
if (tr->cur_image)
/* progs already running at this address */
err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex);
else
/* first time registering */
err = register_fentry(tr, im->image);
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
if (err == -EAGAIN) {
/* -EAGAIN from bpf_tramp_ftrace_ops_func. Now
* BPF_TRAMP_F_SHARE_IPMODIFY is set, we can generate the
* trampoline again, and retry register.
*/
/* reset fops->func and fops->trampoline for re-register */
tr->fops->func = NULL;
tr->fops->trampoline = 0;
/* reset im->image memory attr for arch_prepare_bpf_trampoline */
set_memory_nx((long)im->image, 1);
set_memory_rw((long)im->image, 1);
goto again;
}
#endif
if (err)
goto out;
if (tr->cur_image)
bpf_tramp_image_put(tr->cur_image);
tr->cur_image = im;
tr->selector++;
out:
/* If any error happens, restore previous flags */
if (err)
tr->flags = orig_flags;
kfree(tlinks);
return err;
}
Note: Implementation of bpf_trampoline_update
is a bit different from the one in the previous kernel 5.15
The kallsyms_lookup_size_offset
call is hidden inside another function. We don't see it on the function graph because the compiler inlined it.
It seems that kallsyms_lookup_size_offset
is called by ftrace_location
:
unsigned long ftrace_location(unsigned long ip)
{
struct dyn_ftrace *rec;
unsigned long offset;
unsigned long size;
rec = lookup_rec(ip, ip);
if (!rec) {
if (!kallsyms_lookup_size_offset(ip, &size, &offset))
goto out;
/* map sym+0 to __fentry__ */
if (!offset)
rec = lookup_rec(ip, ip + size - 1);
}
if (rec)
return rec->ip;
out:
return 0;
}
ftrace_location
is called by register_fentry
that contains a check after the ftrace_location
call on the field fops
of struct bpf_trampoline *tr
:
/* first time registering */
static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
{
void *ip = tr->func.addr;
unsigned long faddr;
int ret;
faddr = ftrace_location((unsigned long)ip);
if (faddr) {
if (!tr->fops)
return -ENOTSUPP;
tr->func.ftrace_managed = true;
}
if (bpf_trampoline_module_get(tr))
return -ENOENT;
if (tr->func.ftrace_managed) {
ftrace_set_filter_ip(tr->fops, (unsigned long)ip, 0, 1);
ret = register_ftrace_direct_multi(tr->fops, (long)new_addr);
} else {
ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr);
}
if (ret)
bpf_trampoline_module_put(tr);
return ret;
}
Indeed, if tr->fops
is false
the function returns with error -ENOTSUPP
.
Let's find where tr->fops
is initialized.
If we are correct, the place where the trampoline is created is inside bpf_trampoline_lookup
:
static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
{
struct bpf_trampoline *tr;
struct hlist_head *head;
int i;
mutex_lock(&trampoline_mutex);
head = &trampoline_table[hash_64(key, TRAMPOLINE_HASH_BITS)];
hlist_for_each_entry(tr, head, hlist) {
if (tr->key == key) {
refcount_inc(&tr->refcnt);
goto out;
}
}
tr = kzalloc(sizeof(*tr), GFP_KERNEL);
if (!tr)
goto out;
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
tr->fops = kzalloc(sizeof(struct ftrace_ops), GFP_KERNEL);
if (!tr->fops) {
kfree(tr);
tr = NULL;
goto out;
}
tr->fops->private = tr;
tr->fops->ops_func = bpf_tramp_ftrace_ops_func;
#endif
tr->key = key;
INIT_HLIST_NODE(&tr->hlist);
hlist_add_head(&tr->hlist, head);
refcount_set(&tr->refcnt, 1);
mutex_init(&tr->mutex);
for (i = 0; i < BPF_TRAMP_MAX; i++)
INIT_HLIST_HEAD(&tr->progs_hlist[i]);
out:
mutex_unlock(&trampoline_mutex);
return tr;
}
After the allocation, the fops
field of the trampoline is filled only if CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
flag is present. This flag depends on HAVE_CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
flag and this is not present on aarch64
.
Conclusions
At this time it's impossible to use BPF LSM
on aarch64
because of the missing ftrace direct call feature. Luckily, there is a patch already merged in the current mainline
branch. This patch is going to enable LSMs (and other features) on aarch64
.
We can expect this changes to be released in the next 6.4 Linux kernel.
Domenico Salvatore - Software Engineer