Exploring BPF LSM support on aarch64 with ftrace

This blog post is an extract of an internal research we made about BPF LSM support on aarch64 in Linux. It's very difficult to start looking inside the kernel source if you are not familiar with the codebase, so we decided to publish this post to show our approach as it can be useful to anyone who wants to explore kernel internals.

Introduction

We were already using BPF LSM on x86_64, while on aarch64 we are relying on Kprobes, so we wanted to known what is missing inside the kernel to have this features on aarch64.

We dug many times into the kernel source code, but in general we search for something that is already there to understand how it works. In this case instead we were looking for something that doesn't exists, looking for something that returns an error because is not implemented.

Remembering Steven Rostedt talks on how to start learning the Linux kernel, we started with ftrace (and tools built on the tracing infrastructure) to understand what happens when we load an unsupported BPF program into the kernel.

The problem

This is the output when we try to load a BPF LSM program into a aarch64 5.15 Linux kernel using our software pulsar:

root@pine64-1:/home/exein# ./pulsar-enterprise-exec pulsard
[2023-02-16T14:52:45Z INFO  pulsar::pulsard::daemon] Starting module process-monitor
[2023-02-16T14:52:45Z INFO  pulsar::pulsard::daemon] Starting module file-system-monitor
[2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module network-monitor
[2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module logger
[2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module rules-engine
[2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module desktop-notifier
[2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in file-system-monitor: failed program attach lsm path_mknod
    
    Caused by:
        0: `bpf_raw_tracepoint_open` failed
        1: No error information (os error 524)
[2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module anomaly-detection
[2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module malware-detection
[2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in malware-detection: /var/lib/pulsar/malware_detection/models/parameters.json not found
[2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module platform-connector
[2023-02-16T14:52:46Z INFO  platform_connector::client] Connected to https://platform-dev-instance.exein.io:8001/
[2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module threat-response
[2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in network-monitor: failed program attach lsm socket_bind
    
    Caused by:
        0: `bpf_raw_tracepoint_open` failed
        1: No error information (os error 524)

We got error 524, or ENOTSUPP, while pulsar tries to load the BPF program related to path_mknod LSM hook. Let's try to dig into this problem.

Note: At the time of this research we were not able to find a precompiled aarch64 with BPF and BTF enabled, so we had to compile a custom kernel. We also enabled the tracing option and the function_graph plugin to use the tools below.
All the experiments were done on a Pine A64 with custom Armbian images.
This images have custom kernels with a standard Ubuntu 22.04 LTS Jammy userspace.

The tools

To investigate on this issue we used the following tools:

  • bpftrace: BPF based tool to dynamically attach probes with a custom C-like language.
  • trace-cmd: wrapper around tracefs filesystem that interacts with ftrace infrastructure.

To use this tools you need some options enabled into the Linux kernel, please check the official documentation for the full requirements.

Note: It's also possible to use other tools to do the same job, for example funcgraph and kprobe from perf-tools

Linux 5.15

Now we start using these tools to see what happens in the kernel 5.15 when we try to load our BPF program.

From this point to the end of this post we will use the probe binary instead of pulsar, because it is simpler. To recap how it work this is the command line help:

exein@pine64-1:~$ ./probe 
Test runner for eBPF programs

Usage: probe [OPTIONS] <COMMAND>

Commands:
  file-system-monitor  Watch file creations
  process-monitor      Watch process events (fork/exec/exit)
  network-monitor      Watch network events
  help                 Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose  
  -h, --help     Print help
  -V, --version  Print version

In the examples, we will try to load the file-system-monitor probe.

Running the following commands we can see the function graph calls of __sys_bpf function, the BPF syscall entrypoint:

trace-cmd record -p function_graph -g __sys_bpf ./probe file-system-monitor
trace-cmd report

The output is a giant function graph, too large to paste here. Because we have an error we are interested in the last functions before the program stopped. These are the last lines of the trace-cmd report output.

...
 tokio-runtime-w-1666  [003]  1318.058019: funcgraph_entry:                   |        bpf_trampoline_link_prog() {
 tokio-runtime-w-1666  [003]  1318.058020: funcgraph_entry:        2.292 us   |          bpf_attach_type_to_tramp();
 tokio-runtime-w-1666  [003]  1318.058024: funcgraph_entry:        1.250 us   |          mutex_lock();
 tokio-runtime-w-1666  [003]  1318.058028: funcgraph_entry:                   |          bpf_trampoline_update() {
 tokio-runtime-w-1666  [003]  1318.058030: funcgraph_entry:                   |            kmem_cache_alloc_trace() {
 tokio-runtime-w-1666  [003]  1318.058031: funcgraph_entry:        1.167 us   |              should_failslab();
 tokio-runtime-w-1666  [003]  1318.058036: funcgraph_exit:         6.792 us   |            }
 tokio-runtime-w-1666  [003]  1318.058039: funcgraph_entry:                   |            kmem_cache_alloc_trace() {
 tokio-runtime-w-1666  [003]  1318.058042: funcgraph_entry:        2.750 us   |              should_failslab();
 tokio-runtime-w-1666  [003]  1318.058046: funcgraph_exit:         6.417 us   |            }
 tokio-runtime-w-1666  [003]  1318.058048: funcgraph_entry:        2.708 us   |            bpf_jit_charge_modmem();
 tokio-runtime-w-1666  [003]  1318.058053: funcgraph_entry:                   |            bpf_jit_alloc_exec_page() {
 tokio-runtime-w-1666  [003]  1318.058055: funcgraph_entry:                   |              bpf_jit_alloc_exec() {
 tokio-runtime-w-1666  [003]  1318.058057: funcgraph_entry:                   |                vmalloc() {
 tokio-runtime-w-1666  [003]  1318.058059: funcgraph_entry:                   |                  __vmalloc_node() {
 tokio-runtime-w-1666  [003]  1318.058061: funcgraph_entry:                   |                    __vmalloc_node_range() {
 tokio-runtime-w-1666  [003]  1318.058064: funcgraph_entry:                   |                      __get_vm_area_node.constprop.64() {
 tokio-runtime-w-1666  [003]  1318.058067: funcgraph_entry:                   |                        kmem_cache_alloc_node_trace() {
 tokio-runtime-w-1666  [003]  1318.058069: funcgraph_entry:        1.459 us   |                          should_failslab();
 tokio-runtime-w-1666  [003]  1318.058073: funcgraph_exit:         6.292 us   |                        }
 tokio-runtime-w-1666  [003]  1318.058075: funcgraph_entry:                   |                        alloc_vmap_area() {
 tokio-runtime-w-1666  [003]  1318.058077: funcgraph_entry:                   |                          kmem_cache_alloc_node() {
 tokio-runtime-w-1666  [003]  1318.058079: funcgraph_entry:        1.167 us   |                            should_failslab();
 tokio-runtime-w-1666  [003]  1318.058085: funcgraph_exit:         7.625 us   |                          }
 tokio-runtime-w-1666  [003]  1318.058088: funcgraph_entry:                   |                          kmem_cache_alloc_node() {
 tokio-runtime-w-1666  [003]  1318.058089: funcgraph_entry:        1.208 us   |                            should_failslab();
 tokio-runtime-w-1666  [003]  1318.058092: funcgraph_exit:         4.584 us   |                          }
 tokio-runtime-w-1666  [003]  1318.058104: funcgraph_entry:                   |                          kmem_cache_free() {
 tokio-runtime-w-1666  [003]  1318.058107: funcgraph_entry:        2.084 us   |                            __slab_free();
 tokio-runtime-w-1666  [003]  1318.058110: funcgraph_exit:         5.667 us   |                          }
 tokio-runtime-w-1666  [003]  1318.058112: funcgraph_entry:        6.375 us   |                          insert_vmap_area.constprop.74();
 tokio-runtime-w-1666  [003]  1318.058119: funcgraph_exit:       + 44.667 us  |                        }
 tokio-runtime-w-1666  [003]  1318.058122: funcgraph_exit:       + 58.250 us  |                      }
 tokio-runtime-w-1666  [003]  1318.058124: funcgraph_entry:                   |                      __kmalloc_node() {
 tokio-runtime-w-1666  [003]  1318.058125: funcgraph_entry:        1.625 us   |                        kmalloc_slab();
 tokio-runtime-w-1666  [003]  1318.058128: funcgraph_entry:        1.167 us   |                        should_failslab();
 tokio-runtime-w-1666  [003]  1318.058131: funcgraph_exit:         7.208 us   |                      }
 tokio-runtime-w-1666  [003]  1318.058133: funcgraph_entry:                   |                      alloc_pages() {
 tokio-runtime-w-1666  [003]  1318.058135: funcgraph_entry:        1.583 us   |                        get_task_policy.part.48();
 tokio-runtime-w-1666  [003]  1318.058138: funcgraph_entry:        1.500 us   |                        policy_node();
 tokio-runtime-w-1666  [003]  1318.058141: funcgraph_entry:        1.209 us   |                        policy_nodemask();
 tokio-runtime-w-1666  [003]  1318.058143: funcgraph_entry:                   |                        __alloc_pages() {
 tokio-runtime-w-1666  [003]  1318.058145: funcgraph_entry:        1.458 us   |                          should_fail_alloc_page();
 tokio-runtime-w-1666  [003]  1318.058147: funcgraph_entry:                   |                          get_page_from_freelist() {
 tokio-runtime-w-1666  [003]  1318.058150: funcgraph_entry:        1.583 us   |                            prep_new_page();
 tokio-runtime-w-1666  [003]  1318.058153: funcgraph_exit:         5.459 us   |                          }
 tokio-runtime-w-1666  [003]  1318.058154: funcgraph_exit:       + 10.542 us  |                        }
 tokio-runtime-w-1666  [003]  1318.058155: funcgraph_exit:       + 22.083 us  |                      }
 tokio-runtime-w-1666  [003]  1318.058157: funcgraph_entry:                   |                      __cond_resched() {
 tokio-runtime-w-1666  [003]  1318.058158: funcgraph_entry:        1.833 us   |                        rcu_all_qs();
 tokio-runtime-w-1666  [003]  1318.058161: funcgraph_exit:         4.167 us   |                      }
 tokio-runtime-w-1666  [003]  1318.058166: funcgraph_entry:        5.542 us   |                      vmap_pages_range_noflush();
 tokio-runtime-w-1666  [003]  1318.058173: funcgraph_exit:       ! 112.375 us |                    }
 tokio-runtime-w-1666  [003]  1318.058175: funcgraph_exit:       ! 116.000 us |                  }
 tokio-runtime-w-1666  [003]  1318.058176: funcgraph_exit:       ! 119.292 us |                }
 tokio-runtime-w-1666  [003]  1318.058177: funcgraph_exit:       ! 122.542 us |              }
 tokio-runtime-w-1666  [003]  1318.058179: funcgraph_entry:                   |              find_vm_area() {
 tokio-runtime-w-1666  [003]  1318.058180: funcgraph_entry:        1.375 us   |                find_vmap_area();
 tokio-runtime-w-1666  [003]  1318.058183: funcgraph_exit:         4.333 us   |              }
 tokio-runtime-w-1666  [003]  1318.058185: funcgraph_entry:                   |              set_memory_x() {
 tokio-runtime-w-1666  [003]  1318.058186: funcgraph_entry:                   |                change_memory_common() {
 tokio-runtime-w-1666  [003]  1318.058188: funcgraph_entry:                   |                  find_vm_area() {
 tokio-runtime-w-1666  [003]  1318.058189: funcgraph_entry:        1.333 us   |                    find_vmap_area();
 tokio-runtime-w-1666  [003]  1318.058192: funcgraph_exit:         3.875 us   |                  }
 tokio-runtime-w-1666  [003]  1318.058193: funcgraph_entry:                   |                  vm_unmap_aliases() {
 tokio-runtime-w-1666  [003]  1318.058194: funcgraph_entry:                   |                    _vm_unmap_aliases.part.58() {
 tokio-runtime-w-1666  [003]  1318.058196: funcgraph_entry:        1.542 us   |                      rcu_read_unlock_strict();
 tokio-runtime-w-1666  [003]  1318.058199: funcgraph_entry:        1.208 us   |                      rcu_read_unlock_strict();
 tokio-runtime-w-1666  [003]  1318.058202: funcgraph_entry:        1.166 us   |                      rcu_read_unlock_strict();
 tokio-runtime-w-1666  [003]  1318.058205: funcgraph_entry:        1.208 us   |                      rcu_read_unlock_strict();
 tokio-runtime-w-1666  [003]  1318.058207: funcgraph_entry:        1.208 us   |                      mutex_lock();
 tokio-runtime-w-1666  [003]  1318.058210: funcgraph_entry:                   |                      purge_fragmented_blocks_allcpus() {
 tokio-runtime-w-1666  [003]  1318.058212: funcgraph_entry:        1.500 us   |                        rcu_read_unlock_strict();
 tokio-runtime-w-1666  [003]  1318.058214: funcgraph_entry:        1.500 us   |                        rcu_read_unlock_strict();
 tokio-runtime-w-1666  [003]  1318.058217: funcgraph_entry:        1.500 us   |                        rcu_read_unlock_strict();
 tokio-runtime-w-1666  [003]  1318.058220: funcgraph_entry:        1.167 us   |                        rcu_read_unlock_strict();
 tokio-runtime-w-1666  [003]  1318.058222: funcgraph_exit:       + 11.917 us  |                      }
 tokio-runtime-w-1666  [003]  1318.058224: funcgraph_entry:                   |                      __purge_vmap_area_lazy() {
 tokio-runtime-w-1666  [003]  1318.058232: funcgraph_entry:                   |                        kmem_cache_free() {
 tokio-runtime-w-1666  [003]  1318.058234: funcgraph_entry:        1.250 us   |                          __slab_free();
 tokio-runtime-w-1666  [003]  1318.058237: funcgraph_exit:         4.791 us   |                        }
 tokio-runtime-w-1666  [003]  1318.058241: funcgraph_entry:        1.209 us   |                        __cond_resched_lock();
 tokio-runtime-w-1666  [003]  1318.058244: funcgraph_exit:       + 19.625 us  |                      }
 tokio-runtime-w-1666  [003]  1318.058245: funcgraph_entry:        1.167 us   |                      mutex_unlock();
 tokio-runtime-w-1666  [003]  1318.058247: funcgraph_exit:       + 53.042 us  |                    }
 tokio-runtime-w-1666  [003]  1318.058248: funcgraph_exit:       + 55.625 us  |                  }
 tokio-runtime-w-1666  [003]  1318.058250: funcgraph_entry:                   |                  __change_memory_common() {
 tokio-runtime-w-1666  [003]  1318.058251: funcgraph_entry:                   |                    apply_to_page_range() {
 tokio-runtime-w-1666  [003]  1318.058253: funcgraph_entry:                   |                      __apply_to_page_range() {
 tokio-runtime-w-1666  [003]  1318.058255: funcgraph_entry:        1.250 us   |                        pud_huge();
 tokio-runtime-w-1666  [003]  1318.058258: funcgraph_entry:        1.166 us   |                        pmd_huge();
 tokio-runtime-w-1666  [003]  1318.058260: funcgraph_entry:        1.208 us   |                        change_page_range();
 tokio-runtime-w-1666  [003]  1318.058263: funcgraph_exit:         9.834 us   |                      }
 tokio-runtime-w-1666  [003]  1318.058264: funcgraph_exit:       + 12.709 us  |                    }
 tokio-runtime-w-1666  [003]  1318.058266: funcgraph_exit:       + 15.459 us  |                  }
 tokio-runtime-w-1666  [003]  1318.058268: funcgraph_exit:       + 80.791 us  |                }
 tokio-runtime-w-1666  [003]  1318.058270: funcgraph_exit:       + 84.834 us  |              }
 tokio-runtime-w-1666  [003]  1318.058272: funcgraph_exit:       ! 218.500 us |            }
 tokio-runtime-w-1666  [003]  1318.058274: funcgraph_entry:                   |            __alloc_percpu_gfp() {
 tokio-runtime-w-1666  [003]  1318.058276: funcgraph_entry:                   |              pcpu_alloc() {
 tokio-runtime-w-1666  [003]  1318.058281: funcgraph_entry:        2.250 us   |                mutex_lock_killable();
 tokio-runtime-w-1666  [003]  1318.058290: funcgraph_entry:                   |                pcpu_find_block_fit() {
 tokio-runtime-w-1666  [003]  1318.058293: funcgraph_entry:        2.833 us   |                  pcpu_next_fit_region.constprop.38();
 tokio-runtime-w-1666  [003]  1318.058299: funcgraph_exit:         9.084 us   |                }
 tokio-runtime-w-1666  [003]  1318.058301: funcgraph_entry:                   |                pcpu_alloc_area() {
 tokio-runtime-w-1666  [003]  1318.058315: funcgraph_entry:        4.000 us   |                  pcpu_block_update_hint_alloc();
 tokio-runtime-w-1666  [003]  1318.058320: funcgraph_entry:        2.208 us   |                  pcpu_chunk_relocate();
 tokio-runtime-w-1666  [003]  1318.058324: funcgraph_exit:       + 22.625 us  |                }
 tokio-runtime-w-1666  [003]  1318.058327: funcgraph_entry:        1.208 us   |                mutex_unlock();
 tokio-runtime-w-1666  [003]  1318.058332: funcgraph_entry:        1.584 us   |                pcpu_memcg_post_alloc_hook();
 tokio-runtime-w-1666  [003]  1318.058335: funcgraph_exit:       + 58.833 us  |              }
 tokio-runtime-w-1666  [003]  1318.058336: funcgraph_exit:       + 61.834 us  |            }
 tokio-runtime-w-1666  [003]  1318.058338: funcgraph_entry:                   |            kmem_cache_alloc_trace() {
 tokio-runtime-w-1666  [003]  1318.058339: funcgraph_entry:        1.167 us   |              should_failslab();
 tokio-runtime-w-1666  [003]  1318.058342: funcgraph_exit:         4.458 us   |            }
 tokio-runtime-w-1666  [003]  1318.058359: funcgraph_entry:                   |            bpf_image_ksym_add() {
 tokio-runtime-w-1666  [003]  1318.058360: funcgraph_entry:                   |              bpf_ksym_add() {
 tokio-runtime-w-1666  [003]  1318.058363: funcgraph_entry:        1.583 us   |                __local_bh_enable_ip();
 tokio-runtime-w-1666  [003]  1318.058366: funcgraph_exit:         5.750 us   |              }
 tokio-runtime-w-1666  [003]  1318.058369: funcgraph_exit:         9.834 us   |            }
 tokio-runtime-w-1666  [003]  1318.058371: funcgraph_entry:        1.250 us   |            arch_prepare_bpf_trampoline();
 tokio-runtime-w-1666  [003]  1318.058373: funcgraph_entry:        2.292 us   |            kfree();
 tokio-runtime-w-1666  [003]  1318.058377: funcgraph_exit:       ! 348.625 us |          }
 tokio-runtime-w-1666  [003]  1318.058379: funcgraph_entry:        1.250 us   |          mutex_unlock();
 tokio-runtime-w-1666  [003]  1318.058382: funcgraph_exit:       ! 363.167 us |        }
 tokio-runtime-w-1666  [003]  1318.058384: funcgraph_entry:                   |        bpf_link_cleanup() {
 tokio-runtime-w-1666  [003]  1318.058386: funcgraph_entry:                   |          bpf_link_free_id.part.30() {
 tokio-runtime-w-1666  [003]  1318.058392: funcgraph_entry:                   |            call_rcu() {
 tokio-runtime-w-1666  [003]  1318.058396: funcgraph_entry:        1.834 us   |              rcu_segcblist_enqueue();
 tokio-runtime-w-1666  [003]  1318.058401: funcgraph_exit:         9.333 us   |            }
 tokio-runtime-w-1666  [003]  1318.058403: funcgraph_entry:        1.542 us   |            __local_bh_enable_ip();
 tokio-runtime-w-1666  [003]  1318.058406: funcgraph_exit:       + 19.542 us  |          }
 tokio-runtime-w-1666  [003]  1318.058408: funcgraph_entry:                   |          fput() {
 tokio-runtime-w-1666  [003]  1318.058409: funcgraph_entry:                   |            fput_many() {
 tokio-runtime-w-1666  [003]  1318.058411: funcgraph_entry:                   |              task_work_add() {
 tokio-runtime-w-1666  [003]  1318.058414: funcgraph_entry:        1.625 us   |                kick_process();
 tokio-runtime-w-1666  [003]  1318.058418: funcgraph_exit:         6.750 us   |              }
 tokio-runtime-w-1666  [003]  1318.058419: funcgraph_exit:       + 10.333 us  |            }
 tokio-runtime-w-1666  [003]  1318.058420: funcgraph_exit:       + 12.708 us  |          }
 tokio-runtime-w-1666  [003]  1318.058422: funcgraph_entry:        2.250 us   |          put_unused_fd();
 tokio-runtime-w-1666  [003]  1318.058426: funcgraph_exit:       + 41.416 us  |        }
 tokio-runtime-w-1666  [003]  1318.058428: funcgraph_entry:        1.292 us   |        mutex_unlock();
 tokio-runtime-w-1666  [003]  1318.058430: funcgraph_entry:        1.250 us   |        kfree();
 tokio-runtime-w-1666  [003]  1318.058433: funcgraph_exit:       ! 567.458 us |      }
 tokio-runtime-w-1666  [003]  1318.058435: funcgraph_entry:        2.125 us   |      __bpf_prog_put.isra.47();
 tokio-runtime-w-1666  [003]  1318.058438: funcgraph_exit:       ! 602.291 us |    }
 tokio-runtime-w-1666  [003]  1318.058439: funcgraph_exit:       ! 631.791 us |  }

This is the corresponding source code in kernel/bpf/trampoline.c for the last function executed, bpf_trampoline_update:

static int bpf_trampoline_update(struct bpf_trampoline *tr)
{
	struct bpf_tramp_image *im;
	struct bpf_tramp_progs *tprogs;
	u32 flags = BPF_TRAMP_F_RESTORE_REGS;
	bool ip_arg = false;
	int err, total;

	tprogs = bpf_trampoline_get_progs(tr, &total, &ip_arg);
	if (IS_ERR(tprogs))
		return PTR_ERR(tprogs);

	if (total == 0) {
		err = unregister_fentry(tr, tr->cur_image->image);
		bpf_tramp_image_put(tr->cur_image);
		tr->cur_image = NULL;
		tr->selector = 0;
		goto out;
	}

	im = bpf_tramp_image_alloc(tr->key, tr->selector);
	if (IS_ERR(im)) {
		err = PTR_ERR(im);
		goto out;
	}

	if (tprogs[BPF_TRAMP_FEXIT].nr_progs ||
	    tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs)
		flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;

	if (ip_arg)
		flags |= BPF_TRAMP_F_IP_ARG;

	err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
					  &tr->func.model, flags, tprogs,
					  tr->func.addr);
	if (err < 0)
		goto out;

	WARN_ON(tr->cur_image && tr->selector == 0);
	WARN_ON(!tr->cur_image && tr->selector);
	if (tr->cur_image)
		/* progs already running at this address */
		err = modify_fentry(tr, tr->cur_image->image, im->image);
	else
		/* first time registering */
		err = register_fentry(tr, im->image);
	if (err)
		goto out;
	if (tr->cur_image)
		bpf_tramp_image_put(tr->cur_image);
	tr->cur_image = im;
	tr->selector++;
out:
	kfree(tprogs);
	return err;
}

Looking at the previous output we can see:

 tokio-runtime-w-1666  [003]  1318.058371: funcgraph_entry:        1.250 us   |            arch_prepare_bpf_trampoline();
 tokio-runtime-w-1666  [003]  1318.058373: funcgraph_entry:        2.292 us   |            kfree();

so there are no other function calls between arch_prepare_bpf_trampoline and kfree functions, so probably the first function returns an error code in the err variable. Let's verify it!

Launching bpftace in a shell in this way, we capture the return value of arch_prepare_bpf_trampoline function and print it to the console:

bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d\n", retval); }'

and starting probe in another terminal we get the the following output from bpftace:

root@pine64-1:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d\n", retval); }'
Attaching 1 probe...
retval link: -524

This is because the kernel 5.15 lacks the implementation of arch_prepare_bpf_trampoline for aarch64 and uses the default implementation placeholder:

int __weak
arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
			    const struct btf_func_model *m, u32 flags,
			    struct bpf_tramp_links *tlinks,
			    void *orig_call)
{
	return -ENOTSUPP;
}

So this feature is unsupported for this kernel version. The good news is that is implemented in kernel 6.x thanks to this patch.

Let's move to the kernel 6.x.

Linux 6.1

If we try to run probe on the kernel 6.1 we get the following output:

root@pine64:/home/exein# ./probe file-system-monitor
thread 'main' panicked at 'initialization failed: ProgramAttachError { program: "lsm path_mknod", program_error: SyscallError { call: "bpf_raw_tracepoint_open", io_error: Os { code: 524, kind: Uncategorized, message: "No error information" } } }', src/bin/probe.rs:72:43
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

the same error of the 5.15 kernel !!! Let's find an explanation.

Running bpftrace on arch_prepare_bpf_trampoline, this time we get this output:

root@pine64:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval tp link: %d\n", retval); }'
Attaching 1 probe...
retval tp link: 284

So the problem is not here, the function doesn't return an error anymore. Let's go back to the function graph.

This time we launch trace-cmd skipping some functions to have a cleaner output:

trace-cmd record \
    -p function_graph \
    -g bpf_trampoline_link_prog \
    -n bpf_jit_alloc_exec \
    -n kmalloc_trace \
    -n arch_prepare_bpf_trampoline \
    -n generic_handle_domain_irq \
    -n do_interrupt_handler \
    -n irq_exit_rcu \
    ./probe file-system-monitor

And we get this output from trace-cmd report:

root@pine64:/home/exein# trace-cmd report 
CPU 0 is empty
CPU 1 is empty
CPU 3 is empty
cpus=4
 tokio-runtime-w-11886 [002] 193385.056283: funcgraph_entry:                   |  bpf_trampoline_link_prog() {
 tokio-runtime-w-11886 [002] 193385.056321: funcgraph_entry:      + 15.042 us  |    mutex_lock();
 tokio-runtime-w-11886 [002] 193385.056373: funcgraph_entry:                   |    __bpf_trampoline_link_prog() {
 tokio-runtime-w-11886 [002] 193385.056395: funcgraph_entry:      + 14.833 us  |      bpf_attach_type_to_tramp();
 tokio-runtime-w-11886 [002] 193385.056428: funcgraph_entry:                   |      bpf_trampoline_update.isra.23() {
 tokio-runtime-w-11886 [002] 193385.056459: funcgraph_entry:        2.917 us   |        bpf_jit_charge_modmem();
 tokio-runtime-w-11886 [002] 193385.056531: funcgraph_entry:                   |        find_vm_area() {
 tokio-runtime-w-11886 [002] 193385.056540: funcgraph_entry:        3.000 us   |          find_vmap_area();
 tokio-runtime-w-11886 [002] 193385.056547: funcgraph_exit:       + 16.208 us  |        }
 tokio-runtime-w-11886 [002] 193385.056554: funcgraph_entry:                   |        __alloc_percpu_gfp() {
 tokio-runtime-w-11886 [002] 193385.056563: funcgraph_entry:                   |          pcpu_alloc() {
 tokio-runtime-w-11886 [002] 193385.056568: funcgraph_entry:        4.875 us   |            mutex_lock_killable();
 tokio-runtime-w-11886 [002] 193385.056591: funcgraph_entry:                   |            pcpu_find_block_fit() {
 tokio-runtime-w-11886 [002] 193385.056599: funcgraph_entry:        8.625 us   |              pcpu_next_fit_region.constprop.38();
 tokio-runtime-w-11886 [002] 193385.056608: funcgraph_exit:       + 17.166 us  |            }
 tokio-runtime-w-11886 [002] 193385.056610: funcgraph_entry:                   |            pcpu_alloc_area() {
 tokio-runtime-w-11886 [002] 193385.056639: funcgraph_entry:        9.167 us   |              pcpu_block_update();
 tokio-runtime-w-11886 [002] 193385.056656: funcgraph_entry:        7.667 us   |              pcpu_block_update_hint_alloc();
 tokio-runtime-w-11886 [002] 193385.056671: funcgraph_entry:        7.750 us   |              pcpu_chunk_relocate();
 tokio-runtime-w-11886 [002] 193385.056679: funcgraph_exit:       + 69.667 us  |            }
 tokio-runtime-w-11886 [002] 193385.056682: funcgraph_entry:        7.042 us   |            mutex_unlock();
 tokio-runtime-w-11886 [002] 193385.056703: funcgraph_entry:        2.792 us   |            pcpu_memcg_post_alloc_hook();
 tokio-runtime-w-11886 [002] 193385.056712: funcgraph_exit:       ! 148.709 us |          }
 tokio-runtime-w-11886 [002] 193385.056719: funcgraph_exit:       ! 165.250 us |        }
 tokio-runtime-w-11886 [002] 193385.056866: funcgraph_entry:                   |        bpf_image_ksym_add() {
 tokio-runtime-w-11886 [002] 193385.056873: funcgraph_entry:                   |          bpf_ksym_add() {
 tokio-runtime-w-11886 [002] 193385.056882: funcgraph_entry:        2.750 us   |            __local_bh_disable_ip();
 tokio-runtime-w-11886 [002] 193385.056897: funcgraph_entry:        4.625 us   |            __local_bh_enable_ip();
 tokio-runtime-w-11886 [002] 193385.056905: funcgraph_exit:       + 32.459 us  |          }
 tokio-runtime-w-11886 [002] 193385.056922: funcgraph_entry:        7.584 us   |          perf_event_ksymbol();
 tokio-runtime-w-11886 [002] 193385.056944: funcgraph_exit:       + 78.417 us  |        }
 tokio-runtime-w-11886 [002] 193385.057492: funcgraph_entry:                   |        set_memory_ro() {
 tokio-runtime-w-11886 [002] 193385.057501: funcgraph_entry:                   |          change_memory_common() {
 tokio-runtime-w-11886 [002] 193385.057504: funcgraph_entry:                   |            find_vm_area() {
 tokio-runtime-w-11886 [002] 193385.057506: funcgraph_entry:        8.875 us   |              find_vmap_area();
 tokio-runtime-w-11886 [002] 193385.057518: funcgraph_exit:       + 14.250 us  |            }
 tokio-runtime-w-11886 [002] 193385.057522: funcgraph_entry:                   |            __change_memory_common() {
 tokio-runtime-w-11886 [002] 193385.057531: funcgraph_entry:                   |              apply_to_page_range() {
 tokio-runtime-w-11886 [002] 193385.057538: funcgraph_entry:                   |                __apply_to_page_range() {
 tokio-runtime-w-11886 [002] 193385.057544: funcgraph_entry:      + 12.791 us  |                  pud_huge();
 tokio-runtime-w-11886 [002] 193385.057559: funcgraph_entry:        2.708 us   |                  pmd_huge();
 tokio-runtime-w-11886 [002] 193385.057574: funcgraph_entry:      + 15.125 us  |                  change_page_range();
 tokio-runtime-w-11886 [002] 193385.057591: funcgraph_exit:       + 53.792 us  |                }
 tokio-runtime-w-11886 [002] 193385.057597: funcgraph_exit:       + 66.083 us  |              }
 tokio-runtime-w-11886 [002] 193385.057610: funcgraph_exit:       + 88.125 us  |            }
 tokio-runtime-w-11886 [002] 193385.057619: funcgraph_entry:                   |            vm_unmap_aliases() {
 tokio-runtime-w-11886 [002] 193385.057622: funcgraph_entry:                   |              _vm_unmap_aliases.part.77() {
 tokio-runtime-w-11886 [002] 193385.057625: funcgraph_entry:        9.125 us   |                mutex_lock();
 tokio-runtime-w-11886 [002] 193385.057637: funcgraph_entry:        3.084 us   |                purge_fragmented_blocks_allcpus();
 tokio-runtime-w-11886 [002] 193385.057643: funcgraph_entry:                   |                __purge_vmap_area_lazy() {
 tokio-runtime-w-11886 [002] 193385.057687: funcgraph_entry:                   |                  kmem_cache_free() {
 tokio-runtime-w-11886 [002] 193385.057693: funcgraph_entry:      + 13.250 us  |                    __slab_free();
 tokio-runtime-w-11886 [002] 193385.057705: funcgraph_exit:       + 18.750 us  |                  }
 tokio-runtime-w-11886 [002] 193385.057718: funcgraph_entry:        7.416 us   |                  __cond_resched_lock();
 tokio-runtime-w-11886 [002] 193385.057733: funcgraph_exit:       + 90.042 us  |                }
 tokio-runtime-w-11886 [002] 193385.057741: funcgraph_entry:        2.792 us   |                mutex_unlock();
 tokio-runtime-w-11886 [002] 193385.057747: funcgraph_exit:       ! 124.666 us |              }
 tokio-runtime-w-11886 [002] 193385.057749: funcgraph_exit:       ! 130.291 us |            }
 tokio-runtime-w-11886 [002] 193385.057756: funcgraph_entry:                   |            __change_memory_common() {
 tokio-runtime-w-11886 [002] 193385.057759: funcgraph_entry:                   |              apply_to_page_range() {
 tokio-runtime-w-11886 [002] 193385.057765: funcgraph_entry:                   |                __apply_to_page_range() {
 tokio-runtime-w-11886 [002] 193385.057768: funcgraph_entry:        4.125 us   |                  pud_huge();
 tokio-runtime-w-11886 [002] 193385.057778: funcgraph_entry:        8.750 us   |                  pmd_huge();
 tokio-runtime-w-11886 [002] 193385.057790: funcgraph_entry:        4.625 us   |                  change_page_range();
 tokio-runtime-w-11886 [002] 193385.057797: funcgraph_exit:       + 31.958 us  |                }
 tokio-runtime-w-11886 [002] 193385.057803: funcgraph_exit:       + 44.375 us  |              }
 tokio-runtime-w-11886 [002] 193385.057817: funcgraph_exit:       + 61.208 us  |            }
 tokio-runtime-w-11886 [002] 193385.057820: funcgraph_exit:       ! 319.292 us |          }
 tokio-runtime-w-11886 [002] 193385.057826: funcgraph_exit:       ! 333.667 us |        }
 tokio-runtime-w-11886 [002] 193385.057840: funcgraph_entry:                   |        set_memory_x() {
 tokio-runtime-w-11886 [002] 193385.057847: funcgraph_entry:                   |          change_memory_common() {
 tokio-runtime-w-11886 [002] 193385.057855: funcgraph_entry:                   |            find_vm_area() {
 tokio-runtime-w-11886 [002] 193385.057858: funcgraph_entry:        2.917 us   |              find_vmap_area();
 tokio-runtime-w-11886 [002] 193385.057870: funcgraph_exit:       + 14.375 us  |            }
 tokio-runtime-w-11886 [002] 193385.057876: funcgraph_entry:                   |            vm_unmap_aliases() {
 tokio-runtime-w-11886 [002] 193385.057879: funcgraph_entry:                   |              _vm_unmap_aliases.part.77() {
 tokio-runtime-w-11886 [002] 193385.057882: funcgraph_entry:        3.959 us   |                mutex_lock();
 tokio-runtime-w-11886 [002] 193385.057893: funcgraph_entry:        3.000 us   |                purge_fragmented_blocks_allcpus();
 tokio-runtime-w-11886 [002] 193385.057900: funcgraph_entry:        2.791 us   |                __purge_vmap_area_lazy();
 tokio-runtime-w-11886 [002] 193385.057907: funcgraph_entry:        2.709 us   |                mutex_unlock();
 tokio-runtime-w-11886 [002] 193385.057913: funcgraph_exit:       + 33.708 us  |              }
 tokio-runtime-w-11886 [002] 193385.057915: funcgraph_exit:       + 43.000 us  |            }
 tokio-runtime-w-11886 [002] 193385.057922: funcgraph_entry:                   |            __change_memory_common() {
 tokio-runtime-w-11886 [002] 193385.057925: funcgraph_entry:                   |              apply_to_page_range() {
 tokio-runtime-w-11886 [002] 193385.057930: funcgraph_entry:                   |                __apply_to_page_range() {
 tokio-runtime-w-11886 [002] 193385.057933: funcgraph_entry:        4.292 us   |                  pud_huge();
 tokio-runtime-w-11886 [002] 193385.057945: funcgraph_entry:        8.750 us   |                  pmd_huge();
 tokio-runtime-w-11886 [002] 193385.057956: funcgraph_entry:        3.958 us   |                  change_page_range();
 tokio-runtime-w-11886 [002] 193385.058037: funcgraph_exit:       + 32.083 us  |                }
 tokio-runtime-w-11886 [002] 193385.058089: funcgraph_entry:        7.667 us   |                irq_enter_rcu();
 tokio-runtime-w-11886 [002] 193385.058233: funcgraph_exit:       ! 308.041 us |              }
 tokio-runtime-w-11886 [002] 193385.058239: funcgraph_exit:       ! 316.709 us |            }
 tokio-runtime-w-11886 [002] 193385.058247: funcgraph_exit:       ! 400.417 us |          }
 tokio-runtime-w-11886 [002] 193385.058255: funcgraph_exit:       ! 415.000 us |        }
 tokio-runtime-w-11886 [002] 193385.058555: funcgraph_entry:        8.250 us   |        irq_enter_rcu();
 tokio-runtime-w-11886 [002] 193385.058958: funcgraph_entry:                   |        kallsyms_lookup_size_offset() {
 tokio-runtime-w-11886 [002] 193385.058974: funcgraph_entry:      + 36.333 us  |          get_symbol_pos();
 tokio-runtime-w-11886 [002] 193385.059017: funcgraph_exit:       + 59.750 us  |        }
 tokio-runtime-w-11886 [002] 193385.059043: funcgraph_entry:                   |        kfree() {
 tokio-runtime-w-11886 [002] 193385.059057: funcgraph_entry:        3.000 us   |          __kmem_cache_free();
 tokio-runtime-w-11886 [002] 193385.059065: funcgraph_exit:       + 22.833 us  |        }
 tokio-runtime-w-11886 [002] 193385.059073: funcgraph_exit:       # 2644.708 us |      }
 tokio-runtime-w-11886 [002] 193385.059079: funcgraph_exit:       # 2706.292 us |    }
 tokio-runtime-w-11886 [002] 193385.059095: funcgraph_entry:        2.792 us   |    mutex_unlock();
 tokio-runtime-w-11886 [002] 193385.059101: funcgraph_exit:       # 2870.416 us |  }

So this time the program has passed arch_prepare_bpf_trampoline, set_memory_ro and set_memory_x and the last function we see is kallsyms_lookup_size_offset.

As we can see in bpf_trampoline_update function from kernel/bpf/trampoline.c, there is no explicit call to kallsyms_lookup_size_offset:

static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex)
{

// ... OTHER CODE ...

#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
again:
	if ((tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY) &&
	    (tr->flags & BPF_TRAMP_F_CALL_ORIG))
		tr->flags |= BPF_TRAMP_F_ORIG_STACK;
#endif

	err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
					  &tr->func.model, tr->flags, tlinks,
					  tr->func.addr);
	if (err < 0)
		goto out;

	set_memory_ro((long)im->image, 1);
	set_memory_x((long)im->image, 1);

	WARN_ON(tr->cur_image && tr->selector == 0);
	WARN_ON(!tr->cur_image && tr->selector);
	if (tr->cur_image)
		/* progs already running at this address */
		err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex);
	else
		/* first time registering */
		err = register_fentry(tr, im->image);

#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
	if (err == -EAGAIN) {
		/* -EAGAIN from bpf_tramp_ftrace_ops_func. Now
		 * BPF_TRAMP_F_SHARE_IPMODIFY is set, we can generate the
		 * trampoline again, and retry register.
		 */
		/* reset fops->func and fops->trampoline for re-register */
		tr->fops->func = NULL;
		tr->fops->trampoline = 0;

		/* reset im->image memory attr for arch_prepare_bpf_trampoline */
		set_memory_nx((long)im->image, 1);
		set_memory_rw((long)im->image, 1);
		goto again;
	}
#endif
	if (err)
		goto out;

	if (tr->cur_image)
		bpf_tramp_image_put(tr->cur_image);
	tr->cur_image = im;
	tr->selector++;
out:
	/* If any error happens, restore previous flags */
	if (err)
		tr->flags = orig_flags;
	kfree(tlinks);
	return err;
}
Note: Implementation of bpf_trampoline_update is a bit different from the one in the previous kernel 5.15

The kallsyms_lookup_size_offset call is hidden inside another function. We don't see it on the function graph because the compiler inlined it.

It seems that kallsyms_lookup_size_offset is called by ftrace_location:

unsigned long ftrace_location(unsigned long ip)
{
	struct dyn_ftrace *rec;
	unsigned long offset;
	unsigned long size;

	rec = lookup_rec(ip, ip);
	if (!rec) {
		if (!kallsyms_lookup_size_offset(ip, &size, &offset))
			goto out;

		/* map sym+0 to __fentry__ */
		if (!offset)
			rec = lookup_rec(ip, ip + size - 1);
	}

	if (rec)
		return rec->ip;

out:
	return 0;
}

ftrace_location is called by register_fentry that contains a check after the ftrace_location call on the field fops of struct bpf_trampoline *tr:

/* first time registering */
static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
{
	void *ip = tr->func.addr;
	unsigned long faddr;
	int ret;

	faddr = ftrace_location((unsigned long)ip);
	if (faddr) {
		if (!tr->fops)
			return -ENOTSUPP;
		tr->func.ftrace_managed = true;
	}

	if (bpf_trampoline_module_get(tr))
		return -ENOENT;

	if (tr->func.ftrace_managed) {
		ftrace_set_filter_ip(tr->fops, (unsigned long)ip, 0, 1);
		ret = register_ftrace_direct_multi(tr->fops, (long)new_addr);
	} else {
		ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr);
	}

	if (ret)
		bpf_trampoline_module_put(tr);
	return ret;
}

Indeed, if tr->fops is false the function returns with error -ENOTSUPP.

Let's find where tr->fops is initialized.

If we are correct, the place where the trampoline is created is inside bpf_trampoline_lookup:

static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
{
	struct bpf_trampoline *tr;
	struct hlist_head *head;
	int i;

	mutex_lock(&trampoline_mutex);
	head = &trampoline_table[hash_64(key, TRAMPOLINE_HASH_BITS)];
	hlist_for_each_entry(tr, head, hlist) {
		if (tr->key == key) {
			refcount_inc(&tr->refcnt);
			goto out;
		}
	}
	tr = kzalloc(sizeof(*tr), GFP_KERNEL);
	if (!tr)
		goto out;
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
	tr->fops = kzalloc(sizeof(struct ftrace_ops), GFP_KERNEL);
	if (!tr->fops) {
		kfree(tr);
		tr = NULL;
		goto out;
	}
	tr->fops->private = tr;
	tr->fops->ops_func = bpf_tramp_ftrace_ops_func;
#endif

	tr->key = key;
	INIT_HLIST_NODE(&tr->hlist);
	hlist_add_head(&tr->hlist, head);
	refcount_set(&tr->refcnt, 1);
	mutex_init(&tr->mutex);
	for (i = 0; i < BPF_TRAMP_MAX; i++)
		INIT_HLIST_HEAD(&tr->progs_hlist[i]);
out:
	mutex_unlock(&trampoline_mutex);
	return tr;
}

After the allocation, the fops field of the trampoline is filled only if CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS flag is present. This flag depends on HAVE_CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS flag and this is not present on aarch64.

Conclusions

At this time it's impossible to use BPF LSM on aarch64 because of the missing ftrace direct call feature. Luckily, there is a patch already merged in the current mainline branch. This patch is going to enable LSMs (and other features) on aarch64.

We can expect this changes to be released in the next 6.4 Linux kernel.

Domenico Salvatore - Software Engineer