One year of testing eBPF programs

Let's start with an anecdote. When I started working on Pulsar at Exein in November 2021, I came across an unusual bug in one of our network monitoring modules. The DNS interceptor was working on most programs, but when I ran dig exein.io the reply didn't show up. After further research, I found out that for some strange reason, the bug manifested only on the very last DNS response a program received. If a program made 4 DNS requests, only the first 3 would have been intercepted.

Turns out, our eBPF program was reading the UDP message buffers on kprobe:udp_recvmsg, but that memory is filled by the kernel inside udp_recvmsg, after our code had already been executed. Basically we were reading garbage and it accidentally worked most of the times because subsequent calls would have read the previous message. The fix was simple, all we had to do was run our code on the return kprobe when the kernel function had completed and the buffer was filled.

Finding the issue probably took me more hours than I'll ever admit. A test suite would have prevented this problem.

This is the first article of a two part mini-series on testing the Pulsar eBPF code.

The pulsar test-suite

Fast forward to 2023 and many commits later, Pulsar now has a comprehensive suite of integration tests which checks every eBPF program we use as part of our tracing solution.


Every test loads one of our eBPF programs, executes some code to trigger it and asserts an event with the correct fields was generated. The following example tests an hypothetical probe for file open events.

#[cfg(feature = "test-suite")]
pub mod test_suite {
    use bpf_common::{
        event_check,
        test_runner::{TestCase, TestRunner, TestSuite},
    };

    pub fn tests() -> TestSuite {
        TestSuite {
            name: "file-system-monitor",
            tests: vec![open_file()],
        }
    }

    fn file_name() -> TestCase {
        TestCase::new("file_name", async {
            let path = std::env::temp_dir().join("file_name_1");
            TestRunner::with_ebpf(program)
                .run(|| { std::fs::File::create(&path); } )
                .await
                .expect_event(event_check!(
                    FsEvent::FileCreated,
                    (filename, path.to_str().unwrap().into(), "filename")
                ))
                .report()
        })
    }
}

First we have the tests() function, which returns the collection all the test-cases of the module. We need this manual management instead of the usual #[test] macro since we've switched to libtest-mimic for our test harness. This allows us to to build a separate binary, making it easier to run the test-suite with sudo. This is because loading eBPF programs requires administrative privileges. We have considered cutting this boilerplate code by doing something smart with macros and dtolnay/inventory, but we decided to prefer transparency.

Next we have the file_name() function. It returns a TestCase, which has the name used on output and the async code which will run the actual test. This leads to one of my favourite hobbies: writing abstractions for test cases. TestRunner contains all the code shared by our domain specific logic. It is constructed with an eBPF program, then it is given a closure to trigger it. Awaiting the future returned by the run method starts the actual test process:

  1. the eBPF program is loaded into the kernel and started
  2. we spawn a task for reading all generated events and sending them over a channel
  3. we run the trigger code
  4. we collect all generated events into a vector and stop the eBPF program

Finally, we call TestResult::expect_event and generate a report for pretty-printing the result. The event_check! macro is a little obscure. Basically it generates a list of predicates which are later matched against every generated event. In the example, we're looking for events of the enum variant FsEvent::FileCreated where the field filename matches the value of path. This is one of the cases where we choose to be smart. The rationale is that we'll have a lot of tests and making the code shorter and easier to write is worth the burden of a slightly complicated macro.

If you're curious of how a test failure looks like, you can see this issue.

Benefits of test-driven eBPF development

First, doing test-driven development for eBPF code is not only the right thing to do, it's also super convenient. On my machine, an ever-present terminal running cargo watch -cs "cargo xtask test <my_new_feature>" supervises my eBPF C code development.

Second, the Linux kernel doesn't keep backward-compatibility for internal APIs. While tracepoints and LSM hook points are supposed to remain stable, kprobes and data structures may change in future releases. BTF and CORE provide the tools to handle this fundamental issue, but we still need to test Pulsar on every kernel release. Without a test-suite, checking so many kernel versions and configurations would be a colossal task. The next chapter of this mini-series will take a deeper look at this.

Third, somewhat related to the previous issue, a test-suite allows us to quickly validate the compatibility of Pulsar to the embedded board of our customers.

Finally, it's fundamental to catch regressions when developing new code, but you probably already know this.

Conclusions

Having a good test-suite is one the most important things for every project, even more for an eBPF-based security solution. The upfront time investment was well paid by the debugging hours saved.

In the next and final chapter, we'll take a look on how we're using Buildroot for building a collection qemu virtual machines: the perfect test environment. In the meantime, go give a star to Pulsar if you're interested in the project.

Matteo Nardi - Rust System Developer