Hacker News new | past | comments | ask | show | jobs | submit login
Notes on BPF and eBPF (jvns.ca)
216 points by mlerner on Jan 2, 2022 | hide | past | favorite | 43 comments



>eBPF programs can’t access arbitrary kernel memory. Instead the kernel provides functions to get at some restricted subset of things.

I must finally becoming a security pessimist when I read those sentences and the first thing I think is: these statements will not age well.


That's definitely a security professional's read :)

"Isn't supposed to be able to" is a lot longer and distracting vs the oversimplification-for-sake-of-understanding of "can't". As far as it being proven wrong though - that's already happened, eg CVE-2021-29154

https://blog.kernelcare.com/vulnerability/specially-crafted-...


That's fair. I understand that in the ideal world, it would be "can't." I guess my concern is that the wording kind of hand waves away any potential security issues, when people interested in this tech should absolutely be made aware of them.


“is not authorized to” ?


Yes, especially with Spectre. I looked into using eBPF with seccomp (which currently only supports BPF) and the consensus seemed to be that they're not going to add any more eBPF to the kernel and have basically given up on making it secure.

I wouldn't be surprised if it becomes entirely root-only by default soon, if it isn't already.


For others who are curious, https://lwn.net/Articles/857228/


I think they already don't. eBPF has become a favorite jumping point for exploit chains that can very quickly escalate some read/write gadget into full execution.


I would love to read an in depth breakdown of an exploit using eBPF as part of its execution chain. Do you happen to have an example/link?


This was the one I was thinking of: https://googleprojectzero.blogspot.com/2020/

Scroll down to "The ultimate ROP"


The BPF capability should really only be given to root. I don't think it really gives any new attack surface. All I could see is it giving black-hats an easier interface to "kernel-level-fuckery".


I'd recommend to anyone running Linux that they do exactly that - disable unprivileged eBPF. For a server you shouldn't need that, you can just drop privs for your service after setting up the filter.

They should probably put it behind its own capability like CAP_LOAD_EBPF.

Forcing signed ebpf will also help though.


It’s not an easier interface. It’s much easier to write kernel modules than to mess around with eBPF.


That seems wrong. The build environment for eBPF programs is simpler (you don't even need a working kernel tree), and, much more importantly, eBPF programs are constrained and can't crash the kernel, which is easy to do accidentally when writing an freestyle C LKM. You can learn enough to get stuff done with eBPF inside of a couple days; the same is absolutely not true of kernel modules.


I find the verifier to be extremely frustrating. It's also very difficult to debug eBPF programs, especially in regards to why it's not passing verifier. Combine with ridiculous restrictions from BCC like not being able to use functions in macros, or the vagueries of the bpf target in clang and it's extremely frustrating.

I've personally never found obtaining a working kernel tree to be difficult, certainly easier than a working BCC toolchain. Or all of the various compiler flags needed for clang not to emit code incompatible with the verifier.


I don't use BCC or really understand why anyone would; I use clang, and just produce .o's.

The verifier is definitely annoying, especially at first, but I found myself sort of quickly working out the verifier's expected idiom, and a lot of it can be wrapped with macros.

All of this drama pales in comparison to writing freestyle C code in the Linux kernel without causing random panics.


Easier to get started maybe, but there's something to be said for the 'ease' of having to worry a lot less about crashing the kernel you're working on.


I've crashed multiple kernels with eBPF programs. I don't buy this argument at all.


I call bullshit! Ebpf programs cannot crash the kernel since they cannot be run if they contain bugs in the first place.


The kernel, however, can and does have bugs. I've seen several thousand hosts get taken down with a perfectly correct eBPF program due to a buggy kernel.


That assumes that the virtual machine has no bugs


Not so much (eBPF VM bugs are pretty rare, as you'd expect, since the VM is very simple) --- you're much more likely to run into bugs in the C-code helpers the kernel exports. If you're malicious, you can also hit verifier bugs that'll give your eBPF code raw pointers, but I don't think you're likely to stumble on them accidentally.


I said 'less' rather than 'at all' for a reason. I'm not really sure what to tell you if you disagree that it's a lot easier to accidentally crash your kernel with a from-scratch module than ebpf though. That's an experience pretty drastically at odds with mine.


Can you describe some of those bugs? What helpers were you using? I've been doing bonkers stuff with eBPF for the last year and, while I've definitely had bugs, none of them took my kernel down.


I no longer have access to the code, and the kernel has since been patched. Mostly doing some observability around new network connections.

We definitely hit this one at some point: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763454

We also ran into a couple of similar but unrelated panic bugs on much newer kernels on non-Ubuntu distros.


Those would be considered kernel bugs though, and should be getting fixed when reported. While a crash in a kernel module is just a bug in the kernel module itself


Could you share the ebpf program and the external dependencies that can reproduce the crash?

It'll be valuable to learn this, so that we might be able to proactively address them. Ebpf is the core of our product.


Are you saying this is factually wrong? I assume eBPF is designed to only access certain points explicitly coded in the kernel but the exploits can access any part of kernel. Is this what you are hinting at? Could you please clarify the statement.


There are at least two issues here.

First: eBPF code is JIT'd in the kernel; at runtime, it is simply native code running at CPL0 alongside the rest of the kernel. Running eBPF code working with pointers is... working with raw pointers. There's no interpretation layer to bounds check or otherwise provide safety.

But eBPF code is meant to be safe: you can get a handle to some kernel structure that's passed to you from trusted code, but you can't bounce from it to a random offset in kernel memory. The way eBPF does this is by verifying the CFG of your eBPF program before it's translated to amd64 or arm. eBPF programs are generally just C programs (the simplest and best way to write an eBPF program is just to write a C program and compile it with the right LLVM flags), and verifying C programs is a hard problem; eBPF gets around this by only accepting a subset of all possible programs (those where memory accesses are simple enough to prove safe, that don't jump anywhere outside of known narrow range of program text, and that don't have unbounded loops).

The tricky thing here is that the eBPF verifier is pretty complicated and lives only in the kernel. People have found bugs in it. If you find a good verifier bug, you can launder an untrusted pointer into your eBPF program (in the end, these bugs end up looking sort of like the browser Javascript RCEs that finagle a bad pointer out of some part of the browser API).

The biggest mitigating factor for these bugs is that Linux systems generally don't expose eBPF to any user other than root, so the upside to these kinds of bugs is limited (it gives you root->kernel, which is not nothing, but not the top of most people's priority list).

The other big issue is that eBPF is a huge source of in-kernel flexibility about runnable code. Modern exploit mitigations are in large part about making sure that instructions running at CPL0 are all known, so that if you manage to corrupt allocator metadata or write an arbitrary 8 byte value at an arbitrary 8 byte offset you can't easily turn that into remote code execution. But, of course, eBPF is an in-kernel JIT; it's there to run essentially random code inside the kernel. eBPF code is normally constrained, but if you have a kernel memory corruption bug, you can aim it at the eBPF subsystem and violate the kernel's assumptions.


Yes, "can't" should be replace by "shouldn't".

If there's a physical possibility, it's just a matter of time before someone finds a way, as was proved by the CPU cache bugs leaking information.


"shouldn't" is ambiguous as it might suggest the responsibility is with the the user of eBPF rather than internal to it.

Perhaps "is designed not to" but that's a mouthful.

I think we should accept "can't" and yet know the limitations of certainty.


Lots of good eBPF info from eBPF Summit: https://ebpf.io/summit-2021/ and https://ebpf.io/summit-2020/

Also videos from eBPF Day KubeCon 2021: https://www.youtube.com/playlist?list=PLj6h78yzYM2Pm5nF_GmNQ...


One cool project that uses eBPF is Cilium. It allows restricting network traffic to / from containers in Kubernetes. Many of the problems it solves, in my opinion, are better solved via user-space solutions, e.g. service-to-service traffic is better controlled via signing / encryption, but overall Cilium is a pretty cool piece of technology.


It seems nice to have both layers. mTLS is great, but you're still exposing your TLS stack to the attacker. Dropping the packet altogether seems nicer.


In a perfect world, yes. What I’ve found in practice is that network policies add a mysterious failure point that makes debugging traffic issues hard, especially when providing a service platform to teams that don’t understand the inner workings. TLS failures tend to be easier to grok for most service devs.


Many of the links under "things you can attach eBPF programs to" are broken, unfortunately.


BPF is indeed a pretty interesting technology. As the knowledge about it becomes more widespread, I anticipate that we will unlock some new capabilities both in terms of tracing. Brendan Gregg's book (https://www.brendangregg.com/bpf-performance-tools-book.html) serves as a good intro to this, although you probably only need to read a small chunk of it as a lot of it is reference-book-style material.

The author's mentioned that you can trace MySQL with USDT, which is a tracepoint inserted by the developer at select locations in the code. This kind of tracepoints form a "stable interface" for tracing/performance debugging, whereas uprobe, which hooks into select userspace functions, are unstable as the binary is recompiled. Unfortunately, the USDT tracepoints (via DTrace) have been removed in MySQL 8.0. This makes it significantly more difficult to trace MySQL, although it's not impossiblhttps://news.ycombinator.com/item?id=29772927e. I've done a proof of concept of tracing MySQL with uprobe instead of USDT in this repo[1], which can kind of give you the same results (and possibly more stuff, as I can more easily read arbitrary memory address due to how the old USDT tracepoints are structured). This is not stable tho, as any MySQL upgrade may introduce incompatibility with the trace script, as I read memory address based on offsets (whereas with USDT this can be kept pretty stable). My appeal to Oracle to re-add this functionality[2] has unfortunately been rejected, which I think is a mistake given the wide range of possibilities unlocked via BPF.

[1]: https://github.com/shuhaowu/mysqld-bpf

[2]: https://bugs.mysql.com/bug.php?id=105741

Another thing that I've been recently thinking of is using BPF to validate programs written for real-time Linux (via PREEMPT_RT). To my understanding, one of the main thing to avoid is page faults [3]. With the proper BPF tracing scripts, I think we can validate that programs indeed avoids page faults in integration testing. I'm not sure if it is super useful yet, but as I'm trying to write a few RT programs, it's something that came to my mind.

[3]: https://lwn.net/Articles/837019/

In addition to tracing (so bpftrace-based/bcc-based tools), I've recently discovered that there there are:

1. ebpfsnitch (https://github.com/harporoeder/ebpfsnitch): which is an application-level firewall without kernel modules.

2. ebpf-traffic-monitor (https://source.android.com/devices/tech/datausage/ebpf-traff...): which appears to be using BPF to account for traffic for different apps on Android.

3. kubectl trace (https://github.com/iovisor/kubectl-trace): Run tracing on k8s.

There are apparently also use cases in the context of security, but I'm not familiar with it.


>With the proper BPF tracing scripts, I think we can validate that programs indeed avoids page faults

Sorry, I'm a little confused why this would be necessary? Like, sure, it's a nice to have on a CI as a basic sanity check but if you just invoke mlockall you'll end up with everything wired down and you're good to go regardless?


Yeah you're right on that one. I probably over-extended the use case here as I'm still just learning about this and thus tried to apply it to everywhere. I guess one thing you can maybe do is to use CI to validate that mlockall indeed has been called, haha. That said it's probably overkill as other tools can probably do this too.

While on that thought experiment, perhaps in general if you can use BPF/USDT to help trace/debug RT programs? I'm thinking of being able to verify/visualize timing for better tracing? Or maybe there are already tools that existing that I don't really know how to use (like ftrace + trace compass, maybe)


mloclall () requires the application under check to call them such that the application process cannot modify memory pages.

Whereas ebpf allows instrumentation free enforcement. Plus, app devs do not need to be aware of this fact. This facilitate separate of responsibility in code and organization.


Is PDF link broken in the blog ?


It's available at: https://files.speakerdeck.com/presentations/130bc7df16db4556... (or click the download button on the slides page)


>things you can attach eBPF programs to

>...

>seccomp / landlock security things

Landlock does not use *BPF.

Seccomp can only use BPF at this point, not eBPF (though there has been some work on it).


I tried clicking on the link with these words: "The BSD Packet Filter: A New Architecture for User-level Packet Capture". The link appears to be an unsecure website that my internet browser preevented me from going on.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: