That's definitely a security professional's read :)
"Isn't supposed to be able to" is a lot longer and distracting vs the oversimplification-for-sake-of-understanding of "can't". As far as it being proven wrong though - that's already happened, eg CVE-2021-29154
That's fair. I understand that in the ideal world, it would be "can't." I guess my concern is that the wording kind of hand waves away any potential security issues, when people interested in this tech should absolutely be made aware of them.
Yes, especially with Spectre. I looked into using eBPF with seccomp (which currently only supports BPF) and the consensus seemed to be that they're not going to add any more eBPF to the kernel and have basically given up on making it secure.
I wouldn't be surprised if it becomes entirely root-only by default soon, if it isn't already.
I think they already don't. eBPF has become a favorite jumping point for exploit chains that can very quickly escalate some read/write gadget into full execution.
The BPF capability should really only be given to root. I don't think it really gives any new attack surface. All I could see is it giving black-hats an easier interface to "kernel-level-fuckery".
I'd recommend to anyone running Linux that they do exactly that - disable unprivileged eBPF. For a server you shouldn't need that, you can just drop privs for your service after setting up the filter.
They should probably put it behind its own capability like CAP_LOAD_EBPF.
That seems wrong. The build environment for eBPF programs is simpler (you don't even need a working kernel tree), and, much more importantly, eBPF programs are constrained and can't crash the kernel, which is easy to do accidentally when writing an freestyle C LKM. You can learn enough to get stuff done with eBPF inside of a couple days; the same is absolutely not true of kernel modules.
I find the verifier to be extremely frustrating. It's also very difficult to debug eBPF programs, especially in regards to why it's not passing verifier. Combine with ridiculous restrictions from BCC like not being able to use functions in macros, or the vagueries of the bpf target in clang and it's extremely frustrating.
I've personally never found obtaining a working kernel tree to be difficult, certainly easier than a working BCC toolchain. Or all of the various compiler flags needed for clang not to emit code incompatible with the verifier.
I don't use BCC or really understand why anyone would; I use clang, and just produce .o's.
The verifier is definitely annoying, especially at first, but I found myself sort of quickly working out the verifier's expected idiom, and a lot of it can be wrapped with macros.
All of this drama pales in comparison to writing freestyle C code in the Linux kernel without causing random panics.
Easier to get started maybe, but there's something to be said for the 'ease' of having to worry a lot less about crashing the kernel you're working on.
The kernel, however, can and does have bugs. I've seen several thousand hosts get taken down with a perfectly correct eBPF program due to a buggy kernel.
Not so much (eBPF VM bugs are pretty rare, as you'd expect, since the VM is very simple) --- you're much more likely to run into bugs in the C-code helpers the kernel exports. If you're malicious, you can also hit verifier bugs that'll give your eBPF code raw pointers, but I don't think you're likely to stumble on them accidentally.
I said 'less' rather than 'at all' for a reason. I'm not really sure what to tell you if you disagree that it's a lot easier to accidentally crash your kernel with a from-scratch module than ebpf though. That's an experience pretty drastically at odds with mine.
Can you describe some of those bugs? What helpers were you using? I've been doing bonkers stuff with eBPF for the last year and, while I've definitely had bugs, none of them took my kernel down.
Those would be considered kernel bugs though, and should be getting fixed when reported. While a crash in a kernel module is just a bug in the kernel module itself
Are you saying this is factually wrong? I assume eBPF is designed to only access certain points explicitly coded in the kernel but the exploits can access any part of kernel. Is this what you are hinting at? Could you please clarify the statement.
First: eBPF code is JIT'd in the kernel; at runtime, it is simply native code running at CPL0 alongside the rest of the kernel. Running eBPF code working with pointers is... working with raw pointers. There's no interpretation layer to bounds check or otherwise provide safety.
But eBPF code is meant to be safe: you can get a handle to some kernel structure that's passed to you from trusted code, but you can't bounce from it to a random offset in kernel memory. The way eBPF does this is by verifying the CFG of your eBPF program before it's translated to amd64 or arm. eBPF programs are generally just C programs (the simplest and best way to write an eBPF program is just to write a C program and compile it with the right LLVM flags), and verifying C programs is a hard problem; eBPF gets around this by only accepting a subset of all possible programs (those where memory accesses are simple enough to prove safe, that don't jump anywhere outside of known narrow range of program text, and that don't have unbounded loops).
The tricky thing here is that the eBPF verifier is pretty complicated and lives only in the kernel. People have found bugs in it. If you find a good verifier bug, you can launder an untrusted pointer into your eBPF program (in the end, these bugs end up looking sort of like the browser Javascript RCEs that finagle a bad pointer out of some part of the browser API).
The biggest mitigating factor for these bugs is that Linux systems generally don't expose eBPF to any user other than root, so the upside to these kinds of bugs is limited (it gives you root->kernel, which is not nothing, but not the top of most people's priority list).
The other big issue is that eBPF is a huge source of in-kernel flexibility about runnable code. Modern exploit mitigations are in large part about making sure that instructions running at CPL0 are all known, so that if you manage to corrupt allocator metadata or write an arbitrary 8 byte value at an arbitrary 8 byte offset you can't easily turn that into remote code execution. But, of course, eBPF is an in-kernel JIT; it's there to run essentially random code inside the kernel. eBPF code is normally constrained, but if you have a kernel memory corruption bug, you can aim it at the eBPF subsystem and violate the kernel's assumptions.
One cool project that uses eBPF is Cilium. It allows restricting network traffic to / from containers in Kubernetes. Many of the problems it solves, in my opinion, are better solved via user-space solutions, e.g. service-to-service traffic is better controlled via signing / encryption, but overall Cilium is a pretty cool piece of technology.
It seems nice to have both layers. mTLS is great, but you're still exposing your TLS stack to the attacker. Dropping the packet altogether seems nicer.
In a perfect world, yes. What I’ve found in practice is that network policies add a mysterious failure point that makes debugging traffic issues hard, especially when providing a service platform to teams that don’t understand the inner workings. TLS failures tend to be easier to grok for most service devs.
BPF is indeed a pretty interesting technology. As the knowledge about it becomes more widespread, I anticipate that we will unlock some new capabilities both in terms of tracing. Brendan Gregg's book (https://www.brendangregg.com/bpf-performance-tools-book.html) serves as a good intro to this, although you probably only need to read a small chunk of it as a lot of it is reference-book-style material.
The author's mentioned that you can trace MySQL with USDT, which is a tracepoint inserted by the developer at select locations in the code. This kind of tracepoints form a "stable interface" for tracing/performance debugging, whereas uprobe, which hooks into select userspace functions, are unstable as the binary is recompiled. Unfortunately, the USDT tracepoints (via DTrace) have been removed in MySQL 8.0. This makes it significantly more difficult to trace MySQL, although it's not impossiblhttps://news.ycombinator.com/item?id=29772927e. I've done a proof of concept of tracing MySQL with uprobe instead of USDT in this repo[1], which can kind of give you the same results (and possibly more stuff, as I can more easily read arbitrary memory address due to how the old USDT tracepoints are structured). This is not stable tho, as any MySQL upgrade may introduce incompatibility with the trace script, as I read memory address based on offsets (whereas with USDT this can be kept pretty stable). My appeal to Oracle to re-add this functionality[2] has unfortunately been rejected, which I think is a mistake given the wide range of possibilities unlocked via BPF.
Another thing that I've been recently thinking of is using BPF to validate programs written for real-time Linux (via PREEMPT_RT). To my understanding, one of the main thing to avoid is page faults [3]. With the proper BPF tracing scripts, I think we can validate that programs indeed avoids page faults in integration testing. I'm not sure if it is super useful yet, but as I'm trying to write a few RT programs, it's something that came to my mind.
>With the proper BPF tracing scripts, I think we can validate that programs indeed avoids page faults
Sorry, I'm a little confused why this would be necessary? Like, sure, it's a nice to have on a CI as a basic sanity check but if you just invoke mlockall you'll end up with everything wired down and you're good to go regardless?
Yeah you're right on that one. I probably over-extended the use case here as I'm still just learning about this and thus tried to apply it to everywhere. I guess one thing you can maybe do is to use CI to validate that mlockall indeed has been called, haha. That said it's probably overkill as other tools can probably do this too.
While on that thought experiment, perhaps in general if you can use BPF/USDT to help trace/debug RT programs? I'm thinking of being able to verify/visualize timing for better tracing? Or maybe there are already tools that existing that I don't really know how to use (like ftrace + trace compass, maybe)
mloclall () requires the application under check to call them such that the application process cannot modify memory pages.
Whereas ebpf allows instrumentation free enforcement. Plus, app devs do not need to be aware of this fact. This facilitate separate of responsibility in code and organization.
I tried clicking on the link with these words: "The BSD Packet Filter: A New Architecture for User-level Packet Capture". The link appears to be an unsecure website that my internet browser preevented me from going on.
I must finally becoming a security pessimist when I read those sentences and the first thing I think is: these statements will not age well.