Hacker News new | past | comments | ask | show | jobs | submit login
DWARF-Based Stack Walking Using eBPF (polarsignals.com)
140 points by brancz on Nov 29, 2022 | hide | past | favorite | 64 comments



Are the authors here? Thanks for this! I'm always thrilled to see advances in profiling tools.

I'm curious what they have to say about complexity/necessity of interpreting all of DWARF. cargo-trace (an neat and conceptually similar but abandoned project, I think) [1] says:

> It can be empirically determined that almost all dwarf programs consist of a single instruction and use only three different instructions. rip+offset, rsp+offset or *cfa+offset, where cfa is the rsp value of the previous frame. The result of the unwinding is an array of instruction pointers.

Do you find this to be true? Is more complex interpreting of DWARF necessary?

And in the lkml thread linked from the article, Linus is extremely pessimistic about DWARF unwinding, [2] I'm sure not without justification. He's talking about kernel stacks, and I think the trade-off is different when you're trying to profile existing userspace applications and libraries compiled and implemented however, but nonetheless I'm curious to hear the authors say how applicable they think his points are.

The authors' pointer to .ctf_frame is also a very interesting development in this space. Not sure how long that will take to be widely usable (assuming it goes through at all).

[1] https://github.com/dvc94ch/cargo-trace

[2] https://lkml.org/lkml/2012/2/10/356


Author here!

This project looks interesting and as you mention it seems conceptually similar! I would agree with that quote, generally speaking, but I think it really depends on a binary to binary basis.

Some binaries outsource most of the work to dynamic libraries. Unfortunately, DWARF expressions are typically emitted for these program counter ranges, so it's desirable to at least implement a subset of expressions [0].

Even if that's not the case, we want to produce profiles that are as accurate as possible :)

You are totally right, they were discussing kernel stacks where stakes are higher as it needs to work perfectly, otherwise kernel live patching would not reliably work, among others. The kernel has now some unwind table format, that can be used in x86_64, called ORC [1].

That being said the parser for DWARF would still have to live in the kernel, and I am not sure if kernel devs would like to accept such a patch.

Ideally we would transition to an unwind-specific format for user-space (something like ours, for example) and perhaps have a suitable unwinder in the kernel, rather than having to implement it in a BPF program. This is something we are considering for the future, but it's not free from problems (increased executable size, redundant unwind information, etc). But this is exactly why we wanted to have a conversation with the communities interested in this work!

[0]: https://github.com/parca-dev/parca-agent/pull/1058/commits

[1]: https://lwn.net/Articles/728339/


> Ideally we would transition to an unwind-specific format for user-space (something like ours, for example) and perhaps have a suitable unwinder in the kernel, rather than having to implement it in a BPF program. This is something we are considering for the future, but it's not free from problems (increased executable size, redundant unwind information, etc). But this is exactly why we wanted to have a conversation with the communities interested in this work!

I'm just an interested user rather than someone who has an ownership stake in any major distribution, language, library, or application, but fwiw:

I'd love for every binary/library to embed enough information to cheaply unwind during profiling [edit: and symbolize shortly afterward], in one simple format, so that this sort of profiling tool Just Works, as well as backtraces generated by the program itself (e.g. Rust's panics). I'd prefer it support inlining well (so ideally not just turning on frame pointers everywhere but that'd still be a huge improvement over the status quo). As you know, this is all a complete mess now, and I hate that. I think a reasonable amount of space overhead for this is fine. I want to be able to profile pre-packaged applications and my own Rust applications that use some C libraries without hassle. And have profiles written in a format that I can ship to another machine without trouble reading the symbols there or concern about excessive sensitive information. All stuff you've touched on. I would consider it a genuine miracle if the parties involved could all get behind this.

Full DWARF information, e.g. Rust's debug = 2 [1] (to be clear: my understanding is this is more or less to allow debuggers to print variable contents reasonably well?) is another matter. Seems like that adds a huge amount of bloat and is more rarely used. I like split debuginfo. As long as there's an easy way to get it on demand via a debuginfod service and/or installing an additional package, it's fine.

[1] https://doc.rust-lang.org/cargo/reference/profiles.html


> I'd love for every binary/library to embed enough information to cheaply unwind during profiling

Definitely! We hope to be able to change this. There's the .ctf_frame format [0] that could be used too. While its design goals are very much aligned to ours, implementing a reader for this format in BPF will be complicated.

> and symbolize shortly afterward

Definitely, I think that C Type Format (CTF) [1] could be a candidate here. Another possibility would be to get BTF (BPF Type Format) [2] working in userspace. While it wouldn't cover everything that DWARF deals with, it might be a reasonable tradeoff for symbolization.

FWIW both in Polar Signals and Parca (our OSS offering) we resolve to async symbolization done in the server, as it's very expensive to do and we want to reduce load where the Agent runs.

[0]: https://gcc.gnu.org/wiki/cauldron2022talks

[1]: https://lwn.net/Articles/795384/#:~:text=The%20Compact%20C%2....

[2]: https://docs.kernel.org/bpf/btf.html


I wish you success!

AFAICT from some searching this morning, CTF is implemented in the GNU toolchain (compiler -> debugger) but not at all in LLVM? so seems like it'd be a while before this really happens even if everyone gets behind that format.

> FWIW both in Polar Signals and Parca (our OSS offering) we resolve to async symbolization done in the server, as it's very expensive to do and we want to reduce load where the Agent runs.

Makes total sense for your cluster-wide always-on profiling. I've worked some with GWP [1] and Google Cloud Profiler, and they similarly defer/offload the symbolization to a dedicated service. But in a small-scale setting for e.g. a Rust std::backtrace::Backtrace with an Error or panic, I generally want to print a nice symbolized version in-process shortly after capture (or never). For small-scale on-demand profiling, similarly I likely want to symbolize soon after, although probably out-of-process.

[1] https://research.google/pubs/pub36575/


No idea about LLVM's support for it, but I know Apple's XNU kernel, built with LLVM, uses it (check for the __CTF,__ctf section). I think it uses ctf_insert(1), so it probably doesn't need the toolchain itself to support it.


Thanks so much!

You are right! Hopefully these formats will become standard and implemented in more toolchains. My worry is that there seem to be several "competing" formats for a more compact DWARF debug (as opposed to unwind) information. I hope this doesn't lead to more fragmentation of the ecosystem.

Definitely! I assume you work at Google, so most of your binaries probably have some crash handlers set up already, right? I find this very useful, and wish it were more commonplace. I really like Folly's implementation of this idea [0] that just requires a single line of code: `folly::symbolizer::installFatalSignalHandler()`


> I assume you work at Google

I did for a really long time. At a startup now.

> so most of your binaries probably have some crash handlers set up already, right?

Yes, there was nice instrumentation there too. All google3 binaries did something like InitGoogle() that set off commandline flag parsing, signal handler installation, log hooks, mlock/huge page setup, etc.


I've written a DWARF expression parser and stack unwinder and its not true that those three use-cases are enough to be robust [0]. Based on my own observations, you need to implement:

- undefined - same value - offset(N) - val_offset(N) - register(N)

I've never come across expression(E), val_expression(E), or architectural. But you definitely do need the above in the common cases.

More generally, I'm kinda confused by that quote, I think the author may be conflating DWARF expressions and stack unwinding a bit?

What I think they might have been getting at is that with DWARF expression evaluation [1] in particular (i.e. not just stack unwinding), I've never even seen +80% of the opcodes I've implemented in the wild across several programming languages and compilers. I also left about 1/4 of them unimplemented and I've never had an issue. The intention of the DWARF spec writers was to provide folks with a stack machine that was Turing-complete, but in practice, yeah, nobody really takes advantage of it because there's not really a need.

[0] https://dwarfstd.org/doc/DWARF5.pdf (section 6.4) [1] https://dwarfstd.org/doc/DWARF5.pdf (section 7.7)


I've found two examples of DWARF expressions in the wild: 1. A handwritten expression in libpthread <= 2.19, which was removed in 2014 [1], and 2. the expression covering the PLT section of most binaries. [2]

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1157194#c14

[2] https://github.com/mstange/framehop/blob/6777199dd0c2c1b3768...


> More generally, I'm kinda confused by that quote, I think the author may be conflating DWARF expressions and stack unwinding a bit?

I suspect they only cared about / looked at the portion of DWARF information needed for stack unwinding. I'd read the quote with that implicit qualifier after "almost all dwarf programs".


Ah gotcha, in that case, as I said, I've never come across a DWARF expression while doing stack unwinding in the wild. Just registers and offsets :)


I find this surprising! Was this for off the shelf applications or some custom binaries?

We see DWARF expressions such as `DW_CFA_def_cfa_expression` on the regular. See the "Test Plan" section and commit messages of the PR that introduced support for this particular opcode, as well as the simple executable that shows this opcode [0] and its printed unwind table [1]

[0]: https://github.com/parca-dev/parca-agent/pull/1058

[1]: https://github.com/parca-dev/testdata/blob/7f9206570a743206f...


Yep! I've never seen anything besides DW_CFA_def_cfa, DW_CFA_offset, and DW_CFA_nop in my CIE definition instructions before. Not to say you shouldn't support them or that they don't exist - they just haven't come up for me yet on the compilers I've been digging in to. I've been examining programs I wrote myself, compiled on Linux using a variety of tools (gcc, g++, clang, clang++, rust, go, zig, and jai are the compilers I've been supporting). I've been writing a debugger from scratch for the past few months, so the set of programs I've been testing on is quite small compared to you I'm sure.

I've never encountered a DW_CFA_def_cfa_expression before in the wild, which compilers(s) are you seeing that in out of curiosity? I see Parca supports many more languages/compilers than I do. I do support that opcode (and all the other CIE opcodes) in my debugger, I just have yet to find a compiler that generates it.

Side note, just looked through the Parca marketing materials, it looks awesome! Excited to try it out on some of my stuff and see what I can find.


That's very curious! We definitely see way more DWARF opcodes in the wild, including `DW_CFA_remember_state`, `DW_CFA_restore_state`, `DW_CFA_advance_*`, among others.

Statistically speaking, a profiler has more chances of bumping into program counters (PCs) that fall in ranges defined by a larger set of DWARF opcodes. We run at 100Hz, which can quickly yield a large number of different program counters, while I assume that most debugging sessions won't hit that many different PCs.

I'm using Fedora 36 which ships `gcc (GCC) 12.2.1 20220819 (Red Hat 12.2.1-2)`

> I've been writing a debugger from scratch for the past few months

Sounds fun! Are you planning on writing about this? I am sure there would be many people interested in reading about this topic!

Either way feel free to reach out at javier @ my company's domain if you want to discuss the DWARF opcodes thing further!


Good point about profilers' stack unwinders hitting much more code than a traditional debugger does, I'm sure that's mostly to blame. I've only been debugging a few thousand lines of code with my debugger so far, so we can't really call it a serious tool yet! I also am using Fedora 36 with the gcc version that it ships with. :shrug:

I'm not sure if I plan to write about it, it doesn't feel like I'm uncovering anything that anyone can't go look up pretty easily. I've been building it mostly because I'm frustrated with all the existing tools out there, save for RemedyBG (which is awesome, but Windows-only, and I'm only on Windows every so often). However, I'm still quite a ways off from being a reasonable daily driver!


> The intention of the DWARF spec writers was to provide folks with a stack machine that was Turing-complete, but in practice, yeah, nobody really takes advantage of it because there's not really a need.

There definitely is a need for better DWARF information, be it with new opcodes or not. It's just that generating them has not been a priority for compilers.


I am going to start positing this into every discussion about this topic now: can we as an industry start not omitting frame pointers? It’s not so significant (maybe puts us back three months in hardware advances) but everybody greatly benefits from that.


They are part of the the psABI on aarch64 and POWER. Due to to the link register, they can even omit the frame pointer from leaf functions without losing the identity of the immediate caller.

Generally, the “compile with frame pointers” request is a bit underspecified on x86-64. For example, the perf trampoline that is cited as the reason why future Python versions are going to use frame pointers itself does not have a frame pointer. (It does not clobber it, either, but backtraces through it will skip a frame.) Similarly, if you just compile glibc with frame pointers without patching a handful of assembler files, then for typical workloads, 5% to 10% of your samples will have an incomplete backtrace because the immediate caller of glibc string functions is not recorded (they don't have a frame pointer, either).

I expect this x86-64 debate will be obsolete Really Soon Now because everyone will simply copy out the shadow stack on every sample. It's even faster than frame pointer traversal because it's just an array copy. We are still looking at making DWARF backtraces a bit faster (or even the more complex unwinding case), but I doubt that this work will be impactful, given the politics involved. The shadow stack will have some performance impact, too, but it can be switched off on a per-process basis if necessary.


Frame pointers are enabled by default on macOS and iOS and I can't understand why they aren't enabled by default elsewhere. It just makes it so much easier to debug and profile things.

See for instance Apple's ARM64 calling convention: https://developer.apple.com/documentation/xcode/writing-arm6...

"The frame pointer register (x29) must always address a valid frame record. Some functions — such as leaf functions or tail calls — may opt not to create an entry in this list. As a result, stack traces are always meaningful, even without debug information."


Apple relies on this heavily when doing systemwide performance profiling. It's good practice!


There is currently a Fedora proposal to add -fno-omit-frame-pointer to default compilation flags [0](mailing list discussion: [1]).

  [0]: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer
  [1]: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/OOJDAKTJB5WGMOZRXTUX7FTPFBF3H7WE/
edit: whoops, this was already mentioned in the blog post. Should read before commenting :)


By now (but not at the time of posting), a rejected Fedora proposal.


100% this. Disabling frame pointers is not worth it in my opinion. Hindering debugging and profiling is a massive regression. It's one of these historical (due to x86 having fewer registers) and short-term thinking changes that have unfortunately stayed with us. I'm optimistic that this can change but it will take a while.

BTW, I'm a big fan of your work, thanks for Flask and Jinja!


Could not agree more! Unfortunately, I don't think we'll realistically be able to convince everyone, and even if we did convince everyone to have frame pointers tomorrow, it would still take years to take effect. So for the foreseeable future we'll still need things like this, even if I wish we didn't.

That said, we should still try to convince people!


I have a tangential question, more about BPF in general:

I watched a presentation recently by one of the core BPF contributors, in which they paint it as a safer, extended version of C:

https://www.youtube.com/watch?v=K08YCgALHDo

Is there any reason why you wouldn't want to use this to write regular, user-land applications instead of C? Could you?

(I don't know much about BPF in general)


I like to joke that BPF-verified C is the future of memory-safe systems code, not Rust. It's a joke because you can't express most programs in verified BPF. The reason BPF is safe, despite the seeming theoretical impossibility of program verification, is that the verifier rejects most "valid" programs. To give a simple example: every loaded BPF program must provably halt, so any loop in a BPF program must be bounded in a way that the verifier can prove. And bounded loops, where you can see just from the AST that the loop can only tick N times, are a recent eBPF verifier innovation.

The way you build serious systems with BPF is that you couple a tiny straight-line BPF kernel that generates events with a userland program that consumes the events and does interesting stuff.


Ah, that sounds like a pretty serious limitation. Do you think they might relax this restriction in the future?

In the talk, he mentions adding higher-level constructs to C like lists and RB-trees that are also concurrently-safe(!!) which seem just massive to me.

It'd be stellar to be able to program in a safe C subset with some higher-level affordance like this.


Having worked with eBPF for a while now, it’s actually interesting how it makes you think about useful limits, that we typically defer thinking about. For example instead of writing code that validates a string and then doesn’t care about the length anymore we just always limit it to only process a certain length in the first place. I’ve found few problems that I couldn’t find useful limits for.

That’s not to say that the verifier isn’t limiting and at time difficult to deal with, but I find it less of a limitation in practice than it seems at first. Much like rust’s borrow checker I do think it takes a change in mindset.

//edit

Turns out being thoughtful about this when writing the code is also helpful for performance reasons as things like loops can be unrolled which can have significant performance gains.


I think they'll make incremental progress at accepting more programs, but that BPF-C programs aren't going to look like normal C programs any time in the near future.

There's a bunch of data structures available to BPF programs, because they can call "helpers", which are just ordinary C functions. There are concurrent semi-durable key-value stores of varying types (eBPF "maps"). These aren't written in BPF-C; they're just kernel code that BPF gets to call.

Adding new helpers is not nontrivial; you can't just load them into your kernel and call them; they're part of the kernel API.


This is true of Rust as well. The Rust type system is just a formal verifier for memory safety which rejects most memory safe programs. Of course Rust will accept more memory safe programs than eBPF.


The only thing Rust keeps you from implementing are trees. It's at least OK with loops! :)


That's not really the point I'm trying to make. There are programs that are literally memory safe that don't compile, that may be using any data structure you can think of.


Rust is fine with trees, no? No cycles and all that. It's arbitrary graphs that are hard.


AFAIU BPF does not have loops as it is designed to be assured the execution has an end.


That's right! There's no "traditional" loops as programs have to be proved to terminate at some point.

That being said, very recently support for bounded loops landed [0]. It's very exciting and useful, and I've seen it reduce verification times significantly, but we can't use this yet as it requires kernel 5.3 or greater, and we would like to support as many users as possible!

[0]: https://lwn.net/Articles/877062/

[1]: https://github.com/iovisor/bcc/commit/38304256c49a02aecbf78f...


(this talk is great!)

I think this would be cool. Have you considered asking Alexei directly?


I did email the BPF kernel mailing list, they directed me to two user-land (user-space?) implementations of BPF but didn't say much about what "user-space BPF" meant.

Didn't want to clutter up the mailing list with noise so I didn't reply:

https://lore.kernel.org/bpf/Y3BQGgcCMaSptEFJ@google.com/T/#t


This is super awesome work and a great technical explanation of a very deep topic.

What happens in the case of JIT or FFI? I think I've only ever seen the Python profiler, scalene[0], handle these cases.

[0]: https://github.com/plasma-umass/scalene


Thanks!

FFI should work as it's typically a dynamic library that's loaded at runtime. If said library has unwind information, which most compilers add it by default as it's part of the ABI (right now we support unwind information stored in the .eh_frame ELF section).

JIT'd code is not supported yet. This is something that we are already working on, though!


Giving teams more insights into their apps at runtime without requiring changes to the apps is as close as one gets to a superpower.


Agreed! I'm very happy to see that we are getting to the point where engineers that don't work at big tech companies can get these kind of insights, too.


I clicked because I was puzzled why anyone would want to use eBPF to do this. And then the first sentence of the article said why!

Sad that I am delighted by this. It seems most articles these days obfuscate their story.



This is a great tool to have profiling data for "-fomit-frame-pointer" compiled code! Question: How hard is it to add ARM architectures?


We don't know yet, but we are planning to start this work soon. In theory the unwind tables should be very similar thanks to still following the DWARF spec.

We would have to modify the BPF code to handle saved return addresses in arm64, which as far as I am aware are not defined by the ABI, so they might be at any offset from the previous frame's stack pointer (what DWARF calls CFA).

There's (at least) these other things that we might want to look into, such as:

- split stacks

- pointer authentication

- unwind tables stored in a different ELF section


FWIW, the prodfiler.com agent does the same, and we have experimentally working ARM unwinding, including high-level language stuff (Python, Java) mixed with native, on ARM, too.


Nice post @javierhonduco, really interesting read!

As you mention that you are looking into loosening the minimum kernel requirements, what is currently the primitive(s) that is dictating the minimum required version? And how do you plan to sidestep that?


Thanks! :)

Good question! We are building atop the incredible job that the BPF community does. BPF is a restricted environment [0], among others, the stack size is very small, programs have to be proved to finish, and there's no access to arbitrary kernel function.

We've done most of the work to loosen the minimum required kernel, sorry if that wasn't clear. The minimum supported kernel is 4.18.

More context, for those interested:

In order to provide an API to interact with the rest of the kernel, we use BPF helpers, which are like a library (sorta "syscalls") functions that BPF programs can make.

Not every BPF helper is available in a given kernel. The BCC project has a comprehensive and up-to-date list of the different helpers and other features and their introductory commit [1].

The minimum kernel that we can support is then decided by the most modern helper or feature we use. In our case, the most modern features we require are:

- BPF tail calls: since 4.2

- BPF Type Format (BTF): since 4.18

[0]: https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.htm...

[1]: https://github.com/iovisor/bcc/blob/master/docs/kernel-versi...


FWIW, the prodfiler.com agent has had no-symbol stack unwinding since summer 2021, and a minimum kernel version of 4.15 :-), and happens to have a lower footprint than the solution discussed here.


Very nice work! It seems that Delve is used for inspiration and development, but Golang code is always compiled with frame pointers, I think. Is that right?


Thanks!

As @brancz mentioned, Delve uses DWARF unwind information to produce backtraces (they are stored in the .debug_frame section for Go).

You are right, Go enabled frame pointers for all architectures as of 1.17 [0]. This is enabled to allow profilers to work well, without having to use to techniques such as the one we describe in our post.

When it gets funny is that there's `gopclntab`, a 3rd option in Go to unwind stacks, used by `panic` and I believe other parts of the runtime. If you are interested in more details, Felix Geisendörfer's repo contains way more details [1]

[0]: https://go.dev/doc/go1.17

[1]: https://github.com/DataDog/go-profiler-notes/blob/main/stack...


That's right, but Delve also needs to use DWARF to walk stacks that involve CGO, where it can't rely on frame pointers being present.


Quick correction, actually Delve uses DWARF unwind information for every code location, including those from Go, even if frame pointers are present.


I'd love to see a size comparison of the DWARF encoding vs the binary search map. I have a strong suspicion that there's a neat perfect hash solution to this problem - perfect hashes can encode keys in single digit #bits, and you get faster lookups.


Let me check! I don't our approach to be significantly more compact than DWARF's. DWARF is very compact despite being so much more expressive.

Funny that you mentioned perfect hashing, I thought about this, but didn't go for this approach as it would have too many drawbacks for our use-case:

- It would further increase the complexity of the unwinder, which is already not trivial. Ideally, we would like to reduce the surface area of things that can go wrong.

- Implementing perfect hashing in BPF might be tricky for several reasons. First it might take quite a bit of precious instructions, but also, it would force us to ship a compiler to generate the BPF code in the fly rather than shipping it pre-compiled. We really want to avoid this.

- Last, it would force us to generate entries for every program counter, while thanks to using binary search we can omit redundant entries. DWARF expressions would suffer from a similar issue.



I clicked this wondering if someone had made some fantastic new contraption using Dwarf Fortress. Still a very interesting read, but slightly disappointed :)


It’s “DWARF-based”, not “Dwarf-Based”. No diminutive humanoids, fantastical or otherwise, were involved.


Fixed now, and we'll make amends with the dwarves. Thanks!


I went in expecting something implemented in Dwarf Fortress


Me too, came here looking for this comment/to post this comment


Looks like there is another active submission: https://news.ycombinator.com/item?id=33788883 (maybe they can be merged since they were submitted so close to each other?)


We merged them hither since the OP was the original post.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: