Hacker News new | past | comments | ask | show | jobs | submit login
Frame pointers vs. DWARF – my verdict (rwmj.wordpress.com)
97 points by rwmj on Feb 15, 2023 | hide | past | favorite | 66 comments



IMHO, perf's decision to write whole stacks directly to the disk and unwinding them as a post-process is a really bad design. It wastes disk space, and as the author pointed out, it also has a lot of IO overhead.

As an alternative approach, https://github.com/mstange/samply processes data streamed from perf and unwinds it in realtime. The unwinding overhead is surprisingly low: it only takes around 1% of (single) CPU per CPU profiled. Solving the disk waste alone has been a tremendous improvement of profiling experience. As a bonus, the unwinding and symbolization works reliably while I frequently had postprocessing not terminating when using the perf CLI directly.


Are you saying that Dwarf information should be unwound in realtime or that it should use framepointers and debug information to trivially sample the stacks and record the symbols?

If you have framepointers and debug information, it is both high resolution and fast. DWARF is a fallback for not having framepointers.

If you are saying the DWARF information should be processed at the point of use and not copied and processed later, then I concur. But we should also encourage folks to compiled WITH `-fno-omit-frame-pointer` and `-g`


This could be a great Linux perf GSoC project. Projects and mentors are being looked for: https://wiki.linuxfoundation.org/gsoc/2023-gsoc-perf


Parca also have done work to unwind DWARF in kernel with eBPF: https://www.polarsignals.com/blog/posts/2022/11/29/profiling...

Edit: refer to another comment in this thread: https://news.ycombinator.com/item?id=34809265


> Frame pointers have some corner cases which they don’t handle well (certain leaf and most inlined functions aren’t collected), but these don’t matter a great deal in reality.

> DWARF unwinding can show inlined functions as if they are separate stack frames. (Opinions differ as to whether or not this is an advantage.)

This conflates unwinding and symbolization. Unwinding collects the list of frames, which by definition cannot have inlined functions (functions that don't have their own frame); the unwinding mechanism does not matter here.

DWARF can be used to resolve the "stack" of inlined functions for an instruction address, even if that was collected via frame pointers. That can be done in post-processing, possibly on a different machine, so the cost does not affect the workload being profiled.

For example, using addr2line from the LLVM distribution

    $ llvm-addr2line -pfi --demangle -e /path/to/unstripped/binary
    # Enter instruction address in the form 0x...
    # Outputs inlined stack
To summarize, unwinding via frame pointers does not miss any information that would be collected with DWARF unwinding. Everything can be recovered later at symbolization time.

And on whether reporting inlined functions is important or not, at least for C++, which heavily relies on inlining to mitigate abstraction penalty, I'd say it is crucial.


> To summarize, unwinding via frame pointers does not miss any information that would be collected with DWARF unwinding. Everything can be recovered later at symbolization time.

The real issue with DWARF based unwinding/symbolication is that it's really complex on the user experience around it. We at Sentry support stack walking from minidumps, yet we often cannot unwind on Linux platforms on the server because executables and object files are typically not available.

Despite us supporting debuginfod (we're hitting the canonical service today only) we are unable to collect executables/object files from there, making it completely impossible to produce proper stack traces if frame pointers are omitted.

In a world were DWARF unwinding is a thing and people want to use, there has to be an ecosystem of sharing binaries too. This issue is particular bad on Android where there are millions of devices out there with tons of different proprietary system libraries linked in, destroying stacks.


Couldn't this be solved by uploading the binaries along with the crash dumps, if you don't already have a copy of it, as determined by checking hashes or something?


The debug info is usually stripped out of the binaries for Linux distros and has to be installed separately. debuginfod is supposed to make possible to download the debug info in a distro-agnostic way based on the build ID embedded in the binaries.


Exactly, except debuginfod from Canonical and some others misses the executables.


I would expect them to be there but debuginfod for Ubuntu is very new and did not (at least currently) import older releases (including the current LTS) or older package versions. Which is maybe what you’re seeing. Should work better with the most recent release only.

Do you have a specific example of which executables are missing?

There are also other ways to get all the debug symbols but it’s much fiddlier and not nearly as nice as debuginfod. Data’s all there, finding it is harder. You can query the apt index for buildids but only for the currently released package version and not superseded ones.

There were discussions about importing all the history and a way to do it was determined but not sure where the implementation of that has gotten too.


Uh... [Begin flashback] 2012: A change made in GCC 4.7 allowed optimization to reschedule and defer the push of the frame pointer that previously occurred in the function prologue whenever frame pointers were enabled. When binaries are profiled using frame pointers, incorrect call chains are derived whenever a sample is taken between the top of the function and the instruction that pushes the frame pointer. I complained, but got an immediate WONTFIX: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55667

So did they fix that eventually or is everyone just oblivious?

How I found the problem in 2012: Configuration of profiling tools for C/C++ applications under 64-bit Linux https://doi.org/10.6028/NIST.TN.1790


I cannot comment on whether “everyone” is oblivious but yes, this is still the case - frame pointer based unwinding sometimes skips the caller when the IP is sampled before the callee sets up a frame.

This is also common for samples in leaf functions.

compiler & tool chain folks tend to think (quite justifiably imo) that this and similar stuff is fine because dwarf allows reconstructing everything perfectly. The problem is just that the user experience of dwarf-based unwinding is poor, because the only implemented method in Linux is sampling the contents of the stack and doing the unwind in post processing.


AFAIK, the optimization undermined the only use case that -fno-omit-frame-pointer actually had on x86_64. Is there a real use case that benefitted from allowing the frame pointer push to wander? Why why why


That is too bad, it does mention `-fno-schedule-insns2` in the comments. I'll have to see how well that works in practice.

Having high quality low cost stack straces is important for continuous profiling. ref Knuth.


The pervasive lack of frame pointers is the reason why we've developed a custom format derived from DWARF unwind information thanks to some insights: DWARF unwind information is incredibly flexible, it supports many architecture and allows restoring any arbitrary register. But we only need 3: the frame pointer, the stack pointer, and in non-x86 the return address.

While DWARF unwind info doesn't use that many bytes, but unfortunately reading and parsing that information is quite expensive.

For that reason I've developed a new unwinder that uses custom unwind information derived from DWARF (https://www.polarsignals.com/blog/posts/2022/11/29/profiling..., previously discussed in https://news.ycombinator.com/item?id=33788794) that runs in BPF. This new compact representation can be binary searched easily and each unwind row has a size of 16 bytes. I am currently working on reducing it down to ~10 bytes.

All the code is fully OSS (Apache 2.0 for userspace and GPL for BPF), and part of the Parca project (https://github.com/parca-dev/parca-agent). We've also given a talk this year in FOSDEM going deeper into how we made it scale for many big processes.


Nice!

One suggestion: binary search has extremely poor cache behavior, and early versions of the ORC in winder (IIRC) spent considerably more time binary searching the table than actually unwinding.

There are many solutions to this. ORC (IIRC) uses a flat hint table mapping PC -> offset in the table. It’s sparse, so you look up hint[(ip-base) / divisor] and its successor to find a small range of the table to search. (Divisor is set to keep the hint table compact but still limit the main search to something small.). This gives essentially linear time lookups with simple code.

You can also use a B-tree or a similar structure. B-trees are pretty straightforward if you don’t ever need to modify them.

IIRC this gave a substantial speedup.


This is a very good point. Thanks for the context on how ORC does it, I was not familiar with how it evolved.

In our case this is not an issue yet. We split up the unwind tables in chunks of up to ~1MB so we typically have in very few L2 cache misses.

This has to be done once per frame so in my 4 year old i7 processor. There are roughly `2 L2 cache misses (+ few other misses from reading ancillary data structures) * frames`.

We have more work to do regarding benchmarks but I collected some numbers the other day and 90 frames can be walked in less than 500ns (slide 50 https://fosdem.org/2023/schedule/event/walking_stack_without...)

There are more optimisations we have in the works to make our unwinder more efficient, mostly related to fitting more data in the CPU cache and reducing cache misses.


What everybody in this discussion seems to miss is that you don't need to unwind the DWARF data structures during profiling time, you are free to convert DWARF to a fast-lookup data structure on the machine.

DWARF needs to support every CPU under the sun. Every unwinder on the other hand is CPU-specific. For prodfiler.com's continuous in-production unwinding, we convert DWARF into something compact and fast-to-lookup that is then placed in eBPF maps.

It all works like a charm. We can have our cake (e.g. use RBP as GPR) and eat it too (e.g. use .eh_frame, converted at runtime into a fast-to-lookup format) to do reliable whole-system unwinding in production.


prodfiler clearly has a market. It would be interesting to see the approach as something standard in the kernel tree, perhaps it can be added to perf's synthesis, etc. There is already BPF based profiling within perf to avoid file descriptor overheads. If engineering resources are the issue then this could be a good GSoC project: https://wiki.linuxfoundation.org/gsoc/2023-gsoc-perf


This would be ideal. There's some great work by folks at Oracle in this space: SFrame (https://www.phoronix.com/news/GNU-Binutils-SFrame) née ctf_frame that I hope will be integrated in the kernel.

As this will take few years, in the meantime I've developed a DWARF-based unwinder in BPF [0]. Some perf maintainers showed interest in this, so thanks for bringing up the GSoC project idea, didn't occur to me!

[0]: https://news.ycombinator.com/item?id=33788794


Yeah. I really like the ideas proposed by Brendan Gregg -- essentially encouraging every HLL runtime to embed an eBPF-based unwinder in it's own executable. The upshot of that would be "generic, in-production unwinding of native code and HLL code", similar to what prodfiler is doing, but inside the main kernel tree...


The article points out that the kernel uses ORC instead of DWARF for unwinding. I wonder if that could ever become an option in userspace? I imagine that if all you’re interested in is stack traces, instead of debugging (which DWARF is designed for), ORC would be a very nice performance win. And it’s not as if they’re mutually exclusive, either: there’s no reason why a binary or debuginfo couldn’t just ship both.

In any case, the runtime performance hit of frame pointers is quite high, and since we do have things like DWARF (and maybe ORC someday?), I’d still argue that frame pointers aren’t necessary. It’s nice to have that extra register free!


Native ORC generation from GCC or LLVM or userspace tooling like objtool would be nifty.

FWIW, ORC was specifically designed to be efficient, but it was not designed to be future-proof against complex toolchain changes. Since all the ORC tooling is in the kernel, it can evolve together if needed.

(I was a bit involved in the design — I helped optimize the format to reduce cache misses on lookup.)


For prodfiler.com we convert .eh_frame (DWARF) unwinding format to something that is more similar to ORC and optimized for lookup in eBPF data structures.


> In any case, the runtime performance hit of frame pointers is quite high

That's not true at all. It's in some very rare cases. Firefox I think at this point ships with framepointers enabled and so does every M1/M2 mac app and all iOS applications as it's mandatory for the calling convention on Apple.


The ORC documentation [1] cites a 5-10% perf hit [2] on x86. Now, granted, we're talking about an architecture that has only 8 logical registers to begin with, so losing one is going to hurt a lot more than x86-64 (16 logical registers) or AARCH64 (32 logical registers). But, 5-10% is definitely a noticeable perf hit.

Phoronix [3] did a test of "-fno-omit-frame-pointer" during the discussion on whether to enable the flag for Fedora, this time on x86-64. They found an average 14% performance hit on a wide variety of benchmarks.

[1] https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html [2] https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@... [3] https://www.phoronix.com/review/fedora-frame-pointer/5


See also https://lists.fedoraproject.org/archives/list/devel@lists.fe... - the botan phoronix results with frame pointers were probably measuring debug builds.


On AArch64 I think it's cheaper because losing one register doesn't hurt as much, since you've got twice as many.


nitpick: it's the number of physical registers that matter more here, not ISA registers


Why wouldn't the lower ISA register count lead to more spills, regardless of the physical count? Is store forwarding really that effective?


It isn't.


Relevant discussion on profiling without frame pointers: https://twitter.com/halvarflake/status/1577644229853151233 / https://prodfiler.com/blog/introducing-prodfiler/ via the "prodfiler" tool.


Twitter: "No need to recompile with frame pointers."

Web page: "always-on profiling powered by eBPF technology."

BPF stack traces are gathered using frame pointers:

https://github.com/torvalds/linux/blob/master/kernel/bpf/sta...

https://github.com/torvalds/linux/blob/master/arch/x86/event...


Someone from prodfiler appears to be explaining in this thread https://news.ycombinator.com/item?id=34806693


For every running application turn DWARF data into BPF maps. Does this scale?


As surprising as it seems, it does! My colleague and I spoke about this in our project in FOSDEM https://fosdem.org/2023/schedule/event/walking_stack_without.... Let us know if you have any questions or feedback :)


There is an option far better than either suggested. In my experience using the --callgraph=lbr option produces far more reliable callstacks than relying on frame pointers. Sadly it's only available on Intel cpus at the moment.


AMD will have support in Zen4 and Linux 6.1 (which is LTS):

https://lore.kernel.org/lkml/Yz%2FcpNTSacRMh1FK@gmail.com/

Further, precise events are fixed in Linux 6.2:

https://lore.kernel.org/lkml/Y5eQeR2tpZ%2FBos49@gmail.com/


What does precise events in perf help with?


Not having frame pointers is a complete PITA on Android.

I built my company's in-house mobile crash reporter ~ 2013, based on experience building a similar system at an earlier startup. At the startup I used Google breakpad on both Android and iOS, doing all the unwinding and symbolication on the backend. iOS at least made this easy because Apple makes dSYMs readily available. On Android, you simply can't get system symbols. So you essentially can't unwind OS symbols on Android.

At current company I used PLCrashReporter on iOS. Unwinding occurs on device. Symbolication on the backend.

I tried everything with Android for native code crashes, starting with breakpad minidumps, then using every available unwinder option on Android: corkscrew, the Android fork of libunwind, the official libunwind, whatever custom unwinder Android eventually wrote. None of them work reliably for native code crashes. And good luck tracing from native code back into the ART frames.

In the end what ending up being most reliable was including the last few thousand lines of logcat (grabbed upon the app restarting after a crash since you can't reliably grab it inside the crash handler). Android's OOB crash handler for some crashes (with recent Android versions) dumps full stacks of every thread including native code and ART frames to logcat. So the crash SDK looks for that in the logcat output and includes it in the report. That at least provides a stack. Symbolicating anything but application frames is still impossible though.

And this isn't even going into esoteric things Android has done over the years just for Chrome like relocation packing:

https://android.googlesource.com/platform/bionic/+/f5e0ba94d...

To this day, I don't understand why both Apple and Google make it so difficult for an application to get access to the stack traces of its own crashes. And no, the reporting built-in to Google Play and iTunes Connect (and Xcode) are not sufficient for large usage apps or companies like mine that have lots of apps with shared SDKs and need to correlate crashes across apps.


> But collecting the whole stack would consume far too much storage, so by default it only collects the first 8K. Many userspace stacks will be larger than this, in which case the data collection will simply be incomplete – it will never be possible to recover the full stack trace.

Yeah, doing an 8KB memcopy for every profile sample sounds like a lot of overhead. Is DWARF unwinding so slow that that is actually faster?

Virgil doesn't use a frame pointer, and I sort of regret it. It uses custom unwinding information that is used during GC or throwing an exception (i.e. controlled crash). I spent considerable time optimizing both the space and time of that lookup, to the point where it's only 32 bits of metadata per call site and a few dozen instructions to walk each frame. But that was a major, major pain to debug and I found a bug in it at late as last year.

A frame pointer is also required for stack allocation of objects that aren't fully scalar-replaceable. Virgil doesn't do that yet, but aspires to someday.


> Is DWARF unwinding so slow that that is actually faster?

No. The only reason it works like this is because the upstream Linux kernel has thus far rejected in-kernel dwarf unwinders, but copying the stack is simpler and available / implemented.


DWARF bytecode is a full VM. Do compiler writers test their DWARF output? (my experience is not - especially for architectures out of the big 2 or 3) How does the kernel access the ELF file pages with the DWARF information in when in an NMI handler? You could mlock all your debug information when a program loads but the memory overhead wouldn't be nice. It is hard enough getting a build ID.

The elephant in the room btw is LBR call stacks, but they aren't exposed in the kernel/BPF yet. Userland perf has them and they recently became available on AMD.


It is not required to unwind the user space stack in the NMI handler. It can be done later before returning to user space in a context that can handle faults.


Allowing processes to sniff each others stacks has some fairly obvious security issues.


I don’t understand your concern - what about this would involve one process sniffing another process’s memory? The kernel would still be doing the unwinding, just not in the NMI handler.


Wouldn't all your kernel stacks then end up in whatever this handler is? Why not implement your approach and mail it to LKML :-)


Yes, this only works for user space stacks, but that is sufficient since with ORC kernel stacks are solved (IMO) and it avoids all the issues with trying to mlock debuginfo of all processes that you mentioned. The NMI handler would still unwind the kernel stack.

> Why not implement your approach and mail it to LKML :-)

because this would still be an in-kernel dwarf unwinder and I would expect an instant reject, and because I am lazy and/or don’t care enough about this problem or linux to work on it. Even if people could be persuaded, I don’t have the interest or temperance to debate this with LKML.


Why is profiling done in the kernel for userspace stacks?


because this is about PMU based sampling, which involves triggering interrupts at some interval and doing the sampling while handling the interrupt


Other than overhead, what is the advantage as opposed to handling the interrupt in the kernel and then delivering a signal to userspace? After all, isn't this the role of SIGPROF?


dwarf = brain convolution

frame pointer = 1 less register on an already set of registers under heavy strain (x86_64). I do envy risc-v with all their registers: even the load-store model of risc-v won't use enough more registers to end up with the same amount of strain as on x86_64.

If I ever have the usage of dwarf, I would build the tables manually only for what I wish to debug?... if it is possible (without the use of specialized assembler directives, because it increase significantly the technical cost of the assembler). That said, does a real, accurate and complete specification of dward exists? Because when I look at the sysv ABI or ELF, what a mess.


> frame pointer = 1 less register on an already set of registers under heavy strain (x86_64).

X86-64 is not under heavy register pressure strain. That idea is a legacy from x86 (plain, 32b).

x86-64 has the same register count as ARMv7 and few bothered disabling the frame pointer there, even though it’s a load-store architecture.


Right, x86-64 offers eight extra register names† (r8 through r15). If you choose to go from x86-without-frame-pointer to x86-64-with-frame-pointer you gained 7 register names which is huge.

This makes the case where that one extra register name makes all the difference much rarer, arguably turning it from "I demand a compiler flag" to "Let's just hand-write the machine code for this one very special routine if our performance data suggests it's worth it".

† Internally a modern CPU has far more actual register, to enable a feature called "register renaming". But we can only talk about them using their canonical names, and x86-64 adds eight more of those.


I think generally the talk about "there are not enough registers" ignore pipelining and register renaming way too much. The loss of performance of the frame pointer register even on x86 is not that problematic, and on x86_64 it's completely negligible unless you're in a tight switch heavy interpreter loop.


Register renaming doesn’t significantly address the impact of reducing the number of architectural registers available to the compiler. With fewer register names available, the compiler will spill locals to stack more often, and register renaming doesn’t help - memory renaming is needed to really mitigate this.

But i agree the impact of preserving frame pointers is generally quite small and doesn’t often actually need mitigation - on amd64 there’s not much impact from losing 1 more of 16 arch registers.


Agner speaks about memory renaming back on Zen 2:

https://www.agner.org/forum/viewtopic.php?t=41

Intel Alderlake has performance events for tracking it:

https://github.com/intel/perfmon/blob/974c69919b2a9dfd8278cf...

But even before this you had store to load forwarding on x86. I'm not saying you have, but before inventing a performance problem it is worth spending time trying to diagnose it with thorough profiling (e.g. [1]). The Fedora frame pointer patch did a thorough performance analysis and performance will be revisited again. Unfortunately there are a lot of arm chair performance experts who haven't spent time looking into the details.

[1] https://perf.wiki.kernel.org/index.php/Top-Down_Analysis


And that’s before accounting for SP which the compiler will usually avoid touching. So for an x86 target, omitting the frame pointer means going from 6 to 7 usable registers, on x86-64 it’s 14 to 15, a lot less necessary outside of very specific work loads.


Nice analysis.

It would be interesting to see comparison to Intel LBR.

Also would be nice to know how profiling unwinding is done on Windows (maybe someone knows how to summon Bruce Dawson).


AIUI the problems with LBR are two-fold. It only works on newish CPUs, and it only handles a limited number of stack frames (I heard 8, but maybe more on recent CPUs).


CET’s shadow stacks, on the other hand, will solve this entire problem, both exactly and extremely efficiently.


IIUC Windows implements in-kernel unwinding using FPO debug data embedded in every executable.


Big advantage for DWARF2 over frame pointers is that it actually works for unwinding on aarch64.


Apple's version of AARCH64 requires that the frame pointer register frame record. DWARF Information is not necessary to unwind there. It's lovely and I applaud Apple for that decision.


Can you give more detail?

AFAIK frame pointers work fine for unwinding on aarch64. And on aarch64 the gcc default is not to omit frame pointers and IIRC when the default was switched at some point it was treated as a bug and reverted (not sure if required in the ABI or just strongly preferred by the community). So IME generally unwinding with frame pointers on aarch64 works more often than on amd64 since you don’t have to recompile the world.


Why don't frame pointers work on aarch64?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: