Hacker News new | past | comments | ask | show | jobs | submit login
eBPF-based auto-instrumentation outperforms manual instrumentation (odigos.io)
202 points by edenfed on Oct 30, 2023 | hide | past | favorite | 59 comments



How do you solve the context propagation issue with eBPF based instrumentation?

E.g. if you get a RPC request coming in, and make an RPC request in order to serve the incoming RPC request. The traced program needs to track some ID for that request from the time it comes in, through to the place where the the HTTP request comes out. And then that ID has to get injected into a header on the wire so the next program sees the same request ID.

IME that's where most of the overhead (and value) from a manual tracing library comes from.


100%. Context propagation is _the_ key to distributed tracing, otherwise you're only seeing one side of every transaction.

I was hoping odigos was language/runtime-agnostic since it's eBPF-based, but I see it's mentioned in the repo that it only supports:

> Java, Python, .NET, Node.js, and Go

Apart from Go (that is a WIP), these are the languages already supported with Otel's (non-eBPF-based) auto-instrumentation. Apart from a win on latency (which is nice, but could in theory be combated with sampling), why else go this route?


eBPF instrumentation does not require code changes, redeployment or restart to running applications.

We are constantly adding more language support for eBPF instrumentation and are aiming to cover the most popular programming languages soon.

Btw, not sure that sampling is really the solution to combat overhead, after all you probably do want that data. Trying to fix production issue when the data you need is missing due to sampling is not fun


All good points, thank you.

What's the limit on language support? Is it theoretically possible to support any language/runtime? Or does it come down to the protocol (HTTP, gRPC, etc) being used by the communicating processes?


We already solved compiled languages (Go, C, Rust) and JIT languages (Java, C#). Interpreted languages (Python, JS) are the only ones left, hopefully we will solve these as well soon. The big challenge is supporting all the different runtimes, once that is solved implementing support for different protocols / open-source libraries is not as complicated.


Got to get PHP on that list :)


FWIW it's theoretically possible to support any language/runtime, but since eBPF is operating at the level it's at, there's no magic abstraction layer to plug into. Every runtime and/or protocol involves different segments of memory and certain bytes meaning certain things. It's all in service towards having no additional requirements for an end-user to install, but once you're in eBPF world everything is runtime-and-protocol-and-library-specific.


It depends on the programming language being instrumented. For Go we are assuming the context.Context object is passed around between different functions or goroutines. For Java, we are using a combination of ThreadLocal tracing and Runnable tracing to support use cases like reactive and multithreaded applications.


That’s a very big assumption, at least for Go based applications.


I don't think it's unreasonable, you need a Context to make a gRPC call and you get one when handling a gRPC call. It usually doesn't get lost in between.


True for gRPC, but not necessarily for HTTP - the HTTP client and server packages that ship with Go predate the Context package by quite a long while.


We also thinking on implementing fallback mechanism to automatically propagate context on the same goroutine if context.Context is not passed


Going to be rough for supporting virtual threads then?


We have a solution for virtual thread as well. Currently working on a blog post describing exactly how. Will update once releases



The eBPF programs handle passing the context through the requests by adding a field to the header as you mentioned. The injected field is according to the w3c standard.


They don't really show any of the settings they used, but for traces, I imagine if you have a reasonable sampling rate, then you aren't going to be running any code for most requests, so it won't increase latency. (Looking at their chart, I guess they are sampling .1% of requests, since 99.9% is where latency starts increasing. I am not sure if I would trace .1% of pages loads to google.com, as their table implies. Rather, I'd pick something like 1 request per second, so that latency does not increase as load increases.)

A lot of Go metrics libraries, specifically Prometheus, introduce a lot of lock contention around incrementing metrics. This was unacceptably slow for our use case at work and I ended up writing a metrics system that doesn't take any locks for most cases.

(There is the option to introduce a lock for metrics that are emitted on a timed basis; i.e. emit tx_bytes every 10s or 1MiB instead of at every Write() call. But this lock is not global to the program; it's unique to the metric and key=value "fields" on the metric. So you can have a lot of metrics around and not content on locks.)

The metrics are then written to the log, which can be processed in real time to synthesize distributed traces and prometheus metrics, if you really want them: https://github.com/pachyderm/pachyderm/blob/master/src/inter... (Our software is self-hosted, and people don't have those systems set up, so we mostly consume metrics/traces in log form. When customers have problems, we prepare a debug bundle that is mostly just logs, and then we can further analyze the logs on our side to see event traces, metrics, etc.)

As for eBPF, that's something I've wanted to use to enrich logs with more system-level information, but most customers that run our software in production aren't allowed to run anything as root, and thus eBPF is unavailable to them. People will tolerate it for things like Cilium or whatever, but not for ordinary applications that users buy and request that their production team install for them. Production Linux at big companies is super locked down, it seems, much to my disappointment. (Personally, my threat model for Linux is that if you are running code on the machine, you probably have root through some yet-undiscovered kernel bug. Historically, I've been right. But that is not the big companies' security teams' mental model, it appears. They aren't paranoid enough to run each k8s pod in a hypervisor, but are paranoid enough to prevent using CAP_SYS_ADMIN or root.)


Thanks for the valuable feedback! We used a constant throughout of 10,000 rps. The exact testing setup can be found under “how we tested”.

I think the example you gave for the lock used by Prometheus library is a great example why generation of traces/metrics is a great fit for offloading to different process (an agent).

Patchyderm looks very interesting however I am not sure how you can generate distributed traces based on metrics, how do you fill in the missing context propagation?

Our way to deal with eBPF root requirements is to be transparent as possible. This is why we donated the code to the CNCF and developing as part of the OpenTelemetry community. We hope that being open will make users trust us. You can see the relevant code here: https://github.com/open-telemetry/opentelemetry-go-instrumen...


> I am not sure how you can generate distributed traces based on metrics

Every log line gets an x-request-id field, and then when you combine the logs from the various components, you can see the propagation throughout our system. The request ID is a UUIDv4 but the mandatory 4 nibble in the UUIDv4 gets replaced with a digit that represents where the request came from; background task, web UI, CLI, etc. I didn't take the approach of creating a separate span ID to show sub-requests. Since you have all the logs, this extra piece of information isn't super necessary though my coworkers have asked for it a few times because every other system has it.

Since metrics are also log lines, they get the request-id, so you can do really neat things like "show me when this particular download stalled" or "show me how much bandwidth we're using from the upstream S3 server". The aggregations can take place after the fact, since you have all the raw data in the logs.

If we were running this such that we tailed the logs and sent things to Jaeger/Prometheus, a lot of this data would have to go away for cardinality reasons. But squirreling the logs away safely, and then doing analysis after the fact when a problem is suspected ends up being pretty workable. (We still do have a Prometheus exporter not based on the logs, for customers that do want alerts. For log storage, we bundle Loki.)


In the age of supply chain attack weariness and general risk skyrocketing, it is a bit funny seeing the many observability vendors wanting you to give them kernel mode access. And it's sad that most apps that will be most in need of automatic instrumentation are "frozen" rarely developed / updated apps at critical institutions like banks.

As for the original post, opentelemetry is forced to be relatively slow because of a huge amount of semantic conventions that are meant to make data more useful. I won't go into the legitimacy of that, but while I haven't been able to verify the data this solution records, it is very unlikely to be recording as much information. Manual instrumentation would never loose to eBPF in principle, at least in a compiled language like Go, but eBPF does have great potential to perform better than OTel while recording far less data. Then comes blog post, users giving the keys to their kernel, and data ending up in the hands of an enemy state. I doubt that's the case this time but it's only a matter of time.

Banking apps if you see this, please just instrument your code. Thank you.


Somewhat related, I mainly code in Kotlin. Adding open telemetry was just adding agent to command line args (usual Java/JVM magic most people don't like). Then I had a project in Go and I got so tired of all the steps it took (setup and ensuring each context is instrumented) and just gave up. We still add our manual instrumentation for customization, but auto-instrumentation made adoption much easier in the day 0.


OTel autoinstrumentation is in the works, check it out: https://github.com/open-telemetry/opentelemetry-go-instrumen... (I wrote and tested that guide).


I think eBPF has also great potential to help JVM-based languages. Especially around performance aspects even comparing to the current java agents which use bytecode manipulation.


The article mentions avoiding GC pressure and separation between recording and processing as big wins for performance for runtimes like Java but you could do the same inside Java by using ring buffer, no?


Interesting idea. I think that as long as you able to do processing, serializing and delivery in other process and save this work from your application runtime you should see great performance


can we add manual spans (at service level) also as part of automated traces (at eBPF auto instrumentation) created by this approach? like can we access context in a running application? or will there be any sort of "traceparent" header present in incoming request?


The column in the table claiming the "number of page loads that would experience the 99th %ile" is mathematically suspect. It directly contradicts what a percentile is.

By definition, at 99th percentile, if I have 100 page loads, the one with the worst latency would be over the 99th percentile. That's not 85.2%, 87.1%, 67.6%, etc. The formula shown in that column makes no sense at all.


That's not what that column is supposed to mean afaict. The way I read it is it's showing that if the website requires hundreds of different parallel backend service calls to serve the page load, what's the probability a page load hits the p99 instrumentation latency?

We have a similar chart at my job to illustrate the point that high p99 latency on a backend service doesn't mean only 1% of end-user page loads are affected.


Ah, I see. So, for example, if one page request would result in 190 different backend requests to fulfill, then the possibility that at least one of those subrequests exceeds the 99th percentile would be 85.2%. That makes a lot more sense.


I recommend watching Gil Tene’s talk, I think he explains the math better than I do: https://www.youtube.com/watch?v=lJ8ydIuPFeU


But what if the 100 page loads are just a sample of the population?


Disclaimer: I'm a co-founder of Coroot. We're currently benchmarking our eBPF-based agent to measure its performance impact.

Could you please elaborate on a few more details about your benchmark?

- Did you measure the CPU usage of the eBPF agent?

- How does Odigos handle eBPF's perfmap overflow, and did you measure any lost events between the kernel and the agent?


How hard is it to use Odigos without k8s? We mainly use docker compose for our deployments (because it's convenient, and we don't need scale), but I'm having trouble finding anything in the documentation that explains the mechanism for hooking into the container (and hence I have no clue how to repurpose it).


We are currently supporting just Kubernetes environments. docker-compose, VMs, and Serverless are on our roadmap and will be ready soon


Anyone from the dtrace community want to enlighten a n00b about how eBPF compares to what dtrace does?


They're really very different -- with very different origins and constraints. If you want to hear about my own experiences with bpftrace, I got into this a bit recently.[0] (And in fact, one of my questions about the article is how they deal with silently dropped data in eBPF -- which I found to be pretty maddening.)

[0] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m0s


I listened to this live! That's probably why I was wondering, because I remember you talking about something you used in Linux that didn't quite live up to your expectations with DTrace, but I didn't catch all of the names. Thanks!


By dropped data do you mean by exceeding the size of the allocated ring buffer/perf buffer? If so this is configurable by the user, so you can adjust is according to the expected load


eBPF can drop data silently under quite a few conditions, unfortunately. And -- most frustratingly -- it's silent, so it's not even entirely clear which condition you've fallen into. This alone is a pretty significant with respect to DTrace: when/where DTrace drops data, there is always an indicator as to why. And to be clear, this isn't a difference merely of implementation (though that too, certainly), but of principle: DTrace, at root, is a debugger -- and it strives to be as transparent to the user as possible as to the truth of the underlying system.


From the hot takes in this post from 2018 [0], I may be asking a contentious question.

[0] https://news.ycombinator.com/item?id=16375938


I don’t have a lot of experience using dtrace, but AFAIK the big advantage of eBPF over dtrace is that you do not need to instrument your application with static probes during coding.


DTrace (on Solaris at least) can instrument any userspace symbol or address, no need for static tracepoints in the app.

One problem that DTrace has is that the "pid" provider that you use for userspace app tracing only works on processes that are already running. So, if more processes with the executable of interest launch after you've started DTrace, its pid provider won't catch the new ones. Then you end up doing some tricks like tracking exec-s of the binary and restarting your DTrace script...


That's not exactly correct, and is merely a consequence of the fact that you are trying to use the pid provider. The issue that you're seeing is that pid probes are created on-the-fly -- and if you don't demand that they are created in a new process, they in fact won't be. USDT probes generally don't have this issue (unless they are explicitly lazily created -- and some are). So you don't actually need/want to restart your DTrace script, you just want to force probes to be created in new processes (which will necessitate some tricks, just different ones).


So how would you demand that they’d be created in a new process? I was already using pid* provider years ago when I was working on this (and wasn’t using static compiled-in tracepoints).


Of course it outperforms it, but it's basic instrumentation, how do you properly select the labels for example? In your application you will have custom instrumentation for business logic, so what do you do? Now you have two systems instrumenting the same app?


You can enrich the spans created by eBPF by using OpenTelemetry APIs as usual, the eBPF instrumentation is a replacement for the instrumentation SDK. The eBPF program will detect the data recorded via the APIs and will add it to the final trace combining both automatic and manually created data.


Website doesn't display correctly on FF on android. Text bleeds on left and right side.


Thank you for reporting will fix ASAP


According to what you say, nobody should implement logs manually? I will check Odigos.


Logs are easy and familiar API for adding additional data to your traces. They still have their place, Odigos is just adding much more context.


If I am manually implemented all my logs, what do I need to do to move to Odgios?


Nothing special, if you are working on Kubernetes its as easy as running `odigos install` CLI and pointing to your current monitoring system.


How does it work with nodejs? Iirc they don’t support ebpf


This is great. Can you elaborate on how the performance is better?


Our focus was on latency. The reason we were able to cut it down was due to the fact that eBPF-based automatic instrumentation separates the recording from the processing.


How did you actually reduce the latency here ?


The main factor for reduced latency is the separation between recording and processing of data. The eBPF programs are the only overhead for the instrumented process in terms of latency. The eBPF programs transfer the collected data to a separate process which handles all the exporting. In contrast to manually adding code to an application which adds latency and memory footprint in terms of handling the exported data.


but the processing will still cost CPU time which takes it away from the 'main' process, unless it's transferred away from the machine and processed elsewhere. Unless if eBPF can do such processing much more efficiently than the application's own code, i don't see how it reduces latency differently from a properly threaded app. Of course, using eBPF makes an app instrument-able without changes is good enough a reason to use it.


Compared to multi-threaded process, there is still a big advantage in terms of latency. Handling all the exporting in the same process will greatly effect GC operations which require stop the world handling.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: