Hacker News new | past | comments | ask | show | jobs | submit login
A Practitioner's Guide to Wide Events (jeremymorrell.dev)
91 points by dmazin 19 days ago | hide | past | favorite | 56 comments



I’m quite looking forward to a future where we’ve finally accepted that all this stuff is just part of the domain and shouldn’t be treated like an ugly stepchild, and we’ve merged OLTP and OLAP with great performance for both, and the wolf also shall dwell with the lamb, and we’ll all get lots of work done.


Wide events are good, but watch out they don't become "god events". The event that every service needs to ingest, and, therefore, if there's new data that a service needs then we just add it onto the god event, because, conveniently, it's already being ingested. Before too long, the query that generates the wide event is getting so complex it's setting the db on fire. Like anything, there are trade offs; practical limits to how wide an event should reasonably become.


Maybe I’m missing something, but this doesn’t seem like what the article is talking about at all. These events are just telemetry — they’re downstream from everything, and no service is ingesting them or relying on them for actual operational data.


until you wire up alerts, and auto remediation steps driven by alerts....


i wonder if there are any semi automated approaches to finding outliers or “things worth investigating” in these traces, or is it just eyeballs all the way down?


This is possible by semi-automatic detection of anomalies over time for some preset of fields used for grouping the events (aka dimensions) and another preset of fields used in stats calculations (aka metrics). In general case this is hard to resolve taks, since it is impossible to check for anomalies across all the possible combinations of dimensions and metrics for wide events with hundreds of fields.

This is also complicated by the possibility to apply various filters for the events before and after ststs' calculations.


honeycomb "bubble up"


That seems a good usecase for AI: Its trivial to have it suggest some queries and test if they give interesting results.


Wide events is a great concept for observability space! This a superset of structured logs and traces. Wide events is basically structured logs, where every log entry contains hundreds of fields with various properties of the log entry. This allows slicing and dicing the collected events by arbitrary subsets of thier fields. This opens an infinite possibilities to obtain useful analytics from the collected events.

Wide events can be stored in traditional databases. But this approach has a few drawbacks:

- Every wide event can have different sets of fields. Such fields cannot be mapped to the classical relational table columns, since the full set of potential fields, which can be seen in wide events, isn't known beforehand.

- The number of fields in wide events is usually quite big - from tens to a few hundreds. If we are going to store them in a traditional relational table, this table will end up with hundreds of columns. Such tables aren't processed efficiently by traditional databases.

- Typical queries over wide events usually refer only a few fields out of hundreds of available fields. Traditional databases usually store every row in a table as a contiguous chunk of data with all the values for all the fields of the row (aka row-based storage). Such a scheme is very inefficient when the query needs to process only a few fields out of hundreds of available fields, since the database needs to read all the hundreds fields per each row and then extract the needed few fields.

It is much better to use analytical databases such as ClickHouse for storing and processing of big volumes of wide events. Such databases usually store values per every field in contiguous data chunks (aka column-oriented storage). This allows reading and processing only the needed few fields mentioned in the query, while skipping the rest of hundreds fields. This also allows efficiently compressing field values, which reduces storage space usage and improves performance for queries limited by disk read speed.

Analytical databases don't resolve the first issue mentioned above, since they usually need creating a table with the pre-defined columns before storing wide events into it. This means that you cannot store wide events with arbitrary sets of fields, which can be unknown before creating the table.

I'm working on a specialized open-source database for wide events, which resolves all the issues mentioned above. It doesn't need creating any table schemas before starting ingesting wide events with arbitrary sets of fields (e.g. it is schemaless). It automatically creates the needed columns for all the fields it sees during data ingestion. It uses column-oriented storage, so it provides query performance comparable to analytical databases. The name of this database is VictoriaLogs. Strange name for the database specialized for efficient processing of wide events :) This is because initially it was designed for storing logs - both plaintext and structured. Later it has been appeared that it's architecture ideally fits wide events. Check it out - https://docs.victoriametrics.com/victorialogs/


Thoughts on stuff like ClickHouse with JSON column support? Less upfront knowledge of columns needed.


It is a great step, but in my testing with the new JSON type if you use beyond 255 unique json locations/types (255 max_dynamic_types in their config) you will fall back to much worse performance for certain queries and aggregations. This is quite easy to hit with some of the suggestions in this blog post, especially if you are designing for multi-tenant use.

For this clickhouse wide event lib I'm working on (not worth anyones time atm) I am still using this schema https://www.val.town/v/maxm/wideLib#L34-39 (which is from a Boris Tane talk https://youtu.be/00gW8txIP5g?t=801) for good multi-tenant performance.

I hope clickhouse performance here can still be vastly improved, but I think it is a little awkward to get optimal performance with wide events today.


A small question on the schema, I noticed that you have only “_now” as the Order By (so should just use that for the primary key). Do you expect any cross tenant queries?

Just my feeling would be that I’d add the tenant ID before the timestamp as it should filter the parts more effectively


Yes, I think you are correct. In the video Boris/Baselime uses (_tenantId, _traceId, _timestamp). Will update that :)


Clickhouse's revised JSON type is still quite new (in beta currently), but I'm hopeful for it. Their first attempt fell apart if the schema changed.

[1] https://clickhouse.com/blog/a-new-powerful-json-data-type-fo...


JSON column type in ClickHouse [1] looks promising, since it allows storing wide events with arbitrary sets of fields. This feature is still in beta. Let's see how it will evolve.

[1] https://clickhouse.com/docs/en/sql-reference/data-types/newj...


ClickHouse is open core too. If you care about that.


How is that a "superset" ? From what I gather, it's... just a "JSON-formatted log"? They just decide to put as much data in it as they can and decide that it should be called a "wide event", but it makes no sense... it's just a regular JSON-formatted log, with all the data inside, nothing new?


Any fully open source tool that can store "wide events"?


VictoriaLogs - it is fully open source released under Apache2 license, and it can efficiently store and query these "plain old JSONs" with hundreds of fields (aka wide events) - https://docs.victoriametrics.com/victorialogs/keyconcepts/#d...


Thanks. Yeah, I was planning to try it but I was wondering since the parent's comment was saying it's just a "JSON-formatted log".


that's what I infer from their description? They just seem to have slapped a fancy name and say: just add as many data point to your JSON as you can, and call that a "wide event", but I don't see how it's not "just" a JSON-formatted log?


It is “just” a JSON-formatted log (or any other format really, just a set of keys and values).

However the practice of collecting a lot of context per-transaction / unit-of-work and emitting that as one piece of data, storing it in a place that can quickly query across these and visualize is not very common across most orgs and teams. Feedback on this article has been a mix of “I’ve never heard of this before” and “we’ve been doing this for a decade, didn’t know anyone had a name for it” with not a lot of in-between.

It’s not a new idea, which I call out in the intro. It’s not even a very fancy idea. It is really, really helpful if implemented though. Modern OLAP column stores help a lot here too since they make this type of exploration cheap and quick.


Wide events are plain old structured logs with many fields. They can be represented as JSON. They also can be represented as logfmt, protobuf or any other format suitable for representing a set of (field=value) pairs.

The novelty of wide events is that it is recommended to:

- emit an event (leg entry) once per every processed request. Previously it was OK to emit many logs per request. This could complicate degugging and analyzing such logs.

- don't afraid to add fields to the event if these fields can help debugging and/or analyzing the logs. Previously it was recommended artificially limiting the number of log fields to some small value. This could prevent from debugging and analyzing such logs in the future.


Can you emit only 1 event per request when you use the open telemetry collector are you emit a bunch of spans? I think I saw that Victoria logs support the otel collector.


You can emit as many events per request as you wish, and store all of them to VictoriaLogs. But it is recommended emitting a single "master" event with all the context and useful information about the request in separate fields, which then can help debugging and analyzing the processed requests.


One event per request seems okay in some monitoring scenarios, but in debugging scenarios you might want to instrument many functions


One main event per request with a lot of context does not mean that’s the only data you emit. You can emit “normal” log lines too, or use granular log levels, etc.

However for understanding how your system or your users are behaving, querying the wide or “main” events will be far better as entry points for exploration.


Tldr; just use slog package (structured logs) to log everything and then visualize.


This works only for Go language, which provides slog package ( https://go.dev/blog/slog ). What about other programming languages?


Structured logging is very common in many languages. Off the top of my head, in C#/.NET Serilog has been doing structured logging for a long time. Modern .NET has structured logging in its standard lib. Rust’s tracing lib also supports proper structured logging.


Practitioner of what? What is a "wide event"? In what context is this concept relevant? It took several sentences before I was even confident that this is something to do with programming.


They link to three separate articles right at the start that cover all of this. Not every article needs to start from first principles. You wouldn't expect an article about a new Postgres version to start with what databases are and why someone would need them.


>Not every article needs to start from first principles.

Sure, but it would be nice if title submissions made it feasible to predict the topic category of the article for people who are not already in the relevant niche.


Wide events are a very well known approach, especially if you do any work with observability, and articles about it have been on the HN front page too. You not knowing about something does not automatically make it a narrow niche.


From my point of view, "it has something to do with web dev" already makes it a niche. And as a rule of thumb, if you're using letter-number-letter abbreviations like "o11y" and assuming everyone knows what you're talking about, you're in a niche. (E.g.: I could parse "i18n" and "l10n" already, but I wouldn't expect random HN readers to. When I first saw "k8s" and looked it up I thought "man, really?".)


None of this is web dev specific. It applies most strongly to distributed systems, of which web systems are a subset, but in principle it can apply to any system with non-trivial requirements around logging and metrics.


I felt like I got the gist after the first two:

> Adopting Wide Event-style instrumentation has been one of the highest-leverage changes I’ve made in my engineering career. The feedback loop on all my changes tightened and debugging systems became so much easier.


That doesn’t really give an objective definition of what wide events are, just an opinion and example in this one persons life.

I had to lookup wide events in the middle of the article, and I can’t say I can viscerally see and feel the benefits the OP was espousing. Just felt like an adderall-fueled dump of information being thrown at me.


>I felt like I got the gist after the first two:

What I get is: here's a thing that made a big improvement to how I debug systems.

Except, it turns out that the systems in question are very specific ones.

> The tl;dr is that for each unit-of-work in your system (usually, but not always an HTTP request / response) you emit one “event” with all of the information you can collect about that work.

Okay, but... as opposed to what? And why is it better this way?

>“Event” is an over-loaded term in telemetry so replace that with “log line” or “span” if you like. They are all effectively the same thing.

In the programming I do, "event" doesn't mean anything to do with logging or telemetry.


It’s about observability and strongly related to Honeycombs o11y 2.0 vision.


Okay, so a web search and some looking around gives me https://www.honeycomb.io/frontend-observability. I guess this is something to do with tools for sending telemetry back from web applications and then doing statistics on them and giving the user some nice reports.

"Observability" seems like a weird term for that to me, but okay.

But I don't understand why not just give the appropriate context in the submission, rather than keeping a title that only makes sense to a very specific niche audience and then not saying up front what the niche is.

The concept of an "event" is coherent in many other programming contexts, so the possibility that one could be coherently "wide" is at least plausibly interesting. But then I get there and find myself completely disoriented, and eventually figure out that it's not actually relevant to anything I do. And anyway it looks like a lot of this jargon is really just not necessary to convey the core ideas... ?


If the entire contents of the article was in the title, you’d still have to read all the words


If the title had said something like "A guide to using Wide Events in website telemetry for [insert objective here]", I wouldn't have had the original objection.


Wide events aren't limited to website analytics. Thy are useful for observability of any application types - databases, services, microservices, web servers, application servers, mobile apps, industrial apps, IoT, etc.


"[An Observability] Practitioner's Guide to Wide Events"

That's how I would have titled it.


Okay, and why would people who aren't already in the field have any idea about your specific jargon meaning of "observability"? My browser's spellcheck underlines that. My understanding of ordinary English turns it into "the fact, of something which can be observed, that it can be observed" which is... supremely unenlightening.

I get that HN isn't appealing to the general population, but the world of programmers etc. is still quite broad.


You’ve made a lot of critical comments here.

You are obviously the one who is not understanding or is perhaps misunderstanding something.

Observability is a pretty standard term in software development.

Events have nothing per se to do with logging or tracing, but you can visualize/trace events with logs/spans.

From my perspective, you seem to misunderstand a lot in the article, I am not judging you for that, just observing this.

I suggest you try to understand the gist of the article instead of scolding the language used.


You're missing the point. My complaint is not about the article content. My complaint is about the fact that the submission title does not adequately prepare anyone to understand what the article will be about.


That’s a recurring theme on HN. The site prefers the original title, and not every blog post has a title that adequately prepares one for the contents, especially since many blogs have a recurring theme.


I had a very fine idea about what the article would be about from reading the title.

You’re being unreasonable about this IMO.


I don't understand why I should have had any such idea, given that I've been programming for thirty-five years and yesterday was literally the first day I even heard (well, saw) the word "observability" used this way, never mind that it isn't in the title. I also already suggested an alternative and I don't see what would be wrong with it.


Again this pattern that “you don’t understand, so others should change something”.

Again: not judging, just observing.

Consider that you are perhaps the minority ¯\_(ツ)_/¯


It seems to be the primary meaning in software: https://en.wikipedia.org/wiki/Observability_(software)


you just read an advertisement article and some people don't like you pointing that out. hence the downvotes i assume


While the article is written by observability vendor, it contains an excellent information about wide events, without annoying advertisement of the vendor.


It is not written by an observability vendor, nor is it an advertisement. Source: I wrote it and do not work at an observability vendor: https://jeremymorrell.dev/about/

(I wrote it mostly so I could stop re-explaining this concept from first-principles and how to go about implementing it over-and-over again )




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: