I’m quite looking forward to a future where we’ve finally accepted that all this stuff is just part of the domain and shouldn’t be treated like an ugly stepchild, and we’ve merged OLTP and OLAP with great performance for both, and the wolf also shall dwell with the lamb, and we’ll all get lots of work done.
Wide events are good, but watch out they don't become "god events". The event that every service needs to ingest, and, therefore, if there's new data that a service needs then we just add it onto the god event, because, conveniently, it's already being ingested. Before too long, the query that generates the wide event is getting so complex it's setting the db on fire. Like anything, there are trade offs; practical limits to how wide an event should reasonably become.
Maybe I’m missing something, but this doesn’t seem like what the article is talking about at all. These events are just telemetry — they’re downstream from everything, and no service is ingesting them or relying on them for actual operational data.
i wonder if there are any semi automated approaches to finding outliers or “things worth investigating” in these traces, or is it just eyeballs all the way down?
This is possible by semi-automatic detection of anomalies over time for some preset of fields used for grouping the events (aka dimensions) and another preset of fields used in stats calculations (aka metrics). In general case this is hard to resolve taks, since it is impossible to check for anomalies across all the possible combinations of dimensions and metrics for wide events with hundreds of fields.
This is also complicated by the possibility to apply various filters for the events before and after ststs' calculations.
Wide events is a great concept for observability space! This a superset of structured logs and traces. Wide events is basically structured logs, where every log entry contains hundreds of fields with various properties of the log entry. This allows slicing and dicing the collected events by arbitrary subsets of thier fields. This opens an infinite possibilities to obtain useful analytics from the collected events.
Wide events can be stored in traditional databases. But this approach has a few drawbacks:
- Every wide event can have different sets of fields. Such fields cannot be mapped to the classical relational table columns, since the full set of potential fields, which can be seen in wide events, isn't known beforehand.
- The number of fields in wide events is usually quite big - from tens to a few hundreds. If we are going to store them in a traditional relational table, this table will end up with hundreds of columns. Such tables aren't processed efficiently by traditional databases.
- Typical queries over wide events usually refer only a few fields out of hundreds of available fields. Traditional databases usually store every row in a table as a contiguous chunk of data with all the values for all the fields of the row (aka row-based storage). Such a scheme is very inefficient when the query needs to process only a few fields out of hundreds of available fields, since the database needs to read all the hundreds fields per each row and then extract the needed few fields.
It is much better to use analytical databases such as ClickHouse for storing and processing of big volumes of wide events. Such databases usually store values per every field in contiguous data chunks (aka column-oriented storage). This allows reading and processing only the needed few fields mentioned in the query, while skipping the rest of hundreds fields. This also allows efficiently compressing field values, which reduces storage space usage and improves performance for queries limited by disk read speed.
Analytical databases don't resolve the first issue mentioned above, since they usually need creating a table with the pre-defined columns before storing wide events into it. This means that you cannot store wide events with arbitrary sets of fields, which can be unknown before creating the table.
I'm working on a specialized open-source database for wide events, which resolves all the issues mentioned above. It doesn't need creating any table schemas before starting ingesting wide events with arbitrary sets of fields (e.g. it is schemaless). It automatically creates the needed columns for all the fields it sees during data ingestion. It uses column-oriented storage, so it provides query performance comparable to analytical databases. The name of this database is VictoriaLogs. Strange name for the database specialized for efficient processing of wide events :) This is because initially it was designed for storing logs - both plaintext and structured. Later it has been appeared that it's architecture ideally fits wide events. Check it out - https://docs.victoriametrics.com/victorialogs/
It is a great step, but in my testing with the new JSON type if you use beyond 255 unique json locations/types (255 max_dynamic_types in their config) you will fall back to much worse performance for certain queries and aggregations. This is quite easy to hit with some of the suggestions in this blog post, especially if you are designing for multi-tenant use.
A small question on the schema, I noticed that you have only “_now” as the Order By (so should just use that for the primary key). Do you expect any cross tenant queries?
Just my feeling would be that I’d add the tenant ID before the timestamp as it should filter the parts more effectively
JSON column type in ClickHouse [1] looks promising, since it allows storing wide events with arbitrary sets of fields. This feature is still in beta. Let's see how it will evolve.
How is that a "superset" ? From what I gather, it's... just a "JSON-formatted log"? They just decide to put as much data in it as they can and decide that it should be called a "wide event", but it makes no sense... it's just a regular JSON-formatted log, with all the data inside, nothing new?
that's what I infer from their description? They just seem to have slapped a fancy name and say: just add as many data point to your JSON as you can, and call that a "wide event", but I don't see how it's not "just" a JSON-formatted log?
It is “just” a JSON-formatted log (or any other format really, just a set of keys and values).
However the practice of collecting a lot of context per-transaction / unit-of-work and emitting that as one piece of data, storing it in a place that can quickly query across these and visualize is not very common across most orgs and teams. Feedback on this article has been a mix of “I’ve never heard of this before” and “we’ve been doing this for a decade, didn’t know anyone had a name for it” with not a lot of in-between.
It’s not a new idea, which I call out in the intro. It’s not even a very fancy idea. It is really, really helpful if implemented though. Modern OLAP column stores help a lot here too since they make this type of exploration cheap and quick.
Wide events are plain old structured logs with many fields. They can be represented as JSON. They also can be represented as logfmt, protobuf or any other format suitable for representing a set of (field=value) pairs.
The novelty of wide events is that it is recommended to:
- emit an event (leg entry) once per every processed request. Previously it was OK to emit many logs per request. This could complicate degugging and analyzing such logs.
- don't afraid to add fields to the event if these fields can help debugging and/or analyzing the logs. Previously it was recommended artificially limiting the number of log fields to some small value. This could prevent from debugging and analyzing such logs in the future.
Can you emit only 1 event per request when you use the open telemetry collector are you emit a bunch of spans? I think I saw that Victoria logs support the otel collector.
You can emit as many events per request as you wish, and store all of them to VictoriaLogs. But it is recommended emitting a single "master" event with all the context and useful information about the request in separate fields, which then can help debugging and analyzing the processed requests.
One main event per request with a lot of context does not mean that’s the only data you emit. You can emit “normal” log lines too, or use granular log levels, etc.
However for understanding how your system or your users are behaving, querying the wide or “main” events will be far better as entry points for exploration.
Structured logging is very common in many languages. Off the top of my head, in C#/.NET Serilog has been doing structured logging for a long time. Modern .NET has structured logging in its standard lib. Rust’s tracing lib also supports proper structured logging.
Practitioner of what? What is a "wide event"? In what context is this concept relevant? It took several sentences before I was even confident that this is something to do with programming.
They link to three separate articles right at the start that cover all of this. Not every article needs to start from first principles. You wouldn't expect an article about a new Postgres version to start with what databases are and why someone would need them.
>Not every article needs to start from first principles.
Sure, but it would be nice if title submissions made it feasible to predict the topic category of the article for people who are not already in the relevant niche.
Wide events are a very well known approach, especially if you do any work with observability, and articles about it have been on the HN front page too. You not knowing about something does not automatically make it a narrow niche.
From my point of view, "it has something to do with web dev" already makes it a niche. And as a rule of thumb, if you're using letter-number-letter abbreviations like "o11y" and assuming everyone knows what you're talking about, you're in a niche. (E.g.: I could parse "i18n" and "l10n" already, but I wouldn't expect random HN readers to. When I first saw "k8s" and looked it up I thought "man, really?".)
None of this is web dev specific. It applies most strongly to distributed systems, of which web systems are a subset, but in principle it can apply to any system with non-trivial requirements around logging and metrics.
> Adopting Wide Event-style instrumentation has been one of the highest-leverage changes I’ve made in my engineering career. The feedback loop on all my changes tightened and debugging systems became so much easier.
That doesn’t really give an objective definition of what wide events are, just an opinion and example in this one persons life.
I had to lookup wide events in the middle of the article, and I can’t say I can viscerally see and feel the benefits the OP was espousing. Just felt like an adderall-fueled dump of information being thrown at me.
What I get is: here's a thing that made a big improvement to how I debug systems.
Except, it turns out that the systems in question are very specific ones.
> The tl;dr is that for each unit-of-work in your system (usually, but not always an HTTP request / response) you emit one “event” with all of the information you can collect about that work.
Okay, but... as opposed to what? And why is it better this way?
>“Event” is an over-loaded term in telemetry so replace that with “log line” or “span” if you like. They are all effectively the same thing.
In the programming I do, "event" doesn't mean anything to do with logging or telemetry.
Okay, so a web search and some looking around gives me https://www.honeycomb.io/frontend-observability. I guess this is something to do with tools for sending telemetry back from web applications and then doing statistics on them and giving the user some nice reports.
"Observability" seems like a weird term for that to me, but okay.
But I don't understand why not just give the appropriate context in the submission, rather than keeping a title that only makes sense to a very specific niche audience and then not saying up front what the niche is.
The concept of an "event" is coherent in many other programming contexts, so the possibility that one could be coherently "wide" is at least plausibly interesting. But then I get there and find myself completely disoriented, and eventually figure out that it's not actually relevant to anything I do. And anyway it looks like a lot of this jargon is really just not necessary to convey the core ideas... ?
If the title had said something like "A guide to using Wide Events in website telemetry for [insert objective here]", I wouldn't have had the original objection.
Wide events aren't limited to website analytics. Thy are useful for observability of any application types - databases, services, microservices, web servers, application servers, mobile apps, industrial apps, IoT, etc.
Okay, and why would people who aren't already in the field have any idea about your specific jargon meaning of "observability"? My browser's spellcheck underlines that. My understanding of ordinary English turns it into "the fact, of something which can be observed, that it can be observed" which is... supremely unenlightening.
I get that HN isn't appealing to the general population, but the world of programmers etc. is still quite broad.
You're missing the point. My complaint is not about the article content. My complaint is about the fact that the submission title does not adequately prepare anyone to understand what the article will be about.
That’s a recurring theme on HN. The site prefers the original title, and not every blog post has a title that adequately prepares one for the contents, especially since many blogs have a recurring theme.
I don't understand why I should have had any such idea, given that I've been programming for thirty-five years and yesterday was literally the first day I even heard (well, saw) the word "observability" used this way, never mind that it isn't in the title. I also already suggested an alternative and I don't see what would be wrong with it.
While the article is written by observability vendor, it contains an excellent information about wide events, without annoying advertisement of the vendor.
It is not written by an observability vendor, nor is it an advertisement. Source: I wrote it and do not work at an observability vendor: https://jeremymorrell.dev/about/
(I wrote it mostly so I could stop re-explaining this concept from first-principles and how to go about implementing it over-and-over again )