I work in gaming and stream events into a self-hosted Clickhouse db without Kafk...

maccard · 2024-10-23T13:24:39 1729689879

How do you ensure retries, single entries, and not losing 100k entries if your app goes down?

It's also kind of a bummer that the batches have to be inserted, when the tagline on Clickhouse's website is:

> Build real-time data products that scale

But, thanks for the clarification!

nrjames · 2024-10-23T14:14:42 1729692882

We do overlapping inserts and let ReplacingMergeTree remove the duplicates. You can use the FINAL statement on select queries if you’re concerned about queries returning duplicates before the Clickhouse backend performs the deduplication mutation.

maccard · 2024-10-23T14:38:33 1729694313

Ah, great. Thanks for the info!

hedora · 2024-10-23T14:27:25 1729693645

I've seen a few solutions in this space that use an RDBMS as a glorified spool file. So, append log entries to PG or MySQL or whatever over a rest endpoint (like the one splunk exposes to writers), and then have a few workers (for fault tolerance) that the 100K oldest entries in the table every few seconds, stick them into the "real-time" system, delete them from the DBMS and commit.

I've never understood why this isn't just done better by the downstream product though. It's not that hard to implement a performant write ahead log from scratch.

(Note that you can scale out the above arbitrarily, since there's no reason to limit yourself to one worker or one DBMS.)

aseipp · 2024-10-23T15:11:25 1729696285

Use something like https://vector.dev which can put up an HTTP endpoint you can submit entries to, and it will batch and submit them to ClickHouse on your behalf and do all the buffering and other stuff. Vector is extremely reliable in my experience but I don't know the size of your operation. Vector can also do a lot of other stuff for you.

Realistically I think ClickHouse's features count as real-time, batching or not. The thing is, there is a cost to inserting things, it's a question of what the cost is. ClickHouse has a lot of overhead for an insert, and very little overhead for large OLAP queries, so amortizing the overhead with big writes is important. That's just a design tradeoff. Let's say you have 1mil events a second and you batch at 100k. You then get 10mil. Does that mean you need 10x as long to see the data? No, you can just scale out the writes by standing up new nodes and scale them up by doing larger batches. In contrast, systems that do excellent on singular point queries and transactional inserts are probably not going to handle 10x (relative) larger inserts and 10x as many writers as well -- or, they will not handle it as well, for as long, and will need more care. For reference I have done tens and hundreds of billions of rows on a commodity homeserver with ease, something Postgres isn't going to handle as well (I have pushed Postgres to about 3 billion rows.)

In this kind of setup, losing some events occasionally isn't ideal, and you should try to stop it, but it will happen. More importantly, at large scale, you'll only be able to sample subsets of the data to get answers in a reasonable time anyway, so your answers will become increasingly approximate over time. In a system of 1 trillion rows, does 100k rows missing matter when you already sample 10% of the dataset via SELECT ... FROM xyz SAMPLE 0.1? This is an important question to ask.

Most of the time you can get data through the pipeline quickly, in seconds (more than enough to spot problems) and you can use tools like ReplacingMergeTree or AggregatingMergeTree in order to scale up your write throughput in the event of multiple writers. Again, at large scale, duplicate rows (no exactly once delivery) are mostly just statistical noise, and they are ultimately idempotent because ClickHouse will merge them together anyway. Someone else already mentioned FINAL here. There are tricky parts to running any big system at scale but, yeah.

If you actually need sub-second or millisecond-level latency and you can't stand to lose even a single event, then you need to look into streaming solutions like using https://nats.io/ combined with Materialize or Feldera, which completely reframe the problem as an incremental computation problem rather than an analytical OLAP system that addresses scale through mass parallelism.

If all of the mentioned numbers here are too big for you or overkill, or something you aren't thinking about yet -- you can just stand up ClickHouse with Vector as an insert buffer, and throw shit at it all day without worry.

maccard · 2024-10-23T15:48:37 1729698517

Thanks for the response here.

> Realistically I think ClickHouse's features count as real-time, batching or not

I agree, but if you look at some of the suggestions in this thread they talk about (e.g.) writing batches to S3 and crawling it on an interval - that's not real time (even if clickhouse itself is). If clickhouse is Real-time, but can't ingest data in a sane format it's not real time.

_That said_, I work at the scale where we have to be slightly careful with what we do, but not at the level where we'd call it a "big system at scale". We operate at the scale where we're worried about the stability of our app (i.e. batching in our app has the potential to cause data loss), but we can fit the ingress management/queue on a single instance (or a very small number of instances) so if _that_ is reliable we're happy.

> If all of the mentioned numbers here are too big for you or overkill,

They are, and Vector is exactly what I want. It took me about 20 minutes from seeing this comment to have app -> vector -> clickhouse-local up and running.