Show HN: Trench – Open-source analytics infrastructure

bosky101 · 2024-10-28T18:57:40 1730141860

1) Appreciate the single image to get started, but am particularly curious how you handle different events of a new user going to different nodes.

2) any admin interface or just the rest API?

3) a little bit on the clickhouse table and engine choices?

4) stats on Ingesting and querying tbe same time

5) node doesn't support the clickhouse TCP interface. This was a major bottleneck even with batching of 50k events (or 30 secs whichever comes first)

6) CH indexes?

7) how are events partitioned to a Kafka partition? By userId? Any assumptions on minimum fields

Will try porting our in-house marketing automation backend (posthog frontend compatible) to this and see how it goes (150M+ events per day)

Kudos all around. Love all 3 of your technology choices.

pancomplex · 2024-10-28T19:33:35 1730144015

Thank you!

1) All data is partitioned based on the "instanceId" of events (see `instanceId` here: https://docs.trench.dev/api-reference/events-create). Instance IDs are typically a logically meaningful way of separating users (such as by company/team/etc.) that allows for sharding the data across nodes.

2) Yes, this the number 1 thing on our roadmap right now (if anyone is interested in helping build this, please reach out!)

3) We're using the Kafka engine in ClickHouse for throttling the ingestion of events. It's partitioned by instanceId (see #1) for scaling/fast queries over similar events.

4) My benchmarks in production showed a single EC2 instance (16 cores / 32 gb ram) barely working at 1000+ inserts / second with roughly the same amount of queries per second. Load averages 0.91, 0.89 0.9. This was in stark contrast to our AWS Postgres cluster which continued to hit 90%+ CPU and low memory with 80 ACUs, before we finished the migration to Trench.

5) We seemed to solve this by running individual Node processes on every core (16 in parallel). Was the limit you saw caused by ClickHouse's inbound HTTP interface?

6) Right now the system uses just a default MergeTree ordered by instanceId, useId, timestamp. This works really well for doing queries across the same user or instance, especially when generating timeseries graphs.

7) I am still trying to figure out the best Kafka partitioning scheme. userId seems to be the best for avoiding hot partitions. Curious if you have any experience with this?

Let us know how the migration goes and feel free to connect with me (christian@trench.dev).

klaussilveira · 2024-10-29T13:58:35 1730210315

How do you guarantee ACID with Kafka being responsible for actually INSERT'ing into ClickHouse? Wouldn't it be less error prone to just use ClickHouse directly and their async inserts?

https://clickhouse.com/blog/asynchronous-data-inserts-in-cli...

pancomplex · 2024-10-30T00:16:35 1730247395

I am thinking about setting this up as as a configuration for the type of traffic that doesn't require Kafka.

That being said, Kafka has in my experience come in super handy again and again, simply because it adds an incredible extra layer of fault tolerance when running at scale, including the ability to replay events, replicate, fail over, etc. I'd be nervous about letting the amount of throughput we receive directly interface to ClickHouse (though I'd be excited to run an experiment with this).

bosky101 · 2024-10-29T03:29:27 1730172567

Not sure of the CH Kafka engine but generally I think you should partition by userId.

Because the next step would be trying to run some cron for a user or event based trigger based on the events.

And the only way to avoid multiple machines doing the same work / sending the same comms - would be to push all users events to a partition. This way with multiple workers you don't have the risk of duplicate processing.

svilen_dobrev · 2024-10-31T18:21:46 1730398906

check "partial ordering" concept. What is the minimum independent "thing"? Probably user?

example over user+invoices: i.e. there are things that have to come in exact order (e.g. activity on certain invoice), and there are things that can move around (i.e. processing those, timewise), being independent from one another (different invoices' activities, wholesale). But when same user acts on different invoices, then whole one-user-activity should be in exact order.. not just invoice-activity

hitradostava · 2024-10-28T20:11:53 1730146313

Looks interesting, we solved this problem with Kinesis Firehose, S3 and Athena. Pricing is cheap, you can run any arbitrary SQL query and there is zero infrastructure to maintain.

bosky101 · 2024-10-29T03:21:24 1730172084

Storing small events in s3 can explode costs quickly.

At 1M events/day that's $7.5/day. Decent

At 15M, $75/day

Cost for 150 million S3 PUT requests per day of 25KB each would be $750/day, assuming no extra data transfer charges.

With clickhouse you won't get charged per read/write

hitradostava · 2024-10-29T07:22:23 1730186543

Kinesis supports buffering - up to 900 seconds or 128mb. So you are way out on your cost estimations. Over time queries can start costing more due to S3 Requests, but regular spark runs to combine small files solves that.

bosky101 · 2024-10-31T06:11:45 1730355105

I haven't even got to kinesis or bandwidth or storage.

Even if you compress N objects through spark/etc your starting point would be the large number of writes first. So that doesn't change. The costs would be even larger considering even more medium sized PUT's that double the storage, add N deletes potentially. Have also heard that Athena, presto etc charge based on rows read.

antman · 2024-10-28T20:49:46 1730148586

How does it scale? Can you spin up multiple containers? For upcoming features auto archiving to cloud storage old data would be great.

pancomplex · 2024-10-28T21:06:08 1730149568

Once you've outgrown a single physical server, you can continue to scale the Trench cluster by spinning up more Trench application servers and switching to dedicated Kafka and ClickHouse (either self-hosted or via cloud offerings). You can also shard Trench itself depending on the structure of your data (e.g. 1 Trench instance per customer, use case, etc.)

Auto-archiving to cloud for Kafka (Confluent, AWS KMS, etc.) / ClickHouse (ClickHouse Cloud, etc.) is definitely high on the roadmap.

Attummm · 2024-10-28T18:57:15 1730141835

Looks great, but what is missing for me are use cases.

Why should I use it? What are the unique selling points of your project?

pancomplex · 2024-10-28T19:12:08 1730142728

I looked around, but all the open source analytics projects I could find were bloated with all kinds of UI and unnecessary code paths. They also all seemed to use row-based RDMS as the data backbone (vs columnar stores like ClickHouse). I was looking for a backend-only solution that we could shape for our product use case that could scale.

So TLDR, if you're at a smaller scale (<1M MAUs), you probably will be fine just using a table in MySQL or Postgres. If you have a lot of traffic and users, you will need something like Trench that uses Kafka and ClickHouse.

Attummm · 2024-10-28T20:23:07 1730146987

You are selling the underlying technologies(Kafka/Clickhouse).

I'm interested in your project can do for me, my project(s), team/company. There is a reason that most of the internet still uses PHP and old technologies. Because they focused not on the latest tech but on solving problems for others.

The project looks cool, but tell us the usecases.

mind-blight · 2024-10-28T23:56:12 1730159772

It seems pretty clearly spelled out. If you have enough traffic that an events table is slowing down your postgres instance, you can easily set this up as a service to offload the events table. The author says <1 million MAUs, and you probably don't need this.

It's built on tech known for handling very large amounts of traffic, which answers the how after the what.

dfltr · 2024-10-28T21:51:34 1730152294

Use case #1: You have a problem table (e.g. a high-volume events table) that grows non-linearly as your business starts to scale up. A queue + columnar store package like Trench moves the problem table out to a system better equipped to deal with it and lets your DB server handle its relational business in relative peace and quiet.

Attummm · 2024-10-29T01:01:55 1730163715

Maybe I wasn't clear enough but my questions have been rhetorical. They were not for me. If one starts stating technologies, it is akin to describing the individual ingredients of a sandwich.

The question remains: Why choose Trench over just using Kafka and Clickhouse or any other message queue and columnar database / big data base?

If the goal of the post and the landing website is to entice people to use the tool, then answering these questions is important. If what is being discussed seems obvious, then who is the target demographic? Because they already know the space, use alternatives or have built their own.

teleforce · 2024-10-29T23:37:24 1730245044

Probably it's just me, but your comment is very similar to the famous one on Dropbox:

My YC app: Dropbox - Throw away your USB drive

https://news.ycombinator.com/item?id=9224

Attummm · 2024-10-30T13:13:42 1730294022

These two comments are worlds apart.

My comment is feedback to better pitch the project with the goal of attracting more users.

The Dropbox comment, in contrast, is a mean-spirited criticism that just lists alternatives.

Jgrubb · 2024-10-29T12:13:22 1730204002

Sometimes the innovation is a new underlying technology applied to an old problem?

codegeek · 2024-10-28T21:29:04 1730150944

Looks good. In market for something like this and I just ran it locally. how do I visualize data ? Is Grafana not included by default.

Also, minor issue in your docs. There is an extra comma in the sample JSON under the sample event. The fragment below:

        "properties": {
            "totalAccounts": 4,
            "country": "Denmark"
        },
    }]

I had to remove that comma at the end.

pancomplex · 2024-10-28T21:59:00 1730152740

Thanks for flagging. Just fixed this. Grafana is intentionally not included by default -- but it takes a few minutes to set it up. We're still trying to figure out what to bundle by default in terms of UI -- for now it's API only.

codegeek · 2024-10-29T17:57:41 1730224661

No worries. I am going to test it as we are looking for a simple centralized tool for multiple customers to run reporting on events. Most tools have been too complex to setup and yours is promising.

d_watt · 2024-10-28T19:23:16 1730143396

Looks super interesting. Any positioning thoughts on this vs https://jitsu.com ?

pancomplex · 2024-10-28T19:41:20 1730144480

I think a major difference is that Jitsu depends on you having a data warehouse whereas Trench can be spun up as a standalone system. The nature of Trench's data is also to enable real-time querying a high scale which will be much slower when depending on ETL'ed data in a data warehouse.

brody_slade_ai · 2024-10-29T11:12:43 1730200363

I've been exploring open source data analytics software and it's been a game-changer. I mean the flexibility and cost savings are huge perks. I've been looking into Apache Spark and KNIME, and they both seem like great options

Incipient · 2024-10-30T03:20:46 1730258446

>LLMs are really good at writing SQL

Unfortunately not my experience. Possibly not well promoted, but trying to get vscode copilot to generate anything involving semi-basic joins fall quite flat.

oulipo · 2024-10-28T22:58:46 1730156326

What is the advantage of this rather than using a postgres plugin for clickhouse and S3 storage of the data to build a kind of data-warehouse, which wouldn't require the bloat of Kafka?

pancomplex · 2024-10-28T23:17:56 1730157476

In my experience, at scale (~2-3k QPS), you'd run into a bottleneck ingesting so many events without Kafka. If you don't have this level of throughput, you could totally do the above and still get the advantages of ClickHouse's columnar datastore.

remram · 2024-10-29T20:10:09 1730232609

If you don't mind me asking, why the name "Trench"?

pancomplex · 2024-10-29T20:19:45 1730233185

We were inspired by datalakes and thought the name of a super deep lake could be a cool domain. Turns out 10 of the deepest spots on Earth are all trenches, and the domain was cheap, so we went with trench.dev https://www.marineinsight.com/know-more/10-deepest-parts-of-...

asdev · 2024-10-28T23:03:03 1730156583

how is this different from Posthog?

BohdanPetryshyn · 2024-10-29T03:54:44 1730174084

In addition to what pancomplex mentioned, Posthog is not fully open-source. Their free self-hosted version has limited functionality and the paid self-hosted version is no longer supported [1] which makes me feel like I'm pushed to use their cloud offering.

[1]: https://posthog.com/docs/self-host

pancomplex · 2024-10-28T23:13:52 1730157232

The stack is indeed very similar to Posthog. The biggest difference is that we don't come with all the feature bloat (Session Recordings, Feature Flags, Surveys, etc.) and instead provide a very minimal and easy to use backend + API that is applicable to a ton of use cases.

We (Frigade.com) actually use Posthog as well as Trench in production. Posthog powers all our website analytics. Trench powers our own SDK and tracking scripts we ship to our own customers.

I actually tried to spin up Posthog originally before building Trench, but there was just way too much overhead and "junk" we didn't need. I would need to strip out so many features of their Python app, it would eventually be faster to build a clean solution in Typescript ourselves.

oulipo · 2024-10-28T22:57:24 1730156244

Could this be used to log IoT object events? or is it more for app analytics?

pancomplex · 2024-10-28T23:04:44 1730156684

Yes for sure. We intentionally designed Trench to be very unopinionated when it comes to the application. So you can use it to stream and query anything from page views, log traces to IoT object events.

biddendidden · 2024-10-29T11:57:15 1730203035

I _totally_ associate 'trench' with 'analytics'. Oh, perhaps the author associates it with 'infrastructure'? Just stupid.