Hacker News new | past | comments | ask | show | jobs | submit login
pg_timeseries: Open-source time-series extension for PostgreSQL (tembo.io)
331 points by samaysharma 8 months ago | hide | past | favorite | 82 comments



>You may already be asking: “why not just power the stack using TimescaleDB?” The Timescale License would restrict our use of features such as compression, incremental materialized views, and bottomless storage. With these missing, we felt that what remained would not provide an adequate basis for our customers’ time-series needs. Therefore, we decided to build our own PostgreSQL-licensed extension.

Have been using the free version timescaledb before to shard a 500 Million observation time series database. Worked drop-in without much hassle. Would have expected some benchmarks and comparisons in the post. I will for sure watch this...


500 million is very little however. A regular table with a covering index would probably be fine for many use cases with this number of points.


indeed. Financial timeseries I was working with over 100 million new points, _per day_. For anything serious TimescaleDB is essentially not open source. Well done tembo.io crew -- will definitely give this a whirl.


What do you mean by "for anything serious it isn't open source"? I didn't see any red flags in the apache variant of timescale, just constant pleading to try their hosted option.

https://github.com/timescale/timescaledb/blob/main/LICENSE-A...


Compression and other features use the non-Apache license:

https://github.com/timescale/timescaledb/tree/main/tsl


And as I understand that license, you are allowed to use Timescale for anything that doesn’t involve offering Timescale itself as a service. If you were using Timescale to process lots of time series transactions in your backend, it doesn’t seem to me like that would break the license.

(Which is to say that if, like Tembo, you’re offering Postgres as a service you do indeed have a problem. But for other use, should be fine)


The tricky thing with these licenses (BSL, SSPL, etc.) is that you can use them freely for internal stuff, but suddenly, if you make your product public (assuming it uses, e.g., TimescaleDB), things can get muddy. Everyone wants the flexibility to either open-source or commercialize a successful internal product in the future.

The problem is that, even if your app is not a mere frontend for TimescaleDB/Mongo/Redis, you can get sued, and you'll have to spend unnecessary time and money proving things in court. No one wants this, especially a startup owner whose money and time are tight. Also, even if your startup/company uses some of these techs, potential company buyers will be very wary of the purchase if they know they'll have to deal with this later.


I would assume TimescaleDb only sues if you money. In this case you can also afford a commercial license. If you hit big just contact them and tell there was a problem having a correct license earlier and you want to fix the situation.

There is 0% chance Timescale would sue mom’n’pop operation for breaking their license.


Your assumptions are based on good faith, and businesses are not run on that :) Imagine Oracle or IBM bought Timescale; that 0% chance suddenly increases dramatically.


The license doesn't allow you to "give access to, directly or indirectly (e.g., via a wrapper) to [SQL]".

Legally, what's a wrapper? Is a REST API a wrapper?


I imagine legally would need a lawsuit to set a precedence, and if a license owner sets an over-reaching precedence of what a wrapper is, they risk losing customer trust and companies avoiding them like the plague.

e.g. timescaledb going after a tsdb as a service company offering tsdb behind a graphql wrapper vs timescaledb going after a financial company offering timeseries data collection and viewing.

I think a good border test would be, would timescaledb allow you to offer a metrics and logging service? technically you're offering timeseries database functionality, but it's in a constrained domain, and very clearly a different product, but still effectively CRUDing timeseries data.


That’s the internal use restriction. There is also the restriction more relevant to the use cases I’m talking about on Value Added Products which is “the customer is prohibited, either contractually or technically, from defining, redefining, or modifying the database schema or other structural aspects of database objects”.

Which is, basically, saying that you can do anything that doesn’t give your customers the ability to redefine and modify the database schema as long as you are creating a product that is adding value on top of timescale. Is any of this 100% clear? Not any more that legalese generally is, and of course probably wise to talk to a lawyer if you’re concerned about it. Timescale has made their intent with the license clear in the past with blog posts and such though.


If you have 100 million points per day it’s likely you afford to pay any commercial license.


Why would the number of data points correlate to budget? Perhaps there’s a chance if the business scales with paying users, but that’s unlikely to be true in finance.


At that number of observations, I would assume depth of market data so probably HFT use case. HFT is notoriously expensive to try to compete in


Or IoT data, which is notoriously hard to make money on.


Or someone using sound recorder on Windows.


I'm no market data specialist, but that seems like order of magnitude just top-of-book for US equities? The equity options market, at least, is orders of magnitude larger.


actually, any number of crypto APIs will give you tens of millions of points per day, for free (though that was not our use case).


Money is not only diffrentiator. For example, sometimes you couldn't just buy something, because they won't sell. Or stop selling to you.


> 500 million is very little however. A regular table with a covering index would probably be fine for many use.

Totally agree. The problem here was it was some awkwardly designed geospatial measurement dabase (OGC sensor things) and we could not do much about the queries etc forms one ORM logic and alot of postgis stuff. It was great to have something as a drop-in replacement speeding up all the time series queries without much thinking. Actually what was still nagging us was all the locking going on due to the transactional semantics. Time series databases are probably much better at handling queries under constant ingress. We are total amateurs wrt database optimisation, but my guess is that many of those offerings are rather targeting the average use case.


I think you’re not talking about the same thing. There’s an expression related to time series data —- “high churn” and another “active time series”.

500 million active time series is extremely huge.

It does not have anything to do with number of data points.

Good time series databases can scale to 1M-10M writes per second without a hiccup.


I suppose it means by what is meant by an "observation". Is that an entire time series for a single property or a single point? Nevertheless, the number of points absolutely matters.

A regular Postgres database can give you 50-100K inserts per second and can scale to at least 1B rows with 100K+ individual series without much difficultly. If you know you will need less (or much less) than this, my suggestion is to use a regular table with a covering index. If you need more, use ClickHouse.


That number in itself doesn’t say anything.

What really causes a database to sweat is high cardinality.

When talking about time series, also which fields are indexed and if are you inserting out of order.


Databases are a tough business. You're just waiting for open source to eat your lunch.


AFAIK https://github.com/timescale/tsbs is based on artificial data and I would recommend running benchmarks and comparisons on real data from node_exporter, like https://github.com/VictoriaMetrics/prometheus-benchmark.


500 million observations, with 4-byte floats, is 2 GB. This is the kind of size that you can store uncompressed, in RAM, on a phone. It is hardly at the point where you require specialized time-series software at all.


Looking at their roadmap, the killer feature for me would be incremental materialised views

> Incremental view maintenance — define views which stay up-to-date with incoming data without the performance hit of a REFRESH

I wonder if they plan to incorporate something like https://github.com/sraoss/pg_ivm or write their own implementation.

(Although I'm hopeful that one day we see ivm land in postgres core)


Former Timescaler here.

It's about time that Timescale started getting what it deserves.

Sometime in early 2022, just as they raised their Series C, leadership decided that they had gotten what they wanted from the open-source community and TimescaleDB. They decided it was time to focus 100% on Timescale Cloud. Features began to become exclusive to Timescale Cloud, and the self-hosted TimescaleDB was literally treated as competition. At the same time, they managed to spoil their long-time PaaS partnership with Aiven, which was (and still is) a major source of revenue for the company. The reason? Everyone needed to use Timescale Cloud and give their money to Timescale, thus making Aiven a competitor. In short, with the raising of Series C, Timescale stopped being an OSS startup and began transitioning to a money hungry corporation.

In 2023, they conducted two rounds of layoffs, even though the company was highly profitable. Recently, Planetscale also carried out layoffs in a similarly harsh manner as Timescale, but at least Planetscale had the "courtesy" to address this with two sentences in their PR statement about company restructuring. Timescale did not even do that; they kept it all quiet. Out of 160 employees, around 65 were laid off. The first round of layoffs occurred in January, and the second in September. No warnings. No PIPs. Just an email informing you that you no longer work for them. Many of the affected employees were in the middle of ongoing projects. The CEO even mentioned in the in-house memo how they diligently worked on the September layoff throughout the summer. Interestingly, many of these employees were hired by competitors like Supabase and Neon. It’s worth emphasizing that this was not a financial issue—Timescale is far from having such problems. Instead, it was a restructuring effort to present nice financial numbers during ongoing turbulence in the tech market. (And yes, you guessed it! Timescale also hired their first CFO a couple of months before the first layoffs.)

You might say that it's just business, but as an OSS startup, I expect them to live by the values they have advertised over the years and treat their users and employees much better than they currently do. With this in mind, I welcome Tembo as a new player in the time-series market.

Footnotes: Timescale = the company. TimescaleDB = OSS time-series database developed by Timescale. Timescale Cloud = TimescaleDB managed by Timescale on AWS.


Knowing little about the company, it's almost certainly completely untrue to state the "the company was highly profitable"; you don't hire 160 people in an OSS-centric business and also turn a profit. You likely have a distorted understanding of the challenging nature of the burn rate in a changing macro environment.


Ajay, Timescale CEO and co-founder, here.

It saddens me to see that we have generated so much ill will from you. It sounds like you were affected by our layoffs last year. You have every right to be upset. If you ever want to chat about this 1:1, you know how to reach me. I’d be happy to make the time.

To anyone else reading this: Some of what this person has shared is true, but some of it is not true.

I debated whether or not to reply. But one of my personal leadership values is “transparency”, so I thought I’d take the time to respond.

Yes, we conducted two rounds of layoffs in 2023. Like many tech companies, we hired a lot in 2021 and early 2022. Then, as the tech market began to correct mid 2022, we were forced to make tough decisions, including layoffs.

I take responsibility for the over-hiring and the layoffs. It brought me no joy to do them. But I feel a moral obligation to our customers to stay on the path of financial sustainability. I also feel a fiduciary obligation to our investors, some of whom are individuals, some of whom are large funds, who have all trusted us with their money. I feel a similar responsibility to current and former Timescalers who own equity in Timescale.

Sometimes, that means making tough decisions like this. But again, it was my call (not anyone else), and I accept full responsibility.

Yes, we did not publicize this news. Frankly, we thought we were too small for others to care. Maybe we got that wrong. But that decision came from a place of humility.

This is not true: “Just an email informing you that you no longer work for them.” Every affected person – except for a handful who were not working that day – was told the news individually, on a live Zoom call, that included at least one of our executives or a member of our People team. For the few teammates who were not working that day, we made many attempts to connect with them personally. I know the team tried their best to approach these hard conversations with care and empathy.

I was glad to see that a number of the affected individuals quickly found new roles at other companies in the PostgreSQL ecosystem, including at Supabase, Neon, and Tembo. These are good, smart people. The PostgreSQL ecosystem is better off with these people continuing to work to improve PostgreSQL.

The comments questioning our belief in open source are also not true. We still believe in open source. The core of TimescaleDB is still open source. Some of the advanced features are under a free, source-available license. Our latest release – TimescaleDB 2.15 – was just two weeks ago. Unlike most (all?) of our competitors, we have never re-licensed our open source software. This is something that is true for us but not for many others, like MongoDB, Elastic, Redis, Hashicorp, Confluent, etc.

Yes, we are building a self-sustaining open source business. Yes, it is hard and sometimes we get things wrong. But we have never stopped investing in our community. Today the TimescaleDB community (open source and free) is 20x larger than our customer base. And this community has more than doubled in the past 1+ year. We are also planning significant open source contributions for the next few months.

To the author of this post: I hope this response provides some clarification. And again, I’m available to chat one-on-one if you’d like.

To our open source and free community users, and to our customers: thank you for trusting us with your workloads. We are committed to serving you.

Finally, to the Timescale team, both current and former: thank you for all your hard work making developers successful. We are here to serve developers so that they can build the future. The road won’t always be easy or smooth. But we are committed, and we will get there.


Taking responsibility for laying off a bunch of people is thin gruel. What does that mean? Nothing. They showed you loyalty, you didn't return it. The over hiring is a symptom of bad management, so taking responsibility would be to demote yourself and take a pay cut. All the executive staff should have taken one and kept more people on. It is irresponsible and cruel to overhire and then dump people. I could say more, but I am sure this is falling on deaf ears.


Over-hiring is a calculus that shifts depending on macro market conditions. To pretend otherwise just isn't an honest assessment of what it means to be a business leader. In the zero interest rate era, it was irresponsible for business executives to not invest to ensure they had a competitive foundation or else be left-behind by those doing so... The rise of interest rates changed the market's appetite to invest in non-profitable growth companies, which in turn had a series of follow-on effects that required software executives to reverse course so as to optimize for their long term viability in the new climate.


Dumb question: why can't I just insert a bunch of rows with a timestamp column and indices? Where does that fall short? At a certain # of rows or something?

What does this let me do that can't be achieved with "regular PostgreSQL without the extension"?


I'm with you, I need to read up more on where timeseries could benefit, at work we have a PostgreSQL instance with around 27 billion rows in a single partitioned table, partitioned by week. Goes back to January of 2017 and just contains tons of data coming in from sensors. It's not "fast", but also not ridiculously slow to say e.g. "Give me everything for sensor 29380 in March of 2019".

I guess depends on your needs but I do think I need to investigate timeseries more to see if it'd help us.


Now give me a plot with the average of all sensors of model=A in region=B, grouped by customer, for the past 3 months, downsampled to 500 points. Assuming 1 sensor reading per minute.

I have no doubt sql can do it without too much trouble, but for a time series this is really an instant operation, even on a small server.

A time series will first find the relevant series and then simply for-loop through all the data. It takes just a handful of milliseconds.

Sql will need to join with other tables, traverse index, load wider columns. And you better have set the correct index first, in your case you also spent extra effort on partitioning tables. Likely you are also using a beefy server.


Adding to your comment, from my perspective (exploration geophysics)

> and just contains tons of data coming in from sensors.

it's also desired to, on the fly, deal with missing sensor data, clearly bad sensor data, identify and smooth spikes in data (weird glitch or actual transient spike of interest), apply a variety of running filters; centred average is basic, parameterised Savitzky–Golay filters provide "beefed up" better than running average handling .. and there are more.

It's not just better access to sequential data that makes a dedicated time series engine desirable, it's the suite of time (and geo spatial) series operations that close the deal.


Same boat here, but with DSP/SSP bidding statistics. Generating around 1 billion rows a day and still going strong. Single table, partitioned by week. BRIN index on timestamp, normal index on one column.

Postgres is just a beast.


there are several good articles explaining this, especially on Timescable blog, but in short, without time partitioning and just index, at some given point the performance for reads and writes degrades exponencially.


Time based partitioning.


    CREATE TABLE logs (
        id SERIAL PRIMARY KEY,
        log_time TIMESTAMP NOT NULL,
        message TEXT
    ) PARTITION BY RANGE (log_time);
Why won't this work on stock PostgreSQL?


I think what's meant here is windowing (partitioning the query) not partitioning the table per se. Though even with this strategy, you must manually create new partitions all the time.

This also isn't typical time-series data, which generally stores numbers. Supposing you had a column "value INTEGER" as well, how do you do something like the following (pseudo-SQL)?

    SELECT AVG(value) AS avg_value FROM logs GROUP BY INTERVAL '5m'
Which should output rows like the following, even if the data were reported much more frequently than every 5 minutes:

    log_time             | avg_value
    2024-05-20T00:00:00Z | 10.3
    2024-05-20T00:05:00Z | 7.8
    2024-05-20T00:10:00Z | 16.1


    SELECT date_bin('5 minutes', log_time, '2000-01-01') log_time,
           AVG(value) avg_value
    FROM logs GROUP BY 1


read the docs, it's not saying that won't work. This extension along with timescale just makes some things more ergonomic.


And provides helper functions to deal with regular time series tasks


That won’t work already because your timestamp isn’t part of your primary key.


It will work just fine.


Most of the time-series queries (almost all of them) are aggregated queries. Why not leverage or build top-notch Columnarstore for the same.

Everything seems to be there and why there's not first class product like ClickHouse on PG.


The gold standard for this Druid at very large scale, or ClickhouseDB. Clickhouse has a lot of problems as far as modifying/scaling shards after the fact, while Druid handles this with ease (and the penalty of not being able to update after the fact.)


Doris?


Citus, Persona, TimescaleDB?


That was very "Klaatu, Barada, Nikto".


Victoria metrics as well, they say based on similar structures used in clickhouse


Looking at the comparison with Click Benchmark, they are almost pathetic in terms of performance. They cant even handle sub-second aggregation queries for 10M records. Compared that too even duckdb reading from parquet files.


Postgres is missing a proper columnstore implementation. It's a big gap and it's not easy to build.

One solution could be integrating duckdb in a similar way as pgvector. You need to map duckdb storage to Postgres storage and reuse duckdb query processor. I believe it's the fastest way to get Postgres to have competitive columnstores.


Olo, CEO of https://www.tablespace.io here. We've built a columnstore extension for Postgres that is faster than ClikcHouse for real-time analytics in their own ClickBench Benchmarks. Feel free to check it out - https://www.tablespace.io/blog/postgres-columnstore-index-vs...


This sounds interesting. I don't see duck db as a supported extension or mentioned anywhere in your code yet ;)

Is this foreshadowing?


Hydra?


Thank you for posting it: I followed the links and found out about trunk and https://pgt.dev/


Great to see this kind of innovation. PostgreSQL is interesting while "core" was always Open Source and using very permissive Open Source library, there have been many proprietary and source available extensions, ranging from replication to time series support.

Now we see those Proprietary extensions being disrupted by proper Open Source!


PostgreSQL licensed, good move!


Your site is very well designed and easy to read btw, and the app UI looks great from the demo photos. I might try it!


Thank you!


Interesting release, it feels that the time-series database landscape is evolving toward:

a) columnar store & built from scratch, with convergence toward open formats such as parquet & arrow: influxdb 3.0, questdb

b) Adding time-series capabilities on top of Postgres: timescale, pg_timeseries

c) platforms focused on observability around the Prometheus ecosystem: grafana, victoria metrics, chronosphere


Would this be a good extension when you want to load balancer log entries (status, response body, headers etc)?

I think a columnar database store would be more efficient than normal row-based databases? load balancer log entries could be considered something similar to analytics events.


Yes. Columnar is integrated with pg_timeseries already.


Benchmarks with respect to QuestDB, TimescaleDB?


QuestDB, TimescaleDB, and PostgreSQL are more about universal databases and don't work as well as Prometheus or VictoriaMetrics. And yes, Clickhouse will beat them both, but only by running queries on a huge amount of data. See also https://docs.victoriametrics.com/faq/#how-does-victoriametri...


How’s it different than timescaledb?


> You may already be asking: “why not just power the stack using TimescaleDB?” The Timescale License would restrict our use of features such as compression, incremental materialized views, and bottomless storage. With these missing, we felt that what remained would not provide an adequate basis for our customers’ time-series needs. Therefore, we decided to build our own PostgreSQL-licensed extension.


It's much newer for one, so it's behind on features. But, we're working on adding new ones based on customer demand.

We want to build it with a PostgreSQL license using existing community extensions as much as possible and build custom stuff (still permissively licensed) only when necessary.


The use of postgresql licensing might mean we can see this available for AWS RDS and other managed PostgreSQL providers.


Timescaledb's license is more like Redis' new license?


It's about time that postgres (and other databases) add native append-only tables. That doesn't make it timeseries, but it probably helps with the standardiziation and all the logic/access around it.


How does this stack up against something like what QuestDB offers?


Interesting, how does it compare to proper (open source) time series database like InfluxDB other than being 'Postgres' like ?


i's not Postgres-like, it _is_ Postgres


InfluxDB is a "proper" time series database?


It is in the sense that it's a purpose built time series database. Once upon a time it was definitely among the best too, but IMO their sales and tech and everything strategies have been just massive clusterfucks. They also have the best metrics agent, by far.

They're on their third serious rewrite with breaking changes for users. They pulled features from the open source version (clustering) to have a higher moat for the Enterprise version. They deleted customer data permanently in their hosted version after sending two emails to billing addresses. Their marketing and product naming also sucks:

InfluxDB InfluxDB Cloud Serverless InfluxDB Cloud Dedicated InfluxDB Clustered


Would love to use this with RDS!


Tembo CEO here - we are targeting feature parity for Tembo Cloud w/ RDS as soon as possible, would love to have you give Tembo a try sometime, give us feedback :)

Tembo Cloud is standard SaaS offering, and our new Tembo Self Hosted (https://tembo.io/docs/product/software/tembo-self-hosted/ove...) allows you to run the same software that powers our SaaS, but in your own K8s cluster.


Same here, but not holding my breath since all these neat Postgres extensions compete with other AWS DBs like Redshift, Timestream etc




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: