Databricks acquired bit.io and subsequently shut it down quite fast. Afaik bit.io had a very small team and the founder was a serial entrepreneur who is not going to stick around and he did not. I am not sure who from bit.io is still around at databricks.
If I am guessing right, Motherduck will likely be acquired by GCP because most of the founding team was ex-BQ. Snowflake purchased Modin and polars is still quite immature to be acquisition ready. So, what does this leave us with. There is also EDB who is competing in enterprise Postgres space.
Folks I know in the industry are not very happy with databricks. Databricks themselves was hinting people that that they would be potentially acquired by Azure as Azure tries to compete in the data warehouse space. But everyone become an AI company which left Databricks in an awkward space. Their bdev team is not bestest from my limited interactions with them (lots of starbucks drinkers and let me get back to you after a 3 month PTO), so they do not know who or how to lead them to an AI pivot. With cash to burn from overinvestment and the snowflake/databricks conf coming up fast they needed a big announcement and this is that big announcement.
Should have sobered up before writing this though. But who cares.
From context in parent, I'm reading as the sort of person who looks more competent than they are and skates from job to job quickly enough that no one notices.
BDev can be good or bad. Bad ones tend to not follow up, and Starbucks here represents they have poor decision making skills (reinforced by going on PTO for three months and not following up on commitments).
Thought the same. I mean I don’t drink it because I can make my own far cheaper but I don’t look on with scorn at those who do. It says a lot more about the person making the judgment than those who drink the coffee.
> Folks I know in the industry are not very happy with databricks
Yeah, big companies globing up everything does not lead to a healthy ecosystem. Congrats on the founders for their the acquisition but everyone else loses with movements like this.
I'm still sour after their Redash purchase that instantly "killed" the open source version. Tabular acquisition was also a bit controversial since one of the founders is the PMC Chair for Iceberg which "competes" directly with Databricks own delta lake. The mere presence of these giants (mostly databricks and snowflake) makes the whole data ecosystem (both closed and open source) really hostile.
An OLTP solution fixes a lot of the headaches about the traditional extract-load-transform steps.
Mostly a lot of OLAP starts when the data loads in Kafka logs or a disk of some sort.
Then you schedule a task or keep a task polling this constantly, which is always prone to small failures & delays or big failures when schema changes up.
The "data pipeline" team exists because the data doesn't move by itself from where it is first stored to where it is ready for deep analysis.
If you can directly push 1-row updates transactionally to a system and feed off the backend to write a more OLAP friendly structure, then you can hookup things like a car rental service's operational logs into a system which can compute more complex things like forecasting of availability or apply discounts to give a customer an upgrade for cheap.
Neon looks a lot better than YugaByte in tech (which also talks postgres protocols) and a lot nicer in protocol compatibility than something like FoundationDB.
Alloy from Google feels somewhat similar, Spanner has a postgres interface too.
The postgres API is a great abstraction common point, even if the actual details of the implementations vary a lot.
Neon is a great product because they are run by Postgres enthusiasts. They have decent customer-friendly pricing, real serverless HTTP endpoints, and they're always on the latest version of Postgres as soon as it is stable. From what I can tell, no other provider has this positioning, driven by dedication.
I really hope they can maintain this dedication after acquisition, but Databricks will probably push them into enterprise and it will lose the spark. I wish Cloudflare bought them instead.
I've been bullish on neon for a while -- the idea hits exactly the right spot, IMO, and their execution looks good in my limited experience.
But I mean that from a technical perspective. I never have any real idea about the business -- do they have an edge that makes people want to start paying them money and keep paying them money? Heck if I know.
I guess that's going to be Databricks problem now (maybe).
Neon goes further than just "managed postgres". I would say one of their big features is just how fast and easy you can spin up new db/clusters. It's completely possible (encouraged) to spin up 1 DB per tennant and potentially spin u and tear down 1000's of databases.
It opens up some interesting ideas/concepts when creating an isolated DB is just as easy as creating a new db table.
These Postgres, and serverless databases are all so overhyped. I have tried all of them and they all are much slower than just deploying a managed database in the same datacenter as your application.
I have an application deployed on Railway with a Postgres database and the user's latency is consistent 150ms. The same application deployed on these serverless/edge provider is anywhere between 300-400ms with random spikes to 800ms. The same application, same data, and same query.
The edge and serverless has to be the biggest scam in cloud industry right now.
They aren't faster, and they aren't cheaper. You could argue they are easier to scale, but that not he case anymore since everyone provides autoscaling now.
Whatever. I was able to set up Neon Postgres in 5 mins. It’s still crazy fast with my Fly services, has replication out of the box and backups. Much easier than AWS and from what I can tell, getting something going with Railway. And I don’t have to worry about operating it. My time is valuable.
All of that can be true. What I wonder is — if that all is true — how much of a moat is there around that? It seems like the secret sauce in that company isn’t some custom technology, it’s execution. Execution can be replicated by another competent team. Or is there some other secret sauce that I can’t see?
It's the team, they have a few Postgres committers and major contributors, and there are not that many of them. But that's a bit precarious, the team may leave after the acquisition for many reasons.
I completely agree... in my comment, the word "competent" was doing a lot of heavy lifting.
And it begs comparisons to comments about Dropbox/rsync, etc...
But, I personally think the Neon concept of branching databases with CoW storage is quite interesting. That, combined with cost-management with autoscaling does seem like at least a serviceable moat.
These are features of any managed database service.
DigitalOcean, Railway, Render, and so on all offer the exact same feature except it's just pure Postgres and you can deploy them in the same data center as your application.
400ms added latency is really bad for user experience. Do a few queries and you’re going to need to add caching. Now you’re spending your precious developer time managing caching invalidation in lots of places instead of just setting up your database properly in the beginning.
I understand there are ways to deal with the problem of latency in serverless, but this is a problem I'd rather not deal with in the first place. The database IS the application, and I would not want to sacrifice speed of the database for anything. Serverless is totally not worth the trade-off for me: slightly more convenient deployments, for much higher latency to the database.
I'm a solo dev that has been installing and running my own database server with backups for decades and have never had a problem with it. It's so simple, and I have no idea why people are so allergic to managing their own server. 99% of apps can run very snappily on a single server, and the simplicity is a breath of fresh air.
That's why I'm working hard on bringing in a tightly integrated support for SQLite in the Elixir ecosystem (via a Rust FFI bridge): because in my professional experience not many applications need something as hardcore and amazing as PostgreSQL; at least 80% of all apps I ever witnessed would be just fine with an embedded database.
I share similar experiences like yours and others in this thread, and to me all those operational concerns grow into unnecessary noise that distracts from the real problems that we are paid to solve.
Not just cold start (another problem you have to worry about with serverless). There's the simple fact that network latency outside of the same datacenter is ALWAYS slow and randomly unpredictable, especially if you have to run multiple queries just to render a single page to your user. A database should always be over LAN in my opinion, if you need to access data over the internet, at that point it should be over an API/HTTP, not internal database access.
Neon's multi-region support isn't directly comparable to a single Postgres database in a single data center. You can set up Neon in a single data center, too, and I would expect the same performance in that case.
Meanwhile, if you tried to scale your single-Postgres to a multi-region setup, you'd expect higher latencies relative to the location of your data.
Even managed databases are a scam. You can easily get 10x cheaper pricing for the same workload, by, wait for it, installing Postgres yourself on a baremetal machine. Plus you get much better performance, no noisy neighbors, and ability to actually control and measure low level performance. I never got the hype for serverless. Why are people so allergic to setting up a server? It takes a few hours a year of investment, and the performance benefits are huge.
What is the lowdown on Databricks? Their bread and butter were hosted Spark and notebooks. As tasks done in Spark over a data lake began to be delegated wholesale to columnar store ELT, they tried to pivot to "lake houses", then I sort of lost track of them after I got out of Spark myself.
Did Delta Lake ever catch on? Where are they going now?
Capture enterprise AI enthusiasm by providing a 1-stop shop for data and AI, optionally hosted on your own cloud tenant. Keep deploying functionality so clients never need another supplier. Partner with SAP, OpenAI, anyone who holds market share. Buy anyone that either helps growth or might help a competitor grow.
Enterprise view: delegate AI environment to Databricks unless you’re a real player. Market is too chaotic, so rely on them to keep your innovation pipeline fed. Focus on building your own core data and AI within their environment. Nobody got fired for choosing Databricks.
You basically pay databricks a “fee” to choose the more appropriate and modern stack for you to build on, and keep it up to date. Never used it, but it handles with lots of the administrative bs (compliance, SLAs, idk) for you so you can just ship.
That does sound, as you allude, like IBM on its long downward spiral of globbing up products to stay relevant and touting them as an integral solution, while in-house development stuck to keeping legacy products alive for their Enterprise contracts. I wonder if they'll be foolish enough to start doing consulting around them, obliterating their economies of scale in the process; so far they are going with the "consulting partners" approach.
Oh well. Databricks notebooks were hella cool back when companies were willing to spend lavishly on having engineers write cloud hosted Scala in the first place, and at premium prices to boot.
A nice UI for a data lake house is underrated. I use AWS Athena at my work and it is just so bad for no good reason. For example, big columns of text are expanded outwards making reading the subsequent columns impossible.
Delta Lake is not catching on, but no worries, they bought Iceberg[0] (the competing standard).
I'm joking, but only a bit. Iceberg is open source (Apache), but a lot of the core team and the creator worked at Tabular and Databricks bought them for $1B.
It provides central place to store and query data. A big org might have a few hundred databases for various purposes - databricks lets data engineers set up pipelines to ETL that data into databricks and when the data is there it can be queried (using spark, so there's some downsides - namely a more restrictive SQL variant - but some advantages - better performance across very large datasets).
Personally, I hated databricks, it caused endless pain. Our org has less than 10TB of data and so it's overkill. Good ol' Postgres or SQL Server does just fine on tables of a few hundred GB, and bigquery chomps up 1TB+ without breaking a sweat.
Everything in databricks - everything - is clunky and slow. Booting up clusters can take 15 minutes whereas something like bigquery is essentially on-demand and instant. Data ETL'd into databricks usually differs slightly from its original source in subtle but annoying ways. Your IDE (which looks like jupyter notebook, but is not) absolutely suck (limited/unfamiliar keyboard shortcuts, flakey, can only be edited in browser), and you're out of luck if you want to use your favorite IDE, vim etc.
Almost every databricks feature makes huge concessions on the functionality you'd get if you just used that feature outside of databricks. For example databricks has it's own git-like functionality (which is the 5% of git that gets most used, but no way to do the less common git operations).
My personal take is databricks is fine for users who'd otherwise use their laptop's computer/memory - this gets them an environment where they can access much more, at about 10x the cost of what you'd pay for the underlying infra if you just set it up yourself. Ironically, all the databricks-specific cruft (config files, click ops) that's required to get going will probably be difficult for that kind of user anyway, so it negates its value.
For more advanced users (i.e. those that know how to start an ec2 or anything more advanced), databricks will slow you down and be endlessly frustrating. It will basically 2-10x the time it takes to do anything, and sap the joy out of it. I almost quit my job of 12 years because the org moved to databricks. I got permission to use better, faster, cheaper, less clunky, open-source tooling, so I stayed.
My stack atm is neovim, python/R, an EC2 and postgres (sometimes Sql Server). Some use of arrow and duckdb. For queries on less than few hundred GB this stack does great. Fast, familiar, the ec2 is running 24/7 so it's there when I need it and can easily schedule overnight jobs, and no time wasted waiting for it to boot.
You mentioned earlier about how long it would take to acquire a new cluster in Databricks, but you are comparing it here to something that's always on here. In a much larger environment, your setup is not really practical to have a lot of people collaborating.
Note that Databricks SQL Serverless these days can be provisioned in a few seconds.
> you are comparing it here to something that's always on
That's the point. Our org was told databricks would solve problems we just didn't have. Serverful has some wonderful advantages: simplicity, (ironically) cheaper (than something running just 3-4 hours a day but which costs 10x), familiarity, reliability. Serverless also has advantages, but only if it runs smoothly, doesn't take an eternity to boot, isn't prohibitively expensive, and has little friction before using it - databricks meets 0/4 of those critera, with the additional downside of restrictive SQL due to spark backend, adding unnecessary refactoring/complexity to queries.
> your setup is not really practical to have a lot of people collaborating
Hard disagree. Our methods are simple and time-tested. We use git to share code (100x improvement on databricks' version of git). We share data in a few ways, the most common are by creating a table in a database or in S3. It doesn't have to be a whole lot more complicated.
I totally understand if Databricks doesn't fit your use cases.
But you are doing a disingenuous comparison here because one can keep a "serverful" cluster up without shutting it down, and in that case, you'd never need to wait for anything to boot up. If you shut down your EC2 instances, it will also take time to boot up. Alternatively, you can use the (relatively new) serverless offering from them that gets you compute resources in seconds.
To ensure I'm not speaking incorrectly (as I was going from memory), I grep'ed my several years' of databricks notes. Oh boy.. the memories came flooding back!
We had 8 data engineers onboarding the org to databricks, it was only after 2 solid years before they got to working on serverless (it was because users complained of user unfriendliness of 'nodes', and managers of cost). But then, there were problems. A common pattern through my grep of slack convos is "I'm having this esoteric error where X doesn't work on serverless databricks, can you help".. a bunch of back and forth (sometimes over days) and screenshots followed by "oh, unfortunately, serverless doesn't support X".
Another interesting note is someone compared serverless databricks to bigquery, and bigquery was 3x faster without the databricks-specific cruft (all bigquery needs is an authenticated user and a sql query).
Databricks isn't useless. It's just a swiss army knife that doesn't do anything well, except sales, and may improve the workflows for the least advanced data analysts/scientists at the expense of everyone else.
This matches my experiences as well. Databricks is great if 1. your data is actually big (processing 10s/100s of terabytes daily), and 2. you don't care about money.
they are competitors and are similar. Snowflake popularized the cloud datawarehouse concept (after aws fumbled it big with Redshift). DB is the hot new tool.
BigQuery ELT, the org I went to was rather immature in their data practice, and I sold them on getting some proper orchestration (Dataform, their preference over DBT, and Airflow), and keeping the architecture coherent.
I'd have rather stuck with Spark just because I prefer Scala or Python to SQL (and that comes with e.g. being far easier to unit test), but life happened and that ecosystem was getting disrupted anyway.
Databricks is trying hard to get into serverless, but it seems like they refuse to allow it to actually be cheaper, which defeats the purpose of serverless.
You will all forced to go serverless because new grads can't use the command line. Running a database is about the hardest thing you can do. If it is serverless, you don't need special skills, preventing employees from becoming valuable lowers costs across the board.
When running a service, databases are the hardest to run. K8S still doesn't handle them well (this is by design), so they are the first thing to get outsourced to a managed service.
This is me being less jaded. Support those little wins!
There are so many gotchas. I'm getting so tired of working around it, but my company is all in on serverless so the pain will continue. A lot of it is tied up with Unity Catalog shortcomings, but Serverless and UC are basically joined at the hip.
A few just off the top of my head:
* You can't .persist() DataFrames in serverless. Some of my work involves long pipelines that wind up with relatively small DFs at the end of them, but need to do several things with that DF. Nowhere near as easy as just caching it.
* Handling object storage mounted to Unity Catalog can be a nightmare. If you want to support multiple types of Databricks platforms (AWS, Azure, Google, etc.), then you will have to deal with the fact that you can't mount one type's object storage with another. If you're on Azure Databricks, you can't access S3 via Unity Catalog.
* There's no API to get metrics like how much memory or CPU was consumed for a given job. If you want to handle monitoring and alerting on it yourself, you're out of luck.
* For some types of Serverless compute, startup times from cold can be 1 minute or more.
They're getting better, but Databricks is an endless progression of unpleasant surprises and being told "oh no you can't do it that way", especially compared to Snowflake, whose business Databricks has been working to chew away at for a while. Their Variant type is a great example. It's so much more limited than Snowflake's that I'm still learning new and arbitrary ways in which it's incompatible with Snowflake's implementation.
I had an interview with a senior data engineering candidate and we were talking about how expensive Databricks can get. :D I set up specific budget alerts in Azure just for Databricks resources in DEV and PROD environments.
basically they separate the compute and storage into different components, where the traditional PG use both compute and storage at the same server.
because of this separation, the compute (e.q SQL parsing, etc) can be scaled independently and the storage can also do the same, which for example use AWS S3
so if your SQL query is CPU heavy, then Neon can just add more "compute" nodes while the "storage" cluster remain the same
to me, this is similar to what the usual microservice where you have a API service and DB. the difference is Neon is purposely running DB on top of that structure
So how is this distributed Postgres still an ACID-compliant database? If you allow multiple nodes to query the same data this likely is just Trino/an OLAP-tool using Postgres syntax? Or did they rebuild Postgres and not upstream anything?
It's only serverless in the way it commits transactions to cloud storage, making the server instance ephemeral; otherwise it has a server process with compute and in-memory buffer pool almost identical to pg, with the same overheads.
You shouldn't be getting downvoted. Serverless is nothing more than a hype which is meant to overcharge you instead of running it on a server owned by you
That's a reductionist view of a technical aspect because of the way the technical aspect is sold. Serverless are VMs that launch and turn off extremely quickly, so much so that they open up new ways of using said compute.
You can deploy serverless technologies in a self hosted setup and not get "overcharged". Is a system thread bullshit marketing over a system process?
Okay now I am concerned. We're using Neon. We can move easily at this point, but I'm sure they have huge customers storing many terabytes of data where this may be genuinely hard to do.
I went to Archive.org and figured out that in 2023, they announced they were shutting down on May 30th, all databases shutdown on June 30th, only available for downloads after that, and deleted on July 30th.
Same boat here. Not really looking to have to move but I'm incredibly thankful that I never integrated with Neon more than using Postgres. I don't depend on/need their API or other branching features.
I hate that this is what I've become, I want to try some of the cool features "postgres++" providers offer but I actively avoid most features fearing the potential future migration. I got burned using the Data API on Aurora Serverless and then leaving them and having to rewrite a bunch of code.
They aren’t exactly hiding it. I kept my eye on bit.io because they looked very promising. Next day, gone. Shut down immediately. Something is fucky with the investment pipeline because it’s not ”worth” that much on its own, it’s a market dominance play, bad for innovation..
I've been seriously considering neon for a new application. This definitely gives me pause... maybe plain ol' Postgres is going to be the winner for me again.
Can’t speak for anyone but myself and my experience anecdotally, having used Databricks: I consider them to be the Oracle of the modern era. Under no circumstances would I let them get their hooks into any company I have the power from preventing it.
Why do think so? Databricks notebook product I have used in couple of companies is pretty solid. I have done any google research but they are generally known to be very high talent dense kind of place to work.
Serverless in the context of Postgres means to decouple storage and compute, so you could scale compute "infinitely" without setting up replica servers. This is what Neon offers, where you can just keep hitting their endpoints with your pg client and it should just take whatever load (in principle) and bill you per request.
Supabase gives you a server that runs classic Postgres in a process. Scaling in this scenario means you increase your server's capacity, with a potential downtime while the upgrade is happening.
You are confusing _managed_ Postgres for _serverless_.
I haven't studied the CLA situation in order to know if a rug pull is on the table but Tofu and Valkey have shown that where there's a will there's a way
The whole point to you, but the whole point to me was having scale-to-zero because Aurora Serverless hurp-durp-ed on that. And I deeply enjoy the ability to fix bugs instead of contacting AWS Support with my hat in my hand asking to be put on some corporate backlog for 2073
Thankfully, you can continue to pay Databricks whatever they ask for the privilege of them hosting it for you
Later stage things are , the potential IPO is a benefit not deterrent. Recruiters and hiring managers will hint at potential IPO being not far off as an incentive to join. It minimizes risk, they do same for potential target’s founders like Neon here .
This is better than earlier stage startups , while you get far better multiples , it is also quite possible that you are let go somewhere into the cycle without the money to vest the options for tax reasons and there is short vesting period on exit.
For this reason companies these days offer 5/10 yr post leaving as a more favorable offer
——
For founders it is gives them a shorter window to a exit than on their own, and in revenue light and tech heavy startup like neon (compared to databricks) the value risk is reduced because stock they get in acquisition is based on real revenue and growth not early stage product traction as neon would be today .
They also have some cash component which is usually enough to buy core things in most founders look at like buying a house in few million range or closing mortgages or invest in few early stage projects directly or through funds
No, basically it is a buy back of employee options and stock .
Many companies raise money only to give liquidity to founders / employees and some early investors even if they don’t money for operations at all.
While Databricks is large , there are much bigger companies which would have IPOed at smaller sizes in the past which are delaying (may never do) today. Stripe and SpaceX are the biggest examples both have healthy positive cash flows but don’t feel the value of going public . Buying back shares and options is the only route if you don’t have IPO plans if you want to keep early stage employees happy
Well this isn't great news. I quite enjoy using Neon but I doubt it's going to continue to cater to people like me if it's bought by Databricks (from the little I know about them and from looking at their website).
Thankfully, I just need "Postgres", I wasn't depending on any other features so I can migrate easily if things start going south.
Neon is a interesting product and they've got some great Postgres engineers. Having said that 1 Second cold starts are still quite painful for a website/web app.
I hope the $19 plans are there to stay - but I somewhat doubt it.
cold starts are 500ms on average, and that's only for the first call that wakes up the db from hibernation. people still seem to think that this latency happens for every call (see other threads here) but once the service has woken up (cold start over) you're back to regular (sub 10ms) latency timings and the service continues to run that way. you'll only hit a cold start again if (you have this option turned on) your service goes idle for > 5 min. You can turn scale-to-zero off and you'll run 24/7, have zero cold starts.
$19 plan is going away, will launch a better $5 plan soon.
I use neon quite a bit, profiling seems to show ~600-980ms of extra latency. This is in the AWS London region, on postgres 15/16.
Regardless if I've got a website that's used a couple of times a hour every hour then the practical reality is almost all users have a extra second of latency or so.
I'm not complaining, it's a great product that I'll continue to use, but it's the biggest pain point.
Congrats to the Neon team - they make an awesome product. That’s about all the good I can say here. I don’t blame them for selling out. It’s always felt like a “when” not an “if”. I would be surprised if you can make money selling cloud databases - especially when funded by VCs.
What’s with all these Postgres hosting services being worth so much now?
Someone at AWS probably thought about this, easy to provision serverless Postgres, and they just didn’t build it.
I’m still looking for something that can generate types and spit it out in a solid sdk.
It’s amazing this isn’t a solved problem. A long long time ago, I was apart of a team trying to sort this out. I’m tempted to hit up my old CEO and ask him what he thinks.
The company is long gone…
If anything we tried to do way too much with a fraction of the funding.
In a hypothetical almost movie like situation I wouldn’t hesitate to rejoin my old colleagues.
The issue then, as is today is applications need backends. But building backends is boring, tedious and difficult.
Maybe a NoSql DB that “understands” the Postgres API?
Building backends is easy. It is sort of weird. In 2003 no one would bat an eyelid at building an entire app and chucking it on a server. I guess front-end complexity had made that a specialism so with all that dev energy drained they have no time for the backend. The backend is substantial easier though!
These high value startups timed well to capture the vibe coding (was known as builidng an MVP before), front end culture and sheer volume of internet use and developers.
Django on Render( and presumably a heroku) just works.
It's still much more work that just dropping in a Firebase url. Firebase can lead to poor design choices and come back to bite you, but hopefully by then you've already raised a few VC rounds and you're rolling in dough.
DSQL is genuinely serverless (much more so than "Aurora Serverless"), but it's a very long way from vanilla Postgres. Think of it more like a SQL version of DynamoDB.
Supabase is not just a hosted Postgres, it’s a full(ish) backend stack built on open source components comparable with something like firebase. But being Postgres, encourages same data modeling (and an escape hatch). Their type generation and SDK is quite good, too. It’s one of my favorite services and powers to projects of mine, soon to be 3.
Firebase let's you write functions in normal node js and Python.
Supabase only supports Deno. The quirkiness is my own server side logic. Tbf, I've tried to build this project at least 4 times and I might need to take a step back.
"Easy to provision" is mostly a strategic feature for acquiring new users/customers. The more difficult parts of building a database platform are reliability and performance, and it can take a long time to establish a reputation for having these qualities. There's a reason why most large enterprises stick to the hyperscalers for their mission-critical workloads.
That reason also includes SOC2, FedRAMP, data at rest jurisdiction, availability zones etc. And if large enough you can negotiate the standard pricing.
For sure. And oftentimes these less sexy features or certifications are much more cumbersome to implement/acquire than the flashy stuff these startups lead with
1. An acquihire (if your a Neon customer this would probably be a bad outcome for you).
2. A growth play. Neon will be positioned as an 'application layer' product offered cheap to bring SaaS startups into the ecosystem. As those growth startups grow and need more services sell them everything else.
I am fairly new to all this data pipeline services (Databricks, Snowflakes etc).
Say right now I have an e-commerce site with 20K MAU. All metrics are going to Amplitude and we can use that to see DAU, retention, and purchase volume. At what point in my startup lifecycle do we need to enlist the services?
A non-trivial portion of my consulting work over the past 10 years has been working on data pipelines at various big corporations that move absurdly small amounts of data around using big data tools like spark. I would not worry about purchasing services from Databricks, but I would definitely try to poach their sales people if you can.
Just curious, what would you consider, "absurdly small amounts of data around using big data tools like spark" and what do you recommend instead?
I recently worked on some data pipelines with Databricks notebooks ala Azure Fabric. I'm currently using ~30% of our capacity and starting to get pushback to run things less frequently to reduce the load.
I'm not convinced I actually need Fabric here, but the value for me has been its the first time the company has been able to provision a platform that can handle the data at all. I have a small portion of it running into a datbase as well which has been constant complaints about volume.
At this point I can't tell if we just have unrealistic expectations about the costs of having this data that everyone wants, or if our data engineers are just completely out of touch with the current state of the industry, so Fabric is just the cost we have to pay to keep up.
One financial services company has hundreds of Glue jobs that are using pyspark to read and write less than 4GB of data per run. These jobs run every day.
Of all the billion-scale investment and acquisition news of the last 24 hours this is the only one that makes sense. Especially after the record-breaking $15B round, that Databricks closed last year.
More like "single process application's database".
There are interesting use cases for DB-per-user which can be server or client side, or litestream's continuous backup/sync that can extend it beyond this use case a bit too.
You _can_ use SQLite as your service's sole database, if you vertically scale it up and the load isn't too much. It'll handle a reasonable amount of traffic. Once you hit that ceiling though, you'll have to rethink your architecture, and undergo some kind of migration.
The common argument for SQLite is deferring complexity of hosting until you've actually reached the type of load you have to use a more complex stack for.
Enterprises have lots of data. They store it somewhere, and there are multiple vendors that provide such "credible" infrastructure for this type of storage. Think of it like, your dad says he's willing to get a dog, but only trusts these-five-animal-shelters and nothing else. That doesn't mean that's correct (that those are the only places to get a dog), it just means that's what he trusts. Databricks is most likely a unicorn because they have successfully sold the idea that they are one of those trusted vendors, like Snowflake.
The truth of the 2010s up until now is that every startup was a massive sales con job. The wealth of this industry is not truly built on incredible tech, but on the audacity of salesmanship. It's a billion-dollar con job. That's one of the reasons I take every ridiculous startup that launches quite seriously, because you have no idea just how audacious their sales people are. They can sell anything.
Your question is very fundamental, and the answer is just as raw and fundamental too. I would love it if some of these sales people actually reform and write tell-alls about how they conned so many large companies in their years of working. This content has got to be out there somewhere.
So, I'm not sure if this is less cynical or more cynical, but.. have you ever talked to the decision-makers who buy something like databricks?
They can't build it themselves, and it's highly dubious that they'd be able to hire and supervise someone to build it. Databricks may be selling "nothing special", but it's needed, and the buyers can't build it themselves.
The thing is, it's actually a very difficult engineering/research/infra problem to run complicated queries on enormous data lakes. All the obvious ways to do it are prohibitively slow and expensive. Every bit of performance you can squeeze out of this, you unlock the ability for people to work with their data more easily. So there is huge value in having some centralized companies sink lots of R&D into trying to solve these problems well.
I can tell you the company I work at (4000 people, legacy banking IT) has 4 people running our Datalake. We likely have more people buying/"evaluating" Databricks currently (from overhearing calls in open-plan offices), so I guess they have a point. A very sad point...
My mental model is that there are few big money printing industries, and the major players and it will pay just about anything for a slight advantage. It's really about additive revenue, it's about protecting market share.
If I am guessing right, Motherduck will likely be acquired by GCP because most of the founding team was ex-BQ. Snowflake purchased Modin and polars is still quite immature to be acquisition ready. So, what does this leave us with. There is also EDB who is competing in enterprise Postgres space.
Folks I know in the industry are not very happy with databricks. Databricks themselves was hinting people that that they would be potentially acquired by Azure as Azure tries to compete in the data warehouse space. But everyone become an AI company which left Databricks in an awkward space. Their bdev team is not bestest from my limited interactions with them (lots of starbucks drinkers and let me get back to you after a 3 month PTO), so they do not know who or how to lead them to an AI pivot. With cash to burn from overinvestment and the snowflake/databricks conf coming up fast they needed a big announcement and this is that big announcement.
Should have sobered up before writing this though. But who cares.
reply