Thank you, it was FoundationDB I was thinking of, post-mortem analysis from the ...

voidmain · on Oct 19, 2017

Ah, yes, a completely speculative "post-mortem" by someone who knew nothing about the technology, but wanted some free press for his own database out of the news, knowing that the FDB team wouldn't be able to respond.

And I still can't respond to equally uninformed speculation about the FDB transaction, because that was confidential.

But I can say that the article's technical thesis about the design space is wrong. A layered design (and, really, all databases are layered, so we just mean a design where distributed concurrency control is below the layer of the data model) is good for performance, exactly because it makes it practical in engineering terms to use the right data model for the job, often getting asymptotic wins rather than tiny constant factor ones. Row and column stores, hierarchical document or "table family" stores, conventional indexes, compressed bitmaps, spatial indexes, etc can all be reasonably efficiently mapped to K/V pairs and then take advantage of the same fast distributed storage and concurrency control. You can use them for different indexes in the same database or the same table. You can, just to throw out an example, store recently changed rows in a row store while transactionally moving them into a column representation in the background, and combining both seamlessly for reads. And you can have things that aren't a good fit for the relational model, like queues, graphs, and full text search indexes, in the same database and operate on them transactionally. You can still co-locate compute with the data, and you can actually scale indexes because they don't have to be co-located with the data as most databases wind up doing. And let's not forget that you have at least a chance to actually get the hard distributed systems part right, as so few products do, because it is not hopelessly entangled with your data model and execution engine.

To address some specific speculations from the article: you don't have to read metadata every transaction; you can just cache it in memory and just ask the distributed store to make sure it hasn't changed in conflict with the transaction (via a tiny read of a serial number, for example, or a usually-empty range read of a log of metadata changes). A key/value store designed for layering needs to be able to go full speed with real, cross-node, multi-key transactions rather than focus on a single node "fast path" (like, if I recall correctly, VoltDB does). It's also highly recommended to be able to do concurrency control on arbitrary ranges rather than just individual keys, both for performance and serializability reasons. Indexes, as I've already mentioned, are vastly more scalable in this type of design than most others. And you get to decide whether you want to read from them serializably or not. One way you can do push down (which in the modern world of 10GBE networking is way more important for OLAP workloads than OLTP, but let's be perfectionist) is to co-locate SQL and K/V store nodes and to let the higher layer see where data is located in the lower layer so that it can partition queries (at least approximately) to one of its own co-located nodes. You keep all the nice abstractions and the statelessness of the SQL nodes, and you can read data locally (a nice optimization would be to have a super fast shared memory path for doing this in the database driver).

There is some cost for every abstraction. But because human ingenuity is finite, successful abstractions on net greatly increase what we can do. We don't write everything in machine code, or build new custom networking hardware and protocol stacks for every piece of data we need to send, even though there might be some microoptimization opportunities if we did.

The real performance problem that any new SQL database is going to face is that its query planner isn't exactly the same as the one people are trying to switch from. So if the customer's random web app has 1000 queries, and 995 of them are faster on the new database and 5 of them are worse, guess what happens? The 1000 queries by selection bias ran acceptably on the old database, but the 5 worse ones can be arbitrarily worse, because the badness of query planning in such a rich query model isn't really bounded. So the application is now brokenly slow. So there is going to be a long, long game of whack-a-mole, and users' experience of the performance is going to tend to be negative no matter how good or bad it really is. In my view, the ideal database interface for scalable production "OLTP" applications would be as powerful as SQL, but do less magic in the query planner - it would make it more explicit in the query how challenging the execution plan is supposed to be, so that you don't have queries that explode into unscalability. Pushing a little more work onto developers in order to save the skins of the ops people when the cardinality of some table changes.

Sorry, /rant. I don't know anything about the engineering of TiDB or TiKV and can't comment on them. But I really strongly believe that a layered architecture is the right one both for databases and for lots of things that people don't think of as databases but that wind up having to solve all the same concurrency control and fault tolerance problems.

jhugg · on Oct 23, 2017

As I mentioned in a previous comment, you can fix some of the problems; caching metadata is a great example.

The real issue is not whether abstractions are good, but at what layer and how pure/leaky they should be.

If I build a SQL engine on top of RocksDB, I still need a way to scan a bunch of tuples and apply a predicate. It's probably faster if RocksDB lets me hand over a predicate and returns an iterator of matching tuples than if I have to iterate on top of rocks DB. Maybe this difference is large -- maybe not. It depends on a lot of details. Certainly a custom storage layer turned to apply predicates fast is substantially faster.

If I build a SQL engine on top of a distributed KV store, then I really want to push the predicate scan down to the individual nodes, and I probably still want to push the predicate down even lower. For most queries, I also want to have understanding of how data is partitioned.

You can do all of this, but the abstraction gets leakier and leakier as you start to get reasonable performance. At the time, the FDB SQL layer didn't seem to do any of this. Maybe not at Apple it is much smarter and more intertwined.

The planner issue you mention is real, but I'm slightly more optimistic that engineers are willing to identify slow queries and figure out how to adapt them to the new system if the rewards are clear.

N.B. If you're using SQL for KV gets/puts, or if you're joining one row to a handful of others by primary key (e.g. lookup order items in order), then this stuff doesn't matter much. But if you give someone a SQL layer, odds are they'll want to run a query sooner or later, even an OLTP-ish one.

To address the "completely speculative "post-mortem" by someone who knew nothing about the technology" bit: I was only talking about the FDB SQL layer performance and design, much of which was public at the time of the acquisition.

slashdev · on Oct 19, 2017

Thank you for such a thoughtful and detailed response. I'd like to apologize for contributing to the spread of rumors around FDB's acquisition, as you guys could never tell the real story, rumors are all we had.

Regarding FDB's architecture, I'm going to agree with you wholeheartedly that the world is built on layers of abstractions. It's how we make any progress at anything. It's useful, and it can save time that can be focused on more important or beneficial areas. I would hope nobody in CS would disagree with your premise there.

However, there's a reason nobody writes a high performing database in Python. Or just uses the filesystem as a KV store. Abstractions are obviously useful, but it's equally trite to say they're always beneficial.

So the question with FDB: Ddoes basing everything on a generic KV abstraction yield enough benefits to make up for the losses in performance? I'm not qualified to say. Clearly the developers of FDB think so, and they'd be the ones most likely to know.

My opinion, as a database developer with many years experience is that any performance losses from building over a generic KV-store are probably solvable or mostly solvable. They also likely pale into insignificance next to the performance losses from trying to run a relational database over a distributed cluster. I'd question the sanity of doing that when the vast majority of people can fit their entire OLTP dataset in memory on a single server. I know I'm in the minority with that opinion though.

wwilson · on Oct 19, 2017

Your understanding of what occurred in the acquisition is incorrect. (Source: I am a former FoundationDB employee.) Unfortunately I can’t say much more than that.

The VoltDB post is also wrong in many of its particulars, but I’ve gotten into that argument on this site way too many times by now (and last I checked John agreed that some of its claims were overly hasty).

If there’s a particular question about FoundationDB’s performance characteristics that you’re curious about, I can probably give you a straight answer provided that it was once public information.

jhugg · on Oct 23, 2017

To be pedantic, I don't think I agreed that any of my claims were hasty, just that some of them could be addressed more easily than others. Caching metadata, for example.

To be clear, I have a lot of respect for FDB and what their team did; it was only the SQL layer put on top that seemed poorly thought through.

slashdev · on Oct 19, 2017

I'm happy to hear that I'm wrong. It obviously looked a lot like an aqui-hire from the outside, and that was the speculation on HN at the time. I hope the FoundationDB team ended up doing great things at Apple.