Generally interesting article, so I was struck by the ending:
> Also, the confusion does not only come from the overlaps in the wording, but in the necessity to go into the implementation details to understand the difference in a real-life concurrent application. In practice, this is one of the difficulties that had led many people (me included) to look at the ‘NoSQL’ world.
Consistency in database systems is a hard topic because maintaining consistency in any concurrent system is hard. Traditional systems provide you with a set of tools (isolation levels, foreign keys) that make a decent programmer capable of building a safe concurrent application.
Throwing away those tools and replacing them with nothing does not make life easier. The tool is easier to understand, but the problem is harder to solve.
This is not to say that NoSQL systems do not have a place - of course they do - but I feel like a lot of people adopting them talk about 'eventual consistency' while what they maintain is actually inconsistency.
Maintaining consistency in NoSQL systems when the application is nontrivial is really hard - and if the developer is not up to understanding the locking model of a traditional DB, I'd be pretty surprised if they were up to working with an eventually consistent system.
Thanks for the feedback.
> if the developer is not up to understanding the locking model of a traditional DB I'd be pretty surprised if they were up to working with an eventually consistent system
I agree. As well, most 'NoSQL' systems don't throw everything. Typically, some of them are strongly consistent. The ones based on Dynamo claim to have 'tunable consistency', i.e. the choice between strong & weak consistency is up to the user.
We have a lot of "acid simply works" and "NoSQL is available". The blog is basically about saying things are not that simple, and it includes this "isolation in acid isn't that simple".
First, I've made one sinful over-simplification in my own post, in conflating NoSQL systems with eventual consistency. While that's usually the case, it's absolutely not an intrinsic property: my apologies!
True SERIALIZABLE-level ACID does pretty much simply just work - and if you're using Postgres the performance hit really isn't too bad. Of course you're chucking away replication then, so whether it's suitable for your needs may vary rather!
Dynamo-based systems have 'tunable consistency' but that's almost always over one key: multi-key operations are usually inconsistent. That being the case, they're pretty much only 'easy to use' for applications with a very simple data model: my experience is that most applications of any real complexity will at some point want to do some kind of multi-key operation. That being the case, you're probably on the hook for a pretty expensive programmer.
I'm vaguely aware that this doesn't strictly apply to Cassandra, and it has a limited notion of transactions - last I checked they didn't work very well at all ( https://aphyr.com/posts/294-call-me-maybe-cassandra/ ), but that may well have changed
I do appreciate your blog post in general - I think there's an awful lot of oversimplifying of this stuff out there. Part of the problem is that high speed, concurrent, distributed data storage is a topic that is, at its heart, pretty damn complicated. Unfortunately,
Well, I've never used postgres, but from the documentation v9.4 it does not look that different from the others db engine: 'Read Committed is the default isolation'; 'applications using this level must be prepared to retry transactions due to serialization failures' (plus the one you mentioned already: 'Serializable transaction isolation level has not yet been added to Hot Standby replication targets')
Not exactly what I like to call 'simply works' ;-)
But I didn't want to say that a NoSQL database is always better than an traditional one. Just that Isolation is complex on traditional systems when dealing with volumes & concurrency. And, typically, transactions between tables or even rows are difficult/impossible for a distributed database, as these rows can be on different nodes (the 'multi-key operations' you mentioned)
Postgres is quite different in one respect: it has a truly serializable snapshot isolation, at an acceptable performance cost (single digit percentage generally). Other DB systems are either not truly serializable, or have lock-based systems that are sometimes more difficult to work with for web apps.
> 'applications using this level must be prepared to retry transactions due to serialization failures'
True of any serializable system that supports concurrent access, AFAIK - not quite sure that's a fair criticism :-).
------
> And, typically, transactions between tables or even rows are difficult/impossible for a distributed database
Depends what you mean by 'distributed', really - Oracle RAC is very much distributed, and supports normal transactional behaviour. On the other hand you won't get that working across a large geographical area.
I accept that understanding the impact of isolation levels can be complex - I'm just very much of the opinion that you'll take a lot more pain trying to maintain consistency in a typical NoSQL system.
I can agree with these facts.
I just give them a different weight than you do: I don't like the incertitude of the 'serialization failures': it depends on a workload that can be difficult to predict, especially if you're a software vendor. YMMV :-)
Thanks for the constructive feedback.
There's only one type of consistency in the CAP theorem. The OP is probably just inexperienced with NoSQL databases since Cassandra, Riak and HBase all support strongly consistent operations.
...over a single key, which is an extremely limited guarantee compared to what a lot of applications need. I'm aware that Cassandra offers a limited notion of linearisable transactions, but last I checked they weren't all that great in terms of functionality or correctness - that may well have changed in the last year or so, of course.
To be fair most traditional DBs are not really ACID. This could have implications in tradtional DB application development. NoSQL at least is more honest in its documentation. You know that you are not getting ACID, or if you are getting limited ACID like with ArangoDB.
Bad implementations are indeed rife! It can be particularly problematic in home-grown ORMs. ACID DBs do provide you with tools to work around this pretty easily, but it's fair to say a lot of mistakes get made.
It's one reason I'm wary of anti-ORM sentiment I sometimes see around. I'm quite often more productive when I eschew a heavyweight ORM in favour of something closer to the 'metal', but part of that is that I have a lot of background in database systems, so I'm confident that I will generally avoid concurrency mistakes. For a typical programmer, in a developer culture that doesn't want to understand the complexities of data storage, my experience is that they're often better off using an ORM.
Very good post. The two meanings of "consistent", and the generally different ways database literature and distributed systems literature approach the world, are a really common source of confusion.
One way out of the confusion is to use different words. For example, using "linearizability", "serializability" and "strict serializability" may cut down confusion. These terms have complex-sounding names, but generally very approachable definitions. Aphyr's blog post on different models is a good place to start: https://aphyr.com/posts/313-strong-consistency-models
Hallelujah somebody wrote this article. I write a distributed database (http://github.com/amark/gun) and have repeatedly gotten into discussions where everybody is confused about ACID. This article actually takes the time and effort to attempt to delineate between some of the ideas. Thank you so much, I'll be quoting this in the future.
>Available: “every request received by a non-failing node in the system must result in a response” [C2]
This is also confusing to people. It means successful response, you can't for example respond with an error and claim the system is still available. I've seen some NoSQL databases claim they were still available because the user was getting an error message back.
There is a kind of opposition between 'CAP as a theorem' and 'CAP as a tool to think about distributed systems'. The theorem does not leave much room for something else than black and white. But in the (great) paper you mention, there is a lot about "what is the future for distributed systems", it's more CAP-as-a-tool.
In the post (and in the blog) I stick to the theorem. It's not 'right' or 'wrong', it's a choice. I made this choice because a lot of people are deciding their trade-offs with CAP-as-a-theorem, while actually CAP-as-a-theorem cannot be applied to the problem they're working on.
This article also makes it even more complicated by calling the C in CAP "atomic consistency". While technically correct as that means the same thing as linearizability, that term is rarely used in comparison. When comparing the overlapping terms in CAP and ACID it just seems excessive.
'Atomic Consistency' is used in the CAP proof. YMMV, but I've seen it more used than 'linearizability' (it's easier to pronounce...). Agreed, there is no confusion when you use linearizability.
> Also, the confusion does not only come from the overlaps in the wording, but in the necessity to go into the implementation details to understand the difference in a real-life concurrent application. In practice, this is one of the difficulties that had led many people (me included) to look at the ‘NoSQL’ world.
Consistency in database systems is a hard topic because maintaining consistency in any concurrent system is hard. Traditional systems provide you with a set of tools (isolation levels, foreign keys) that make a decent programmer capable of building a safe concurrent application.
Throwing away those tools and replacing them with nothing does not make life easier. The tool is easier to understand, but the problem is harder to solve.
This is not to say that NoSQL systems do not have a place - of course they do - but I feel like a lot of people adopting them talk about 'eventual consistency' while what they maintain is actually inconsistency.
Maintaining consistency in NoSQL systems when the application is nontrivial is really hard - and if the developer is not up to understanding the locking model of a traditional DB, I'd be pretty surprised if they were up to working with an eventually consistent system.