Hacker News new | past | comments | ask | show | jobs | submit login
The confusing CAP and ACID wording (thislongrun.blogspot.com)
71 points by nkeywal on March 29, 2015 | hide | past | favorite | 27 comments



Generally interesting article, so I was struck by the ending:

> Also, the confusion does not only come from the overlaps in the wording, but in the necessity to go into the implementation details to understand the difference in a real-life concurrent application. In practice, this is one of the difficulties that had led many people (me included) to look at the ‘NoSQL’ world.

Consistency in database systems is a hard topic because maintaining consistency in any concurrent system is hard. Traditional systems provide you with a set of tools (isolation levels, foreign keys) that make a decent programmer capable of building a safe concurrent application.

Throwing away those tools and replacing them with nothing does not make life easier. The tool is easier to understand, but the problem is harder to solve.

This is not to say that NoSQL systems do not have a place - of course they do - but I feel like a lot of people adopting them talk about 'eventual consistency' while what they maintain is actually inconsistency.

Maintaining consistency in NoSQL systems when the application is nontrivial is really hard - and if the developer is not up to understanding the locking model of a traditional DB, I'd be pretty surprised if they were up to working with an eventually consistent system.


Thanks for the feedback. > if the developer is not up to understanding the locking model of a traditional DB I'd be pretty surprised if they were up to working with an eventually consistent system I agree. As well, most 'NoSQL' systems don't throw everything. Typically, some of them are strongly consistent. The ones based on Dynamo claim to have 'tunable consistency', i.e. the choice between strong & weak consistency is up to the user.

We have a lot of "acid simply works" and "NoSQL is available". The blog is basically about saying things are not that simple, and it includes this "isolation in acid isn't that simple".


First, I've made one sinful over-simplification in my own post, in conflating NoSQL systems with eventual consistency. While that's usually the case, it's absolutely not an intrinsic property: my apologies!

True SERIALIZABLE-level ACID does pretty much simply just work - and if you're using Postgres the performance hit really isn't too bad. Of course you're chucking away replication then, so whether it's suitable for your needs may vary rather!

Dynamo-based systems have 'tunable consistency' but that's almost always over one key: multi-key operations are usually inconsistent. That being the case, they're pretty much only 'easy to use' for applications with a very simple data model: my experience is that most applications of any real complexity will at some point want to do some kind of multi-key operation. That being the case, you're probably on the hook for a pretty expensive programmer.

I'm vaguely aware that this doesn't strictly apply to Cassandra, and it has a limited notion of transactions - last I checked they didn't work very well at all ( https://aphyr.com/posts/294-call-me-maybe-cassandra/ ), but that may well have changed

I do appreciate your blog post in general - I think there's an awful lot of oversimplifying of this stuff out there. Part of the problem is that high speed, concurrent, distributed data storage is a topic that is, at its heart, pretty damn complicated. Unfortunately,


Well, I've never used postgres, but from the documentation v9.4 it does not look that different from the others db engine: 'Read Committed is the default isolation'; 'applications using this level must be prepared to retry transactions due to serialization failures' (plus the one you mentioned already: 'Serializable transaction isolation level has not yet been added to Hot Standby replication targets')

Not exactly what I like to call 'simply works' ;-)

But I didn't want to say that a NoSQL database is always better than an traditional one. Just that Isolation is complex on traditional systems when dealing with volumes & concurrency. And, typically, transactions between tables or even rows are difficult/impossible for a distributed database, as these rows can be on different nodes (the 'multi-key operations' you mentioned)


Postgres is quite different in one respect: it has a truly serializable snapshot isolation, at an acceptable performance cost (single digit percentage generally). Other DB systems are either not truly serializable, or have lock-based systems that are sometimes more difficult to work with for web apps.

> 'applications using this level must be prepared to retry transactions due to serialization failures'

True of any serializable system that supports concurrent access, AFAIK - not quite sure that's a fair criticism :-).

------

> And, typically, transactions between tables or even rows are difficult/impossible for a distributed database

Depends what you mean by 'distributed', really - Oracle RAC is very much distributed, and supports normal transactional behaviour. On the other hand you won't get that working across a large geographical area.

I accept that understanding the impact of isolation levels can be complex - I'm just very much of the opinion that you'll take a lot more pain trying to maintain consistency in a typical NoSQL system.


I can agree with these facts. I just give them a different weight than you do: I don't like the incertitude of the 'serialization failures': it depends on a workload that can be difficult to predict, especially if you're a software vendor. YMMV :-) Thanks for the constructive feedback.


Thanks for a good read/conversation!


Maintaining strong consistency in some NoSQL databases e.g Cassandra is merely a config parameter away. Really isn't that hard.


Are you able to do full atomic transactions in Cassandra now? Like the ones described in the article:

    begin
        val1 = read(account1)
        val2 = read(account2)
        newVal1 = val1 - 100
        newVal2 = val2 + 100
        write(account1, newVal1)
        write(account2, newVal2)
    commit
Last time I checked this was not possible, only light compare-and-swap operations.


Are you talking about the same kind of consistency as the parent post?


There's only one type of consistency in the CAP theorem. The OP is probably just inexperienced with NoSQL databases since Cassandra, Riak and HBase all support strongly consistent operations.


...over a single key, which is an extremely limited guarantee compared to what a lot of applications need. I'm aware that Cassandra offers a limited notion of linearisable transactions, but last I checked they weren't all that great in terms of functionality or correctness - that may well have changed in the last year or so, of course.


But there is a second consistency he could be talking about, ie ACID consistency. Thus the article...


Let me clarify then. Strong consistency = ACID consistency.

If I write the data and then immediately read it I will get the new data under all circumstances.


And ACID DB solves a tiny fraction of consistency issues.

Most of the really difficult problems show up when you have multiple users and still need consistency.

Granted, ACID databases make solving most of these far simpler.


To be fair most traditional DBs are not really ACID. This could have implications in tradtional DB application development. NoSQL at least is more honest in its documentation. You know that you are not getting ACID, or if you are getting limited ACID like with ArangoDB.


Can you elaborate please? Are you saying that systems like Postgres, Oracle, and SQLServer are not ACID?


I am guessing he means the applications built on top of them.

Say for example you have two users on a normal CRUD app. User 1 opens an object in a web form. User two opens the same objects a few seconds later.

User 1 updates field A. Saves the object back to the database.

User 2 updates field B. (field A is still in the web form at its initial state). Saves the object.

Field B is at the correct state, but field A is not.


Bad implementations are indeed rife! It can be particularly problematic in home-grown ORMs. ACID DBs do provide you with tools to work around this pretty easily, but it's fair to say a lot of mistakes get made.

It's one reason I'm wary of anti-ORM sentiment I sometimes see around. I'm quite often more productive when I eschew a heavyweight ORM in favour of something closer to the 'metal', but part of that is that I have a lot of background in database systems, so I'm confident that I will generally avoid concurrency mistakes. For a typical programmer, in a developer culture that doesn't want to understand the complexities of data storage, my experience is that they're often better off using an ORM.


Very good post. The two meanings of "consistent", and the generally different ways database literature and distributed systems literature approach the world, are a really common source of confusion.

One way out of the confusion is to use different words. For example, using "linearizability", "serializability" and "strict serializability" may cut down confusion. These terms have complex-sounding names, but generally very approachable definitions. Aphyr's blog post on different models is a good place to start: https://aphyr.com/posts/313-strong-consistency-models

Another set of terms that approaches a different part of this problem is Harvest and Yield, which can help explain real-world availability and consistency (http://codahale.com/you-cant-sacrifice-partition-tolerance/). Unfortunately, the original paper just seems to add to the term confusion (http://brooker.co.za/blog/2014/10/12/harvest-yield.html).


Hallelujah somebody wrote this article. I write a distributed database (http://github.com/amark/gun) and have repeatedly gotten into discussions where everybody is confused about ACID. This article actually takes the time and effort to attempt to delineate between some of the ideas. Thank you so much, I'll be quoting this in the future.


>Available: “every request received by a non-failing node in the system must result in a response” [C2]

This is also confusing to people. It means successful response, you can't for example respond with an error and claim the system is still available. I've seen some NoSQL databases claim they were still available because the user was getting an error message back.


Yes, I agree. It's not 'eventually consistent' but it's nearly available-in-CAP.

There are some elements on the CAP definition traps in the first post: http://thislongrun.blogspot.com/2015/03/comparing-eventually.... I also plan to do another post on this subject.


Great article, although I am not an expert in both CAP and ACID, it looks very reliable to me.

I was surprised this was not mentioned yet: http://www.infoq.com/articles/cap-twelve-years-later-how-the...

(Eric Brewer revisits the CAP theorem and explains why it's not always a "black or white" case of "choose only two" triangle...)


There is a kind of opposition between 'CAP as a theorem' and 'CAP as a tool to think about distributed systems'. The theorem does not leave much room for something else than black and white. But in the (great) paper you mention, there is a lot about "what is the future for distributed systems", it's more CAP-as-a-tool.

In the post (and in the blog) I stick to the theorem. It's not 'right' or 'wrong', it's a choice. I made this choice because a lot of people are deciding their trade-offs with CAP-as-a-theorem, while actually CAP-as-a-theorem cannot be applied to the problem they're working on.


This article also makes it even more complicated by calling the C in CAP "atomic consistency". While technically correct as that means the same thing as linearizability, that term is rarely used in comparison. When comparing the overlapping terms in CAP and ACID it just seems excessive.


'Atomic Consistency' is used in the CAP proof. YMMV, but I've seen it more used than 'linearizability' (it's easier to pronounce...). Agreed, there is no confusion when you use linearizability.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: