Why use Paxos instead of Raft?

mgdev · on Aug 15, 2022

I used to work in the orbit of a distinguished eng at AWS who was famous for saying something to the effect of, "At the bottom of any scaled distributed system is either Paxos, or a bug."

gamegoblin · on Aug 15, 2022

When I was at AWS I heard the same phrase from a DE, probably the same guy, and once heard him also say "Raft is just a special case of Paxos to try to simplify it, but regular Paxos isn't actually that hard, just use Paxos".

I was fairly junior at the time, and Raft seemed much more approachable, but after really forcing myself to read and understand the Paxos paper, I see what he meant. I am pretty sure most of the love for Raft was that the original whitepaper was just a better presentation. The actual Paxos algorithm is quite simple.

If you go into Raft already having mastered Paxos (as this DE was doing), it's clear that Raft is basically isomorphic to a special case of Paxos.

jsmith45 · on Aug 15, 2022

https://arxiv.org/abs/2004.05074

This paper argues basically argues that raft has a different leadership election mechanism than paxos, but that if you tweak some terminology, and make a few relatively reasonable implementation choices for paxos they are otherwise pretty equivalent.

It even gives a raft style single page description of paxos (using raft style terminology), and shows how little it differs from the equivalent single page summary of raft.

The main implementation choices they use are: - combined roles into a single server role - enforce that log messages are decided in sequence (largely to avoid the having to specify the behavior of newly elected leader to propose operations for the gaps (possibly no-ops)) - numeric ballot number, rather than lexicographical pair (but this changes nothing except making the summary slightly easier to express)

toomim · on Aug 15, 2022

For some reason, people considering PAXOS/RAFT don't tend to consider CRDT/OT synchronization. I think this is a big oversight.

We should start considering CRDT/OT/VCS/Diffsync approaches to distributed systems as well. They present a very nice alternative approach: whereas PAXOS/RAFT implement a consistent "distributed state machine", a CRDT, OT, VCS, or Diffsync system implements consistent "distributed state", upon which one can build a machine as a function of the state.

This latter approach is actually simpler, IMO, because it encapsulates all the challenge of distributed consistency within a smaller subset of the problem — state synchronization. This makes it more generally re-usable. When you create a system, you can just use an off-the-shelf library & algorithm to synchronize your data over a network, and then write synchronous functions on top of that to represent the system you want, however you want, without having to understand PAXOS/RAFT.

diroussel · on Aug 15, 2022

Would this approach be resistant to a rogue actor. If one actor bad data would all the other actors still be able to reach consensus?

Paxos is complicated, but it’s well studied and proven.

suchire · on Aug 15, 2022

Neither OT/CRDT nor Paxos/Raft are meant to resist rogue actors. They are not Byzantine fault tolerant.

rubiquity · on Aug 15, 2022

Assuming we're thinking of the same person, I believe it went:

"There are three types of consistent distributed systems: paxos, broken protocols, and single points of failures."

dastbe · on Aug 15, 2022

i don't understand why people won't just say alv. he won't hurt you.

mgdev · on Aug 16, 2022

Really sucks the mystery out of the room though, doesn't it?

It's almost more impressive that people know who it is without saying the name.

rubiquity · on Aug 15, 2022

You're assuming the name matters more than the content.

dastbe · on Aug 16, 2022

i guess he never explained a false dichotomy to you :).

knowing who said this allows readers to be able to look up more of the (unfortunately scant) content publicly available from him.

dtjohnnymonkey · on Aug 15, 2022

I just watched a talk on this the other day (assuming it's same person since he's saying the same thing) the other day: https://youtu.be/QVvFVwyElLY?t=2370

ChadNauseam · on Aug 15, 2022

It's a fun quote, it reminds me of the "every sufficiently advanced program contains a bug-ridden implementation of half of common lisp", or something along those lines. But there really is a wide world of distributed consensus systems out there (although Paxos is easily the most elegant).

winkywooster · on Aug 15, 2022

You mean “Greenspun's tenth rule”

https://en.m.wikipedia.org/wiki/Greenspun%27s_tenth_rule

arthurcolle · on Aug 15, 2022

There's a variation of this on the subject of distributed systems re: Erlang/BEAM VM

qohen · on Aug 15, 2022

Yes, from Robert Virding, one of Erlang's co-creators, from the Erlang mailing-list, from 2008 [0]:

After reading the blogs about how good Erlang's concurrency model is and how we just just made a super implementation of it in XXX I have been led to formulate Virding's First Rule of Programming:

Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.

This is, of course, a mild travesty of Greenspun (*) but I think it is fundamental enough to be my first rule, not the tenth.

[0] http://erlang.org/pipermail/erlang-questions/2008-January/03...

YZF · on Aug 15, 2022

There's also ZooKeeper with its own consensus protocol.

roughly · on Aug 15, 2022

“…but I repeat myself.”

chucky_z · on Aug 15, 2022

Interesting reasoning! I had a similar thought about why raft doesn’t allow observers the other day.

I used observers with Gluster previously and went from annoying split brain scenarios to flawless clusters just by adding a few, and their resource usage was basically nothing.

kelvich · on Aug 15, 2022

The author is here, happy to answer questions if any.

sebk · on Aug 15, 2022

Besides storage itself, the Postgres compute layer has a good amount of (transient) state that doesn't lend itself to either compute nodes or clients springing in and out of existence in a serverless environment. For instance, a fresh compute node with an unfilled cache can perform horribly, and Postgres client connections don't scale well with transient clients. Both of these problems, and others in the same category, were very noticeable for Aurora Serverless. My understanding is that AWS mitigates these two by an elaborate cache-filling service for new nodes, and a pgbouncer-style proxy pooling connections and hiding compute nodes being rescheduled from clients.

What's Neon's point of view about transient state in nodes? Is there a world where serverless client connections are stateless, or is the set up overhead not expected to be worth the cost?

kelvich · on Aug 15, 2022

Right. Our design guideline is to get as much serverless behavior as possible while keeping full Postgres compatibility (in terms of features and expected performance). Single node Postgres can give you hundreds of thousands of small RW queries per second, so competing connections should be a few compare-and-swap instructions away from the shared state to provide this performance. So for the primary, it means it should be just a Postgres in the container or VM, and we have to deal with consequences (cache pre-warm, handle cross-node migrations, etc).

However, read-only nodes require less coordination, and we have way more freedom there, so read-only Postgres as a function seems to be a more feasible concept.

nikita · on Aug 15, 2022

This is a very good question. We are working on it and will be publishing a blog post on autoscaling very soon. We are experimenting with VM Migration technology that would allow to transfer the state between compute nodes and failover traffic.

We have some encouraging early results, but haven't committed to a particular technology (like cloud hypervisor) yet.

erichocean · on Aug 15, 2022

Thank you for publishing the TLA+ model, that dramatically increases my level of trust in Neon. (I'm on the early adopter list, got my invite a week or two ago but haven't been able to give it a spin yet.)

> Right now, such a change requires humans to be in the loop to ensure that the old safekeeper is actually down. It is on our roadmap to automate this procedure.

If you do implement this (which I don't recommend), be certain to also model it with TLA+. This level of automation, IMO, requires a human in the loop + a ton of visibility tracking on when it is happening.

A good way to roll it out is "semi-automation"—implement the automation but use it to ask a human to approve. The human will then do the normal (manual) verification. After you've run that successfully for a year, and your TLA+ model passes, you can then decide to fully automate without a human in the loop.

Otherwise, you're asking for an outage (caused by bad failover) IMO, and possibly data loss.

ryanworl · on Aug 15, 2022

Do you attempt to guarantee linearizability of read-only operations? The scenario I'm concerned about is when a partitioned compute node is processing a read-only transaction from a partitioned client, and neither has noticed the partitioned compute node has been replaced in a later term. Do you use a lease system for this that relies on the partitioned compute nodes to be able to accurately measure the passaged of time (not wall clock time), or do you have the compute nodes contact a quorum of acceptors before replying to read-only queries as well?

kelvich · on Aug 15, 2022

Good catch! Currently, we don't, and we rely on k8s to stop the old node. Technically speaking, if k8s and our control plane are always good at stopping the old primary, we don't need consensus at all. So that is more of a question of what set of problems we can see if there is a bug in our orchestration code. Split-brain seemed to be unacceptable. But with stale reads, we decided that we can only rely on k8s without double-checking that on our side.

asadawadia · on Aug 15, 2022

if storage and compute are separated - how is storage mounted on to the compute? Generally you can attach a volume only to one server at a time

nikita · on Aug 15, 2022

We changed Postgres to send WAL to safekeepers and read from page servers: https://neon.tech/blog/architecture-decisions-in-neon/

asadawadia · on Aug 15, 2022

Safekeeper to page servers take some time

what happens when the compute server issues a read for something that has made it to the WAL servers but not the page servers?

kelvich · on Aug 15, 2022

Postgres tracks maximal LSN among the evicted pages and passes it to the pageserver in the page request. If the pageserver hasn't received that LSN, it will wait for it to arrive.

clircle · on Aug 15, 2022

What is Neon? What is Paxos? what is Raft?

difflens · on Aug 15, 2022

Great write up! I don't have much practical experience with Paxos/Raft other than coursework. I'm curious, if I wanted to insert a row into a table, what is the overhead of all these extra nodes in the insert operation now compared to a single table?

I realize the answer depends on how big the cluster is, what state it is in at any given moment etc, but I'm happy to accept back of the envelope calculations/estimations!

canadiantim · on Aug 15, 2022

So would using Neon negate the need for something like Citus for scaling out a postgres database?

craigkerstiens · on Aug 15, 2022

It's a different approach, planetscale and Citus are sharding that is intended to be mostly transparent. It's not 100% transparent, but both get pretty close.

Neon is more of an aurora approach detaching storage from the compute, you could scale up to more replicas and it could enable other functionality, though Postgres already can handle a pretty high replica count so you can scale out reads that way.

nikita · on Aug 15, 2022

That's right. The important observation is that in OLTP queries are mostly small and can fit into one node. Neon architecture allows to scale read throughput by spinning up read replicas (or read endpoints to the same storage in our case).

Citus (Cockroachdb, Yugabyte) has distributed compute which allows to engage multiple nodes per queries. This helps with analytical queries AND with scaling writes. But you lose out on compatibility and predictability of performance. Shared nothing systems are no longer Postgres.

jeffbee · on Aug 15, 2022

Is there a difference between "observer" and "witness"? I'm used to the witness terminology from old literature and also from the Megastore paper, in which a witness votes in the Paxos election and stores the only the WAL.

ryanworl · on Aug 15, 2022

"Witness" comes from Frugal Paxos [1] (AFAIK, and not cited in the Megastore paper directly from what I saw while skimming it a few minutes ago) and indeed means an acceptor who does not contain a state machine replica, but does store the log and participate in elections.

"Observer" is not as well specified of a term [2], but from what I can find observer means non-voting replica which stores the log and a state machine replica.

Both of these make sense depending on the goals of the system. Observers make it easier to add new replicas without changing the size of the quorum, and witnesses make it cheaper to increase fault tolerance.

[1]: https://lamport.azurewebsites.net/pubs/web-dsn-submission.pd... [2]: https://cse.buffalo.edu/tech-reports/2016-02.orig.pdf

jeffbee · on Aug 15, 2022

Thanks! I believe the place I originally saw the term was "Voting with Witnesses" http://www2.cs.uh.edu/~paris/MYPAPERS/Icdcs86.pdf

ryanworl · on Aug 15, 2022

I'm not as familiar with the literature on replication for file systems as I am with state machine replication, so perhaps the usage of those terms have diverged since then. Regardless, I think my analysis is correct for state machine replication. Thanks for the link!

zinclozenge · on Aug 15, 2022

I'm curious how many nodes end up in the consensus group, presumably you don't want more than 3 because throughput scales 1/N, unless their implementation can alleviate that significantly.

kelvich · on Aug 15, 2022

Currently, we deploy three safekeepers, one in each AZ. We need to collect more stats on failure rates, and maybe we will go to 6 (3 AZs with two nodes in each) as Aurora does.

achanda358 · on Aug 15, 2022

This is a great writeup, thanks. It will be very useful if y'all could add a post comparing neon with citusdb in terms of functionality, performance and operations.

nikita · on Aug 15, 2022

Let us work on it! Short answer is Neon is most similar to AWS Aurora - 100% compatibility with Postgres and all the innovation is on storage and serverless. The use case is core database for apps.

And Citus is shared nothing architecture plus columnstores. This means the use case is analytics or mixed workloads. Citus people should comment on this of course.

bogomipz · on Aug 16, 2022

Agreed. This would make a wonderful follow up post!

indymike · on Aug 15, 2022

Every time I see Neon I think this is about KDE.