Hacker News new | past | comments | ask | show | jobs | submit login
High Availability for PostgreSQL, Batteries Not Included (compose.io)
132 points by rfks on July 17, 2015 | hide | past | favorite | 62 comments



Where I am we have a similar setup for leader election and failover (using etcd and haproxy) but we add an additional step: a standby instance that does not participate in master election, and always follows the elected master.

Then we turn on confirmed writes on the master so that the non-participating standby (called the "seed") has to receive and confirm your write before the transaction can commit.

This has the bonus of preventing split brain... If the wrong instance thinks it's master, writes will block indefinitely because the seed isn't confirming them. If the seed is following the wrong machine, same thing. And if clients and the seed and the master are all "wrong", then that's ok because at least they all "consistently" disagree with etcd.

The seed instance can run anywhere, and is responsible for receiving WAL snapshots from the master and archiving them (to shared storage) so it can crash too and be brought up elsewhere and catch up fine. The writes just block until this converges.

It's worked quite well for us for a few months on a hundred or so Postgres clusters, we haven't seen an issue yet. I'd love for somebody knowledgeable about this stuff to point out any flaws.


That's interesting. We do something pretty similar in the Manatee component that I mentioned elsewhere in this thread, except that the designated synchronous standby can takeover if the primary goes away. But it can only do so when another peer is around to become the new synchronous standby, so we maintain the write-blocking behavior that avoids split-brain.


Like a lot of designs that use Raft/Zookeeper/Paxos/whatever as a building block, the full system doesn't inherit all of the safety properties of the underlying consensus algorithm. I don't think that makes this code useless by any means, but I think it's important to be aware of the edge cases.

Consensus algorithms are popular because they're supposed to solve the difficult problem of guaranteeing consistency while attempting to provide liveness, in the presence of arbitrary node or connection failures. Etcd itself can provide this for operations on its own datastore, but that doesn't mean it can be used as a perfect failure detector for another system (which is impossible in the general case). In particular, if the database master becomes partitioned from the etcd leader for more than 30 seconds but is still accessible to clients, boom -- split brain.

(You can attempt to mitigate this with timeouts, but that's not foolproof if your system can experience clock skew or swapping/GC delays. Exactly this kind of faulty assumption has caused critical bugs in e.g. HBase in the past, turning what would otherwise be a temporary period of unavailability into data loss.)

EDIT: If I'm reading the code correctly, compose.io doesn't make any attempt to mitigate this failure scenario. If the Postgresql master can't contact etcd, it continues acting as a master indefinitely, even after 30 seconds have expired and another server might have taken over. This appears to be what happens in the "no action. not healthy enough to do anything." case in ha.py. I'd be happy to be corrected if there's something I'm missing.


If the PostgreSQL leader doesn't reset the leader key, it's no longer leader.


The rest of the cluster doesn't think it's the leader, but the problem is that it still accepts database connections as if it were.

If a client sees a stale value of the leader key (which is possible, either through network hiccups or etcd's normal behavior of allowing reads from followers) then it could contact the old leader and perform updates which won't be visible on the new leader.


Those updates don't count though, since they weren't sent to the leader. Might as well pipe them to /dev/null. There is no need to track them.

A client shouldn't use a stale value. If a DB does not hold a valid key, it shouldn't accept new connections, or signal that data was committed.


> Those updates don't count though, since they weren't sent to the leader.

But that's exactly the problem! If you were to run this whole system under a tool like Jepsen, this would show up as "acknowledged but lost writes". It's not generally considered acceptable to connect to your database, issue an UPDATE and a COMMIT, and have everything appear to work successfully, only for the data to disappear into the aether because it got sent to the wrong replica.

> If a DB does not hold a valid key, it shouldn't accept new connections, or signal that data was committed.

Exactly, and the problem with this implementation (again, unless I'm missing something) is that it may violate this contract.


Fencing isn't quite that simple, unfortunately.

I've been doing database, and specifically PostgreSQL, administration and HA setups for a long time now. This stuff is a lot harder than people think it is. People who roll their own solutions, thinking "Oh, this will totes be good enough!" tend to find themselves very painfully surprised that it isn't.


I've seen multiple deployments (not necessarily specific to PostgresSQL) engineer themselves into a corner with what people feel will be a highly available roll your own solution, complete with convincing sounding blog posts.

In every case, at some point, there were implementation/software bug related issues that ultimately caused more unplanned outages than I've ever seen a single, well run server experience.


Based on experience is there a common bug or scenario that you see overlooked often? Like say what happens during the transition between leaders, or handling multiple failures (multiple netsplits..)?


I can't really identify a common problem. Things I've seen include:

* After a complete, planned shutdown, neither server is happy to start until it sees the other one online. In the end, neither ends up booting. * A failover occurs, at which point you find out the hard way there is state being stored in a non-replicate file. I've seen this with several different Asterisk HA solutions in particular. * A failover occurs, and non-database aware storage snapshots leave the redundant server with a non-mountable mirror of the database.


I'm just wading into the HA waters with Postgres. I somewhat understand the tradeoffs between simplicity and robustness, but what would be your recommendation on how to proceed for someone who is a newbie?


This is going to sound cynical and self-serving (even though I'm not actually available for hire right now), but find someone who knows what they're doing and buy their time. It probably won't be cheap, but it will almost without doubt be cheaper than what you'll do to yourself if you try to hand-roll database HA.


Isn’t the major problem knowing that the expert actually know what they are doing and not just think they know what they are doing? Any tips on how you can separate the true experts from the deluded?


For someone who doesn't already have a background in and depth of understanding of this stuff, I'd first probably look for any relevant blog posts or articles written by the folks with whom you're considering a consulting arrangement. If they do exist, you can do some research on the things they're talking about and hopefully get at least a first-pass approximation of their full of shit factor.

Any consultant worth their day (let alone week) rate should be also able to refer you to previous clients, from whom you can hopefully get some sense of how satisfied people are with the candidate's work.

And, ultimately, there's an intuition factor at work here. It's been my consistent experience that if a candidate gives you some kind of hinky vibe, don't use them. When folks I've worked with haven't followed that, the results have pretty reliably been poor, at best.


Thanks for this. As you rightly point out setting up these systems can involve knowing all the rare edge cases. I am sure I could put together a system that functioned - that is until it ran straight into an edge case I have not thought of. Getting this right is not easy.


I'll second this. Getting your data store right is way too important to leave to the inexperienced (and I say that being, if I had to judge, the semi-experienced; I've refused gigs centered around HA data stores because I won't put my name on something I am not 100% sure will work for them without issues).


I haven't look at the code, but the failover should ensure that the HAproxy isolates the failed master ("fencing" in HA terminology).


I won't pretend to understand all the details after only a quick glance at the code, but it looks to me like HAproxy is invoking a script[1] that checks each PostgreSQL instance directly (bypassing the governor process) and enables it as an endpoint iff the instance is a replication master, as opposed to a follower. In which case, if multiple instances are in master mode, the proxy might forward connections to either one. It has no way of knowing which one is considered to be the "real" one by etcd; and even if it tried to check, that check would itself be subject to race conditions.

[1]: https://github.com/compose/governor/blob/master/haproxy_stat...


Personally, I would try to go for a simpler solution. In case of a failover event which is already complicated in itself and happening at a point in time where stuff is already going wrong (there would be no failover otherwise), do you really want to have all this additional infrastructure with etcd and haproxy as a dependency?

If you can live with a few minutes of downtime, I would recommend to trigger your failover using human intervention once you have ascertained that the failover would actually help (you never, ever want to fail over if master doesn't respond in time due to high load - at that point, failing over will only make things worse due to cold caches).

See https://github.com/blog/1261-github-availability-this-week for a nice story of automated DB failover going wrong.

In our case, we're running keepalived to share the IP address of the postgres master, but we don't actually automatically act on PG availability changes.

In a situation that actually warrants the failover, a human will kill the master node by shutting it down and keepalived will select another master and trigger the failover (which is then automated using `trigger_file` in `recovery.conf`).

In this case we have only one additional piece of infrastructure (keepalived) and we can be sure that we don't accidentally make our lives miserable with automated failovers.

The cost is, of course, potential additional downtime while somebody checks the situation, does minimal emergency root cause analysis and then shuts down the failed master.

In the even rarer case of hardware failure, keepalived would of course fail over automatically, but let's be honest: Most failures are caused by application or devops issues and in these cases it pays off to be diligent instead of panicing.


VMS clusters on VAXen were doing fail-overs perfectly in the 80's. All kinds of products and software (even FOSS) do it today. You're telling me that, in 2015, you are doing manual failovers despite tons of free tools to automate it reliably?

Better to do an assessment of each thing that can fail, how to isolate/detect it, how to recover from it, how to implement that with available tools, and implement it. Test it in a number of situations on same hardware, network, and apps you'll use in production. Once it's solid, put them into production. Then, never worry about that stuff again past monitoring and maintenance.

Btw, Netflix employs Monkeys to do this. Open-sources their tools with blog writeups on their use, too. I'm sure you Humans will be able to handle it. ;)


If you are running in Microsoft Azure you need two VM instances to get any form of availability SLAs. Microsoft can reboot/migrate single instances whenever they feel like it. With manual failover you would only have a few minute downtime if someone is there to trigger it. That honestly sounds like a crappy solution 2015..


What? In the case of Microsoft rebooting/migrating an instance causing a failure, Keepalived will automatically failover.

The manual failover is in case of something going horribly wrong (outside of hardware failure), in which case a human steps in, looks at the situation, determines the best solution... and if it's failover, they initiate the failover.

I've personally used this procedure in the past and it worked 100% of the time there was a failure in a production environment. The tricky part is then notifying the hell out of everyone who needs to be notified that something really bad has happened, a failover occurred, everything is OK, but it needs some attention ASAP.

In PGSQL world, there are even a handful of tools to help you turn the old (failed) master into a slave, and correctly escalate the old (promoted) slave into a master; all in a single command on each side (which can be kicked off through keepalived).


And you can do this 24/7 since you are awoke everyday and night?


"The tricky part is then notifying the hell out of everyone who needs to be notified that something really bad has happened, a failover occurred, everything is OK, but it needs some attention ASAP."

If there is a need for multiple 9's of uptime, there should be an escalation process for these kinds of events, which will probably include 24/7 on-call rotations.

Even if the problem is entirely self-resolving, it should still be looked at by more than one system. It should be noted, observed, documented, and confirmed it's truly resolved. That system is usually a human, but it doesn't necessarily have to be.


Can keepalived automatically float MAC addresses nowadays? Last time I checked, that didn't work and clients needed an arp flush to use the new master.


Shouldn't sending a gratuitous ARP work to update clients?


keepalived uses vrrp and will issue a gratuitous arp


Since most of the comments are critical, I'll say: thank you for the awesome writeup! I agree this is more complex than HA PG setups I've done in the past, but I'm thrilled to have another perspective. Also doing a thorough writeup like this takes time, and a lot of people would rather jump back into building the next thing. It's a great contribution!

I agree with pilif that you almost always want to failover the db manually.

I agree with teraflop that just because etcd gives strong guarantees, that doesn't mean your application logic built on top of etcd primitives shares them. So you have to be careful about your reasoning there.

I'm curious if you're doing anything to mitigate haproxy being a single point of failure?

One thing I've had to fix in other people's HA PG setups is ease of getting back to HA after a failover. You lose the master and promote the slave, and now you've just got a master. Ideally it should be easy to just launch another db instance and everyone keeps going. I think this setup achieves that, and that's great!


Agree and that's a great point about human failover. It can become a challenge for distributed databases running on a large number of instances (like bigtable) but if we're talking only about master HA, then yes, that can still do with human intervention though automation is still preferable. For smaller db setups, much easier to just let a human/dba intervene.


Smaller DB setups rarely have the ops/DBA support required to do manual failover. I think having an as-consistent-as-feasible, automatic failover is something of a default expectation for databases these days, at any size.


You need a larger team to do automatic failover because getting it right is a massive PITA. Either that or pay someone to do it right for you, e.g. RDS, managed solutions.

Manual failover is often a lot safer, automatic systems have a nasty habit of not doing what you expect them to and trashing your database / losing data.


I should have clarified that I meant small in the context of less scaled out and more vertically scaled like the traditional rdbms running on big iron.


Take a look at the code we have open sourced: https://github.com/compose/governor


Great write-up, thanks for sharing!

I'm curious about HAProxy being a single point of failure as well. What happens when it fails?


Thanks for writing this up!

At Joyent, we built a similar system for automated postgresql failover called Manatee. I'm sure today we would have used a Raft-based system, but that was not available when we did this work, so we used ZooKeeper. We haven't spent much time polishing Manatee for general consumption, but there's a write-up on how it maintains consistency[1]. The actual component is available here[2], and it's also been ported to Go as part of Flynn[3].

Edit: Manatee uses synchronous replication, not async, so it does not lose data on failover.

[1] https://github.com/joyent/manatee-state-machine

[2] https://github.com/joyent/manatee

[3] https://github.com/flynn/flynn


This seems robust, but feels like more moving parts than are necessary.

I feel like HAProxy with PostgreSQL + Bucardo (multi-master + at least one slave) would achieve this, and net you fewer moving parts. Under what circumstances does this fail where the etcd-dependent solution succeeds?


Bucardo with multi-master is fantastic when a DBA can configure the multi-master and manage future changes. Bucardo requires each table to have proper Bucardo configuration and each table on each host to have the proper schema, since Bucardo does not replicate schema changes.

Compared to streaming replication, during high load, Bucardo sync is also quite expensive for a replication mechanism.

As a service, Bucardo's requirements did not scale for us. It created to many caveats. The limitations of Bucardo for our service became obvious quickly.


Can I ask why HAProxy seems to be a more popular choice than the very, very simple (and robust) pgbouncer?


We tested with PGPool and PgBouncer in various iterations.

PGPool failed at basic failover. It worked fine while the leader remained leader. It would failover to the follower who became leader, but after the first failover, it would stall on connections. We worked through various settings and attempts at making it more stable, but in the end we were not happy with the stability.

PGBouncer requires a connection to a single database and requires a user store associated at the PGBouncer level. One of our internal requirements for our Postgres service is give customers full access to Postgres capabilities. PGBouncer would either limit customer functionality or require us to build more tools for customers to use Postgres's complete functionality. For instance, if a customer ran `CREATE USER foo WITH LOGIN …` from the Postgres connection, the customer would not be authenticate as foo user because PGBouncer would not have immediate knowledge of the new user.

In the end, HAProxy offered the stability and enabled the base functionality of Postgres we wanted. In tests, it failed over quickly and reliably. The only caveat with HAProxy + Postgres is that you have to rely on TCP passthrough with SSL termination at Postgres. We'd have preferred the SSL termination at HAProxy, but Postgres engineered it's own connecting procedure to listen for standard and SSL connections on the same port. SSL termination at the HAProxy was causing issues for drivers that were built to use that procedure and cannot use a standard SSL connection.


Interesting. I got around this by just recreating the users file for pgbouncer and issuing a reload every 15 minutes or so. This was only needed for one setup as the rest were just static, or the user was very low volume so I asked them to connect directly. pgbouncer handled reloads very well (on-par with nginx) so it worked for me. Thanks for the great insight with your answer.


This is great supplemental data, much appreciated!


Lots of people are already using it, so it has familiarity (I was thinking more than pgbouncer, pure conjecture though).

I think if you wanted to have the simplest possible solution, pgbouncer and postgresql-specific replication mechanism would be perfect. This is along the same lines as my question - I don't really see how these alternate solutions could be construed as lacking...


"If no one has the leader key it runs health checks and takes over as leader."

I'm no expert at all on this stuff, but I do smell either a race condition (if other nodes comes alive and 'goes to see who owns the leader key in etcd' before the node 'takes over as leader') or a longer-than-needed time without a leader (where the new node knows it wants to become the leader, but is running health checks)


The code relies on functionality in etcd to prevent a race condition. Using `prevExist=false` on acquiring the leader key, the set will fail if another node wins the race.

The functionality in the code is here: https://github.com/compose/governor/blob/master/helpers/etcd...

The documentation for etcd is here: https://coreos.com/etcd/docs/latest/api.html#atomic-compare-...


But then, isn't it not

"If no one has the leader key it runs health checks and takes over as leader."

but

"If no one has the leader key it takes over as leader, runs health checks, and starts functioning as leader."

? If so, I would do the health checks and then try to become the leader. Or do the 'health checks' involve other nodes?


It simply relies on the Voting feature of ETCD (Raft) it's really simple to use locking with etcd, and etcd is really really stable. However it would be easier to install etcd on every Postgres node and just make a golang library that sets the master of Postgres to the etcd master (etcd also has a leader). Also systemd would keep the overall system healthy. (that's what we at envisia do) Just have repeatedly check if the machine is the leader and if yes it sets the url of the currently running machine to a etcd key. So overall we need to use 3 Postgres machines and 1 could fail and we would still have voting, however thats just for a single master where we don't need to read from the slaves, however thats easily extendable.

Oh and here is the Compare and Swap (Atomic) functionality of etcd that he described: https://github.com/coreos/etcd/blob/master/Documentation/api...


The problem with etcd members on every Postgres node is that clusters fixed nodes or members. etcd doesn't function well in an environment where you could tear down / build up new nodes. Most of our Postgres service runs on AWS, and thus we must expect that any single node may vanish, and our system must replace that node. We tried running etcd alongside Postgres in an early prototype, but ran into issues with etcd cluster stability when destroying and recreating nodes. Thus, we opt for a stand alone etcd cluster distinct from the Postgres cluster.


You can set up a local etcd proxy to mitigate this. You'd run the proxy listening on localhost, and then have it connected to the stable etcd cluster elsewhere.

The proxy can find the cluster manually or use SRV records. Autoscale the Postgres machines as much as you want after that while leaving etcd on stable machines.


That's what we basically trying to do in the future, however that's really hard to do if you want to have a running etcd cluster with 5 nodes all the time. You would need to check if one etcd died, and then either promote a proxy to a etcd master or run a new machine (the later is only possible in clouds or virtual environments)


You can do that trivially with Mesos and have it always ensure 5 instances are running. Bonus points that it will run identically on bare metal and cross cloud which means less vendor lock in for you.


I really wonder why people go for implementing their own HA stack when there's Pacemaker or rgmanager available?

First: There's a functional resource agent available to handle PostgreSQL. It handles single instances or multiple ones.

Second: Zhe whole cluster-thing can be very complex. You can have a LOT of fail scenarios and I wouldn't recommend to anyone to try to catch them all.


Slightly off-topic:

If you're considering MySQL for HA, a project called Vitess jives well with Kubernetes + CoreOS, and has been in production use at YouTube for a while now:

http://vitess.io/


What does "Batteries Not Included" mean for a highly available database? Is it a good thing?


"Batteries Not Included" is a phrase found on the box of children's toys, so that parents know they will need to buy batteries before giving it to their kids. It has become an expression for something being incomplete and will require effort before it will work properly.

In this case the author is stating that Postgres doesn't come with a high availability capability. He then goes on to explain the high availability setup he put together.


"Batteries Not Included" in this article refers to the fact that PostgreSQL doesn't offer any built-in solution yet to do HA (automatic fail-over) and you have to depend of 3rd party solution.


How does this compare to RDS - both in terms of HA, and generally speaking?


And with this amazing design you can easily loose committed data and have all sorts of other fun problems.


It's relatively easy to adjust this setup to use synchronous replication. In fact, it's something that I expect we'll be offering to customers in the future. Most people we talk to aren't willing to sacrifice the write availability or performance synchronous replication requires. We mostly try to give them the tools to do it right, and then educate them on the tradeoffs.

(disclaimer, I work for Compose)


Is it even possible to guarantee that you won't lose commits with postgresql replication? For many applications, consistency is more important than not losing any data ever. For the other kind of application, you'll need something else.


With postgresql synchronous replication, in order to lose writes that have been acknowledged to the postgresql client, you'd have to lose filesystem data on both the primary and the synchronous standby. (I believe the way postgresql uses the term "committed", you can lose data that's "committed", but not once postgresql has acknowledged it to the client.)

For many applications, consistency includes not losing acknowledged data. If I PUT data into an application and fetch it back and it's not there, that's not consistent.


> For many applications, consistency includes not losing acknowledged data. If I PUT data into an application and fetch it back and it's not there, that's not consistent.

Durability and consistency are two separate concepts.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: