Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

(I help maintain SmartStack)

I think it's really interesting that "what we've already got setup" is such a big driver in which systems we pick. For example, in 2013 Yelp already had hardened Zookeeper setups and Consul didn't exist ... and when it did exist Consul was the new "oh gosh they implemented their own consensus protocol" kid on the block, so we opted for what we felt was the safer option. I do have to be honest that I was also pretty worried about the ruby ZK library, but to be totally honest it's been relatively well behaved, aside from the whole sched_yield bug [1] occasionally causing Nerves to infinite loop shutting down. We fixed that with a heartbeat and a watchdog, so not too bad. Which technologies are available at which times really drives large technical choices like this.

Consul template is undeniably useful, especially when you start integrating it with other Hashicorp products like Vault for real time rolling your SSL creds on all your distributed HAProxies. And I think that the whole Hashicorp ecosystem together is a really powerful set of free off the shelf tools that are really easy to get going with. I do think, however, that Synapse does have some important benefits, specifically around managing dynamic HAProxy configs that have to run on every host in your infra. For example, Synapse can remove dead servers ASAP through the HAProxy stats socket after getting realtime ZK push notifications rather than relying on healthchecks (in production <~10s across the fleet, which is crucial because if HAProxy healthchecked every 2s we'd kill our backend services with healthcheck storms ... because we've totally done that ...), Synapse can try to remember old servers so that temporary flakes don't result in HAProxy reloads, and it can try to spread and jitter HAProxy restarts so that the healthcheck storms have less impact, all while having flexibility in the registration backend (Synapse supports any service registry that can implement the interface [2]). However, there are some pretty cool alternative proxies to HAProxy out there and one area that Consul is really doing well on is supporting arbitrary manifestations of service registration data using Consul template; SmartStack is still playing catch up there, supporting only HAProxy and json files (with arbitrary outputs on their way in [3]).

I enjoyed the article, and thank you to the Stripe engineers for taking the time to share your production experiences! I'm excited to see folks talking about these kinds of real world production issues that you have to deal with to build reliable service discovery.

[1] https://github.com/zk-ruby/zk/issues/50 [2] https://github.com/airbnb/synapse/blob/master/lib/synapse/se... [3] https://github.com/airbnb/synapse/pull/203



> We fixed that with a heartbeat and a watchdog, so not too bad.

I disagree. That's a band-aid solution, good for a short time while you figure out the root cause and solve it for real.


I respectfully disagree. I'm all for root cause analysis and taking the time to fix things upstream, but I also think that it's easy to say that and hard to actually do it.

Yelp doesn't make more money and our infra isn't particularly more maintainable when I invest a few weeks debugging Ruby interpreter/library bugs, especially not when there are thousands of other higher priority bugs I could be determining the root cause of and fixing.

For context, we spent a few days trying to get a reproducible test case for a proper report upstream, but the issue was so infrequent and hard to reproduce that we made the call not to pursue it further and just mitigate it. I do believe that mitigate rather than root cause is sometimes the right engineering tradeoff.


A bug like that is something that you want to squash because the cause might have other unintended consequences that you are currently un-aware of. To assume that there are no other consequences is the error, and the only way to make sure there are not is to identify the cause. This sort of wiping things under the carpet is what comes back to bite you a long time after either with corrupted data or some other consequence.

Now, given the context it doesn't matter whether or not the company or the product dies so I can see where you're coming from but in any serious enterprise that would not be tolerated, but when your code base already has 'thousands of other higher priority bugs' it's a lost cause, point taken. But at some level you have to wonder whether you have 'thousands of higher priority bugs' because there is such a cavalier attitude to fixing them in the first place.


> in any serious enterprise that would not be tolerated

I think that's a bit of a true scotsman fallacy. We use a lot of software we didn't write, and a lot of it has bugs. The languages that we write code in have bugs (e.g. Python has a fun bug where the interpreter hangs forever on import [1]; we've hit this in production many times). Software we write has bugs and scalability issues as well. We try really hard to squash them. We have design reviews, code reviews, and strive to have good unit, integration, and acceptance tests. There are still bugs.

I'm glad that there are some pieces of software that are written to such high standard that bugs are extremely rare (I think that HAProxy is a great example of such a project), but I know of very few in the real world.

[1] https://bugs.python.org/issue14903




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: