Hacker News new | past | comments | ask | show | jobs | submit login
Good Retry, Bad Retry (medium.com/yandex)
240 points by misonic 5 months ago | hide | past | favorite | 66 comments



AWS also say they do something interesting:

> When adding jitter to scheduled work, we do not select the jitter on each host randomly. Instead, we use a consistent method that produces the same number every time on the same host. This way, if there is a service being overloaded, or a race condition, it happens the same way in a pattern. We humans are good at identifying patterns, and we're more likely to determine the root cause. Using a random method ensures that if a resource is being overwhelmed, it only happens - well, at random. This makes troubleshooting much more difficult.

https://aws.amazon.com/builders-library/timeouts-retries-and...


I’ve read a suggestion to use prime numbers for retry timers to reduce the chance of multiple timers synchronizing if they have common factors. I don’t know if that’s a real concern, but it wouldn’t hurt to pick a random prime number instead of some other random number.


IIS picked 29 hours as the smallest prime over 24.

https://serverfault.com/questions/348493/why-does-the-iis-wo...


I never get this desire for micro services. You IDE can help if there are 500 functions, but nothing would help you if you have 500 micro services. Almost no one fully understands such a system. Is is hard to argue who parts of code are unused. And large scale refactoring is impossible.

The upside seems to be some mythical infinite scalability which will collapse under such positive feedback loops.


The point of microservices is not technical, it's so that the deployment- and repository ownership structure matches your organization structure, and that clear lines are drawn between responsibilities.


its also easier to find devs that have the skills to create and maintain thin services than a large complicated monolith, despite the difficulties found when having to debug a constellation of microservices during a crisis.


For the folks who downvoted this - why? I hire developers and this is the absolute truth of the matter.

You can get away with hiring devs able to only debug their little micro empire so long as you can retain some super senior rockstar level folks able to see the big picture when it inevitably breaks down in production under load. These skills are becoming rarer by the day, when they used to be nearly table stakes for a “senior” dev.

Microservices have their place, but many times you can see that it’s simply developers saying “not my problem” to the actual hard business case things.


You need those senior folks who can see the big picture, whether you use monoliths or microservices.

The real benefit of a microservice is that it's easier to see the interactions, because you can't call into some random and unexpected part of the codebase...or at least it's much harder to do something that's not noticeable like that.


At the cost of network boundaries everywhere, and all that entails


If there are network problems everything fails anyway, so it's not really an issue in production.

In the end, it depends on your skillsets. Most developers can't deal with a lot of complexity, and a monolith is the simplest way to program. They also can't really deal with scale, and cost of learning how to build a real distributed system is high...and the chances you'll hit scale are low.

So instead people scale horizontally or vertically, with ridiculously complicated tools like k8. K8 basically exists outside of google because developers can't write scalable apps, whether monolithic or microservice-based.


It's so funny we always use technical solutions to solve social problems, while confusing which parts are what. :)


My interpretation of Conways Law is that social problems) in development organizations) are isomorphic to (gross) technical problems, and that leverage works both directions.


> retain some super senior rockstar level folks able to see the big picture

This is the critical piece that many organisations miss.

Microservices are the bricks; but the customer needs those assembled into a house.


Btw, important factor: you can only see the big picture properly if you co-created the setup. Hiring senior rockstars as a reaction to problems will satisfy some short-term goals but not solve the problems overall


"I see you have a poorly structured monolith. Would you like me to convert it into a poorly structured set of microservices?"


It's easier to find people who are confident that they understand a microservice, but the fact is that it interacts with the system as a whole and much of that interaction is dark matter. It's unknown unknowns that lead to Dunning-Kruger. People looking at a large system have more known unknowns and are less likely to be overconfident to the same degree.

Also we need to have about 5x as many people graduating with formal classes in distributed computing as are now or have been for the last several decades and it's just ridiculous how many people have to learn this stuff on their own. Distributed debugging is really had when you don't understand the fundamental problems.


The reality is, the organizational structure likely to change over time, then would anyone want to mirror it in the repo structure?


Not likely, no.


I prefer mid-scale services. For a given app, there shouldn’t be more than 20-30 of them (and preferably around 10). Each will still have clean ownership and single responsibility, but the chaotic quadratic network effect will hopefully not get out of control. Cleanly defined protocols become a necessity though.


Except it never does. Have you ever heard of a company with a thousand teams?

Or for that matter, how many default alive companies have more teams than customers?


But, uh, both Google and Yandex use monorepo-style of development; and microservices style of deployment, yes. Go figure.


I think the dream is that you can reason locally. I'm not convinced that it actually help any, but the dream is that having everything as services, complete with external boundaries and enforced constraints, you're able to more accurately reason about the orchestration of services. It's hard to reason about your order flow if half if it depends on some implicit procedure that's part of your shopping cart.

The business I'm part of isn't really after "scalable" technology, so that might color my opinion, but a lot of the arguments for microservices I hear from my colleagues are actually benefits of modular programs. Those two have just become synonyms in their minds.


> […] the dream is that having everything as services, […], you're able to more accurately reason about the orchestration of services.

Well.. I mean that’s an entirely circular point. Maybe you mean something else? That you can individually deploy and roll back different functionality that belong to a team? There’s some appeal for operations yeah.

> but a lot of the arguments for microservices I hear from my colleagues are actually benefits of modular programs

Yes I mean from a development perspective a library call is far, far superior to an http call. It is much more performant and orders of magnitude easier to reason about since the caller and callee are running the same version of the code. That means that breaking changes is a refactor and single commit, whereas with a service boundary you need a whole migration.

You can’t avoid services altogether, like say external services like a payment portal by a completely different company. But to deliberately create more of these expensive boundaries for no reason, within the same small org or team, is madness, imo.


> That means that breaking changes is a refactor and single commit, whereas with a service boundary you need a whole migration.

This decoupling-of-updates-across-a-call-boundary is one of the key reasons why I _prefer_ microservices. Monoliths _force_ you to update your caller and callee at the same time, which appears attractive when they are 1-1 but becomes prohibitively difficult when there are multiple callers of the same logic - changes take longer and longer to be approved, and you drift further from CD. Microservices allow you to gradually roll out a change across the company at an appropriate rate - the new logic can be provided at a different endpoint for early adopters, and other consumers can gradually migrate to it as they are encouraged or compelled to do so.

Similarly with updates to cross-cutting concerns. Say there's a breaking change to your logging or testing framework, or an encryption library, or something like that. You can force all your teams to down tools and to synchronize in collaborating on one monster commit to The Monolith that will update everything at once - or you can tell everyone to update their own microservices, at their own pace (but by a given deadline, if InfoSec so demands), without blocking each other. Making _and testing and deploying_ one large commit containing lots of changes is, counter-intuitively, much harder than making lots of small commits containing the same quantity of actual change - your IDE can find-and-replace easily across the monorepo, but most updates due to breaking changes require human intervention and cannot be scripted. The ability for different microservices within the same company to consume different versions of the same utility library at the same time (as they are gradually, independently, updated) is a _benefit_, not a drawback.

> a library call[...]is much more performant [...than] these expensive boundaries

I mean, no argument here - but latency tends to be excessively sought by developers, beyond the point of actual experience improvement. If it's your limiting factor, then by all means look for ways to improve it - but designing for fast development and deployment has paid far greater dividends, in my experience, than overly optimizing for latency.


> Monoliths _force_ you to update your caller and callee at the same time

It's possible to migrate method calls incrementally (create a new method or add a parameter). In large codebases, it's necessary to migrate incrementally. The techniques overlap those of changing an RPC method.


That's fair! The point about shared cross-cutting concerns still applies, but yeah, that's a fair point - thank you for pointing that out!


You can absolutely reason locally with libraries. A library has an API that defines its boundary. And you can enforce that only the API can be called from the main application.


>The upside seems to be some mythical infinite scalability which will collapse under such positive feedback loops.

Unless I misunderstand something here, they say pretty early in the article that they didn't have autoscaling configured for the service in question and there is no indication they scaled up the number of replicas manually after the downtime to account for the accumulated backlog of requests. So, in my mind, of course there can be no infinite, or really any, scalability if the service isn't allowed to scale...


I’ve seen monumental engineering effort go into managing systems because for one reason or another people refused to use (or properly configure) autoscaling.


> You IDE can help if there are 500 functions, but nothing would help you if you have 500 micro services.

Using micro-services doesn't mean you're using individual repositories and projects for each one. The best approach I've seen is one repo, with inter-linked packages/assemblies (lingo can vary depending on the language).


Agree 100%

A monolith with N libraries (instead of N microservices) work so much better in my experience. You avoid the networking overhead, and the complexity of reasoning about all the possible ways N microservices will behave when one or more microservices crash.


What you are describing, where 1 function = 1 service, is serverless architectures. The "ideal" with any service (micro or macro) is to get it so that it maximises richness of functionality over scale of API.

But I agree one should do monolith by default.


The concepts here apply to any client-server networking setup. Monoliths could still have web clients, native apps, IOT sensors, third party APIs, databases, etc.


The real reason is that it's impossible to safely upgrade a dependency in Python. And by the time you realise this you're probably already committed to building your system in Python (for mostly good reasons). So the only way to get access to new functionality is to break off parts of your system into new deployables that can have new versions of your dependencies, and you keep doing this forever.


> but nothing would help you if you have 500 micro services.

Have you pondered the likelihood that your IDE sucks?


Meanwhile if you use an Erlang language you can horizontally scale later.


I just learned quite a bit about retries. I really liked this tour of one area of the domain in the form of a narrative. When written by someone who clearly knows the area and also has skill at writing it, that's a great way to learn more techniques.

Would love to read more things like this in different areas.


To counter the avalanche of retries on different layers, I have also seen a custom header being added to all requests that are retries. Upon receiving a request with this header, the microservice would turn off its own retry logic for this request.


Ya. Instead of blind retries, I have server respond with “try after timestamp” header. This way it can tell everybody to back off. If no response then welp


For the "no response" case, e.g. clients ignoring the retry-after header (or in the particular situation where I learned this trick, retrying aggressively on 401) one can implement an "err200" response in the load balancer which makes them all go away ;).


Sending HTTP 429 is also a good way to ask that the error returned is not because of a downstream generic failure, but excess traffic.

Having clients suspend retries altogether allows the service to come back up. Manual retries triggered from user action would be fresh requests.


This is the kind of well written, in depth technical narrative I visit HN for. I definitely learned from it. Thanks for posting!


I agree. What a treat. One of the best submissions gracing HN in months.


It's worth noting that the logic in the article only applies to idempotent requests. See this article (by the same author) for the non-idempotent counter-part: https://habr.com/ru/companies/yandex/articles/442762/ (unfortunately, in Russian). I am sure somebody posted a human-written English translation back then, but I cannot find it. So here is a Google-translated version (scroll past the internal error, the text is below):

https://habr-com.translate.goog/ru/companies/yandex/articles...


Ideally you can only retry error codes where it is guaranteed that no backend logic has executed yet. This prevents retry amplification. It also has the benefit that you can retry all types of RPCs, including non-idempotent ones. One example is if the server reports that it is overloaded and can't serve requests right now (loadshedding).

Without retry amplification you can do retries ASAP, which has much better latency. No exponential backoff required.

Retrying deadline exceeded errors seems dangerous. You are amplifying the most expensive requests, so even if you only retry 20% of all RPCs, you could still 10x server load. Ideally you can start loadshedding before the server grinds to a halt (which we can retry without risk of amplification). Having longer RPC deadlines helps the server process the backlog without timeouts. That said, deadline handling is a complex topic and YMMV depending on the service in question.


This is probably the most detailed analysis of retry techniques that I've seen. I really appreciated the circuit breaker and deadline propagation sections.

But this is why I've pretty much abandoned all connection-oriented logic in favor of declarative programming:

https://en.wikipedia.org/wiki/Declarative_programming

Loosely, that means that instead of thinking of communication as client-server or peer-to-peer remote procedure calls (RPC), I think of it as state transfer. Specifically, I've moved aware from REST towards things like Firebase that encapsulate retry logic. Under this model, failure is never indicated, apps just hang until communication is reestablished.

I actually think that apps can never really achieve 100% reliability, because there's no way to ever guarantee communication:

https://bravenewgeek.com/you-cannot-have-exactly-once-delive...

https://en.wikipedia.org/wiki/Byzantine_fault

https://en.wikipedia.org/wiki/Two_Generals%27_Problem

Although deadline propagation neatly models the human experience of feeling connected to the internet or not.

Also this is why I think that microservices without declarative programming are an evolutionary dead end. So I recommend against starting any new work with them in this era of imperative programming where so many developer hours are lost to managing mutable state. A better way is to use append-only databases like CouchDB which work similarly to Redux to provide the last-known state as the reduction of all previous states.


I missed it on the first read-through but there is a link to the code used to run the simulations in the first appendix.

Homegrown python code (i.e. not a library), very nicely laid out. And would form a good basis for more experiments for anyone interested. I think I'll have a play around later and try and train my intuition.


Really good article about retries, its consequences and how load amplification works. Loved it


Good reading.

In my last job, the service mesh was responsible to do retries. It was a startup and the system was changing every day.

After a while, we suspect that some services were not reliable enough and retries were hiding this fact. Turning off retries exposed that in fact, quality went down.

In the end, we put retries in just some services.

I never tested neither retry budget nor deadline propagation. I will suggest this in the future.


Why not just add telemetry to see when requests are retried?


I don't understand why load shedding wasn't considered here. My experience has been that it's so effective at making a system recover properly that everything else feels like it's handling an edge case.


It was considered:

> Then Mary and Ben discussed whether load shedding on the server would help, but in the end, they decided that a thick client in this case was acceptable.

Just guessing, but it seems that they share the client code between many teams and microservices, so possibly it was just easier to embed the logic in the shared client code. But it seems clear that they consider load shedding to be another way to handle this problem.


Yeah, sorry, I meant that I was confused by the brief mention where it was removed from consideration. I would have thought that if the team were planning to implement deadline propagation on the server anyway, they'd reach for load shedding first.


Would load shedding in this case just mean that the server replies with “nope” at a much earlier and cheaper stage?


Yes, though there's no "just" about it in many cases. If you kick the client out at an early enough stage in processing that the shed requests are only 1% as costly, then 3x retries aren't going to be able to cause a significant CPU spike. I would have liked to see how the simulation behaved, though, because I don't have empirical evidence for exactly what the traffic would look like.

You can also configure your client to stop retrying if it sees a load shedding error. That won't protect you against traffic that overwhelms the load shedder, but it's pretty effective against an innocently misbehaving client.


ver nice read with lots of interesting points and examples / examination. very thorough imo. Im not a microservices guy but it gives a lot of general concepts also applicable outside of that domain. very good thanks!


Agree! Generic use of systems thinking principles and usage of that terminology allows extending lessons to our real world as well.


Strange architecture. They clearly have a queue, but instead of checking previous request, they create a new one. It's like they managed to get the worst of pub/sub and task queue.


No, not that kind of queue. I believe the queue in question is simply the in-memory queue of HTTP requests being processed on each instance, it's neither persisted nor shared between instances.

If an instance is stuck and not replying in time, you make a retry and the new request hopefully will be dispatched to another instance.


Great food for thought! I’m currently on an endeavor at work to stabilize some pre-existing rest service integration tests executed in parallel.


Reading this excellent article put me in the mind of wondering if job interviews for developer positions include enough questions about queue management.

"Ben" developed retries without exponential back-off, and only learned about that concept in code review. Exponential back-off should be part of any basic developer curriculum (except if that curriculum does not mention networks of any sort at all).


if you have too many deeper questions you rule out a lot of eager juniors who can learn and grow on the job. its a fine balance though, but looking at the article, ben's taking his lessons and growing. thats more important i think than having someone who's some guru from the get go. everyone has things they are better or worse at, and it's really a team effort to do everythinng right. presumably someone reviewed and accepted his code, that person also didnt catch it... there's no developer who knows everything and makes all perfect code and design. its a well balanced team that can help go in that direction


I wholeheartedly agree, and realize my comment was not really clear.

Any training curriculum needs to include exponential back-off as a core concept of any system-to-system interaction.

Ben was let out of school without proper training. Kudos on the employer for finishing up the training that was missed earlier on.


However backoff without circuit-breakers is bad juju. The author asserts that circuit breakers are necessary but insufficient counter for retry storms. I'm not sure if I agree, but I don't have a solid counterargument at the moment.

I think it also depends on how you think about reverse proxies. Are they a given or do you need to explicitly mention that you can cut servers or processes out of the cluster that are lagging and timing out?


can't imagine working at a company with so many competent team members.

fun narrative though!


nteresting reading. I think the article kind of misses the point. The problem was the queuing of requests where nobody was waiting for the response anymore. The same problem would manifest on a monolith with this queuing. If the time to generate the response plus the maximum queue time were shorter than the timeout on the client side, the request amplification would not have happened. The first thing I do on HTTP-based backends is to massively decrease the queue size. This fixes most of these problems. An even better solution would be to be able to purge old requests from the queue, but most frameworks do not allow that, probably due to the Unix socket interface.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: