I think the important thing about Erlang (the system as a whole) is that you really have to understand what it does first. It's not one of those platforms you just jump into and start toying around and then maybe things work 'good enough.' Once you do understand what it does, though, it is exceptionally good at that.
This post does a great job of explaining what Erlang, as a whole, does and why it does it.
The lighter let-it-crash is a circuit breaker. This is done quite frequently in the JVM world because .. well the JVM has a really shitty startup time.. even restarting threadpools can be expensive.
I get the whole let-it-crash but I really would like more tools on feedback control and backpressure handling (ie whats the right amount threads to allocate and how many failures/timeouts should you allow etc...). Even monitoring is a pain (ie too many alarms). I don't know if erlang provides libraries for this but its a hard problem (see https://github.com/Netflix/Hystrix/issues/131).
'Let it crash' is a philosophy geared toward handling errors.
Circuit breakers are geared toward handling resources that may become unavailable.
While they seem similar, they're conceptually very, very different. Let it crash is mostly for things where one's own code, one's own state, may end up faulty, and where recovering in a known good state will solve the issue. And it turns out this is really effective for most 'bugs'.
A circuit breaker is where -external- state, environmental state if you will, may become faulty. This is really effective not for 'bugs', but for predictable periodic issues such as one's network going down, a database becoming inaccessible, etc.
Everyone who writes a reasonably complex system in Erlang, that interfaces with systems external to it, learns the shortcomings of applying 'let it crash' to those instances (a network hiccup overloads your supervisor threshold with crashes, leading to parts of, or the entirety of your system going down), and goes looking for (and hopefully finding) the circuit breaker pattern.
Sadly, they are not mentioned much in books or other documentation, despite being a potentially extremely useful piece of infrastructure for some kinds of projects.
What we do in order to make the ideas of load regulation (see https://github.com/jlouis/safetyvalve or https://github.com/uwiger/jobs ) and circuit breakers is that we "prove" them correct by extensive use of property based testing. That is, it is highly unlikely that these tools have errors under production runs because the corner-cases tested for them are far more complex than what a normal program would do.
The reason it is nice to have circuit breakers is what Fred touched on in another thread here: you want to gracefully degrade a system, even if parts of it is temporarily down, either due to error or for maintenance. You can thus keep up the processes that are proxying for the underlying cascading dependency, and turn faults into terms of the form `{error, system_unavailable}` which lets you turn an implicit crash into an explicit error path.
Chapter 3 of Erlang in Anger (http://www.erlang-in-anger.com/) does mention them among other strategies in handling overload (3.2.2). I tried to put as much concise production experience as I could into that manual. Hopefully it proves helpful!
The JVM does not have a shitty startup time. Starting up a JVM takes 50-80ms. What takes time is HotSpot's warmup -- getting to peak performance. Erlang doesn't have this problem simply because it never gets anywhere near HotSpot's performance.
As to thread pools, that's an apples-to-oranges comparison. Erlang's processes should be compare to Java tasks or fibers; not to Java's heavyweight threads.
I agree with you and probably should have made that statement more specific (ie the extreme class loading that typically happens in most Java apps and what exactly is a full started up app). A typical closure app for example is well well above 50-80ms time to being ready to receive requests.
As for the threads the same goes. I agree with you that ideally should be the case but in practice there are so many libraries that boot up their own thread pool (for isolation reasons, or because they are using blocking IO... rabbitmq).
BTW I'm a big fan of all your concurrency work and I too agree that subscribers are sort of hard to get right in reactive-streams and could be easier (I think that was you) :)
Here's a devil-in-the-details question that you might consider adding to your excellent article:
You have a web server in there, and also a storage system. What happens when the errors propagate up and the storage system dies? Does it force the entire node to reboot? Shouldn't the web server stay up to keep users informed that there is a serious problem, rather than simply going away? What's the best way to accomplish that?
Author here. This is a challenging one, because it is intimately related to what is acceptable or not to your users.
By default you could say that if the storage mechanism must be up and available and it isn't, then the front-end shouldn't be responsive and it should crash.
You could also say that you want the front-end app to be available if the storage layer is offline. This has two possible consequences:
a) you disconnect the front-end and the back-end so that they do not depend on each other. This can be done either through application strategies (you can define the storage app as 'transient' so it can fail without shutting down the system) or by putting the front-end on a different Erlang node.
The latter means that your dependency on the storage back-end is not as direct as it seems.
b) this is my preferred solution, and it requires you to rework what you think of as 'depends on'. If you expect the storage layer to fail and that you must be able to service the front-end anyway, then the architecture demoed in the presentation needs an asterisk.
The reason for this is that the dependency as described crashes if the database is not available, because the storage subtree acts as a proxy for 'the database'. The OTP structure encodes 'my database is available'.
I can rework that requirement to mean 'the storage layer is up and ready to talk to a database'. This is a huge change because it no longer promise the DB is available, it promises that something whose job it is to talk to the DB is available.
In a nutshell, the difference in both initialization and supervision approaches is that in the one described in b), the client's callers make the decision about how much failure they can tolerate, not the client itself. The client making the decisions is what is described in the presentation.
Sadly I could not fit all of that and the compromise of supervision structures within the hour I had allocated for my presentation, so this comment and the side-blog post ought to do (I've also put that material in Erlang in Anger, if you happen to grab that free ebook).
I wish more people would talk about this kind of thing in the Erlang world. Supervision trees are nice, but there are real-world examples like the above where it's not quite so cut-and-dried, and some additional design is required. Each of your proposed solutions involves compromises, costs, and benefits of their own that may not be obvious to someone new to Erlang.
The insight of people such as yourself who have already run into these problems is very valuable to those of us with less experience.
I think a lot of these things are experience-related, or usually cemented within a specific implementation. A lot of people may apply these principles correctly because that's what they find works best, without necessarily bringing it to a conscious level, or to a level of explicitness that makes it easy to teach or use.
Garrett Smith is starting to hit on that with http://www.erlangpatterns.org/ and trying to broadcast that kind of information to the rest of the community, but I'm guessing participation hasn't been strong enough to help (I know I haven't participated enough to that website personally)
> I wrote a very tiny booklet about writing highly scalable, fault tolerant, distributed system.
i am sure you already know about jim-gray's (tandem computers) article on "Why Do Computers Stop and What Can Be Done About It?". it is pretty good, imho, actually, despite being approx. 20 years old (how many millenia is that in internet time ?)
Good job. I think it is way too introductory, tbh. A few examples of working distributed systems while talking about why they are the way they are, might be useful. Also, unless you have plans for updating it in the future, you might get more readers interested, if you publish it as a blog.
When I wrote ctadvisor[0], I continually ran into issues with certificates in the chain that weren't encoded the way I expected. Sometimes it was legitimate - because it took a week to realise I occasionally hit an email certificate that looks quite different, and sometimes it was just because some CAs generate unusual certs.
Everytime such a thing happened, it would crash, and just plow on. I never actively planned that. It's incredibly powerful.
A really excellent piece. Having no experience whatsoever with Erlang, I feel like I have a very strong idea of its purpose and approach after reading this. Not only that, but it's convinced me that a system for prioritizing and restarting pieces of code is essential to all projects. It seems dumb to not have a system like this. Thank you for taking the time to write this up.
"blow it up" was an example of a thing that "could not make sense" for rocket science as a quote. I also did not know (and currently do not know) of significant rocket explosions or failures that didn't result in the loss of human life, sadly.
Looking at the list here https://en.wikipedia.org/wiki/List_of_spaceflight-related_ac... I'm guessing Soyuz 33, STS-1 and a few others would have worked, but any of those would have brought back similar images, whether the space shuttle image was of a complete one or from the challenger explosion, a failure in rocket science reminds you of any of those you have seen; car crashes and airplane crashes are likely the same.
Then again, it's possible the whole slide is in bad taste. I wanted to convey what the 'let it crash' stuff felt to me the first time I heard it, and Challenger's disaster felt both higher profile and more distant in our collective memory than any random disasters I could have used.
I could probably have avoided discussing the topic entirely, but I hoped that the context around it where I think it would obviously be a bad idea to have 'blow it up' as a rocket science motto would save it. It possibly failed.
> I wanted to convey what the 'let it crash' stuff felt to me the first time I heard it, and Challenger's disaster felt both higher profile and more distant in our collective memory than any random disasters I could have used.
This was a good choice.
> ...I hoped that the context around it where I think it would obviously be a bad idea to have 'blow it up' as a rocket science motto would save it.
Given enough people, someone will inevitably take offense to anything you write. If someone is insufficiently capable of considering the context in which a reminder of a thirty-year-old high-profile disaster [0] is presented, they're gonna be unreasonably kerfluffled.
[0] A disaster that was caused by a serious failure to remember and stay within the safety margins of a very complex and hazardous system... which makes the choice of this particular disaster even more apt.
I'm glad that you're signalling that you didn't carefully read the prose. From TFA, right below the offending photo:
"In some ways it would be as funny to use 'Let it Crash' for Erlang as it would be to use 'Blow it up' for rocket science. 'Blow it up' is probably the last thing you want in rocket science — the Challenger disaster is a stark reminder of that. Then again, if you look at it differently, rockets and their whole propulsion mechanism is about handling dangerous combustibles that can and will explode (and that's the risky bit), but doing it in such a controlled manner that they can be used to power space travel, or to send payloads in orbit.
The point here is really about control; you can try and see rocket science as a way to properly harness explosions — or at least their force — to do what we want with them. Let it crash can therefore be seen under the same light: it's all about fault tolerance. The idea is not to have uncontrolled failures everywhere, it's to instead transform failures, exceptions, and crashes into tools we can use."
Thanks. ended up going with Cygnus CRS Orb-3, which was a rocket failure, unmanned, and also has high res reusable photos. Text is unchanged, which I believe is fine in this context.
Sure, but people died in forest fires, from bee stings, mountain climbing/cross country skiing, in aircraft. It's just you're not personally sensitive to those images.
The Challenger explosion was a heavily publicized event. Many people in my age group watched the launch and the explosion in class as primary school students. Trying to dismiss something like that as simply a personal sensitivity is a rather personal insensitivity.
This post does a great job of explaining what Erlang, as a whole, does and why it does it.