Hacker News new | past | comments | ask | show | jobs | submit login

You don't need to consider 'what if a meteor hit the data centre and also it was made of cocaine'. You do need to think through "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."



In a company the size of FaceBook, "everything is turned off" has never happened since before the company was founded 17 years ago. This makes is very hard to be sure you can bring it all back online! Every time you try it, there are going to be additional issues that crop up, and even when you think you've found them all, a new team that you've never heard of before has wedged themselves into the data-center boot-up flow.

The meteor isn't made of cocaine, but four of them hitting at exactly the same time is freakishly improbable. There are other, bigger fish to fry, that we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.


>we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

I think that suggests that there were not bigger fish to fry :)

I take your point on priorities, but in a company the size of facebook perhaps a team dedicated to understanding the challenges around 'from scratch' kickstarting of the infrastructure could be funded and part of the BCP planning - this is a good time to have a binder with, if not perfectly up-to-date data, pretty damned good indications of a process to get things working.


>> we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

> I think that suggests that there were not bigger fish to fry :)

I can see this problem arising in two ways:

(1) Faulty assumptions about failure probabilities: You might presume that meteors are independent, so simultaneous impacts are exponentially unlikely. But really they are somehow correlated (meteor clusters?), so simultaneous failures suddenly become much more likely.

(2) Growth of failure probabilities with system size: A meteor hit on earth is extremely rare. But in the future there might be datacenters in the whole galaxy, so there's a center being hit every month or so.

In real, active infrastructure there are probably even more pitfalls, because estimating small probabilities is really hard.


> "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."

The electricity people have a name for that: black start (https://en.wikipedia.org/wiki/Black_start). It's something they actively plan for, regularly test, and once in a while, have to use in anger.


It's a process I'm familiar with gaming out. For our infrastructure, we need to discuss and update our plan for this from time to time, from 'getting the generator up and running' through to 'accessing credentials when the secret server is not online' and 'configuring network equipment from scratch'.


I love that when you had to think of a random improbable event, you thought of a cocaine meteor. But ... hell YES!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: