Hacker News new | past | comments | ask | show | jobs | submit login

Can you say more about what "bringing up" from complete failure entails? Do you have an HA backup DC that you fail over to or something else?



Sure. We just moved our build configuration documentation out of our on-site Confluence to plaintext markdown so we can have them in the event of a full network breach. Tomorrow's exercise is to discuss how to leverage those docs for an exhaustive rebuild of the company under the assumption that a full data breach has occurred and we can't trust the existing servers/ backup system.

Like I mentioned, it's sort of an "Armageddon", or worst case scenario. It's meant to help us identify issues with our docs, assist with order of operations, and it exposes any new hires to the full scope of all the moving pieces.

We already perform DR exercises, including testing our online backup restores, between production and DR sites, but this tabletop takes it further.

Hope that answers your question.


Wow, that's great to hear. I wish more companies would do this. I'm curious if you also made similar provisions for Git as did the Confluence stuff? I would love to see a writeups or blog post on these types of table top exercises as I feel most companies just run on IaaS/PaaS/SaaS and hope for the best.


Not OP, but at a previous company we were given access to a new AWS environment with nothing in it and were timed on how quickly we could get all of our services operational. Fleet teams went first, then networking, then T1 services (identity, platform, etc) and so on


So roughly how long did it take? I participated in table exercises of that sort and couldn’t convincingly get less than two weeks. And not two fun weeks.


I suppose it all depends on how much infra needs to be stood up for the absolute necessities of the business to operate. Does the company need that internal ticketing system in place to process external client transactions? Probably not, but it'll need it eventually (so maybe that moves to 2nd tier restore process?). My company's RTO is 24hrs to processing new client transactions. Restoring old ones will definitely take longer, but at least new ones can proceed.

If your own company's RTO is 2w, that sounds like a lot needs to be in place. Part of the business continuity/ disaster recovery is getting management to sign off on those types of numbers, big or small. Make sure they're realistic.

You're right that this type of recovery is not fun. Bryan Cantrill gives a great presentation about managing an outage (https://www.youtube.com/watch?v=30jNsCVLpAE). One of my biggest takeaways, if you're looking at a sweeping outage and a long haul of a recovery, do sleep management asap with your team. Dead tired people are more likely to make brain dead decisions.


What a great link, thanks for sharing.

>"Dead tired people are more likely to make brain dead decisions"

Indeed. In reading the post-mortem on the recent multi-day Roblox outage, it's hard not to imagine some bad decisions were made that only made problems worse and that likely these were on account of people just being fried by lack of sleep:

https://blog.roblox.com/2022/01/roblox-return-to-service-10-...


The PM in charge of it spent probably 6 months to a year planning it with every team prior to our first gameday. The first gameday took 2-5 days to complete entirely, but we made it a bi-annual exercise and eventually got it down to ~12 hours before I left the company


I'm curious why Fleet teams went before networking? Shouldn't the dependency there require it to be the other way around?


This was a couple years ago, you may be right, sorry for the confusion


What is a fleet team? I'm not familiar with that term.


Company was in the middle of an on-prem -> cloud transition. They were responsible for the on-prem hardware as well as provisioning VMs across cloud instances




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: