Can you say more about what "bringing up" from complete failure entails? Do you ...

wingmanjd · on March 22, 2022

Sure. We just moved our build configuration documentation out of our on-site Confluence to plaintext markdown so we can have them in the event of a full network breach. Tomorrow's exercise is to discuss how to leverage those docs for an exhaustive rebuild of the company under the assumption that a full data breach has occurred and we can't trust the existing servers/ backup system.

Like I mentioned, it's sort of an "Armageddon", or worst case scenario. It's meant to help us identify issues with our docs, assist with order of operations, and it exposes any new hires to the full scope of all the moving pieces.

We already perform DR exercises, including testing our online backup restores, between production and DR sites, but this tabletop takes it further.

Hope that answers your question.

bogomipz · on March 22, 2022

Wow, that's great to hear. I wish more companies would do this. I'm curious if you also made similar provisions for Git as did the Confluence stuff? I would love to see a writeups or blog post on these types of table top exercises as I feel most companies just run on IaaS/PaaS/SaaS and hope for the best.

babelfish · on March 22, 2022

Not OP, but at a previous company we were given access to a new AWS environment with nothing in it and were timed on how quickly we could get all of our services operational. Fleet teams went first, then networking, then T1 services (identity, platform, etc) and so on

lanstin · on March 22, 2022

So roughly how long did it take? I participated in table exercises of that sort and couldn’t convincingly get less than two weeks. And not two fun weeks.

wingmanjd · on March 22, 2022

I suppose it all depends on how much infra needs to be stood up for the absolute necessities of the business to operate. Does the company need that internal ticketing system in place to process external client transactions? Probably not, but it'll need it eventually (so maybe that moves to 2nd tier restore process?). My company's RTO is 24hrs to processing new client transactions. Restoring old ones will definitely take longer, but at least new ones can proceed.

If your own company's RTO is 2w, that sounds like a lot needs to be in place. Part of the business continuity/ disaster recovery is getting management to sign off on those types of numbers, big or small. Make sure they're realistic.

You're right that this type of recovery is not fun. Bryan Cantrill gives a great presentation about managing an outage (https://www.youtube.com/watch?v=30jNsCVLpAE). One of my biggest takeaways, if you're looking at a sweeping outage and a long haul of a recovery, do sleep management asap with your team. Dead tired people are more likely to make brain dead decisions.

bogomipz · on March 22, 2022

What a great link, thanks for sharing.

>"Dead tired people are more likely to make brain dead decisions"

Indeed. In reading the post-mortem on the recent multi-day Roblox outage, it's hard not to imagine some bad decisions were made that only made problems worse and that likely these were on account of people just being fried by lack of sleep:

https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

babelfish · on March 22, 2022

The PM in charge of it spent probably 6 months to a year planning it with every team prior to our first gameday. The first gameday took 2-5 days to complete entirely, but we made it a bi-annual exercise and eventually got it down to ~12 hours before I left the company

bogomipz · on March 22, 2022

I'm curious why Fleet teams went before networking? Shouldn't the dependency there require it to be the other way around?

babelfish · on March 22, 2022

This was a couple years ago, you may be right, sorry for the confusion

wingmanjd · on March 22, 2022

What is a fleet team? I'm not familiar with that term.

babelfish · on March 22, 2022

Company was in the middle of an on-prem -> cloud transition. They were responsible for the on-prem hardware as well as provisioning VMs across cloud instances