1. They provide a service to people around the world, yet they don't ensure that someone is available as an emergency contact on Sunday evening, when the first post-deployment usage happens.
2. They don't have a universal list of "this breaks, contact that guy".
3. They don't have a known instant rollback procedure for a release.
4. They don't have cross-component integration tests and they don't do them manually either.
5. They decide that since they can't do a release that doesn't break stuff and can't organise themselves to resolve it quickly during the weekend when it affects only a small number of people, they'll do releases in the middle of the day now, so that they hear customers complaining right away.
Is that for real? Is he serious? Here's what I would get out of that issue (even if it's basically reiterating the "wrong" things above):
They need to do more integration testing before a release. They need to know who to contact and have to make sure the person is on call and ready for action. The person handling the issue needs to have a simple, quick way to reverse the release without manual intervention (tweaking the code). Again, this specific issue should get regression tests right away. And the most important thing - NEVER treat your customers as a test suite.
Of course I'm aware not everyone can afford operating like that. But at least this could be their goal. "Let's make breakage affect more people, so we know about it earlier and when we're at work" is a really silly conclusion.
EDIT: I posted a rundown of our deployment process, including where and how tests happen, and why they failed to catch this bug, at http://news.ycombinator.com/item?id=2301680 .
While I'm sure there's a lot of stuff we could improve, the situation's not exactly as you describe.
Responding to a few contacts:
1 & 2. We do have a list of "if this breaks, contact this guy." What we don't have (in response to your first point) is a demand that those people be available Sunday night.
3. We have a known rollback procedure. It does not work if we do a irreversible schema change and the problem's not caught until 20 hours later. We couldn't just throw out 20 hours of data.
4. We actually do a lot of testing. Beginning on Wednesday, we deploy to our early leak accounts. We steadily increase that through the week. The problem with this particular bug is that you could use Kiln lightly (most of our test accounts are not large accounts) without hitting this problem at all. Even the full QA test suite did not trigger the problem. That happened because Kiln was designed to keep working in the case of a FogBugz communication failure until it couldn't, which was directly proportional to how much you used Kiln. The real problem here, which has been fixed, is that Kiln should not attempt to hide a problem communicating with FogBugz.
5. We don't do release in the middle of the day. We do them at 10 PM. I have no idea where you got that.
There's a lot we can improve. We need to make sure Kiln not talking to FogBugz, which can bring down Kiln, hard-fails, instead of trying to continue. We need to make sure that all hands are on-deck when people are going to work, as you noted, which is vastly easier to do midweek than Sunday night. And we probably ought to add more automated testing to the integration points. But I think you're painting a somewhat unfair picture of the current situation.
1. Tim, our sysadmin, was the emergency contact and was on top of things ASAP, but didn't have the particular knowledge to fix it himself. That required a developer.
2. He called me first, since the problem appeared to be in Kiln. I'm a Kiln dev, I can and have fixed things on a Sunday evening after a deploy. I missed his first call, but got back to him within 3 minutes (the 30 minutes mentioned in the article was Ben's guess). I started diagnosing the problem and realized it wasn't Kiln specifically that was the problem, but something in the communication between the two. That meant we needed a FogBugz dev, which we got quickly, and possibly a deploy..
3. That led us to investigate rolling the specific account back to the previous version. 98 percent of our updates are reversible, but as Ben mentioned, this particular release included not one, but two irreversible database migrations, and since the upgrade step had run successfully, going back would not be an option.
4. All tests passed (both automated and manual). Ben has updated the article to make it clear that all but one API call between Kiln and FogBugz was working, and the one call that was broken, the one that lead to this crash, is called very infrequently (on the order of months for some accounts). Yes, integration tests should have and will cover that one API call, but missing one corner case is very different than not doing integration testing.
5. Given the situation we were in, the problem will always be solved more quickly when we're in the office than when we're at home. We take every possible precaution to avoid outages, but they will still happen, and moving to mid-week deploys is just another precaution to decrease the impact of these outages if and when they occur in the future.
So in short, yes, this is the very definition of real software, and we take this very seriously. Your bullet list of armchair quarterback suggestions grossly oversimplifies the situation. The goal is to have problems affect fewer users, which is directly affected by our response time.
It is so refreshing for any person or company to come out and explain, in detail, how they screwed up and what they are doing to ensure it doesn't happen again.
The tone of your post encourages people to cover up their mistakes for fear of ridicule, and I am against that. Some of your points are worthy of debate though.
You have large numbers of paying customers to whom you're delivering a mission-critical system (source control isn't exactly optional), and your releases involve neither automated production monitoring/continuous deployment nor formal release procedures?
I think your problem is more than just weekend deployments!
Back to the deploy process, nine minutes have elapsed and a commit has been greenlit for the website. The programmer runs the imvu_push script. The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.
In fairness, that's a description of what a routine and successful build "should" go like. I bet if IMVU were to post a blow-by-blow account of their hairiest deployment screwup ever, it would be a good bit more colorful than that.
There are some headscratchers in the description of the Fogbugz problem, but kudos to them for explaining how and why things broke.
The releases are both automated (except for one component, as noted, which we are now automating), and are fully vetted.
Here is the old release process:
1. Monday morning, the version to be used for the next release is automatically built for the QA team, who begins running their test suites on it and doing soft checks.
2. By no later than Wednesday, the new version is leaked to testing an alpha accounts on Fog Creek On Demand. Tests are re-run at this point.
3. The leak is increased later in the week if the QA results look good, or the weekend release is canceled, depending on how testing goes.
4. Provided everything has been good, on Saturday night, the leak is increased to 100% of customers. This step does not have a full QA rundown, because the code has already been vetted several times by QA at this point. The sanity checks are truly sanity checks.
5. At the same time, we monitor that our monitoring system (Nagios) agrees that all accounts are online and that there are no major problems, such as massive CPU spikes.
So far, so good. The issue with this release is we had a bug that did not manifest for awhile, because Kiln had been deliberately designed to ignore the failure condition "as long as possible", which ended up just being too damn long. Once we started having failures, we noticed--that's why our sysadmin called us in--but those failures started happening 20 hours after the 100% release, and several days after testing and alpha accounts were upgraded.
I am not arguing our system is perfect, but I'm a nonplussed where the your-deployment-system-totally-sucks stuff is coming from. I'll ask our build manager to post an even more detailed rundown.
Sincere question: how do you leak irreversible schema changes to a subset of accounts? Isn't the point of the leak that you're not confident and might need to reverse it? Or are you willing to let those accounts get hosed?
Perhaps have Kiln send notifications on the failure conditions even if it doesn't throw an error? Better a few false positives than no indication at all.
I agree. My other thought was 'isn't there a staging server in there somewhere?' Something that is near identical to production, with fake production data, etc, that could surface the problem before a customer sees it.
btw, props to Fog Creek and OP for airing their dirty laundry. They take some heat, but in the end we all learn from it.
It's stunning how easy it is to spot a specific lack of "automated production monitoring" after something fails. Hey idiot, you should've been testing that thing!
I've seen all of Fog Creek's automated production monitoring courtesy of their sysadmins and devs as it was months ago, and it was very solid. I'm sure it's only gotten better.
This is a case of a specific deployment failure slipping through the cracks and being honestly explained, apologized for, and rectified. I'm obviously biased due to my history (and probably-justified guilt for this particular failure), but shotgun criticism about formal release procedures is very misguided.
Two better approaches come to mind to resolve this:
2. Full-on, properly managed releases like they do in large IT corporations, such as banks, where a "release" is not something you kick off from home via SSH on a Saturday night, but a properly planned effort that involves critical members of the dev team as well as the QA team being present and ready to both test the production system thoroughly and fix any issues that may occur.
What you describe in #2 here sounds like a complete anti-pattern when compared with the idea of continuous deployment and automated verification. This 2nd approach sounds like a huge manual effort.
It absolutely is, and I'd be surprised to see this kind of effort from any but the most paranoid corporations (like, as I mentioned, banks). Automation and continuous deployment are definitely the way forward.
But even this gargantuan effort is a better option than just "let's deploy and wait for our users to tell us if anything has gone wrong".
But even this gargantuan effort is a better option than just "let's deploy and wait for our users to tell us if anything has gone wrong".
To be fair it sounds like in the original article that they did do some verification that things were working after the deployment. However for some reason their verification tests didn't reveal the presence of a real bug.
Even in a more gargantuan system, it's possible to have tests that give false positive results.
Everyone will screw up releases at some point, the key is to be able to learn from them and get better.
If you're making a big change you first cut a CR and get approval of any teams involved. At change time, everyone knows they need to be on-call if something breaks, preferably in a live chatroom.
The rest of the time devs should just deploy when they think the code is ready and have tested it on a like-production box. They then manually verify the change worked. You use automated monitoring to ensure when something does break you are notified immediately.
Their release procedures didn't cover the case, and they're fixing it ("modifying the communication ... [to] fail early and loudly during our initial tests", according to the "with details" post on their status blog[1]).
But I still find their lack of monitors... disturbing.
I agree except I don't think continuous deployment necessarily means automatic deployment. Every deploy should be done by a person and tested right after; none of this "push out all commits at X time" or "push as soon as it's committed" as both are risky.
During the day is usually preferred and never at 4:59PM on a Friday or right before everyone goes to lunch (ever had to clean up a downed cluster when some jerk pushed bad code and the whole team went to Sweet Tomatoes? yeah).
To help troubleshoot breaks, have a mailing list with changelogs showing who made a change, time/date, files touched. Also have your deploy tools mail it when there's a code push, rollback, server restart, etc. Have a simple tool someone can run to revert changes back to a time of day so if something breaks just "revert back to 6 hours ago" and debug while your old app is running (nice to take one broken box offline first to test on).
I work for web dev agencies and it surprises me just how often they launch on a Friday afternoon despite how every single developer pleads with them that its an absolutely awful idea.
Golden rule, Never launch on a Friday.
Personally i've found it easy to persuade clients to do this once you say it'll cost an extra ten grand just for the privilege of a friday launch.
I feel for you. On the plus side, process improvements to prevent it from happening next time are exactly how you should respond to things like this.
One which has saved my bacon numerous times is investing a few hours into tweaking monitoring and alert systems. I hear PagerDuty exists to help with this. I use a bunch of scripts and bubblegum, and even that caught 10 of the last 12 big problems. Queuing systems dying has hosed me many times over the years, for example, and a borked deploy which causes that would have my phone ringing before I got my laptop closed.
I've tried convincing many companies I've worked for that weekend deployments are a bad idea over the years.
Even with continuous integration tests, rolling deployments, and all the precautions in the world things can still happen.
You need live people available to handle a deployment.
Personally, I don't like working on weekends. I've worked for companies that refused to believe that this was a bad idea. I learned pretty fast that life is too short to work on a weekend.
If something does go wrong, it's better to have people on hand to correct the error and get back on track. It's much easier to schedule those people during the work week. It's not rocket science.
Probably? I'm welcome to be schooled here. 90% of the time, we can roll back instantly, because there were no database changes. 5% of the time, we can roll back with slightly more pain, because the database migrations were reversible. In this case, the database migration was not reversible. If we'd noticed immediately, we could still have just activated snapshots, but we didn't notice until 20 hours later. What do others do in this situation?
Wait. What blew up that it took someone 20 hours to realize? The first thing you take from that is, don't do anything without double-checking your change to make sure it worked.
In terms of rollback, just don't do anything which isn't reversible. Taking chances with your changes is taking chances with your business. If you don't know how to rollback whatever you're doing, ask someone who does (there is always a way to roll back or add redundancy).
"The failed API call turns out to be one that’s trivially cached for a very long time, and so is one that Kiln would allow to fail without actually dying."
I was taught that Thursdays are best for deployment because you got Friday to fix stupid things, and then weekend to fix the terrible things. By Monday all is working anyways.
And best of all, Friday people are generally happy (it is last day of the week), respectively on Monday expect grumpy users.
Users seem more comfortable with predictable maintenance than arbitrary outages. Weekend deploys are just bad all around.
When I began in my current role (managing QA/DBAs and app deploys) one of the first things I killed was the late Friday/weekend deploys. They are spirit-crushing and if they go south, they usually go south in a terminal-velocity nose dive.
We set up early Fridays for maintenance, to give us enough time in case something goes south. Aggressive Change Control Requests means the people impacted get a heads up (including Account Managers, who in turn inform clients) if there are any user-facing impacts, and we avoid trying to pack too much in at once.
Having QA, Engineering and the SOC team on hand is...helpful. Maybe its paranoid, but its been very solid so far. When things have gone south, I think the events -since everyone is "on deck" have actually helped build some cameraderie in the teams themselves.
I seem to accrete job roles, and one time I was so far behind on my testing that I went to the boss to ask for a Wed-Sunday work week so I could get some actual work done without interruptions.
The first fortnight was great, got through a lot of backlog on sat/sun.
The second fortnight sucked as I was blocked on stupidly trivial issues. That ended the experiment.
I guess the moral of the story is to pick and choose your 'out of hours' work wisely.
In my experience , there are two kinds of deployment -- ones without DB changes and ones that are accompanies with DB changes.
The deployments that do not require DB changes are easy -- mirror the prod box(non db) onto a smaller box, , deploy upgrades/updates to prod box . If things go wrong, put mirror box online with a DNS/proxy configuration while apologizing to your customer who complain about slower performance .
When DB changes are involved , you need to have your DBAs do a dry run of backing out changes-- after all practice makes perfect. Communicate scheduled outage to customers, backup db . Mirror your production box. Roll put update -- if things go wrong, restore DB and bring the mirror box online.
I have always focused on DB aspect more -- loss of integrity of data can cause customers to look for your replacement.
But I am not sure if weekly upgrades of production environment with paying customers is advisable .
I'm lucky to run a system that is small enough that an entire deploy consists of around 2 seconds of downtime for the server to restart and start the new instance of the application.
We deploy new versions side by side and then then the webserver points at the new application on restart.
Only time it takes any longer is when there are sweeping database changes(schedule the downtime, inspect snapshots incase of issues, etc.)
We use PagerDuty (http://pagerduty.com) at HipChat and while I absolutely loathe being woken up by it, it's helped us identify issues during off-peak hours much more quickly.
But no matter what systems you have in place or how many hundreds of deploys you've done, there's always a new way for things to break.
Oh god. I release 5pm pacific every week because users "can't have a single second" of downtime. We manually test an ever growling checklist fo functionality. There is always, always an issue. The angry emails start to roll in around 5:15 pacific.
Their problem isn't their deployment process, its their monitoring.
Blindly ignoring errors is a recipe for failure. You should always look at situation like that asking "how can we monitor this weak point?" Logging plus a service like Splunk work great.
Should always have a solid on call rotation. We have two rotations, and ops one which is first line, and dev in case deeper code changes or more eyes on it are needed.
1. They provide a service to people around the world, yet they don't ensure that someone is available as an emergency contact on Sunday evening, when the first post-deployment usage happens.
2. They don't have a universal list of "this breaks, contact that guy".
3. They don't have a known instant rollback procedure for a release.
4. They don't have cross-component integration tests and they don't do them manually either.
5. They decide that since they can't do a release that doesn't break stuff and can't organise themselves to resolve it quickly during the weekend when it affects only a small number of people, they'll do releases in the middle of the day now, so that they hear customers complaining right away.
Is that for real? Is he serious? Here's what I would get out of that issue (even if it's basically reiterating the "wrong" things above):
They need to do more integration testing before a release. They need to know who to contact and have to make sure the person is on call and ready for action. The person handling the issue needs to have a simple, quick way to reverse the release without manual intervention (tweaking the code). Again, this specific issue should get regression tests right away. And the most important thing - NEVER treat your customers as a test suite.
Of course I'm aware not everyone can afford operating like that. But at least this could be their goal. "Let's make breakage affect more people, so we know about it earlier and when we're at work" is a really silly conclusion.