> The script doing the backup did not catch the error. That may seems sloppy, but this sort of thing happens in the Real World all the time.
I disagree. I mean, I agree those things happen, but the system administrator's job is to anticipate those Real World risks and manage them with tools like quality assurance, redundancy, plain old focus and effort, and many others.
The fundamental of backups is to test restoring them, which would have caught the problem described. It's so basic that it's a well-known rookie error and a source of jokes like, 'My backup was perfect; it was the restore that failed.' What is a backup that can't be restored?
Also, in managing those Real World risks, the system administrator has to prioritize the value of data. The company's internal newsletter gets one level of care, HR and payroll another. The company's most valuable asset and work product, worth hundreds of millions of dollars? A personal mission, no mistakes are permitted; check and recheck, hire someone from the outside, create redundant systems, etc. It's also a failure of the CIO, who should have been absolutely sure of the data's safety even if he/she had to personally test the restore, and the CEO too.
I don't know or recall the details well enough to be sure, but it's possible that they were, in fact, testing the backups but had never before exceeded the 2GB limit. Knowing that your test cases cover all possible circumstances, including ones that haven't actually occurred in the real world yet, is non-trivial.
Your post is valid from a technical and idealistic standpoint, however when you realize the size of the data sets turned over in the film / TV world in a daily basis, restoring, hashing and verifying files during production schedules is akin to painting the forth bridge - only the bridge has doubled in size by the time you get half way through, and the river keeps rising...
There are lots of companies doing very well in this industry with targeted data management solutions to help alleviate these problems (I'm not sure that IT 'solutions' exist), however these backups aren't your typical database and document dumps. In today's UHD/HDR space you are looking at potentially petabytes of data for a single production - solely getting the data to tape for archive is a full time job for many in the industry, let alone administration of the systems themselves, which often need overhauling and reconfiguring between projects.
Please don't take this as me trying to detract from your post in any way - I agree with you on a great number of points, and we should all strive for ideals in day to day operations as it makes all our respective industries better. As a fairly crude analogy however, the tactician's view of the battlefield is often very different to that of the man in the trenches, and I've been on both sides of the coin. The film and TV space is incredibly dynamic, both in terms of hardware and software evolution, to the point where standardization is having a very hard time keeping up. It's this dynamism which keeps me coming back to work every day, but also contributes quite significantly to my rapidly receding hairline!
> Your post is valid from a technical and idealistic standpoint
You seem to have direct experience in that particular industry, but I disagree that I'm being "idealistic" (often used as a condescending pejorative by people who want to lower standards). I'm managing the risk based on the value of the asset, the risk to it, and the cost of protecting it. In this case, given the extremely high value of the asset, the cost and difficulty of verifying the backup appears worthwhile. The internal company newsletter in my example above is not worth much cost.
> solely getting the data to tape for archive is a full time job for many in the industry, let alone administration of the systems themselves, which often need overhauling and reconfiguring between projects.
Why not hire more personnel? $100K/yr seems like cheap insurance for this asset.
> restoring, hashing and verifying files during production schedules is akin to painting the forth bridge - only the bridge has doubled in size by the time you get half way through, and the river keeps rising...
> you are looking at potentially petabytes of data for a single production
I agree that not all situations allow you to perform a full restore as a test; Amazon, for example, probably can't test a complete restore of all systems. But I'm not talking about this level of safety for all systems; Amazon may test its most valuable, core asset, and regardless there are other ways to verify backups. In this case it seems like they could restore the data, based on the little I know. If the verification is days behind live data or doesn't test every backup, that's no reason to omit it; it still verifies the system, provides feedback on bugs, and reduces the maximum dataloss to a shorter period than infinity.
> I disagree that I'm being "idealistic" (often used as a condescending pejorative by people who want to lower standards)
A poor word choice on my part. It was certainly not meant to come across that way, so apologies there! Agreed that a cost vs risk analysis should be one of the first items on anyone's list, especially given the perceived value of the digital assets in this instance.
This particular case is one that's hard to test - you'd restore the backup, look at it, and it would look fine; all the files are there, almost all of them have perfect content, and even the broken files are "mostly" ok.
As the linked article states, they restored the backup seemingly sucessfully, and it took two days of normal work until someone noticed that the restored backup is actually not complete. How would you notice that in backup testing that (presumably) shouldn't take thousands of man-hours to do?
Good points. High-assurance can be very expensive in almost any area of IT. Speaking generally, when the asset is that valuable, the IT team should take responsibility for anticipating those problems - difficult, but not impossible. Sometimes you just have to roll up your sleeves and dig into the hard problems.
Speaking specifically, based on what you describe (neither of us is fully informed, of course), the solutions are easy and cheap: Verify the number of bytes restored, the numbers of files and directories restored, and verify checksums (or something similar) for individual files.
The impression I got from the descriptions of that incident and especially the followup was that their main weakness was not technical, but organizational - their core business consisted on making, versioning and using a very, very large number of data assets that was very important to them, but they apparently didn't have any good process of (a) inventory of what assets they have or should have, and (b) who is responsible for each (set of) assets. Instead, the assets "simply" were there when everything was going okay, and just as simply weren't there without anyone noticing when it wasn't.
If they had even the most rudimentary tracking or inventory of those assets/artifacts, the same technical problems would cause a much simpler and faster business recovery; instead, circumstances forced them to inventory something that they (a) possibly didn't have and (b) didn't know if it needed to exist in the first place, and (c) in a hurry, without preparation or adequate tools or people for that.
IT couldn't and cannot fix that - implementing a process may need some support from IT for tooling or a bit of automation, but most of the fix would be by and for the non-IT owners/keepers of that data.
I disagree. I mean, I agree those things happen, but the system administrator's job is to anticipate those Real World risks and manage them with tools like quality assurance, redundancy, plain old focus and effort, and many others.
The fundamental of backups is to test restoring them, which would have caught the problem described. It's so basic that it's a well-known rookie error and a source of jokes like, 'My backup was perfect; it was the restore that failed.' What is a backup that can't be restored?
Also, in managing those Real World risks, the system administrator has to prioritize the value of data. The company's internal newsletter gets one level of care, HR and payroll another. The company's most valuable asset and work product, worth hundreds of millions of dollars? A personal mission, no mistakes are permitted; check and recheck, hire someone from the outside, create redundant systems, etc. It's also a failure of the CIO, who should have been absolutely sure of the data's safety even if he/she had to personally test the restore, and the CEO too.