I used to work in the data protection industry, doing backup software integration. Customers would ask me stupid questions like "what digital tape will last 99 years?"
They have a valid business need, and the question isn't even entirely stupid, but it's Wrong with a capital W.
The entire point of digital information vs analog is the ability to create lossless copies ad infinitum. This frees up the need to reduce noise, increase fidelity, and rely on "expensive media" such as archival-grade paper, positive transparency slides, or whatever.
You can keep digital data forever using media that last just a few years. All you have to do is embrace its nature, and utilise this benefit.
1. Take a cryptographic hash of the content. This is essential to verify good copies vs corrupt copies later, especially for low bit-error-rates that might accumulate over time. Merkle trees are ideal, as used in BitTorrent. In fact, that is the best approach: create torrent files of your data and keep them as a side-car.
2. Every few years, copy the data to new, fresh media. Verify using the checksums created above. Because of the exponentially increasing storage density of digital media, all of your "old stuff" combined will sit in a corner of your new copy, leaving plenty of space for the "new stuff". This is actually better than accumulating tons of low-density storage such as ancient tape formats. This also ensures that you're keeping your data on media that can be read on "current-gen" gear.
2. Distribute at least three copies to at least three physical locations. This is what S3 and similar blob stores do. Two copies/locations might sound enough, but temporary failures are expected over a long enough time period, leaving you in the expected scenario of "no redundancy".
... or just pay Amazon to do it and dump everything into an S3 bucket?
S3 eliminates the risk of a disk becoming unreadable, or losing data in a fire. And it's overwhelmingly likely S3 will still exist in an easily readable form in 30 years time.
But it doesn't provide protection against you forgetting to pay AWS, you losing your credentials, your account getting hacked, or your account getting locked by some overzealous automation.
> And it's overwhelmingly likely S3 will still exist in an easily readable form in 30 years time.
There is no indication that this statement holds true. Not even remotely.
Businesses fold all the time. How many services still exist today that existed 30 years ago? Not in some archive, but still operational?
In addition to that problem, tech half-life continues to decrease. 30 years in the future is likely more comparable to 60 years in the past. Hello punch-cards.
> There is no indication that this statement holds true. Not even remotely.
Well, over 30 years I'd bet on S3 over blu-ray or magnetic tape at the very least :)
For one thing, S3 itself is already 16 years old - and Ceph, B2, GCS and Azure all offer extremely similar products, indicating there's solid demand for this product.
Second, it's not clear to me that 'tech half-life continues to decrease' - granted, there's huge churn in javascript web frameworks, but for PCs and laptops? Very little has changed in the last 5 years.
And thirdly, some technologies stay around for absolutely ages. Right now, you can buy a brand new data projector with a 15-pin analog VGA port - and a motherboard with a VGA output. You can get a motherboard for a 64-core ThreadRipper processor... which has a PS/2 connector.
To add to this, the core of the S3 argument (specifically with AWS and Azure) is based on the premise that AWS/Azure (AA from now) are too big to fail and to go anywhere, but I don't think that it has any bearing on whether the specific services will continue "as-is". The idea and concept should be that they're fine for storage, but keep in mind like any service, you need to keep an eye on it, no matter what they tell you now. It's impossible to predict what changes they may introduce even next year (or month for that matter) that makes AA S3 storage not feasible/usable.
Furthermore, keep in mind with this, once your data is in AA, it's no longer your data, it's AA's. Sure, you can pay to retrieve it, but that's the catch -- you gotta pay the toll. Unexpected bills or hidden fees or changes to fees may make the retrieval process simply not fiscally possible after a time if your account is delinquent. (I've seen this with _many_ clients; they tried to squeeze out every coin of the budget and didn't have flexibility with AA, and they ended up with a delinquent account and lost access: https://aws.amazon.com/premiumsupport/knowledge-center/react...)
Now, of course this is considered "your responsibility", and it can be true of anything, but if the data size is that low, then I think just manually managing it probably is a safer and perhaps cheaper bet. S3 as a service is mostly fine, but a lot of people very much so underestimate the expense of it and never bother to test recovery scenarios, and it ends up with a real surprise. (at least a few customers let their AWS bill go delinquent until the next fiscal year allowed them to pay the bills, only to find the data deleted by the above mentioned policy).
Basically, the idea of "set and forget" backups is a pipe dream; if it's important, you need to maintain it, and it basically can be like a second job.
>30 years in the future is likely more comparable to 60 years in the past. Hello punch-cards.
this part is a pro for S3 not a con. In this analogy OP is trying to store the information held by the punch-card, not the punch-card itself. So giving the information over to a business means they will preserve your binary data by moving it from punch-cards to HDD to SSD, etc - they will handle the hardware changes and redundancy.
For S3 specifically you want to use Glacier. It's made for long term storage and is very, very cheap to store in.
Be warned though that restoration takes special procedures, time, and can be expensive. So Glacier is most definitely a place for storing stuff you hope you'll never need, not just a cheap file repository.
The Glacier fees for retrieving data in minutes are incredibly awful, so take that into account. Count on waiting 12 hours to get your stuff for cheap.
Glacier Deep is the cheapest option. It does come with a catch that there's a minimum of 180 days commitment for their infrequent access tier. Last time I checked, the cost for US-East-1 is roughly like this:
At $0.00099/GB/month, it would cost ~$12/year to store 1TB.
Retrieval cost is $0.0025/GB and bandwidth down is $0.09/GB (exorbitant! But you get 100GB/mo free)
So, retrieving 1TB (924GB chargeable) once will run ~$85. I've also excluded their http request pricing which shouldn't matter much unless you've millions of objects.
For the same amount of data, Backblaze costs ~$60/year to store but only $10 to retrieve (at $0.01/GB).
I suppose an important factor to consider in archival storage is the expected number of retrievals, and whether you can handle the cost.
Sounds like points 1 and 2 can be elegantly combined using "next-gen" filesystems like zfs or btrfs. The hashing and scrubbing (automatic repair) happens in the background and the swapping to new/fresh media is automatic through replacing failing hard drives. Plus, the two are open and widely adopted standards.
I always thought a, say, zfs pool with 2-disk redundancy is not only redundant (RAID) but also servers as a backup (through snapshots). The 3-2-1 rule is good, but I feel like zfs is powerful enough to change that. A pool with scrubbing, some hardware redundancy and snapshots could/should no longer require two backups, just a single, offsite one.
If it's code like the OP seems to indicate, publish it on github; many services draw copies of source code from Github, and they themselves once put all code into cold storage for posterity: https://archiveprogram.github.com/arctic-vault/
> Each was packaged as a single TAR file.
> For greater data density and integrity, most data was stored QR-encoded, and compressed.
> A human-readable index and guide found on every reel explains how to recover the data
> The 02/02/2020 snapshot, consisting of 21TB of data, was archived to 186 reels of film by our archive partners Piql and then transported to the Arctic Code Vault, where it resides today.
And then pay somebody to guard it. Aka the "pay someone else to do it" option that your sibling comment talks about (and of which there are many different flavors, S3 being another one).
I used to work in the data protection industry, doing backup software integration. Customers would ask me stupid questions like "what digital tape will last 99 years?"
They have a valid business need, and the question isn't even entirely stupid, but it's Wrong with a capital W.
The entire point of digital information vs analog is the ability to create lossless copies ad infinitum. This frees up the need to reduce noise, increase fidelity, and rely on "expensive media" such as archival-grade paper, positive transparency slides, or whatever.
You can keep digital data forever using media that last just a few years. All you have to do is embrace its nature, and utilise this benefit.
1. Take a cryptographic hash of the content. This is essential to verify good copies vs corrupt copies later, especially for low bit-error-rates that might accumulate over time. Merkle trees are ideal, as used in BitTorrent. In fact, that is the best approach: create torrent files of your data and keep them as a side-car.
2. Every few years, copy the data to new, fresh media. Verify using the checksums created above. Because of the exponentially increasing storage density of digital media, all of your "old stuff" combined will sit in a corner of your new copy, leaving plenty of space for the "new stuff". This is actually better than accumulating tons of low-density storage such as ancient tape formats. This also ensures that you're keeping your data on media that can be read on "current-gen" gear.
2. Distribute at least three copies to at least three physical locations. This is what S3 and similar blob stores do. Two copies/locations might sound enough, but temporary failures are expected over a long enough time period, leaving you in the expected scenario of "no redundancy".
... or just pay Amazon to do it and dump everything into an S3 bucket?