"Hashes + Copies + Distribution" I used to work in the data protection industry,...

michaelt · on April 25, 2022

S3 eliminates the risk of a disk becoming unreadable, or losing data in a fire. And it's overwhelmingly likely S3 will still exist in an easily readable form in 30 years time.

But it doesn't provide protection against you forgetting to pay AWS, you losing your credentials, your account getting hacked, or your account getting locked by some overzealous automation.

wildmanxx · on April 25, 2022

> And it's overwhelmingly likely S3 will still exist in an easily readable form in 30 years time.

There is no indication that this statement holds true. Not even remotely.

Businesses fold all the time. How many services still exist today that existed 30 years ago? Not in some archive, but still operational?

In addition to that problem, tech half-life continues to decrease. 30 years in the future is likely more comparable to 60 years in the past. Hello punch-cards.

michaelt · on April 25, 2022

> There is no indication that this statement holds true. Not even remotely.

Well, over 30 years I'd bet on S3 over blu-ray or magnetic tape at the very least :)

For one thing, S3 itself is already 16 years old - and Ceph, B2, GCS and Azure all offer extremely similar products, indicating there's solid demand for this product.

Second, it's not clear to me that 'tech half-life continues to decrease' - granted, there's huge churn in javascript web frameworks, but for PCs and laptops? Very little has changed in the last 5 years.

And thirdly, some technologies stay around for absolutely ages. Right now, you can buy a brand new data projector with a 15-pin analog VGA port - and a motherboard with a VGA output. You can get a motherboard for a 64-core ThreadRipper processor... which has a PS/2 connector.

csydas · on April 25, 2022

To add to this, the core of the S3 argument (specifically with AWS and Azure) is based on the premise that AWS/Azure (AA from now) are too big to fail and to go anywhere, but I don't think that it has any bearing on whether the specific services will continue "as-is". The idea and concept should be that they're fine for storage, but keep in mind like any service, you need to keep an eye on it, no matter what they tell you now. It's impossible to predict what changes they may introduce even next year (or month for that matter) that makes AA S3 storage not feasible/usable.

Furthermore, keep in mind with this, once your data is in AA, it's no longer your data, it's AA's. Sure, you can pay to retrieve it, but that's the catch -- you gotta pay the toll. Unexpected bills or hidden fees or changes to fees may make the retrieval process simply not fiscally possible after a time if your account is delinquent. (I've seen this with _many_ clients; they tried to squeeze out every coin of the budget and didn't have flexibility with AA, and they ended up with a delinquent account and lost access: https://aws.amazon.com/premiumsupport/knowledge-center/react...)

Now, of course this is considered "your responsibility", and it can be true of anything, but if the data size is that low, then I think just manually managing it probably is a safer and perhaps cheaper bet. S3 as a service is mostly fine, but a lot of people very much so underestimate the expense of it and never bother to test recovery scenarios, and it ends up with a real surprise. (at least a few customers let their AWS bill go delinquent until the next fiscal year allowed them to pay the bills, only to find the data deleted by the above mentioned policy).

Basically, the idea of "set and forget" backups is a pipe dream; if it's important, you need to maintain it, and it basically can be like a second job.

cupofpython · on April 25, 2022

>30 years in the future is likely more comparable to 60 years in the past. Hello punch-cards.

this part is a pro for S3 not a con. In this analogy OP is trying to store the information held by the punch-card, not the punch-card itself. So giving the information over to a business means they will preserve your binary data by moving it from punch-cards to HDD to SSD, etc - they will handle the hardware changes and redundancy.

Your first point stands strong though

dale_glass · on April 25, 2022

For S3 specifically you want to use Glacier. It's made for long term storage and is very, very cheap to store in.

Be warned though that restoration takes special procedures, time, and can be expensive. So Glacier is most definitely a place for storing stuff you hope you'll never need, not just a cheap file repository.

The Glacier fees for retrieving data in minutes are incredibly awful, so take that into account. Count on waiting 12 hours to get your stuff for cheap.

maxwelldone · on April 25, 2022

Glacier Deep is the cheapest option. It does come with a catch that there's a minimum of 180 days commitment for their infrequent access tier. Last time I checked, the cost for US-East-1 is roughly like this:

At $0.00099/GB/month, it would cost ~$12/year to store 1TB. Retrieval cost is $0.0025/GB and bandwidth down is $0.09/GB (exorbitant! But you get 100GB/mo free)

So, retrieving 1TB (924GB chargeable) once will run ~$85. I've also excluded their http request pricing which shouldn't matter much unless you've millions of objects.

For the same amount of data, Backblaze costs ~$60/year to store but only $10 to retrieve (at $0.01/GB).

I suppose an important factor to consider in archival storage is the expected number of retrievals, and whether you can handle the cost.

diarrhea · on April 25, 2022

Sounds like points 1 and 2 can be elegantly combined using "next-gen" filesystems like zfs or btrfs. The hashing and scrubbing (automatic repair) happens in the background and the swapping to new/fresh media is automatic through replacing failing hard drives. Plus, the two are open and widely adopted standards.

I always thought a, say, zfs pool with 2-disk redundancy is not only redundant (RAID) but also servers as a backup (through snapshots). The 3-2-1 rule is good, but I feel like zfs is powerful enough to change that. A pool with scrubbing, some hardware redundancy and snapshots could/should no longer require two backups, just a single, offsite one.

michaelgrafl · on April 25, 2022

What if I don't want to backup stuff, but archive and then forget about it?

Edit: Oh, and I want it to keep existing after I'm no longer alive.

Cthulhu_ · on April 25, 2022

If it's code like the OP seems to indicate, publish it on github; many services draw copies of source code from Github, and they themselves once put all code into cold storage for posterity: https://archiveprogram.github.com/arctic-vault/

> Each was packaged as a single TAR file.

> For greater data density and integrity, most data was stored QR-encoded, and compressed.

> A human-readable index and guide found on every reel explains how to recover the data

> The 02/02/2020 snapshot, consisting of 21TB of data, was archived to 186 reels of film by our archive partners Piql and then transported to the Arctic Code Vault, where it resides today.

Tao331 · on April 25, 2022

That's the "pay someone else to do it" option.

It's this way because "archive and then forget about it" isn't really a thing. It turns out an archive that is not maintained is no archive.

imtringued · on April 25, 2022

Build a pyramid and carve your data into walls deep inside the pyramid.

wildmanxx · on April 25, 2022

And then pay somebody to guard it. Aka the "pay someone else to do it" option that your sibling comment talks about (and of which there are many different flavors, S3 being another one).

michaelgrafl · on April 27, 2022

I'd prefer a mobile solution. Something I can just put into a box and forget about. Someone else can than stumble upon it and be able to read it.

Basically like a box of photographs. But I know that's a foolish dream to have.