Hacker News new | past | comments | ask | show | jobs | submit login
The Backing Up of the Internet Archive Continues (textfiles.com)
175 points by bane on May 26, 2015 | hide | past | favorite | 65 comments



Jason Scott here. Just wanted to address the questions that always come up when this project gets some attention. (Also: Come volunteer to be a client! The more the merrier.)

* We are only backing up public facing data. (Roughly 12pb) * We are only backing up curated sets of data. (So less than that.) * We are stepping carefully to learn more about the whole process as we go, documenting, etc. * The hope is this will produce some real-world lessons and code that other sites can use. * This project uses non Internet Archive infrastructure, and is not an Internet Archive project.

It's going well, and the more people who join up, the better. Oh, and support the Internet Archive with a donation - it's a meaningful non-profit making a real difference in the world. http://archive.org/donate


Thanks for doing this. Hopefully, this will help reduce the problem of the Web of Alexandria that Brett Victor talked about:

60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress expects 60% of their collection to go up in smoke every decade.

---

For someone who's thinking about a library in every desk, going on the web today might feel like visiting the Library of Alexandria. Things didn't work out so well with the Library of Alexandria.

It's interesting that life itself chose Bush's approach. Every cell of every organism has a full copy of the genome. That works pretty well -- DNA gets damaged, cells die, organisms die, the genome lives on. It's been working pretty well for about 4 billion years.

We, as a species, are currently putting together a universal repository of knowledge and ideas, unprecedented in scope and scale. Which information-handling technology should we model it on? The one that's worked for 4 billion years and is responsible for our existence? Or the one that's led to the greatest intellectual tragedies in history?

https://twitter.com/worrydream/status/478087637031325697

http://worrydream.com/TheWebOfAlexandria/


Every cell has a full copy of the current operating plan, not the entire history of all preceding operating plans. Storing the entire commit history of our DNA would be much more space intensive.


How much more?


Well, poking around the web, it looks like the average bacterial time-between-generations is ~0.1-10 hr, and there are ~0.1-100 mutations/generation. So, if life began 4 billion years ago, I get between 10^11 and 10^16 mutations that need tracking.

The article states that there are 10^9 bases in the human genome.


While that's a pretty significant range, it seems that we could store the entire commit-log in the same amount of DNA that 100 cells normally store.. 10,000,000 cells on the high-end. Which is still only a few milligrams of cells. Impratical of course, but interesting.


> It's interesting that life itself chose Bush's approach. Every cell of every organism has a full copy of the genome.

Ehhh... every cell has a full copy, but that's more of a coincidence than anything. They're not capable of using their full copy. And only the germ cells make any contribution to the genome of the next generation.


Are you guys concerned with decay at all? How do you know the file stored 2 years ago is still the same file? what happens if its become corrupted?


Currently, the system requires you to check in (using the git-annex facility) on a regular basis. If you don't check in (it's a script run by cron) within 2 weeks, you're considered decayed, and after 30 days, your contributions fall back into the pool and are taken in elsewhere.


I'm talking more about the checksums of files over time, and error correction should they deviate.

I recently read about facebooks cold storage and it got me thinking about it (https://code.facebook.com/posts/1433093613662262/-under-the-...) though for them they just want one copy total.


git-annex is the engine behind this project currently, and a lot of bugfixes, feature-adds and work has happened as a result.

https://git-annex.branchable.com/


I would be interested to hear about protections against active misinformation attacks, where someone is attempting to change archived content maliciously.


The checksums we have from the IA are md5sums, which are not ideal, so a preimage attack is possible, but AFAICS you could only use it if you're generating the original file that is stored in the IA, and are planning to replace that with a colliding version in the future. Otherwise, we can detect falsified files.

There are some potential attacks of putting false information into the git-annex repositories, that we use for tracking which clients are storing which files. We'll eventually need post-receive hooks to validate that pushes only change information about the client making the push. For now, if someone attempts this attack, we can revert their malicious changes after the fact.


...if you're generating the original file...

Maybe for the time being, with published attacks.

But MD5's strength against attacks has been sufficiently dubious since early 1996 – before the founding of the Internet Archive – that experts have recommended against using it in new applications.

By 2009, the CMU Software Engineering Institute (responsible for CERT security notices) wrote:

Software developers, Certification Authorities, website owners, and users should avoid using the MD5 algorithm in any capacity. As previous research has demonstrated, it should be considered cryptographically broken and unsuitable for further use.

So they're worse than 'not ideal'. They're dangerously obsolete for the purpose of content-authenticity, and any reliance upon them (under hand-wavy conditions) serves to block the migration to available hashes that would actually work.

If you're making the backup to be sent through a wormhole to 1999, use MD5. If you're creating the backup for trust and availability through to end-of-year 2015, or 2025, or 2115, ditch MD5 ASAP.

[1] http://www.kb.cert.org/vuls/id/836068


This isn't a lot of data in the general scheme of things.

Why doesn't AWS or Google offer to pop it in glacier/S3 or GCS for free for the PR? They both have a huge multiple of that much unused disk - it would cost them effectively nothing beyond inbound bandwidth.


It kind of is a lot of data... trying to archive internet websites whole and keeping multiple snapshots is pretty close to what google and other search engines do, only search engines need to generate distributed indexes as well as some information regarding versioning.

Not to mention, that historically speaking, you cannot trust google to keep this information preserved, or public. Look at what happened to any number of other tools google once offered. TBH, I would like to see funding via a grant from the library of congress towards archive.org.

I'm curious as to how many copies a piece of data is needed to be "safe" in such a flexible unknown as end user/volunteer storage. It's one thing for compute items that can be re-queued for work in a day or two if abandoned... it's another where every copy of a record happens to walk away. Let alone the communications protocol.. this goes way beyond most bigtable implementations.


On a purely numerical comparison, the article mentions 27TB which really isn't much in terms of size, especially compared to what some of the companies using AWS produce daily.

EDIT: according to comments below, looks like its about 20+ petabytes which is actually a fairly large amount.


The article mentions 27TB so far at a point where they appear to still be focused on making their tools better. 27TB is a tiny proportion of IA.


Cite: https://en.wikipedia.org/wiki/Internet_Archive

> As of October 2012, its collection topped 10 petabytes.

I'd be curious what it's grown to since, but google didn't immediately tell me.



Not for Google or S3.


Seeing as how the out of the door price for 12PB of data hosted on RRD S3 storage @ 99.999999999% durability is roughly $300k/month, I'd hardly call it trivial.


A few things:

1) They have it sitting unused presently. The nominal cost of providing it would be zero.

2) RRS is fine. So would Glacier be. This reduces costs further.

3) There is significant PR benefit to such a move.


I'm not sure how you've figured they have 12PB of disk (Or possibly more like 36PB due to replication for 11 9s of durability) just lying around. The whole meme that ec2 came about due to it just being extra capacity is incorrect. Same goes for all of the other services. Running a business whose MO is to keep lowering costs and make a profit on razor thin margins doesn't lend itself to lots of unused infrastructure. They aren't going to sink costs into infrastructure without realizing a return as soon as possible.

Server amortization costs, future cost calculation planning, depreciation costs, power consumption, etc are all closely calculated and factored into budgets. Just thinking they can support that much data for free and at no cost, or minimal cost because it feels good is naive.


Google, and Amazon could very well run within the archive.org network once it is up... and could very well offload a significant amount of data. The nominal cost is anything but near zero... just on the wear of hard drives alone, it will be costly.

My point was you can't rely on them to keep said data available. Not that they couldn't participate.. but I wouldn't trust anything less than a "Lifetime" (of the company) or a 25 year minimum commitment as anything but transient.


> Not to mention, that historically speaking, you cannot trust google to keep this information preserved, or public. Look at what happened to any number of other tools google once offered.

Google is hardly going to abandon cloud storage anytime soon. Google's willingness to release and kill small projects has no rational impact on the reliability of their core services.

Historically speaking, Google helped preserve Usenet archives via Google Groups and digitized library archives via Google Books. Both projects have seen their share of ups and downs, but they're still publicly available.


It's odd that you'd use the deja news archive. They way Google handled that aas pretty terrible.

Arguably moving away from the Google Groups search page and into the general search page but using the site:groups.google.com term is better but now it's really hard to find stuff and it's really hard to search by different parameters.


Yes, and Google Books was hampered by backlash from authors. That's why I noted both projects had a history of ups and downs. However, both are still going.

The point is that, if we're talking about Google's history, they have shown an interest in preserving data. And in regards to the Internet Archive, they'd only be serving as a backup storage provider, not in a frontend role.

This is not to say that Google (or Amazon or Microsoft) should be the only backup provider for the Internet Archive, for example, but it would hardly be a bad thing, as tracker1 suggested, if they cared to donate the resources.


> Google is hardly going to abandon cloud storage anytime soon.

I wonder what "soon" means in this context. If I wanted to archive anything I would think in the timescale of decades, and that's only for my personal photos and videos.


It'd be great if they would, as an extra copy! However, reliance on charity from any single commercial entity would be somewhat against the spirit of the project. Companies (and their PR budgets) come and go.

Archive.org has about 22 petabytes of archived material – and growing. Even at AWS Glacier's cold-storage rate of $0.01/GiB per month, that's $220,000/month, $2.64MM a year.


yeah, definitely not a trivial amount.


There's a huge amount of copyrighted material on the Internet Archive. They can get away with it, being a library and all, but I don't think anyone like Google or Amazon would want this liability.

Plus it works the other way around too. I don't think the Internet Archive would want the Amazon liability. Amazon might delete to delete all or some of the data someday for whatever reason.


Yeah, I'm sure Google holding copies of vast archives of other peoples' published copyrighted data is something they want to avoid.

Oh, wait.


They could just act as dumb host behind the scenes. The hosting provider is not liable for its customers.


That would seem to lose at least some of the benefits of creating tools and getting people involved to create properly distributed backup not just in terms of physically distributing the copies but making it independent of any single entity.


It wouldn't be bad to have another option, though. The volunteer effort currently covers only a fraction of the total Internet Archive. Better a single backup than no backup at all.


I wonder if anyone else also thought of Backblaze's "unlimited" storage for $5/month...


One of the comments there brought up an interesting point:

Is this pure archival, meaning a onetime download and no uploads?

Imagine a distributed, peer-to-peer style Internet Archive. That would be awesome. It reminds me of a decade ago, when the rapid rise of P2P file sharing (mostly torrents) made it possible to find literally anything that existed and someone felt neighbourly enough to share. Multimedia, software, books, anything that was present in digital form. It was probably the closest we've ever been to a "global library", but too bad antipiracy groups/commercial interests/security focus mostly killed it off and replaced it with locked-down content silos...


Since it uses git-annex, it actually can fetch files in a P2P fashion, using Bittorrent: http://git-annex.branchable.com/special_remotes/bittorrent/


One thing to note, IA distributes many of their items via Torrent (as a download option).

I've kind of found it interesting TPB doesn't mirror all of them.


I had a crack at running this, and it was quite interesting. I'm certainly up for having some of my spare space for such a worthy cause.

I stopped, however, mostly due to the speed. I have an 80Mb connection but couldn't pull more than ~0.5-1Mb down. At that rate, even filling 500GB would take about two months. Perhaps this was due to the size of the files (downloading music).

For anyone looking to try it out, please do, and see what you get. One thing to be aware of is it asked for the amount of space to leave free not how much to use, and got the size of my disk wrong by about a factor of 10 so I had to be a bit cautious about what I selected.


> One thing to be aware of is it asked for the amount of space to leave free not how much to use, and got the size of my disk wrong by about a factor of 10 so I had to be a bit cautious about what I selected.

Well, that's a deal-breaker for me.

I can easily dedicate, say, a terabyte or two to the project, but I can't do "Free space minus this number". I need predictability.


Perhaps I worded that poorly, it's asking for the amount of space to not use. As in, on a 3TB drive, leave me 2TB for my own purposes and it'll then use a max of 1TB. It's not based on how utilised the drive currently is.

My problem was it thought the drive was 16TB when in reality it was 2TB. That's a minor usability problem though, one I've been meaning to go and submit a PR for.

Edit - As I understand it anyway.


I'd be interested in that bug report!


Isn't it more predictable to know that you'll have at least (say) 500MB free, instead of gradually filling up your drive until one day you don't have any more space?

Edit: to be clear, it will actually remove backup data from the drive over time to maintain the amount of free space you set. It updates the central server to tell it which files you still have.


I have a 24TB disk array, configured as RAIDZ2; 12TB of usable space. When I run out of space, I add more disks. (Or, more usually, replace old ones.) I use Prometheus to keep track of its state, and have an email alert set up when disk space drops below a certain point.

The problem with a program that tries to maintain a certain amount of free space is that I won't know how much space there really is. I'd need to subtract the amount the archive uses, then.. add back in the amount I actually want it to use.. and it all ends up as a bit of a mess.

It'd be much easier for me if I could just set it to use 2TB, and be done with it. I could probably rely on it actually using the space.


I don't know if git-annex does that. Could you make a subvolume of the right size? Or create a new user and set its disk quota to the size you want?


You will probably be able to increase the bandwidth by running concurrent downloads.

echo "-J4" > IA.BAK/ANNEXGETOPTS


Perhaps approaching this with LOCKSS (Lots Of Copies Keep Stuff Safe)[1] or a LOCKSS-like system would be a good idea.

[1] http://www.lockss.org


"We’ve intentionally and unintentionally punched clients in the gut"

Oh come now, you can't drop that in a blog post without elaboration. :) Share your pain!


The Internet Archive will still archive your site even if you use their recommended robots.txt suggestions to have them not archive your site. I had several domains where I setup the robots.txt immediately as one of the first things on my sites. While I owned the sites and ran them with the robots.txt Archive.org would not show any archive of them. However once the sites went down/changed domains, etc, the archive of them is now on archive.org. So all they do is block you from seeing what is on the site as of now, not forever. So you'd have to own the domain infinitely to have it not archived by them.


what exactly is the objective here? just making sure there's multiple copies of the internet archive?

Nifty stuff in any case


There's multiple bits of objectives here - education, research, practice, and awareness. As time goes on, it's obvious that distributed sharing of data is the only real vaccine against the inherent problems of the modern Internet, and learning more about different ways this might be done benefit all.

Plus it's nice to have some of this data scattered about the world.


In the sphere of research & practice, are there any long-term plans for distributed processing applied to the data, akin to the Common Crawl in AWS allowing one to run map-reduce jobs against it?

There is a long history of sandboxing the JVM, which means in theory it should be safe to bring the code to the data without running the risk of having your local machine p0wned.


I'm curious about how archive.org feel about this. Presumably they are paying for the outgoing bandwidth.


With respect to pure speed, at some point it would be cheaper/faster to show up at the facility with a trunk load of drives and plug right in. Superficially, this is a game for the locals, not someone like me. Then again, looking at how little it costs to ship me a couple terrabytes of drives, maybe this isn't as crazy as it sounds.

I can get a cheap empty prosumer NAS box for a couple hundred bucks and a couple TB of drives for a couple hundred more. If I gave someone (presumably the internet archive) a kilobuck and told them to keep the change (which would be a small but nice revenue stream for them), they could assemble a couple TB nas and ship it to me and I could plug it in and give it a public ipv6 address and aside from the kilobuck it would cost me a couple hundred bucks a year in electricity and presumably bandwidth, but at least it wouldn't have any labor cost. This might be going a bit far for individual citizens but you could probably guilt trip corporations into hosting your custom made NAS boxes for you. Just an idea to think about.

Another intriguing financial model might be to partner with public libraries and schools and double charge private companies in order to give public libraries and schools free backup NAS.

The world of non-profits opens up strange new ideas.


This is being done by Jason Scott and the rest of the team that runs archive.org. Presumably they're ok with using their own bandwidth!


It's being overseen by Jason Scott (me) but I am not utilizing other employees of Internet Archive - they're quite busy enough as it is.


Don't forget that the purpose of the Internet Archive is primarily to be an open archive. In that sense, I'd imagine that bandwidth use is not really a factor, and anything that contributes to the longevity of the archive is welcome.

"It's costing us bandwidth" is only a relevant factor if your goal is to turn a profit, and the bandwidth use isn't contributing to that. An archive doesn't have the same considerations.


I'm very aware that IA is an open, non-profit archive and I'm sure that they don't mind that other copies of the archive exist. My question was about the cost to IA of transferring 12 petabytes (according to Jason Scott [1]). Outbound bandwidth ultimately costs money. IA is a non-profit and "It's costing us bandwidth" is very relevant.

For instance, here [2] are the charges for outgoing bandwidth on Rackspace cloud servers. The lowest quoted rate if £0.06/GB for up to about half a petabyte. IA won't be paying that much as they have their own infrastructure. But even if bandwidth only costs the IA $0.01/GB, 12PB would cost around $125k.

12PB is a lot of data. I don't know what IA pay for outbound bandwidth but it won't be zero. Thats all I was asking about.

[1] https://news.ycombinator.com/item?id=9602868

[2] http://www.rackspace.com/cloud/public-pricing#bandwidth


Keep in mind, we're trying to make 3 copies of all this data - so multiply that bandwidth by 3.


The link into 300 Funston Ave is 10Gbit. There is no shortage of bandwidth.


Your intel is old, edward ;) they're past 40Gbit now.


Even saturating a 40Gbit link at 5GB/s would require nearly 50 days to transfer 20PB.

Realistically, I'd estimate mirroring the IA to take years... providing you can keep up with its own growth.


Don't forget that the purpose of the Internet Archive is primarily to be an open archive. In that sense, I'd imagine that bandwidth use is not really a factor, and anything that contributes to the longevity of the archive is welcome.

"It's costing us bandwidth" is only a relevant factor if your goal is to turn a profit, and the bandwidth use isn't contributing to that. An archive doesn't have the same considerations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: