How to crawl a quarter billion webpages in 40 hours (2012)

secondtimeuse · on Jan 8, 2016

This was written in 2012, Its even easier these days by using SQS and Cloud Formation. 250 Million is a small number you are better of first going through Common Crawl and then use data from crawls to build a better seed list.

Common Crawl now contains repeated crawls conducted every few months and also urls donated by blekko.

https://groups.google.com/forum/m/#!msg/common-crawl/zexccXg...

frik · on Jan 8, 2016

IBM bought "Blekko" (semantic search engine) a few months ago to build a knowledge base (for Watson AI) and took them offline. (similar to Microsoft that bought "Powerset" for Bing and Cortana AI a few years ago)

ClayM · on Jan 8, 2016

Do you have any recommendations on crawling / detecting changes on a large set of data? For example, the RSS feeds for every podcast on the planet?

drakenot · on Jan 8, 2016

This has been my personal project for the past few months. There are around 240k podcasts in the iTunes index and it is fairly trivial to scrape their feed urls.

Most of the podcast apps that have feed crawler backends (Pocket Casts, Overcast, etc) poll all 240k podcast feeds fairly frequently. More popular podcasts are polled on the order of every 2-3 minutes while less popular podcasts may only get polled every 10-15 minutes. This comes out to around 1.5 - 2 billion web requests per month.

It is important when you are making your feed requests that you set your last-modified and etag headers. These will speed up your requests significantly by having the servers send you a Not Modified (304) response if nothing has changed since your last poll. Something like 60% of the feeds support this.

You'll also want to keep a hash of the feed content. That way, when you get back a 200 response with the feed contents you can do a quick check to see if the feed content has changed since your last poll (for those servers that don't support etag). This will even further reduce the number of feeds you need to actually parse.

For those that returned a 200, and had a different hash, you now need to parse the feeds. There are a large number of podcasts which insert dynamic data into their feed. Some insert dynamic tracking query items into feed items. Or they make the some of the RSS feed dates the current time stamp (which is incorrect). These feeds with dynamic data will have to be fully parsed every time, which is a bummer. I've considered a future enhancement to my crawler that detects the feeds that do this and flip a bozo bit on them so I poll them less frequently.

The majority of podcast feeds are RSS 2.0. I'd have to check, but I think < 2% of podcasts in my database used Atom feeds. This was something that surprised me when I started the project. I spent a lot of time worrying about Atom feeds, or older RSS feeds but you could almost ignore them entirely and still capture most of the podcasts.

Parsing these feeds robustly is a whole topic unto itself. Many RSS/XML parsers are very strict. However, for this use case you don't want strictness. You want to extract the info out of the maximum number of feeds possible, even if some of them are malformed in some way. Perhaps the user didn't properly specify an XML namespace they are using. Or they are missing a closing tag for an element, etc.

Because the RSS spec doesn't require a GUID for items in the feed, you have to come up with your own algorithm for matching items with your new feed response. Many articles will tell you to use GUID if available, and if not, use Link. Or some combination of the above. However, for podcasts, you can almost always be assured that a podcast will have a url to the media file. So, I suggest using that as part of your matching algorithm in the absence of a GUID.

I plan on writing a more detailed article on this project as I get closer to finishing my crawler and submitting it to HN. As a further constraint, I'm attempting to get the monthly hosting costs for my distributed crawler to around $100/mo and it be capable of updating every podcast feed every 5 minutes.

mynewtb · on Jan 8, 2016

Please please please donate your invaluable collection to archive.org!

mei0Iesh · on Jan 8, 2016

Common Crawl contains the HTML?

I wonder how this is legal and considered acceptable. I wish I knew how even Google and others get away with scraping content, saving it, and utilizing it for profit without sharing any revenue with the original webmasters.

I know people can opt out of crawling, for those that actually respect that. But still, am I the only one who feels like this is wrong?

I guess I have this view that your domain is yours, and you invite the public in like an open house. It's my house, my property, and the door is open, where people can come in and look around at my stuff. But the expectation is that only locals will arrive, in small number, and they'll be good guests. If someone is breaking the lock on the bedroom door and going through the private drawers, that's wrong. If someone is taking photographs of everything, to then create a virtual tour of my house they charge for, that's wrong. The expectation is you're being nice by providing free and open access to information you created and own, and people should behave courteous to that.

Then if you as the webmaster choose, you can provide an API, or database dumps for people to download, along with the licensing terms. That is when it feels right for people to do things like this with the data, because you intentionally provided it through a non-personal interface.

To me the web is still a personal interface. I expect humans to use it, in an ordinary human-like way where it is somewhat ephemeral and courteous. I feel like Google cheated their way to success, and Common Crawl is stealing to rise their position in an unfair similar manner.

These all seem like parasites to me. They didn't create anything, they just steal it en masse.

There's so many businesses like that, such as Domain Tools that gets rich by hoarding everyone's contact details from WHOIS: http://whois.domaintools.com/commoncrawl.org

They have a screenshot history they won't ever delete even if you ask nicely. Here is a picture of Common Crawl from 2011: http://thumbnails.domaintools.com/domaintools/2016-01-08T19:...

jahewson · on Jan 8, 2016

> I wonder how this is legal and considered acceptable. I wish I knew how Google and others gets away with scraping content [...]

Well, here's the answer: "transformative" reuse of content is explicitly permitted under copyright law. Simply reproducing the content and charging for it would not fall under this provision, but building an archive of publicly available information is - quite appropriately, permissible.

There was recently a very large court case regarding this principle and its application to Google Books. Google won, by demonstrating that their search index is not equivalent to and does not affect the market for the original work - a "transformative" use.

Sharing is good. Publicly available works achieve their aims only by being consumed by others - anyone who publishes a work free of charge should expect it to be, and remain, publicly accessible.

gabriele · on Jan 8, 2016

I believe the Internet Archive is much less clear in this regard than public search engines. IA doesn't even have a clear takedown policy and no webmaster tools in place to give owners control on the archived content. Their crowler does obey to robots.txt rules but if you want content to be removed permanently you have to ask politely by email and in my experience they simply block the site urls from being searched but they don't make clear at all if content was actually removed from their servers.

pronoiac · on Jan 9, 2016

I think the idea of intentionally deleting content is pretty foreign to the Internet Archive. They're more likely to say "welcome to oblivion!" and set a timer for 70 years to show the content again.

mei0Iesh · on Jan 8, 2016

I don't think "Sharing is good" is true in the real world. If you apply that as a blanket statement, you'll end up in trouble.

What is legal is not always ethical. I think there's an interesting story there about how Google is legal, if someone doesn't automatically assume it should be just because it is.

The text online isn't always similar to a published text of the past. There is a personal overlap today that changes the rules. Such as this text I'm publishing right now. Forgetting about all the legalities and technicalities, I still feel like it is different than a page published in a book. I still feel like I should have the power to edit or delete it whenever I want in the future, yet Hacker News disagrees and removes my right to modify it, forever capturing it as if it owns it, not me. I still feel like this text is more transitory, where its relevance is mostly right now, and if it were deleted in a month it would be fine, because it's mostly just chit chat.

Certainly we could live in a world where everyone has microphones transcribing everything they ever say, which is transmitted to Google, and provided to researchers, where all kinds of uses could emerge. But that's a different world than the one where we've developed rules for today. Right now, I feel like most things I say are in passing, and should not only disappear, but won't spread where someone is capturing and propagating it beyond my control.

What control do I have over my text that is in this Common Crawler database? What if it captured information that was considered to be ephemeral in the website's context, and ripped it out of its home where it's now part of this collective publication, where anyone can use it for anything?

Sharing could be good in a world where people are not selfish and malicious. But in this one, many people will use whatever data they can get their hands on for selfish and malicious purposes, that do not benefit you, the author, in any way. I bet a large percentage of use for that Common Crawler database was harmful to society, such as for helping spammers generate fake content.

effie · on Jan 8, 2016

> "These all seem like parasites to me."

Your impression is wrong. Search engines and other services based on web data provide great value to society. They don't create documents they link to, but they deliver relevant links to people's queries. That's a great service. Without the search engine service, people may not even find the web page. That's why large portion of website owners and webmasters are glad search engine crawlers visit them and even expect indexing to databases to be fast and smooth.

If you publish anything on your web, you're facilitating free use and duplication of it in the whole world. If this was not your intention, but you still published your stuff on your web, you misunderstood the original intent and reality of the Web for sharing information.

There is a widely known standard of communication between robots and web sites called robots.txt standard. It is a file where you can state your intent to restrict crawler downloads. There is also html tag <meta name="robots" content="noindex,nofollow"> that signalizes to crawlers your wish that the page should not appear in search engine results. If you want to prevent people from accessing and using your documents, use these. Both Google and Common Crawl seem to obey them. If you want to _make_sure_ nobody accesses and uses your documents, don't publish them on the Web.

There is no practical way to achieve your documents are accessible only for some limited period you want. If you release them to the world, you always lose control over their distribution and use.

7952 · on Jan 8, 2016

A lot of this kind of thing in the "real world" is managed by social convention. People understand that there is a difference based on context, that can not fully be captured by the law. For example it may be perfectly legal to take photos of strangers at the beach, but we all understand why that is creepy.

The thing is that on the world wide web the social convention is strongly in favour of being able to slurp up data, at least as long as it does not cause technical problems. Mostly people get this and understand it.

Other apps have emerged that follow different social conventions. For example if you share something on SnapChat you are suggesting that the information should be ephemeral. But you can't expect people/crawlers to infer the context without having that strong hint.

eli · on Jan 8, 2016

Perhaps this is helpful? https://www.eff.org/deeplinks/2006/01/google-cache-ruled-fai...

worried_citizen · on Jan 8, 2016

Web crawling is just like most things: 80% of the results for 20% of the work. It's always the last mile that takes the most significant cost and engineering effort.

as your index and scale grow you bump into the really difficult problems:

1. How do you handle so many DNS requests/sec without overloading upstream servers?

2. How do you discover and determine the quality of new links? It's only a matter of time until your crawler hits a treasure trove of spam domains.

3. How do you store, update, and access an index that's exponentially growing?

Just some ideas.

betolink · on Jan 8, 2016

I've been working on something similar and I have ran into some of the issues you mention. As you correctly pointed out, quality and post processing is also relevant to not crawl irrelevant/spam sites, which can be HUGE! The work presented here is cool but it does not address the whole picture. Having a crawler that takes quality and user feedback into account is the hard part. Not to mention if you are being polite with the requests... we need to scale but not ignoring Robots.txt

So crawling a billion of links in X number of hours is not trivial but not that hard specially with cloud infrastructure like AWS, it's just a matter of a good enough implementation and how much money one wants to spend on it.

supername · on Jan 8, 2016

No one ever talks about a particular topic though when it comes to web crawling etc. How do you avoid all the "bad" sites as in, really bad shit? The stuff that your ISP could use as evidence against you when in fact it was just your code running and it happened to come across one of these sorts of sites. How do you deal with all that? That is the only thing stopping me from experimenting around with web crawling.

johnward · on Jan 8, 2016

I used to do some crawling on comcast using what is now IBM Watson Explorer. I got a ton of phone calls from them. It sounded like those calls would go away if I just paid a little bit more for business class service.

pella · on Jan 8, 2016

old HN comments ( 3 years ago ) https://news.ycombinator.com/item?id=4367933

tegansnyder · on Jan 8, 2016

I feel like more companies are building their businesses around web crawling and parsing data. There are lots of players in the eCommerce space that monitor pricing, search relevance, and product integrity. Each one of these companies has to build some sort of a templating system for defining crawl jobs, a definition of parsing rules to extract the data, and monitoring system to alert when the underlining HTML of a site has changed from their predefined rules. I'm interested in these aspects. Building a distributed crawler is easier than ever.

jdrock · on Jan 8, 2016

This isn't particularly difficult anymore. The most interesting challenges in web crawling around turning a diaspora of web content into usable data. E.g., how to get prices from 10 million product listings from 1,000 different e-retailers?

packersville · on Jan 8, 2016

I don't understand his hesitancy in releasing his crawler code. I imagine there are plenty for people to access and alter for malicious use if they desired, so why does releasing his such a big deal?

pbreit · on Jan 8, 2016

Is "quarter billion" used to make it sound like a bigger number? Even "half" is aggressive, imo.

johnnymonster · on Jan 8, 2016

I was going to ask this very question. Seems like the appropriate way would be to say 250 million. I mean you wouldn't say I got a quarter hundred dollars or quarter thousand...