Hacker News new | past | comments | ask | show | jobs | submit login
Crawling a quarter billion webpages in 40 hours (2012) (michaelnielsen.org)
208 points by swyx on June 15, 2023 | hide | past | favorite | 63 comments



This is a good overview, but being from 2012 it's missing comment on one now common area, using a real browser for crawling/scraping.

You will often now hear recommendations to use a real browser, in headless mode, for crawling. There are two reasons for this:

- SPAs with front end only rendering are hard to scrape with a traditional http library.

- Anti bot/scraping technology fingerprints browsers, looks at request patterns, and browser behaviour to try and detect and block bots.

Using a "real browser" is often advised as a way around these issues.

However from my experience you should avoid headless browser crawling until it is absolutely necessary, I have found:

- Headless browser scraping is between 10x and 100x more resource intensive, even if you carefully block requests and cache resources.

- Most SPAs now have some level of server side rendering, and often that includes having a handy JSON in the returned document that contains the data you actually want.

- Advanced browser fingerprinting is vanishingly rare. At most I have seen detection of user agent strings and comparing them to http headers and the order of them. If you make your http lib look like a current browser you are %99.9 of the way there.


> Headless browser scraping is between 10x and 100x more resource intensive, even if you carefully block requests and cache resources.

Instead of setting up some kind of partnership with our vendors, where they just send us information or provide an API, we scrape their websites.

The old version ran in a hour, using one thread on one machine. Downloaded PDF's and extracted the values.

The new version is Selenium based, uses 20 cores, 300GB of memory, and takes all night to run. It does the same thing, but from inside of a browser.

As a bonus, the 'web scrapers' were blamed for every performance issue we had for a long time.


On the topic of JSON and whatnot that reminded me of this awesome post[1] about looking within heap snapshots for entire data structures where they may have a lot of nicely structured data within it not readily available from more direct URL calls.

[1] https://www.adriancooney.ie/blog/web-scraping-via-javascript...

(Previous comments about it from 2022 here: https://news.ycombinator.com/item?id=31205139)


It also depends on what the goal is. If you need to extract some data from a specific site (as opposed to obtaining faithful representations of pages from arbitrary sites like a search engine might need to), then SPAs might be the easiest targets of all. Just do some inspection beforehand to find out where the SPA is loading the data from, very often the easiest thing is to query the API directly.

Sometimes the API will need some form of session tokens found in HTTP meta tags, form inputs, or cookies. Sometimes they have some CORS checks in place that are very easy to bypass from a server-side environment by spoofing the Origin header.


I do a lot of scraping at my day job, and I agree the times we need to use a headless browser are very rare. Vast majority of the time you can either find an API or JSON endpoint, or scrape the JS contained in the returned SSR document to get what you want.


Headless browsers are expensive, but if you needed them, I suspect there are easy wins to be had on the performance front.

For example, you can probably skip all layout/rendering work and simply return faked values whenever javascript tries to read back a canvas it just drew into, or tries to read the computed width of some element. The vast majority of those things won't prevent the page loading.


Which way to go will depend on your use-case and what websites you want to scrape.

For occasional web crawling, headless browser is great as it's easy to set up and you're almost guaranteed to get the data you need.

For frequent or large-scale crawling, it's a different story. Surely you can implement something with just HTTP library, but you'll need to test it throughly and make some research before hand. That said, most scraped, content-heavy websites use either static HTML or SSR, in which case you can use HTTP no problem.


Depends on what you want with the data I guess. From a search engine point of view, the SPA case isn't that relevant, since SPAs can't reliably be linked to and in general tend to not be very stable and it's overall difficult to figure out how to enumerate and traverse their views.

I think a good middle ground might be to do a first pass with a "stupid" crawler, and then re-visit the sites where you were blocked or that just contained a bunch of javascript with a headless browser.


The SPA's we create can be reliably linked to (the current URL changes as the user moves around, even though the page hasn't reloaded) and they are "stable" because our business would go bankrupt if Google couldn't crawl our content.

If Google can crawl it, then you can too. And while Google doesn't use a headless browser (or at least I assume they don't) they absolutely do execute javascript before loading the content of the page. And they execute the click event handlers on every link/button and when we use "history.pushState()" to change the URL Google considers that a new page.

You're just going to get a loading spinner with no content if you do a dumb crawl (I disagree with that and think we should be running a headless browser server side to execute javascript and generate the initial page content for all our pages... but so far management hasn't prioritised that change... instead they just keep telling us to make our client side javascript run faster... imagine if there was no javascript to execute at all? At least none before first contentfull paint)


> The SPA's we create can be reliably linked to (the current URL changes as the user moves around, even though the page hasn't reloaded) and they are "stable" because our business would go bankrupt if Google couldn't crawl our content.

This is true for some SPAs, but not all SPAs, and there's not really any way of telling which is which.

I don't personally attempt to crawl SPAs because it's not the sort of content I want to index.


I have a pet theory that there are two forms of the web: the document web and the application web. SPAs have some very attractive properties for the application web but complicate/break the document web.

That being said, with sites like HN, Reddit, LinkedIn, Twitter, news outlets, etc. the lines between “document” and “application” get blurred. In some ways they’ve built a micro-application that hosts documents. Content can be user submitted in-browser. Content can be “engaged with” in browser. Some handle this blurring better than others. HN is an example IMO of getting it right where nearly everything that should be addressable (like comments) can be linked to. Others not so much.

(As an aside, I love marginalia!)


For application websites like the ones you listed, you'd typically end up building a special integration for crawling against their API or data dumps. This is also true for github, stackoverflow, and even document:y websites like wikipedia.

It's simply not feasible to treat them as any other website if you wanna index their data.


Site owners adding browser fingerprinting explicitly is vanishingly rare however site owners who sit behind cloud flare who very likely fingerprint browsers is very common.


The issues you list can probably be split into 2 main issues

- The host has a tendency to block unknown user agents, or user agents that claim to be a browser but are not

- Anything that requires client side rendering

I'd suppose both problems are more pertinent in 2023 than they were in 2012.

At web scale, the issue appears on at what point would you be required to use a headless browser when not using one in the first place e.g. if React is included/referenced. Perhaps some simple fingerprinting of JS files would do, IMO in reality the line is very blurry, so you either do or don't.


I've always wondered why not to use request interceptors, get the html/json/xml/whatever url and then just call them.

If you need cookies/headers, you can always open the browser, log in and then make the same requests in the console, instead of waiting for the browser to load and scrape the UI (by xpath, etc.)

sounds weird going in circles: - SPA call some URL - SPA use the response data to populate the UI - You scrape the UI

instead of just calling the url inside the browser? am i missing something?


Is there like a custom performance based headless browser

How are they scaled?


A note of caution: never scrape the web from your local/residential network. Few months back I wanted to verify a set of around 200k URLs from a large data set that included a set of URL references for each object, and naively wrote a simple Python script that would use ten concurrent threads to ping each URL and record the HTTP status code that came back. I let this run for some time and was happy with the results, only to find out later that a large CDN provider has identified me as a spammy client with their client reputation score and blocked my IP address on all of the websites that they serve.

Changing your IP address with AT&T is a pain (even though the claim is that your IP is dynamic, in practice it hardly ever changes) so I opted to contact the CDN vendor by filling out a form and luckily my ban was lifted in a day or two. Nevertheless, it was annoying that suddenly a quarter of the websites I normally visit were not accessible to me since the CDN covers a large swath of the Internet.


I run a search engine crawler from my residential network. I get this too sometimes, but a lot of the time the IP shit-listing is temporary. It also seems to happen more often if you don't use a high enough crawl delay, ignore robots.txt, do deep crawls ignoring HTTP 429 errors and so on. You know, overall being a bad bot.

Overall, it's not as bad as it seems. I doubt anyone would accidentally damage their IP reputation doing otherwise above-board stuff.


I’ve learned a bunch of stuff about batch processing in the last few years that I would have sworn I already knew.

We had a periodic script that had all of these caveats about checking telemetry on the affected systems before running it, and even when it was happy it took gobs of hardware and ran for over 30 minutes.

There were all sorts of mistakes about traffic shaping that made it very bursty, like batching versus rate limiting, so the settings were determined by trial and error, essentially based on the 95th percentile of worst case (which is to say occasionally you’d get unlucky and knock things over). It also had to gather data from three services to feed the fourth and it was very spammy about that as well.

I reworked the whole thing with actual rate limiting, some different async blocks to interleave traffic to different services, and some composite rate limiting so we would call service C no faster than Service D could retire requests.

At one point I cut the cluster core count by 70% and the run time down to 8 minutes. Around a 12x speed up. Doing exactly the same amount of work, but doing it smarter.

CDNs and SaaS companies are in a weird spot where typical spider etiquette falls down. Good spiders limit themselves to N simultaneous requests per domain, trying to balance their burden across the entire internet. But they are capable of M*N total simultaneous requests, and so if you have a narrow domain or get unlucky they can spider twenty of your sites at the same time. Depending on how your cluster works (ie, cache expiry) that may actually cause more stress on the cluster than just blowing up one Host at a time.

People can get quite grumpy about this behind closed doors, and punishing the miscreants definitely gets discussed.


It makes very little difference what IP you scrape from, unless you're from a very dodgy subnet.

The major content providers tend to go on a whitelist only based approach, you're either a human-like visitor or facing their anti-scraping methodologies.


I think the emphasis is on "never scrape from YOUR local/residential network".


Most probably cloud based scraping services were not available in 2012. Now there are services available like scraperapi and others that don't need you to install anything at your end. You pay them, use their cloud infra, infinite proxies and even headless browsers. Shameless plug, I had written about it a few years ago on my blog post [1]

[1] https://blog.adnansiddiqi.me/scraping-dynamic-websites-using...


>(even though the claim is that your IP is dynamic, in practice it hardly ever changes)

Every ISP just uses DHCP for router IPs. It's dynamic, you just have to let the lease time expire to renew it.

Or, have your own configurable router instead of the ISPs so that you can actually send a dhcp release command though they don't all support this part. Changing MAC Address will work otherwise.


When the lease expires, the same IP is prioritized for renewal. Leases are generally for a week or two, but I've noticed dynamic IPs staying for 3 months or more. Swapping modems is really the best way to get a new external IP.


Not sure how is it in python, but what about using something like arti-client? Would it be already blocked?


To save you a click: Use 20 machines.


250M / 40hrs / 60min / 60s ~= 1,737 rps. That over 20 machines is ~87 rps per machine.

Depending on a few factors, I rough out my backend Go stuff to handle between 1-5k rps per machine before we have real numbers.


You didn't write your Go stuff 11 years ago though...


We started with Go version 1.2 which was released over ten years ago. Pretty darn close to 11 years.

https://go.dev/doc/go1.2


Wow crazy how time flies. I did not realize Go has been in production for over 10 years now.


How to spend $580 in 40 hours. More can be done for much less in 2012.


Yes. I used lib curl's multi interface on one $40/m server around that time. Indeed at any scale the rate limiting becomes the main bottleneck, mainly because a lot of sites are concentrated on certain hosts. Speed isn't the problem and multiple servers aren't really needed.


For all the people that say this is easy. Try it ! That's not easy at all, I've tried it and spend a few weeks to get similar performance. Receiving thousands of request is not similar to making thousands of requests, you can saturate your network, saturated with latency of random websites, get site that never timeout, parse multi megabytes malformed html, get infinite redirections.

My fastest implementation in python was actually using threads and was much faster than any async variant


Great to see this again. This was the article that introduced me to Redis (and more broadly the NoSQL rollercoaster) all those years ago!


Just use 500 android phones running Tmux they said. It'll be easy they said.


Imagine a Beowulf cluster of android phones running tmux!


Be careful with that reference, it's an antique.


Netcraft now confirms: Beowulf cluster memes are dead.


A Beowulf cluster running screen


Archive link: https://archive.is/yUWjh

Also kinda wish the author paid any sort of attention to the fact that doing this incorrectly may create a flood of DNS queries. At least have the decency to set up a bind cache or something.


what is a bind cache? always assume most of us have terrible knowledge of networking


I mean running bind[1] locally configured to act as a DNS cache.

The operating system does some DNS caching as well, but it's not really tuned for crawling, and as a result it's very easy to end up spamming innocent DNS servers with an insane amount of lookup requests.

[1] https://www.isc.org/bind/


ok but in my understanding isnt DNS also cached on the nearest wifi/ISP routers? the whole DNS system is just layer after layer of caches right? i.e. does caching on local machine actually matter? (real question, i dont know)


Yeah sure, but most of those caching layers (including possibly on the ISP level) aren't really configured for the DNS flood a crawling operation may result in.

If you're going to do way more DNS lookups than are expected from your connection, it's a good custom to provide your own caching layer that's scaled accordingly.

Not doing so probably won't break anything, but it risks degrading the DNS performance of other people using the same resolvers.


gotcha. thanks for indulging my curiosity! hopefully others will learn good dns hygiene from this as well.


I've been recently reading up on multithreading and multiprocessing in python. You mention that you've taken a multi threaded approach since the processes are i/o bound. Is this the same as running the script with asyncio as async/await?


At a 10,000 foot view (pedants will take offense) you should look to use multiprocessing for tasks which are CPU bound, and asyncio/threads (but really asyncio if you can) for problems which are IO-bound.

This is a massive simplification but most useful for a beginner.

Additionally, asyncio is not the same as multithreading, because typically asyncio is powered by a single-threaded event loop and use of a mechanism like select/kqueue/IOCP/epoll.


Oops, I meant to ask the author but realised that the author is not the same as the op. Hah!


With async in any language of choice this should need just 1 server. 250M/40hours=1.7K requests/second.

You can probably do it on a single core in a low level language like Rust/Java.


Threads isn't the only bottleneck in crawling.

Assuming you're crawling at a civilized rate of about 1 request/second, you're only so many network hiccups away from consuming the entire ephemeral port range with connections in TIME_WAIT or CLOSE_WAIT.


Crank up the ulimit. And what is this 1 req/second nonsense? 1 req/sec/domain maybe. I have to agree, my first thought is "why this is not a single node in Go?"


Oh yeah, I mean 1 req/sec per domain of course.

It's very easy to end up with tens of thousands of not-quite-closed connections while crawling, even with SO_LINGER(0), proper closure, tweaking the TCP settings, and doing all things by the book.

It's a different situation from e.g. having a bunch of incoming connections to a node, the traffic patterns are not at all similar.


a bit of context - posted this because i found it via https://twitter.com/willdepue/status/1669074758208749572

if they replicate it in 2023 it would be pretty interesting to me. i can think of a few times a year i need a good scraper.

but also thought it a good look into 2012 Michael Nielsen, and into thinking about performance.


What makes this post more interesting is that Reddit might now be ushering in a new era of the crawler arms race


How to get the top million sites list today? Alexa has shifted focus recently.


Check out Tranco [1], which uses Cisco Umbrella, Majestic and also now a list sourced from Farsight passive DNS [2]. They're "working on" adding Chrome UX and Cloudflare Radar.

There's also a list from Netcraft [3].

[1] https://tranco-list.eu/

[2] https://www.domaintools.com/resources/blog/mirror-mirror-on-...

[3] https://trends.netcraft.com/topsites




Here is an example of how to obtain a list of the top six million domains from Tranco and analyze their content with ClickHouse: https://github.com/ClickHouse/ClickHouse/issues/18842


[flagged]


No point being a douchebag.


don't want to release the source code because of moral reservations about how people might use it. Ok upstanding citizen.


Just like AI people


Nothing to stop you writing a similar project and releasing it if you feel so strongly that it should be out there, instead of being incensed by not being give then sweat off someone else's brow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: