> 1 server to index the whole internet? I don't care how big your server is -- that's going not to happen.
Well let's do the math!
1 billion bytes is 1 GB. The average text-size of a document is ~5 KB, that is the HTML, uncompressed. 1 billion documents is thus about 5 TB raw HTML. In practice the index would be smaller, say 1 TB, because you're not indexing the raw HTML, but a fixed width representation of the words therein, typically with some form of compression as well. (In general, the index is significantly smaller than the raw data unless you're doing something strange)
You could buy a thumbdrive that would a search index for 1 billion documents. A real server uses enterprise SSDs for this stuff, but even so, you don't even have to dip into storage server specs to outfit a single server with hundreds of terabytes of SSD storage.
> And of course all of this is based on "standard" algorithms. But it doesn't change the fact that implementing them all to build a functional performant web crawler+database+search is a project that is larger than any one person could build from scratch.
I'm still not seeing why this is impossible.
> It would only be feasible to build on top of existing libraries, which is the whole point here.
I don't think there are existing libraries for most of this stuff. Most off-the-shelf stuff in the search space is for indexing smaller corpuses, they generally aren't built to scale up to corpus sizes in the billions (e.g. using 32 bit keys etc.)
Well let's do the math!
1 billion bytes is 1 GB. The average text-size of a document is ~5 KB, that is the HTML, uncompressed. 1 billion documents is thus about 5 TB raw HTML. In practice the index would be smaller, say 1 TB, because you're not indexing the raw HTML, but a fixed width representation of the words therein, typically with some form of compression as well. (In general, the index is significantly smaller than the raw data unless you're doing something strange)
You could buy a thumbdrive that would a search index for 1 billion documents. A real server uses enterprise SSDs for this stuff, but even so, you don't even have to dip into storage server specs to outfit a single server with hundreds of terabytes of SSD storage.
> And of course all of this is based on "standard" algorithms. But it doesn't change the fact that implementing them all to build a functional performant web crawler+database+search is a project that is larger than any one person could build from scratch.
I'm still not seeing why this is impossible.
> It would only be feasible to build on top of existing libraries, which is the whole point here.
I don't think there are existing libraries for most of this stuff. Most off-the-shelf stuff in the search space is for indexing smaller corpuses, they generally aren't built to scale up to corpus sizes in the billions (e.g. using 32 bit keys etc.)