1 server to index the whole internet? I don't care how big your server is -- tha...

59nadir · on Dec 27, 2023

> It would only be feasible to build on top of existing libraries, which is the whole point here.

I don't think you've made a single point that substantiates that claim and neither has anyone else, really. The person you're arguing with has built (as per their own estimation) more from scratch when it comes to the actual search engine core than anything else in their own project (https://marginalia.nu/).

Honestly, it seems more like the people arguing a search engine is somehow "off limits" from scratch are doing so because they imagine it's less feasible, probably because they simply don't know or are in general pretty bad at implementing things from scratch, period.

To a large degree there is also a wide difference in actual project scope. Remember that we are comparing this to implementing Space Invaders, one of the simplest games you can make, and one that is doable in an evening even if you write your own renderer.

To talk about "webscale" (a term that is so deliciously dissonant with most of the web world which runs at a scale several 100 times slower and less efficient than things should be) is like suddenly talking about implementing deferred shading for the Space Invaders game. Even if this was something you'd want to explore later, it's just not something you're doing right now because this is something you're using to learn as you go.

marginalia_nu · on Dec 27, 2023

> 1 server to index the whole internet? I don't care how big your server is -- that's going not to happen.

Well let's do the math!

1 billion bytes is 1 GB. The average text-size of a document is ~5 KB, that is the HTML, uncompressed. 1 billion documents is thus about 5 TB raw HTML. In practice the index would be smaller, say 1 TB, because you're not indexing the raw HTML, but a fixed width representation of the words therein, typically with some form of compression as well. (In general, the index is significantly smaller than the raw data unless you're doing something strange)

You could buy a thumbdrive that would a search index for 1 billion documents. A real server uses enterprise SSDs for this stuff, but even so, you don't even have to dip into storage server specs to outfit a single server with hundreds of terabytes of SSD storage.

> And of course all of this is based on "standard" algorithms. But it doesn't change the fact that implementing them all to build a functional performant web crawler+database+search is a project that is larger than any one person could build from scratch.

I'm still not seeing why this is impossible.

> It would only be feasible to build on top of existing libraries, which is the whole point here.

I don't think there are existing libraries for most of this stuff. Most off-the-shelf stuff in the search space is for indexing smaller corpuses, they generally aren't built to scale up to corpus sizes in the billions (e.g. using 32 bit keys etc.)