Hacker News new | past | comments | ask | show | jobs | submit login

Assuming you get the TCP/IP stack for free, you still need to build fully-featured HTTPS and a "webscale" multi-server database for document storage from scratch. The crawler is easy and so is something like PageRank, but then building the sharded keyword text search engine itself that operates at webscale is a whole other project...

The point is that it's too much work for a single person to build all the parts themselves. It's only feasible if you rely on pre-existing HTTPS libraries and database and text search technologies.




Simple methods of search like exact matching are very fast using textbook algorithms. There are well known algorithm like suffix tree which could search in millions of documents in milliseconds.


That's not enough. It needs to be sharded and handle things like 10 search terms, each of which match a million documents and you're trying to find the intersection. Across shared results from 20 servers. Quickly.

That's not a textbook algorithm.


Intersecting postingslists is a solved problem. You can do it in sublinear time with standard search engine algorithms.

The problem space is embarassingly parallel, so sharding is no problem. Although, realistically, you probably only need 1 server to cope with the load and storage needs. This isn't 2004. Servers are big and fast now, as long as you don't try to use cloud compute.


1 server to index the whole internet? I don't care how big your server is -- that's going not to happen.

And of course all of this is based on "standard" algorithms. But it doesn't change the fact that implementing them all to build a functional performant web crawler+database+search is a project that is larger than any one person could build from scratch.

It would only be feasible to build on top of existing libraries, which is the whole point here.


> It would only be feasible to build on top of existing libraries, which is the whole point here.

I don't think you've made a single point that substantiates that claim and neither has anyone else, really. The person you're arguing with has built (as per their own estimation) more from scratch when it comes to the actual search engine core than anything else in their own project (https://marginalia.nu/).

Honestly, it seems more like the people arguing a search engine is somehow "off limits" from scratch are doing so because they imagine it's less feasible, probably because they simply don't know or are in general pretty bad at implementing things from scratch, period.

To a large degree there is also a wide difference in actual project scope. Remember that we are comparing this to implementing Space Invaders, one of the simplest games you can make, and one that is doable in an evening even if you write your own renderer.

To talk about "webscale" (a term that is so deliciously dissonant with most of the web world which runs at a scale several 100 times slower and less efficient than things should be) is like suddenly talking about implementing deferred shading for the Space Invaders game. Even if this was something you'd want to explore later, it's just not something you're doing right now because this is something you're using to learn as you go.


> 1 server to index the whole internet? I don't care how big your server is -- that's going not to happen.

Well let's do the math!

1 billion bytes is 1 GB. The average text-size of a document is ~5 KB, that is the HTML, uncompressed. 1 billion documents is thus about 5 TB raw HTML. In practice the index would be smaller, say 1 TB, because you're not indexing the raw HTML, but a fixed width representation of the words therein, typically with some form of compression as well. (In general, the index is significantly smaller than the raw data unless you're doing something strange)

You could buy a thumbdrive that would a search index for 1 billion documents. A real server uses enterprise SSDs for this stuff, but even so, you don't even have to dip into storage server specs to outfit a single server with hundreds of terabytes of SSD storage.

> And of course all of this is based on "standard" algorithms. But it doesn't change the fact that implementing them all to build a functional performant web crawler+database+search is a project that is larger than any one person could build from scratch.

I'm still not seeing why this is impossible.

> It would only be feasible to build on top of existing libraries, which is the whole point here.

I don't think there are existing libraries for most of this stuff. Most off-the-shelf stuff in the search space is for indexing smaller corpuses, they generally aren't built to scale up to corpus sizes in the billions (e.g. using 32 bit keys etc.)


No, it doesn't require. If you take all public web text it is just something like 40 trillion characters, which could fit in a single hard disk uncompressed including suffix tree structure.

[1]: https://arxiv.org/pdf/2306.01116.pdf


Yeah, it is textbook. It was textbook a decade and a half ago already.

Add docids to sorted update lists. Zipper merge and arithmetic coding once lists get large, is the textbook baseline from back then.

Tack on a skip list, and you have fast intersections.

This is well trodden ground.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: