> Literally all of the listed projects, text editors, compilers, operating systems, and ray tracers, can exercise the exact same activities.
In the linked article, these projects are all explicitly described as opportunities to learn about low-level stuff like how to efficiently store editable text. The difference with a web search engine is that nobody today can build such a thing completely from scratch, therefore it forces you to give up the toxic NIH mentality, which in my experience is usually driven by elitism and ego (on full display in this comment thread), not by some lofty desire to learn something new.
If you want to build a compiler from scratch, you must first invent the universe. Peeling back abstractions to see how things could or should work is perfectly fine, even for professionals.
Case in point, I've spent a year excising bloated frameworks from my stack at work and replacing the few corners we needed from those frameworks with, e.g. 50 lines of curl calls. The C compiles instantly and is tailored for our tiny use case, produced much quicker delivery on our one related feature we wanted, and removed chains of dependencies.
Being reliant on far-too-abstract libraries and frameworks to do simple jobs is also a curse. But nobody at work has the experience to know that curl was sitting right there on our image available for our use. And nobody has that experience because nobody took the time to build something from low level libraries. Now we know how and can make an intelligent decision without defaulting in either direction because we were afraid to try.
20,000 fewer lines of code, faster compile / deploy times, significantly less interfacing / translation between "their" types and "their" apis, and a much better control over types and structure across our codebase b/c we didn't need "their" types and structure anywhere. It had crept everywhere.
It's just faster, cleaner, and easier in a few cases to do precisely what you need right now, rather than anticipate a million things you might need and refactor your code to adopt a given "solution".
Yeah, the problem / benefit is never the one framework you added / removed, it is the mindset of reaching for another dependency as a default which leads to a behemoth which no one enjoys working with.
I also think a person who has had to strip out or replace dependencies is more likely to be careful when choosing them in the future. This is definitely the case for me, at least.
You know those car guys who will rebuild their engine, just because? Or those retro computer guys that will recap an ancient board rather than buying a modern pc?
For you it is a job, for me it is a hobby. I have no interest in making something 'professional' I want to take it apart to understand how it works. Want to understand how a text editor works? Write one.
What component of that is ego?
It seems to me you're the elitist, you're not far off saying mere users shouldn't be allowed to modify their own software, shouldn't be allowed to install software that hasn't been okayed by the people who know what they're doing.
Your lack of constraints on personal exploration in software is interpreted as hubris by people who came to software seeking high paying regimented recipe-following.
Assuming you get the TCP/IP stack for free, you still need to build fully-featured HTTPS and a "webscale" multi-server database for document storage from scratch. The crawler is easy and so is something like PageRank, but then building the sharded keyword text search engine itself that operates at webscale is a whole other project...
The point is that it's too much work for a single person to build all the parts themselves. It's only feasible if you rely on pre-existing HTTPS libraries and database and text search technologies.
Simple methods of search like exact matching are very fast using textbook algorithms. There are well known algorithm like suffix tree which could search in millions of documents in milliseconds.
That's not enough. It needs to be sharded and handle things like 10 search terms, each of which match a million documents and you're trying to find the intersection. Across shared results from 20 servers. Quickly.
Intersecting postingslists is a solved problem. You can do it in sublinear time with standard search engine algorithms.
The problem space is embarassingly parallel, so sharding is no problem. Although, realistically, you probably only need 1 server to cope with the load and storage needs. This isn't 2004. Servers are big and fast now, as long as you don't try to use cloud compute.
1 server to index the whole internet? I don't care how big your server is -- that's going not to happen.
And of course all of this is based on "standard" algorithms. But it doesn't change the fact that implementing them all to build a functional performant web crawler+database+search is a project that is larger than any one person could build from scratch.
It would only be feasible to build on top of existing libraries, which is the whole point here.
> It would only be feasible to build on top of existing libraries, which is the whole point here.
I don't think you've made a single point that substantiates that claim and neither has anyone else, really. The person you're arguing with has built (as per their own estimation) more from scratch when it comes to the actual search engine core than anything else in their own project (https://marginalia.nu/).
Honestly, it seems more like the people arguing a search engine is somehow "off limits" from scratch are doing so because they imagine it's less feasible, probably because they simply don't know or are in general pretty bad at implementing things from scratch, period.
To a large degree there is also a wide difference in actual project scope. Remember that we are comparing this to implementing Space Invaders, one of the simplest games you can make, and one that is doable in an evening even if you write your own renderer.
To talk about "webscale" (a term that is so deliciously dissonant with most of the web world which runs at a scale several 100 times slower and less efficient than things should be) is like suddenly talking about implementing deferred shading for the Space Invaders game. Even if this was something you'd want to explore later, it's just not something you're doing right now because this is something you're using to learn as you go.
> 1 server to index the whole internet? I don't care how big your server is -- that's going not to happen.
Well let's do the math!
1 billion bytes is 1 GB. The average text-size of a document is ~5 KB, that is the HTML, uncompressed. 1 billion documents is thus about 5 TB raw HTML. In practice the index would be smaller, say 1 TB, because you're not indexing the raw HTML, but a fixed width representation of the words therein, typically with some form of compression as well. (In general, the index is significantly smaller than the raw data unless you're doing something strange)
You could buy a thumbdrive that would a search index for 1 billion documents. A real server uses enterprise SSDs for this stuff, but even so, you don't even have to dip into storage server specs to outfit a single server with hundreds of terabytes of SSD storage.
> And of course all of this is based on "standard" algorithms. But it doesn't change the fact that implementing them all to build a functional performant web crawler+database+search is a project that is larger than any one person could build from scratch.
I'm still not seeing why this is impossible.
> It would only be feasible to build on top of existing libraries, which is the whole point here.
I don't think there are existing libraries for most of this stuff. Most off-the-shelf stuff in the search space is for indexing smaller corpuses, they generally aren't built to scale up to corpus sizes in the billions (e.g. using 32 bit keys etc.)
No, it doesn't require. If you take all public web text it is just something like 40 trillion characters, which could fit in a single hard disk uncompressed including suffix tree structure.
It is a joke based on needing to define "from scratch". Similar to, "how do you bake a cake from scratch? First you must create the universe." Op is gathering the raw ingredients to start fabricating his chips.
Back when online learn-to-code courses like Codecadamy and Udemy were a fad, I remember that one of them (and unfortunately I don't remember which, and Google, ironically, turns up nothing) taught how to build a search engine in Python from scratch as a first project for complete beginners. I thought it had a reasonable level of complexity for this task.
You can still find search-engine-from-scratch courses on Udemy, complete with all the necessary algorithms [1].
While nih is ultimately dependent on multiple different aspects, I would argue that, ironically, it's more likely to create nih-qualifying product with your aproach.
Because when you write things from scratch, you actually have the space to innovate. But when building product with preexisting puzzles, there is much less space to actually make any usefull changes that would make your product actually standout from existing alternatives (which nih is all about)
but a text editor, a space invaders clone, a tiny basic compiler, or a mini operating system is an afternoon project. a spreadsheet or a video game console emulator is maybe comparable to implementing datalog
>> The difference with a web search engine is that nobody today can build such a thing completely from scratch
Really? I think there are a couple out there. The main issue today is scale, but you could constrain that by limiting your crawler to a fixed set of sites.
I enjoy reading your discussion and just wanted to add that some people write a big scale software from scratch nowadays - for instance Marginalia for web search and Andreas Kling and team for operating system and web browser
Marginalia, which I'm a big fan of, is not "written from scratch" (it would be stupid to do so). Check the project on GitHub, it has lots of third-party dependencies.
I definitely lean more toward NIH than what's conventionally considered wise, but most of the time it's not NIH for the sake of NIH.
I do pull a lot of libraries, but an enormous amount of what the search engine does is very much built from scratch. The libraries generally deal with parsing common formats, compression, serialization, and various service glue like dependency injection. I think the number of explicit dependencies is a bit inflated by the choice to not use a framework like springboot, which pulls many of the same (or equivalent ones) implicitly.
What makes the search engine a search engine, the indexing software (all the way down to database primitives like btrees etc.), a large chunk of the language processing, and so forth; that's all bespoke. I think it needs to be. A lot of existing code just doesn't scale, or has too many customizations that would add unnecessary glue and complexity to my own code.
I'm going to echo SerenityOS Andreas and suggest that it's a skill like any other. If you shy away from building custom solutions to hard problems, you will never be good at it; and it will become a self-fulfilling prophecy that these NIH solutions are too hard to build.
At the same time, there's a time and a place and you should indeed be judicious as to when to roll your own solutions, but maybe that time and place is exactly in a hobby project like the ones suggested in this thread (and is how my search engine started out; a place to dick around with difficult problems).
I'd also add that being able to tackle problems yourself, rather than needing a library to do all the heavy lifting at all times, is a great enabler. Sometimes there is no adequate library, but that doesn't mean the conclusion has to be "welp, I guess we can't do that yet..."
Well, if I’m going to build a web search engine from scratch don’t I first need to write a compiler from scratch? Which means I first need to write an editor from scratch…
I encountered some push back on this in my "code editor from the ground up" post[1]. I think the only reasonable definition of from scratch is:
Does not have domain-specific dependencies.
So a code editor based on ACE or CodeMirror would not be from scratch, obviously, but one that involves writing all of the domain-specific logic would be. Using generic libraries doesn't stop something being from scratch. (In my case Tree-sitter is arguably domain-specific, but an early version did use a hand-coded JavaScript tokeniser in its place.)
In the linked article, these projects are all explicitly described as opportunities to learn about low-level stuff like how to efficiently store editable text. The difference with a web search engine is that nobody today can build such a thing completely from scratch, therefore it forces you to give up the toxic NIH mentality, which in my experience is usually driven by elitism and ego (on full display in this comment thread), not by some lofty desire to learn something new.