> Literally all of the listed projects, text editors, compilers, operating syste...

jvanderbot · on Dec 26, 2023

If you want to build a compiler from scratch, you must first invent the universe. Peeling back abstractions to see how things could or should work is perfectly fine, even for professionals.

Case in point, I've spent a year excising bloated frameworks from my stack at work and replacing the few corners we needed from those frameworks with, e.g. 50 lines of curl calls. The C compiles instantly and is tailored for our tiny use case, produced much quicker delivery on our one related feature we wanted, and removed chains of dependencies.

Being reliant on far-too-abstract libraries and frameworks to do simple jobs is also a curse. But nobody at work has the experience to know that curl was sitting right there on our image available for our use. And nobody has that experience because nobody took the time to build something from low level libraries. Now we know how and can make an intelligent decision without defaulting in either direction because we were afraid to try.

deterministic · on Dec 27, 2023

My experience as well. The more dependencies I remove, the fewer problems I have. It is a massive productivity boost.

Having said that, there are libraries there are high value/cost. So I keep those. But the dirty reality is that most libraries are pretty bad.

pipes · on Dec 26, 2023

Genuine question: what value was added by getting rid of the frameworks?

jvanderbot · on Dec 26, 2023

20,000 fewer lines of code, faster compile / deploy times, significantly less interfacing / translation between "their" types and "their" apis, and a much better control over types and structure across our codebase b/c we didn't need "their" types and structure anywhere. It had crept everywhere.

It's just faster, cleaner, and easier in a few cases to do precisely what you need right now, rather than anticipate a million things you might need and refactor your code to adopt a given "solution".

cellularmitosis · on Dec 27, 2023

Yeah, the problem / benefit is never the one framework you added / removed, it is the mindset of reaching for another dependency as a default which leads to a behemoth which no one enjoys working with.

thomaslord · on Dec 29, 2023

I also think a person who has had to strip out or replace dependencies is more likely to be careful when choosing them in the future. This is definitely the case for me, at least.

benj111 · on Dec 26, 2023

>elitism and ego

You know those car guys who will rebuild their engine, just because? Or those retro computer guys that will recap an ancient board rather than buying a modern pc?

For you it is a job, for me it is a hobby. I have no interest in making something 'professional' I want to take it apart to understand how it works. Want to understand how a text editor works? Write one.

What component of that is ego?

It seems to me you're the elitist, you're not far off saying mere users shouldn't be allowed to modify their own software, shouldn't be allowed to install software that hasn't been okayed by the people who know what they're doing.

finnthehuman · on Dec 26, 2023

> What component of that is ego?

Your lack of constraints on personal exploration in software is interpreted as hubris by people who came to software seeking high paying regimented recipe-following.

59nadir · on Dec 26, 2023

> The difference with a web search engine is that nobody today can build such a thing completely from scratch

I'm sorry, but can you substantiate this claim? I've seen no indication that a search engine is not buildable from scratch at all.

crazygringo · on Dec 26, 2023

Assuming you get the TCP/IP stack for free, you still need to build fully-featured HTTPS and a "webscale" multi-server database for document storage from scratch. The crawler is easy and so is something like PageRank, but then building the sharded keyword text search engine itself that operates at webscale is a whole other project...

The point is that it's too much work for a single person to build all the parts themselves. It's only feasible if you rely on pre-existing HTTPS libraries and database and text search technologies.

YetAnotherNick · on Dec 26, 2023

Simple methods of search like exact matching are very fast using textbook algorithms. There are well known algorithm like suffix tree which could search in millions of documents in milliseconds.

crazygringo · on Dec 26, 2023

That's not enough. It needs to be sharded and handle things like 10 search terms, each of which match a million documents and you're trying to find the intersection. Across shared results from 20 servers. Quickly.

That's not a textbook algorithm.

marginalia_nu · on Dec 27, 2023

Intersecting postingslists is a solved problem. You can do it in sublinear time with standard search engine algorithms.

The problem space is embarassingly parallel, so sharding is no problem. Although, realistically, you probably only need 1 server to cope with the load and storage needs. This isn't 2004. Servers are big and fast now, as long as you don't try to use cloud compute.

crazygringo · on Dec 27, 2023

1 server to index the whole internet? I don't care how big your server is -- that's going not to happen.

And of course all of this is based on "standard" algorithms. But it doesn't change the fact that implementing them all to build a functional performant web crawler+database+search is a project that is larger than any one person could build from scratch.

It would only be feasible to build on top of existing libraries, which is the whole point here.

59nadir · on Dec 27, 2023

> It would only be feasible to build on top of existing libraries, which is the whole point here.

I don't think you've made a single point that substantiates that claim and neither has anyone else, really. The person you're arguing with has built (as per their own estimation) more from scratch when it comes to the actual search engine core than anything else in their own project (https://marginalia.nu/).

Honestly, it seems more like the people arguing a search engine is somehow "off limits" from scratch are doing so because they imagine it's less feasible, probably because they simply don't know or are in general pretty bad at implementing things from scratch, period.

To a large degree there is also a wide difference in actual project scope. Remember that we are comparing this to implementing Space Invaders, one of the simplest games you can make, and one that is doable in an evening even if you write your own renderer.

To talk about "webscale" (a term that is so deliciously dissonant with most of the web world which runs at a scale several 100 times slower and less efficient than things should be) is like suddenly talking about implementing deferred shading for the Space Invaders game. Even if this was something you'd want to explore later, it's just not something you're doing right now because this is something you're using to learn as you go.

marginalia_nu · on Dec 27, 2023

> 1 server to index the whole internet? I don't care how big your server is -- that's going not to happen.

Well let's do the math!

1 billion bytes is 1 GB. The average text-size of a document is ~5 KB, that is the HTML, uncompressed. 1 billion documents is thus about 5 TB raw HTML. In practice the index would be smaller, say 1 TB, because you're not indexing the raw HTML, but a fixed width representation of the words therein, typically with some form of compression as well. (In general, the index is significantly smaller than the raw data unless you're doing something strange)

You could buy a thumbdrive that would a search index for 1 billion documents. A real server uses enterprise SSDs for this stuff, but even so, you don't even have to dip into storage server specs to outfit a single server with hundreds of terabytes of SSD storage.

> And of course all of this is based on "standard" algorithms. But it doesn't change the fact that implementing them all to build a functional performant web crawler+database+search is a project that is larger than any one person could build from scratch.

I'm still not seeing why this is impossible.

> It would only be feasible to build on top of existing libraries, which is the whole point here.

I don't think there are existing libraries for most of this stuff. Most off-the-shelf stuff in the search space is for indexing smaller corpuses, they generally aren't built to scale up to corpus sizes in the billions (e.g. using 32 bit keys etc.)

YetAnotherNick · on Dec 27, 2023

No, it doesn't require. If you take all public web text it is just something like 40 trillion characters, which could fit in a single hard disk uncompressed including suffix tree structure.

[1]: https://arxiv.org/pdf/2306.01116.pdf

vidarh · on Dec 29, 2023

Yeah, it is textbook. It was textbook a decade and a half ago already.

Add docids to sorted update lists. Zipper merge and arithmetic coding once lists get large, is the textbook baseline from back then.

Tack on a skip list, and you have fast intersections.

This is well trodden ground.

fho · on Dec 26, 2023

Sigh ... Of to the silicone mines it is then ... again /s

isoprophlex · on Dec 26, 2023

You mean, the beach?

sethammons · on Dec 26, 2023

It is a joke based on needing to define "from scratch". Similar to, "how do you bake a cake from scratch? First you must create the universe." Op is gathering the raw ingredients to start fabricating his chips.

enneff · on Dec 26, 2023

Yeah, that’s sand. :)

fho · on Jan 2, 2024

Fair enough :-)

indigo945 · on Dec 26, 2023

Back when online learn-to-code courses like Codecadamy and Udemy were a fad, I remember that one of them (and unfortunately I don't remember which, and Google, ironically, turns up nothing) taught how to build a search engine in Python from scratch as a first project for complete beginners. I thought it had a reasonable level of complexity for this task.

You can still find search-engine-from-scratch courses on Udemy, complete with all the necessary algorithms [1].

[1]: https://www.udemy.com/course/build-a-search-engine-with-pyth...

universse · on Dec 26, 2023

was it udacity's cs101?

indigo945 · on Dec 26, 2023

Yes! Thank you.

dmoy · on Dec 26, 2023

Incidentally, the first Google search result for

"introduction online cs course that uses Python the build a search engine from scratch"

is udemy CS 101

Even through my smartphone autocorrect mistakes of 'introductory -> introduction' and 'the -> to'

Xeamek · on Dec 26, 2023

While nih is ultimately dependent on multiple different aspects, I would argue that, ironically, it's more likely to create nih-qualifying product with your aproach.

Because when you write things from scratch, you actually have the space to innovate. But when building product with preexisting puzzles, there is much less space to actually make any usefull changes that would make your product actually standout from existing alternatives (which nih is all about)

vkazanov · on Dec 26, 2023

A simple search engine is certainly doable from scratch in a matter of weeks, complete with most things expected from a search engine.

Similar to compilers or tiny OSes, one can go as hardcore as necessary, or just stick to basic stuff.

Of all the typical personal challenge style projects, databases are probable the hardest to build, and even that is not impossible.

fragmede · on Dec 26, 2023

making a database is easy. making a fast database is what's hard.

pointy_hat · on Dec 27, 2023

Also, making a correct one is not an easy fit.

kragen · on Dec 26, 2023

even implementing all of sql92 or datalog without concern for efficiency is fairly complex

convolvatron · on Dec 26, 2023

implementing a toy datalog should take no more than a week.

kragen · on Dec 26, 2023

maybe less

kragen · on Dec 27, 2023

but a text editor, a space invaders clone, a tiny basic compiler, or a mini operating system is an afternoon project. a spreadsheet or a video game console emulator is maybe comparable to implementing datalog

phkahler · on Dec 26, 2023

>> The difference with a web search engine is that nobody today can build such a thing completely from scratch

Really? I think there are a couple out there. The main issue today is scale, but you could constrain that by limiting your crawler to a fixed set of sites.

yu3zhou4 · on Dec 26, 2023

I enjoy reading your discussion and just wanted to add that some people write a big scale software from scratch nowadays - for instance Marginalia for web search and Andreas Kling and team for operating system and web browser

cinntaile · on Dec 26, 2023

I guess the first step in writing large scale software is to become Swedish!

marginalia_nu · on Dec 26, 2023

Long cold winters dimly lit by a flickering CRT.

p-e-w · on Dec 26, 2023

Marginalia, which I'm a big fan of, is not "written from scratch" (it would be stupid to do so). Check the project on GitHub, it has lots of third-party dependencies.

marginalia_nu · on Dec 26, 2023

I definitely lean more toward NIH than what's conventionally considered wise, but most of the time it's not NIH for the sake of NIH.

I do pull a lot of libraries, but an enormous amount of what the search engine does is very much built from scratch. The libraries generally deal with parsing common formats, compression, serialization, and various service glue like dependency injection. I think the number of explicit dependencies is a bit inflated by the choice to not use a framework like springboot, which pulls many of the same (or equivalent ones) implicitly.

What makes the search engine a search engine, the indexing software (all the way down to database primitives like btrees etc.), a large chunk of the language processing, and so forth; that's all bespoke. I think it needs to be. A lot of existing code just doesn't scale, or has too many customizations that would add unnecessary glue and complexity to my own code.

I'm going to echo SerenityOS Andreas and suggest that it's a skill like any other. If you shy away from building custom solutions to hard problems, you will never be good at it; and it will become a self-fulfilling prophecy that these NIH solutions are too hard to build.

At the same time, there's a time and a place and you should indeed be judicious as to when to roll your own solutions, but maybe that time and place is exactly in a hobby project like the ones suggested in this thread (and is how my search engine started out; a place to dick around with difficult problems).

I'd also add that being able to tackle problems yourself, rather than needing a library to do all the heavy lifting at all times, is a great enabler. Sometimes there is no adequate library, but that doesn't mean the conclusion has to be "welp, I guess we can't do that yet..."

datadeft · on Dec 26, 2023

I think searx was largely built by a single person.

https://github.com/searx/searx

HumblyTossed · on Dec 26, 2023

Well, if I’m going to build a web search engine from scratch don’t I first need to write a compiler from scratch? Which means I first need to write an editor from scratch…

AeroNotix · on Dec 26, 2023

Please can you define "from scratch"?

gushogg-blake · on Dec 26, 2023

I encountered some push back on this in my "code editor from the ground up" post[1]. I think the only reasonable definition of from scratch is:

Does not have domain-specific dependencies.

So a code editor based on ACE or CodeMirror would not be from scratch, obviously, but one that involves writing all of the domain-specific logic would be. Using generic libraries doesn't stop something being from scratch. (In my case Tree-sitter is arguably domain-specific, but an early version did use a hand-coded JavaScript tokeniser in its place.)

[1] https://news.ycombinator.com/item?id=34577246