Why Writing Your Own Search Engine Is Hard

iamelgringo · on Dec 25, 2010

Search is a gold mine, and I don't understand why there aren't more people diving in to building niche search engines. Sure you can't really compete with Google on size, but there's a lot of nooks and crannies online where you can pick up valuable search traffic around the edges.

At least, that's why I'm working on a search engine for financial news at http://Newsley.com/search. (We're focused on building the crawlers and the index right now. Search is _very_ alpha).

After reading this article, I feel validated for a bunch of the decisions that I've been making. I've been running on EC2, but their disk IO is slow as molasses. So, I'm starting to build servers and throw them in my garage. I'll be migrating to garage servers in the next few months. Pretty much everyone I talk to thinks running servers in your garage is a terrible idea, but I can't think of any way else to do this cheaper and still have control over my hardware. It's nice to read that I'm not crazy for thinking this.

It was also great to read that on early search engines, the bulk of the work is done by small teams. Being the only dev, at times I think I'm a bit crazy for trying to boostrap a search startup. Again, it was nice to read that it's not all that crazy to try and do it on my own.

ljlolel · on Dec 25, 2010

She specifically says to go with slow disk I/O (not in-memory indices) because the most important thing is not to have to deal with failing servers.

This was written in 2004, and the equivalent statement would be that you should not deal with maintaining physical servers but have the cloud handle it so you can focus on algorithms/parallelism/runtime.

Also, unrelated to what you wrote, this is 2004, and statements like "people search for words not phrases" are frankly no longer true. Average query length is way up (even before google suggest and instant) and people have been searching for phrases more and more since 2004.

She's completely right, but if she were to rewrite this today, in 2010, it would be an even longer article. A single-word search engine would not return acceptable results.

iamelgringo · on Dec 25, 2010

re: Cloud Srorage vs Garage data center

Elastic Block is NAS . This is why Elastic Block storage is two to three orders of magnitude slower than SSD attached to a server in your garage. 1 EC2 compute unit is analagous to a 1.6 Ghz 2005 Opteron, or performance wise it's similar to a fast Atom processor. I can build a 4 core server with 8 Gigs of RAM and 120GB of SSD and 2 TB of spinning disk for around $600. SSD latencies are low enough that they can be thought of as slow local memory.

So, for the cost of a 3 to 4 months of medium / large instance EC2 instances, you can have roughly 8 times the processing power and 2 orders of magnitude more local memory/SSD.

By setting up a couple of cheap servers in my garage, I can get as much processing power as 20 to 30 EC2 instances. That means I can spend a lot more time worrying about coding and less worrying about SysOps.

re: single word search

You are absolutely right. As I said, our search is very much in alpha right now, and I

really haven't spent much time at all on it.

Yrlec · on Dec 25, 2010

Have you looked into 80legs.com? It seems like a good service to use if you want to start your own niche search engine (I'm not affiliated with them, just curious).

iamelgringo · on Dec 25, 2010

I have. They provide crawlers . We have a curated set of news feeds that we scrape and index. It doesn't really fit what we want to do.

JusticeJones · on Dec 25, 2010

I don't mean to knock your project, I think it's cool. However, if you look at the improvements done to the existing search giants it's all about solutions to social problems. Stuff like checking the latest score of a game, finding movie times, the weather, etc. Which to me suggests that the goldmine for modern search development is in creating a glanceable results.

Just to be clear when I say 'social' I am referring to a general community, so social problems are really just annoyances that the majority of community members have to deal with. Sorry if that was a bit confusing; I'm not the most articulate person in the world.

bradleyland · on Dec 24, 2010

This is from 2004. A lot of the paper still applies, in principle, but I'd argue that there are far fewer people chomping at the bit to get in to the search business these days. Now it's all "social" or "game" related.

mmaunder · on Dec 25, 2010

Agree and I think there's a a few big opportunities. e.g. a better job search engine. Search in some form is what most of us spend a huge amount of time doing every day. Even searching within web pages for the relevant content we're looking for.

Also Google are barely keeping their head above the web spam and filtering out spam is a very hard problem which may be better solved with a blend of human intelligence. Before you disagree with me consider that there are vast farms of humans creating SEO content and the content-gen business is growing extremely fast.It's the thing that scares Google the most in search.

gtani · on Dec 24, 2010

I'd say if you can provide a better vertical search, say for java/scala programmers, that's s worthwhile endeavor. Also, many folks think Solr / lucene /sphinx is the only game in town, but Riak and indexTank certainly don't buy that.

brianobush · on Dec 25, 2010

SCSI disks are a dead giveaway to a date long past. As soon as I read that, I paged up to find the publish date.

rwmj · on Dec 24, 2010

I wonder what happened to the Internet Archive search tool she wrote (recall.archive.org)?

gojomo · on Dec 25, 2010

'Recall' was at the promising prototype quality level, with results more Cuil-like, centered around linguistic concepts detected during indexing, than Google-like. So for example the index was compact but there was no phrase-search, and even after receiving a page as a result to one query, a followup query with words on that page might return zero results if that exact set of words hadn't been concept-extracted.

It had nice graphs of concepts over time. But, its cleverness was probably overkill for such visualizations, except to the extent (IIRC) it disambiguated similar concepts from context. As the recent Google Books-based word frequency tool has shown, simple word and n-gram counts provide plemty of similar value.

Ultimately, though, as a bit of advanced, non-open-source Lisp code by a single expert, there was no one appropriate to productionize, maintain, and improve it after she joined Google.

(I work at the Internet Archive.)

iwwr · on Dec 24, 2010

In other words, avoid spending money, refine your algorithms first. Faster machines may be tempting, but that makes scaling horribly expensive down the road.

jamesaguilar · on Dec 24, 2010

Following our own advice is hard. http://en.wikipedia.org/wiki/Cuil

ot · on Dec 25, 2010

From another article in the home page today (http://edition.cnn.com/2010/TECH/web/12/24/ex.google.employe...):

  But abandoning Google hasn't always proved to be a wise or
  permanent move. Anna Patterson left the company in 2007 to
  start a rival search engine called Cuil. When that tanked, she
  returned to Google in September.

puredemo · on Dec 25, 2010

... I can't believe they hired her back.

korussian · on Dec 25, 2010

Why wouldn't they? She proved she's awesome at difficult stuff. Ultimate failure didn't unprove that.

iwwr · on Dec 24, 2010

Wow, the same Anna Patterson. I don't understand how Cuil became such a disaster. Even early Google seemed alright.

jseliger · on Dec 24, 2010

I don't either, and I'd love to know who does and whether they've written about it.

p01nd3xt3r · on Dec 25, 2010

I tink that Cuil became a disaster because Gogle existed. When google came out it was easily 10x better than what was currently on the market. The same cant be same for cuil.

korussian · on Dec 25, 2010

I think for most people writing a search engine is overkill when there are existing options out there.

If you want to search a subset of sites, then Google CSE is really all you need + whatever bells & whistles you'd like to add around it. I've done that here: http://searchESLCafe.com, adding "recent searches", search via wildcard subdomain (i.e. foo.searchESLCafe.com or bar.searchESLCafe.com or foo_bar.searchESLCafe.com, etc), and customizing the heck out of Google CSE's options.

Is there a demand out there for the search engine to parse the results into something informative at-a-glance? I'm not so sure it's the user's first priority. Or, to put it another way, there's plenty of hard-to-reach info out there that you can hand users via a customized Google CSE, and they don't mind doing the leg-work of clicking on the query results and finding their own answers.

It's a lot more important to have an accurate search algorithm than drill-down-related bells & whistles.

Google does a great job of returning solid results for any subset of sites, so why not let Google handle it, and concentrate on the other stuff?

joshbaptiste · on Dec 25, 2010

Heh.. wonder what yegg of DuckDuckGo thinks of this article.

known · on Dec 25, 2010

We can rollout our won Google search engine via http://aspseek.org

mixmax · on Dec 24, 2010

Application server is busy. Either there are too many concurrent requests or the server still is starting up

Apparently scaling is hard too.