Search is a gold mine, and I don't understand why there aren't more people diving in to building niche search engines. Sure you can't really compete with Google on size, but there's a lot of nooks and crannies online where you can pick up valuable search traffic around the edges.
At least, that's why I'm working on a search engine for financial news at http://Newsley.com/search. (We're focused on building the crawlers and the index right now. Search is _very_ alpha).
After reading this article, I feel validated for a bunch of the decisions that I've been making. I've been running on EC2, but their disk IO is slow as molasses. So, I'm starting to build servers and throw them in my garage. I'll be migrating to garage servers in the next few months. Pretty much everyone I talk to thinks running servers in your garage is a terrible idea, but I can't think of any way else to do this cheaper and still have control over my hardware. It's nice to read that I'm not crazy for thinking this.
It was also great to read that on early search engines, the bulk of the work is done by small teams. Being the only dev, at times I think I'm a bit crazy for trying to boostrap a search startup. Again, it was nice to read that it's not all that crazy to try and do it on my own.
She specifically says to go with slow disk I/O (not in-memory indices) because the most important thing is not to have to deal with failing servers.
This was written in 2004, and the equivalent statement would be that you should not deal with maintaining physical servers but have the cloud handle it so you can focus on algorithms/parallelism/runtime.
Also, unrelated to what you wrote, this is 2004, and statements like "people search for words not phrases" are frankly no longer true. Average query length is way up (even before google suggest and instant) and people have been searching for phrases more and more since 2004.
She's completely right, but if she were to rewrite this today, in 2010, it would be an even longer article. A single-word search engine would not return acceptable results.
Elastic Block is NAS . This is why Elastic Block storage is two to three orders of magnitude slower than SSD attached to a server in your garage. 1 EC2 compute unit is analagous to a 1.6 Ghz 2005 Opteron, or performance wise it's similar to a fast Atom processor. I can build a 4 core server with 8 Gigs of RAM and 120GB of SSD and 2 TB of spinning disk for around $600. SSD latencies are low enough that they can be thought of as slow local memory.
So, for the cost of a 3 to 4 months of medium / large instance EC2 instances, you can have roughly 8 times the processing power and 2 orders of magnitude more local memory/SSD.
By setting up a couple of cheap servers in my garage, I can get as much processing power as 20 to 30 EC2 instances. That means I can spend a lot more time worrying about coding and less worrying about SysOps.
re: single word search
You are absolutely right. As I said, our search is very much in alpha right now, and I
Have you looked into 80legs.com? It seems like a good service to use if you want to start your own niche search engine (I'm not affiliated with them, just curious).
I don't mean to knock your project, I think it's cool. However, if you look at the improvements done to the existing search giants it's all about solutions to social problems. Stuff like checking the latest score of a game, finding movie times, the weather, etc. Which to me suggests that the goldmine for modern search development is in creating a glanceable results.
Just to be clear when I say 'social' I am referring to a general community, so social problems are really just annoyances that the majority of community members have to deal with. Sorry if that was a bit confusing; I'm not the most articulate person in the world.
This is from 2004. A lot of the paper still applies, in principle, but I'd argue that there are far fewer people chomping at the bit to get in to the search business these days. Now it's all "social" or "game" related.
Agree and I think there's a a few big opportunities. e.g. a better job search engine. Search in some form is what most of us spend a huge amount of time doing every day. Even searching within web pages for the relevant content we're looking for.
Also Google are barely keeping their head above the web spam and filtering out spam is a very hard problem which may be better solved with a blend of human intelligence. Before you disagree with me consider that there are vast farms of humans creating SEO content and the content-gen business is growing extremely fast.It's the thing that scares Google the most in search.
I'd say if you can provide a better vertical search, say for java/scala programmers, that's s worthwhile endeavor. Also, many folks think Solr / lucene /sphinx is the only game in town, but Riak and indexTank certainly don't buy that.
'Recall' was at the promising prototype quality level, with results more Cuil-like, centered around linguistic concepts detected during indexing, than Google-like. So for example the index was compact but there was no phrase-search, and even after receiving a page as a result to one query, a followup query with words on that page might return zero results if that exact set of words hadn't been concept-extracted.
It had nice graphs of concepts over time. But, its cleverness was probably overkill for such visualizations, except to the extent (IIRC) it disambiguated similar concepts from context. As the recent Google Books-based word frequency tool has shown, simple word and n-gram counts provide plemty of similar value.
Ultimately, though, as a bit of advanced, non-open-source Lisp code by a single expert, there was no one appropriate to productionize, maintain, and improve it after she joined Google.
In other words, avoid spending money, refine your algorithms first. Faster machines may be tempting, but that makes scaling horribly expensive down the road.
But abandoning Google hasn't always proved to be a wise or
permanent move. Anna Patterson left the company in 2007 to
start a rival search engine called Cuil. When that tanked, she
returned to Google in September.
I tink that Cuil became a disaster because Gogle existed. When google came out it was easily 10x better than what was currently on the market. The same cant be same for cuil.
I think for most people writing a search engine is overkill when there are existing options out there.
If you want to search a subset of sites, then Google CSE is really all you need + whatever bells & whistles you'd like to add around it. I've done that here: http://searchESLCafe.com, adding "recent searches", search via wildcard subdomain (i.e. foo.searchESLCafe.com or bar.searchESLCafe.com or foo_bar.searchESLCafe.com, etc), and customizing the heck out of Google CSE's options.
Is there a demand out there for the search engine to parse the results into something informative at-a-glance? I'm not so sure it's the user's first priority. Or, to put it another way, there's plenty of hard-to-reach info out there that you can hand users via a customized Google CSE, and they don't mind doing the leg-work of clicking on the query results and finding their own answers.
It's a lot more important to have an accurate search algorithm than drill-down-related bells & whistles.
Google does a great job of returning solid results for any subset of sites, so why not let Google handle it, and concentrate on the other stuff?
At least, that's why I'm working on a search engine for financial news at http://Newsley.com/search. (We're focused on building the crawlers and the index right now. Search is _very_ alpha).
After reading this article, I feel validated for a bunch of the decisions that I've been making. I've been running on EC2, but their disk IO is slow as molasses. So, I'm starting to build servers and throw them in my garage. I'll be migrating to garage servers in the next few months. Pretty much everyone I talk to thinks running servers in your garage is a terrible idea, but I can't think of any way else to do this cheaper and still have control over my hardware. It's nice to read that I'm not crazy for thinking this.
It was also great to read that on early search engines, the bulk of the work is done by small teams. Being the only dev, at times I think I'm a bit crazy for trying to boostrap a search startup. Again, it was nice to read that it's not all that crazy to try and do it on my own.