Hacker News new | past | comments | ask | show | jobs | submit login
IBM Watson Acquires Technology from Blekko (asmarterplanet.com)
106 points by boyter on March 29, 2015 | hide | past | favorite | 69 comments



IBM needs an up-to-date web-scale ontology (Blekko has it) for Watson, it used Freebase (et al) which is closing down on March 31, 2015 (largest open collaborative knowledge base): http://www.freebase.com , http://en.wikipedia.org/wiki/Freebase

Freebase was acquired by Google in 2010 and powers their internal Google Knowledge Graph. Microsoft bought Powerset (company) in 2008.

Freebase has 2,751,614,700 facts, Wikidata has 13,788,746 facts. Wikidata may import some data of Freebase, but due its stricter guidelines (notability guideline...) many facts of minor will be lost/never migrated. A Freebase dump won't age well, in a lot of cases up-to-date facts from the real world are required.

Maybe some community project can rescue the Freebase community project before it is too late?

@downvotes: ?


I totally agree, I believe many people vastly underestimate the usefulness of Freebase as a open source ontology. Starting tomorrow, there simply is no real alternative to Freebase and so many great applications are not possible any more :(


What do you mean with "ontology"?


We need more alternatives to google, not less. On the one hand, great for blekko to have an exit of sorts, on the other a huge loss for the web. Now it's more or less down to duckduckgo and the likes of gigablast.com (which is now open source).


True, but there are some systemic issues that make that pretty challenging. Google is now paying about $4B a year to people to host (or white label) a search box that goes back to Google. Then if you want to offer advertising on your search you generally have to get it from Google too, but that isn't as valuable as it once was and that is a problem for Google.

But perhaps the most interesting thing I've learned from my 5 years at Blekko doing search is that "phase 2 or 3" of the Internet is here. Crawling everything is a waste of resources because 95% of the "new" stuff coming online on the Internet is not information, it is just spam. This combined with advertiser burnout from getting scammed again and again on advertising that claims to generate leads or sales but leads only to click farms. And the whole eco system of the web is in being rocked, along with media distribution. The world is approaching some sort of climactic shift of orientation.

Blekko's key mission has always been to try to find the needles in this exponentially growing pile of hay. And it is something that the folks at Watson really liked about our technology when we first met at their outreach program to connect with startups. That is what lead to their asking us to join them, and no, they weren't particularly interested in the stuff we had done to provide more topical advertising signals. So as a technologist this has been both a validation of our work on finding the real information in the web and making it useful to people, and, when I'm honest with myself, a welcome step away from working on still more advertising technology.

Having the resources to pursue that, and an engine (Watson) that can put it to use, seems pretty exciting.


The web turned out that way because of google, until google came along the worst we had was on-page keyword spam. Google assigned a value to the links that make up the web and as a result those links were spammed. Hopefully Watson will be able to create a search feature that can not be spammed but so far anything that ends up being a metric that the web pages are rated by will end up being spammed.

Best of luck! At least you're fighting the good fight.


Poor Bing. Hundreds of millions spent on advertising and still less mindshare on Hackernews than Gigablast.


Gigablast is alive and well at Diffbot! (founder here)

https://gigaom.com/2013/09/10/diffbot-brings-big-time-search...


Neat! Thanks for all the hard work over the years.


and gigaom is dead :(


Which is odd, because this is the first time I've heard of Gigablast. If that isn't a name straight out of the 90's...


Gigablast is a testimony to how productive one driven programmer can be. Matt did an amazing job of building and running it.


You don't get people to use your search engine by spending on advertising.


What's Bing?


Agreed. There are only a few players with a large web index now. Google, Bing, Yandex, Gigablast, Baidu, Naver.

Others that I can think of are IXQuick (not sure if they use their own index or not) and Yioop (smallish index). There really is a lack of large players indexing the web.

Its also possible that we will see the rise of niche search engines, such as iconfinder.


There are claims that Google indexes 150B pages today. Even if you settle with 1/3rd of this and take in to account that average page size for HTML alone is 200KB, it would be evident that you need 10s of millions of dollars to pay towards hardware. Remember fetching pages costs, storing them, processing them etc costs lots of money and building even semi-scalable index which have in-memory, SSD based layers is super expensive. And we haven't even started the major cost factor i.e. human employees.

Today's search engines are not just indexes of web pages. They give direct answers for "when did Lincoln died", they show detailed street and satellite maps with StreetViews in search results, they act as business directory, they act as people finder, they have elaborate image search (again super expensive to do), they recognize objects in images, they have freshest news search for thousands of sources, they can do video search, they have scans of 100s of thousands of books, they catalog millions of products and so on. Even getting some of this high quality data such as satellite maps, books, business listings, product catalogues etc would cost ~100 million dollars in licensing deals.

Even if you decided to get world's most productive programmers you will need at least 300 people in my most minimal estimation and more than 2-4 years before you can build anything that has non-negligible chance of competing. That's about $75M of cost per year right there at the ultra-minimal end. Of course, this is assuming you already had bigger breakthrough than that of PageRank and that you can beat state of the art machine learning and natural language processing techniques. Hopefully you can now see, for all intent and purposes, search business is closed to attack via startups. I can imagine Facebook and Apple would sooner or later get in to this business but it would be uphill battle for them primarily because of lack of talent that search engine business requires and more importantly lack of data that only Google has for billions of queries and users doing them. You can build AirBnB with smart college hires but building search engines needs truly the cream of the crop who are perfect rare blend of Computer Scientists and part time mathematicians while also being an exceptionally productive applied programmers. Google has been working for years to sweep away pretty much all talent in this area and it would take most competitors significant effort to build this kind of army.

PS: DuckDuckGo is not a "real" search engine. It has a very little of its own index and for most queries they just rearrange results fetched mostly from Bing while inserting links from their own little index. If Bing shuts them out for leaching off of them, they would be toast.


A serious contender to Google isn't going to be a better Google (or even a comparable one), exactly for the reasons you mention. It's going to be something else altogether.

I can't tell you what the answer is, but I am pretty sure that it's outside the box you've built yourself into.


Great write up and break down of why search is hard and how unique a position Google is in. That said searching the entire web as a business isn't very feasible for many companies now and hence most of the other giants competing with Google instead use the asymmetric approach of bringing content into their own garden and building a specialized search engine around it. Since the data in their own garden is structured the amount of work to build an acceptable quality search engine is relatively less. See apple's app store, facebook search, twitter's search, amazon's a9 etc.


A good reason to use and support Common Crawl - a good resource, that is updated (I think) every few months).


Cuil (ha!) donated most of their crawled data to the Internet Archive - so with that plus Common Crawl, it's pretty easy for anyone to have a sizeable index.

What those neither of those give you is an up to date index, which is why small search engines still need Yahoo and Yandex's APIs. I'm not sure any free resource can match the speed at which big companies can index the web.


Common Crawl is more interesting for research purposes than for real search engine business. It refreshes only 4 times a year and has only 2 billion pages. For anything under ~10B pages is most likely to be called "research" or "toy" index. You have almost zero chance of building competitive index if you are below ~15B. Google is rumored to crawl about 150B pages as reference. There are estimated 1.2T pages out there which does not get crawled due to various reasons.


I dont think that you need so much pages - most of 150B pages will be never ever shown displayed at SERP.


True, you just need a subset. Now how you you identify that subset without indexing the pages to find out whether each page is in the subset you need?

IIRC google used to scan different pages at very different frequencies. Quite possibly because it has assigns pages into subsets every time it indexes.


Perhaps. The problem is that crawling the web isn't that hard. Foreach(links) { get(link)} is all you really need. Doing it multiple times is harder though.

The real problem imho is building a distributed index and fighting spam. Both are incredibly hard problems to do well and very expensive. Hence so few are trying to do it.


If you would crawled web using Foreach(links) { get(link)} you would end up crawling useless automatically generated webpages.


You say that like it doesn't happen all the time. googlebot included.


I think providing quality results is the challenge. That take certain algorithms and optimizations to do so in a timely fashion. This is where AI projects like Watson really come into play add they are all about retrieving the correct information (a difficult problem on its own) in a timely fashion.


Also true, however I think a basic ranking algorithm should still be fine for most cases if you have 100% spam eradication. Good content left over should bubble to the top.

I tried building a small search engine with a very basic algorithm and it worked very well for 90% of searches.

I agree with you in part but I also believe if you got the others 100% right you would have a respectable engine by itself.


Ixquick is sort of like a proxy for Google search results. It's run by the same people who run StartPage.com


They claim not to track users, any idea how they make money?


They show keyword-based ads. Same as DuckDuckGo and other search sites that claim to not track users.

A search engine doesn't need to track people to make money, they just make much more money if they target ads better by tracking.


You don't have to track users to make money. It's just a way of making more money hence its occurring.

DuckDuckGo makes a business not tracking users and I believe made over 1.2 million last year based on its donations.


Where did you manage to find their donation numbers from last year? I thought they'd stopped disclosing their donation amounts a few years ago (even then they were in the 6 figures for revenue).


Cannot remember sorry. The numbers were published as 6 donations of $25,000 to different organisations. I got the link from twitter. I think of the groups was a women's interest group if that helps.


Ah right no worries. They used to post the donation breakdown on https://duck.co/ but the last one I could find was in 2011.


Perhaps I should try building a search engine. It can't be that hard.


It really isn't (been there, done that). The hard part is to operate it. You're going to need quite a bit of hardware (figure about 10 large machines minimum to store an index of any appreciable size) and a ton of bandwidth to do the crawling. Then you're going to have to somehow get an audience large enough to earn enough money to pay for it all. Your runway will be in the 10's of millions of bucks at the size of todays internet.


Yeah, the reason I said that was because if a guy can read books and learn how to build rockets then I should be able to figure out search.


Blekko was the only search engine I knew of that provided a no BS unlimited search api. You just had to add '/json' to any request, and you had it their results in json.


I remember having an email conversation a long time ago with someone who wrote their own search engine in C++. I think that system was Blekko. Congratulations to Blecko for a (hopefully) nice exit.

IBM Watson absorbed AlchemyAPI a few weeks ago.

I helped a friend's company integrate IBM Watson into their product, and I have mixed feeling about IBM Watson: plenty of potential, but some rough edges.


C++? No way, Blekko's done in Perl. See the YAPC slides: http://www.pbm.com/~lindahl/blekko-yapc-na-2013-2.pdf


We have a lot of C/C++ in the guts of our software, it's a lot more than just the million lines of Perl!


Any reason for the use of Perl both in this and in DDG? I'm a recent Comp Sci graduate and haven't seen a lot jobs for Perl or use cases over say Python/Ruby which seem to more popular.


I guess the people who made them were already comfortable with Perl. Perl is also very good at text processing.


Wow - I'm a little surprised that blekko would be acquired by IBM, but I suppose IBM must see Watson as a huge business now. I think blekko was starting to struggle to stay competitive in consumer search (their backend technology was always pretty good, it's just they struggled to show those results well).


Freebase is a database of things Blekko is a company that classifies things Watson interprets the relationships between classifications . This stuff is way beyond a search engine .

People talking about needing to building a better search engine better get to work because I think that space is almost abandoned .

A recommendation , an opinion , and a result are a hell of a lot different .


Welcome ChuckMcM!


Saw the headline and wondered about his landing. Glad to know he's headed to another interesting project.


Thanks!


I don't think IBM is gonna convince many developers to build applications on top of Watson, as open-source has never been in the top of their list.

My gut tells me IBM just needs some independent developers to develop some apps and then be acquired( as in that point they'd have no choice).

I may be wrong, but I'd rather not take that chance as I work on my projects.


Interesting. IBM also acquired a startup I worked for (Vivisimo) because of the search technology. We never really focused on web crawling though. We dealt mostly with enterprise systems. Now we're a part of Watson too. Also the Alchemy API seems like it kind of touches parts of what we do which is another recent IBM acquisition.


With Blekko gone there is a very limited number of alternative search engines if you don't want to use Google. Have a look though at Mojeek, a UK privately owned crawler based search engine producing its own results and without any tracing. It is certainly one to watch for the future


Now that's a surprise. I didn't think Blekko was doing much more than reselling Yandex and adding some social features. Yandex has an $30M equity stake in Blekko - did IBM buy that out?


I was in touch with Blekko about a year ago regarding using their api to access their index for search results for my own meta-search engine. They might have used Yandex to augment their search results, but Blekko definitely had its own index as well.


Blekko had a sizeable index of their own. Most successful meta-search engines (like DuckDuckGo) keep their own index because there's not a lot they can get from a Yandex api call.


My understanding is actually the opposite. While DuckDuckGo does have a small index of their own, they mainly rely on api calls to Yandex and Yahoo BOSS!


They do rely upon the API calls, but their spam elimination requires full web pages, not just what you get from an API call.


So much for a graceful shutdown. Search capabilities gone immediately. If you created a slashtag, it's gone forever.


Congrats to Greg and the team.


Url changed from http://blekko.com/, which redirects (after a while) to this.

Should the title just say "IBM Acquires Blekko"? The article seems to stop short of saying so.


Here's what it used to look like, for those who had never heard of this product: https://i.imgur.com/uIHcEN4.png


So as a result of the acquisition, IBM has disabled the service? Seems odd. Their only interest was supplementing/enhancing Watson?


at least when Google acquires an existing service, it waits a few years before shutting it down


Oh Crap. IBM's AI is now buying companies?!?!


You Bitcoin people keep going on about DACs. Now witness the firepower of this fully armed and operational M&A machine.


Don't bin me in there, much of that stuff is somewhere between hopelessly naive and outright nonsense. :)


There was a funny comment that a lot of IBM's meetings are held on conference calls because they are so large and the organization so dispersed. How hard would it be that someone on that call never seemed to be in any office? ... :-)


It would be kind of a stretch to label Watson as AI, at least currently.


Supervised learning still falls under the AI umbrella.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: