Hacker News new | past | comments | ask | show | jobs | submit login
Better Search Doesn't Mean Beating Google (nytimes.com)
22 points by limeade on March 9, 2009 | hide | past | favorite | 17 comments



The most fundamental problem with natural language search engines is that the "natural language" part is more a limitation than a feature to me. Natural language is meant for people to communicate with other people and not with computers. I believe that a well designed keyword/tag based search combined with factual auto suggestions extracted from formal/semantic sources (similar to wikipedia) could be far more efficient for people to use and computers to run.


I am in law school and as a law student I spend lots of time searching through past cases. My searching is done almost exclusively through Westlaw, the online database of Thompson West products.

They have many different types of searches, but the two applicable to this discussion are (as they are written on the site) both "terms & connectors" and "natural language." T&C works well using your standard OR/AND/etc. However, natural language works so much better even though you type in the exact same words.

The natural language search returns cases more on point and has one awesome feature: the most relevant text is in red type, set apart from the rest of the case. From the natural language search West is better able to determine what the legal researcher wants and shows it to him.

I spent almost 2 years of law school searching using terms and connectors because I thought the same thing as you do. But I recently converted when I realized West returns better results from their natural language search.


@micks56 - re your legal search using natural language -

Are you able to compare, say, Wests results to those of a pure google text search on the keyword terms?

[ To do that you'd need some example of large legal texts fully online and thus indexed by google - I dont know if that exists ]

Its sometimes hard to discern the value of the tech versus the quality of the implementation + usability factors - but your observations are interesting. I wonder how search on medical information compares...

gord.


I am trying to think of a test that can be run on my West search engine and Google. West's legal resources dwarfs Google's. Google might as well be considered non-existent in the area compared to West or LexisNexis. That is just what those two companies do. They have people that enter cases into databases as they become available. Google just doesn't do that.

I haven't thought of a fair test to run yet. The two engines do different things. My West can search case decisions, statutes, administrative codes, briefs filed to the court, secondary sources (sort of the research paper of the legal field) and the news.

So I tried doing a search on the news only. I searched "ycombinator" and the results returned are news articles only, whereas on Google someone probably wants and gets the YC home page, this site, or the actual function. None of those show up on the West site.

Then I ran a search of these terms on each (I didn't enter quotes on the actual searches): "massachusetts custody modification"

On Westlaw, I get cases, and statutes on point. With extra terms I will easily get to cases that deal with my specific issue. On Google, the first link is a divorce resource site and the rest are for lawyers.

Searching statutes might work. But the main reason statutes search well on Google is the Cornell Law site. The quality of results for statutes is probably a bigger testament to them and their cataloging efforts.

I would say both search engines hit their target markets well. Most people searching "massachusetts custody modification" don't want 20 decisions of the Mass SJC. And people searching the same on Westlaw don't want attorneys. Google is much much faster though. It returns in a fraction of a second. Westlaw took about 12 seconds to return 10,000 hits. First three hits were decided yesterday, which is pretty cool.

There is a group of people creating an open legal database. I can't remember its name. I think they are based in the San Fran area. I think it was started by some hacker that worked on opening up some other government data and is now on the court system. I have the bookmark buried somewhere and of course can't find it. Does anyone know which one I am talking about? We could maybe test that database versus the commercial West one.


thanks for the write up.. interesting to see how things develop in the real world outside your own domain.

I'm surprised the big G hasn't just paid some money to get that data, given their plan to scan all the worlds books.

I wonder what percent of all text is legal or medical.


I doubt West, LexisNexis, or any other legal aggregator will sell the information to Google. Those companies make a lot of money selling it to lawyers on a monthly subscription basis. They also do some value-add to the materials. What I see on West or LexisNexis is more than just the publicly available decision. West and Lexis employ lawyers to create summaries and other helpful things for the legal researcher.

There certainly is a lot of legal text. Lawyers certainly are good at creating volumes of paper. For example, the Supreme Court just decided a case, Wyeth v. Levine. It will be recorded in volume 555. So to date the Supreme Court decisions have filled 554 volumes of 1000 pages each. And that is just one court. Every state court, state appeals court, and state supreme court, federal court, land court, etc has similar volumes and page counts.

And all of this is just the primary sources. Once you add secondary sources, aka books and papers written by learned scholars on individual topics or cases, the number of books and pages increase by orders of magnitude. And we still haven't archived any statutes (those go on forever, for each state) or any administrative law. And each one of those has comment sections that go on for pages whereas the actual rule is only a paragraph.

I wonder what percentage this is, too. I bet it is still extremely small compared to what the rest of the world has produced. There are so few law writers when compared to all other writers.


Thats a lot of text. The few patents Ive read strike me as quite verbose. I was quite amazed at what was patentable, and how loosely described {ephemeral!} the descriptions were. I'm not suggesting all legal text is as sparse in information.

We could certainly do with a better text search for patents.. but I wonder if thats possible unless a form of restricted prose is used that makes the text less obtuse/verbose.

Maybe an algorithm can reduce the common legal motifs and replace them with shorter versions thus refactoring legal-speak into human-readable prose on which text search can be effective.

[ For some reason this reminds me of the law student drama series 'the paper chase'. ]

How well is the information hyper-linked? Presumably one paper references many previous rulings, and youd jump around a lot in researching issues.


Thank you for reminding me of patents. I forgot to mention those. A patent is a completely different entity compared to case law. Case law and case briefs/motions written by lawyers have to be short, concise, to the point, and logical. The judge will quickly (in a matter of a few seconds) ignore your argument if he has to spend any time figuring out what you have to say.

That leads all of our law professors to drill into our heads brevity, clarity, and conciseness in everything we write. But as you mentioned, patents have a completely different audience and goal.

I am amazed at what is patentable too. I wrote a research paper arguing against software patents. The professor that graded my paper disagreed with the position very much. I wrote mine a few days before the Court of Appeals for the Federal Circuit heard the Bilski case. When the decision was rendered this past fall they made some law that is similar to what I argued. I should go show the professor the paper he marked down and the Bilski decision. But I digress...

Patents are a land grab. The goal is to get the vaguest, broadest patent possible and protect the most space. And the legal-speak is there because those words have been litigated time and time again and they have a known meaning to the courts. As soon as you write a new phrase you open yourself to debate in front of the court. A macro to convert legal-speak to human-readable prose should be used at the researcher's own peril.

We are told time and time again: read the case for yourself. Do not read anyone else's summary. And don't paraphrase words unless you know to stay away from the special ones.

Example: There was a contracts case where the contract says "only use pipe made by Company A to build my house." The builder uses pipe from Company B. The court ruled against home buyer because "only use pipe by Company A" doesn't actually mean that! It means use pipe similar to the quality of Company A! So translating the legal speak required to really get a builder to use pipe from Company A into "only use pipe from Company A" will result in failure.

The information is hyper-linked very well. I wish I could show you, but I can't. My student access to the site is restricted to school use only. I am pretty sure I will be violating the TOS by posting any of the information.

But every case cited is linked. Those are the most important. Judges are linked to other decisions. Same with arguing lawyers. Statutes are linked. Footnotes are linked. Obscure terms are linked. For example, a medication will have a link but a legal term of art will not.

I just wish the search and the site overall were faster. Sometimes the navigation is quirky, too.


Hi micks56, I would like to ask someone who has a grounding in both law and technology some questions not directly related to this discussion but to software for lawyers in general. Your profile doesn't have an email ID. Care to email me at heuristix at gmail or reply back with your email id?


I just sent you an email. I am happy to answer any questions that you or anyone else has.


Thank you! Got the mail. I'll write up my question(s) today evening and email you.


I think I just said what you just said.. but then I read your post.

Is this something you'd enjoy hacking on?


The article argues that one thing is having a successful technology, and another is having a successful business. This is, of course, true - but there's a significant correlation between the two. Correlational databases, Google's search algorithm and the light bulb were all influential technologies that founded hugely successful companies.

There's a rule of thumb saying that your solution to a problem has to be 2 - 300% better than the existing state of the art for it to be adapted successfully without artificial help (marketing $, monopolies, etc.) and Google certainly lived up to that when it went live. It remains to be seen whether Wolfram Alpha will.


It won't.

It'll probably be another case of Powerset or Cuil, lots of hype by the company in question that is impossible to live up to.


You can’t beat Google only by developing a superior technology. You’ll still be left with the Herculean task of drilling your search engine deep into the minds of hundreds of millions of users around the world. Google the brand is far mightier than Google the technology. It’s the Google’s redoubtable omnipresence and visibility that makes it tick.


Microsoft tried, and is still trying, with their Live Search. It's the default search in Internet Explorer, and the Microsoft brand is very well known. But you know what the problem is? The search results just aren't as good!

I tried out MS Live Search, thinking that a company with as much money to throw at things as Microsoft would probably be of similar quality to Google. I quickly got frustrated when most of the search results were nothing like what I was looking for. Meanwhile Google consistently gave exactly what I wanted near the top of their results page.

What I'm saying is that technology does matter. There are other brand-name titans in the world.


No. Better search does mean beating google.

But you don't have to play within Googles rules.. theres lots of territory between text search [=cool] and Natural Language [=sucks, to a first order approximation].

For example, just treat data as a graph of tagged pieces of text.. and give a good web interface to that. Bypass all the RDF, semantic web hype and just make something workable, usable. A wiki for data.

anyone looking for a co-founder? Im working on this.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: