Hacker News new | past | comments | ask | show | jobs | submit login
Data Science Toolkit: free-as-in-speech Natural Language Processing & Geo tools (datasciencetoolkit.org)
84 points by nosignal on Jan 15, 2012 | hide | past | favorite | 3 comments



Before you start building something too elaborate on top of this, you should play with the tools to get a sense of their limitations. The Geocoder::US module, in particular, has some problems with real-world input (e.g. geocoding "151 Third St San Francisco" doesn't work, but geocoding "151 Third St San Francisco CA" does). Similar caveats apply to the heuristics used by the other tools.

There are a lot of these corner cases, and they're hard to solve. As always, nothing replaces experience when you're doing this kind of work.



Yeah...parsing addresses is non-trivial. It quickly turns into a probabilistic/NLP problem, if you want to deal with any sort of real-life input. I can believe that Wolfram Alpha hasn't put a lot of effort into it.

The DB lookup part of the problem (the part that Geocoder::US solves) is the straightforward, mechanical part -- but geocoder precision/recall is almost entirely determined by the quality of the parsing. That's a lot harder to do well.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: