Hey! I am Sanjaya, the lead developer of wink-lemmatizer, and the rest of the wink family. We are trying to break down NLP and ML into atomic building blocks. I am happy to answer any questions and gather feedback about wink located at https://github.com/winkjs :)
The stem for 'ponies' in the earlier figure was 'poni'. That means that 'pony' has to be stemmed as 'poni' in order to correctly retrieve 'pony' when searching for 'ponies' and vice versa.
I disagree. For use as a tool, for some use-cases, if it "works as advertised," all we really need to know is whether it's efficient enough for our purposes.
One (horrible) use for something like this is in a framework like Rails, where there's a cultural acceptance of making method and field names by pluralising and conjugating etc.
In Rails if your `House` model has a one-to-many relationship with your `Mouse` model, a `house` object probably automatically gets a `.mice` field. That sort of thing can be done with extensive rule sets, or it can be done with a black-box library.
Of course it's a horrible use-case, and you'll probably always need to deal with ambiguity and context, but for that sort of thing the implementation details aren't nearly as relevant as "the dimensions of the black box" -- how quickly does it run, how quickly does it start up, how much memory does it use, how good at its job is it, and which languages does/can it be made to support?
Good luck with Turkish. A very large lookup table may work for %90 cases but it will still fail. A basic `true` lemmatizer requires either a complex graph with rules, or an FST generated from it. input is searched through the graph and morphological disambiguator must be applied to result to pick the correct lemma.
I can personally suggest FreeLing [nlp.lsi.upc.edu/freeling]. It supports most languages in the iberian peninsula (spanish, portuguese, catalan (including multiple dialects), galician), and then it also includes english, italian, french, german, russian, croatian and slovene. It's also more flexible than other tools if you need to go outside those languages and customize something.
As of now wink-tokenizer (https://github.com/winkjs/wink-tokenizer) supports multiple scripts, therefore, it can also tokenize sentences in languages like Hindi, Marathi, French, German etc. We are working on extending multi-lingual support to other components including this lemmatizer.
Very cool! It looks like this is only for English at the moment, but it would be cool to see this for other languages as well.