Wink-lemmatizer

vecplane · on June 24, 2018

Here's a link to the github (AGPL 3.0) - https://github.com/winkjs/wink-lemmatizer

Very cool! It looks like this is only for English at the moment, but it would be cool to see this for other languages as well.

raldi · on June 24, 2018

Link was originally https://nlp.stanford.edu/IR-book/html/htmledition/stemming-a... ... I don't mind that the mods changed it; I just wanted there to be a record, because otherwise some of the comments here don't make sense.

sanjayaksaxena1 · on June 25, 2018

Hey! I am Sanjaya, the lead developer of wink-lemmatizer, and the rest of the wink family. We are trying to break down NLP and ML into atomic building blocks. I am happy to answer any questions and gather feedback about wink located at https://github.com/winkjs :)

amelius · on June 24, 2018

Nice. Any plans for the following features?

- instead of calling the API using "noun", "verb", etc, let the library figure out the type of word.

- let the library return information about the word, e.g. "coolest" -> "cool"+superlative.

moarcoffee · on June 24, 2018

“cool” is both a noun and a verb. How would it handle that?

amelius · on June 24, 2018

Return a list of results?

raldi · on June 24, 2018

From the Exercises section:

> The stemming for ponies and pony might seem strange.

What's that all about?

djur · on June 24, 2018

The stem for 'ponies' in the earlier figure was 'poni'. That means that 'pony' has to be stemmed as 'poni' in order to correctly retrieve 'pony' when searching for 'ponies' and vice versa.

raldi · on June 24, 2018

The next question was,

Does it have a deleterious effect on retrieval? Why or why not?

Any thoughts?

ccleve · on June 24, 2018

Ok, so how does it work? Which stemming algorithm?

It's not helpful unless we know what it's doing under the covers.

repsilat · on June 24, 2018

I disagree. For use as a tool, for some use-cases, if it "works as advertised," all we really need to know is whether it's efficient enough for our purposes.

One (horrible) use for something like this is in a framework like Rails, where there's a cultural acceptance of making method and field names by pluralising and conjugating etc.

In Rails if your `House` model has a one-to-many relationship with your `Mouse` model, a `house` object probably automatically gets a `.mice` field. That sort of thing can be done with extensive rule sets, or it can be done with a black-box library.

Of course it's a horrible use-case, and you'll probably always need to deal with ambiguity and context, but for that sort of thing the implementation details aren't nearly as relevant as "the dimensions of the black box" -- how quickly does it run, how quickly does it start up, how much memory does it use, how good at its job is it, and which languages does/can it be made to support?

sanjayaksaxena1 · on June 25, 2018

The lemmatizer is an adaptation of https://wordnet.princeton.edu/documentation/morphy7wn. It conjugates the verb and adjectives to their base form, whereas the plural nouns are converted to singular form. The wink-pos-tagger (https://github.com/winkjs/wink-pos-tagger) leverages the lemmatizer to automatically find lemma of each word as per its part of speech.

For stemming, we have wink-porter2-stemmer (https://github.com/winkjs/wink-porter2-stemmer) that uses Porter Stemmer Algorithm V2 by Dr Martin F Porter.

saagarjha · on June 24, 2018

It would be really cool if they had a live demo to try it out on their site.

tomjakubowski · on June 24, 2018

Here you go. This works for many npm packages. https://npm.runkit.com/wink-lemmatizer

slx26 · on June 24, 2018

any plans for multi-language support?

EmilStenstrom · on June 24, 2018

This is really needed in the NLP world. Most things are english only.

_r5wf · on June 24, 2018

Good luck with Turkish. A very large lookup table may work for %90 cases but it will still fail. A basic `true` lemmatizer requires either a complex graph with rules, or an FST generated from it. input is searched through the graph and morphological disambiguator must be applied to result to pick the correct lemma.

DFHippie · on June 24, 2018

Or Welsh. You want a canonical form for "wnaethpwyd"? Try "gwneud"! Some regexes and an exceptions list isn't going to cut it.

The more a language needs a lemmatizer for NLP the harder it is to write it.

tomjakubowski · on June 24, 2018

Could someone explain what's hard in particular about canonical forms in Welsh?

wyldfire · on June 24, 2018

spaCy excels here. English, German, Portuguese, and more.

slx26 · on June 24, 2018

I can personally suggest FreeLing [nlp.lsi.upc.edu/freeling]. It supports most languages in the iberian peninsula (spanish, portuguese, catalan (including multiple dialects), galician), and then it also includes english, italian, french, german, russian, croatian and slovene. It's also more flexible than other tools if you need to go outside those languages and customize something.

sanjayaksaxena1 · on June 25, 2018

As of now wink-tokenizer (https://github.com/winkjs/wink-tokenizer) supports multiple scripts, therefore, it can also tokenize sentences in languages like Hindi, Marathi, French, German etc. We are working on extending multi-lingual support to other components including this lemmatizer.