Hacker News new | past | comments | ask | show | jobs | submit login
Wink-lemmatizer (winkjs.org)
51 points by petethomas on June 24, 2018 | hide | past | favorite | 22 comments



Here's a link to the github (AGPL 3.0) - https://github.com/winkjs/wink-lemmatizer

Very cool! It looks like this is only for English at the moment, but it would be cool to see this for other languages as well.


Link was originally https://nlp.stanford.edu/IR-book/html/htmledition/stemming-a... ... I don't mind that the mods changed it; I just wanted there to be a record, because otherwise some of the comments here don't make sense.


Hey! I am Sanjaya, the lead developer of wink-lemmatizer, and the rest of the wink family. We are trying to break down NLP and ML into atomic building blocks. I am happy to answer any questions and gather feedback about wink located at https://github.com/winkjs :)


Nice. Any plans for the following features?

- instead of calling the API using "noun", "verb", etc, let the library figure out the type of word.

- let the library return information about the word, e.g. "coolest" -> "cool"+superlative.


“cool” is both a noun and a verb. How would it handle that?


Return a list of results?


From the Exercises section:

> The stemming for ponies and pony might seem strange.

What's that all about?


The stem for 'ponies' in the earlier figure was 'poni'. That means that 'pony' has to be stemmed as 'poni' in order to correctly retrieve 'pony' when searching for 'ponies' and vice versa.


The next question was,

Does it have a deleterious effect on retrieval? Why or why not?

Any thoughts?


Ok, so how does it work? Which stemming algorithm?

It's not helpful unless we know what it's doing under the covers.


I disagree. For use as a tool, for some use-cases, if it "works as advertised," all we really need to know is whether it's efficient enough for our purposes.

One (horrible) use for something like this is in a framework like Rails, where there's a cultural acceptance of making method and field names by pluralising and conjugating etc.

In Rails if your `House` model has a one-to-many relationship with your `Mouse` model, a `house` object probably automatically gets a `.mice` field. That sort of thing can be done with extensive rule sets, or it can be done with a black-box library.

Of course it's a horrible use-case, and you'll probably always need to deal with ambiguity and context, but for that sort of thing the implementation details aren't nearly as relevant as "the dimensions of the black box" -- how quickly does it run, how quickly does it start up, how much memory does it use, how good at its job is it, and which languages does/can it be made to support?


The lemmatizer is an adaptation of https://wordnet.princeton.edu/documentation/morphy7wn. It conjugates the verb and adjectives to their base form, whereas the plural nouns are converted to singular form. The wink-pos-tagger (https://github.com/winkjs/wink-pos-tagger) leverages the lemmatizer to automatically find lemma of each word as per its part of speech.

For stemming, we have wink-porter2-stemmer (https://github.com/winkjs/wink-porter2-stemmer) that uses Porter Stemmer Algorithm V2 by Dr Martin F Porter.


It would be really cool if they had a live demo to try it out on their site.


Here you go. This works for many npm packages. https://npm.runkit.com/wink-lemmatizer


any plans for multi-language support?


This is really needed in the NLP world. Most things are english only.


Good luck with Turkish. A very large lookup table may work for %90 cases but it will still fail. A basic `true` lemmatizer requires either a complex graph with rules, or an FST generated from it. input is searched through the graph and morphological disambiguator must be applied to result to pick the correct lemma.


Or Welsh. You want a canonical form for "wnaethpwyd"? Try "gwneud"! Some regexes and an exceptions list isn't going to cut it.

The more a language needs a lemmatizer for NLP the harder it is to write it.


Could someone explain what's hard in particular about canonical forms in Welsh?


spaCy excels here. English, German, Portuguese, and more.


I can personally suggest FreeLing [nlp.lsi.upc.edu/freeling]. It supports most languages in the iberian peninsula (spanish, portuguese, catalan (including multiple dialects), galician), and then it also includes english, italian, french, german, russian, croatian and slovene. It's also more flexible than other tools if you need to go outside those languages and customize something.


As of now wink-tokenizer (https://github.com/winkjs/wink-tokenizer) supports multiple scripts, therefore, it can also tokenize sentences in languages like Hindi, Marathi, French, German etc. We are working on extending multi-lingual support to other components including this lemmatizer.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: