Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It doesn't sound like the approaches are incompatible. You can use minhash LSH to search a large set and get a top-k list for any individual, then use a weighted average with penalty rules to decide which of those qualifies as a dupe or not. Weighted minhash can also be used to efficiently add repeats to give some additional weighting.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: