Hacker News new | past | comments | ask | show | jobs | submit login

Yep, that's kind of the main problem :) Hence the need for some manual curation. (e.g. ClearURLs seems to do it here https://github.com/ClearURLs/Rules/blob/master/data.json) For 80% of sites just throwing away the query parameters work, for the rest sadly it's necessary to do more sophisticated normalizing.

I'm also thinking that it might be possible by some simple machine learning, by looking at the corpus of existing URLs. E.g. if a human looks at a corpus of different URLs they would more or less guess what is useful, and what's tracking garbage, so perhaps it's possible to automate it with a high accuracy?

Then, I also feel if it's paired with some UI to allow the user to 'fix' the algorithm for entity extraction (e.g. by pointing at the 'relevant' parts of the URL), it would already be good enough for the user -- they would fix the sites that are worst offenders for them. Then these fixes could be optionally contributed back and merged to the upstream 'rules database'.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: