Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The file has 2+ billion lines. A set or hash table won't work very if if there are, say 500 million unique strings.


Hence the split. It's still likely to beat the crap out of sorting and it might not be needed, depending on the data. I do wonder what the dataset is that makes for such tiny, short lines.


Given the numbers, the need to dedupe, the poster's strong motivation and relative inexperience...

I'd guess it's a gigantic spam list.


Hah, self-duh. Total failure of technical cynicism on my part.


an in-memory set won't work. indexes don't have to be in memory all the time to provide log N lookup, I mean just think about how b-trees work.

...and this is a problem most databases have addressed.


It might well work, it depends a great deal on the data. In any event, I was talking about in-memory hashing, the responder's assumption that I was was correct.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: