The file has 2+ billion lines. A set or hash table won't work very if if there a...

pvg · on Aug 14, 2010

Hence the split. It's still likely to beat the crap out of sorting and it might not be needed, depending on the data. I do wonder what the dataset is that makes for such tiny, short lines.

aristus · on Aug 14, 2010

Given the numbers, the need to dedupe, the poster's strong motivation and relative inexperience...

I'd guess it's a gigantic spam list.

pvg · on Aug 14, 2010

Hah, self-duh. Total failure of technical cynicism on my part.

keefe · on Aug 14, 2010

an in-memory set won't work. indexes don't have to be in memory all the time to provide log N lookup, I mean just think about how b-trees work.

...and this is a problem most databases have addressed.

pvg · on Aug 15, 2010

It might well work, it depends a great deal on the data. In any event, I was talking about in-memory hashing, the responder's assumption that I was was correct.