a bloom filter may give you false negatives (i.e., you think you've seen it befo...

a bloom filter may give you false negatives (i.e., you think you've seen it before, but you didn't). Or do you mean, use a bloom filter (~2-3MB, so that it fits in the L3 cache) and look in the mmapped hash table only if the bloom filter indicates that you may have seen the line already?

That sounds at least halfway feasible to me - you assume 1/3 duplicates, and 20 bytes per line, you'd get ~8GB worth of hashtable entries (i.e., if you want the hashtable 70% filled to limit the amount of collisions, you'd need 12GB of virtual memory to back the hashtable, but only rarely access it since you're using the Bloom filter).

(To the person who downvoted it: can you say why you don't like the idea?)