Collisions aren't a major risk with MD5 when you also give someone the file size...

EthanHeilman · on Sept 16, 2014

>Finding a collision in MD5 is costly, finding a collision in MD5 which is within -+10% of the actual size is extremely costly (technically possible, but maybe not in your lifetime).

MD5 collisions with 10% of the size of the file can be found in seconds on a old laptop computer. I've done it, we assign it as HW in class.

Read this http://www.mathstat.dal.ca/~selinger/md5collision/

Notice that the two colliding exe are exactly the same file size. These attacks have only gotten better.

>Zip is an extremely good format for crafting fake files which match a checksum. Really any format which can take arbitrary metadata (which is MOST) is pretty easy.

The example I gave uses windows and linux executables. No zip files in sight. These attacks are from 2009.

Someone1234 · on Sept 16, 2014

> Notice that the two colliding exe are exactly the same file size. These attacks have only gotten better.

They're also 6, not 200+ KB. They have been specially crafted to be as small as possible to make the problem set as easy as possible.

> The example I gave uses windows and linux executables. No zip files in sight. These attacks are from 2009.

That's a really strange reply. What is it you think I said..? I said and to quote you quoting me: "'Really any format which can take arbitrary metadata (which is MOST) is pretty easy.'"

So why you felt the need to point out that it is an executable not a zip file is uhh strange to say the least...

EthanHeilman · on Sept 16, 2014

>They're also 6, not 200+ KB. They have been specially crafted to be as small as possible to make the problem set as easy as possible.

That is not how it works, MD5 is vulnerable to length extension attacks[0]. Once you collide part of an MD5 hash, if everything that follows that collision is the same, it can be as long as you want. Colliding large files is just as easy as colliding small files. You could perform the same exercise with 1GB executables.

[0]: http://en.wikipedia.org/wiki/Length_extension_attack

Someone1234 · on Sept 16, 2014

> Once you collide part of an MD5 hash, if everything that follows that collision is the same, it can be as long as you want. Colliding large files is just as easy as colliding small files.

I've read that three times, still don't follow what you're getting at. That isn't how length extension attacks work/can be utilised.

Please go ahead and generate a file that collides with any of the linked files and is the same file size. The content doesn't have to be valid or readable, junk/binary is fine. If you can do this in a reasonable period of time (e.g. 24 hrs) then your point would have been proven.

The smallest is 224K with a hash of 180caf23dd71383921e368128fb6db52.

codeflo · on Sept 16, 2014

That's not what a collision attack[1] is. You're probably thinking of a pre-image attack[2].

[1] http://en.wikipedia.org/wiki/Collision_attack

[2] http://en.wikipedia.org/wiki/Preimage_attack

Someone1234 · on Sept 16, 2014

I didn't use the expression "collision attack" ever in this thread. I quoted someone else who used that term however (and the context of the whole discussion is clearly related to preimage attacks, not collision attacks).

sp332 · on Sept 16, 2014

That's if you're generating both sides. If someone has a file and I want to generate a new file with a matching MD5, that's a lot harder.

EthanHeilman · on Sept 16, 2014

MD5 is both vulnerable to collision attacks and targeted collision attacks. We can imagine both in the wikileaks case. You are correct that Target collision attacks are more difficult but they have been done in research for many years now[0](2006) and they are showing up in the wild as well[1](2012).

[0]: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140....

[1]: http://blogs.technet.com/b/srd/archive/2012/06/06/more-infor...

sp332 · on Sept 16, 2014

Those are both chosen-prefix attacks. They're impressive, but not relevant to this case where one file is completely out of the attacker's control.

userbinator · on Sept 16, 2014

Finding a collision in MD5 is costly

Not at all; look up "md5coll" and "fastcoll", released nearly 10 years ago, could generate a pair of colliding blocks in under an hour. Testing them now on my machine (which is already a few years old) it generated them in under a second(!)

This has been used to create executables that behave differently but that's because they can inspect themselves; on the other hand I think generating two .zip files with the same hash but different (valid) contents would be rather more difficult, but it's probably still quite feasible today.

Someone1234 · on Sept 16, 2014

You're ignoring half of my post (on purpose?): Now generate the collisions matching file sizes. Even that "under a second" concept relies in tiny files.

As files get larger matching both the MD5 and file size becomes more costly.

emmelaich · on Sept 16, 2014

That's a point I've always wondered about. Given that most (all?) md5 collisions consist of appending or prepending data, how much more difficult would it be if you encode the size as well.

Surely the difficulty is much more. And then add the fact that it has to be semantically/syntactically similar enough to fool whatever ingests it...