It's my understanding that Git is pretty bad for storing binary files that chang...

peff · on Dec 11, 2014

It's not exactly plaintext versus binary. Git "stores" a copy of every file, no matter the content. But it also does "delta compression" between objects when you run "git gc"; it tries to find binary diffs between objects that are similar. So the poor performance comes from files that don't "delta" well.

These tend to be things which are compressed or encrypted, where a small semantic change can cascade into a lot of bytes changing. Of course, binary formats often do both of those things. And the larger the files, the more painful it is when it happens.

e12e · on Dec 13, 2014

Incidentally, the new office formats (both the "open" and the open one) are zip-files with (among other things) xml-documents. I saw someone a while back recommending storing them uncompressed when working with VCS -- gives pretty useful diffs with no extra work (I forget if they recommended extracting, or just storing without compression).

agumonkey · on Dec 11, 2014

I remember trying candidely git and svn (around 2008 so..) on a webdev project involving many images including some xMB PSD files. Surprisingly the git repo grew sublinearly, while svn was absuperlinear.

With git packing, the full repo with a bit of history (mostly adding some php and a few images) ended up smaller than the original windows native folder. Version control and compression; nice :)

hardwaresofton · on Dec 11, 2014

Very good point - however I think this is just an implementation detail of Git as it stands today.

Also, if the person is working with SVG (for example), then it's less of a problem.

Also, given the cheapness of disk, I don't think it would be a limiting cost. And since git is open source, if I were to actually make this thing, it would definitely incentivize me (or others) to make git less bad for storing binary files :)

vtemian · on Dec 11, 2014

You are right about large binary files and gitfs currently has an option limiting the maximum file size.

jimktrains2 · on Dec 11, 2014

> (whereas with plaintext it can just store the diffs)

git stores full blobs, not deltas.

rakoo · on Dec 11, 2014

I guess it depends how deep you dive... It is true that on the surface, git gives you the user access to full blobs only and calculates the difference every time you access blobs. But when you go in packfiles the content is actually diff'ed because it compresses so well.

In the context of the discussion, since we're interested in the on-disk format, it's more accurate to say that git will try to diff binary blobs, fail at that and so store the full content of blobs.

jimktrains2 · on Dec 11, 2014

Not everything is in packfiles right away.