Happy new year and goodbye bzip2

jaytaylor · on Dec 28, 2013

Good riddance; for text files, gzip is mediocre compression and fast, lzma (xz) is high compression and slow. Bzip2 is medium-high compression and _really_ slow [0].

[0] http://tukaani.org/lzma/benchmarks.html

pedrocr · on Dec 28, 2013

Your link shows bzip2 having a more balanced compression time than lzma that is often 3-4 times slower. So for a tarball xz sounds indeed better as you compress once and uncompress many times. For compressing HTTP or things of the sort bzip2 might be better, but maybe for those cases gzip is already best.

jaytaylor · on Dec 28, 2013

You are right, I was speaking to the compression ratios and decompression times. For cases where something is compressed once and decompressed many times I don't care much how long it takes to do the initial compression (within reason, of course).

fiffig · on Dec 28, 2013

Here is another with lz4, lzo as well as lzma1, although they're not very relevant for the tarball test. Gzip -9 is on par with bzip2 -1 while almost twice as fast. And uses practically no memory(!).

http://pokecraft.first-world.info/wiki/Quick_Benchmark:_Gzip...

acdha · on Dec 28, 2013

-1 is an edge-case: if you're memory constrained, bzip2 isn't a good fit. Otherwise, you pay the extremely low percentage increase in CPU time to use -9 where bz2 will massively outperform gzip on most workloads.

Scaevolus · on Dec 28, 2013

Note that LZMA decompressed at 20MB/s in that benchmark, while bzip2 managed about 6.

LZMA tends to be faster and more efficient than bzip2 in all respects.

Buge · on Dec 28, 2013

pedrocr said that LZMA decompression is faster, but that sometimes LZMA compression is slower.

ketralnis · on Dec 28, 2013

I don't know how well lzma parallelises, but bzip2 is plenty fast for me with a multi-core version like pbzip2, as measured in wall-time on an otherwise idle system. Different considerations in different environments and all that.

rwmj · on Dec 28, 2013

Unfortunately there's no multithreaded option for the standard xz uncompression tools (xzcat, unxz). However I wrote (as a quick hack) a parallel xzcat:

http://git.annexia.org/?p=pxzcat.git;a=summary

I also wrote a random access program for xz files:

https://github.com/libguestfs/nbdkit

(in the plugins/xz subdirectory)

langarto · on Dec 28, 2013

The version of the tools that I have installed (5.1.2alpha in Fedora) do support a -T option for parallel compression which I use regularly.

However, it is true that that same option does not seem to work everywhere. At least, it didn't seem to work a few weeks ago in a different system (Ubuntu). The option is accepted, but does not seem to have any effect.

vasi · on Dec 28, 2013

lzma parallelizes very well! My implementation of parallel xz ( https://github.com/vasi/pixz ) works quite well. There are others as well, including pxz and the alpha version of the standard xz. All these tools produce compressed files that conform to the xz format and can be decompressed by any xz utility.

sliverstorm · on Dec 28, 2013

I'll throw out a vote for lbzip2. It threads strongly, and doesn't cannibalize your memory like threaded xz. (You can still use gzip where every CPU cycle counts)

itry · on Dec 28, 2013

I always was fascinated by compression. It feels like no big progress was ever made in lossless compression. How would todays best compression algos compare to the ones used during the times of the C64? And compared to the first zip algorithm implemented in PKZIP?

Sharlin · on Dec 28, 2013

Well, there are clear information theoretic limits on how much you can compress something, so any improvements are necessarily going to be incremental and converging to the theoretical limit. The more special-purpose the compression algorithm, the more assumptions it can make, the better it can perform on the sort of data it's designed for, at the expense of behaving poorly in the general case. Indeed, it can be shown that any possible lossless compression algorithm, on average, will yield an output larger than its input; it's just that most "interesting" data contains plenty of redundancy that's reliably compressible.

Someone · on Dec 29, 2013

Also, it often is better to measure an improvement as relative to the maximum improvement possible, rather than as a percentage of the file size.

For compression, let's say that the best possible compressor compresses a file to 30% of its size, and the current compressor reaches 50%. Then, an improvement to 45% should not be seen as 'only 5%', or as '10% smaller', but as '25% of the maximum possible improvement'. A follow-on step that gets you to 40% would be a larger improvement of 33%.

That, IMO, is a reasonable way to somewhat compensate for the fact that the low hanging fruit gets plucked by those who come first.

And yes, there is a problem there. That 'best possible compressor', theoretically, can produce extremely small files. Maybe your Wikipedia dump happens to be equal to the binary expansion of sin(1/e + 34/sqrt(PI)) to a few billion digits, but how are you going to find out? So, for most files, we don't really know what that best compression is.

wolf550e · on Dec 28, 2013

Quite a bit. For excellent introduction to the topic, read this: http://mattmahoney.net/dc/dce.html

For fast lossless compressors, see this article using an Apple II to compare the best of today's (LZ4) with what was available when the hardware was popular/new.

hornetblack · on Dec 28, 2013

One new thing is that patents on arithmetic compression expired. Libjpeg will use now, but too much software won't understand it.

wmf · on Dec 28, 2013

The low-hanging fruit was picked long ago, but there has been some interesting recent work like Snappy, PAQ, and Intel's improvements to gzip ( http://mail.madler.net/pipermail/zlib-devel_madler.net/2013-... ).

ye · on Dec 28, 2013

Compression algorithms are improving. There are compression programs out there that will beat pretty much anything famous like xz, but 99% of the time at the expense of either compression/decompression time or memory.

PAQ8PX is a classic example of this - amazing compression ratios, but ridiculously slow.

http://www.maximumcompression.com/data/summary_mf.php

http://www.maximumcompression.com/data/summary_mf3.php

There are programs that can shrink JPG losslessly by 24%:

http://www.maximumcompression.com/data/jpg.php

NanoZip keeps a very good balance of speed and compression ratio, but it's so unpopular, it's not very practical.

tlhunter · on Dec 28, 2013

With typical files, XZ Utils create 30 % smaller output than gzip and 15 % smaller output than bzip2.

http://tukaani.org/xz/

Myrth · on Dec 28, 2013

Next up: lrzip?

http://www.techradar.com/us/news/software/applications/best-...

bcrack · on Dec 28, 2013

The most appropriate compression algorithm to use is highly case-dependent (data structure, compression vs decompression cpu/wall-clock time, compression ratio, system memory,bandwidth, etc.). I'm sure people at kernel.org have their reasons to convert to xz.

The following link has some interesting (albeit non-exhaustive) benchmarks, comparing various compression algorithms (both serial and parallel implementations). http://vbtechsupport.com/1614/

xuhu · on Dec 28, 2013

Why ditch bzip2 and not gzip ? It was clearly for storage space.

clarry · on Dec 28, 2013

For compatibility and speed, gzip is a winner. It's fast and it's everywhere (plus it compresses text well enough that it makes sense to use in place of non-compressed files).

For strong compression, there are better options than bzip2.

In a sense, bzip2 is "worse of both worlds".

ksec · on Dec 28, 2013

Tl;dr, Thx for the summary

citrin_ru · on Dec 28, 2013

Try to extract xz compressed linux kernel on host with 128Mb RAM and 500MHz CPU. Then try to extract bzip2 - it will be very slow. IMHO gzip is the best option for old hardware or small virtual machines.

thaJeztah · on Dec 28, 2013

XZ sounds interesting. Anybody know if work is in progress to add XZ support in webbrowsers as an alternative to gz compression? Any webserver already supporting xz?

slashdotaccount · on Dec 28, 2013

I used to create WinRAR backup archives with recovery record or recovery volumes for DVD backups. Are there other (free and open) archive formats with such technologies? Or maybe stand-alone tools?

happyhappy · on Dec 28, 2013

If I understand your question correctly, you should look into PAR2

d0ugie · on Dec 28, 2013

did someone say WebP? :)

manuw · on Dec 28, 2013

Did the bzip maintainer steal someones girlfriend or anything else?