Funny this comes up again so soon after I needed it! I recently did a proof-of-concept related to bioinformatics (gene assembly, etc...), and one quirk of that space is that they work with enormous text files. Think tens of gigabytes being a "normal" size. Just compressing and copying these around is a pain.
One trick I discovered is that tools like pigz can be used to both accelerate the compression step and also copy to cloud storage in parallel! E.g.:
There is a similar pipeline available for s3cmd as well with the same benefit of overlapping the compression and the copy.
However, if your tools support zstd, then it's more efficient to use that instead. Try the "zstd -T0" option or the "pzstd" tool for even higher throughputs but with same minor caveats.
PS: In case anyone here is working on the above tools, I have a small request! What would be awesome is to automatically tune the compression ratio to match the available output bandwidth. With the '-c' output option, this is easy: just keep increasing the compression level by one notch whenever the output buffer is full, and reduce it by one level whenever the output buffer is empty. This will automatically tune the system to get the maximum total throughput given the available CPU performance and network bandwidth.
--adapt[=min=#,max=#]
zstd will dynamically adapt compression level to perceived I/O conditions. Compression level adaptation can be observed live by using command -v. Adaptation can be constrained between supplied
min and max levels. The feature works when combined with multi-threading and --long mode. It does not work with --single-thread. It sets window size to 8 MB by default (can be changed manu‐
ally, see wlog). Due to the chaotic nature of dynamic adaptation, compressed result is not reproducible.
I really should have read the documentation! That feature looks awesome, but in a quick test it could only use about 50% of the available output bandwidth. My upload speed is 50 Mbps, but zstd could only send about 25 Mbps.
Similarly, on a local speed test (SSD -> SSD), using a fixed compression level was much faster than --adapt.
"" note : at the time of this writing, --adapt can remain stuck at low speed when combined with multiple worker threads (>=2). ""
There are some ADVANCED COMPRESSION OPTIONS --zstd tunables that might help.
Leave wlog alone unless you're willing to store the value out of band and pass it in again during decompression.
hashLog, bigger number uses more memory to compress but is often faster.
chainLog smaller number compresses faster, but worse ratio.
In your use case monitoring general system utilization to identify bottlenecks might also help. My gut instinct is that you might already have hit a memory bandwidth limit for the platform, at which point REDUCING the hashLog until it fits within your intended performance budget might yield better bandwidth results. Reducing the chainLog value might have the same effect.
if you're running your test over the internet [ fluctuating latency, some packet losses ] - try enabling BBR [1] tcp congestion control algorithm on the sender side to utilize the available bandwidth more efficiently.
Bioinformatician here. Consider using bgzf instead. It's fully gzip backwards compliant (it is a subset of gzip), but de/compression can be implemented to be much faster, and it's also much easier to paralellize. Compression rates are slightly lower, i.e it creates larger files.
The bgzf format was invented for bioinformatics, and BAM files are bgzipped by default.
Hasn't that always been the case? When I lived in the Bay Area and networked, and was exposed to so much of how people did things, I was able to land much more contracts because I could come up with simple existing solutions. Once I moved away and wasn't exposed to cool people doing cool problem solving things, I had to switch into more bog standard consulting.
It blew people's minds when we couple implement huge projects that they could never get of the ground for years in just a matter of months because of 'this one cool trick'.
Man I miss Santa Cruz ಥ﹏ಥ worst mistake of my life to leave there.
bgzf has a clear advantage when it comes to sorted BED/VCF/GTF formats, especially if one does index these.
But frankly I have no idea if it may improve IO times for the fastq files when read by say bwa or another mapper. Do you have any experience with that?
I have seen some mapping times improvement using fastqs with clustered reads (by clumpify).
Yes, bgzf can be much faster in practice. Both because the underlying gzip implementation can be simplified (see e.g. libdeflate), and because it can be more effectively parallelised. It doesn't matter if it's IO-bound of course, but compression rarely is.
While the topic of compressing FASTQ comes up, you might be interested to know that kmer sorting FASTQ files can lead to around an 8% improvement in compression depending on the diversity.
I believe this is because it puts similar strings together in the tree gzip uses. clumpify.sh from bbtools is one example.
It was probably fasta of fastq format, which is pretty much line separated strings of the DNA letters. BSON won't help with that. You could try to squeeze the each of the 4 letters into 2 bits, but they use few more letters to indicate "unknown", so it's not that easy. And even the simplest compression algorithms just do that for you.
And text is easier to work with in any hacked up script.
There are some binary formats (BAM, I think), but people often prefer the text format anyway. When compressed, the size is pretty much the same
Because scientists are woefully undereducated on the finer points of computer science. Does it (eventually) produce publishable papers? is all they care about.
One trick I discovered is that tools like pigz can be used to both accelerate the compression step and also copy to cloud storage in parallel! E.g.:
There is a similar pipeline available for s3cmd as well with the same benefit of overlapping the compression and the copy.However, if your tools support zstd, then it's more efficient to use that instead. Try the "zstd -T0" option or the "pzstd" tool for even higher throughputs but with same minor caveats.
PS: In case anyone here is working on the above tools, I have a small request! What would be awesome is to automatically tune the compression ratio to match the available output bandwidth. With the '-c' output option, this is easy: just keep increasing the compression level by one notch whenever the output buffer is full, and reduce it by one level whenever the output buffer is empty. This will automatically tune the system to get the maximum total throughput given the available CPU performance and network bandwidth.