Funny this comes up again so soon after I needed it! I recently did a proof-of-c...

paisleyrob · on Oct 17, 2022

zstd has --adapt:

       --adapt[=min=#,max=#]
              zstd will dynamically adapt compression level to perceived I/O conditions. Compression level adaptation can be observed live by using command -v. Adaptation can be constrained between supplied
              min and max levels. The feature works when combined with multi-threading and --long mode. It does not work with --single-thread. It sets window size to 8 MB by default (can  be  changed  manu‐
              ally, see wlog). Due to the chaotic nature of dynamic adaptation, compressed result is not reproducible.

jiggawatts · on Oct 18, 2022

I really should have read the documentation! That feature looks awesome, but in a quick test it could only use about 50% of the available output bandwidth. My upload speed is 50 Mbps, but zstd could only send about 25 Mbps.

Similarly, on a local speed test (SSD -> SSD), using a fixed compression level was much faster than --adapt.

mjevans · on Oct 18, 2022

My copy of that manual page has additional text:

  "" note : at the time of this writing, --adapt can  remain  stuck  at  low speed when combined with multiple worker threads (>=2). ""

There are some ADVANCED COMPRESSION OPTIONS --zstd tunables that might help.

Leave wlog alone unless you're willing to store the value out of band and pass it in again during decompression.

hashLog, bigger number uses more memory to compress but is often faster.

chainLog smaller number compresses faster, but worse ratio.

In your use case monitoring general system utilization to identify bottlenecks might also help. My gut instinct is that you might already have hit a memory bandwidth limit for the platform, at which point REDUCING the hashLog until it fits within your intended performance budget might yield better bandwidth results. Reducing the chainLog value might have the same effect.

pQd · on Oct 18, 2022

if you're running your test over the internet [ fluctuating latency, some packet losses ] - try enabling BBR [1] tcp congestion control algorithm on the sender side to utilize the available bandwidth more efficiently.

[1] https://en.wikipedia.org/wiki/TCP_congestion_control#TCP_BBR

jakobnissen · on Oct 18, 2022

Bioinformatician here. Consider using bgzf instead. It's fully gzip backwards compliant (it is a subset of gzip), but de/compression can be implemented to be much faster, and it's also much easier to paralellize. Compression rates are slightly lower, i.e it creates larger files.

The bgzf format was invented for bioinformatics, and BAM files are bgzipped by default.

jiggawatts · on Oct 18, 2022

These days, skill is not about how you use tools in IT, but simply knowing that a tool exists.

ROTMetro · on Oct 18, 2022

Hasn't that always been the case? When I lived in the Bay Area and networked, and was exposed to so much of how people did things, I was able to land much more contracts because I could come up with simple existing solutions. Once I moved away and wasn't exposed to cool people doing cool problem solving things, I had to switch into more bog standard consulting.

It blew people's minds when we couple implement huge projects that they could never get of the ground for years in just a matter of months because of 'this one cool trick'.

Man I miss Santa Cruz ಥ﹏ಥ worst mistake of my life to leave there.

elmolino89 · on Oct 18, 2022

bgzf has a clear advantage when it comes to sorted BED/VCF/GTF formats, especially if one does index these. But frankly I have no idea if it may improve IO times for the fastq files when read by say bwa or another mapper. Do you have any experience with that? I have seen some mapping times improvement using fastqs with clustered reads (by clumpify).

jakobnissen · on Oct 18, 2022

Yes, bgzf can be much faster in practice. Both because the underlying gzip implementation can be simplified (see e.g. libdeflate), and because it can be more effectively parallelised. It doesn't matter if it's IO-bound of course, but compression rarely is.

michaelbarton · on Oct 17, 2022

While the topic of compressing FASTQ comes up, you might be interested to know that kmer sorting FASTQ files can lead to around an 8% improvement in compression depending on the diversity.

I believe this is because it puts similar strings together in the tree gzip uses. clumpify.sh from bbtools is one example.

nextaccountic · on Oct 18, 2022

> one quirk of that space is that they work with enormous text files.

Why not use some binary format like BSON? Compressing with gzip works but then you can't query it compressed

exyi · on Oct 18, 2022

It was probably fasta of fastq format, which is pretty much line separated strings of the DNA letters. BSON won't help with that. You could try to squeeze the each of the 4 letters into 2 bits, but they use few more letters to indicate "unknown", so it's not that easy. And even the simplest compression algorithms just do that for you.

And text is easier to work with in any hacked up script.

There are some binary formats (BAM, I think), but people often prefer the text format anyway. When compressed, the size is pretty much the same

nextaccountic · on Oct 18, 2022

> And even the simplest compression algorithms just do that for you.

They do, but you need to uncompress to read the N-th letter

hoseja · on Oct 18, 2022

Because scientists are woefully undereducated on the finer points of computer science. Does it (eventually) produce publishable papers? is all they care about.

elmolino89 · on Oct 18, 2022

Apart from tuning up the compression you may gain a lot by clustering reads in FASTQ files using clumpify from BBMap:

https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb...

You may also filter out duplicates and depending on what you do correct base errors or get rid of low complexity reads.