3

What would be the recommended compression algorithm (.xz, tar.gz, tar.bz2 and so on) for compressing a dataset consisting of fasta nucleotide sequences?

What would be the recommended compression mechanisms for such data?

  1. Dictionary based compression
  2. Adaptive dictionary based compression
  3. LZW algorithm based compression
Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Allan K
  • 379
  • 2
  • 13
  • 1
    Use gzip because everyone uses gzip. Even if you can squeeze a bit more compression out of another method, more bioinformatics tools will read gzipped files. – CJR Oct 30 '21 at 17:46
  • Certainly not LZW. That's obsolete technology. A great deal of attention has been paid to the compression of sequencing data. For fasta, see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3866555/ – Mark Adler Oct 30 '21 at 21:36

1 Answers1

1

I have seen gzip used most often, so I recommend gzip, as CJR mentioned in the comment. This is the option most compatible with the collaborators, even though not the most efficient (depending on your definition of efficiency).

Under some conditions, where the collaborators and you can install specialized compressing tools, it might be worth looking into more efficient tools, for example see this paper, which compares many of them using several different metrics (especially Figure 1):

Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, Tadashi Imanishi, Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences, GigaScience, Volume 9, Issue 7, July 2020, giaa072, https://doi.org/10.1093/gigascience/giaa072 : https://academic.oup.com/gigascience/article/9/7/giaa072/5867695

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47