While looking for a good option to store large amounts of data (coming mostly from numerical computations) long-term, I arrived at using xz
archive format (tar.xz
). The default LZMA compression there provides significantly better archive sizes (for my type of data) compared to more common tar.gz
(both with reasonable compression options).
However, the first google search on the safety of long-term usage of xz
, arrived at the following web-page (coming from one of the developers of lzip
) that has a title
Xz format inadequate for long-term archiving
listing several reasons, including:
xz
being a container format as opposed to simple compressed data preceded by a necessary headerxz
format fragmentation- unreasonable extensibility
- poor header design and lack of field length protection
- 4-byte alignment and use of padding all over the place
- inability to add the trailing data to already created archive
- multiple issues with
xz
error detection - no options for data recovery
While some of the concerns seem a bit artificial, I wonder, if there is any solid justification for not using xz
as an archive format for long-term archiving.
What should I be concerned about if I choose xz
as a file format?
(I guess, access to an xz
program itself should not be an issue even 30 years from now)
a couple of notes:
- The data stored are results of numerical computations, some of which are published in different conferences and journals. And while storing results does not necessarily imply research reproducibility, it is an important component.
- While using more standard
tar.gz
or even plainzip
might be a more obvious choice, an ability to cut about 30% of the archive size is very appealing to me.