I am searching for the best possible solution for storing large dense 2D matrices of floating-point data, generally float32 data.
The goal would be to share scientific data more easily from websites like the Internet Archive and make such data FAIR.
My current approaches (listed below) fall short of the desired goal, so I thought I might ask here, hoping to find something new, such as a better data structure. Even though my examples are in Python, the solution does not need to be in Python. Any good solution will do, even one in COBOL!
CSV-based
One approach I tried would be to store the values as compressed CSVs using pandas, but this is excruciatingly slow, and the end compression is not exactly optimal (generally a 50% from the plain CSV on the data I tried this on, which is not flawed but not sufficient to make this viable.) In this example, I am using gzip. I have also tried LZMA, but it is generally way slower, and, at least on the data I tried it on, it does not yield a significantly better result.
import pandas as pd
my_data: pd.DataFrame = create_my_data_doing_something()
my_data.to_csv("my_data_saved.csv.gz")
NumPy Based
Another solution is to store the data into a NumPy-like matrix and then compress it on a disk.
import numpy as np
my_data: np.ndarray = create_my_data_doing_something()
np.save("my_data_saved.npy", my_data)
and afterwards
gzip -k my_data_saved.npy
Numpy with H5
Another possible compression is H5, but as shown in the benchmark below it does not do better than the plain numpy.
with h5py.File("my_data_saved.h5", 'w') as hf:
hf.create_dataset("my_data_saved", data=my_data)
Issues in using the Numpy format
While this may be a good solution, we limit the scope of usability of the data to people that can use Python. Of course, this is a vast pool of people, though, in my circles, many biologists and mathematicians abhor Python and like to stick to Matlab and R (and therefore would not know what to do with a .npy.gz
file).
Pickle
Another solution, as correctly pointed out by @lojza, is to store the data into a pickle object, which may also be compressed to a disk. Pickle obtains in my benchmarks (see below) a compression ratio comparable with what is obtained with Numpy.
import pickle
import compress_pickle
my_data: np.ndarray = create_my_data_doing_something()
# Not compressed pickle
with open("my_data_saved.pkl", "wb") as f:
pickle.dump(my_data, f)
# Compressed version
compress_pickle.dump(df, "my_data_saved.pkl.gz")
Issues in using Pickle format
The issue in using Pickle is two-fold: first, the same Python-dependency issue discussed above. Secondly, there is a significant security issue: the Pickle format can be used for arbitrary code execution exploits. People should be wary of downloading random pickle files from the internet (and here, the goal is to make people share datasets on the internet).
python import pickle
# Build the exploit
command = b"""cat flag.txt"""
x = b"c__builtin__\ngetattr\nc__builtin__\n__import__\nS'os'\n\x85RS'system'\n\x86RS'%s'\n\x85R."%command
# Test it
pickle.load(x)
Benchmarks
I have executed benchmarks to provide a baseline on which to improve. Here, we can see that generally, Numpy is the best performing choice, and to my knowledge, it should not have any security risk in its format. It follows that a compressed NumPy array is currently the best contender, please let's find a better one!
Examples
To share some actual use cases of this relatively simple task, I share a couple of graphs embedding the complete OBO Foundry graph. If there were a way to make these files smaller, sharing them would be significantly more accessible, allowing for more reproducible experiments and accelerating research in the bio-ontologies (for this specific data) and other fields.
What other approaches may I try?