I have a file with 25,000 floats (,
delimited) and there are about 100K such rows. A row of the file looks something like:
1689.97,-9643.39,-82082.1,9776.09,-33974.84,-67247.38,32997.34,72811.53,31642.87,-949.6,9340.68,-85854.48,-17705.36,187.74,-3002.6,-35812.21,37382.32,22770.78,40893.09,45743.99,-6500.92,26243.85,13975.95,0,56669.47,-25865.36,-17066.78,26788.57,0,-36554.86,-3687.19,18933.93
I have a 2 part question.
- Is there a way (in Java or Python) to compress data efficiently without effecting the performance much. The compression would be done once per day, but data has to be read quite often.
- Can the data be manipulated in the compressed form e.g. I would like to aggregate first 10 columns on the first 10 rows without decompressing. That way I dont have to worry about frequent reads to compressed data. One of the challenges would be converting 25,000 string to float for addition.
I have looked at gzip
and zcat
and they are good options. But I wanted to find some compression or serializing algo to store data through Java/Python
and perform reads without decompressing.