Handling large, compressed csv files with Dask

Question

the setup is that I have eight large csv files (32GB each) which are compressed with Zip to 8GB files each. I cannot work with the uncompressed data as I want to save disk space and do not have 32*8GB space left. I cannot load one file with e.g. pandas because it does not fit into memory.

I thought Dask is a reasonable choice for the task, but feel free to suggest a different tool if you think it suits the purpose.

Is it possible to process one 8GB compressed file with Dask by reading multiple chunks of the compressed file in parallel, process each chunk and save the results to disk?

The first problem is that Dask does not support .zip. This issue proposes to use dask.delayed, but it would also be possible for me to change the format to .xz or something else.

Second, and probably related to the choice of the compression format is whether it is possible to access only parts of the compressed file in parallel.

Or is it better to split each uncompressed csv file into smaller parts which fit into memory and then process the recompressed smaller parts with something like this:

import dask.dataframe as dd

df = dd.from_csv('files_*.csv.xz', compression='xz')

For now, I would prefer something similar to the first solution which seems to be leaner, but I might be totally mistaken as this domain is new to me.

Thanks for your help!

mdurant · Accepted Answer · 2018-06-07T14:56:45.170

The easiest solution is certainly to stream your large files into several compressed files each (remember to end each file on a newline!), and then load those with Dask as you suggest. Each smaller file will become one dataframe partition in memory, so as long as the files are small enough, you will not run out of memory as you process the data with Dask.

The fundamental reason here, is that a format list bz2, gz or zip does not have allow random-access, the only way to read the data is from the start of the data. xz is the only format that allows for block-wise compression within a file, so, in principle, it would be possible to load block-wise, which is not quite the same as real random-access. That would do what you are after. However, this pattern is really the very same as having separate files, so not worth the extra effort to write the files in blocking mode (not the default) and use functions dask.bytes.compression.get_xz_blocks, xz_decompress, which are not currently used for anything in the codebase.

I think that .bz2 files are compressed blockwise. A form of random access is possible in Python using https://github.com/mxmlnkn/indexed_bzip2, although you have to read through the file once to get the block boundaries out, so it's only any use if you'll be reading the file multiple times over. — user2667066, Dec 09 '20 at 10:26
Right, and this indexing isn't done/understood by the filesystem as things stand; even xz would need work. This effort could be in fsspec, for those interested to contribute. I note that indexed_bzip2 only works on local files. — mdurant, Dec 09 '20 at 14:03

Handling large, compressed csv files with Dask

1 Answers1

Linked