the setup is that I have eight large csv files (32GB each) which are compressed with Zip to 8GB files each. I cannot work with the uncompressed data as I want to save disk space and do not have 32*8GB space left. I cannot load one file with e.g. pandas
because it does not fit into memory.
I thought Dask is a reasonable choice for the task, but feel free to suggest a different tool if you think it suits the purpose.
Is it possible to process one 8GB compressed file with Dask by reading multiple chunks of the compressed file in parallel, process each chunk and save the results to disk?
The first problem is that Dask does not support .zip
. This issue proposes to use dask.delayed
, but it would also be possible for me to change the format to .xz
or something else.
Second, and probably related to the choice of the compression format is whether it is possible to access only parts of the compressed file in parallel.
Or is it better to split each uncompressed csv file into smaller parts which fit into memory and then process the recompressed smaller parts with something like this:
import dask.dataframe as dd
df = dd.from_csv('files_*.csv.xz', compression='xz')
For now, I would prefer something similar to the first solution which seems to be leaner, but I might be totally mistaken as this domain is new to me.
Thanks for your help!