0

I am trying to load a .csv.gz file into Dask. Reading it this way will load it successfully but into one partition only

dd.read_csv(fp, compression="gzip")

My work around now is to unzip the file using gzip, load it into Dask, then remove it after I am finished. Is there a better way?

Aissaoui A
  • 109
  • 2
  • 10
  • Why does using unzip give you a better result? I wouldn't think it would matter how you're extracting your data from the file, but then again, I don't know Dask. Maybe you can show us that code and that would answer my question. – CryptoFool Feb 28 '21 at 22:42
  • Does this answer your question? [Handling large, compressed csv files with Dask](https://stackoverflow.com/questions/50741918/handling-large-compressed-csv-files-with-dask) – SultanOrazbayev Mar 01 '21 at 04:02

1 Answers1

0

There is a very similar question here:

The fundamental reason here, is that a format list bz2, gz or zip does not have allow random-access, the only way to read the data is from the start of the data.

The recommendation there is:

The easiest solution is certainly to stream your large files into several compressed files each (remember to end each file on a newline!), and then load those with Dask as you suggest. Each smaller file will become one dataframe partition in memory, so as long as the files are small enough, you will not run out of memory as you process the data with Dask.

As an alternative option, if disk space is consideration, you can use .to_parquet instead of .to_csv upstream, since parquet data is compressed by default.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46