There is a very similar question here:
The fundamental reason here, is that a format list bz2, gz or zip does not have allow random-access, the only way to read the data is from the start of the data.
The recommendation there is:
The easiest solution is certainly to stream your large files into several compressed files each (remember to end each file on a newline!), and then load those with Dask as you suggest. Each smaller file will become one dataframe partition in memory, so as long as the files are small enough, you will not run out of memory as you process the data with Dask.
As an alternative option, if disk space is consideration, you can use .to_parquet
instead of .to_csv
upstream, since parquet
data is compressed by default.