I have been trying to read multiple CSVs from a zipped directory using dask, according to this answer. However, I get a long error message which I cannot make sense of. I think the important line is this one:
msgpack.exceptions.ExtraData: unpack(b) received extra data.
The data is publicly available.
import numpy as np
import pandas as pd
import dask.dataframe as dd
# read data, the dask way
df = dd.read_csv('zip://BACI*.csv', sep=",", dtype={"k":str, "i":int, "j":int, "t":int}, storage_options={'fo': '../input/baci_hs92.zip'})
df.head()
I believe this kind of fly-by extraction should work in dask and I would rather not extract all files into some directory as other answers have suggested.