I am trying to load a large zipped data set into python with the following structure:
- year.zip
- year
- month
- a lot of .csv files
- month
- year
So far I have used the ZipFile library to iterate through each of the CSV files and load them using pandas.
zf = ZipFile(year.zip)
for file in zf.namelist:
try:
pd.read_csv(zf.open(file))
It takes ages and I am looking into optimizing the code. One option I ran into is to use dask library. However, I can't figure out how to best implement it to access at least the whole month of CSV files in one command. Any suggestions? Also open to other optimization approaches