3

I am trying to load a large zipped data set into python with the following structure:

  • year.zip
    • year
      • month
        • a lot of .csv files

So far I have used the ZipFile library to iterate through each of the CSV files and load them using pandas.

zf = ZipFile(year.zip)

for file in zf.namelist:
    try:
        pd.read_csv(zf.open(file))

It takes ages and I am looking into optimizing the code. One option I ran into is to use dask library. However, I can't figure out how to best implement it to access at least the whole month of CSV files in one command. Any suggestions? Also open to other optimization approaches

Vlad
  • 33
  • 4

1 Answers1

3

There are a few ways to do this. The most similar to your suggestion would be something like:

zf = ZipFile("year.zip")
files = list(zf.namelist)
parts = [dask.delayed(pandas.read_csv)(f) for f in files)]
df = dd.from_delayed(parts)

This works because a zipfile has a offset listing, so that the component files can be read independently; however, performance may depend on how the archive was created, and remember: you only have one storage device, to throughput from the device may be your bottleneck anyway.

Perhaps a more daskian method to do this is as follows, taking advantage of the features of fsspec, the file-system abstraction used by dask

df = dd.read_csv('zip://*.csv', storage_options={'fo': 'year.zip'})

(of course, pick the glob pattern appropriate for your files; you could also use a list of files here, if you prepend "zip://" to them)

mdurant
  • 27,272
  • 5
  • 45
  • 74