How do I read multiple csv files from a zip file in a Dask?

Question

import dask
import dask.dataframe as dd
from dask.delayed import delayed
import pandas as pd

I'm using dask's delayed and read_delay to do this because it works and it works fast. Here is my conundrum...

dfc = [delayed(pd.read_csv)(u)[['UserID', 'ConversionDate']] for u in conversions]
dfs = [delayed(pd.read_csv)(u)[['UserID', 'EventDate']] for u in standard]

This works fine. I then do this...

df = dd.from_delayed(dfc)

and it gives me a dask dataframe of length ~8million. Ok, great. But the I do this...

ds = dd.from_delayed(dfs)

And I get the following error...

ValueError: ('Multiple files found in compressed zip file %s', "['MM_CLD_Standard_Agency_142087_Daily_191101_00.csv', 'MM_CLD_Standard_Agency_142087_Daily_191101_01.csv', 'MM_CLD_Standard_Agency_142087_Daily_191101_02.csv', 'MM_CLD_Standard_Agency_142087_Daily_191101_03.csv', 'MM_CLD_Standard_Agency_142087_Daily_191101_04.csv']")

So as you can see there are multiple csv's in that zip file. I want to extract all of those csv's easily like the first one goes. There's going to be a lot more data but dask should be able to handle it. How do I go about doing this?

Also, after that, I need to left join df and ds on 'UserID' and reset the index.

Please help! Thank you!

Have you looked at this? I think it would work for dask, too. https://stackoverflow.com/questions/44575251/reading-multiple-files-contained-in-a-zip-file-with-pandas — Evan, Dec 04 '19 at 02:45
If you can write it in code and it works I will mark it as an answer but as it is it doesn't answer my question. — Ravaal, Dec 04 '19 at 03:01

score 2 · Answer 1 · edited Dec 06 '19 at 05:16

2

Okay, I had to make some data to play with, so I used this dataset.

import pandas as pd

cols = ["mpg", "cylinders", "displacement", 
        "horsepower", "weight", "acceleration", 
        "model_year", "origin", "car_name"]
df = pd.read_csv("auto-mpg.data", sep="\s+", 
                 header=None, names=cols)

df[:100].to_csv("auto_1.csv")
df[100:200].to_csv("auto_2.csv")
df[200:300].to_csv("auto_3.csv")
df[300:].to_csv("auto_4.csv")

I then compressed the files into a zip archive. (Right click -> compress. This can be done in zipfile, but I didn't figure out how.)

Next, read the compressed file, and add the files in there to your dask dataframe.

from zipfile import ZipFile
import dask.dataframe as dd
import os

wd = '/path/to/zip/files'
file_list = os.listdir(wd)
destdir = '/extracted/destination/'

ddf = dd.from_pandas(pd.DataFrame())

for f in file_list:
    with ZipFile(wd + f, "r") as zip:
        print(zip.namelist())
        zip.extractall(destdir, None, None)
        df = dd.read_csv(zip.namelist(), usecols=['Enter', 'Columns', 'Here'], parse_dates=['Date'])
        ddf = ddf.append(df)

ddf.compute()

Output:

['auto_4.csv', 'auto_3.csv', 'auto_2.csv', 'auto_1.csv']
    Unnamed: 0   mpg  cylinders  displacement horsepower  weight  \
0          300  23.9          8         260.0      90.00  3420.0   
1          301  34.2          4         105.0      70.00  2200.0   
2          302  34.5          4         105.0      70.00  2150.0   
3          303  31.8          4          85.0      65.00  2020.0   
4          304  37.3          4          91.0      69.00  2130.0   
5          305  28.4          4         151.0      90.00  2670.0

As you can see, the Unnamed: 0 is the original index, which is now out of order. You can drop it, sort the ddf, etc.

If there's other files in there, you can use glob to search, or a list comprehension like

print([file for file in file.namelist() if "auto" in file])

edited Dec 06 '19 at 05:16

Ravaal

3,233
6
39
66

answered Dec 04 '19 at 07:16

Evan

2,121
14
27

Hey Evan. It looks like you got it figured out but I'm still having issues... I'm getting a `file not found` error. I'm working on it now because it may not be something that you can answer but I'll edit once I figure it out. – Ravaal Dec 04 '19 at 14:49
Thank you Evan. It was a tremendous help. I added some code that I had to come up with in order to make it work. – Ravaal Dec 04 '19 at 17:26
Glad it worked. Did you have to extract the files, then read them? I had hoped it could be done wo that (ie, extracting a 20 GB zip archive could really consume drive space, an issue if on a laptop). – Evan Dec 04 '19 at 18:16
I had to extract them. I liked your method of pulling all of the files into the dataframe simultaneously. The problem I ran into is that `example.csv` cannot be found in `different directory` which I think is the pwd. But how do you specify a directory where you can find the files and do the simultaneous dataframe thing? I don't know. I'd like to know. – Ravaal Dec 04 '19 at 19:23
Please help me with extracting the data directly from the zip file. I'm taking up too much space and can't complete my algorithm... – Ravaal Dec 05 '19 at 01:42
Can you clarify what issue you are having when you try to create a `ddf` from the zip? – Evan Dec 05 '19 at 03:37
It's very efficient when making a ddf out of ALL files in the zipfile. The problem I'm getting is that it changes the directory from `//path/to/zip` to `C:/User/User/file.csv` and it says that there's no file located in that location. Make sense? – Ravaal Dec 05 '19 at 04:01
Unfortunately, I do not understand - I'm not sure why `zipfile` would switch the directories. I'm not seeing `C:/User/User/file.csv` mentioned anywhere in your code, so it's not clear why that would pop up. – Evan Dec 05 '19 at 15:49
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203681/discussion-between-evan-and-ravaal). – Evan Dec 05 '19 at 16:18

How do I read multiple csv files from a zip file in a Dask?

1 Answers1

Linked