import dask
import dask.dataframe as dd
from dask.delayed import delayed
import pandas as pd
I'm using dask's delayed and read_delay to do this because it works and it works fast. Here is my conundrum...
dfc = [delayed(pd.read_csv)(u)[['UserID', 'ConversionDate']] for u in conversions]
dfs = [delayed(pd.read_csv)(u)[['UserID', 'EventDate']] for u in standard]
This works fine. I then do this...
df = dd.from_delayed(dfc)
and it gives me a dask dataframe of length ~8million. Ok, great. But the I do this...
ds = dd.from_delayed(dfs)
And I get the following error...
ValueError: ('Multiple files found in compressed zip file %s', "['MM_CLD_Standard_Agency_142087_Daily_191101_00.csv', 'MM_CLD_Standard_Agency_142087_Daily_191101_01.csv', 'MM_CLD_Standard_Agency_142087_Daily_191101_02.csv', 'MM_CLD_Standard_Agency_142087_Daily_191101_03.csv', 'MM_CLD_Standard_Agency_142087_Daily_191101_04.csv']")
So as you can see there are multiple csv's in that zip file. I want to extract all of those csv's easily like the first one goes. There's going to be a lot more data but dask should be able to handle it. How do I go about doing this?
Also, after that, I need to left join df
and ds
on 'UserID'
and reset the index.
Please help! Thank you!