I have parquet file that is 800K rows x 8.7K columns. I loaded it into a dask dataframe:
import dask.dataframe as dd
dask_train_df = dd.read_parquet('train.parquet')
dask_train_df.info()
This yields:
<class 'dask.dataframe.core.DataFrame'>
Columns: 8712 entries, 0 to 8711
dtypes: int8(8712)
When I try to do simple operations like dask_train_df.head()
or dask_train_df.loc[2:4].compute()
, I get memory errors, even with 17+ GB of RAM.
However, if I do:
import pandas as pd
train = pd.read_parquet('../input/train.parquet')
train.info()
yields:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Columns: 8712 entries, 0 to 8711
dtypes: int8(8712)
memory usage: 6.5 GB
and I can run train.head()
and train.loc[2:4]
with no problems since everything is in memory already.
1) So my question is why do these simple operations blow up the memory usage using a Dask Dataframe, but works fine with when I load everything into memory using a Pandas Dataframe?
I notice that npartitions=1
, and I see that in the documentation that read_parquet
"reads a directory of Parquet data into a Dask.dataframe, one file per partition". In my case, it sounds like I'm losing out on all of the parallelization power of having multiple partitions, but then shouldn't the Dask Dataframe memory usage be capped by the amount of memory of the single Pandas Dataframe?
2) Also, a side question: If I wanted to parallelize this single parquet file by partitioning it in a Dask Dataframe, how would I do so? I don't see a blocksize parameter in the dd.read_parquet
signature. I also tried using the repartition function, but I believe that partitions along the rows and in a parquet file, I would want to partition along the columns?