Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.
from sklearn.datasets import load_iris
import dask.dataframe as dd
d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)
What I would like to do:
in_memory = ddf.iloc[:,2:4].compute()
What I have been able to do:
ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()
map_partitions
works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.