subselection of columns in dask (from pandas) by computed boolean indexer

Question

I'm new do dask (imported as dd) and try to convert some pandas (imported as pd) code.

The goal of the following lines, is to slice the data to those columns, which's values fullfill the calculated requirement in dask.

There is a given table in csv. The former code reads

inputdata=pd.read_csv("inputfile.csv");
pseudoa=inputdata.quantile([.035,.965])
pseudob=pseudoa.diff().loc[.965]
inputdata=inputdata.loc[:,inputdata.columns[pseudob.values>0]]
inputdata.describe()

and is working fine. My simple idea for conversion was so substitute the first line to

inputdata=dd.read_csv("inputfile.csv");

but that resulted in the strange error message IndexError: too many indices for array. Even by switching to ready computed data in inputdata and pseudob the error remains.
Maybe the question is specifically assigned to the idea of calculated boolean slicing for dask-columns.

I just found a (maybe suboptimal) way (not solution) to do that. Changing line 4 to the following

inputdata=inputdata.loc[:,inputdata.columns[(pseudob.values>0).compute()[0]]]

seems to work.

score 0 · Accepted Answer · answered Aug 23 '17 at 12:25

Yes, Dask.dataframe's .loc accessor only works if it gets concrete indexing values. Otherwise it doesn't know which partitions to ask for the data. Computing your lazy dask result to a concrete Pandas result is one sensible solution to this problem, especially if your indices fit in memory.

subselection of columns in dask (from pandas) by computed boolean indexer

1 Answers1