3

I have a dask Series from which I need to drop both infs and nans. .dropna() only drops the nans. In numpy/pandas, I would do something like result = result[np.isfinite(result)]. What's the recommended equivalent in dask-land? Indexing the dask object with a boolean array gives an error. Is there some way to tell dask that inf or -inf should be considered null values, for example?

Tim Morton
  • 240
  • 1
  • 3
  • 11

2 Answers2

2

You should avoid using NumPy functions. These will trigger computation and future dask.dataframe operations will be hesitant about using those results.

Instead, use the equivalent dask.array function. Here is a minimal example.

In [1]: import numpy as np
   ...: import pandas as pd
   ...: import dask.dataframe as dd
   ...: import dask.array as da
   ...: df = pd.DataFrame({'x': [0, 1, 2], 'y': [0, np.inf, 5]})
   ...: df
   ...: 
Out[1]: 
   x         y
0  0  0.000000
1  1       inf
2  2  5.000000

In [2]: ddf = dd.from_pandas(df, npartitions=2)
   ...: ddf[~da.isinf(ddf.y)].compute()
   ...: 
Out[2]: 
   x    y
0  0  0.0
2  2  5.0
MRocklin
  • 55,641
  • 23
  • 163
  • 235
0

OK, I just discovered that I can do the following:

import dask.array as da
result = result[da.isfinite(result)]

In general it looks like just using da. array operations is the missing piece I've been looking for.

Tim Morton
  • 240
  • 1
  • 3
  • 11