How do you drop infs from dask dataframe/series?

Question

I have a dask Series from which I need to drop both infs and nans. .dropna() only drops the nans. In numpy/pandas, I would do something like result = result[np.isfinite(result)]. What's the recommended equivalent in dask-land? Indexing the dask object with a boolean array gives an error. Is there some way to tell dask that inf or -inf should be considered null values, for example?

score 2 · Accepted Answer · answered Sep 12 '17 at 15:03

You should avoid using NumPy functions. These will trigger computation and future dask.dataframe operations will be hesitant about using those results.

Instead, use the equivalent dask.array function. Here is a minimal example.

In [1]: import numpy as np
   ...: import pandas as pd
   ...: import dask.dataframe as dd
   ...: import dask.array as da
   ...: df = pd.DataFrame({'x': [0, 1, 2], 'y': [0, np.inf, 5]})
   ...: df
   ...: 
Out[1]: 
   x         y
0  0  0.000000
1  1       inf
2  2  5.000000

In [2]: ddf = dd.from_pandas(df, npartitions=2)
   ...: ddf[~da.isinf(ddf.y)].compute()
   ...: 
Out[2]: 
   x    y
0  0  0.0
2  2  5.0

score 0 · Answer 2 · answered Sep 12 '17 at 15:04

0

OK, I just discovered that I can do the following:

import dask.array as da
result = result[da.isfinite(result)]

In general it looks like just using da. array operations is the missing piece I've been looking for.

answered Sep 12 '17 at 15:04

Tim Morton

240
1
3
11

How do you drop infs from dask dataframe/series?

2 Answers2