I have a dask Series from which I need to drop both infs and nans. .dropna()
only drops the nans. In numpy/pandas, I would do something like result = result[np.isfinite(result)]
. What's the recommended equivalent in dask-land? Indexing the dask object with a boolean array gives an error. Is there some way to tell dask that inf
or -inf
should be considered null values, for example?
Asked
Active
Viewed 1,008 times
3

Tim Morton
- 240
- 1
- 3
- 11
2 Answers
2
You should avoid using NumPy functions. These will trigger computation and future dask.dataframe operations will be hesitant about using those results.
Instead, use the equivalent dask.array function. Here is a minimal example.
In [1]: import numpy as np
...: import pandas as pd
...: import dask.dataframe as dd
...: import dask.array as da
...: df = pd.DataFrame({'x': [0, 1, 2], 'y': [0, np.inf, 5]})
...: df
...:
Out[1]:
x y
0 0 0.000000
1 1 inf
2 2 5.000000
In [2]: ddf = dd.from_pandas(df, npartitions=2)
...: ddf[~da.isinf(ddf.y)].compute()
...:
Out[2]:
x y
0 0 0.0
2 2 5.0

MRocklin
- 55,641
- 23
- 163
- 235
0
OK, I just discovered that I can do the following:
import dask.array as da
result = result[da.isfinite(result)]
In general it looks like just using da.
array operations is the missing piece I've been looking for.

Tim Morton
- 240
- 1
- 3
- 11