I am able to discretize a Pandas dataframe by columns with this code:
import numpy as np
import pandas as pd
def discretize(X, n_scale=1):
for c in X.columns:
loc = X[c].median()
# median absolute deviation of the column
scale = mad(X[c])
bins = [-np.inf, loc - (scale * n_scale),
loc + (scale * n_scale), np.inf]
X[c] = pd.cut(X[c], bins, labels=[-1, 0, 1])
return X
I want to discretize each column using as parameters: loc (the median of the column) and scale (the median absolute deviation of the column).
With small dataframes the time required is acceptable (since it is a single thread solution).
However, with larger dataframes I want to exploit more threads (or processes) to speed up the computation.
I am no expert of Dask, which should provide the solution for this problem.
However, in my case the discretization should be feasible with the code:
import dask.dataframe as dd
import numpy as np
import pandas as pd
def discretize(X, n_scale=1):
# I'm using only 2 partitions for this example
X_dask = dd.from_pandas(X, npartitions=2)
# FIXME:
# how can I define bins to compute loc and scale
# for each column?
bins = [-np.inf, loc - (scale * n_scale),
loc + (scale * n_scale), np.inf]
X = X_dask.apply(pd.cut, axis=1, args=(bins,), labels=[-1, 0, 1]).compute()
return X
but the problem here is that loc
and scale
are dependent on column values, so they should be computed for each column, either before or during the apply.
How can it be done?