2

I am able to discretize a Pandas dataframe by columns with this code:

import numpy as np
import pandas as pd

def discretize(X, n_scale=1):

    for c in X.columns:
        loc = X[c].median()

        # median absolute deviation of the column
        scale = mad(X[c])

        bins = [-np.inf, loc - (scale * n_scale),
                loc + (scale * n_scale), np.inf]
        X[c] = pd.cut(X[c], bins, labels=[-1, 0, 1])

    return X

I want to discretize each column using as parameters: loc (the median of the column) and scale (the median absolute deviation of the column).

With small dataframes the time required is acceptable (since it is a single thread solution).

However, with larger dataframes I want to exploit more threads (or processes) to speed up the computation.

I am no expert of Dask, which should provide the solution for this problem.

However, in my case the discretization should be feasible with the code:

import dask.dataframe as dd
import numpy as np
import pandas as pd

def discretize(X, n_scale=1):

    # I'm using only 2 partitions for this example
    X_dask = dd.from_pandas(X, npartitions=2)

    # FIXME: 
    # how can I define bins to compute loc and scale
    # for each column?
    bins = [-np.inf, loc - (scale * n_scale),
            loc + (scale * n_scale), np.inf]

    X = X_dask.apply(pd.cut, axis=1, args=(bins,), labels=[-1, 0, 1]).compute()

    return X

but the problem here is that loc and scale are dependent on column values, so they should be computed for each column, either before or during the apply.

How can it be done?

gc5
  • 9,468
  • 24
  • 90
  • 151

1 Answers1

1

I've never used dask, but I guess you can define a new function to be used in apply.

import dask.dataframe as dd
import multiprocessing as mp
import numpy as np
import pandas as pd

def discretize(X, n_scale=1):

    X_dask = dd.from_pandas(X.T, npartitions=mp.cpu_count()+1)
    X = X_dask.apply(_discretize_series,
                     axis=1, args=(n_scale,),
                     columns=X.columns).compute().T

    return X

def _discretize_series(x, n_scale=1):

    loc = x.median()
    scale = mad(x)
    bins = [-np.inf, loc - (scale * n_scale),
            loc + (scale * n_scale), np.inf]
    x = pd.cut(x, bins, labels=[-1, 0, 1])

    return x
gc5
  • 9,468
  • 24
  • 90
  • 151
Happy001
  • 6,103
  • 2
  • 23
  • 16
  • Thanks. I edited your question with the working solution. Please accept it if you think it suffices. – gc5 Aug 09 '16 at 08:18