3

I tried using dask DummyEncoder for OneHotEncoding my data. But the results are not as expected.

dask's DummyEncoder Example:

from dask_ml.preprocessing import DummyEncoder
import pandas as pd

data = pd.DataFrame({
                'B': ['a', 'a', 'a', 'b','c']
                    })
de = DummyEncoder()
de = de.fit(data)
testD = pd.DataFrame({'B': ['a','a']})
trans = de.transform(testD)
print(trans)

Ouput:

   B_a

0    1
1    1

Why it doesn't show B_b, B_c? But when I change the testD as this:

testD = pd.DataFrame({'B': ['a','a', 'b', 'c']})

Result is:

   B_a  B_b  B_c
0    1    0    0
1    1    0    0
2    0    1    0
3    0    0    1

sklearn's OneHotEncoder Example (After LabelEncoding):

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data = pd.DataFrame({
                'B': [1, 1, 1, 2, 3]
})
encoder = OneHotEncoder()
encoder = encoder.fit(data)
testdf = pd.DataFrame({'B': [2, 2]})
trans = encoder.transform(testdf).toarray()
pd.DataFrame(trans, columns=encoder.active_features_)

Output:

     1    2    3
0  0.0  1.0  0.0
1  0.0  1.0  0.0

How do I achieve the same results? The reason I want it this way because I will be encoding a subset of the columns and then concatenate the resultant encoded_df to the main df along with that dropping main column from the main df.

So something like below (main df):

   A  B   C
0  M  1  10
1  F  2  20
2  T  3  30
3  M  4  40
4  F  5  50
5  F  6  60

Expected output:

   A_F  A_M  A_T  B   C
0    0    1    0  1  10
1    1    0    0  2  20
2    0    0    1  3  30
3    0    1    0  4  40
4    1    0    0  5  50
5    1    0    0  6  60

EDIT:

Since dask internally uses pandas, I believe it uses get_dummies. Which is how DummyEncoder is behaving. If someone could point out a way to do the same in pandas will also be appreciated.

Asif Ali
  • 1,422
  • 2
  • 12
  • 28

1 Answers1

7

From dask's documentation for DummyEncoder columns parameter:

The columns to dummy encode. Must be categorical dtype.
Dummy encodes all categorical dtype columns by default.

Also, it says here that you must always use a Categorizer before using some encoders (DummyEncoder included).

A correct way to do this:

from dask_ml.preprocessing import Categorizer, DummyEncoder
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    Categorizer(), DummyEncoder())

pipe.fit(data)

pipe.transform(testD)

Which will ouput:

    B_a     B_b     B_c
0   1       0       0
1   1       0       0
Qusai Alothman
  • 1,982
  • 9
  • 23