I tried using dask DummyEncoder for OneHotEncoding
my data. But the results are not as expected.
dask's DummyEncoder Example:
from dask_ml.preprocessing import DummyEncoder
import pandas as pd
data = pd.DataFrame({
'B': ['a', 'a', 'a', 'b','c']
})
de = DummyEncoder()
de = de.fit(data)
testD = pd.DataFrame({'B': ['a','a']})
trans = de.transform(testD)
print(trans)
Ouput:
B_a
0 1
1 1
Why it doesn't show B_b
, B_c
? But when I change the testD
as this:
testD = pd.DataFrame({'B': ['a','a', 'b', 'c']})
Result is:
B_a B_b B_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 0 1
sklearn's OneHotEncoder Example (After LabelEncoding):
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
data = pd.DataFrame({
'B': [1, 1, 1, 2, 3]
})
encoder = OneHotEncoder()
encoder = encoder.fit(data)
testdf = pd.DataFrame({'B': [2, 2]})
trans = encoder.transform(testdf).toarray()
pd.DataFrame(trans, columns=encoder.active_features_)
Output:
1 2 3
0 0.0 1.0 0.0
1 0.0 1.0 0.0
How do I achieve the same results? The reason I want it this way because I will be encoding a subset of the columns and then concatenate the resultant encoded_df to the main df along with that dropping main column from the main df.
So something like below (main df):
A B C
0 M 1 10
1 F 2 20
2 T 3 30
3 M 4 40
4 F 5 50
5 F 6 60
Expected output:
A_F A_M A_T B C
0 0 1 0 1 10
1 1 0 0 2 20
2 0 0 1 3 30
3 0 1 0 4 40
4 1 0 0 5 50
5 1 0 0 6 60
EDIT:
Since dask internally uses pandas, I believe it uses get_dummies
. Which is how DummyEncoder
is behaving. If someone could point out a way to do the same in pandas will also be appreciated.