Solving pd.get_dummies dysfunction in python

Question

I have

 a={0: ['I3925'], 1: ['I3925'], 2: ['I3925'], 3: ['I2355'], 4: ['I2355'], 5: ['I2355'], 6: ['I111'], 7: ['I111'], 8: ['I111'], 9: ['I405'], 10: ['I405'], 11: ['I3878', 'I2864'], 12: ['I3878'], 13: ['I534'], 14: ['I534'], 15: ['I134', 'I2276'], 16: ['I107'], 17: ['I107'], 18: ['I2864']}

which contains one supplementary I number for one key.

b = pd.Series(a,\
                              index = a.keys(),
                               name = "a")
pd.get_dummies(b.apply(pd.Series))

then get_dummies is not working, as it creates a duplicate column 1_15 to store the match with the second I number, instead of stacking them into the same column. I don't understand why.

    0_I107  0_I111  0_I134  0_I2355 0_I2864 0_I3878 0_I3925 0_I405  0_I534  1_I2276 1_I2864
0   0   0   0   0   0   0   1   0   0   0   0
1   0   0   0   0   0   0   1   0   0   0   0
2   0   0   0   0   0   0   1   0   0   0   0
3   0   0   0   1   0   0   0   0   0   0   0
4   0   0   0   1   0   0   0   0   0   0   0
5   0   0   0   1   0   0   0   0   0   0   0
6   0   1   0   0   0   0   0   0   0   0   0
7   0   1   0   0   0   0   0   0   0   0   0
8   0   1   0   0   0   0   0   0   0   0   0
9   0   0   0   0   0   0   0   1   0   0   0
10  0   0   0   0   0   0   0   1   0   0   0
11  0   0   0   0   0   1   0   0   0   0   1
12  0   0   0   0   0   1   0   0   0   0   0
13  0   0   0   0   0   0   0   0   1   0   0
14  0   0   0   0   0   0   0   0   1   0   0
15  0   0   1   0   0   0   0   0   0   1   0
16  1   0   0   0   0   0   0   0   0   0   0
17  1   0   0   0   0   0   0   0   0   0   0
18  0   0   0   0   1   0   0   0   0   0   0

Could someone please explain what I am doing wrong?

Because `b.apply(pd.Series)[0]` generates a dataframe with two columns. Output will have columns `columnname_dummyvalue` — Zero, Sep 15 '17 at 17:55
You may be looking for `pd.Series([v for x in b for v in x]).str.get_dummies()`? — Zero, Sep 15 '17 at 17:57
@JohnGalt ok; thanks, so the problem doesn't come from get_dummies but from series but the question then is why? it should merely convert the list into series. I only perform the conversion so input to get_dummies is possible, it doesn't accept lists. I will try the expression you propose. I actually just want "cells" of this column to contain series instead of lists, since it is the only thing that prevent get_dummies to work. Alexander, you're right, I'll edit the OP (but that doesn't change the pb). Scott: I would like the 1s of the duplicated column to be on the same level in the same one. — Ando Jurai, Sep 16 '17 at 11:27
@JohnGalt The expression you gave works oddly, it gives to "1" in 11 and 13 for I3878 while it should be 11 and 12, and doesn't give two entries for index 15 — Ando Jurai, Sep 16 '17 at 12:02

score 3 · Answer 1 · answered Sep 15 '17 at 18:10

Option 1

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(b), b.index, mlb.classes_)

    I107  I111  I134  I2276  I2355  I2864  I3878  I3925  I405  I534
0      0     0     0      0      0      0      0      1     0     0
1      0     0     0      0      0      0      0      1     0     0
2      0     0     0      0      0      0      0      1     0     0
3      0     0     0      0      1      0      0      0     0     0
4      0     0     0      0      1      0      0      0     0     0
5      0     0     0      0      1      0      0      0     0     0
6      0     1     0      0      0      0      0      0     0     0
7      0     1     0      0      0      0      0      0     0     0
8      0     1     0      0      0      0      0      0     0     0
9      0     0     0      0      0      0      0      0     1     0
10     0     0     0      0      0      0      0      0     1     0
11     0     0     0      0      0      1      1      0     0     0
12     0     0     0      0      0      0      1      0     0     0
13     0     0     0      0      0      0      0      0     0     1
14     0     0     0      0      0      0      0      0     0     1
15     0     0     1      1      0      0      0      0     0     0
16     1     0     0      0      0      0      0      0     0     0
17     1     0     0      0      0      0      0      0     0     0
18     0     0     0      0      0      1      0      0     0     0

Option 2

b.str.join('|').str.get_dummies()

    I107  I111  I134  I2276  I2355  I2864  I3878  I3925  I405  I534
0      0     0     0      0      0      0      0      1     0     0
1      0     0     0      0      0      0      0      1     0     0
2      0     0     0      0      0      0      0      1     0     0
3      0     0     0      0      1      0      0      0     0     0
4      0     0     0      0      1      0      0      0     0     0
5      0     0     0      0      1      0      0      0     0     0
6      0     1     0      0      0      0      0      0     0     0
7      0     1     0      0      0      0      0      0     0     0
8      0     1     0      0      0      0      0      0     0     0
9      0     0     0      0      0      0      0      0     1     0
10     0     0     0      0      0      0      0      0     1     0
11     0     0     0      0      0      1      1      0     0     0
12     0     0     0      0      0      0      1      0     0     0
13     0     0     0      0      0      0      0      0     0     1
14     0     0     0      0      0      0      0      0     0     1
15     0     0     1      1      0      0      0      0     0     0
16     1     0     0      0      0      0      0      0     0     0
17     1     0     0      0      0      0      0      0     0     0
18     0     0     0      0      0      1      0      0     0     0

Thanks I didn't know the first one, it thus basically does the same than get_dummies (but work better?;)) As for the second, if I understand well, it will join every list element in a string around a pipe. But I don't get how it makes get_dummies work better? Since there is now only one string by cell, how does the method "get" that it has to split the string? It works like a regular expression in this case? I wasn't aware of this behavior. — Ando Jurai, Sep 16 '17 at 11:29
This is bad practice to modify an object when passing it as an argument and to then to use the modified object attribute in the same function call. Also, I would name the parameters for the dataframe constructor. — Ted Petrou, Sep 17 '17 at 10:06

score 2 · Answer 2 · answered Sep 15 '17 at 17:59

Something like this?

pd.get_dummies(b.apply(pd.Series).stack()).sum(level=0)

Output:

    I107  I111  I134  I2276  I2355  I2864  I3878  I3925  I405  I534
0      0     0     0      0      0      0      0      1     0     0
1      0     0     0      0      0      0      0      1     0     0
2      0     0     0      0      0      0      0      1     0     0
3      0     0     0      0      1      0      0      0     0     0
4      0     0     0      0      1      0      0      0     0     0
5      0     0     0      0      1      0      0      0     0     0
6      0     1     0      0      0      0      0      0     0     0
7      0     1     0      0      0      0      0      0     0     0
8      0     1     0      0      0      0      0      0     0     0
9      0     0     0      0      0      0      0      0     1     0
10     0     0     0      0      0      0      0      0     1     0
11     0     0     0      0      0      1      1      0     0     0
12     0     0     0      0      0      0      1      0     0     0
13     0     0     0      0      0      0      0      0     0     1
14     0     0     0      0      0      0      0      0     0     1
15     0     0     1      1      0      0      0      0     0     0
16     1     0     0      0      0      0      0      0     0     0
17     1     0     0      0      0      0      0      0     0     0
18     0     0     0      0      0      1      0      0     0     0

That's strange, I tried this but stack failed with the same values, they didn't get separated and stayed in the same row... — Ando Jurai, Sep 16 '17 at 11:34
Actually I also did this without sum and the output was the same, only labels in different orders... but columns are the same and all have prefixes. — Ando Jurai, Sep 16 '17 at 11:49

Solving pd.get_dummies dysfunction in python

2 Answers2