Create Binarized Rows in Python Pandas Dataframe

Question

Suppose data frame df is

d = { 'Title': ['Elden Ring', 'Starcraft 2', 'Terraforming Mars'], 'Genre' : [ 'Fantasy;Videogame', 'Videogame', 'Fantasy;Boardgame'] }
pd.DataFrame(data=d, index=None)

Such that it's

Elden Ring          Fantasy;Videogame
Starcraft 2         Videogame
Terraforming Mars   Fantasy;Boardgame

My goal is to end with a dataframe that looks like this:

Title               Genres                 Fantasy     Videogame   Boardgame
Elden Ring          [Fantasy, Videogame]      1            1            0
Starcraft 2         [Videogame]              0            1            0
Terraforming Mars   [Fantasy, Boardgame]      1            0            1

How is the best way to go about this? I tried doing

from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame(data=d, index=None)
df.Genre = df.Genre.str.split(';')
binar = MultiLabelBinarizer()
genre_labels = binar.fit_transform( df.Genre )
df[ binar.classes_ ] = genre_labels

This gives me a dataframe:

Title             Genre                 Boardgame   Fantasy     Videogame
Elden Ring        [Fantasy, Videogame]  0             1             1
Starcraft 2       [Videogame]           0             0             1
Terraforming Mars [Fantasy, Boardgame]  1             1             0

This gives me what I want but it felt convoluted to do. Is there a cleaner way to be doing this?

Are the values of `Genre` real lists of just strings that look like lists? — , Feb 21 '22 at 17:45
Your data is a bit confusing. From your code, it appears that the `Genre` values are actually semicolon-separated lists, but in your sample data, they appear to be comma-separated...? — , Feb 21 '22 at 17:54
Yes because of this line ( df.Genre = df.Genre.str.split(';') ) they become that. I see my original (first example) I laid it out wrong though. — Jibril, Feb 21 '22 at 17:55

score 1 · Answer 1 · edited Feb 21 '22 at 19:00

1

.str.get_dummies was designed specifically for this:

df = pd.concat([df, df['Genre'].str.get_dummies(';')], axis=1)

Output:

>>> df
               Title              Genre  Boardgame  Fantasy  Videogame
0         Elden Ring  Fantasy;Videogame          0        1          1
1        Starcraft 2          Videogame          0        0          1
2  Terraforming Mars  Fantasy;Boardgame          1        1          0

edited Feb 21 '22 at 19:00

Community

1
1

answered Feb 21 '22 at 17:48

I'm getting ;F;a;n;t;a;s;y;,; ;V;i;d;e;o;g;a;m;e; with the join(";") – Richard K Yu Feb 21 '22 at 17:51
1

@Richard yes, assuming that the sample dataframe the OP provided is actually what they're using, which I suspect not. As I pointed out in my comment on the question, `Genre` in the OP's sample data is a list of strings, not a list of lists. However, the data is actually quite confusing. – Feb 21 '22 at 17:53

Psidom · Answer 2 · 2022-02-21T17:56:25.940

Or use Series.str.get_dummies:

df.Genre.str.strip('[]').str.get_dummies(sep=', ')
   Boardgame  Fantasy  Videogame
0          0        1          1
1          0        0          1
2          1        1          0

To append to dataframe:

pd.concat([df, df.Genre.str.strip('[]').str.get_dummies(sep=', ')], axis=1)

               Title                 Genre  Boardgame  Fantasy  Videogame
0         Elden Ring  [Fantasy, Videogame]          0        1          1
1        Starcraft 2           [Videogame]          0        0          1
2  Terraforming Mars  [Fantasy, Boardgame]          1        1          0

If Genre is started as list type:

df.Genre = df.Genre.str.join(';')
pd.concat([df, df.Genre.str.get_dummies(sep=';')], axis=1)

               Title              Genre  Boardgame  Fantasy  Videogame
0         Elden Ring  Fantasy;Videogame          0        1          1
1        Starcraft 2          Videogame          0        0          1
2  Terraforming Mars  Fantasy;Boardgame          1        1          0

Create Binarized Rows in Python Pandas Dataframe

2 Answers2