-1

Suppose data frame df is

d = { 'Title': ['Elden Ring', 'Starcraft 2', 'Terraforming Mars'], 'Genre' : [ 'Fantasy;Videogame', 'Videogame', 'Fantasy;Boardgame'] }
pd.DataFrame(data=d, index=None)

Such that it's

Elden Ring          Fantasy;Videogame
Starcraft 2         Videogame
Terraforming Mars   Fantasy;Boardgame

My goal is to end with a dataframe that looks like this:

Title               Genres                 Fantasy     Videogame   Boardgame
Elden Ring          [Fantasy, Videogame]      1            1            0
Starcraft 2         [Videogame]              0            1            0
Terraforming Mars   [Fantasy, Boardgame]      1            0            1

How is the best way to go about this? I tried doing

from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame(data=d, index=None)
df.Genre = df.Genre.str.split(';')
binar = MultiLabelBinarizer()
genre_labels = binar.fit_transform( df.Genre )
df[ binar.classes_ ] = genre_labels

This gives me a dataframe:

Title             Genre                 Boardgame   Fantasy     Videogame
Elden Ring        [Fantasy, Videogame]  0             1             1
Starcraft 2       [Videogame]           0             0             1
Terraforming Mars [Fantasy, Boardgame]  1             1             0

This gives me what I want but it felt convoluted to do. Is there a cleaner way to be doing this?

Jibril
  • 967
  • 2
  • 11
  • 29
  • 1
    Are the values of `Genre` real lists of just strings that look like lists? –  Feb 21 '22 at 17:45
  • Your data is a bit confusing. From your code, it appears that the `Genre` values are actually semicolon-separated lists, but in your sample data, they appear to be comma-separated...? –  Feb 21 '22 at 17:54
  • Yes because of this line ( df.Genre = df.Genre.str.split(';') ) they become that. I see my original (first example) I laid it out wrong though. – Jibril Feb 21 '22 at 17:55
  • check my answer now. –  Feb 21 '22 at 18:45

2 Answers2

1

.str.get_dummies was designed specifically for this:

df = pd.concat([df, df['Genre'].str.get_dummies(';')], axis=1)

Output:

>>> df
               Title              Genre  Boardgame  Fantasy  Videogame
0         Elden Ring  Fantasy;Videogame          0        1          1
1        Starcraft 2          Videogame          0        0          1
2  Terraforming Mars  Fantasy;Boardgame          1        1          0
Community
  • 1
  • 1
  • I'm getting ;F;a;n;t;a;s;y;,; ;V;i;d;e;o;g;a;m;e; with the join(";") – Richard K Yu Feb 21 '22 at 17:51
  • 1
    @Richard yes, assuming that the sample dataframe the OP provided is actually what they're using, which I suspect not. As I pointed out in my comment on the question, `Genre` in the OP's sample data is a list of strings, not a list of lists. However, the data is actually quite confusing. –  Feb 21 '22 at 17:53
1

Or use Series.str.get_dummies:

df.Genre.str.strip('[]').str.get_dummies(sep=', ')
   Boardgame  Fantasy  Videogame
0          0        1          1
1          0        0          1
2          1        1          0

To append to dataframe:

pd.concat([df, df.Genre.str.strip('[]').str.get_dummies(sep=', ')], axis=1)

               Title                 Genre  Boardgame  Fantasy  Videogame
0         Elden Ring  [Fantasy, Videogame]          0        1          1
1        Starcraft 2           [Videogame]          0        0          1
2  Terraforming Mars  [Fantasy, Boardgame]          1        1          0

If Genre is started as list type:

df.Genre = df.Genre.str.join(';')
pd.concat([df, df.Genre.str.get_dummies(sep=';')], axis=1)

               Title              Genre  Boardgame  Fantasy  Videogame
0         Elden Ring  Fantasy;Videogame          0        1          1
1        Starcraft 2          Videogame          0        0          1
2  Terraforming Mars  Fantasy;Boardgame          1        1          0
Psidom
  • 209,562
  • 33
  • 339
  • 356