1

Source data is from the book Python_for_Data_Analysis, chp 2. The data for movies is as follows and can also be found here:

movies.head(n=10)
Out[3]: 
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy
5         6                         Heat (1995)         Action|Crime|Thriller
6         7                      Sabrina (1995)                Comedy|Romance
7         8                 Tom and Huck (1995)          Adventure|Children's
8         9                 Sudden Death (1995)                        Action
9        10                    GoldenEye (1995)     Action|Adventure|Thriller

The following code has trouble when I use iloc:

import pandas as pd
import numpy as np
from pandas import Series,DataFrame

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table( 'movies.dat', sep='::', 
                       engine='python', header=None, names=mnames)
movies.head(n=10)
genre_iter = (set(x.split('|')) for x in movies['genres'])
genres = sorted(set.union(*genre_iter))
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)

for i, gen in enumerate(movies['genres']):
# the following code report error
# TypeError: '['Animation', "Children's", 'Comedy']' is an invalid key
    dummies.iloc[i,dummies.columns.get_loc(gen.split('|'))] = 1
# while loc can run successfully
    dummies.loc[dummies.index[[i]],gen.split('|')] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

I have some understanding of why Children's is error, but why Animation,Comedy are error? I have tried:

dummies.columns.get_loc('Animation')

and the result is 2.

Renke
  • 452
  • 6
  • 22
  • Just saw this answer https://stackoverflow.com/questions/45017074/pandas-crosstabulation-and-counting which might be even better. Here, it would be something like `str.get_dummies('|')` – JohnE Jul 10 '17 at 17:34

2 Answers2

1

This is a pretty simple (and fast) answer using string matching that should work fine here and in any case where the genres names don't overlap. E.g. if you had categories "crime" and "crime thriller" then a crime thriller would be categorized under both crime AND crime thriller rather than just crime thriller. (But see note below for how you could generalize this.)

for g in genres:
    movies[g] = movies.genres.str.contains(g).astype(np.int8)

(Note using np.int8 rather than int will save a lot of memory as int defaults to 64 bits rather than 8)

Results for movies.head(2):

   movie_id             title                        genres  Action  \
0         1  Toy Story (1995)   Animation|Children's|Comedy       0   
1         2    Jumanji (1995)  Adventure|Children's|Fantasy       0   

   Adventure  Animation  Children's  Comedy  Crime  Documentary   ...     \
0          0          1           1       1      0            0   ...      
1          1          0           1       0      0            0   ...      

   Fantasy  Film-Noir  Horror  Musical  Mystery  Romance  Sci-Fi  Thriller  \
0        0          0       0        0        0        0       0         0   
1        1          0       0        0        0        0       0         0   

   War  Western  
0    0        0  
1    0        0  

The following generalizaion of the above code may be overkill but gives you a more general approach that should avoid potential double counting of genre categories (e.g. equating Crime and Crime Thriller):

# add '|' delimiter to beginning and end of the genres column
movies['genres2'] = '|' + movies['genres'] + '|'

# search for '|Crime|' rather than 'Crime' which is much safer b/c
# we don't match a category which merely contains 'Crime', we 
# only match 'Crime' exactly
for g in genres:
    movies[g+'2'] movies.genres2.str.contains('\|'+g+'\|').astype(np.int8)

(If you're better with regular expressions than me you wouldn't need to add the '|' at the beginning and end ;-)

JohnE
  • 29,156
  • 8
  • 79
  • 109
0

Try

dummies = movies.genres.str.get_dummies()
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • @Renke then something is wrong with your data. You have a row where genre is this string ['Animation', "Children's", 'Comedy'] and not the list of strings. – piRSquared Jul 09 '17 at 13:57
  • it seems that dummies.columns.get_loc(gen.split('|')) will trigger problem when there is " '" in the gen. is there any way to solve this problem? – Renke Jul 09 '17 at 14:03
  • get_dummies can take a separator: `str.get_dummies('|')` https://stackoverflow.com/questions/45017074/pandas-crosstabulation-and-counting – JohnE Jul 24 '17 at 18:04