Source data is from the book Python_for_Data_Analysis, chp 2. The data for movies is as follows and can also be found here:
movies.head(n=10)
Out[3]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
The following code has trouble when I use iloc
:
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table( 'movies.dat', sep='::',
engine='python', header=None, names=mnames)
movies.head(n=10)
genre_iter = (set(x.split('|')) for x in movies['genres'])
genres = sorted(set.union(*genre_iter))
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)
for i, gen in enumerate(movies['genres']):
# the following code report error
# TypeError: '['Animation', "Children's", 'Comedy']' is an invalid key
dummies.iloc[i,dummies.columns.get_loc(gen.split('|'))] = 1
# while loc can run successfully
dummies.loc[dummies.index[[i]],gen.split('|')] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
I have some understanding of why Children's is error, but why Animation,Comedy are error? I have tried:
dummies.columns.get_loc('Animation')
and the result is 2.