Using IMDB data for the sci-kit regression models package which has text values in feature variables

Question

I have a csv file containing IMDB movie ratings data. The file has 27 features and 1 target variable. I have attached SampleData. And also the data set can be downloaded from KaggleData. I have learnt that sklearn package of python requires all the data to be in numbers. So how do I use this data to do a regression analysis? Right now I have used below code, but it says "Some director name" can't be converted to float.

import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_csv('D:\Machine Learning\Final\movie_metadata.csv')
feature_cols = [
                 "director_facebook_likes", 
                 "cast_total_facebook_likes",
                 "movie_facebook_likes",
                 "facenumber_in_poster",
                 "gross",
                 "num_critic_for_reviews",
                 "num_voted_users",
                 "num_user_for_reviews",
                 "duration",
                 "title_year",
                 "content_rating",
                 "budget",
                 "director_name"]
X = df[feature_cols]
y = df.imdb_score
lm = LinearRegression()
lm.fit(X, y)
print (lm.intercept_)
print (lm.coef_)

score 0 · Answer 1 · answered Nov 05 '16 at 20:50

0

The simplest is pd.get_dummies(). You may also come across one-hot-encoding.

answered Nov 05 '16 at 20:50

simon

2,561
16
26

Using IMDB data for the sci-kit regression models package which has text values in feature variables

1 Answers1