0

I have a csv file containing IMDB movie ratings data. The file has 27 features and 1 target variable. I have attached SampleData. And also the data set can be downloaded from KaggleData. I have learnt that sklearn package of python requires all the data to be in numbers. So how do I use this data to do a regression analysis? Right now I have used below code, but it says "Some director name" can't be converted to float.

import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_csv('D:\Machine Learning\Final\movie_metadata.csv')
feature_cols = [
                 "director_facebook_likes", 
                 "cast_total_facebook_likes",
                 "movie_facebook_likes",
                 "facenumber_in_poster",
                 "gross",
                 "num_critic_for_reviews",
                 "num_voted_users",
                 "num_user_for_reviews",
                 "duration",
                 "title_year",
                 "content_rating",
                 "budget",
                 "director_name"]
X = df[feature_cols]
y = df.imdb_score
lm = LinearRegression()
lm.fit(X, y)
print (lm.intercept_)
print (lm.coef_)
aks_Nin
  • 147
  • 4
  • 13

1 Answers1

0

The simplest is pd.get_dummies(). You may also come across one-hot-encoding.

simon
  • 2,561
  • 16
  • 26