0

I have trained a multi-label classifier using SVM, Logistic Regression and NB. My question is how do I pass in unseen data to the classifier? Here's my full code

# Bring all the important libraries

%matplotlib inline

import re
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
#from nltk.corpus import stopwords
#stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import seaborn as sns


df = pd.read_csv("movies_genres_en.csv", delimiter='\t')
df.drop('plot_lang', axis=1, inplace=True)
df.rename(columns={'plot':'plot_text'}, inplace=True)
df.info()

#using for loop get a count of movies by genre
df_genres = df.drop(['plot_text', 'title'], axis=1)
counts = []
categories = list(df_genres.columns.values)
for i in categories:
counts.append((i, df_genres[i].sum()))
df_stats = pd.DataFrame(counts, columns = ['genre','#movies'])
df_stats

# Create a fuction to clean the text

def clean_text(text):
text = text.lower()
text = re.sub(r"what's", "what is ", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"can't", "can not ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"i'm", "i am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " would ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub(r"\'scuse", " excuse ", text)
text = re.sub('\W', ' ', text)
text = re.sub('\s+', ' ', text)
text = text.strip(' ')
return text    

# clean up the text in plot
df['plot_text'] = df['plot_text'].map(lambda com : clean_text(com))

# define genre
genres =   ['Action','Adult','Adventure','Animation','Biography','Comedy','Crime','Documentary','Drama','Family','Fantasy','Game-Show','History','Horror','Music','Musical','Mystery','News','Reality-TV','Romance','Sci-Fi','Short','Sport','Talk-Show','Thriller','War','Western']   

Split the data into test and train

Split the data in to train and test sets

train, test = train_test_split(df, random_state=42, test_size = 0.33, shuffle=True)
x_train = train.plot_text
x_test = test.plot_text

Train the classifiers # predict accuracy using SVM

SVC_pipeline = Pipeline([
            ('tfidf', TfidfVectorizer(stop_words='english')),
            ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
        ])
for genre in genres:
print('... Processing {}'.format(genre))
# train the model using X_dtm & y
SVC_pipeline.fit(x_train, train[genre])
# compute the testing accuracy
prediction = SVC_pipeline.predict(x_test)
print('Test accuracy is {}'.format(accuracy_score(test[genre], prediction)))

After doing this I get the accuracy scores and I have decided that I will use the SVM Classifier to label unseen data. How do I pass in the unseen data? It's a dataset with two columns the movie title and the plot. Can someone please help?

1 Answers1

0

Just convert your unseen dataset into a new dataframe with the same names as your training dataframe. For example

from sklearn.svm import LinearSVC
import pandas as pd
model=LinearSVC()
train=pd.DataFrame({'a':[1,2,3,4,5],'b':[21,22,23,24,25],'c':['c1','c0','c2','c1','c0']})

    a   b   c
0   1   21  c1
1   2   22  c0
2   3   23  c2
3   4   24  c1
4   5   25  c0

model.fit(train[['a','b']],train['c'])
unseen=pd.DataFrame({'a':[1,2,1,3,4],'b':[22,21,22,23,25]})

    a   b
0   1   22
1   2   21
2   1   22
3   3   23
4   4   25

model.predict(unseen)

The output is

array(['c1', 'c1', 'c1', 'c0', 'c0'], dtype=object)

Then use pd.get_dummies(model.predict(unseen)) to get

    c0  c1
0   0   1
1   0   1
2   0   1
3   1   0
4   1   0

I'm not sure if this is what you want...

antonioACR1
  • 1,303
  • 2
  • 15
  • 28
  • I brought in one dataset, then split it. 33% for test and 67% for training. # Split the data in to train and test sets train, test = train_test_split(df, random_state=42, test_size = 0.33, shuffle=True) then SVC_pipeline.predict(df_p) My question is how do I make the unseen data multi-label predictions into a matrix as in the image above – Chat Peters May 22 '18 at 01:42
  • It's not very clear what you're asking for. Based on the image, I think what you need is `pd.get_dummies()`. Check the edit in my answer. Also please make your question more concise and clear and simplify it with a reproducible example, otherwise it becomes difficult for us to help – antonioACR1 May 22 '18 at 15:39