I am using scikit learn and tackle the exercise of predicting movie review rating. I have read on Cohen's kappa ( i Frankly to not understand it fully ), and it's usefulness as a metric of comparison between Observed and Expected accuracy. I have proceeded as usual in applying machine learning algorithm on my corpus, using a bag of words model. I read the Cohen's Kappa is a good way to measure the performance of a classifier.
How do i adapt this concept to my prediction problem using sklearn ?
Sklearn's documentation is not really explicit on how to proceed on this matter with a document term matrix ( if it's even the right way to do it )
sklearn.metrics.cohen_kappa_score(y1, y2, labels=None, weights=None)
this is the example found on the sklearn website:
from sklearn.metrics import cohen_kappa_score
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
cohen_kappa_score(y_true, y_pred)
Is the Kappa scoring calculation applicable here ? among the people who annotated the reviews in my corpus ? How to write it ? Since all the movie reviews are from different annotators, are they still two annotators to consider in evaluating Cohen's Kappa ? What should I do ? Here's the example I am trying :
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedShuffleSplit
xlsx1 = pd.ExcelFile('App-Music/reviews.xlsx')
'''
review are stored in two columns, one for the review, one for the rating
'''
X = pd.read_excel(xlsx1,'Sheet1').Review
Y = pd.read_excel(xlsx1,'Sheet1').Rating
X_train, X_test, Y_train, Y_test = train_test_split(X_documents, Y, stratify=Y)
new_vect= TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
X_train_dtm = new_vect.fit_transform(X_train.values.astype('U'))
X_test_dtm = new_vect.fit_transform(X_test.values.astype('U'))
new_model.fit(X_train_dtm,Y_train)
new_model.score(X_test_dtm,Y_test)
'''
this is the part where I want to calculate cohen kappa score for comparison
'''
I might completely wrong on the idea but i read it in this page concerning sentiment analysis:
Ultimately, a tool’s accuracy is merely the percentage of times that human judgment agrees with the tool’s judgment. This degree of agreement among humans is also known as human concordance. There have been various studies run by various people and companies, and they concluded that the rate of human concordance is between 70% and 79%.