I am trying to create an automatic sentiment detector which is supposed to assign a sentiment (positive, negative, neutral, etc.) based on it's target context text. But somehow I am getting a training accuracy of 100%. I have been trying to figure this out for too long and I feel like I am missing something fairly obvious now. What am i doing wrong?
I am really sorry if this is badly formatted. I joined this site a few days ago and I am desperate and in a hurry with my Uni deadline amidst all this chaos, so please forgive me.
import pandas as pd
import os
df_train = pd.read_csv("combined-sentiment-judgments.tsv", sep='\t', header=None)
df_train.head()
df_train.shape
Result is this: (980, 6)
import sys
import csv
import random
csv.field_size_limit(sys.maxsize)
with open ("combined-sentiment-judgments.tsv") as data:
trainData = [row for row in csv.reader(data, delimiter='\t')]
random.shuffle(trainData)
context = []
for i in range (len(trainData)):
result =[]
f= ""
result.append(trainData[i][3])
result.append(trainData[i][4])
result.append(trainData[i][5])
context.append(f.join(result))
labels = []
for i in range (len(context)):
res = []
s= ""
res.append(trainData[i][1])
labels.append(s.join(res))
for label,text in list(zip(labels,context))[:20]:
print(label,text[:50]+"...")
These are few lines that i printed with that, sentiments are in English but the sentences are in Finnish:
neutral Kyseinen auto taitaa olla Mursu . Niissä on nykyää...
neutral,positive,neutral,unclear Tällä hetkellä Liptonin vihreä sitrushedelmätee . ...
negative,neutral,negative,negative mutta eikös Windows 8 ole ihan paska ? en nyt muis...
positive,mixed,neutral,mixed Tarkoitus olisi ostaa b230ft koneellinen volvo , k...
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix=vectorizer.fit_transform(context)
print("shape=",feature_matrix.shape)
Results: shape= (980, 10861)
from sklearn.model_selection import train_test_split
train_texts, dev_texts, train_labels, dev_labels=train_test_split(context,labels,test_size=0.2)
vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
feature_matrix_train=vectorizer.fit_transform(train_texts)
feature_matrix_dev=vectorizer.transform(dev_texts)
print(feature_matrix_train.shape)
print(feature_matrix_dev.shape)
Results again:
(784, 27827)
(196, 27827)
import sklearn.svm
classifier=sklearn.svm.LinearSVC(C=0.009,verbose=1)
classifier.fit(feature_matrix_train, train_labels)
print("DEV",classifier.score(feature_matrix_dev, dev_labels))
print("TRAIN",classifier.score(feature_matrix_train, train_labels))
And this is what I get:
DEV 0.08673469387755102
TRAIN 1.0