1

I am trying to create an automatic sentiment detector which is supposed to assign a sentiment (positive, negative, neutral, etc.) based on it's target context text. But somehow I am getting a training accuracy of 100%. I have been trying to figure this out for too long and I feel like I am missing something fairly obvious now. What am i doing wrong?

I am really sorry if this is badly formatted. I joined this site a few days ago and I am desperate and in a hurry with my Uni deadline amidst all this chaos, so please forgive me.

import pandas as pd
import os

df_train = pd.read_csv("combined-sentiment-judgments.tsv", sep='\t', header=None)
df_train.head()

df_train.shape

Result is this: (980, 6)

import sys
import csv
import random

csv.field_size_limit(sys.maxsize)

with open ("combined-sentiment-judgments.tsv") as data:
  trainData = [row for row in csv.reader(data, delimiter='\t')]
random.shuffle(trainData)

context = []
for i in range (len(trainData)):
  result =[]
  f= ""
  result.append(trainData[i][3])
  result.append(trainData[i][4])
  result.append(trainData[i][5])
  context.append(f.join(result))

labels = []
for i in range (len(context)):
  res = []
  s= ""
  res.append(trainData[i][1])
  labels.append(s.join(res))


for label,text in list(zip(labels,context))[:20]:
  print(label,text[:50]+"...")

These are few lines that i printed with that, sentiments are in English but the sentences are in Finnish:

neutral Kyseinen auto taitaa olla Mursu . Niissä on nykyää...

neutral,positive,neutral,unclear Tällä hetkellä Liptonin vihreä sitrushedelmätee . ...

negative,neutral,negative,negative mutta eikös Windows 8 ole ihan paska ? en nyt muis...

positive,mixed,neutral,mixed Tarkoitus olisi ostaa b230ft koneellinen volvo , k...


import sklearn
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix=vectorizer.fit_transform(context)
print("shape=",feature_matrix.shape)

Results: shape= (980, 10861)

from sklearn.model_selection import train_test_split

train_texts, dev_texts, train_labels, dev_labels=train_test_split(context,labels,test_size=0.2)
vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
feature_matrix_train=vectorizer.fit_transform(train_texts)
feature_matrix_dev=vectorizer.transform(dev_texts)

print(feature_matrix_train.shape)
print(feature_matrix_dev.shape)

Results again:

(784, 27827)

(196, 27827)

import sklearn.svm
classifier=sklearn.svm.LinearSVC(C=0.009,verbose=1)
classifier.fit(feature_matrix_train, train_labels)

print("DEV",classifier.score(feature_matrix_dev, dev_labels))
print("TRAIN",classifier.score(feature_matrix_train, train_labels))

And this is what I get:

DEV 0.08673469387755102

TRAIN 1.0

  • 4
    Very high training accuracy could suggest that your labels are in your features. – jkr Apr 19 '20 at 15:11
  • Also, OP is a beginner, so he/she could be simply be overtraining the data. This would mean that the model starts 'memorizing' the data. This is typically because of setting a high learning rate. It won't matter unless you show us what the testing accuracy is (assuming you have split your dataset) then we can certainly give fixes.... – neel g Apr 19 '20 at 15:25
  • `@jakub` What should I fix if my labels are mixed in there? – somerandomdude Apr 19 '20 at 15:42
  • `@neel g` You are very correct, I am a beginner so this could very well be the case. Also, what I have posted here is literally all the code I have written for this. So that means I have not split the dataset? I somehow figured that my training accuracy is the above mentioned 1.0. – somerandomdude Apr 19 '20 at 15:43

0 Answers0