1

I have downloaded a .txt which contains 1000's of words with each word assigned a label indicating positive or negative value.The lesser than value is, the more -ve sentiment it represents. It looks like :-

bad,-1
sucks,-2
too good,2
amazing,3
terrible,-2
...

I have named the first column as word and the second column as label. I am training it using :-

vectorizer = TfidfVectorizer(use_idf = True, lowercase=False,strip_accents='ascii', stop_words=stop_words)
y = test_df['label']
X = vectorizer.fit_transform(test_df['word'])
X_train, X_test, y_train, y_test = train_test_split(X, y)

Now, the problem is that since each word is present only one time, so it makes absolutely no sense to predict the label of a word in the untrained part since the word in the untrained part has no relation with the words in the trained part.So,as expected, I am getting quite low accuracy.So, how are you supposed to use predefined dictionaries of words for sentiment analysis?

Devansh Singh
  • 53
  • 1
  • 10
  • you woud need an algorithm that measures how a word is related to another word. this is a good read: https://stackoverflow.com/questions/21979970/how-to-use-word2vec-to-calculate-the-similarity-distance-by-giving-2-words – jose_bacoy Apr 21 '18 at 14:25
  • Can you please elaborate?How would word2vec help in relating one word in the training set to another word in the testing set? – Devansh Singh Apr 21 '18 at 15:32
  • word distance will give you numeric value how a word is related to a given word. this number is close to 1 if closely related and close to zero if not. then you can find the word in your dictionary which is highly related to your untrained data. – jose_bacoy Apr 21 '18 at 18:36

0 Answers0