7

I have some text data for which I need to do sentiment classification. I don't have positive or negative labels on this data (unlabelled). I want to use the Gensim word2vec model for sentiment classification.
Is it possible to do this? Because till now I couldn't find anything which does that? Every blog and article are using some kind of labelled dataset (such as imdb dataset)to train and test the word2vec model. No one going further and predicting their own unlabelled data.

Can someone tell me the possibility of this (at least theoretically)?

Thanks in Advance!

Piyush Ghasiya
  • 515
  • 7
  • 25
  • Why don't you have any labeled data? Can you create some? – gojomo Apr 13 '20 at 18:35
  • Is there any rulebook or guide to label data manually? My data consists of news articles and I am finding it is very difficult to label them. – Piyush Ghasiya Apr 14 '20 at 01:32
  • Almost certainly, your news articles will have some sort of canonical identifier - perhaps just their ordinal position in your original dataset. So the most basic strategy is: look at article #0, then mark in some data structure that text "0" has sentiment "WHATEVER" - and repeat, for a random set of the texts. (If your texts were in a plain-text file, with one text per line, you might even put the annotations, or lack of annotation, as a token at the start of each line.) – gojomo Apr 14 '20 at 04:26
  • It's certainly possible to build custom UIs/etc for this, but for a simple classification project, just hand-editing simple files is often sufficient – either the original source/database/file, or some adjunct file with correlated IDs. – gojomo Apr 14 '20 at 04:27
  • You mention that text "0" has sentiment "WHATEVER" but how to decide this "WHATEVER"? Unlike tweets, News articles are complex text where it is difficult to understand the sentiments. That is why my question is is there any annotation guide that we need to follow to annotate each news article? – Piyush Ghasiya Apr 14 '20 at 05:56
  • You read the text & apply your human judgement, based on the goals of your project. If you were recruiting others to help, you might write up a guide summarizing what you intend 'positive' to mean, in context of your goals. Even the idea of 'sentiment' is slightly different if you're talking about products, or friends, or political candidates, or news-about-a-company, or the general mood of a population, or other possible subjects-of-that-sentiment. So the correct answer is, "it depends on the domain/project-goals" – & that's why labeled data consistent with your specific needs is important. – gojomo Apr 14 '20 at 17:48

4 Answers4

3

YES, There are 2 main methods to do sentiment just like any machine learning problem. Supervised Sentiment Analysis and unsupervised Sentiment Analysis. In the 1st way, you definitely need a labelled dataset. In that way, you can use simple logistic regression or deep learning model like "LSTM". But in unsupervised Sentiment Analysis, You don't need any labeled data. In that way, you can use a clustering algorithm. K-Means clustering is a popular algorithm for this task. Following medium article contains a worked example for your solution,

https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483

To add your question, Word embedding such as word2vec or fasttext has nothing to do with supervised or unsupervised sentiment analysis. There are very powerful ways to represent features of your dataset. BTW, fasttext is more accurate than word2vec according to my experience.

Lahiru
  • 89
  • 5
  • Thank you for the reply and for suggesting other methods. I am aware of supervised and unsupervised methods but my question is specific to the word2vec model. My question is: If I use IMDB dataset to train and test my word2vec model and then use the same tested model to predict my unlabelled data. Is this possible? – Piyush Ghasiya Apr 14 '20 at 01:28
3

If it is a simple text(and not sticking to word2vec), it can be classified with VADER model irrespective of labels. Just need to give the text to api.

import nltk
        
from nltk.sentiment.vader import SentimentIntensityAnalyzer as si

a = 'This is a good movie'

si.polarity_scores(a)

which returns below result.

{'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compound': 0.4404}
Santosh K
  • 447
  • 5
  • 11
0

Essentially, no - you can't perform sentiment analysis without some labeled data.

Without labels, of some sort, you have no way of evaluating whether you're getting anything right. So, you could just use this sentiment-analysis function:

get_sentiment(text):
    return random.choice(['positive', 'negative'])

Woohoo! You've got a 'sentiment' for every text!

What's that? You object that for some text, it's giving the "wrong" answer?

Well, how do you know what's wrong? Do you have a desired correct answer – a label – for that text?

OK, now you have some hope, but you also have at least one label. And if you have one, you can get more – even if it's just hand-annotating some texts that are representative of what you want your code to classify.

Another answer shares an article which purports to do unsupervised sentiment analysis. That article's meandering grab-bag of techniques sneaks in supervision via the coder's labeling of his two word-clusters as positive and negative. And, he's only able to claim success based on target labels for some of the data. And the data appears to be about 635,000 'positive' texts and just 9800 'negative' texts – where you could get 99.5% accuracy just by answering 'positive' to every text. So its techniques may not be very generalizable.

But the article does do one thing that could be re-used elsewhere, in a very crude approach, if you've really just got word-vectors and nothing else: labeling every word as positive or negative. It does this by forcing all words into 2 clusters, then hand-reviewing the clusters to choose one as positive and one as negative. (This might only work well with certain kinds of review texts with strong underlying positive/negative patterns.) Then, it gives every other word a score based on closeness to those cluster centroids.

You could repeat that for another language. Or, just create a hand-curated list of a few dozen known 'positive' or 'negative' words, then assign every other word a positive or negative value based on relative closeness to your 'anchor' words. You're no longer strictly 'unsupervised' at this point, as you've injected your own labeling of individual words.

I'd guess this could work even better than the just-2-centroids approach of the article. (All 'positive' or 'negative' words, in a real semantic space, could be spread across wildly-shaped coordinate-regions that aren't reducible to a single centroid summary point.)

But again, the only way to check if this is working would be to compare against a lot of labeled data, with preferred "correct" answers, to see if tallying a net-positive/net-negative score for texts, based on their individual words, performs satisfactorily. And once you have that labeled data for scoring, you could use a far more diverse & powerful set of text-classification methods than a simple tallying-of-word-valences.

gojomo
  • 52,260
  • 14
  • 86
  • 115
0

as Lahiru mention if we downloaded any data, it dont have labelled data. so we need to label it. Either manually one by one by a person and than verify it else use some other library like sentiwordnet or textblob to level it.