0

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf

But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.

Can anyone guide me on what techniques I should follow. Appreciate any help.

dvlper
  • 462
  • 2
  • 7
  • 18
  • Have you implemented any of the suggested methods? Which showed up as good for you? – Bob Jul 18 '17 at 19:19

3 Answers3

4

Alright by what i can understand i think there are multiple ways to attend to this case. there will be trade offs and the accuracy rate may vary. because of the well know fact and observation

Each single tweet is distinct!

(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything

The thing you can do is to generate a set of dictionary for each class you have (i.e Music => pop , jazz , rap , instruments ...) which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.

You can start with extracting

  • Synonyms
  • Hyponyms
  • Hypernyms
  • Meronyms
  • Holonyms

Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.

Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others. Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.

Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear

The real struggle will be the machine learning part and how you manage that.

Community
  • 1
  • 1
Qaisar Rajput
  • 751
  • 1
  • 8
  • 21
  • I am sure this will work for the dictionary words. But how can I make this technique work for general words we use, like technology might include (JQuery, IPhone, NODE, Apple, Google, etc) or Coffee might include (Starbucks, etc) – dvlper Apr 27 '16 at 11:10
  • These kind of words are proper nouns which none of the word-nets i have seen contain. But then again its the beauty of dictionaries you can put the words you think relevant in each class even if they are proper nouns. The thing working in your favor in this matter is that you have predefined categories. @dvlper – Qaisar Rajput Apr 28 '16 at 07:15
2

Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).

You could also simply cluster data as @Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.

lejlot
  • 64,777
  • 8
  • 131
  • 164
0

You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

AlBlue
  • 23,254
  • 14
  • 71
  • 91
  • When you answer a question, no need to add 'hope it helps' or other commentary on the post - you can just give the answer. Thanks for answering the question! – AlBlue Apr 26 '16 at 21:45
  • @Shoaib, I have completely unlabeled data, If I label some examples wont it introduce biasness ? – dvlper Apr 26 '16 at 22:35