How does one check classification results on two million tweets?

Question

I've got two million tweets, which I need to classify into three categories:

liking the product;
not liking it; and
suggestions for product.

But how do you check the result of your classifications? Shall I just randomly choose some tweets, read their content manually and check if their content matches the classification tag being given to them? Or is there a better way to do it?

I've heard that there are some enterprise-level software packages that do multi-level classification, but then how does somebody attest their results to be of significance, without going through millions of the records manually and checking the classification?

I am not even sure what you want. If you want to classify tweets, you need to have bins. For example "tweets about a celebrity" and "tweets not about a celebrity". If you don't even have this, then your best bet is to use unsupervised learning methods such as k means. So, extract information about your tweets (length, number of retweets, number of words, etc...) then use a clustering algorithm. If your cross validation results are good, it means that you have found a relevant classification. (and you don't need to check manually) — B. Decoster, Oct 13 '14 at 23:56
@Fezvez first of all thanks for even writing. The issue is this :- 2million tweets to classify into 3 classes "liking the product","not liking it", "suggestions for product". Now since all the tweets are unlabeled, so you suggest that I use K-means. Can you pls tell me what kind of information/features shall I look for specifically in my-3 class case, that I just in previous line? That would be of immense help !!!!! — , Oct 14 '14 at 04:48
*"just need some directions/suggestions"* is **not** an on-topic question here on SO. — jonrsharpe, Oct 14 '14 at 10:25
@jonrsharpe thanks. Could you pls suggest what features to select in the tweets for 3-class classification as per the labels. — , Oct 14 '14 at 12:30
@shalini which features you should be looking for is a whole different question to the one you're currently asking, and is also off-topic here. Please read the [content in the Help Center](http://stackoverflow.com/help/asking). — jonrsharpe, Oct 14 '14 at 12:32

score 0 · Answer 1 · answered Oct 14 '14 at 23:05

Honestly, it's a huge problem you are tackling.

A very basic method to start (it is doing to yield poor results but it's better than nothing), classify manually 1000 tweets. It will help you get a feel of what you are going to classify.

Then, make a database of the 1000 most popular words in your 2 million tweets. Edit manually this database (remove useless words for your problem such as the word "the" or "is"). Try to make a database of "good" words (like, love, amazing), a database of "bad" words (bad, sucks,...) and a database of "suggestion" (suggest, errr I don't have anything else). The goal is to reduce your database to the most useful words for your problem (like, use only 100 words in the end)

Each tweet becomes a vector of size 100. Do whatever techniques you want with this (naive bayes, SVM, etc...)

This whole process is the outline of what I did for a course a while ago for spam classification. It worked super well (98% recognition rate?). Then, our real project was to classify hate mail on forums (messages such as "go die"). I think we got a 80% recognition rate, which was pretty poor. But better than nothing.

Because your 2 million tweets are not classified, you will be hard pressed to check your results with this method. You will only be able to do cross validation with your 1000 samples. Just a warning

How does one check classification results on two million tweets?

1 Answers1