Choosing Features to identify Twitter Questions as "Useful"

Question

I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark.

As such, I end up getting several non-useful questions in my database like: 'who cares?', 'what's this?' etc and some useful ones like: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' etc

However, I am only interested in useful questions.

I have got about 3000 questions, ~2000 of them are not useful, ~1000 of them are useful that I have manually label them. I am attempting to use a naive Bayesian classifier (that comes with NLTK) to try to classify questions automatically so that I don't have to manually pick out the useful questions.

As a start, I tried choosing the first three words of a question as a feature but this doesn't help very much. Out of 100 questions the classifier predicted only around 10%-15% as being correct for useful questions. It also failed to pick out the useful questions from the ones that it predicted not useful.

I have tried other features such as: including all the words, including the length of the questions but the results did not change significantly.

Any suggestions on how I should choose the features or carry on?

Thanks.

Can you give examples of questions you tagged useful or not useful? — Suzana, Jan 14 '13 at 16:57
This is more a Machine Learning question than programming. You can try asking it in CrossValidated to get a few suggestions for feature selection — Ram Narasimhan, Jan 14 '13 at 18:14
@Suzana_K: not useful: 'who cares?', 'what's this? and useful: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' — bili, Jan 14 '13 at 21:23

score 13 · Accepted Answer · edited Apr 13 '17 at 12:44

Some random suggestions.

Add a pre-processing step and remove stop-words like `this`, `a`, `of`, `and`, etc.

  How often is there a basketball fight

First you remove some stop words, you get

  how often basketball fight

Calculate tf-idf score for each word (Treating each tweet as a document, to calculate the score, you need the whole corpus in order to get document frequency.)

For a sentence like above, you calculate tf-idf score for each word:

  tf-idf(how)
  tf-idf(often)
  tf-idf(basketball)
  tf-idf(fight)

This might be useful.

Try below additional features for your classifier

average tf-idf score
median tf-idf score
max tf-idf score

Furthermore, try a pos-tagger and generate a categorized sentence for each tweet.

>>> import nltk
>>> text = nltk.word_tokenize(" How often is there a basketball fight")
>>> nltk.pos_tag(text)
[('How', 'WRB'), ('often', 'RB'), ('is', 'VBZ'), ('there', 'EX'), ('a', 'DT'), ('basketball', 'NN'), ('fight', 'NN')]

Then you have possibly additional features to try that related to pos-tags.

Some other features that might be useful, see paper - qtweet (that is a paper for tweet question identification) for details.

whether the tweet contains any url
whether the tweet contains any email or phone number
whether there is any strong feeling such as ! follows the question.
whether unigram words present in the contexts of tweets.
whether the tweet mentions other user's name
whether the tweet is a retweet
whether the tweet contains any hashtag #

FYI, the author of qtweet attempted 4 different classifiers, namely, Random Forest, SVM, J48 and Logistic regression. Random forest performed best among them.

Hope they help.

Thank you for your suggestions. I will try them out and get back to you. — bili, Jan 15 '13 at 22:53

score 1 · Answer 2 · answered Jan 15 '13 at 20:30

1

A most likely very powerful feature you could try and build (Not sure if its possible) is it there is a reply to the tweet in question.

answered Jan 15 '13 at 20:30

Steve

21,163
21
69
92

Yes, I only apply the regular expression to pick out questions on tweets that have one or more replies to it. – bili Jan 15 '13 at 22:55