11

I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark.

As such, I end up getting several non-useful questions in my database like: 'who cares?', 'what's this?' etc and some useful ones like: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' etc

However, I am only interested in useful questions.

I have got about 3000 questions, ~2000 of them are not useful, ~1000 of them are useful that I have manually label them. I am attempting to use a naive Bayesian classifier (that comes with NLTK) to try to classify questions automatically so that I don't have to manually pick out the useful questions.

As a start, I tried choosing the first three words of a question as a feature but this doesn't help very much. Out of 100 questions the classifier predicted only around 10%-15% as being correct for useful questions. It also failed to pick out the useful questions from the ones that it predicted not useful.

I have tried other features such as: including all the words, including the length of the questions but the results did not change significantly.

Any suggestions on how I should choose the features or carry on?

Thanks.

Ram Narasimhan
  • 22,341
  • 5
  • 49
  • 55
bili
  • 610
  • 2
  • 9
  • 20
  • Can you give examples of questions you tagged useful or not useful? – Suzana Jan 14 '13 at 16:57
  • This is more a Machine Learning question than programming. You can try asking it in CrossValidated to get a few suggestions for feature selection – Ram Narasimhan Jan 14 '13 at 18:14
  • @Suzana_K: not useful: 'who cares?', 'what's this? and useful: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' – bili Jan 14 '13 at 21:23
  • @RamNarasimhan: thanks I will try CrossValidated. – bili Jan 14 '13 at 21:23

2 Answers2

13

Some random suggestions.

Add a pre-processing step and remove stop-words like this, a, of, and, etc.

  How often is there a basketball fight

First you remove some stop words, you get

  how often basketball fight 

Calculate tf-idf score for each word (Treating each tweet as a document, to calculate the score, you need the whole corpus in order to get document frequency.)

For a sentence like above, you calculate tf-idf score for each word:

  tf-idf(how)
  tf-idf(often)
  tf-idf(basketball)
  tf-idf(fight)

This might be useful.

Try below additional features for your classifier

  • average tf-idf score
  • median tf-idf score
  • max tf-idf score

Furthermore, try a pos-tagger and generate a categorized sentence for each tweet.

>>> import nltk
>>> text = nltk.word_tokenize(" How often is there a basketball fight")
>>> nltk.pos_tag(text)
[('How', 'WRB'), ('often', 'RB'), ('is', 'VBZ'), ('there', 'EX'), ('a', 'DT'), ('basketball', 'NN'), ('fight', 'NN')]

Then you have possibly additional features to try that related to pos-tags.

Some other features that might be useful, see paper - qtweet (that is a paper for tweet question identification) for details.

  • whether the tweet contains any url
  • whether the tweet contains any email or phone number
  • whether there is any strong feeling such as ! follows the question.
  • whether unigram words present in the contexts of tweets.
  • whether the tweet mentions other user's name
  • whether the tweet is a retweet
  • whether the tweet contains any hashtag #

FYI, the author of qtweet attempted 4 different classifiers, namely, Random Forest, SVM, J48 and Logistic regression. Random forest performed best among them.

Hope they help.

Community
  • 1
  • 1
greeness
  • 15,956
  • 5
  • 50
  • 80
1

A most likely very powerful feature you could try and build (Not sure if its possible) is it there is a reply to the tweet in question.

Steve
  • 21,163
  • 21
  • 69
  • 92
  • Yes, I only apply the regular expression to pick out questions on tweets that have one or more replies to it. – bili Jan 15 '13 at 22:55