I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark.
As such, I end up getting several non-useful questions in my database like: 'who cares?', 'what's this?' etc and some useful ones like: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' etc
However, I am only interested in useful questions.
I have got about 3000 questions, ~2000 of them are not useful, ~1000 of them are useful that I have manually label them. I am attempting to use a naive Bayesian classifier (that comes with NLTK) to try to classify questions automatically so that I don't have to manually pick out the useful questions.
As a start, I tried choosing the first three words of a question as a feature but this doesn't help very much. Out of 100 questions the classifier predicted only around 10%-15% as being correct for useful questions. It also failed to pick out the useful questions from the ones that it predicted not useful.
I have tried other features such as: including all the words, including the length of the questions but the results did not change significantly.
Any suggestions on how I should choose the features or carry on?
Thanks.