0

I have a question regarding a project in which I have to classify text. In this project I have several thousand questions (strings), which should be put into the categories tech, sports, politics, history, science and geography. My training data (already labeled) is 200 questions in size (I can easily expand on that). I tried TextBlob (which uses NLTK) with the NB-classifier, but that only brought me an accuracy of 28%. Currently I am in search of new possibilities to improve accuracy (k-NN, SVM, ...).

Do you have any suggestion what I should use to categorize these questions?

Sincerely, James

James No
  • 29
  • 3
  • You can try deep learning (CNNs, RNNs) if you have enough data; or use pre-trained models and fine tune on your dataset. Here is one implementation in TensorFlow: https://github.com/yuhui-lin/text-classification – Blackberry Aug 12 '17 at 13:10
  • Thank you very much! What do you think of SVM's for this task? – James No Aug 12 '17 at 15:37
  • 1
    Since this is not a question on programmatic implementation, it would be better suited for [Data Science](https://datascience.stackexchange.com/). Incidentally, I do not agree that deep NNs are a good idea here. An SVM should do just fine, given the right features. I would probably start with `tf-idf`, perhaps including `bi-grams`/`tri-grams` (or go for `word embeddings` if all else fails). However, the key issue is probably that the questions are short and therefore contain little information on their own, so I'd look into `query expansion`, e.g. using the `synsets` of `WordNet`. – WhoIsJack Aug 12 '17 at 21:06
  • SVMs were the previous state-of-the-art for text classification, as far as I know. The current best approaches are deep learning based. If you need high accuracy, the highest possible, then, deep learning is the choice. Otherwise, SVMs should also be fine. – Blackberry Aug 13 '17 at 09:09
  • @Blackberry DL comes at the cost of significantly complicating the model compared to an SVM, which among other consequences drastically increases the amount of training data needed. It's certainly true that complicated tasks with large data sets of very high-dimensional samples are currently best solved by DL, but in this case I would be surprised to find that an SVM performs substantially worse (whilst being much easier to handle). But again, no matter the learning approach chosen, my guess would be that enrichment of the data using additional resources will be the key to success here. – WhoIsJack Aug 14 '17 at 15:53
  • Thank you for you replies! An article about DL that I found especially interesting is [this](https://machinelearnings.co/text-classification-using-neural-networks-f5cd7b8765c6), where you already have an example. Currently I try to expand that basic model with the WordNet feature, so it becomes more accurate. Also, please don't close this thread, because I might have questions regarding the coding and it would make sense to post them here. Sincerely, James – James No Aug 14 '17 at 21:02
  • The article you link to is about classical (not deep) learning and presents an ANN model that is essentially equivalent to an SVM. As far as implementation questions are concerned, I am not sure if posting them here is the correct way of using StackOverflow. Asking a new question (after you are certain that you cannot figure out the problem using existing resources/questions) would probably be better. – WhoIsJack Aug 15 '17 at 11:03

0 Answers0