2

I have a training dataset of 1,00,000+ documents categorised into around 100 categories. I am trying to predict category for a text using DeepLearning4java library, code based on ParagraphVectorsClassifierExample example. Each document is a single short line of text.

I am splitting available data into training(80%) and test data(20%). With much tuning of parameters, I am getting maximum 20% correct predictions on the test data. I understand lot of things depend on input data itself. However, just wanted to check if the accuracy can be further improved. I see a comment in the example code that says "This example could be improved by using learning cascade for higher accuracy". Any hint/help/advice to improve prediction accuracy would be highly appreciated.

Gopi
  • 10,073
  • 4
  • 31
  • 45
  • You should cycle through all of your data to see if the same content goes into each class because that will lower the accuracy. Foe example you have 100 categorys, if 2 of those categorys have the same document content its going to lower the accuracy – JRowan May 01 '17 at 00:26
  • Thanks for input @JRowan. Will check for this. – Gopi May 01 '17 at 13:33

0 Answers0