-1

I've got about thousands of txt documents stored in 8 different file folders which are tagged with topic categories (actually,they are class 1,2,3...). And I have another 80 txt documents that don't yet have categories. I'm trying to find the best way to categorize them.

I have already finished the text segmentation and deleted the English letters(cause they are Chinese texts).What should I do next?

I can get the words with highest TF-IDF values but don't know how to do next.It seems like I should turn these text into vectors and train a classifier,but I don't know how.

Andy Zhao
  • 13
  • 3
  • 1
    Consider taking a look at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html – DJanssens Nov 07 '16 at 10:37

1 Answers1

0

Instead of implementing your own bag of words models, you could use e.g. doc2vec from gensim. It offers excellent performance that will be difficult to match with your own implementation. You can choose between hierarchical softmax or negative sampling.

Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53