how to classify .txt documents into some other .txt categories by supervised learning

Question

I've got about thousands of txt documents stored in 8 different file folders which are tagged with topic categories (actually,they are class 1,2,3...). And I have another 80 txt documents that don't yet have categories. I'm trying to find the best way to categorize them.

I have already finished the text segmentation and deleted the English letters(cause they are Chinese texts).What should I do next?

I can get the words with highest TF-IDF values but don't know how to do next.It seems like I should turn these text into vectors and train a classifier,but I don't know how.

Consider taking a look at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html — DJanssens, Nov 07 '16 at 10:37

score 0 · Answer 1 · answered Nov 07 '16 at 12:48

0

Instead of implementing your own bag of words models, you could use e.g. doc2vec from gensim. It offers excellent performance that will be difficult to match with your own implementation. You can choose between hierarchical softmax or negative sampling.

answered Nov 07 '16 at 12:48

Lukasz Tracewski

10,794
3
34
53

how to classify .txt documents into some other .txt categories by supervised learning

1 Answers1