0

I'm in the process of trying to figure out how to apply text classification using RTextTools on a corpus I downloaded from LexisNexis .

I succeeded in both parsing LexisNexis N html files into a document feature matrices using the Quanteda package and classifying text in those files with RTextTools.

However, I not only want to be able to classify these N texts on a document level, but also on a sentence level. I can't think of a way to parse these N documents into a dfm consisting of X sentences.

Moreover, I imagine that most sentences in my training data will be irrelevent and, henceforth, not classified. How does RTextTools handle irrelevent sentences in my test data?

  • In **quanteda**, you can reshape your corpus from documents to sentences using `corpus_reshape()`, and then use `corpus_subset()` to filter out the "irrelevant" sentences (but this would need to be based on your own criteria). Then you could create a dfm and classify them using whatever method you want, e.g. `textmodel_NB()` for Naive Bayes, or something in another package. – Ken Benoit Sep 20 '17 at 09:16
  • 1
    Thank you very much, Ken. Didn't know classification was also possible within the Quanteda package. Maybe you should incorporate this/text classification in the ME414 methods course, which I followed this summer :) . – fritsvegters Sep 20 '17 at 09:23
  • In addition to my main first question: Say I have a total of 2000 documents. I use 200 documents, coded at document and sentence level (0 or 1) , as input for the machine. Some sentences and documents were marked irrelevent and will be left out. Will the computer then also classify some of the remaining 1800 documents and sentences as irrelevent, next to 0 or 1? – fritsvegters Sep 20 '17 at 10:20
  • You can train a multinomial Naive Bayes classifier for the three categories, but only if you use three codes, not two (with the irrelevant category omitted). – Ken Benoit Sep 20 '17 at 15:52

0 Answers0