I'm in the process of trying to figure out how to apply text classification using RTextTools on a corpus I downloaded from LexisNexis .
I succeeded in both parsing LexisNexis N html files into a document feature matrices using the Quanteda package and classifying text in those files with RTextTools.
However, I not only want to be able to classify these N texts on a document level, but also on a sentence level. I can't think of a way to parse these N documents into a dfm consisting of X sentences.
Moreover, I imagine that most sentences in my training data will be irrelevent and, henceforth, not classified. How does RTextTools handle irrelevent sentences in my test data?