0

I am trying to implement a text classification program in R that classifies input text (args) into 3 different classes. I have successfully tested the sample program by dividing the input data into training and test data.

I would now like to build something that would allow me to classify custom text. My input data has following structure:

So if I enter a custom text : "games studies time", I would like to get a matrix that looks like following:

Please tell me what is the best way to do the same.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205

1 Answers1

0

This sounds a lot like the application of a "dictionary" to text following the tokenization of that text. What you have as the matrix result in your question, however, makes no use of the categories in the input data.

So here are two solutions: one, for producing the matrix you state that you want, and two, for producing a matrix that counts the input text according to the counts of the categories to which your input data maps the text.

This uses the quanteda package in R.

require(quanteda)
mymap <- dictionary(list(school = c("time", "games", "studies"),
                         college = c("time", "games"),
                         office = c("work")))
dfm("games studies time", verbose = FALSE)
## Document-feature matrix of: 1 document, 3 features.
## 1 x 3 sparse Matrix of class "dfmSparse"
##        features
## docs    games studies time
##   text1     1       1    1
dfm("games studies time", dictionary = mymap, verbose = FALSE)
## Document-feature matrix of: 1 document, 3 features.
## 1 x 3 sparse Matrix of class "dfmSparse"
##        features
## docs    school college office
##   text1      3       2      0
Ken Benoit
  • 14,454
  • 27
  • 50