I use NGramTokenizer() to do 1~3 gram segmentation, but it seems doesn't consider punctuation, and removes punctuation.
So the segmentation words isn't ideal for me.
(like the result: oxidant amino, oxidant amino acid, pellet oxidant and so on.)
Is there any segmentation way to remain punctuation (I think I can use POS tagging to filter out the strings which contain punctuation after segmentation work.)
Or have other way can consider punctuation to do word segmentation? It will more perfect for me.
text <- "the slurry includes: attrition pellet, oxidant, amino acid and water."
corpus_text <- VCorpus(VectorSource(text))
content(corpus_text[[1]])
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
dtm <- DocumentTermMatrix(corpus_text, control = list(tokenize = BigramTokenizer))
mat <- as.matrix(dtm)
colnames(mat)
[1] "acid" "acid and" "acid and water"
[4] "amino" "amino acid" "amino acid and"
[7] "and" "and water" "attrition"
[10] "attrition pellet" "attrition pellet oxidant" "includes"
[13] "includes attrition" "includes attrition pellet" "oxidant"
[16] "oxidant amino" "oxidant amino acid" "pellet"
[19] "pellet oxidant" "pellet oxidant amino" "slurry"
[22] "slurry includes" "slurry includes attrition" "the"
[25] "the slurry" "the slurry includes" "water"