R: consider punctuation to do word segmentation

Question

I use NGramTokenizer() to do 1~3 gram segmentation, but it seems doesn't consider punctuation, and removes punctuation.

So the segmentation words isn't ideal for me.

(like the result: oxidant amino, oxidant amino acid, pellet oxidant and so on.)

Is there any segmentation way to remain punctuation (I think I can use POS tagging to filter out the strings which contain punctuation after segmentation work.)

Or have other way can consider punctuation to do word segmentation? It will more perfect for me.

text <-  "the slurry includes: attrition pellet, oxidant, amino acid and water."

corpus_text <- VCorpus(VectorSource(text))
content(corpus_text[[1]])

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
dtm <-  DocumentTermMatrix(corpus_text, control = list(tokenize = BigramTokenizer))
mat <- as.matrix(dtm)
colnames(mat)

 [1] "acid"                      "acid and"                  "acid and water"           
 [4] "amino"                     "amino acid"                "amino acid and"           
 [7] "and"                       "and water"                 "attrition"                
[10] "attrition pellet"          "attrition pellet oxidant"  "includes"                 
[13] "includes attrition"        "includes attrition pellet" "oxidant"                  
[16] "oxidant amino"             "oxidant amino acid"        "pellet"                   
[19] "pellet oxidant"            "pellet oxidant amino"      "slurry"                   
[22] "slurry includes"           "slurry includes attrition" "the"                      
[25] "the slurry"                "the slurry includes"       "water"

if punctuation is just required, maybe tokenzing based on punctuation symbols (regex-based) could be done. Will that work? — amrrs, Sep 21 '17 at 05:13

score 2 · Accepted Answer · edited Sep 29 '17 at 11:03

you can use tokenize function of the quanteda package as follows:

library(quanteda)
text <- "some text, with commas, and semicolons; and even fullstop. to be toekinzed"
tokens(text, what = "word", remove_punct = FALSE, ngrams = 1:3)

output:

tokens from 1 document.
text1 :
 [1] "some"              "text"              ","                 "with"             
 [5] "commas"            ","                 "and"               "semicolons"       
 [9] ";"                 "and"               "even"              "fullstop"         
[13] "."                 "to"                "be"                "toekinzed"        
[17] "some text"         "text ,"            ", with"            "with commas"      
[21] "commas ,"          ", and"             "and semicolons"    "semicolons ;"     
[25] "; and"             "and even"          "even fullstop"     "fullstop ."       
[29] ". to"              "to be"             "be toekinzed"      "some text ,"      
[33] "text , with"       ", with commas"     "with commas ,"     "commas , and"     
[37] ", and semicolons"  "and semicolons ;"  "semicolons ; and"  "; and even"       
[41] "and even fullstop" "even fullstop ."   "fullstop . to"     ". to be"          
[45] "to be tokeinzed"

for more information on what each argument in the function is, see the documentation

Update: For document term frequency look at Constructing a document-frequency matrix

As an example try the following:

For bigrams(note you don't need to tokenize):

dfm(text, remove_punct = FALSE, ngrams = 2, concatenator = " ")

It seems a good way to achieve what segmentation I want. But because I need to convert these string to a dtm after word segmentation. Is it possibly to convert dtm without use corpus? — Eva, Sep 21 '17 at 08:29
@Eva I have updated the answer to address the need of document term frequency, hope it helps you — Imran Ali, Sep 21 '17 at 11:42

score 1 · Answer 2 · answered Sep 21 '17 at 06:32

You can probably pass the corpus through tm_map before DTM, something like,

text <-  "the slurry includes: attrition pellet, oxidant, amino acid and water."

corpus_text <- VCorpus(VectorSource(text))
content(corpus_text[[1]])


clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation) #other common punctuation
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "and")) #ignoring "and"
  return(corpus)
}

corpus_text <- clean_corpus(corpus_text)
content(clean_corpus(corpus_text)[[1]])
#" slurry includes attrition pellet oxidant amino acid water"

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
dtm <-  DocumentTermMatrix(corpus_text, control = list(tokenize = BigramTokenizer))
mat <- as.matrix(dtm)
colnames(mat)

R: consider punctuation to do word segmentation

2 Answers2