2

I have started working on a project which requires Natural Language Processing and building a model on Support Vector Machine (SVM) in R.

I’d like to generate a Term Document Matrix with all the tokens.

Example:

testset <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",  "M6 is 13 days out of the visit window")
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
test_annotations <- annotate(testset, list(sent_ann, word_ann))
test_doc <- AnnotatedPlainTextDocument(testset, test_annotations)
sents(test_doc)

[[1]]
 [1] "From"       "month"      "2"          "the"        "AST"        "and"        "total"     
 [8] "bilirubine" "were"       "not"        "measured"   "."         

[[2]]
 [1] "16:OTHER"                         "-"                               
 [3] "COMMENT"                          "REQUIRED"                        
 [5] "IN"                               "COMMENT"                         
 [7] "COLUMN;07/02/2004/GENOTYPING;SF-" "genotyping"                      
 [9] "consent"                          "not"                             
[11] "offered"                          "until"                           
[13] "T4"                               "."                               

[[3]]
[1] "M6"     "is"     "13"     "days"   "out"    "of"     "the"    "visit"  "window" 

And then I generated a TDM:

tdm <- TermDocumentMatrix(as.VCorpus(list(test_doc)))
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 1)>>
Non-/sparse entries: 22/0
Sparsity           : 0%
Maximal term length: 32
Weighting          : term frequency (tf)

                                  Docs
Terms                              NULL
  16:other                            1
  and                                 1
  ast                                 1
  bilirubine                          1
  column;07/02/2004/genotyping;sf-    1
  comment                             2
  consent                             1
  days                                1
  from                                1
  genotyping                          1
  measured                            1
  month                               1
  not                                 2
  offered                             1
  out                                 1
  required                            1
  the                                 2
  total                               1
  until                               1
  visit                               1
  were                                1
  window                              1

I actually have three documents in the dataset: "From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",
"M6 is 13 days out of the visit window" so it should have shown 3 columns of documents. But I only have one column shown here.

Could anyone please give me some advice on this?

sessionInfo()
    R version 3.3.0 (2016-05-03)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.6-2       openxlsx_3.0.0 magrittr_1.5   RWeka_0.4-28   openNLP_0.2-6  NLP_0.1-9     
[7] rJava_0.9-8   

2 Answers2

0

I think what you are trying to do is take a list of 3 strings and then trying to make that into corpus. I am not sure if in a list 3 different strings count for 3 diff documents.

I took your data and put it into 3 txt files and ran this.

text_name <- file.path("C:\", "texts")
dir(text_name)

[1] "text1.txt" "text2.txt" "text3.txt"

if you dont want to do any cleaning you can directly convert it to corpus by

docs <- Corpus(DirSource(text_name)) 
summary(docs)
          Length Class             Mode
text1.txt 2      PlainTextDocument list
text2.txt 2      PlainTextDocument list
text3.txt 2      PlainTextDocument list

dtm <- DocumentTermMatrix(docs)   
dtm

<<DocumentTermMatrix (documents: 3, terms: 22)>>
Non-/sparse entries: 24/42
Sparsity           : 64%
Maximal term length: 32
Weighting          : term frequency (tf)

tdm <- TermDocumentMatrix(docs) 
tdm
TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity           : 64%
Maximal term length: 32
Weighting          : term frequency (tf)

inspect(tdm)


<<TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity           : 64%
Maximal term length: 32
Weighting          : term frequency (tf)

                              Docs
Terms                              text1.txt text2.txt text3.txt
16:other                                 0         1         0
and                                      1         0         0
ast                                      1         0         0
bilirubine                               1         0         0
column;07/02/2004/genotyping;sf-         0         1         0
comment                                  0         2         0
consent                                  0         1         0
days                                     0         0         1
from                                     1         0         0
genotyping                               0         1         0
measured.                                1         0         0
month                                    1         0         0
not                                      1         1         0
offered                                  0         1         0
out                                      0         0         1
required                                 0         1         0
the                                      1         0         1
total                                    1         0         0
until                                    0         1         0
visit                                    0         0         1
were                                     1         0         0
window                                   0         0         1

I think you might want to create 3 different list and then covert it into corpus. let me know if this helps.

Shweta Kamble
  • 432
  • 2
  • 10
  • 21
  • Hi, Shweta. Thanks for the response. But in my case, I have 846000 obs as one of the columns in the dataset so there are no ways that I can put them all into several texts. As well, I need them to be in one dataset so I can do svm later. I think the main bottleneck is that I need to tokenize the words by Natural Language Processing first so I have to use: tdm <- TermDocumentMatrix(as.VCorpus(list(test_doc))) – Chih-Ching Yeh Jun 15 '16 at 18:37
  • Do you want each row as a document ? – Shweta Kamble Jun 15 '16 at 18:45
0

So considering you want each row in your column of text as document coverting the list to dataframe

df=data.frame(testset)
install.package("tm")
docs=Corpus(VectorSource(df$testset))
summary(docs)
  Length Class             Mode
1 2      PlainTextDocument list
2 2      PlainTextDocument list
3 2      PlainTextDocument list

follow the steps mentioned in the previous answer after this to get your tdm. this should solve your problem

Shweta Kamble
  • 432
  • 2
  • 10
  • 21
  • Thanks Shweta! I do want each row in your column of text as a document. But now the problem is now the input of text mining is the output of NLP (test_doc above) but not testset anymore. That's why I need TermDocumentMatrix(as.VCorpus(list(test_doc))) – Chih-Ching Yeh Jun 16 '16 at 16:56