0

I am trying to build term document matrix from one pdf text. When I inspect the term document matrix, I get this.

<<TermDocumentMatrix (terms: 7245, documents:342)>>

The number of document should 1 not 342, and 342 is number of pages in pdf files. I've tried use this code using R.

pdf_file <- file.path(("Lat/web"), "textpdf.pdf")
text <- pdf_text(pdf_file)
myCorpus <- Corpus(VectorSource(text))

mytdm <- TermDocumentMatrix(myCorpus, control = list
                         (removeNumbers = TRUE, 
                         removePunctuation = TRUE, 
                         stopwords=stopwords_en, 
                         stemming=TRUE)
)
inspect(mytdm)
Hilfit19
  • 29
  • 7

1 Answers1

0

Use the following code to collapse the pdf pages into 1 document.

pdf_file <- file.path(("Lat/web"), "textpdf.pdf")
text <- pdf_text(pdf_file)
# collapse pdf pages into 1
text <- paste(unlist(text), collapse ="")
.....
rest of code
phiver
  • 23,048
  • 14
  • 44
  • 56