Text mining pdf files/issues with word frequencies

Question

I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?

# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
 pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                              language = "en",
                                              id = "id1")
 content(pdf)[1:4]
 }


docs<- Corpus(URISource(uri, mode = ""),
    readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 

library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  

dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))   
length(freq)  
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)] 
freq[tail(ord)]

I made the changes but i am still getting the greek words delta, toe etc as high frequency terms — user3570187, Sep 09 '15 at 12:53
As a workaround, you can remove the undesired words with `my_stopwords <- c("delta","sigma",gamma")`, followed by `docs <- tm_map(docs, removeWords, my_stopwords)`. Not a real solution though, as it remains unclear where these words come from. — RHertel, Sep 09 '15 at 13:09
That is the problem .Even the low frequency words are like aaa, zutng zwu zwzuz zxanug! So we really need to figure how the pdf is getting read in the package. — user3570187, Sep 09 '15 at 13:11
I'm somewhat surprised by your use of `engine="ghostscript"`. The first lines suggest that, if available, you are using the `xpdf` standard engines, `pdftotext`and `pdfinfo`. Why ghostscript afterwards...? The variable `pdf` does not seem to be used again. I would have probably used something like `docs <- Corpus(VectorSource(pdf$content))` after the initial `readPDF` command. — RHertel, Sep 09 '15 at 13:27
Test: copy the text with Acrobat Reader and paste it into a plain text editor. Does the same random Greek appear, or does it come in as correct text? Acrobat Reader's text encoding/decoding gives somewhere around the best possible result, and if it can't for this PDF, the chance that your software can succeed is miniscule. — Jongware, Sep 09 '15 at 15:42
When i copy paste i get the normal text and no greek terms appear! — user3570187, Sep 09 '15 at 15:49
I can search for words in the acrobat but when tm package reads this file through R it throws up random greek words and fuzzy texts. — user3570187, Sep 09 '15 at 15:56
I am guessing its the problem with how ghostscript reads/interprets the pdf. @RHertel You are in the right direction for solving this puzzle! — user3570187, Sep 09 '15 at 18:08
I'd have to see your PDF file (at least one causing a problem) to comment. I don't speak R so I have no idea what your code does, nor how its invoking Ghostscript to get the text. Ghostscript *does* have a text extraction device and it has several operating modes, I'd need to know the command line being sent to GS before I could help. Also the version of Ghostscript being used. — KenS, Sep 09 '15 at 18:45
Here is the link to pdf [link] (http://sciencedirect.com/science/article/pii/S0164121212000532) — user3570187, Sep 10 '15 at 00:11
You still haven't told me how Ghostscript is being used :-) Which page do you see a problem on ? The only 'problems' I see are in the EM dash and the ligatures. Of course I'm using up to date software and a device designed to properly extract text. — KenS, Sep 10 '15 at 18:32
I am using ghostscript as an engine to read the PDF in R. And i have a gs of 9.5 version on mac. As per my understanding ghostscript is used to read and interpret the text mining package (tm) in R. I am using this tm package to transform the pdf in readable text for processing and computing word frequencies. ;) — user3570187, Sep 10 '15 at 19:44

RHertel · Accepted Answer · 2015-09-22T07:15:31.280

I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:

library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                               language = "en",
                                               id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 
docs <- tm_map(docs, removePunctuation) 
library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))

We can visualize the result of the most frequently used words in your pdf file with a word cloud:

library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))

Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.

Text mining pdf files/issues with word frequencies

1 Answers1

Linked