3

I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?

# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
 pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                              language = "en",
                                              id = "id1")
 content(pdf)[1:4]
 }


docs<- Corpus(URISource(uri, mode = ""),
    readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 

library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  

dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))   
length(freq)  
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)] 
freq[tail(ord)]
user3570187
  • 1,743
  • 3
  • 17
  • 34
  • I made the changes but i am still getting the greek words delta, toe etc as high frequency terms – user3570187 Sep 09 '15 at 12:53
  • As a workaround, you can remove the undesired words with `my_stopwords <- c("delta","sigma",gamma")`, followed by `docs <- tm_map(docs, removeWords, my_stopwords)`. Not a real solution though, as it remains unclear where these words come from. – RHertel Sep 09 '15 at 13:09
  • That is the problem .Even the low frequency words are like aaa, zutng zwu zwzuz zxanug! So we really need to figure how the pdf is getting read in the package. – user3570187 Sep 09 '15 at 13:11
  • 2
    I'm somewhat surprised by your use of `engine="ghostscript"`. The first lines suggest that, if available, you are using the `xpdf` standard engines, `pdftotext`and `pdfinfo`. Why ghostscript afterwards...? The variable `pdf` does not seem to be used again. I would have probably used something like `docs <- Corpus(VectorSource(pdf$content))` after the initial `readPDF` command. – RHertel Sep 09 '15 at 13:27
  • Test: copy the text with Acrobat Reader and paste it into a plain text editor. Does the same random Greek appear, or does it come in as correct text? Acrobat Reader's text encoding/decoding gives somewhere around the best possible result, and if it can't for this PDF, the chance that your software can succeed is miniscule. – Jongware Sep 09 '15 at 15:42
  • When i copy paste i get the normal text and no greek terms appear! – user3570187 Sep 09 '15 at 15:49
  • I can search for words in the acrobat but when tm package reads this file through R it throws up random greek words and fuzzy texts. – user3570187 Sep 09 '15 at 15:56
  • 1
    I am guessing its the problem with how ghostscript reads/interprets the pdf. @RHertel You are in the right direction for solving this puzzle! – user3570187 Sep 09 '15 at 18:08
  • I'd have to see your PDF file (at least one causing a problem) to comment. I don't speak R so I have no idea what your code does, nor how its invoking Ghostscript to get the text. Ghostscript *does* have a text extraction device and it has several operating modes, I'd need to know the command line being sent to GS before I could help. Also the version of Ghostscript being used. – KenS Sep 09 '15 at 18:45
  • How do i share the pdf? – user3570187 Sep 09 '15 at 20:50
  • Here is the link to pdf [link] (http://sciencedirect.com/science/article/pii/S0164121212000532) – user3570187 Sep 10 '15 at 00:11
  • You still haven't told me how Ghostscript is being used :-) Which page do you see a problem on ? The only 'problems' I see are in the EM dash and the ligatures. Of course I'm using up to date software and a device designed to properly extract text. – KenS Sep 10 '15 at 18:32
  • I am using ghostscript as an engine to read the PDF in R. And i have a gs of 9.5 version on mac. As per my understanding ghostscript is used to read and interpret the text mining package (tm) in R. I am using this tm package to transform the pdf in readable text for processing and computing word frequencies. ;) – user3570187 Sep 10 '15 at 19:44

1 Answers1

0

I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:

library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                               language = "en",
                                               id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 
docs <- tm_map(docs, removePunctuation) 
library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))

We can visualize the result of the most frequently used words in your pdf file with a word cloud:

library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))

enter image description here

Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.

RHertel
  • 23,412
  • 5
  • 38
  • 64