10

Our of curiosity, is there a way to read PDF metadata -- such as the information shown below -- from R?

I could not anything about that by searching from [r] pdf metadata in the current question base. Any pointers very welcome!

enter image description here

Fr.
  • 2,865
  • 2
  • 24
  • 44
  • 4
    Take a look at `readPDF` form tm package. – Jilber Urbina Oct 26 '13 at 11:51
  • Thanks -- looks brilliant but not that obvious to use, I'll report back when I manage to write the code that extracts, e.g., the content producer. – Fr. Oct 26 '13 at 11:53
  • 1
    `file.info()` will get you some of that info – GSee Oct 26 '13 at 16:47
  • @GSee: sorry, I should have mentioned that I am not interested in file size, just in PDF producer and the like. – Fr. Oct 27 '13 at 09:52
  • @Jilber I've also taken a good look at what you suggested, but it involves installing extra libraries and some coding magic—unfortunately, I'm no magician (esp. when you have to compile the stuff before use). The answer seems, so far, that there is no easy/pure-R way of doing this. – Fr. Oct 27 '13 at 09:52
  • @Fr., Did my answer help get you started in the right direction? If there's more you're looking for, let me know and I can try to look into it further. – A5C1D2H2I1M1N2O1R2T1 Nov 10 '13 at 10:10
  • @AnandaMahto, your answer is probably the best way to do it indeed. I'm not sure PDFtk would work for me (Mac), but there's probably some equivalent out there. Thanks! – Fr. Nov 23 '13 at 14:25
  • @Fr., not sure I follow. PDFtk is available for the Mac too. See [here](http://www.pdflabs.com/tools/pdftk-server/). – A5C1D2H2I1M1N2O1R2T1 Nov 23 '13 at 14:32
  • @AnandaMahto I was suspicious that the Mac version would have trouble with file encodings, but that was too pessimistic, the script missed only 5 PDF files out of 142. Thanks again. – Fr. Dec 03 '13 at 10:09

1 Answers1

5

I can't think of a pure R way to do this, but you can probably install your favorite PDF command-line tool (for example, the PDF toolkit, PDFtk and use that to get at least some of the data you are looking for.

The following is a basic example using PDFtk. It assumes that pdftk is accessible in your path.

x <- getwd() ## I'll run this example in a tempdir to keep things clean
setwd(tempdir())
list.files(pattern="*.txt$|*.pdf$")
# character(0)

pdf(file = "SomeOutputFile.pdf")
plot(rnorm(100))
dev.off()

system("pdftk SomeOutputFile.pdf data_dump output SomeOutputFile.txt")
list.files(pattern="*.txt$|*.pdf$")
# [1] "SomeOutputFile.pdf" "SomeOutputFile.txt"

readLines("SomeOutputFile.txt")
#  [1] "InfoBegin"                    "InfoKey: Creator"            
#  [3] "InfoValue: R"                 "InfoBegin"                   
#  [5] "InfoKey: Title"               "InfoValue: R Graphics Output"
#  [7] "InfoBegin"                    "InfoKey: Producer"           
#  [9] "InfoValue: R 3.0.1"           "InfoBegin"                   
# [11] "InfoKey: ModDate"             "InfoValue: D:20131102170720" 
# [13] "InfoBegin"                    "InfoKey: CreationDate"       
# [15] "InfoValue: D:20131102170720"  "NumberOfPages: 1"            
# [17] "PageMediaBegin"               "PageMediaNumber: 1"          
# [19] "PageMediaRotation: 0"         "PageMediaRect: 0 0 504 504"  
# [21] "PageMediaDimensions: 504 504"

setwd(x)

I'd look into what other options there are to specify what metadata gets extracted, and see if there's a convenient way to parse this information into a form that is more useful for you.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485