1

I have been practicing with tabulizer package in R and have following problem. Unfortunately I can't offer reproducible example, as pdf is firms property, but I will describe problem in detail.

I'm trying to read PDF that has start and end date in upperright corner. When I open PDF they look normal

Start: 01-Mar-2018
  End: 31-Mar-2018

Now the fun part. When I highlight them and use Ctrl+C to copy them here is result when pasted to R.

:tttt: 11-rrr-8118
tt:: 11-rrr-8118

This is exactly same kind of nonsense that extract_text(path, pages=1) will give. A lot of t::ttttt:ttt... My question is that is there some security in this PDF or do I just need to figure out correct encoding or because this PDF is automatically created from system, there is some weird notation to everything?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Hakki
  • 1,440
  • 12
  • 26

1 Answers1

1

I figured it out. This PDF is mainly created by metadata (didn't know) and great tool in R for accessing metadata in PDFs is pdftools.

library(pdftools)

pdf_info(path.pdf)

and you can wrangle out all the important metadata bits.

Hakki
  • 1,440
  • 12
  • 26