0

PDF has boxes with data. I want to extract all the data from these boxes in R. I want this to be extracted without using OCR.

snapshot of boxes in pdf

I have tried Tabulizer package but it is giving unorganized results making it impossible to extract.

report <- extract_tables("C:\\Users\\672158\\Desktop\\example1.pdf", encoding = "UTF-8")

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • PDF extraction is always a bit tricky. If the results its providing you maintain all the data but "look" disorganised, i would keep the output and perform data cleanup operations afterwards. Could you provide us with how the output of `report` looks like? Preferably via `dput(report)` – Steve Jul 26 '19 at 09:05
  • I tried dput(report) but data is coming in jumbled way. for this image it is coming correct but for my pdf it is still coming jumbled way. data is mixing with another column. I have to extract same box details from various files but for each file alignment coming in put(report) coming different. – Dinesh Mandal Jul 26 '19 at 09:47
  • Indeed, it is expected that they will be jumbled - pdf table extraction is rarely exact. Take a look at my response [here](https://stackoverflow.com/a/52188564/7856717) for a similar situation, using again `tabulizer`. That's why i asked for your output regardless of wether its jumbled, so we could perhaps identify the problematic columns and figure out individual tricks/workarounds to clean-up the tables. Also take a look at trying different extraction methods (`'stream'` or `'lattice'`) while extracting. – Steve Jul 26 '19 at 10:08
  • Could you provide a link to the PDF? – Emmanuel Hamel Sep 15 '22 at 21:40

0 Answers0