0

original pdf files

I am trying to use extract_tables in tabulizer package.

library(tabulizer)
setwd("directory")
pdf_file <- "filenames.pdf"
cle <- extract_tables(pdf_file, pages=47 ,method="stream", encoding="UTF-8")

what I needed to use extract_table function, is just this code.

However, there is a critical problem. It merges some column automatically capture pdf table capture R outcome

you might understand the situation when you see two images. Column 6 and 7, in pdf table image is merged.

not

0.9000 | -

0.6450 | -

0.7470 | -

the two columns are merged like

0.9000-

0.6450-

0.7470-

So I want to find method do not making table like this, also which is general method.

Therefore I tried to put another component in the function like this.

library(pdftools)
library(tabulizer)
files <- list.files(pattern = "pdf$")

opinions <- lapply(files, pdf_text)

cle <- extract_tables(opinions[[2]][47],method="stream", encoding="UTF-8")

*!Error in normalizePath(path.expand(path), winslash, mustWork) :*

So please leave any solution if you know what I should do about it. thanks.

zx8754
  • 52,746
  • 12
  • 114
  • 209
user13232877
  • 205
  • 1
  • 9
  • Did you try pdftools? If you could add the pdf, we could check and suggest. – Mohanasundaram Apr 22 '20 at 04:29
  • Yes. pdftools and tabulizer which I used. And I attached link first line to make you are able to see original pdf files. If you can't see the files try this link : https://drive.google.com/file/d/139CuCBgwzJSRyj4WDX3axjh2ZPvLpBs7/view?usp=sharing even you can't not access this link please leave a reply. – user13232877 Apr 22 '20 at 14:12
  • Try with this library("tesseract") tesseract_download("kor") df <- ocr("filename.pdf", engine = "kor") – Mohanasundaram Apr 22 '20 at 23:39
  • Everytime I run the code : ocr("filename.pdf", engine = "kor") then the pop-up appears. 'R Session Aborted R encountered a fatal error. The session was terminated.' I guess It would be related my computing power or the pdf has so many page.. – user13232877 Apr 23 '20 at 01:18
  • Maybe... Did you try with a single page pdf? – Mohanasundaram Apr 23 '20 at 01:44
  • No.. Cuz I have to deal with some documents which related to investment. So every documents has more than 30page. Now I'm trying to put in the specific pdf page to "filename.pdf" in ocr function to save time and memory. – user13232877 Apr 23 '20 at 03:28

0 Answers0