Is there any method to use differently extract_table function in R?

Question

I am trying to use extract_tables in tabulizer package.

library(tabulizer)
setwd("directory")
pdf_file <- "filenames.pdf"
cle <- extract_tables(pdf_file, pages=47 ,method="stream", encoding="UTF-8")

what I needed to use extract_table function, is just this code.

However, there is a critical problem. It merges some column automatically

you might understand the situation when you see two images. Column 6 and 7, in pdf table image is merged.

not

0.9000 | -

0.6450 | -

0.7470 | -

the two columns are merged like

0.9000-

0.6450-

0.7470-

So I want to find method do not making table like this, also which is general method.

Therefore I tried to put another component in the function like this.

library(pdftools)
library(tabulizer)
files <- list.files(pattern = "pdf$")

opinions <- lapply(files, pdf_text)

cle <- extract_tables(opinions[[2]][47],method="stream", encoding="UTF-8")

*!Error in normalizePath(path.expand(path), winslash, mustWork) :*

So please leave any solution if you know what I should do about it. thanks.

Did you try pdftools? If you could add the pdf, we could check and suggest. — Mohanasundaram, Apr 22 '20 at 04:29
Yes. pdftools and tabulizer which I used. And I attached link first line to make you are able to see original pdf files. If you can't see the files try this link : https://drive.google.com/file/d/139CuCBgwzJSRyj4WDX3axjh2ZPvLpBs7/view?usp=sharing even you can't not access this link please leave a reply. — user13232877, Apr 22 '20 at 14:12
Try with this library("tesseract") tesseract_download("kor") df <- ocr("filename.pdf", engine = "kor") — Mohanasundaram, Apr 22 '20 at 23:39
Everytime I run the code : ocr("filename.pdf", engine = "kor") then the pop-up appears. 'R Session Aborted R encountered a fatal error. The session was terminated.' I guess It would be related my computing power or the pdf has so many page.. — user13232877, Apr 23 '20 at 01:18
No.. Cuz I have to deal with some documents which related to investment. So every documents has more than 30page. Now I'm trying to put in the specific pdf page to "filename.pdf" in ocr function to save time and memory. — user13232877, Apr 23 '20 at 03:28

Is there any method to use differently extract_table function in R?

0 Answers0