I'm trying to extract table text from pdf file which is wrote in Korean. I used library which named tabulizer to extract text.
So my code is
library(pdftools)
library(tidytext)
library(dplyr)
library(janeaustenr)
library(rJava)
library(tabulizer)
library(tidyverse)
setwd("C:/Users/user/Desktop/Test") #This is my directory which contain pdf files.
files <- list.files(pattern = "pdf$")
f2 <- files[6]
e <- extract_text(f2,pages = 25,encoding = 'UTF-8')
But the problem is the table which in the pdf file, the pattern is not desirable for me. I want extract data vertically, however, the extract_text function makes strings horizontally
Here is the outcome which occurred with extract_text:
나. 집합투자기구에 부과되는 보수 및 비용 \r\n구분 \r\n지급비율(연간, %) \r\n지급시기 \r\nC(수수료미징\r\n구-오프라인) \r\nW(수수료미징\r\n구-오프라인-\r\n랩) \r\ne(수수료미징구\r\n-온라인) \r\nI(수수료미징구-\r\n오프라인-기관) \r\nC-P(수수료미\r\n징구-오프라인-\r\n개인연금) \r\nC-P2(수수료미\r\n징구-오프라인-\r\n퇴직연금) \r\n집합투자업자 보수 0.46 0.46 0.46 0.46 0.46 0.46 \r\n매 3개월 \r\n판매회사 보수 1.00 0.00 0.98 0.03 0.95 0.85 \r\n수탁회사 보수 0.025 0.025 0.025 0.025 0.025 0.025 \r\n일반사무관리회사 보\r\n수 \r\n0.005 0.005 0.005 0.005 0.005 0.005 \r\n총 보수 1.49 0.49 1.47 0.52 1.44 1.34 - \r\n기타비용 0.002 0.002 0.002 0.002 0.002 0.002 사유 발생 시 \r\n총 보수․비용 1.492 0.492 1.472 0.522 1.442 1.342 - \r\n(동종유형 총 보수) 1.59 - 1.22 - - - - \r\n총 보수․비용 \r\n(피투자 집합투자기구 보수 포함) 1.493 0.493 1.473 0.523 1.443 1.343 - \r\n증권거래비용 0.107 0.108 0.105 0.108 0.106 0.104 사유 발생 시 \r\n구분 지급비율(연간, %) 지급시기 \r\n"
And more specifically I attached capture image.
Again, what I want to extract is something vertical (red circle) But extract_text organize it horizontally (blue circle)
Also if you know how can organize the text like {cat(e, sep="\n")}
please leave a comment, because using the cat function, I cannot contain result in variables, it automatically gives me output and I have no choice to contain these values.. but i want things clearly organized and 'Anytime i want to need some information I go to the container variables, then get the info..' << that's what I need