I'm trying to extract a table from a PDF with the R tabulizer package. The functions work fine, but it can't get all the data from the entire table.
Below are my codes
library(tabulizer)
library(tidyverse)
library(abjutils)
D_path = "https://github.com/financebr/files/raw/master/Compacto09-08-2019.pdf"
out <- extract_tables(D_path,encoding = 'UTF-8')
arrumar_nomes <- function(x) {
x %>%
tolower() %>%
str_trim() %>%
str_replace_all('[[:space:]]+', '_') %>%
str_replace_all('%', 'p') %>%
str_replace_all('r\\$', '') %>%
abjutils::rm_accent()
}
tab_tidy <- out %>%
map(as_tibble) %>%
bind_rows() %>%
set_names(arrumar_nomes(.[1,])) %>%
slice(-1) %>%
mutate_all(funs(str_replace_all(., '[[:space:]]+', ' '))) %>%
mutate_all(str_trim)
Comparing the PDF table (D_path
) with the tab_tidy
database you can see that some information was missing. All first columns, which are merged, are not found during extract_tables()
. Also, all lines that contain “Boi Gordo” and “Boi Magro” information are not found by the function either.
The rest is in perfect condition. Would you know why and how to solve it? The questions here in the forum dealing with this do not have much answer.