0

I'm trying to extract a table from a PDF with the R tabulizer package. The functions work fine, but it can't get all the data from the entire table.

Below are my codes

library(tabulizer)
library(tidyverse)
library(abjutils)

D_path = "https://github.com/financebr/files/raw/master/Compacto09-08-2019.pdf"

out <- extract_tables(D_path,encoding = 'UTF-8')

arrumar_nomes <- function(x) {
  x %>% 
    tolower() %>% 
    str_trim() %>% 
    str_replace_all('[[:space:]]+', '_') %>% 
    str_replace_all('%', 'p') %>% 
    str_replace_all('r\\$', '') %>% 
    abjutils::rm_accent()
}

tab_tidy <- out %>%
  map(as_tibble) %>% 
  bind_rows() %>% 
  set_names(arrumar_nomes(.[1,])) %>%
  slice(-1) %>% 
  mutate_all(funs(str_replace_all(., '[[:space:]]+', ' '))) %>% 
  mutate_all(str_trim)

Comparing the PDF table (D_path) with the tab_tidy database you can see that some information was missing. All first columns, which are merged, are not found during extract_tables(). Also, all lines that contain “Boi Gordo” and “Boi Magro” information are not found by the function either.

The rest is in perfect condition. Would you know why and how to solve it? The questions here in the forum dealing with this do not have much answer.

zx8754
  • 52,746
  • 12
  • 114
  • 209
bubble
  • 23
  • 4
  • The answer is, tabulizer just doesn't work that well sometimes (rarely perfectly). It is related to the underlying Tabula software, not the R implementation. It depends on how the table is created in the PDF, where the characters are, etc. Tabula just interprets that underlying structure, and it doesn't always come out perfect. – moman822 Aug 12 '19 at 19:41
  • Though you could try `tabulizer::extract_areas()`, which will allow you to select the boundaries of the table. In my experience sometimes they give a slightly different/better output than one another. Can't say why. – moman822 Aug 12 '19 at 19:59
  • @moman822 I had already used extract_areas () and it worked. However, I need something more automatic, I can't keep using the mouse to capture all the time. – bubble Aug 12 '19 at 20:50
  • Well, with `extract_tables()`, there is an optional argument for `areas`, where you can specify the space (as you do when clicking via `extract_areas()`), so if you are doing the same area for a number of pages you could specify it like that and loop over your pages/docs. Idk what scale the coordinates are for that argument. If it is going to be different tables for different pages, this may not work. – moman822 Aug 13 '19 at 14:46
  • The link to the PDF is not valid anymore. Is is possible to provide a new link? – Emmanuel Hamel Sep 16 '22 at 20:17

0 Answers0