Scraping PDF tables based on title

Question

I am trying to extract one table each from 31 pdfs. The titles of the tables all start the same way but the end varies by region.

For one document the title is "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year; Arusha Region, 2012 Census". Another would be "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year; Dodoma Region, 2012 Census."

I used tabulizer to scrape the first table manually based on the specific text lines I need but given the similar naming conventions, I was hoping to automate this process.

```
PATH2<-  "Regions/02. Arusha Regional Profile.pdf"
```
txt2 <- pdf_text(PATH2) %>%
readr:: read_lines()
```

specific_lines2<- txt2[4621:4639] %>%
str_squish() %>%
str_replace_all(",","") %>%
strsplit(split = " ")

score 1 · Answer 1 · answered Aug 27 '20 at 20:48

What: You can find the page with the common part of the title on each file and extract the data from there (if there is only one occurrence of the title per file)

How: Build a function to get the table on a pdf, then ask the function on lapply to run for all pdfs.

Example:

First, load the function to find a page that includes the title and get the text from there.

  get_page_text <- function(url,word_find) {
  txt <- pdftools::pdf_text(url)
  p <- grep(word_find, txt, ignore.case = TRUE)[1]       # Sentence to find
  L <- tabulizer::extract_text(url, pages = p)    
  i <- which.max(lengths(L))
  data.frame(L[[i]])
  }

Second, get file names.

setwd("C:/Users/xyz/Regions")
files <- list.files(pattern = "pdf$|PDF$") # Get file names on the folder Regions.

Then, the "loop" (lapply) to run the function for each pdf.

reports <- lapply(files,
                  get_page_text,
                  word_find = "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year")

The result is a variable list that has one data.frame for each pdf extracted. What comes next is cleaning up your data.

The function may vary a lot depending on the patterns on your pdfs. Finding the page was effective for me, you will find what fits best for you.

Scraping PDF tables based on title

1 Answers1