Trying to extract a subset of pages from each pdf in a directory with 70 pdf files

Question

I am using tidyverse, tidytext, and pdftools. I want to parse words in a directory of 70 pdf files. I am using these tools to do this successfully but the code below grabs all the pages instead of the subset I want. I need to skip the first two pages and select page 3 to the end of the file for each pdf.

directory <- "Student_Artifacts/"
pdfs <- paste(directory, "/", list.files(directory, pattern = "*.pdf"), sep = "")
pdf_names <- list.files(directory, pattern = "*.pdf")
pdfs_text <- map(pdfs, (pdf_text))
my_data <- data_frame(document = pdf_names, text = pdfs_text)

I figured out that by putting [3:12] in brackets like this I can grab the 3rd-12th documents:

pdfs_text <- map(pdfs, (pdf_text))[3:12]

This is not what I want though. How do I use the [3:12] specification to pull the pages I want from each pdf file?

score 3 · Accepted Answer · answered Oct 19 '19 at 22:31

First off, you could index out the 3rd-to-12th page from each PDF within the mapping of pdf_text, with just some very small changes:

pdfs_text <- map(pdfs, ~ pdf_text(.x)[3:12])

But this assumes that all 70 of your PDFs are 13 pages. This might also be slow, especially if some of them are real big. Try something like this (I used R's PDF documentation to demo with):

library(furrr)
#> Loading required package: future
library(pdftools)
library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

plan(multiprocess)

directory <- file.path(R.home("doc"), "manual")
pdf_names <- list.files(directory, pattern = "\\.pdf$", full.names = TRUE)
# Drop the full reference manual since it's so big
pdf_names %<>% str_subset("fullrefman.pdf", negate = TRUE)
pdfs_text <- future_map(pdf_names, pdf_text, .progress = TRUE)
#> Progress: ----------------------------------------------------------------------------------- 100%

my_data   <- tibble(
  document = basename(pdf_names), 
  text     = map_chr(pdfs_text, ~ {
    str_c("Page ", seq_along(.x), ": ", str_squish(.x)) %>% 
      tail(-2) %>% 
      str_c(collapse = "; ")
  })
)

my_data
#> # A tibble: 6 x 2
#>   document    text                                                         
#>   <chr>       <chr>                                                        
#> 1 R-admin.pdf "Page 3: i Table of Contents 1 Obtaining R . . . . . . . . .~
#> 2 R-data.pdf  "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
#> 3 R-exts.pdf  "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
#> 4 R-intro.pdf "Page 3: i Table of Contents Preface . . . . . . . . . . . .~
#> 5 R-ints.pdf  "Page 3: i Table of Contents 1 R Internal Structures . . . .~
#> 6 R-lang.pdf  "Page 3: i Table of Contents 1 Introduction . . . . . . . . ~

^{Created on 2019-10-19 by the reprex package (v0.3.0)}

The main points:

The tail(-2) is doing the work you're most concerned with: dropping the first two pages. Usually you use tail() to grab the last n pages, but it's also ideal for grabbing all but the first n pages - just use the negative.
The plan() and future_map() are parallelizing the PDF-reading, with each of your virtual cores reading one PDF at a time. Also, progress bar!
I'm doing some fancy string concatenation in the construction of text here since it appears that you ultimately want the full text of each document's pages in one cell in your final table. I'm inserting "; Page [n]: " in between each page's text so that data isn't lost, and I'm also removing extra whitespace throughout all the text, since there's usually tons.

Wow! Ok this looks incredibly promising. Thanks for your efforts. Once I can clear out my teaching tedium I will be diving back into this analysis. — Craig Byron, Oct 21 '19 at 15:26
Well I have managed to at least quickly test your first solution and it works great. It did exactly what I wanted (i.e., pdfs_text <- map(pdfs, ~ pdf_text(.x)[3:12]). When I truly get back to this work soon I will try out your other solution. I was able to rapidly get to unseating word tokens with what you helped me produce. I am still having issues with ngrams. I will try to add some reproducible code later. — Craig Byron, Oct 22 '19 at 21:02
Your first solution did what I wanted (i.e., pdfs_text <- map(pdfs, ~ pdf_text(.x)[3:12]) and I was able to parse unigrams but there were NAs that still don't quite understand and prevented parsing into ngrams. The second solution worked and I could parse into ngrams. — Craig Byron, Oct 23 '19 at 15:16

Trying to extract a subset of pages from each pdf in a directory with 70 pdf files

1 Answers1