First off, you could index out the 3rd-to-12th page from each PDF within the mapping of pdf_text
, with just some very small changes:
pdfs_text <- map(pdfs, ~ pdf_text(.x)[3:12])
But this assumes that all 70 of your PDFs are 13 pages. This might also be slow, especially if some of them are real big. Try something like this (I used R's PDF documentation to demo with):
library(furrr)
#> Loading required package: future
library(pdftools)
library(tidyverse)
library(magrittr)
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
#> The following object is masked from 'package:tidyr':
#>
#> extract
plan(multiprocess)
directory <- file.path(R.home("doc"), "manual")
pdf_names <- list.files(directory, pattern = "\\.pdf$", full.names = TRUE)
# Drop the full reference manual since it's so big
pdf_names %<>% str_subset("fullrefman.pdf", negate = TRUE)
pdfs_text <- future_map(pdf_names, pdf_text, .progress = TRUE)
#> Progress: ----------------------------------------------------------------------------------- 100%
my_data <- tibble(
document = basename(pdf_names),
text = map_chr(pdfs_text, ~ {
str_c("Page ", seq_along(.x), ": ", str_squish(.x)) %>%
tail(-2) %>%
str_c(collapse = "; ")
})
)
my_data
#> # A tibble: 6 x 2
#> document text
#> <chr> <chr>
#> 1 R-admin.pdf "Page 3: i Table of Contents 1 Obtaining R . . . . . . . . .~
#> 2 R-data.pdf "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
#> 3 R-exts.pdf "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
#> 4 R-intro.pdf "Page 3: i Table of Contents Preface . . . . . . . . . . . .~
#> 5 R-ints.pdf "Page 3: i Table of Contents 1 R Internal Structures . . . .~
#> 6 R-lang.pdf "Page 3: i Table of Contents 1 Introduction . . . . . . . . ~
Created on 2019-10-19 by the reprex package (v0.3.0)
The main points:
- The
tail(-2)
is doing the work you're most concerned with: dropping the first two pages. Usually you use tail()
to grab the last n
pages, but it's also ideal for grabbing all but the first n
pages - just use the negative.
- The
plan()
and future_map()
are parallelizing the PDF-reading, with each of your virtual cores reading one PDF at a time. Also, progress bar!
- I'm doing some fancy string concatenation in the construction of
text
here since it appears that you ultimately want the full text of each document's pages in one cell in your final table. I'm inserting "; Page [n]: " in between each page's text so that data isn't lost, and I'm also removing extra whitespace throughout all the text, since there's usually tons.