0

I'm trying to extract a line of text from the first page of each multi-page PDF file in a list of PDFs. I'm trying to get the text into a dataframe so I can extract the author of each PDF, which is on the first page and the same word precedes the author in every single document.

I found the resource below by Packt Publishing that gets very close to what I'm trying to do, but when I implement the for loop (I just copied and pasted and plugged in my object names), R throws this error:

For loop:

text_df <- data.frame(matrix(ncol=2, nrow=0))
colnames(text_df) <- c("pdf title", "text")

for (i in 1:length(vector)){
  print(i)
  pdf_text(paste("folder/", vector[i],sep = "")) %>% 
    strsplit("\n")-> document_text
  data.frame("pdf title" = gsub(x =vector[i],pattern = ".pdf", replacement = ""), 
             "text" = document_text, stringsAsFactors = FALSE) -> document
  colnames(document) <- c("pdf title", "text")
  text_df <- rbind(text_df,document) 
}

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 50, 60, 11

Could someone help me understand what this error means? Could someone direct me to other resources that accomplish what I'm trying to do? Thank you in advance!

Resource: https://www.r-bloggers.com/2018/01/how-to-extract-data-from-a-pdf-file-with-r/

  • The included example should work only with 1-page PDF files, from the error I'd guess it failed on something with 3 pages, 50, 60 & 11 lines per page. `document_text` being a list of vectors (each page is a list item, each vector item is a line on a page)., `data.frame()`attempts to create a column for each page, but fails as line count per page is different. – margusl Jun 04 '23 at 11:42

1 Answers1

0

Here's an example based on a few pdfs from arXiv, those are also used in pdftools intro. "Keyword" here for finding the author is \n\n, two line breaks between title and author:

# segment from the beginning of first page, pdf_text() output
"                                              The jsonlite Package: A Practical and Consistent Mapping\n                                                                   Between JSON Data and R Objects\n\n                                                                                    Jeroen Ooms\narXiv:1403.2805v1 [stat.CO] 12 Mar 2014\n\n\n\n\n

Search for a string preceding arXiv would have worked too. When working with pdf_text() output, watch out for all the whitespace and line breaks in resulting strings.

library(pdftools)
#> Using poppler version 22.04.0
library(stringr)
library(purrr)
library(tibble)
pdfs <- c("https://arxiv.org/pdf/1403.2805.pdf", 
          "https://arxiv.org/pdf/1406.4806.pdf")

# purrr::map cycles though all pdf files / URLs and calls a 
# \(pdf_file){...} function on each, returns a list of tibble_rows that 
# are bonded into single data.frame / tibble with list_rbind()
titles_authors <- pdfs %>% 
  map(\(pdf_file){
    first_page <- pdf_text(pdf_file)[1]
    tibble_row(
      file = pdf_file, 
      # title is the first string, can span over multiple lines but ends with
      # 2 newlines. 
      # "^" : match from the beginning of the string
      # "[\\s\\S] : match any character including newline (s - whitespace, S anything but whitespace )
      # "*?" : match as few characters as possible
      # "\n\n"
      title  = first_page %>% str_extract("^[\\s\\S]*?\n\n") %>% str_squish(),
      # author is preceded with 2 newlines, assume it fits on a single line
      # "\n\n.*\n" : string preceded by two newlines and followed by one
      author = first_page %>% str_extract("\n\n.*\n") %>% str_trim()
      )
    }) %>% 
  list_rbind()
titles_authors
#> # A tibble: 2 × 3
#>   file                                title                               author
#>   <chr>                               <chr>                               <chr> 
#> 1 https://arxiv.org/pdf/1403.2805.pdf The jsonlite Package: A Practical … Jeroe…
#> 2 https://arxiv.org/pdf/1406.4806.pdf The OpenCPU System: Towards a Univ… Jeroe…

Created on 2023-06-04 with reprex v2.0.2

( I would not use that linked R-boggers post as a base )

margusl
  • 7,804
  • 2
  • 16
  • 20