2

I need to extract specific parts of a large corpus of PDF documents. The PDFs are large and messy reports containing all kinds of digital, alphabetic and other information. The files are of different length but have unified content and sections across them. The documents have a Table of Content with the section names in them. For example

Table of Content:

Item 1. Business                                                                            1
Item 1A. Risk Factors                                                                       2
Item 1B. Unresolved Staff Comments                                                          5
Item 2. Properties                                                                          10
Item N........

..........text I do not care about...........

Item 1A. Risk Factors 

.....text I am interested in getting.......

(section ends)

Item 1B. Unresolved Staff Comments

..........text I do not care about...........

I have no problem reading them in and analyzing them as a whole but I need to pull out only the text between "Item 1A. Risk Factors" and "Item 1B. Unresolved Staff Comments". I used pdftools, tm, quanteda and readtext package This is the part of code I use to read-in my docs. I created a directory where I placed my PDFs and called it "PDF" and another directory where R will place converted to ".txt" files.

pdf_directory <- paste0(getwd(), "/PDF")
txt_directory <- paste0(getwd(), "/Texts")

Then I create a list of files using "list.files" function.

files <- list.files(pdf_directory, pattern = ".pdf", recursive = FALSE, 
                    full.names = TRUE)
files

After that, I go on to create a function that extracts file names.

extract <- function(filename) {
  print(filename)
  try({
    text <- pdf_text(filename)
  })
  f <- gsub("(.*)/([^/]*).pdf", "\\2", filename)
  write(text, file.path(txt_directory, paste0(f, ".txt")))
}

for (file in files) {
  extract(file)
}

After this step, I get stuck and do not know how to proceed. I am not sure if I should try to extract the section of interest when I read data in, therefore, I suppose, I would have to wrestle with the chunk where I create the function -- f <- gsub("(.*)/([^/]*).pdf", "\\2", filename)? I apologize for such questions but I am self-teaching myself. I also tried engaging the following code on just one file instead of a corpus:

start <- grep("^\\*\\*\\* ITEM 1A. RISK FACTORS", text_df$text) + 1

stop <- grep("^ITEM 1B. UNRESOLVED STAFF COMMENTS", text_df$text) - 1

lines <- raw[start:stop]

scd <- paste0(".*",start,"(.*)","\n",stop,".*")  
gsub(scd,"\\1", name_of_file)

but it did not help me in any way.

s_baldur
  • 29,441
  • 4
  • 36
  • 69
Niki
  • 23
  • 3
  • 1
    Would you be able to share at least one of the pdf files? Would make it easier to write an answer covering the whole process. – JBGruber Aug 06 '20 at 18:26
  • Yes, absolutely. It is public info (https://corporate.exxonmobil.com/-/media/Global/Files/investor-relations/investor-relations-publications-archive/ExxonMobil-2016-Form-10-K.pdf) And thank you very much. – Niki Aug 06 '20 at 18:44

1 Answers1

2

I don't really see why you would write files into a txt first, so I did it all in one go.

What threw me off a little is that your patterns have lots of extra spaces. You can match them with the regular expression \\s+

library(stringr)
files <- c("https://corporate.exxonmobil.com/-/media/Global/Files/investor-relations/investor-relations-publications-archive/ExxonMobil-2016-Form-10-K.pdf",
           "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")


relevant_l <- lapply(files, function(file) {
  
  # print status message
  message("processing: ", basename(file))
  
  lines <- unlist(stringr::str_split(pdftools::pdf_text(file), "\n"))
  start <- stringr::str_which(lines, "ITEM 1A.\\s+RISK FACTORS")
  end <- stringr::str_which(lines, "ITEM 1B.\\s+UNRESOLVED STAFF COMMENTS")
  
  # cover a few different outcomes depending on what was found
  if (length(start) == 1 & length(end) == 1) {
    relevant <- lines[start:end]
  } else if (length(start) == 0 | length(end) == 0) {
    relevant <- "Pattern not found"
  } else {
    relevant <- "Problems found"
  }
  
  return(relevant)
})
#> processing: ExxonMobil-2016-Form-10-K.pdf
#> processing: dummy.pdf

names(relevant_l) <- basename(files)
sapply(relevant_l, head)
#> $`ExxonMobil-2016-Form-10-K.pdf`
#> [1] "ITEM 1A.           RISK FACTORS\r"                                                                                                   
#> [2] "ExxonMobil’s financial and operating results are subject to a variety of risks inherent in the global oil, gas, and petrochemical\r" 
#> [3] "businesses. Many of these risk factors are not within the Company’s control and could adversely affect our business, our financial\r"
#> [4] "and operating results, or our financial condition. These risk factors include:\r"                                                    
#> [5] "Supply and Demand\r"                                                                                                                 
#> [6] "The oil, gas, and petrochemical businesses are fundamentally commodity businesses. This means ExxonMobil’s operations and\r"         
#> 
#> $dummy.pdf
#> [1] "Pattern not found"

I would return the results as a list and then use original file names to name the list elements. Let me know if you have questions. I use the package stringr since it's fast and consistent in dealing with strings. But the command str_which and grep are pretty the same.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • Thank you so much for your help, @JBGruber. I apologize for bothering but I run into a problem when running that code on multiple PDFs. What I do is change the path to the file from the link to the `files <- list.files("/Volumes/GoogleDrive/My Drive/R/Projects/Work package 2/Exxon/PDF/", pattern = "*.pdf$", full.names = TRUE)` – Niki Aug 07 '20 at 17:58
  • I get an error: `Error in file(con, "rb") : invalid 'description' argument` – Niki Aug 07 '20 at 17:59
  • Hello. Thank you for your time and efforts to help me sort this out. However, I still cannot run that code on a list of files. It works only on one file. I get `Error in start:end : argument of length 0` – Niki Aug 09 '20 at 13:52
  • I think that means that your pattern wasn't found. You can cover a few different outcomes of the search using if statments. Check my updated answer. – JBGruber Aug 10 '20 at 15:22
  • @JBGruber I think this is a great solution, although I don't understand why it won't work with my set of PDFs. There are no errors; I load six PDF files, and running the lines separately, confirm that the `str_which()` commands return valid info. But when I run the whole `lapply` function, I get "Problems found" --- why? – Ben Sep 21 '21 at 15:26
  • That could have a multitude of reasons and is impossible to debug without the data. The first thing I would check is if it works for one file. Then check for which files it fails and look into the failing ones. If that doesn't help, you can post a new question with the PDF that fails. – JBGruber Sep 22 '21 at 10:33