Extracting multiple phrases from multiple PDF's simultaneously using R

Question

I have a list of pdf pathways in one table, and I am trying to repeat the commands below for the rest of the pdf's listed. Basically I am converting the pdf file to text for the file's first page only and then using the keyword_search command to run a search on certain phrases within that page. I can complete this successfully for one file at a time, but I have 281 files. What am I missing??

ONE PDF FILE

    my.file<-"//.../cover-letter.pdf"
    my.page<-pdf_text(my.file)[1] %>% as.character()
    my.result<-keyword_search(my.page, keyword = c('reason','not being marketed', 'available for sale', 'withdrawn from sale', 'commercial distribution', 'target date'), ignore_case = TRUE)
    my.result$Cover_Letter<-my.file
    
    my.result<-select(my.result, -5)
    result<-merge(TotNoMark_clean, my.result, by = "Cover_Letter", all.x = TRUE)

MULTIPLE PDF FILES: FAILED ATTEMPT


DF<-as.data.frame(TotNoMark_clean)
file.names<-DF$Cover_Letter

for(i in 1:length(file.names)){
  {pdf_pages<-pdf_text(file.names[i])[1]
  pdf_result<-keyword_search(pdf_pages, keyword = c('reason','not being marketed', 'available for sale', 'withdrawn from sale', 'commercial distribution', 'target date'))
  pdf_result$Cover_Letter<-file.names[i]
  if (!nrow(pdf_result)) {next}
  }
  Result<<-pdf_result
}
Result<-select(Result, -5)
Result<-merge(DF, Result, by = "Cover_Letter", all.x = TRUE)

This is the error message I get:

    "Error in `$<-.data.frame`(`*tmp*`, "Cover_Letter", value = "//cover-letters/***.pdf") : 
  replacement has 1 row, data has 0"

Parfait · Answer 1 · 2019-10-07T16:23:23.217

Currently, your Result never retains past iterations only the very last item even if you use scoping operator, <<-, since you do not use a list or grow your object in loop (which latter is ill-advised). And actually you do need <<- since for loop does not run on local but global objects. If your last item had empty rows, the next would cause Result to be empty.

Consider building a list of data frames to then run bind_rows outside loop for final output:

DF <- as.data.frame(TotNoMark_clean)
# INITIALIZE EMPTY LIST
Result_dfs <- vector(mode="list", length=nrow(DF))

for(i in seq_along(DF$Cover_Letter)) {
  pdf_pages <- pdf_text(DF$Cover_Letter[i])[1]
  pdf_result <- keyword_search(pdf_pages, 
                               keyword = c('reason','not being marketed', 'available for sale', 
                                           'withdrawn from sale', 'commercial distribution', 
                                           'target date'))
  pdf_result$Cover_Letter <- DF$Cover_Letter[i]

  # SAVE TO LIST REGARDLESS OF NROWs 
  Result_dfs[i] <- pdf_result
}

# BIND ALL DFs TOGETHER AND SELECT LAST FIVE COLS
Result <- dplyr::select(dplyr::bind_rows(Result_dfs), -5)

# MERGE TO ORIGINAL
Result <- merge(DF, Result, by = "Cover_Letter", all.x = TRUE)

Alternatively, use lapply to avoid the bookkeeping of initializing list and assigning list items:

DF <- as.data.frame(TotNoMark_clean)

Result_dfs <- lapply(DF$Cover_Letter, function(f) {
    pdf_pages <- pdf_text(f)[1]
    pdf_result <- keyword_search(pdf_pages, 
                                 keyword = c('reason','not being marketed', 'available for sale', 
                                             'withdrawn from sale', 'commercial distribution', 
                                             'target date'))
    pdf_result$Cover_Letter <- f
    return(pdf_result)
})

# BIND ALL DFs TOGETHER AND SELECT LAST FIVE COLS
Result <- dplyr::select(dplyr::bind_rows(Result_dfs), -5)

# LEFT JOIN TO ORIGINAL
Result <- dplyr::left_join(DF, Result, by="Cover_Letter")

I ended up figuring it out with a friend of mine. But yes, that was the issue. Thank you for taking the time to respond! — Siren, Oct 08 '19 at 19:30
Why didn't this solution work? Please advise on any errors. It follows your single PDF file process. — Parfait, Oct 08 '19 at 19:31
I ended up solving it on my own just yesterday, but you were correct to advise me to use the bind_rows function. That was the main issue. I didn't get a chance to try your code since I saw it just now. I'll test it out when I can, and I'll let you know! — Siren, Oct 08 '19 at 19:42
So in your first solution, it only recognizes the first 3 items in the for loop, so the length of i is coming out as 3. Also, when you initialize the empty list in this line '''Result_dfs <- vector(mode="list", length=nrow(DF))''' , the length of the list will not actually be equal to the number of rows in DF. Since multiple phrases can be found in the same document, there was really no way for me to anticipate the length of the result. In your second solution, the for loop only recognizes the first item in the list of pdf's. — Siren, Oct 09 '19 at 13:37
Hmmmm...there is no 3 cut off: `seq_along(DF$Cover_Letter)` is same as your `1:length(file.names)` which is really `1:length(DF$Cover_Letter)`. As for last concern, I think you are confusing *length* of list with *nrow* of data frame. In either solution here, `pdf_result` is a data frame of *any* number of rows but will be **one** data frame per iteration being assigned to list. Be sure same data is being run as your working solution which looks very similar but different keywords and no `next` or growing object in loop. — Parfait, Oct 09 '19 at 14:29

score 0 · Answer 2 · answered Oct 08 '19 at 19:31

After I checked to make sure that the appropriate fields were in the correct class, here's what I ended up doing, and this worked:

PhrasePull<-function(){
DF<-as.data.frame(TotNoMark_clean)
file.names<-DF$Cover_Letter
Result<-data.frame()
for(i in 1:length(file.names)){
    {pdf_pages<-pdf_text(file.names[i])[1]
    pdf_result<-keyword_search(pdf_pages, keyword = c('reason','not being marketed', 'has not marketed', 'will be able to market', 'will market', 'is not marketing', 'available for sale', 'withdrawn from sale', 'commercial marketing', 'commercial distribution', 'target date', 'will be available', 'marketing of this product has been started', 'commercially marketed', 'discontinued', 'launch.', 'not currently marketed', 'unable to market', 'listed in the active section of the Orange Book', 'not currently being manufactured or marketed'), ignore_case = TRUE)
    if (!nrow(pdf_result)) {next}
    pdf_result$Cover_Letter<-file.names[i]
    }
  Result <- bind_rows(Result, pdf_result)
  }
output<<-merge(DF, Result, by = "Cover_Letter", all.x = TRUE)
}

You did what I mentioned was ill-advised in R: do not grow objects in a loop such as `bind_rows`. Ideally, you run this **once** outside the loop on a list of tibbles as I proposed. Otherwise, it causes inefficient, excessive memory copying. Also, avoid scoping operator `<<-` which can be hard to debug as you implicitly affect global environment. In fact, you do not need to use it if inside a local scope function. — Parfait, Oct 08 '19 at 20:09

Extracting multiple phrases from multiple PDF's simultaneously using R

ONE PDF FILE

MULTIPLE PDF FILES: FAILED ATTEMPT

2 Answers2