0

I am trying to extract certain tables from multiple pdf files but not all the files have that table. How can I use trycatch or similar to skip and proceed to the next file even if the first file does not contain the certain table?

library(pdftools)
library(tidyverse)

url <- c("https://www.computershare.com/News/Annual%20Report%202019.pdf?2",
         "https://www.annualreports.com/HostedData/AnnualReportArchive/a/LSE_ASOS_2018.PDF")

raw_text <- map(url, pdf_text)

clean_table1 <- function(raw) {
  
  raw <- map(raw, ~ str_split(.x, "\\n") %>% unlist())
  raw <- reduce(raw, c)
  
  table_start <- stringr::str_which(tolower(raw), "twenty largest shareholders")
  table_end <- stringr::str_which(tolower(raw), "total")
  table_end <- table_end[min(which(table_end > table_start))]
  
  table <- raw[(table_start + 3 ):(table_start + 25)]
  table <- str_replace_all(table, "\\s{2,}", "|")
  text_con <- textConnection(table)
  data_table <- read.csv(text_con, sep = "|")
  #colnames(data_table) <- c("Name", "Number of Shares", "Percentage")
  data_table
}

shares <- map_df(raw_text, clean_table1) 

I got the following error when I tried running.

Error in (table_start + 3):(table_start + 25) : argument of length 0
In addition: Warning message:
In min(which(table_end > table_start)) :
  no non-missing arguments to min; returning Inf
Jane
  • 385
  • 4
  • 11
  • Where is the error occurring? – r2evans Oct 15 '20 at 00:38
  • @r2evans during the table extraction `Error in (table_start + 3):(table_start + 25) : argument of length 0 In addition: Warning message: In min(which(table_end > table_start)) : no non-missing arguments to min; returning Inf` – Jane Oct 15 '20 at 01:00
  • `if (!length(table_start) && !length(table_end)) return();` immediately before `table <- ...` would preempt that error. I'm not sure why it's happening, but at least you can step out before it happens. Realize that you're returning `NULL` in this case (which I see as no problem, though the calling environment will have to deal with that). – r2evans Oct 15 '20 at 01:11

1 Answers1

2

You can check for length of table_start and return NULL if it is 0 so while using map_df those records would automatically collapse and you would have one combined dataframe.

library(tidyverse)

clean_table1 <- function(raw) {
  
  raw <- map(raw, ~ str_split(.x, "\\n") %>% unlist())
  raw <- reduce(raw, c)
  
  table_start <- stringr::str_which(tolower(raw), "twenty largest shareholders")
  if(!length(table_start)) return(NULL)
  table_end <- stringr::str_which(tolower(raw), "total")
  table_end <- table_end[min(which(table_end > table_start))]
  
  table <- raw[(table_start + 3 ):(table_start + 25)]
  table <- str_replace_all(table, "\\s{2,}", "|")
  text_con <- textConnection(table)
  data_table <- read.csv(text_con, sep = "|")
  #colnames(data_table) <- c("Name", "Number of Shares", "Percentage")
  data_table
}

shares <- map_df(raw_text, clean_table1)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • this seems to work but when I tried it on other files, I got this error `Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names`. I presume some of the files have tables with more columns? How do I have an error handling for this? – Jane Oct 16 '20 at 00:19
  • Difficult to tell without looking at the data. However, I have two guesses that we can consider. Read only 3 columns `data_table <- read.csv(text_con, sep = "|")[1:3]`. Or return `NULL` if more than 3 columns. `if(ncol(data_table) > 3) return(NULL)`. – Ronak Shah Oct 16 '20 at 00:41