0

I have couple of pdfs and I wish to extract the shareholders table. How can I specify such that only table appearing after the string 'TWENTY LARGEST SHAREHOLDERS' is extracted?

I tried but was not quite sure of the function part.

library("pdftools")
library("tidyverse")

url <- c("https://www.computershare.com/News/Annual%20Report%202019.pdf?2")

raw_text <- map(url, pdf_text)


clean_table <- function(table){
  table <- str_split(table, "\n", simplify = TRUE)
  table_start <- stringr::str_which(table, "TWENTY LARGEST SHAREHOLDERS")
  table <- table[1, (table_start +1 ):(table_end - 1)]
  table <- str_replace_all(table, "\\s{2,}", "|")
  text_con <- textConnection(table)
  data_table <- read.csv(text_con, sep = "|")
  colnames(data_table) <- c("Name", "Number of Shares", "Percentage")
}

shares <- map_df(raw_text, clean_table) 

Jane
  • 385
  • 4
  • 11

1 Answers1

1

Try this. Besides some minor issues the main change is that I first get the page which contains the desired table. BTW: You have to search for "Twenty Largest Shareholders" and not "TWENTY LARGEST SHAREHOLDERS".

library(pdftools)
library(tidyverse)

# download pdf
url <- c("https://www.computershare.com/News/Annual%20Report%202019.pdf?2")

raw_text <- map(url, pdf_text)

clean_table1 <- function(raw) {
  
  # Split the single pages
  raw <- map(raw, ~ str_split(.x, "\\n") %>% unlist())
  # Concatenate the splitted pages
  raw <- reduce(raw, c)
  
  table_start <- stringr::str_which(tolower(raw), "twenty largest shareholders")
  table_end <- stringr::str_which(tolower(raw), "total")
  table_end <- table_end[min(which(table_end > table_start))]
  
  table <- raw[(table_start + 3 ):(table_end - 1)]
  table <- str_replace_all(table, "\\s{2,}", "|")
  text_con <- textConnection(table)
  data_table <- read.csv(text_con, sep = "|")
  colnames(data_table) <- c("Name", "Number of Shares", "Percentage")
  data_table
}

shares <- map_df(raw_text, clean_table1) 
head(shares)
#>                                                    Name Number of Shares
#> 1             J P Morgan Nominees Australia Pty Limited      109,500,852
#> 2                         Citicorp Nominees Pty Limited       57,714,777
#> 3                                       Mr Chris Morris       32,231,000
#> 4                             National Nominees Limited       19,355,892
#> 5                                         Welas Pty Ltd       18,950,000
#> 6 BNP Paribas Nominees Pty Ltd <Agency Lending DRP A/C>       11,520,882
#>   Percentage
#> 1      20.17
#> 2      10.63
#> 3       5.94
#> 4       3.56
#> 5       3.49
#> 6       2.12
stefan
  • 90,330
  • 6
  • 25
  • 51
  • Possible to use ignore case here? Or include the option to have more than one table start point? – Jane Oct 11 '20 at 22:37
  • What does this error meant: `Error in (table_start + 3):(table_end - 1) : argument of length 0` – Jane Oct 12 '20 at 05:02
  • Yes. Of course can you do that. See my edit. Addionally, instead of looking for just one page I made an edit which searches in the whole document, which is more robust e.g. if a table is split over multiple pages. And of course could you check for multiple starting points. The error means that one of the indices is an element of length 0, i.e. at least there was no match for the start or the end of the table. – stefan Oct 12 '20 at 06:44
  • can i skip it (without error) if there was no match? – Jane Oct 12 '20 at 09:44