0

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted by one column in the row containing a note.

Here's the table.

enter image description here

And here's the code:

# load library
library(pdftools)

# link to report
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"

# read data through pdftool
data <- pdf_text(url)

# create a function to read the pdfs
scrape_pdf <- function(list_of_tables,
                       table_number,
                       number_columns, 
                       column_names,
                       first_row,
                       last_row) {
  data <- list_of_tables[table_number]
  data <- trimws(data)
  data <- strsplit(data, "\n")
  data <- data[[1]]
  data <- data[min(grep(first_row, data)):
                  max(grep(last_row, data))]
  data <- str_split_fixed(data, " {2,}", number_columns)
  data <- data.frame(data)
  names(data) <- column_names
  
  return(data)
}
names <- c("","6m 30-9-2022","6m 30-9-2021","12m 30-3-2022")
output <- scrape_pdf(rym22Q3fs,3,5,names,"Care fees","Basic and diluted")

And the output.

                                        6m 30-9-2022 6m 30-9-2021 12m 30-3-2022      NA
1                             Care fees      210,187      194,603       398,206        
2                       Management fees       59,746       50,959       105,552        
3                     Interest received          364           42            41        
4                          Other income        3,942        2,260         4,998        
5                         Total revenue      274,239      247,864       508,797        
6                                                                                      
7                Fair-value movement of                                                
8                 investment properties            3      261,346       285,143 745,885
9                          Total income      535,585      533,007     1,254,682        
10                                                                                     
11                   Operating expenses    (265,148)    (225,380)     (466,238)        
12                     Depreciation and                                                
13                amortisation expenses     (22,996)     (17,854)      (35,698)        
14                        Finance costs     (19,355)     (15,250)      (30,664)        
15                      Impairment loss            2     (10,784)             -       -
16                       Total expenses    (318,283)    (258,484)     (532,600)        
17                                                                                     
18             Profit before income tax      217,302      274,523       722,082        
19        Income tax (expense) / credit     (23,316)        6,944      (29,209)        
20                Profit for the period      193,986      281,467       692,873        
21                                                                                     
22                   Earnings per share                                                
23  Basic and diluted (cents per share)         38.8         56.3         138.6      

How can I best circumvent this issue?

Many thanks in advance!

  • I suspect you can unambiguously identify all rows where column 5 is meaningful (not NA, not empty), then shift all other rows' columns 2-4 over one. That should be a relatively straight forward matrix/frame operation, right? – r2evans Nov 19 '22 at 13:15

1 Answers1

2

While readr::read_fwf() is for handling fixed width files, it performs pretty well on text from pdftools too once header / footer rows are removed. Even if it has to guess column widths, though those can be specified too.

library(pdftools)
library(dplyr, warn.conflicts = F)

url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
data <- pdf_text(url)

scrape_pdf <- function(pdf_text_item, first_row_str, last_row_str){
  lines <- unlist(strsplit(pdf_text_item, "\n"))
  # remove 0-length lines
  lines <- lines[nchar(lines) > 0]
  lines <- lines[min(grep(first_row_str, lines)):
                 max(grep(last_row_str , lines))]
  # paste lines back into single string for read_fwf()
  paste(lines, collapse = "\n") %>% 
    readr::read_fwf() %>% 
    # re-connect strings in first colum if values were split between rows 
    mutate(X1 = if_else(!is.na(lag(X1)) & is.na(lag(X3)), paste(lag(X1), X1), X1)) %>% 
    filter(!is.na(X3))
}

output <- scrape_pdf(data[3], "Care fees","Basic and diluted" )

Result:

output %>% 
  mutate(X1 = stringr::str_trunc(X1, 35))
#> # A tibble: 16 × 5
#>    X1                                     X2 X3        X4        X5       
#>    <chr>                               <dbl> <chr>     <chr>     <chr>    
#>  1 Care fees                              NA 210,187   194,603   398,206  
#>  2 Management fees                        NA 59,746    50,959    105,552  
#>  3 Interest received                      NA 364       42        41       
#>  4 Other income                           NA 3,942     2,260     4,998    
#>  5 Total revenue                          NA 274,239   247,864   508,797  
#>  6 Fair-value movement of investmen...     3 261,346   285,143   745,885  
#>  7 Total income                           NA 535,585   533,007   1,254,682
#>  8 Operating expenses                     NA (265,148) (225,380) (466,238)
#>  9 Depreciation and amortisation ex...    NA (22,996)  (17,854)  (35,698) 
#> 10 Finance costs                          NA (19,355)  (15,250)  (30,664) 
#> 11 Impairment loss                         2 (10,784)  -         -        
#> 12 Total expenses                         NA (318,283) (258,484) (532,600)
#> 13 Profit before income tax               NA 217,302   274,523   722,082  
#> 14 Income tax (expense) / credit          NA (23,316)  6,944     (29,209) 
#> 15 Profit for the period                  NA 193,986   281,467   692,873  
#> 16 Earnings per share Basic and dil...    NA 38.8      56.3      138.6

Created on 2022-11-19 with reprex v2.0.2

margusl
  • 7,804
  • 2
  • 16
  • 20