PDF scraping: get company and subsidiaries tables

Question

I am trying to scrape this PDF containing information about company subsidiaries. I have seen many posts using the R package Tabulizer but this, unfortunately, doesn't work on my Mac for some reasons. As Tabulizer uses Java dependencies, I tried installing different versions of Java (6-13) and then reinstalling the packages, still no luck in getting this to work (what happens is when I run extract_tables the R session aborts).

I need to scrape the whole pdf from page 19 onwards and construct a table showing company names and their subsidiaries. In the pdf, names start with any letters/number/symbol, whereas subsidiaries start with either a single or double dot.

So I tried with pdftools and pdftables packages. The code below provides a table similar to the one on page 19:

library(pdftools)
library(pdftables)
library(tidyverse)

tt = pdf_text("~/DATA/978-1-912036-41-7-Who Owns Whom UK-Ireland-Volume-1.pdf")

df <- tt[19]
df2 <- strsplit(df, '  ')
 
df3 <-as.data.frame(do.call(cbind, df2)) %>% 
 filter(V1!="") %>% 
 mutate(V2=str_split_fixed(V1, "England . ", 2)) %>% 
 mutate(V3=str_split_fixed(V1, "England", 2)) %>% 
 select(V2,V3,V1) %>% 
 mutate(V1=ifelse(V1==V3,"",V1),V3=ifelse(V3==V2,"",V3)) %>% 
 select(V3,V2,V1) %>% 
 mutate_at(c("V1"), funs(lead), n = 1 ) %>% 
 mutate_at(c("V3"), funs(lag), n = 1 ) %>% 
 unite(V4,V1, V2, V3, sep = "", remove = FALSE)

I am sure there is a more sophisticated function to do this more neatly. For example by using '\n' or '\r' with strsplit:

 df2 <- strsplit(df, '\n') 
 df3 <- do.call(cbind.data.frame, df2)

Can anyone with more experience than me advise me on how to scrape this table?

Oh! Oddly I enjoy doing PDF scrapes, but it can take time. I'll look into this, in the mean time check out my one write-up: https://medium.com/swlh/the-adventure-of-pdf-to-data-frame-in-r-f90609035600?sk=2b36be4b1c2e49a784db67181685579c it was performed on a mac as well. But you're definitely going to probably need more code than what you provided is my impression from just scanning the pdf. — Justin Cocco, May 22 '21 at 13:06

JBGruber · Accepted Answer · 2021-05-26T06:59:00.840

Like @Justin Coco hinted, this was a lot of fun. The code ended up a bit more complex than I anticipated, but I think the result should be what you imagined.

I used pdf_data instead of pdf_text so I can work with the position of words.

library(pdftools)
#> Using poppler version 0.86.1
library(tidyverse)
pdf_location <- "/location/of/pdf"
pdf_raw <- pdf_data(pdf_location)

I then wrote a function which can process a page from the PDF:

get_table <- function(x, page) {
  x[[page]] %>% # select page, I use this variable again below, which is why I'm not simply looping through the whole object

    filter(y > 25, y < 833) %>% # above and below these positions is the pdf header which we are not interested in
    mutate(column = case_when( # I check the x-positions where the columns start an end and transformed them into column numbers
      x >= 36 & x < 220 ~ 1L,
      x >= 220 & x < 403 ~ 2L,
      x >= 403 ~ 3L,
    )) %>% 
    mutate(newrow = case_when( # check if this is a new line
      column == 1L & x == 36  ~ TRUE, 
      column == 2L & x == 220 ~ TRUE,
      column == 3L & x == 403 ~ TRUE,
      TRUE ~ FALSE
    ),
    row = cumsum(newrow), # get the row number
    subsidiary = newrow & text == ".") %>% # as you say, subsidiary names start with "."
    group_by(row, column) %>% # grouping and summarising moves the text into one 'cell'
    summarise(text = paste(text, collapse = " "), 
              subsidiary = sum(subsidiary) > 0,
              .groups = "drop") %>% 
    mutate(headline = !str_detect(text, "\\s")) %>% # the category headlines (@, A, B, C, etc.) are still in there but can be identified easily since they lack whitespace
    mutate(row = ifelse(row > 1 & !subsidiary & !lag(subsidiary) & !lag(headline), lag(row), row),
           row = ifelse(row > 1 & !subsidiary & !lag(subsidiary) & !lag(headline), lag(row), row)) %>% # some company names stretch over up to three lines but lines are not indented
    group_by(row, column) %>% 
    summarise(text = paste(text, collapse = " "), 
              subsidiary = sum(subsidiary) > 0,
              headline = head(headline, 1),
              .groups = "drop")  %>% 
    
    mutate(page = page, .before = row) # finally add the page number to keep track
}

You can test this on one page or loop through all of them at once:

pdf_df <- map_df(19:1428, ~get_table(pdf_raw, page = .x))

I already like the df, but you requested that the table should be "showing company names and their subsidiaries". So let's do some more wrangling on the pdf_df object.

pdf_df %>% 
  filter(!headline) %>% 
  mutate(company_nr = cumsum(!subsidiary)) %>% 
  group_by(company_nr) %>% 
  mutate(company = text[!subsidiary & !headline]) %>% 
  filter(subsidiary) %>% 
  select(company_nr, company, subsidiary = text)
#> # A tibble: 303,380 x 3
#> # Groups:   company_nr [115,477]
#>    company_nr company                             subsidiary                    
#>         <int> <chr>                               <chr>                         
#>  1          1 ?WHAT IF! HOLDINGS LIMITED The Gla… . ?What If! China Holdings Li…
#>  2          1 ?WHAT IF! HOLDINGS LIMITED The Gla… . . ?What If! Innovation Sing…
#>  3          1 ?WHAT IF! HOLDINGS LIMITED The Gla… . ?What If! Joint Ventures Li…
#>  4          1 ?WHAT IF! HOLDINGS LIMITED The Gla… . ?What If! Limited England   
#>  5          1 ?WHAT IF! HOLDINGS LIMITED The Gla… . . ? What If ! Inventors Lim…
#>  6          1 ?WHAT IF! HOLDINGS LIMITED The Gla… . . ? What If ! Training Limi…
#>  7          1 ?WHAT IF! HOLDINGS LIMITED The Gla… . Nobby Styles Limited Englan…
#>  8          2 @A COMPANY LIMITED Premier Suite 4… . Aviva Holdings Limited Engl…
#>  9          2 @A COMPANY LIMITED Premier Suite 4… . Copper Mountain Networks Li…
#> 10          2 @A COMPANY LIMITED Premier Suite 4… . Just Ties Limited England   
#> # … with 303,370 more rows

^{Created on 2021-05-23 by the reprex package (v2.0.0)}

Let me know in a comment if there are problems. I obviously didn't go through all pages to check if the script has some quirks with specific company names etc. but the first pages look fine to me.

You are right. First lines in a new column or on a new page were not picked up as `newline`. I added a test for this. See the updated answer. — JBGruber, May 25 '21 at 13:36
Hi JBGruber, thank you for your amazing work. This is truly neat and concise. The code seems to be working fine although there are a few issues. I fixed one issue regarding some subsidiaries being counted as companies. I noticed this happens at the beginning of a new column or page at this point `newrow = ifelse(is.na(newrow), FALSE, newrow)` so I added a further line of code `subsidiary = ifelse(text=="."&newrow==F&subsidiary==F,T,subsidiary)`. Also, there is an issue arising whenever subsidiaries/companies text is divided in more lines. This happens the first time at page 20-Limited England. — Amleto, May 25 '21 at 14:44
can you tell me the company? Multiple lines should be detected. — JBGruber, May 25 '21 at 16:02
In `row 420` Company: `A A HOMES AND HOUSING LTD Coombe Farm Oaks Road, Croydon, Surrey CR0 5HL 6519 6519` has subsidy `. . A A Homes and Housing (Metro Point) Property` but its ending: `Limited England` seems to be starting a new line and is detected as a company — Amleto, May 25 '21 at 17:40
I gave it another shot, addressing all the special cases so far. I found 115,477 unique company names now. Is there maybe a way to find out how many there are supposed to be? — JBGruber, May 26 '21 at 07:00

PDF scraping: get company and subsidiaries tables

1 Answers1

Linked