R Selenium (or rvest): How to scrape tables in sub(sub)pages listed in a main page

Question

RSelenium

I need quite often to scrape and analyze public data of health-care contracts and partially automated it in VBA. I deserve a couple of minuses although I spent the last night trying to set up RSelenium, succeeded in firing up server and running some examples copying single tables to dataframes. I am a beginner in web-scraping.

I am working with a dynamically generated site. https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false

I deal withthree levels of pages:

Level 1

My top pages have the following structure (column A contains links, at the bottom there are pages):

========
A, B, C
link_A,15,10
link_B,23,12
link_c,21,12
link_D,32,12
========

    1,2,3,4,5,6,7,8,9,...
======================

I have just learned the Selector Gadget that indicates:

Table

 .table-striped

1.2.3.4.5.6.7

.pagination-container

Level 2 Under each link (link_A, link_B) in the table there is a subpage which contains a table. Example: https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009

============
F, G, H
link_agreements,34,23
link_agreements,23,23
link_agreements,24,24
============

Selector gadget indicates

.table-striped

Level 3 Again, under each link (link_agreements) there is another, subsubpage with the data that I want to collect https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009&AgreementTechnicalCode=761176

============
X,Y,Z
orthopedics, 231,323
traumatology, 323,248
hematology, 323,122

Again, Selector Gadget indicates

.table-striped

I would like to iteratively collect all the subpages to the data frame that would look like:

Info from top page; info from sub-subpages

link_A (from top page);15 (Value from A column), ortopedics, 231,323
link_A (from top page);15 (Value from A column), traumatology,323,248
link_A (from top page);15 (Value from A column), traumatology,323,122

Is there a cookbook, some good examples for R selenium or rvest to show, how to iterate through links in the tables and get data in the sub(sub)-pages into a dataframe? I would appreciate any info, an example, any hints a book indicating how to do it with RSelenium or any other scraping package.

P.S. Warning: I am also encountering SSL invalid cretificate issues with this page, I am working with Firefox selenium driver. So each time I manually need to skip the warning - for another topic.

P.S. The code I tried so far and found to be a dead end.

install.packages("RSelenium")
install.packages("wdman")
library(RSelenium)

library(wdman) library(XML)

Next I started selenium, I immediately had issues with "java 8 present, java 7 needed issues solved by removing all java?.exe files wrom Windows/System32 or SysWOW64

library(wdman)
library(XML)

selServ <- selenium(verbose = TRUE) #installs selenium


selServ$process



remDr <- remoteDriver(remoteServerAddr = "localhost"
                      , port = 4567
                      , browserName = "firefox")


remDr$open(silent = F)

remDr$navigate("https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=17480&OW=13&OrthopedicSupply=False&Code=130000111&AgreementTechnicalCode=773979")


webElem <- remDr$findElement(using = "class name", value = "table-striped")


webElemtxt <- webElem$getElementAttribute("outerHTML")[[1]]
table <- readHTMLTable(webElemtxt, header=FALSE, as.data.frame=TRUE,)[[1]]


webElem$clickElement()
webElem$sendKeysToElement(list(key="tab",key="enter"))

Here my struggle with RSelenium ended. I could not send keys to Chrome, I could not work with Firefox because it demanded correct SSL certificates and I could not effectively bypass it.

Please read [ask]. Please provide the code you have tried and the execution result including any error messages, etc. Also provide a link to the page and/or the relevant HTML. — JeffC, Apr 10 '17 at 18:11
Dear JeffC. I struggled with RSelenium and tried to move around with keys only to learn that Chrome driver does not respond to these commands. Then I changed RSelenium code to drive Firefox only to learn that it does not accept any problems with SSL certificate. My trials will be irrelevant, The answer below got me started and I will improve on it and place the solution I arrived at. I have met most of How to Ask. — Jacek Kotowski, Apr 11 '17 at 13:37
The main (and probably most important) part of How to Ask is that you supply the code you have tried and any resulting error messages. — JeffC, Apr 11 '17 at 14:45
Dear JeffC, I am happy to do so. Started with RSelenium, had no idea there is alternative. My code is a dead end from the start because it should "automate" browsers and 1) chrome I cannot pretend keyboard operations 2) firefox demands valid SSL certificate and it cannot be bypassed for future sessions. From the start it is a bad solution then. Thanks God people proposed a library I had no idea can do the job. Effectively I do not have the code to paste in. I can promise to paste my code here once it works. I deserve a minus so I got it. I am sorry. The proposal by Bharath got me started. — Jacek Kotowski, Apr 11 '17 at 17:08

score 1 · Answer 1 · answered Apr 10 '17 at 13:58

table<-0
library(rvest)
# PRIMARY TABLE EXTRACTION
for (i in 1:10){
  url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=",i)
  page<-html_session(url)
  table[i]<-html_table(page)
}

library(data.table)
primary_table<-rbindlist(table,fill=TRUE)

# DATA CLEANING REQUIRED IN PRIMARY TABLE to clean the the variable 
# `Kod Sortuj według kodu świadczeniodawcy`
# Clean and store it in the primary_Table_column only then secondary table extraction will work
#SECONDARY TABLE EXTRACTION
for (i in 1:10){
  url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20795&OW=15&OrthopedicSupply=False&Code=",primary_table[i,2])
  page<-html_session(url)
  table[i]<-html_table(page)
  # This is the key where you can identify the whose secondary table is this.
  table[i][[1]][1,1]<-primary_table[i,2]
}
secondary_table<-rbindlist(table,fill=TRUE)

Code only answers are not acceptable on SO. Please explain what your code does and how it answers the question. — JeffC, Apr 10 '17 at 18:10
Thanks, Barath. Partly works, am stuck on second part though The "PRIMARY TABLE EXTRACTION" is ok then I clean data with `primary_table[,2] <- sapply(primary_table[,2],substring,1,9)` SECONDARY EXTR needs 2 params beside Code index it needs a second parameter in the url to be iterated thorugh: https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03& **ProviderId=20795&** OW=15&OrthopedicSupply=False&Code=150000001 This param is in the urls col2 of the primary table. Is it possible to scrape tables complete with link urls in the columns instead of text? — Jacek Kotowski, Apr 11 '17 at 10:16

score 1 · Answer 2 · edited May 23 '17 at 10:30

Here is the answer I developed based on hbmstr aid: rvest: extract tables with url's instead of text Practically tribute goes to him. I modified his code to deal with subpages. I am also grateful to Bharath. My code works but it may be very untidy. Hope it will be adaptable for others. Feel free to simplify code, propose changes.

library(rvest)
library(tidyverse)
library(stringr)

# error: Peer certificate cannot be authenticated with given CA certificates 
# https://stackoverflow.com/questions/40397932/r-peer-certificate-cannot-be-authenticated-with-given-ca-certificates-windows

library(httr)
set_config(config(ssl_verifypeer = 0L))

# Helpers
# First based on https://stackoverflow.com/questions/35947123/r-stringr-extract-number-after-specific-string
# str_extract(myStr, "(?i)(?<=ProviderID\\D)\\d+")

get_id <-
  function (x, myString) {
    require(stringr)
    str_extract(x, paste0("(?i)(?<=", myString, "\\D)\\d+"))
  }


rm_extra <- function(x) { gsub("\r.*$", "", x) }

mk_gd_col_names <- function(x) {
  tolower(x) %>%
    gsub("\ +", "_", .)
}

URL <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=%d"

get_table <- function(page_num = 1) {

  pg <- read_html(httr::GET(sprintf(URL, page_num)))

  tab <- html_nodes(pg, "table")

  html_table(tab)[[1]][,-c(1,11)] %>%
    set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
    mutate_all(funs(rm_extra)) %>%
    mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
    mutate(provider_id=get_id(link,"ProviderID"))  %>%
    as_tibble()
}

pb <- progress_estimated(10)

map_df(1:10, function(i) {
  pb$tick()$print()
    get_table(page_num = i)
}) -> full_df
#===========level 2===============
# %26 escapes "&"

URL2a <- "https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId="
URL2b <- "&OW=15&OrthopedicSupply=False&Code="

paste0(URL2a,full_df[1,11],URL2b,full_df[1,1])


get_table2 <- function(page_num = 1) {

  pg <- read_html(httr::GET(paste0(URL2a,full_df[page_num,11],URL2b,full_df[page_num,1])))

  tab <- html_nodes(pg, "table")

  html_table(tab)[[1]][,-c(1,8)] %>%
    set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
    mutate_all(funs(rm_extra)) %>%
    mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
    mutate(provider_id=get_id(link,"ProviderID"))  %>%
    mutate(technical_code=get_id(link,"AgreementTechnicalCode"))  %>%
    as_tibble()
}

pb <- progress_estimated(nrow(full_df))

map_df(1:nrow(full_df), function(i) {
  pb$tick()$print()
  get_table2(page_num = i)
}) -> full_df2

#===========level 3===============

URL3a <- "https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId="
URL3b <- "&OW=15&OrthopedicSupply=False&Code=150000001&AgreementTechnicalCode="
paste0(URL3a,full_df2[1,8],URL3b,full_df2[1,9])

get_table3 <- function(page_num = 1) {

  pg <- read_html(httr::GET(paste0(paste0(URL3a,full_df2[page_num,8],URL3b,full_df2[page_num,9]))))

  tab <- html_nodes(pg, "table")

  provider <- as.numeric(full_df2[page_num,8])

  html_table(tab)[[1]][,-c(1,8)] %>%
    set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
    mutate_all(funs(rm_extra)) %>%
    mutate(provider_id=provider)  %>%
    as_tibble()
}

pb <- progress_estimated(nrow(full_df2)+1)

map_df(1:nrow(full_df2), function(i)  {
  pb$tick()$print()
  get_table3(page_num = i)
} ) -> full_df3

R Selenium (or rvest): How to scrape tables in sub(sub)pages listed in a main page

2 Answers2