1

I am struggling to 'web-scrape' data from a table which spans over several pages. The pages are linked via javascript.

The data I am interested in is based on the website's search function:

url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"

I am able to download the first page with the rvest package:

library(rvest)
library(tidyverse)

NI <- read_html(url)

NI.res <- NI %>%
  html_nodes("table") %>% 
  html_table(fill=TRUE)

NI.res <- NI.res[[1]][c(1:10),c(1:5)]

So far so good.

As far as I understood, the RSelenium package is the way forward to navigate on websites with javascript/when html scraping via changing urls is not possible. I installed the package and run it in combination with the Docker Quicktool Box (all working fine),

library (RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-chrome')
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", 
                      port = 4445L,
                      browserName = "chrome")

remDr$open()

My hope was that by triggering the javascript I could naviate to the next page and repeat the rvest command and obtain the data contained on the e.g. 2nd, 3rd etc page (eventually that should be part of a loop or purrr::map function).

Navigate to the table with search results (1st page):

remDr$navigate("http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/1989&td=01/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0")

Trigger the javascript. The content of the javascript is taken from hovering with the mouse over the index of pages on the webiste (below the table). In the case below the javascript leading to page 2 is triggered:

remDr$executeScript("__doPostBack('ctl00$MainContentPlaceHolder$SearchResultsGridView','Page$2');", args=list("dummy"))

Repeat the scraping with rvest

url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
NI <- read_html(url)

NI.res <- NI %>%
  html_nodes("table") %>% 
  html_table(fill=TRUE)
NI.res2 <- NI.res[[1]][c(1:10),c(1:5)]

Unfortunately though, the triggering of the javascript appears not to work. The scraped results are again those from page 1 and not from page 2. I might be missing something rather basic here, but I can't figure out what.

My attempt is partly informed by SO posts here, here and here. I also saw this post.

Context: Eventually, in further steps, I will have to trigger a click on each single finding/row which shows up on all pages and also scrape the information behind each entry. Hence, as far as I understood, RSelenium will be the main tool here.

Grateful for any hint!

UPDATE

I made 'some' progress following the approach suggested here. It's a) still not doing everything I am intending to do and b) is very likely not the most elegant form of how to do it. But maybe it's of some help to others/opens up a way forward. Note that this approach does not require RSelenium.

I basically created a loop for each javascript (page index) leading to another page of the table which I want to scrape. The crucial detail is the EVENTARGUMENT argument to which I assign the respective page number (my knowledge of js is bascially zero).

for (i in 2:15) {

target<- paste0("Page$",i) 

page<-rvest:::request_POST(pgsession,"http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0",
                           body=list(
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__EVENTTARGET`="ctl00$MainContentPlaceHolder$SearchResultsGridView",
                             `__EVENTARGUMENT`= target,                                                         `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value
                           ),
                           encode="form")
   
x <- read_html(page) %>% 
  html_nodes(css = "#ctl00_MainContentPlaceHolder_SearchResultsGridView") %>% 
  html_table(fill=TRUE) %>% 
  as.data.frame() 
  
d<- x[c(1:paste(nrow(x)-2)),c(1:5)]  
page.list[[i]] <- d

i=i+1

}

However, this code is not able to trigger the javascript/go to the pages which are not visible in the page index below the table when opening the site (1 - 11). Only page 2 to 11 can be scraped with this loop. Since the script for page 12 and subsequent are not visible, they can't be triggered.

zoowalk
  • 2,018
  • 20
  • 33
  • The problem might be with the way you are scraping the second page, as you are just passing the original URL to `rvest`, which probably undoes the javascript. It might be better to use `NI <- read_html(remDr$getPageSource()[[1]])` after executing the javascript. – Andrew Gustar Apr 11 '18 at 09:12
  • @AndrewGustar many thanks - tried it, but for reasons beyond my understanding the page's source remains also unchanged (= page 1 to 11); even when manually navigating to page 12 and following, the source of the page makes only reference to pages 1 to 11. – zoowalk Apr 15 '18 at 19:02
  • How about navigating to `Page$Last` (the double arrow at the end of the page bar) and then trying `Page$12` etc, as these at least appear as links on that last page? Just a thought. – Andrew Gustar Apr 15 '18 at 21:38
  • Many thx. Thought also so - remDr$executeScript("__doPostBack('ctl00$MainContentPlaceHolder$SearchResultsGridView','Page$Last');", args=list("dummy")) brings me to the last page and makes the indexes for pages 12 to 14 visible. But for some reason theses pages can't be scrapped with the same command i use for the previous pages. – zoowalk Apr 15 '18 at 21:46
  • Hi Zoowalk I know it's been a while but have you figure out the solution yet? I have the same problem with a page written in .aspx. It is a page with table, I want to navigate to next page and found the button invoke javascript doPostBack.... All I want is just to click on it and wait for it to load another table (next page). – Gabriel Jan 18 '20 at 03:19
  • @Gabriel unfortuantely, no. – zoowalk Jan 21 '20 at 12:33
  • 1
    Hi Zoowalk, I found out switching to firefox works. – Gabriel Jan 22 '20 at 18:30

0 Answers0