scape data from a page that uses .JSF search using R

Question

I try to scrape information from the Swiss Administrative Court for university research.

The URL is: https://jurispub.admin.ch/publiws/pub/search.jsf and I am interessted in the data that is listed in the table that appears after a search is done.

Unfortunately there is no .robots.txt file. However, all decrees on that webpage are open to the public.

I have some experience with html-scraping and I reviewed following resources: http://www.rladiesnyc.org/post/scraping-javascript-websites-in-r/

https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/

Scraping website that include JS/jquery code with R

My approach

I think using PhantomJS to download a html version of the page and rvest to scrape the downloaded website is a good approach.

My problems

However, I do not know how to get the url of the page that appears if an "empty" search is performed on https://jurispub.admin.ch/publiws/ (by clicking on "suchen" without any information in the search mask) which gives 57,294 results. I thought about something like:

GET(url = "https://jurispub.admin.ch/publiws/",
      query=list(searchQuery=""))

However, this does not work.

Moreover, I do not know to to let PhantomJS "click" on the small arrow button to download the next page.

score 1 · Answer 1 · answered Oct 08 '18 at 05:43

Adding external dependencies is fine but should really be a last resort (IMO).

If you aren't familiar with the Developer Tools view in browsers please do some research on that before working through this answer. You need to have it up in a fresh browser session before you go to the search page to really see the flow.

GET wasn't working because it's an HTML form and <form> elements use POST requests (which show up as XHR requests in most Developer Tools Network panes). However, this is a poorly crafted site that is far too complex for its own good (almost worse than a Microsoft SharePoint site) and there is some initial state setup when you go to the start search page and is maintained throughout the rest of the flow.

I used curlconverter to triage the POST XHR requests. The TLDR on doing that is right click on any POST XHR request, find the "Copy as cURL" menu item and select it. Then, with that still on the clipboard, follow the instructions on the README and manual pages of curlconverter to get actual httr functions back. I can't really promise to walk you through this part or answer curlconverter questions here.

Anyway, to get httr/curl to maintain some cookies for you and to get a key session variable you'll need to pass with each call we need to start with a fresh R session and "prime" the scraping process with a GET to the main search URL:

library(stringi) # Iprefer this for extracting matched strings
library(rvest)
library(httr)

primer <- httr::GET("https://jurispub.admin.ch/publiws/pub/search.jsf")

Now, we need to extract a session string that's in javascript on that page:

httr::content(primer, as="text") %>%
  stri_match_first_regex("session: '([[:alnum:]]+)'") %>% 
  .[,2] -> ice_session

Now, we pretend we're submitting a form. All these hidden variables may not be needed but it's what the browser sent. I usually try to pare them down to only what's needed but this is your project, so have fun with that if you want:

httr::POST(
  url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
  body = list(
    `$ice.submit.partial` = "true",
    ice.event.target = "form:_id64",
    ice.event.captured = "form:_id63first",
    ice.event.type = "onclick",
    ice.event.alt = "false",
    ice.event.ctrl = "false",
    ice.event.shift = "false",
    ice.event.meta = "false",
    ice.event.x = "51", 
    ice.event.y = "336",
    ice.event.left = "true",
    ice.event.right = "false",
    form = "form", 
    icefacesCssUpdates = "",
    `form:_id63` = "first",
    `form:_idcl` = "form:_id63first",
    ice.session = ice_session,
    ice.view = "1", 
    ice.focus = "form:_id63first",
    rand = "0.38654987905551663\\n\\n"
  ),
  encode = "form"
) -> first_pg

Now that we have the first page we need the data from it. I'm not going to solve this fully, but you should be able to extrapolate from what is below. The POST request returns XML that the javascript on the page turns into a terrible looking table. We're going to extract that table:

httr::content(first_pg) %>% 
  xml_find_first("//updates/update/content") %>% 
  xml_text() %>% 
  read_html() -> pg_tbl

data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")

However, it's a terrible use of HTML (the programmers had no REDACTED clue how to do web stuff properly) and you can't just use html_table() on it (and you wouldn't want to anyway since you likely want links to the PDFs or what not). So, we can pull columns out at-will:

html_nodes(data_tbl, xpath=".//td[1]/a") %>% 
  html_text()
## [1] "A-3930/2013" "D-7885/2009" "E-5869/2012" "C-651/2011"  "F-2439/2017" "D-7416/2009"
## [7] "D-838/2011"  "C-859/2011"  "E-1927/2017" "E-2606/2011"

html_nodes(data_tbl, xpath=".//td[2]/a") %>% 
  html_attr("href")
##  [1] "/publiws/download?decisionId=0002b1f8-ea53-40bb-8e38-402d9f3fdfa9"
##  [2] "/publiws/download?decisionId=0002da8f-306e-4395-8eed-0b168df8634b"
##  [3] "/publiws/download?decisionId=0003ec45-50be-45b2-8a56-5c0d866c2603"
##  [4] "/publiws/download?decisionId=000508c2-c852-4aef-bc32-3385ddbbe88a"
##  [5] "/publiws/download?decisionId=0006fbb9-228a-4bdc-ac8c-52db67df3b34"
##  [6] "/publiws/download?decisionId=0008a971-6795-434d-90d4-7aeb1961606b"
##  [7] "/publiws/download?decisionId=00099619-519c-4c8f-9cea-a16ed9ab9fd8"
##  [8] "/publiws/download?decisionId=0009ac38-f2b0-4733-b379-05682473b5d9"
##  [9] "/publiws/download?decisionId=000a4e0f-b2a2-483b-a49f-6ad12f4b7849"
## [10] "/publiws/download?decisionId=000be307-37b1-4d46-b651-223ceec9e533"

Lather, rinse, repeat for any other columns but you may need to do some work to get them as nicely and that's an exercise left to you (i.e. I won't answer questions abt it).

And, you'll want to know where you are in the scraping process so we'll need to grab that line at the bottom of the table:

html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 1 bis 10. Seite 1 von 5,730. Resultat sortiert nach: Relevanz"

Parsing that into # of result and which page you're on is an exercise left to the reader.

Now, we need to programmatically click "next page" until done. I'm going to do two manual iterations to prove it works to try to prevent "it doesn't work" comments. You should write an iterator or loop to go through all next pages and save the data however you want.

Next page (first iteration):

httr::POST(
  url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
  body = list(
    `$ice.submit.partial` = "true",
    ice.event.target = "form:_id67",
    ice.event.captured = "form:_id63next",
    ice.event.type = "onclick",
    ice.event.alt = "false",
    ice.event.ctrl = "false",
    ice.event.shift = "false",
    ice.event.meta = "false",
    ice.event.x = "330", 
    ice.event.y = "559",
    ice.event.left = "true",
    ice.event.right = "false",
    form = "", 
    icefacesCssUpdates = "",
    `form:_id63` = "next",
    `form:_idcl` = "form:_id63next",
    iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
    ice.session =  ice_session,
    ice.view = "1", 
    ice.focus = "form:_id63next",
    rand = "0.17641832791084566\\n\\n"
  ),
  encode = "form"
) -> next_pg

httr::content(next_pg) %>% 
  xml_find_first("//updates/update/content") %>% 
  xml_text() %>% 
  read_html() -> pg_tbl

data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")

html_nodes(data_tbl, xpath=".//td[1]/a") %>% 
  html_text()
##  [1] "D-4059/2011" "D-4389/2006" "E-4019/2006" "D-4291/2008" "E-5642/2012" "E-7752/2010"
##  [7] "D-7010/2014" "D-1551/2013" "C-7715/2010" "E-3187/2013"

html_nodes(data_tbl, xpath=".//td[2]/a") %>% 
  html_attr("href")
##  [1] "/publiws/download?decisionId=000bfd02-4da5-4bb2-a5d0-e9977bf8e464"
##  [2] "/publiws/download?decisionId=000e2be1-6da8-47ff-b707-4a3537320a82"
##  [3] "/publiws/download?decisionId=000fa961-ecb4-47d2-8ca3-72e8824c2c6b"
##  [4] "/publiws/download?decisionId=0010a089-4f19-433e-b106-6d75833fae9a"
##  [5] "/publiws/download?decisionId=00111bfc-3522-4a32-9e7a-fa2d9f171427"
##  [6] "/publiws/download?decisionId=00126b65-b345-4988-826b-b213080caa45"
##  [7] "/publiws/download?decisionId=00127944-5c88-43f6-9ef1-3c822288b0c7"
##  [8] "/publiws/download?decisionId=00135a17-f1eb-4b61-9171-ac1d27fd3910"
##  [9] "/publiws/download?decisionId=0014c6ea-c229-4129-bbe0-7411d34d9743"
## [10] "/publiws/download?decisionId=00167998-54d2-40a5-b02b-0c4546ac4760"

html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 11 bis 20. Seite 2 von 5,730. Resultat sortiert nach: Relevanz"

Notice the column values are different and the progress text is different. Also note we got lucky and the incompetent programmers on the site actually had a "next" event vs forcing us to figure out pagination numbers and X/Y coordinates.

Next page (second and last example iteration):

httr::POST(
  url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
  body = list(
    `$ice.submit.partial` = "true",
    ice.event.target = "form:_id67",
    ice.event.captured = "form:_id63next",
    ice.event.type = "onclick",
    ice.event.alt = "false",
    ice.event.ctrl = "false",
    ice.event.shift = "false",
    ice.event.meta = "false",
    ice.event.x = "330", 
    ice.event.y = "559",
    ice.event.left = "true",
    ice.event.right = "false",
    form = "", 
    icefacesCssUpdates = "",
    `form:_id63` = "next",
    `form:_idcl` = "form:_id63next",
    iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
    ice.session =  ice_session,
    ice.view = "1", 
    ice.focus = "form:_id63next",
    rand = "0.17641832791084566\\n\\n"
  ),
  encode = "form"
) -> next_pg

httr::content(next_pg) %>% 
  xml_find_first("//updates/update/content") %>% 
  xml_text() %>% 
  read_html() -> pg_tbl

data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")

html_nodes(data_tbl, xpath=".//td[1]/a") %>% 
  html_text()
##  [1] "D-3974/2010" "D-5847/2009" "D-4241/2015" "E-3043/2010" "D-602/2016"  "C-2065/2008"
##  [7] "D-2753/2007" "E-2446/2010" "C-1124/2015" "B-7400/2006"

html_nodes(data_tbl, xpath=".//td[2]/a") %>% 
  html_attr("href")
##  [1] "/publiws/download?decisionId=00173ef1-2900-49d4-b7d3-39246e552a70"
##  [2] "/publiws/download?decisionId=001a344c-86b7-4f32-97f7-94d30669a583"
##  [3] "/publiws/download?decisionId=001ae810-300d-4291-8fd0-35de720a6678"
##  [4] "/publiws/download?decisionId=001c2025-57dd-4bc6-8bd6-eedbd719a6e3"
##  [5] "/publiws/download?decisionId=001c44ba-e605-455d-9609-ed7dffb17adc"
##  [6] "/publiws/download?decisionId=001c6040-4b81-4137-a6ee-bad5a5019e71"
##  [7] "/publiws/download?decisionId=001d0811-a5c2-4856-aef3-51a44f7f2b0e"
##  [8] "/publiws/download?decisionId=001dbf61-b1b8-468d-936e-30b174a8bec9"
##  [9] "/publiws/download?decisionId=001ea85a-0765-4a1f-9b81-3cecb9f36b31"
## [10] "/publiws/download?decisionId=001f2e34-9718-4ef7-a60c-e6bbe208003b"

html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 21 bis 30. Seite 3 von 5,730. Resultat sortiert nach: Relevanz"

Ideally, you'd wrap the POST in a function you can call and return data frames that you can rbind or bind_rows into a big data frame.

If you made it this far, an alternative is to use RSelenium to orchestrate page clicks on the "next page" selector and retrieve the HTML back (the table will still be horribad and you'll need to use the column targeting or some other HTML selector magic to get useful info from it due to the aforementioned inept programmers). RSelenium introduces an external dependency that — as you'll see if you do a search on SO — many R users have trouble getting working especially on the equally wretched legacy operating system known as Windows. If you can get Selenium running and RSelenium working with it, it might be easier in the long run if all the above seems daunting (you're still going to have to grok Developer Tools at some point so the above might be worth the pain anyway and you'll need the HTML selector target for the various buttons for Selenium too).

I'd seriously avoid phantomjs as it's now in a "best effort" maintenance state and you'll have to figure out how to do the above with JavaScript vs R.

Thank you for your solution. I ran in a problem that I could not solve during the last days: when I try to extract the table from the XML from the POST `httr::content(first_pg) %>% xml_find_first("//updates/update/content") %>% xml_text() %>% read_html() -> pg_tbl` I get the error: `Error: 'Ice.autoPosition.stop('form:_id34');Ice.autoCentre.stop('form:_id34');Ice.iFrameFix.start('form:_id34','/publiws/xmlhttp/blank');Ice.Focus.setFocus('form:_id63first');//-1890007849' does not exist in current working directory (...)`. Do I miss an URL? — captcoma, Oct 12 '18 at 07:18
Did you inspect the output of the first `xml_text()` to make sure it's still HTML? — hrbrmstr, Oct 12 '18 at 10:55
I did not. Unfortunately I cannot do it since I cannot get a sessoin string anymore (I only get NAs) and `..;\ncontainer.bridge = new Ice.Community.Application({blockUI: false,session: '-XS3ibgyGDCrJZJSDjSYNw',view: 1,synchronous: true,connectionLostRedirectURI: null,sessionExpiredRedirectURI: null,serverErrorRetryTimeouts:..` in the primer. I restarted everything, however, I still do not get a session again — captcoma, Oct 12 '18 at 11:27
Just to get a better understanding: why is there that problem with the session. Is it caused due to the poorly crafted site or is there any technique implemented to hinder scraping? — captcoma, Oct 12 '18 at 20:52
Good q. They seem to use a SharePoint-esque back end framework for the site and those frameworks maintain _alot_ of state in cookies and hidden query string parameters. Said state may change over the course of an session and it can be difficult to fully emulate a browser with `GET` and `POST` requests without knowing all the nuances. SharePoint sites are _the_ _worst_ tho. — hrbrmstr, Oct 12 '18 at 21:07

hrbrmstr · Accepted Answer · 2018-10-12T11:56:14.793

Getting Selenium working may be easier (in the long run) than trying to figure out the nuances necessary to get and maintain sessions:

library(wdman) # for managing the Selenium server d/l
library(RSelenium) # for getting a connection to the Selenium server
library(seleniumPipes) # for better navigation & scraping idioms

This should install the jar and start the server:

selServ <- selenium()

We need the port # so do this and look for the port in the msgs

selServ$log()$stderr

Now we need to connect to it and we need to use the port # from ^^. It was 4567 in my case:

sel <- remoteDr(browserName = "chrome", port = 4567)

Now, go to the main URL:

sel %>% 
  go("https://jurispub.admin.ch/publiws/pub/search.jsf")

Start the scraping process by hitting the initial submit button

sel %>% 
  findElement("name", "form:searchSubmitButton") %>%  # find the submit button 
  elementClick() # click it

We're on the next page, now, so grab the columns like in the other answer's example:

sel %>% 
  getPageSource() %>% # like read_html()
  html_node("table.iceDatTbl") -> dtbl  # this is the data table

html_nodes(dtbl, xpath=".//td[@class='iceDatTblCol1']/a") %>% # get doc ids
  html_text()

html_nodes(dtbl, xpath=".//td[@class='iceDatTblCol2']/a[contains(@href, 'publiws')]") %>% 
  html_attr("href") # get pdf links

Etc… for the other columns like in the other answer

Now get the pagination info like in the other answer:

sel %>% 
  getPageSource() %>% 
  html_node("span.iceOutFrmt") %>% 
  html_text() # the total items / pagination info

Find the next page button, click it and go to the next page

sel %>%
  findElement("xpath", ".//img[contains(@src, 'arrow-next')]/../../a") %>% 
  elementClick() # go to next page

Repeat the above table grabbing. You should put the whole thing in a for loop based on the total items/pagination info as in the other answer's suggestion.

When you're all done, don't forget to call:

selServ$stop()

scape data from a page that uses .JSF search using R

2 Answers2