Adding external dependencies is fine but should really be a last resort (IMO).
If you aren't familiar with the Developer Tools view in browsers please do some research on that before working through this answer. You need to have it up in a fresh browser session before you go to the search page to really see the flow.
GET
wasn't working because it's an HTML form and <form>
elements use POST
requests (which show up as XHR
requests in most Developer Tools Network
panes). However, this is a poorly crafted site that is far too complex for its own good (almost worse than a Microsoft SharePoint site) and there is some initial state setup when you go to the start search page and is maintained throughout the rest of the flow.
I used curlconverter
to triage the POST
XHR
requests. The TLDR on doing that is right click on any POST
XHR
request, find the "Copy as cURL" menu item and select it. Then, with that still on the clipboard, follow the instructions on the README and manual pages of curlconverter to get actual httr
functions back. I can't really promise to walk you through this part or answer curlconverter
questions here.
Anyway, to get httr
/curl
to maintain some cookies for you and to get a key session variable you'll need to pass with each call we need to start with a fresh R session and "prime" the scraping process with a GET
to the main search URL:
library(stringi) # Iprefer this for extracting matched strings
library(rvest)
library(httr)
primer <- httr::GET("https://jurispub.admin.ch/publiws/pub/search.jsf")
Now, we need to extract a session string that's in javascript on that page:
httr::content(primer, as="text") %>%
stri_match_first_regex("session: '([[:alnum:]]+)'") %>%
.[,2] -> ice_session
Now, we pretend we're submitting a form. All these hidden variables may not be needed but it's what the browser sent. I usually try to pare them down to only what's needed but this is your project, so have fun with that if you want:
httr::POST(
url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
body = list(
`$ice.submit.partial` = "true",
ice.event.target = "form:_id64",
ice.event.captured = "form:_id63first",
ice.event.type = "onclick",
ice.event.alt = "false",
ice.event.ctrl = "false",
ice.event.shift = "false",
ice.event.meta = "false",
ice.event.x = "51",
ice.event.y = "336",
ice.event.left = "true",
ice.event.right = "false",
form = "form",
icefacesCssUpdates = "",
`form:_id63` = "first",
`form:_idcl` = "form:_id63first",
ice.session = ice_session,
ice.view = "1",
ice.focus = "form:_id63first",
rand = "0.38654987905551663\\n\\n"
),
encode = "form"
) -> first_pg
Now that we have the first page we need the data from it. I'm not going to solve this fully, but you should be able to extrapolate from what is below. The POST
request returns XML that the javascript on the page turns into a terrible looking table. We're going to extract that table:
httr::content(first_pg) %>%
xml_find_first("//updates/update/content") %>%
xml_text() %>%
read_html() -> pg_tbl
data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")
However, it's a terrible use of HTML (the programmers had no REDACTED clue how to do web stuff properly) and you can't just use html_table()
on it (and you wouldn't want to anyway since you likely want links to the PDFs or what not). So, we can pull columns out at-will:
html_nodes(data_tbl, xpath=".//td[1]/a") %>%
html_text()
## [1] "A-3930/2013" "D-7885/2009" "E-5869/2012" "C-651/2011" "F-2439/2017" "D-7416/2009"
## [7] "D-838/2011" "C-859/2011" "E-1927/2017" "E-2606/2011"
html_nodes(data_tbl, xpath=".//td[2]/a") %>%
html_attr("href")
## [1] "/publiws/download?decisionId=0002b1f8-ea53-40bb-8e38-402d9f3fdfa9"
## [2] "/publiws/download?decisionId=0002da8f-306e-4395-8eed-0b168df8634b"
## [3] "/publiws/download?decisionId=0003ec45-50be-45b2-8a56-5c0d866c2603"
## [4] "/publiws/download?decisionId=000508c2-c852-4aef-bc32-3385ddbbe88a"
## [5] "/publiws/download?decisionId=0006fbb9-228a-4bdc-ac8c-52db67df3b34"
## [6] "/publiws/download?decisionId=0008a971-6795-434d-90d4-7aeb1961606b"
## [7] "/publiws/download?decisionId=00099619-519c-4c8f-9cea-a16ed9ab9fd8"
## [8] "/publiws/download?decisionId=0009ac38-f2b0-4733-b379-05682473b5d9"
## [9] "/publiws/download?decisionId=000a4e0f-b2a2-483b-a49f-6ad12f4b7849"
## [10] "/publiws/download?decisionId=000be307-37b1-4d46-b651-223ceec9e533"
Lather, rinse, repeat for any other columns but you may need to do some work to get them as nicely and that's an exercise left to you (i.e. I won't answer questions abt it).
And, you'll want to know where you are in the scraping process so we'll need to grab that line at the bottom of the table:
html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>%
html_text()
## [1] "57,294 Entscheide gefunden, zeige 1 bis 10. Seite 1 von 5,730. Resultat sortiert nach: Relevanz"
Parsing that into # of result and which page you're on is an exercise left to the reader.
Now, we need to programmatically click "next page" until done. I'm going to do two manual iterations to prove it works to try to prevent "it doesn't work" comments. You should write an iterator or loop to go through all next pages and save the data however you want.
Next page (first iteration):
httr::POST(
url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
body = list(
`$ice.submit.partial` = "true",
ice.event.target = "form:_id67",
ice.event.captured = "form:_id63next",
ice.event.type = "onclick",
ice.event.alt = "false",
ice.event.ctrl = "false",
ice.event.shift = "false",
ice.event.meta = "false",
ice.event.x = "330",
ice.event.y = "559",
ice.event.left = "true",
ice.event.right = "false",
form = "",
icefacesCssUpdates = "",
`form:_id63` = "next",
`form:_idcl` = "form:_id63next",
iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
ice.session = ice_session,
ice.view = "1",
ice.focus = "form:_id63next",
rand = "0.17641832791084566\\n\\n"
),
encode = "form"
) -> next_pg
httr::content(next_pg) %>%
xml_find_first("//updates/update/content") %>%
xml_text() %>%
read_html() -> pg_tbl
data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")
html_nodes(data_tbl, xpath=".//td[1]/a") %>%
html_text()
## [1] "D-4059/2011" "D-4389/2006" "E-4019/2006" "D-4291/2008" "E-5642/2012" "E-7752/2010"
## [7] "D-7010/2014" "D-1551/2013" "C-7715/2010" "E-3187/2013"
html_nodes(data_tbl, xpath=".//td[2]/a") %>%
html_attr("href")
## [1] "/publiws/download?decisionId=000bfd02-4da5-4bb2-a5d0-e9977bf8e464"
## [2] "/publiws/download?decisionId=000e2be1-6da8-47ff-b707-4a3537320a82"
## [3] "/publiws/download?decisionId=000fa961-ecb4-47d2-8ca3-72e8824c2c6b"
## [4] "/publiws/download?decisionId=0010a089-4f19-433e-b106-6d75833fae9a"
## [5] "/publiws/download?decisionId=00111bfc-3522-4a32-9e7a-fa2d9f171427"
## [6] "/publiws/download?decisionId=00126b65-b345-4988-826b-b213080caa45"
## [7] "/publiws/download?decisionId=00127944-5c88-43f6-9ef1-3c822288b0c7"
## [8] "/publiws/download?decisionId=00135a17-f1eb-4b61-9171-ac1d27fd3910"
## [9] "/publiws/download?decisionId=0014c6ea-c229-4129-bbe0-7411d34d9743"
## [10] "/publiws/download?decisionId=00167998-54d2-40a5-b02b-0c4546ac4760"
html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>%
html_text()
## [1] "57,294 Entscheide gefunden, zeige 11 bis 20. Seite 2 von 5,730. Resultat sortiert nach: Relevanz"
Notice the column values are different and the progress text is different. Also note we got lucky and the incompetent programmers on the site actually had a "next" event vs forcing us to figure out pagination numbers and X/Y coordinates.
Next page (second and last example iteration):
httr::POST(
url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
body = list(
`$ice.submit.partial` = "true",
ice.event.target = "form:_id67",
ice.event.captured = "form:_id63next",
ice.event.type = "onclick",
ice.event.alt = "false",
ice.event.ctrl = "false",
ice.event.shift = "false",
ice.event.meta = "false",
ice.event.x = "330",
ice.event.y = "559",
ice.event.left = "true",
ice.event.right = "false",
form = "",
icefacesCssUpdates = "",
`form:_id63` = "next",
`form:_idcl` = "form:_id63next",
iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
ice.session = ice_session,
ice.view = "1",
ice.focus = "form:_id63next",
rand = "0.17641832791084566\\n\\n"
),
encode = "form"
) -> next_pg
httr::content(next_pg) %>%
xml_find_first("//updates/update/content") %>%
xml_text() %>%
read_html() -> pg_tbl
data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")
html_nodes(data_tbl, xpath=".//td[1]/a") %>%
html_text()
## [1] "D-3974/2010" "D-5847/2009" "D-4241/2015" "E-3043/2010" "D-602/2016" "C-2065/2008"
## [7] "D-2753/2007" "E-2446/2010" "C-1124/2015" "B-7400/2006"
html_nodes(data_tbl, xpath=".//td[2]/a") %>%
html_attr("href")
## [1] "/publiws/download?decisionId=00173ef1-2900-49d4-b7d3-39246e552a70"
## [2] "/publiws/download?decisionId=001a344c-86b7-4f32-97f7-94d30669a583"
## [3] "/publiws/download?decisionId=001ae810-300d-4291-8fd0-35de720a6678"
## [4] "/publiws/download?decisionId=001c2025-57dd-4bc6-8bd6-eedbd719a6e3"
## [5] "/publiws/download?decisionId=001c44ba-e605-455d-9609-ed7dffb17adc"
## [6] "/publiws/download?decisionId=001c6040-4b81-4137-a6ee-bad5a5019e71"
## [7] "/publiws/download?decisionId=001d0811-a5c2-4856-aef3-51a44f7f2b0e"
## [8] "/publiws/download?decisionId=001dbf61-b1b8-468d-936e-30b174a8bec9"
## [9] "/publiws/download?decisionId=001ea85a-0765-4a1f-9b81-3cecb9f36b31"
## [10] "/publiws/download?decisionId=001f2e34-9718-4ef7-a60c-e6bbe208003b"
html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>%
html_text()
## [1] "57,294 Entscheide gefunden, zeige 21 bis 30. Seite 3 von 5,730. Resultat sortiert nach: Relevanz"
Ideally, you'd wrap the POST
in a function you can call and return data frames that you can rbind
or bind_rows
into a big data frame.
If you made it this far, an alternative is to use RSelenium to orchestrate page clicks on the "next page" selector and retrieve the HTML back (the table will still be horribad and you'll need to use the column targeting or some other HTML selector magic to get useful info from it due to the aforementioned inept programmers). RSelenium introduces an external dependency that — as you'll see if you do a search on SO — many R users have trouble getting working especially on the equally wretched legacy operating system known as Windows. If you can get Selenium running and RSelenium working with it, it might be easier in the long run if all the above seems daunting (you're still going to have to grok Developer Tools at some point so the above might be worth the pain anyway and you'll need the HTML selector target for the various buttons for Selenium too).
I'd seriously avoid phantomjs as it's now in a "best effort" maintenance state and you'll have to figure out how to do the above with JavaScript vs R.