-1

I would like to scrape the title and authors of journal articles from all staff-members' official web-pages. e.g.

https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah

The specific part in question that I'm trying to access is this:

enter image description here

I'm following this guide: https://www.datacamp.com/community/tutorials/r-web-scraping-rvest but it refers to HTML tags which this site doesn't have. Can any point me in the right direction please?

HCAI
  • 2,213
  • 8
  • 33
  • 65
  • Ohhhh, that's interesting! To see the publications you have to click on "Journal articles" tab at the bottom. – HCAI Nov 04 '21 at 09:00
  • @stevec there is a publications database called "symplectic" that Leeds uses that holds all the metadata about the articles of each staff member so I'm guessing somehow it's linked to that. – HCAI Nov 04 '21 at 09:07
  • 1
    @stevec yes it can be done using direct http requests (see below). I don't think this is any more intelligent though, and may be difficult to generalize. – Allan Cameron Nov 04 '21 at 09:19
  • @AllanCameron you did well to find those requests! I couldn't spot them at all. Did you use the chrome devtools network tab? Or is there another tool I'm not familiar with? – stevec Nov 04 '21 at 09:19
  • 1
    @stevec I used the Firefox developer panel with all the XHR requests displayed. I have to do this kind of thing a lot so I'm used to homing in on the correct request. – Allan Cameron Nov 04 '21 at 09:21

2 Answers2

3

The page loads these citations dynamically using an XHR call that returns a json object. In this case, we can replicate the query and parse the json ourselves to get the publication list:

library(httr)
library(rvest)
library(jsonlite)

url <- paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
       "uniqueid=00970757",
       "&tries=0", 
       "&hash=f6a214dc99686895d6bf3de25507356f", 
       "&citationStyle=1")

GET(url) %>% 
  content("text") %>%
  fromJSON() %>%
  `[[`("publications") %>%
  `[[`("journal_article") %>%
  lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
  unlist() %>%
  as.character()
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"                        
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"                                                                                                
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"                                                                 
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"                                                                  
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"                         
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"                                         
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"                                                                                   
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"

Update

It is possible to get the json url from the html of the faculty member's homepage with a bit of text parsing:

get_json_url <- function(url)
{
   carveout <- function(string, start, end)
   {
      string %>% strsplit(start) %>% `[[`(1) %>% `[`(2) %>%
                 strsplit(end)   %>% `[[`(1) %>% `[`(1)
   }
   
   params <- GET(url) %>% 
      content("text") %>% 
      carveout("var dataGetQuery = ", ";")
   
   id <- carveout(params, "uniqueid: '", "'")
   tries <- carveout(params, "tries: ", ",")
   hash <- carveout(params, "hash: '", "'")
   citationStyle <- carveout(params, "citationStyle: ", "\n")

   paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
          "uniqueid=", id,
          "&tries=", tries, 
          "&hash=", hash,
          "&citationStyle=", citationStyle)
}

Which allows:

url <- "https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah"

get_json_request(url)
#> [1] "https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?uniqueid=00970757&tries=0&hash=f7266eb42b24715cfdf2851f24b229c6&citationStyle=1"

And, if you want to be able to just lapply a vector of homepage urls to get the final publication list:

publications_from_homepage <- function(url)
{
   get_json_request(url) %>%
   GET() %>% 
     content("text") %>%
     fromJSON() %>%
     `[[`("publications") %>%
     `[[`("journal_article") %>%
     lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
     unlist() %>%
     as.character()
}

So you have:

publications_from_homepage(url)
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"                        
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"                                                                                                
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"                                                                 
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"                                                                  
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"                         
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"                                         
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"                                                                                   
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"

Created on 2021-11-04 by the reprex package (v2.0.0)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thank you very much for such a quick reply, it works a treat! How did you find all the parts to the url? I have a long list of different researchers (https://eps.leeds.ac.uk/civil-engineering/stafflist) so am going to put it into a list and use lapply. – HCAI Nov 04 '21 at 10:50
  • 1
    @HCAI see my update so you can build the json url without having to find it in the developer tab or use Selenium. This should allow you to rapidly `lapply` a vector of the faculty's homepage urls which should be easy to `rvest` from https://eps.leeds.ac.uk/civil-engineering/stafflist – Allan Cameron Nov 04 '21 at 12:51
3

here is a RSelenium approach

library(RSelenium)
library(rvest)
library(xml2)

#setup driver, client and server
driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE ) 
server <- driver$server
browser <- driver$client

#goto url in browser
browser$navigate("https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah")

#get all relevant titles
doc <- xml2::read_html(browser$getPageSource()[[1]])
df <- data.frame( title = 
                    xml2::xml_find_all(doc, '//span[@class="title-with-parent"]') %>%
                    xml2::xml_text() )

#close everything down properly
browser$close()
server$stop()
# needed, else the port 4545 stays occupied by the java process
system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE)

enter image description here

Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thank you very much indeed for your help with this! I have accepted Allan's answer because being very new to scraping, it seemed a little more easy to read but having said that I could easily implement your method with a long list of urls... When it becomes available, I would like to give you some bounty for your help. – HCAI Nov 04 '21 at 10:52
  • driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE ) gives the error:Error in if (file.access(phantompath, 1) < 0) { : argument is of length zero – HCAI Nov 04 '21 at 10:54
  • https://stackoverflow.com/a/46325575/6356278 – Wimpel Nov 04 '21 at 11:01