I'm trying to develop an academic network using information available from Google Scholar. Part of this involves scraping data from a pop-up window (not actually sure what kind of window it is - it doesn't seem to be a regular window or an iframe) that is produced from clicking on an article title on an individual scholar's page.
I've been using RSelenium to perform this task. Below is the code I've developed so far for interacting with Google Scholar.
#Libraries----
library(RSelenium)
#Functions----
#Convenience function for simplifying data generated from .$findElements()
unPack <- function(x, opt = "text"){
unlist(sapply(x, function(x){x$getElementAttribute(opt)}))
}
#Analysis----
#Start up the server for Chrome.
rD <- rsDriver(browser = "chrome")
#Start Chrome.
remDr <- rD[["client"]]
#Add a test URL.
siteAdd <- "http://scholar.google.com/citations?user=sc3TX6oAAAAJ&hl=en&oi=ao"
#Open the site.
remDr$navigate(siteAdd)
#Create a list of all the article titles
cite100Elem <- remDr$findElements(using = "css selector", value = "a.gsc_a_at")
cite100 <- unPack(cite100Elem)
#Start scraping the first article. I will create some kind of loop for all
# articles later.
#This opens the pop-up window with additional data I'm interested in.
citeTitle <- cite100[1]
citeElem <- remDr$findElement(using = 'link text', value = citeTitle)
citeElem$clickElement()
Here's where I get stuck. Looking at the underlying webpage using Chrome's Developer tools, I can see that the first bit of information I'm interested in, the authors of the article, which is associated with the following HTML:
<div class="gsc_vcd_value">TR Moore, NT Roulet, JM Waddington</div>
This suggests that I should be able do something like:
#Extract all the information about the article.
articleElem <- remDr$findElements(value = '//*[@class="gsc_vcd_title"]')
articleInfo <- unPack(articleElem)
However, this solution doesn't seem to work; it returns a value of "NULL".
I'm hoping that someone out there has an R-based solution, because I know very little about Java Script.
Last, if search the resulting text from the following code (parse the page I'm currently on):
htmlOut <- XML::htmlParse(remDr$getPageSource()[[1]])
htmlOut
I can't find the CSS class associated with "gsc_vcd_title", which suggests to me that the page I'm interested in has a more complicated structure that I haven't quite figured out yet.
Any insights you have would be very welcome. Thanks!