Using RSelenium to web scrape Google Scholar

Question

I'm trying to develop an academic network using information available from Google Scholar. Part of this involves scraping data from a pop-up window (not actually sure what kind of window it is - it doesn't seem to be a regular window or an iframe) that is produced from clicking on an article title on an individual scholar's page.

I've been using RSelenium to perform this task. Below is the code I've developed so far for interacting with Google Scholar.

#Libraries----    
library(RSelenium)


#Functions----
#Convenience function for simplifying data generated from .$findElements()
unPack <- function(x, opt = "text"){
  unlist(sapply(x, function(x){x$getElementAttribute(opt)}))  
}


#Analysis----
#Start up the server for Chrome.
rD <- rsDriver(browser = "chrome")
#Start Chrome.
remDr <- rD[["client"]]
#Add a test URL.
siteAdd <- "http://scholar.google.com/citations?user=sc3TX6oAAAAJ&hl=en&oi=ao"
#Open the site.
remDr$navigate(siteAdd)

#Create a list of all the article titles
cite100Elem <- remDr$findElements(using = "css selector", value = "a.gsc_a_at")
cite100 <- unPack(cite100Elem)

#Start scraping the first article. I will create some kind of loop for all
# articles later.
#This opens the pop-up window with additional data I'm interested in.
citeTitle <- cite100[1]
citeElem <- remDr$findElement(using = 'link text', value = citeTitle)
citeElem$clickElement()

Here's where I get stuck. Looking at the underlying webpage using Chrome's Developer tools, I can see that the first bit of information I'm interested in, the authors of the article, which is associated with the following HTML:

<div class="gsc_vcd_value">TR Moore, NT Roulet, JM Waddington</div>

This suggests that I should be able do something like:

#Extract all the information about the article.
articleElem <- remDr$findElements(value = '//*[@class="gsc_vcd_title"]')
articleInfo <- unPack(articleElem)

However, this solution doesn't seem to work; it returns a value of "NULL".

I'm hoping that someone out there has an R-based solution, because I know very little about Java Script.

Last, if search the resulting text from the following code (parse the page I'm currently on):

htmlOut <- XML::htmlParse(remDr$getPageSource()[[1]])
htmlOut

I can't find the CSS class associated with "gsc_vcd_title", which suggests to me that the page I'm interested in has a more complicated structure that I haven't quite figured out yet.

Any insights you have would be very welcome. Thanks!

There's an R package, [scholar](https://cran.r-project.org/web/packages/scholar/index.html), which may be a better solution than web scraping. [Here's an introduction](http://tuxette.nathalievilla.org/?p=1682) which may be a little outdated. You'd use, for example, `get_publications()` to fetch article information for a user. — neilfws, Feb 08 '18 at 02:12
Once you click, the js loads a `div.gs_citt` at the top of your html, which is the CSS path for the pop-up table — Dambo, Feb 08 '18 at 02:36
@neilfws I appreciate the suggestion! This is really just me goofing around, trying to come up with my own solutions. But now that you've pointed me to that package, it will be really interesting to compare and contrast. — Jason Mercer, Feb 08 '18 at 03:38
@Dambo, where are you seeing `div.gs_citt`? I can't find it anywhere and my various attempts at using it in an xpath expression with either "id" or "class" are yielding no results. Thoughts? — Jason Mercer, Feb 08 '18 at 03:40
You're being unethical if you violate https://scholar.google.com/robots.txt at all (and risking jail time/fines) — hrbrmstr, Feb 08 '18 at 03:49
@hrbrmstr, I'm not really sure what you posted means. Does that also imply that the scholars package is being unethical? I literally don't know. Like I said, I'm treating this as a fun project to learn how to web scrape. — Jason Mercer, Feb 08 '18 at 04:03
@JasonMercer use code inspector, click on the icon for accessing the citations table. When you do so, your html changes, and you should see a `div.gs_citt` somewhere at the top of your html. That's the div you want to use. — Dambo, Feb 08 '18 at 04:44
I'm growing really, really weary of the academic "I'm just learning" excuses to behave unethically. Read up on the ethics and legal ramifications (to you and the encouragement of other to aid you which makes them accomplices) before starting down the tech path. "Just because I can" !== "should". — hrbrmstr, Feb 08 '18 at 15:13

Using RSelenium to web scrape Google Scholar

0 Answers0