4

I have some Rexperience, but not with website coding, and think I was not able to select the correct CSS nodes to parse (I believe).

library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)

url <-'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
webpage <- read_html(url)

title_html <- html_nodes(webpage, 'a#rh06x-YUUvEJ')
title <- html_text(title_html)
head(title)

Ultimately, if I could scrape and divide all scholar results into a csv file with headers like 'Title', 'Author', 'Year', 'Journal', that would be great. Any help would be much appreciated! Thanks

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 3
    **Possible**? Yes. **Legal**? Tricky question: yes at best, gray area at worse (depends on the jurisdiction). **Allowed**? Probably not, the scrapers do not want you to scrap them. – user101 Oct 01 '19 at 20:30
  • 1
    this is definitely disallowed by GS's terms of service: https://academia.stackexchange.com/questions/34970/how-to-get-permission-from-google-to-use-google-scholar-data-if-needed : "[don't] try to access [our services] using a method other than the interface and the instructions that we provide." – Ben Bolker Oct 01 '19 at 22:01
  • @AakashUpraity I wonder if you worked out how to scrape `Journal`? – Jeremy K. May 30 '20 at 17:22

1 Answers1

5

Concerning your code, you almost had it - you did not select the proper element. I believe you selected by id where I found html_nodes works best when selecting by class. The classes you are looking for are gs_rt and gs_a.

With regexyou can then process the data to the desired format by extracting authors and years.

url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)
niko
  • 5,253
  • 1
  • 12
  • 32
  • 1
    apart from scraping the titles, authors and years, I wonder if there's a way to scrape the journals that the papers are published in? – Jeremy K. May 30 '20 at 17:22