0

I am trying to use rvest to scrape one page of Google Scholar search results into a dataframe of author, paper title, year, and journal title.

The simplified, reproducible example below is code that searches Google Scholar for the example terms "apex predator conservation".

Note: to stay within the Terms of Service, I only want to process the first page of search results that I would get from a manual search. I am not asking about automation to scrape additional pages.

The following code already works to extract:

  • author
  • paper title
  • year

but it does not have:

  • journal title

I would like to extract the journal title and add it to the output.

library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)

url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)

df

source: https://stackoverflow.com/a/58192323/8742237

So the output of that code looks like this:

#>                                                                                                                                                   titles
#> 1                                                                                    [HTML][HTML] Saving large carnivores, but losing the apex predator?
#> 2                               Site fidelity and sex-specific migration in a mobile apex predator: implications for conservation and ecosystem dynamics
#> 3                  Effects of tourism-related provisioning on the trophic signatures and movement patterns of an apex predator, the Caribbean reef shark

#>                                           authors years
#> 1                  A Ordiz, R Bischof, JE Swenson  2013
#> 2  A Barnett, KG Abrantes, JD Stevens, JM Semmens  2011

Two questions:

  1. How can I add a column that has the journal title extracted from the raw data?
  2. Is there a reference where I can read and learn more about how to work out how to extract other fields for myself, so I don't have to ask here?
Jeremy K.
  • 1,710
  • 14
  • 35

1 Answers1

1

One way to add them is this:

library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)

url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)


leftovers <- authors_years %>% 
  str_remove_all(authors) %>% 
  str_remove_all(years)


journals <- str_split(leftovers, "-") %>% 
            map_chr(2) %>% 
            str_extract_all("[:alpha:]*") %>% 
            map(function(x) x[x != ""]) %>% 
            map(~paste(., collapse = " ")) %>% 
            unlist()

# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, journals = journals, stringsAsFactors = FALSE)

For your second question: the css selector gadget chrome extension is nice for getting the css selectors of the elements you want. But in your case all elements share the same css class, so the only way to disentangle them is to use regex. So I guess learn a bit about css selectors and regex :)

Ahorn
  • 3,686
  • 1
  • 10
  • 17