R Web Scraping Multiple Levels of a Website

Question

I am a beginner to R web scraping. In this case first I have tried to do a simple web scraping with R. This is the work that I have done.

sort out the staff member details from this website (https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff), this is the code that I have used,

library(rvest)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
url %>% html_nodes(".sppb-addon-content") %>% html_text()

Above code is working and all the sorted data is showing.

When u click on each staff member u can get another details as Research Interests, Areas of Specialization, Profile etc.... How can I get these data and show that data in the above data set according to each staff member?

setty · Accepted Answer · 2020-06-18T19:10:53.960

The code below will get you all the links to each professor's page. From there, you can map each link to another set of rvest calls using purrr's map_df or map functions.

Most importantly, giving credit where it's due @hrbrmstr: R web scraping across multiple pages

The linked answer is subtly different in that it's mapping across a set of numbers, as opposed to mapping across a vector of URL's like in the code below.

library(rvest)
library(purrr)
library(stringr)
library(dplyr)

url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")

names <- url %>%
  html_nodes(".sppb-addon-content") %>%
  html_nodes("strong") %>%
  html_text()
#extract the names

names <- names[-c(3,4)]
#drop the head of department and blank space

names <- names %>%
  tolower() %>%
  str_extract_all("[:alnum:]+") %>%
  sapply(paste, collapse = "-")
#create a list of names separated by dashes, should be identical to link names

content <- url %>% 
  html_nodes(".sppb-addon-content") %>%
  html_text()

content <- content[! content %in% "+"]
#drop the "+" from the content

content_names <- data.frame(prof_name = names, content = content)
#make a df with the content and the names, note the prof_name column is the same as below
#this allows for joining later on

links <- url %>% 
  html_nodes(".sppb-addon-content") %>%
  html_nodes("strong") %>% 
  html_nodes("a") %>%
  html_attr("href")
#create a vector of href links

url_base <- "https://science.kln.ac.lk%s"
urls <- sprintf(url_base, links)
#create a vector of urls for the professor's pages


prof_info <- map_df(urls, function(x) {
  #create an anonymous function to pull the data

  prof_name <- gsub("https://science.kln.ac.lk/depts/im/index.php/", "", x)
  #extract the prof's name from the url

  page <- read_html(x)
  #read each page in the urls vector

  sections <- page %>%
    html_nodes(".sppb-panel-title") %>%
    html_text()
  #extract the section title

  info <- page %>%
    html_nodes(".sppb-panel-body") %>%
    html_nodes(".sppb-addon-content") %>%
    html_text()
  #extract the info from each section

  data.frame(sections = sections, info = info, prof_name = prof_name)
  #create a dataframe with the section titles as the column headers and the
  #info as the data in the columns

}) 
#note this returns a dataframe. Change map_df to map if you want a list
#of tibbles instead

prof_info <- inner_join(content_names, prof_info, by = "prof_name")
#joining the content from the first page to all the individual pages

Not sure this is the cleanest or most efficient way to do this, but I think this is what you're after.

First thank you very much for the answer. I have tried this code and these are the problems I got 1. you have to import the library called tibble. if not the keyword tibble won't recognize by the console. 2. In this case you only get profile, research interests, specializations etc. I wanted to get the other data in the main page ( room, phone, fax, email ) and combine the both data sets. that is not happening in this code right? anyway thank you again for the answer. — Rishan.RM, Jun 18 '20 at 08:38
You can change the tibble() function to data.frame() if you want to stay in base R. I tend to work in the Tidyverse so default to those. To answer your second question, yes this is only pulling their research interests, etc. Add another set of rvest calls within the anonymous map function to pull that info as well. — setty, Jun 18 '20 at 15:44
Looking at this further, it will be tricky to do this in the map call. You may end up with differing column lengths. You could pull only the prof's name and let R recycle their name for each row. Then using their name, join with the room info, etc. — setty, Jun 18 '20 at 16:19
Joined the content from the first page with the content from all the individual pages. — setty, Jun 18 '20 at 19:12
yes using this code I can get the output exactly the way I wanted. thank you for the support! — Rishan.RM, Jun 19 '20 at 19:02

R Web Scraping Multiple Levels of a Website

1 Answers1