1

I'm trying to get the range of the numbers at the end of this link: https://schedule.sxsw.com/2019/speakers/2008434.

The link has a number at the end, e.g. the 2008434. The links refer to the bios of speakers at the upcoming South by Southwest festival. I know there are 3729 speakers total, but that does not help me figure out how each speaker and their associated pages are numbered.

I'm trying to do some simple web-scraping using a lapply function, but my function does not work when I can't specify a range. For example, I used:

number_range <- seq(1:3000000)

Clicking around the links gives no pattern to how they are numbered.

And I got a lot of Error in open.connection(x, "rb") : HTTP error 404.

Is there an easy way to get this range / make this function work? Code below:

library(rvest)
library(tidyverse)

# List for bios
sxsw_bios <- list()

# Creating vector of numbers
number_range <- seq(1:3000000)

# Scraping bios with names
sxsw_bios <- lapply(number_range, function(y) {

# Getting speaker name
Name <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/", 
                       paste0(y))) %>% 
  html_nodes(".speaker-name") %>% 
  html_text()
papelr
  • 468
  • 1
  • 11
  • 42

1 Answers1

2

You can scrape the list of IDs from the speaker pages

library(rvest)

ids <- lapply( letters, function(x) {
  speakers <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/alpha/", x)) %>%
    rvest::html_nodes(xpath = "//*[@class='favorite-click absolute']/@data-item-id")

  speakers <- gsub(' data-item-id="|"',"",speakers)
  speakers
})

Then use these IDs in your code. (I'm only doing the first 5 in this example)

ids <- unlist(ids)

# Scraping bios with names
sxsw_bios <- lapply(ids[1:5], function(y) {

    doc <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/", y))

  # Getting speaker name
  Name <- doc %>% 
    html_nodes(".speaker-name") %>% 
    html_text()

  bio <- doc %>%
    html_nodes(xpath = "//*[@class='row speaker-bio']") %>%
    html_text()
  list(name= Name, bio = bio)
})

sxsw_bios[[1]]

$name
# [1] "A$AP Rocky"

$bio
# [1] "A$AP Rocky is a cultural beacon that continues to ... <etc>

# ------------

sxsw_bios[[5]]

# $name
# [1] "Ken Abdo"
# 
# $bio
# [1] "Ken Abdo is a partner at the national law firm of Fox Rothschild...<etc>
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139