0

I am scraping multiple webpages with the goal of getting the data from each webpage into one encompassing dataframe.

Problem: The r-script works for the most part, but as I expand the amount of webpages through expand.grid I get the error prompt below. I believe the error occurs when the scrape runs into an empty table along the way.

Error: Can't combine ..1$tables$player_name and ..146$tables$player_name .

Going Forward: When I remove the empty tables the code works fine. Is there an efficient way to solve this issue?

#scrape select stats from tournaments
df <- expand.grid(
  tournament_id = c("t060", "t475", "t045", "t021"),
  stat_id = c("02564", "101", "102", "336", "340"),
  year_id = c("2004", "2005", "2006")
) %>% 
  mutate(
    links = paste0(
      'https://www.pgatour.com/stats/stat.',
      stat_id,
      '.y',
      year_id,
      '.eon.',
      tournament_id,
      '.html'
    )
  ) %>% 
  as_tibble()

# Function to get the data combined into a table
get_info <- function(link, tournament_id) {
  data <- link %>%
    read_html() %>%
    html_table() %>%
    .[[2]] %>%
    clean_names() %>% 
    select(-rank_last_week) %>% 
    mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d+")))
  try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
  try(data <- mutate(data, gir_rank = as.character(gir_rank)), silent = TRUE)
  try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
  data
}

test101 <- df %>%
  mutate(tables = map2(links, tournament_id, get_info)) %>%
  tidyr::unnest(everything())
DonnyDolio
  • 89
  • 7

1 Answers1

2

This can be solved with purrr::possibly() Minor adjustment to your code, player_name is always converted to character.

get_info <- function(link, tournament_id) {
  data <- link %>%
    read_html() %>%
    html_table() %>%
    .[[2]] %>%
    janitor::clean_names() %>% 
    select(-rank_last_week) %>% 
    mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d+")), 
           player_name = as.character(player_name))
  try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
  try(data <- mutate(data, gir_rank = as.character(gir_rank)), silent = TRUE)
  try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
  return(data)
}

Wrap get_info with possibly to return empty tibble if no table is available.

test101 <- df %>%
  mutate(tables = map2(links, tournament_id, possibly(get_info, otherwise = tibble())))

# A tibble: 60 x 5
   tournament_id stat_id year_id links                                                        tables           
   <fct>         <fct>   <fct>   <chr>                                                        <list>           
 1 t060          02564   2004    https://www.pgatour.com/stats/stat.02564.y2004.eon.t060.html <tibble [29 x 6]>
 2 t475          02564   2004    https://www.pgatour.com/stats/stat.02564.y2004.eon.t475.html <tibble [72 x 6]>
 3 t045          02564   2004    https://www.pgatour.com/stats/stat.02564.y2004.eon.t045.html <tibble [86 x 6]>
 4 t021          02564   2004    https://www.pgatour.com/stats/stat.02564.y2004.eon.t021.html <tibble [0 x 2]> 
 5 t060          101     2004    https://www.pgatour.com/stats/stat.101.y2004.eon.t060.html   <tibble [29 x 6]>
 6 t475          101     2004    https://www.pgatour.com/stats/stat.101.y2004.eon.t475.html   <tibble [72 x 6]>
 7 t045          101     2004    https://www.pgatour.com/stats/stat.101.y2004.eon.t045.html   <tibble [86 x 6]>
 8 t021          101     2004    https://www.pgatour.com/stats/stat.101.y2004.eon.t021.html   <tibble [0 x 2]> 
 9 t060          102     2004    https://www.pgatour.com/stats/stat.102.y2004.eon.t060.html   <tibble [29 x 6]>
10 t475          102     2004    https://www.pgatour.com/stats/stat.102.y2004.eon.t475.html   <tibble [72 x 6]>
11 t045          102     2004    https://www.pgatour.com/stats/stat.102.y2004.eon.t045.html   <tibble [86 x 6]>
12 t021          102     2004    https://www.pgatour.com/stats/stat.102.y2004.eon.t021.html   <tibble [0 x 2]> 
13 t060          336     2004    https://www.pgatour.com/stats/stat.336.y2004.eon.t060.html   <tibble [29 x 7]>
14 t475          336     2004    https://www.pgatour.com/stats/stat.336.y2004.eon.t475.html   <tibble [72 x 7]>
15 t045          336     2004    https://www.pgatour.com/stats/stat.336.y2004.eon.t045.html   <tibble [86 x 7]>
# ... with 45 more rows
# i Use `print(n = ...)` to see more rows

The column can then be unnest()

test101 %>% 
  unnest(everything())
# A tibble: 3,399 x 19
   tournament_id stat_id year_id links  rank_~1 playe~2 rounds average total~3 measu~4 avg   total~5 total~6 percent fairw~7 possi~8 total~9 numbe~*
   <fct>         <fct>   <fct>   <chr>    <int> <chr>    <int>   <dbl>   <dbl>   <int> <chr> <chr>     <int>   <dbl>   <int>   <int> <chr>     <int>
 1 t060          02564   2004    https~       1 Stephe~      4   1.77     7.09       4 NA    NA           NA      NA      NA      NA NA           NA
 2 t060          02564   2004    https~       2 Mike W~      4   1.56     6.22       4 NA    NA           NA      NA      NA      NA NA           NA
 3 t060          02564   2004    https~       3 Tiger ~      4   1.20     4.79       4 NA    NA           NA      NA      NA      NA NA           NA
 4 t060          02564   2004    https~       4 John D~      4   0.93     3.72       4 NA    NA           NA      NA      NA      NA NA           NA
 5 t060          02564   2004    https~       5 Mark H~      4   0.918    3.67       4 NA    NA           NA      NA      NA      NA NA           NA
 6 t060          02564   2004    https~       6 Zach J~      4   0.893    3.57       4 NA    NA           NA      NA      NA      NA NA           NA
 7 t060          02564   2004    https~       7 Scott ~      4   0.695    2.78       4 NA    NA           NA      NA      NA      NA NA           NA
 8 t060          02564   2004    https~       8 Jerry ~      4   0.661    2.64       4 NA    NA           NA      NA      NA      NA NA           NA
 9 t060          02564   2004    https~       9 Stuart~      4   0.624    2.50       4 NA    NA           NA      NA      NA      NA NA           NA
10 t060          02564   2004    https~      10 Retief~      4   0.612    2.45       4 NA    NA           NA      NA      NA      NA NA           NA
# ... with 3,389 more rows, 1 more variable: relative_to_par <chr>, and abbreviated variable names 1: rank_this_week, 2: player_name,
#   3: total_sg_putting, 4: measured_rounds, 5: total_distance, 6: total_drives, 7: fairways_hit, 8: possible_fairways, 9: total_distance_feet,
#   *: number_of_attempts
# i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Chamkrai
  • 5,912
  • 1
  • 4
  • 14
  • This is so awesome @Tom Hoel - Thank you! Does the possibly() call essentially allow the loop to bypass empty tables? Also, thanks for the note on setting player_name to as.character always. For learning purposes, is the try(data... call just used in cases where the variables don't match? Or is it more to silence any potential snags in the loop? – DonnyDolio Nov 11 '22 at 13:45
  • 1
    Not bypass, it is just an error handler. If the code runs into an error, like not finding a table in this instance it will do the `otherwise = tibble()` – Chamkrai Nov 13 '22 at 09:21