0

I would like to scrape a series of tables from a website whose URL does not change when I click through the tables in my browser. Each table corresponds to a unique date. The default table is that which corresponds to today's date. I can scroll through past dates in my browser, but can't seem to find a way to do so in R.

Using library(rvest) this bit of code will reliably download the table that corresponds to today's date (I'm only interested in the first of the three tables).

webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
  read_html()  %>%
  html_table()
off <- off[[1]]

How can I download the table that corresponds to, say "2022-10-04", to "2022-10-06", or to yesterday?

I've tried to work through it by identifying the node under which the table lies, in the hopes that I could manipulate it to reflect a prior date. However, the following reproduces the same table as above:

webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
  read_html() %>%
  html_nodes("#main > div > section:nth-child(1) > article > div > div.dayContent > div > table") %>%
  html_table()
off <- off[[1]]

Scrolling through past dates in my browser, I've identified various places in the html that reference the prior date; but I can't seem to change it from R, yet alone get the table I download to reflect a change:

webad %>%
  read_html() %>%
  html_nodes("#main > div > section:nth-child(1) > article > header > div")

I've messed around some with html_form(), follow_link(), and set_values() also, but to no avail.

Is there a good way to navigate this particular URL in R?

DataProphets
  • 156
  • 3
  • 17
  • 2
    `r <- jsonlite::read_json('https://official.nba.com/wp-json/api/v1/get-game-officials?&date=2022-10-06', simplifyVector = T); r$nba$Table$rows` – QHarr Oct 09 '22 at 01:42
  • 2
    Generate dates, apply a custom function using e.g. map then bind the results. – QHarr Oct 09 '22 at 01:43
  • @QHarr could you please explain how you knew to do it this way? Thanks! – kybazzi Oct 09 '22 at 02:00
  • 3
    I selected a different date and monitored the change in web traffic in dev tools (F12) network tab. I could see the request made there for the data. An example blog [here](https://scrapecrow.com/reverse-engineering-intro.html) – QHarr Oct 09 '22 at 02:01

3 Answers3

0

You can consider the following approach :

library(RSelenium)
library(rvest)

port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()

url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)

web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()

web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()

web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()

web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()

html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()

[[1]]
# A tibble: 5 x 5
  Game                     `Official 1`            `Official 2`         `Official 3`          Alternate
  <chr>                    <chr>                   <chr>                <chr>                 <lgl>    
1 Indiana @ Charlotte      John Goble (#10)        Lauren Holtkamp (#7) Phenizee Ransom (#70) NA       
2 Cleveland @ Philadelphia Marc Davis (#8)         Jacyn Goble (#68)    Tyler Mirkovich (#97) NA       
3 Toronto @ Boston         Josh Tiven (#58)        Matt Boland (#18)    Intae hwang (#96)     NA       
4 Dallas @ Oklahoma City   Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91)   NA       
5 Phoenix @ L.A. Lakers    Bill Kennedy (#55)      Rodney Mott (#71)    Jenna Reneau (#93)    NA       

[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names

[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names

[[4]]
# A tibble: 6 x 7
      S     M     T     W     T     F     S
  <int> <int> <int> <int> <int> <int> <int>
1    NA    NA    NA    NA    NA    NA     1
2     2     3     4     5     6     7     8
3     9    10    11    12    13    14    15
4    16    17    18    19    20    21    22
5    23    24    25    26    27    28    29
6    30    31    NA    NA    NA    NA    NA
Emmanuel Hamel
  • 1,769
  • 7
  • 19
  • I'm getting the following error with the RSelenium solution: `Error in chrome_ver(chromecheck[["platform"]], chromever) : version requested doesnt match versions available = 106.0.5249.21,106.0.5249.61,107.0.5304.18`. If I set `chromever = "106.0.5249.21"` I get: `Selenium message:session not created: This version of ChromeDriver only supports Chrome version 106 Current browser version is 91.0.4472.101`. Eventually, after `remDr$navigate(url)` I get: `Error in checkError(res) : Undefined error in httr call. httr output: length(url) == 1 is not TRUE` Is my software outdated? – DataProphets Oct 09 '22 at 18:42
  • 1
    You have to install the software chrome driver https://chromedriver.chromium.org/getting-started – Emmanuel Hamel Oct 09 '22 at 19:07
0

Here is another approach that can be considered :

library(RDCOMClient)
library(rvest)

url <- "https://official.nba.com/referee-assignments/"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()

clickEvent <- doc$createEvent("MouseEvent")
clickEvent$initEvent("click", TRUE, FALSE)

web_Obj_Date <- doc$querySelector("#ref-filters-menu > li > div > button")
web_Obj_Date$dispatchEvent(clickEvent)

web_Obj_Date_Input <- doc$GetElementById('ref-date')
web_Obj_Date_Input[["Value"]] <- "2022-10-05"

web_Obj_Go_Button <- doc$querySelector("#date-filter")
web_Obj_Go_Button$dispatchEvent(clickEvent)

html_Content <- doc$Body()$innerHTML()
read_html(html_Content) %>% html_table()

[[1]]
# A tibble: 5 x 5
  Game                     `Official 1`            `Official 2`         `Official 3`          Alternate
  <chr>                    <chr>                   <chr>                <chr>                 <lgl>    
1 Indiana @ Charlotte      John Goble (#10)        Lauren Holtkamp (#7) Phenizee Ransom (#70) NA       
2 Cleveland @ Philadelphia Marc Davis (#8)         Jacyn Goble (#68)    Tyler Mirkovich (#97) NA       
3 Toronto @ Boston         Josh Tiven (#58)        Matt Boland (#18)    Intae hwang (#96)     NA       
4 Dallas @ Oklahoma City   Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91)   NA       
5 Phoenix @ L.A. Lakers    Bill Kennedy (#55)      Rodney Mott (#71)    Jenna Reneau (#93)    NA       

[[2]]
# A tibble: 8 x 7
  Game   `Official 1` `Official 2` `Official 3` Alternate   ``    ``   
  <chr>  <chr>        <chr>        <chr>        <chr>       <chr> <chr>
1 "Game" "Official 1" "Official 2" "Official 3" "Alternate"  NA    NA  
2 "S"    "M"          "T"          "W"          "T"         "F"   "S"  
3 ""     ""           ""           ""           ""          ""    "1"  
4 "2"    "3"          "4"          "5"          "6"         "7"   "8"  
5 "9"    "10"         "11"         "12"         "13"        "14"  "15" 
6 "16"   "17"         "18"         "19"         "20"        "21"  "22" 
7 "23"   "24"         "25"         "26"         "27"        "28"  "29" 
8 "30"   "31"         ""           ""           ""          ""    ""   

[[3]]
# A tibble: 7 x 7
  Game  `Official 1` `Official 2` `Official 3` Alternate ``    ``   
  <chr> <chr>        <chr>        <chr>        <chr>     <chr> <chr>
1 "S"   "M"          "T"          "W"          "T"       "F"   "S"  
2 ""    ""           ""           ""           ""        ""    "1"  
3 "2"   "3"          "4"          "5"          "6"       "7"   "8"  
4 "9"   "10"         "11"         "12"         "13"      "14"  "15" 
5 "16"  "17"         "18"         "19"         "20"      "21"  "22" 
6 "23"  "24"         "25"         "26"         "27"      "28"  "29" 
7 "30"  "31"         ""           ""           ""        ""    ""   

[[4]]
# A tibble: 6 x 7
      S     M     T     W     T     F     S
  <int> <int> <int> <int> <int> <int> <int>
1    NA    NA    NA    NA    NA    NA     1
2     2     3     4     5     6     7     8
3     9    10    11    12    13    14    15
4    16    17    18    19    20    21    22
5    23    24    25    26    27    28    29
6    30    31    NA    NA    NA    NA    NA
Emmanuel Hamel
  • 1,769
  • 7
  • 19
  • I'm having trouble using RDCOMClient. When I install it, I get: `Warning in install.packages : package ‘RDCOMClient’ is not available (for R version 3.6.3)`. I read (https://stackoverflow.com/questions/35509029/installation-error-with-rdcomclient-in-rstudio) that I can install it from github using `library("devtools") install_github('omegahat/RDCOMClient')` but even after installing `install.packages("devtools")` and restarting my R session, I get: `Error in library(devtools) : there is no package called ‘devtools’` when I load `library(devtools)`. Do I need a newer version of RStudio? – DataProphets Oct 09 '22 at 18:51
  • You have to install the R package devtools. What is the error that you get when you install the R package devtools? – Emmanuel Hamel Oct 09 '22 at 19:10
  • 1
    Also, I suggest you upgrade your version of R by using, for example, R 4.2.1. If you use a recent version of R, you will have less problems to install R packages. – Emmanuel Hamel Oct 09 '22 at 19:21
0

If you install the Docker software (see https://docs.docker.com/engine/install/), you can consider the following approach with firefox :

library(RSelenium)
library(rvest)

shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)

web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()

web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()

web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()

web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()

html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()

[[1]]
# A tibble: 5 x 5
  Game                     `Official 1`            `Official 2`         `Official 3`          Alternate
  <chr>                    <chr>                   <chr>                <chr>                 <lgl>    
1 Indiana @ Charlotte      John Goble (#10)        Lauren Holtkamp (#7) Phenizee Ransom (#70) NA       
2 Cleveland @ Philadelphia Marc Davis (#8)         Jacyn Goble (#68)    Tyler Mirkovich (#97) NA       
3 Toronto @ Boston         Josh Tiven (#58)        Matt Boland (#18)    Intae hwang (#96)     NA       
4 Dallas @ Oklahoma City   Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91)   NA       
5 Phoenix @ L.A. Lakers    Bill Kennedy (#55)      Rodney Mott (#71)    Jenna Reneau (#93)    NA       

[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names

[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names

[[4]]
# A tibble: 6 x 7
      S     M     T     W     T     F     S
  <int> <int> <int> <int> <int> <int> <int>
1    NA    NA    NA    NA    NA    NA     1
2     2     3     4     5     6     7     8
3     9    10    11    12    13    14    15
4    16    17    18    19    20    21    22
5    23    24    25    26    27    28    29
6    30    31    NA    NA    NA    NA    NA
Emmanuel Hamel
  • 1,769
  • 7
  • 19