0

I am attempting to scrape MLB lineups from the following URL:

https://www.baseballpress.com/lineups/

However, when a lineup is unreleased, the nodes for each individual players do not exist on the page. Therefore a scape like the following will simply skip the game and then the lineups will not match up correctly when I bind the values together.

vbatter1 = page %>% html_nodes(".col--min:nth-child(1) .player:nth-child(1) .player-link") %>% html_attr("data-mlb")
vbatter2 = page %>% html_nodes(".col--min:nth-child(1) .player:nth-child(2) .player-link") %>% html_attr("data-mlb")
...

hbatter1 = page %>% html_nodes(".col--min+ .col--min .player:nth-child(1) .player-link") %>% html_attr("data-mlb")
hbatter2 = page %>% html_nodes(".col--min+ .col--min .player:nth-child(2) .player-link") %>% html_attr("data-mlb")
...

df <- do.call(rbind, Map(data.frame, GameTime=time, VisTm=VisTm, HmTm=HmTm, VisStPchID=vSP, HmStPchID=hSP, VisBat1ID=vbatter1, VisBat2ID=vbatter2, VisBat3ID=vbatter3, VisBat4ID=vbatter4, VisBat5ID=vbatter5, VisBat6ID=vbatter6, VisBat7ID=vbatter7, VisBat8ID=vbatter8, VisBat9ID=vbatter9, HmBat1ID=hbatter1, HmBat2ID=hbatter2, HmBat3ID=hbatter3, HmBat4ID=hbatter4, HmBat5ID=hbatter5, HmBat6ID=hbatter6, HmBat7ID=hbatter7, HmBat8ID=hbatter8, HmBat9ID=hbatter9))

Is there any way to return NULL values for the missing nodes when the lineup card looks like the following?

enter image description here

as21
  • 13
  • 4

1 Answers1

1

This is slightly tricky since the structure of the page changes when the line-ups are added versus no line-up.
Overall the strategy is to find the parent node for each line-up card and the parse out both line-ups checking to see if either contain "No Line Received" message. Depending on the result retrieve the player list - assumes there will only be 9 returns or adds a default text.
See comments for more detail.

library(rvest)
page <- read_html("https://www.baseballpress.com/lineups?q=%2Flineups%2F")

#find lineups which an no valid line ups.
lineupcards <- page %>% html_elements("div.lineup-card-body")
lineupreturn <- lineupcards %>% html_elements("div.col")  %>% html_text2()
validlineup <- which(lineupreturn !="No Lineup Released")

#find all of the line-up cards for the day
lineupcards <- page %>% html_elements("div.lineup-card-body")

#loop through the list of lineup cards
dfs <- lapply(lineupcards, function(node){
   #find the node containing the individual line-ups
   status <- node %>% html_elements("div.col") 
   
   #check to see if the text is "No Lineup Released", if not get players, else addd default text
   if (html_text2(status[1]) !="No Lineup Released") {
      visitplayers <- status[1] %>% html_elements("div.player") %>% html_text()
   }
   else
   {
      visitplayers <- rep("No player", 9)
   }
   #repeat for home team
   if (html_text2(status[2]) !="No Lineup Released") {
      homeplayers <- status[2] %>% html_elements("div.player") %>% html_text()
   }
   else
   {
      homeplayers <- rep("No player", 9)
   }
   
   #make a data frame for the return
   data.frame(homeplayers, visitplayers)
})

#dfs is a list of data frames with the lineups
dfs
Dave2e
  • 22,192
  • 18
  • 42
  • 50