You picked a challenging problem for your first steps with the magrittr pipe and the map functions! I'll do my best to give you a helpful answer, but I would also recommend that you find some easier data to work with as you practice. A good place to learn about the pipe %>%
is with the "Pipes" chapter in Hadley Wickham's book. The chapter on iteration also offers a good intro into the map_*
functions. You can return to more complex problems once you have a firmer conceptual understanding. I think Hadley explains these tools better than I ever could, so I won't go into great detail about them here, and instead focus on explaining why your code doesn't work, and why mine does.
An analysis of your code
Map functions allow a couple of useful shortcuts, one of which you've already discovered - namely, if you pass in vectors or lists as the function argument, they are automatically converted into extractor functions. So, you're on the right track!
The thing to remember is that map functions return a vector that is the same length, and has the same names, as the input vector. Your input vector is jsonData
, which has 5 elements with names [1] "copyright" "allPlays" "currentPlay" "scoringPlays" "playsByInning"
. When you run jsonData %>% map("playEvents") %>% map("hitData")
, data is being extracted, but R still returns a vector with five elements and the same names as the original vector. If you take a look at the following code, you'll see that your code is, indeed, peeling away the uppermost layers, but the length remains the same, which isn't very helpful:
> unlist(map(jsonData, class))
copyright allPlays currentPlay scoringPlays playsByInning
"character" "data.frame" "list" "integer" "data.frame"
> unlist(map(jsonData %>% map("playEvents"), class))
copyright allPlays currentPlay scoringPlays playsByInning
"NULL" "list" "data.frame" "NULL" "NULL"
> unlist(map(jsonData %>% map("playEvents") %>% map("hitData"), class))
copyright allPlays currentPlay scoringPlays playsByInning
"NULL" "NULL" "data.frame" "NULL" "NULL"
The final output, and what you are trying to combine with your call to bind_rows
above, is this:
> jsonData %>% map("playEvents") %>% map("hitData")
$copyright
NULL
$allPlays
NULL
$currentPlay
launchSpeed launchAngle totalDistance trajectory hardness location coordinates.coordX coordinates.coordY
1 NA NA NA <NA> <NA> <NA> NA NA
2 81.3 61.92 187.5 popup medium 6 75.78 167.97
$scoringPlays
NULL
$playsByInning
NULL
Obviously that's not what you want. After some tinkering I came up with the following solution.
My own strategy
The libraries:
library(jsonlite)
library(purrr)
library(dplyr)
library(readr)
library(stringr)
library(magrittr)
I use a slightly different method to download and parse the JSON because I need to see the structure. I'll include it just in case you might find it useful:
url <- paste0("http://statsapi-prod-alt-968618993.us-east-1.elb.amazonaws",
".com/api/v1/game/565711/playByPlay")
url %>% read_file() %>% prettify() %>% write_file("bball.json")
jsonData <- fromJSON("bball.json")
I first extract and clean the hitData
dataframes. I know they can all be found in playEvents
, so I can skip a few steps by using the $
syntax. The first call to map
extracts hitData
from each element of the list playEvents
. The hitData
dataframes are nested (they contain other dataframes), so the second call to map
with jsonlite::flatten
flattens them. The function safely
ensures that R doesn't throw an error when something other than a dataframe is encountered (only 46 elements contain hitData
). Many of the hitData
dataframes contain rows full of NA
s, so the third call to map
uses an anonymous function (again in safely
) to get rid of those. The fourth call to map
then extracts the dataframe from each element's result
variable, which was created by safely
(along with an error
variable that we don't need):
hitdata_list <- jsonData$allPlays$playEvents %>%
map("hitData") %>%
map(safely(jsonlite::flatten)) %>%
map(safely(~.$result[complete.cases(.$result),])) %>%
map("result")
Now I have a list of hitData
dataframes. As I mentioned above, only 46 of 80 entries contain hitData
, so I need a way to get the corresponding values from atBatIndex
. I can do that by generating a logical vector with TRUE
when an element in hitdata_list
contains a dataframe, and FALSE
otherwise. I use map_lgl
to return a logical vector instead of a list:
lgl_index <- map_lgl(hitdata_list, ~ !is.null(.))
atbatindex_vec <- jsonData$allPlays$atBatIndex[lgl_index]
I then use a stringr
function to get game_pk
from the URL. I'm not sure if it would work with every URL, but it works fine in this case:
game_pk_vec <- str_match(url, "/(\\d+)/")[2] %>%
as.integer()
Finally, I combine atBatIndex
and game_pk
in a tibble, then combine that tibble with with the hitData
data using bind_cols
. The hitData
dataframes are still in a list, so I'll need to combine those first with bind_rows
. The set_colnames
function is from the magrittr
package and does just what it says. I need to set the column names because some compound names were created when I flattened the hitData
dataframes:
hitdata_df <- tibble(game_pk = game_pk_vec, atBatIndex = atbatindex_vec) %>%
bind_cols(bind_rows(hitdata_list)) %>%
set_colnames(str_extract(names(.), "\\w+$"))
The only thing I didn't do was extract pitchNumber
. Calling jsonData$allPlays$playEvents %>% map("pitchNumber")
returns a list of sequences 1 through n, where each vector has length > 1. I assume you only need the final number in each sequence, but I'm not sure so I'll spare myself the effort. You can do what I did with atBatIndex
to get the relevant elements, and then extract what you need. Here's the final dataframe:
# A tibble: 46 x 10
game_pk atBatIndex launchSpeed launchAngle totalDistance trajectory hardness location coordX coordY
<chr> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 565711 4 76.6 2.74 188. ground_ball medium 9 178. 145.
2 565711 5 101. 15.4 328. line_drive hard 8 145. 62.2
3 565711 6 103. 29.4 382. line_drive medium 9 237. 79.4
4 565711 8 109. 15.6 319. line_drive hard 9 181. 102.
5 565711 9 75.8 47.8 239. fly_ball medium 7 99.8 103.
6 565711 10 91.6 44.1 311. fly_ball medium 8 140. 69.3
7 565711 12 79.1 23.4 246. line_drive medium 7 52.3 126.
8 565711 13 67.3 -21.3 124. ground_ball medium 6 108. 156.
9 565711 14 89.9 -21.6 7.41 ground_ball medium 6 108. 152.
10 565711 15 110. 27.7 420. fly_ball medium 9 250. 69.0
# … with 36 more rows