extract nested dataframes from complex JSON efficiently using purrr functions and %>%

Question

I'm trying to build a table similar to this (this is just a couple rows, but I'm trying to get all of the hits from a list of games):

game_pk   atBatIndex  pitchNumber   hardness launchAngle  launchSpeed  location  totalDistance  trajectory   coordX   coordY
565711    4           3             medium   2.74         76.62        9         188.03         ground_ball  177.88   145.11
565711    5           3             hard     15.42        101.26       8         328.08         line_drive   144.79   62.25

Most of what I want to extract can be found in hitData, which is in some, but not all, of 80 elements in the list playEvents, which is itself in the dataframe allPlays. You can use jsonData$allPlays$playEvents[[80]]$hitData to see an example.

Here is the code I'm using:

library(jsonlite)
library(purrr)
library(dplyr)

url <- "http://statsapi-prod-alt-968618993.us-east-1.elb.amazonaws.com/api/v1/game/565711/playByPlay"

jsonData <- fromJSON(url)

hitDataDF <- data.frame(jsonData %>%
                       map("playEvents") %>%
                       map("hitData") %>%
                       map_df(bind_rows))

Unfortunately, it returns the error:

Error: Argument 7 can't be a list containing data frames

I'm having difficulty coming up with way to to deal with the collections of lists, dataframes, nested dataframes, and vectors that are found in the JSON.

Beyond hitData, I also want data from atBatIndex, a numeric vector found in jsonData$allPlays$about (also in jsonData$allPlays), and pitchNumber, which can be found at the same level as hitData.

I'm grabbing the game_pk number 565711 from the URL and adding it to the data frame using this code:

hitDataDF$game_pk = str_match(url, '([^/]+)(?:/[^/]+){1}$')[,2]

I'm newish to R would like to write the code using %>% and map. This is my first try at at it and I'm not sure I fully understand this method. If you have a solution, could you please try to explain so I can better understand what is happening and hopefully apply it to other code if I have pulling similar data?

Any help is greatly appreciated!!

Thanks!!

`map` expects a vector and a function. Could you elaborate further what you would like to do? Maybe show the expected result? — NelsonGon, May 11 '19 at 14:23
@NelsonGon I did show an expected result, it's the first code box. I guess I wasn't clear enough there, I apologize. — Don Hessey, May 11 '19 at 16:59
The problem is that it's difficult to connect the JSON to the desired result. Some of the variables in the example dataframe aren't to be found anywhere in the JSON. You also don't really point us in the direction you want to go - which is of utmost importance as the JSON contains fairly deeply nested lists and dataframes. It seems like you haven't yet made the effort to understand the structure of the JSON. — , May 11 '19 at 18:34
The only variable not in the code is game_pk and I showed where I would get that from. The other two, I'm assuming you're talking about atbatindex and pitchnumber, are located in other area and I listed where they were in the third paragraph under my code. _atbatindex is in the about section, pitchnumber is below the hit data in the details section of the playevents code_ I agree it's hard to connect the json data to the desired result, that's why I came here to ask. I do understand the structure of json, please don't assume that. — Don Hessey, May 11 '19 at 22:24
@DonHessey I opted to delete my previous criticism and just edit your question instead. I've also written a longish answer below. I hope it helps! — , May 12 '19 at 19:52

score 4 · Accepted Answer · answered May 12 '19 at 18:44

You picked a challenging problem for your first steps with the magrittr pipe and the map functions! I'll do my best to give you a helpful answer, but I would also recommend that you find some easier data to work with as you practice. A good place to learn about the pipe %>% is with the "Pipes" chapter in Hadley Wickham's book. The chapter on iteration also offers a good intro into the map_* functions. You can return to more complex problems once you have a firmer conceptual understanding. I think Hadley explains these tools better than I ever could, so I won't go into great detail about them here, and instead focus on explaining why your code doesn't work, and why mine does.

An analysis of your code

Map functions allow a couple of useful shortcuts, one of which you've already discovered - namely, if you pass in vectors or lists as the function argument, they are automatically converted into extractor functions. So, you're on the right track!

The thing to remember is that map functions return a vector that is the same length, and has the same names, as the input vector. Your input vector is jsonData, which has 5 elements with names [1] "copyright" "allPlays" "currentPlay" "scoringPlays" "playsByInning". When you run jsonData %>% map("playEvents") %>% map("hitData"), data is being extracted, but R still returns a vector with five elements and the same names as the original vector. If you take a look at the following code, you'll see that your code is, indeed, peeling away the uppermost layers, but the length remains the same, which isn't very helpful:

> unlist(map(jsonData, class))
    copyright      allPlays   currentPlay  scoringPlays playsByInning 
  "character"  "data.frame"        "list"     "integer"  "data.frame" 

> unlist(map(jsonData %>% map("playEvents"), class))
    copyright      allPlays   currentPlay  scoringPlays playsByInning 
       "NULL"        "list"  "data.frame"        "NULL"        "NULL" 

> unlist(map(jsonData %>% map("playEvents") %>% map("hitData"), class))
    copyright      allPlays   currentPlay  scoringPlays playsByInning 
       "NULL"        "NULL"  "data.frame"        "NULL"        "NULL"

The final output, and what you are trying to combine with your call to bind_rows above, is this:

> jsonData %>% map("playEvents") %>% map("hitData")
$copyright
NULL

$allPlays
NULL

$currentPlay
  launchSpeed launchAngle totalDistance trajectory hardness location coordinates.coordX coordinates.coordY
1          NA          NA            NA       <NA>     <NA>     <NA>                 NA                 NA
2        81.3       61.92         187.5      popup   medium        6              75.78             167.97

$scoringPlays
NULL

$playsByInning
NULL

Obviously that's not what you want. After some tinkering I came up with the following solution.

My own strategy

The libraries:

library(jsonlite)
library(purrr)
library(dplyr)
library(readr)
library(stringr)
library(magrittr)

I use a slightly different method to download and parse the JSON because I need to see the structure. I'll include it just in case you might find it useful:

url <- paste0("http://statsapi-prod-alt-968618993.us-east-1.elb.amazonaws",
              ".com/api/v1/game/565711/playByPlay")

url %>% read_file() %>% prettify() %>% write_file("bball.json")

jsonData <- fromJSON("bball.json")

I first extract and clean the hitData dataframes. I know they can all be found in playEvents, so I can skip a few steps by using the $ syntax. The first call to map extracts hitData from each element of the list playEvents. The hitData dataframes are nested (they contain other dataframes), so the second call to map with jsonlite::flatten flattens them. The function safely ensures that R doesn't throw an error when something other than a dataframe is encountered (only 46 elements contain hitData). Many of the hitData dataframes contain rows full of NAs, so the third call to map uses an anonymous function (again in safely) to get rid of those. The fourth call to map then extracts the dataframe from each element's result variable, which was created by safely (along with an error variable that we don't need):

hitdata_list <- jsonData$allPlays$playEvents %>% 
    map("hitData") %>% 
    map(safely(jsonlite::flatten)) %>% 
    map(safely(~.$result[complete.cases(.$result),])) %>% 
    map("result")

Now I have a list of hitData dataframes. As I mentioned above, only 46 of 80 entries contain hitData, so I need a way to get the corresponding values from atBatIndex. I can do that by generating a logical vector with TRUE when an element in hitdata_list contains a dataframe, and FALSE otherwise. I use map_lgl to return a logical vector instead of a list:

lgl_index <- map_lgl(hitdata_list, ~ !is.null(.))
atbatindex_vec <- jsonData$allPlays$atBatIndex[lgl_index]

I then use a stringr function to get game_pk from the URL. I'm not sure if it would work with every URL, but it works fine in this case:

game_pk_vec <- str_match(url, "/(\\d+)/")[2] %>%
    as.integer()

Finally, I combine atBatIndex and game_pk in a tibble, then combine that tibble with with the hitData data using bind_cols. The hitData dataframes are still in a list, so I'll need to combine those first with bind_rows. The set_colnames function is from the magrittr package and does just what it says. I need to set the column names because some compound names were created when I flattened the hitData dataframes:

hitdata_df <- tibble(game_pk = game_pk_vec, atBatIndex = atbatindex_vec) %>% 
    bind_cols(bind_rows(hitdata_list)) %>% 
    set_colnames(str_extract(names(.), "\\w+$"))

The only thing I didn't do was extract pitchNumber. Calling jsonData$allPlays$playEvents %>% map("pitchNumber") returns a list of sequences 1 through n, where each vector has length > 1. I assume you only need the final number in each sequence, but I'm not sure so I'll spare myself the effort. You can do what I did with atBatIndex to get the relevant elements, and then extract what you need. Here's the final dataframe:

# A tibble: 46 x 10
   game_pk atBatIndex launchSpeed launchAngle totalDistance trajectory  hardness location coordX coordY
   <chr>        <int>       <dbl>       <dbl>         <dbl> <chr>       <chr>    <chr>     <dbl>  <dbl>
 1 565711           4        76.6        2.74        188.   ground_ball medium   9         178.   145. 
 2 565711           5       101.        15.4         328.   line_drive  hard     8         145.    62.2
 3 565711           6       103.        29.4         382.   line_drive  medium   9         237.    79.4
 4 565711           8       109.        15.6         319.   line_drive  hard     9         181.   102. 
 5 565711           9        75.8       47.8         239.   fly_ball    medium   7          99.8  103. 
 6 565711          10        91.6       44.1         311.   fly_ball    medium   8         140.    69.3
 7 565711          12        79.1       23.4         246.   line_drive  medium   7          52.3  126. 
 8 565711          13        67.3      -21.3         124.   ground_ball medium   6         108.   156. 
 9 565711          14        89.9      -21.6           7.41 ground_ball medium   6         108.   152. 
10 565711          15       110.        27.7         420.   fly_ball    medium   9         250.    69.0
# … with 36 more rows

This code/explanation is incredible @gersht!! It's thought out and ridiculously thorough!! I really really appreciate you taking the time to write out all the explanations what each step does. I've been pulling this data for a couple years now and slowly pairing down my code to be faster an more efficient, 10 minutes to do 100 games, to 2-3 minutes currently. Your code is around 35 seconds and that is insane to me... You've shown me a couple new functions I can use for other parts of the baseball data I'm pulling. Again, thank you for you answer and time, I don't take it for granted!! — Don Hessey, May 13 '19 at 13:37

score 0 · Answer 2 · answered May 11 '19 at 14:46

0

Try a cheecky bit of 'unlist'. I've managed to get a nameless data frame - getting the names out of a list seems complicated. Hope this helps:

hitData = jsonData %>%
      map("playEvents") %>%
      map("hitData") %>%
      unlist(recursive = F)

numRows = lapply(hitData,length) %>% unique %>% unlist

hitDataFrame = unlist(hitData) %>% matrix(nrow = numRows) %>% as.data.frame

answered May 11 '19 at 14:46

Captain Hat

2,444
1
14
31

It doesn't really achieve what the OP is trying to do. – May 12 '19 at 18:46

extract nested dataframes from complex JSON efficiently using purrr functions and %>%

2 Answers2

An analysis of your code

My own strategy