Parse multi-level json file in r

Question

I have a pretty good understanding of R but am new to JSON file types and best practices for parsing. I'm having difficulties building a data frame from a raw JSON file. The JSON file (data below) is made up of repeated measure data that has multiple observations per user.

When the raw file is read into r

 jdata<-read_json("./raw.json")

It comes in as a "List of 1" with that list being user_ids. Within each user_id are further lists, like so -

jdata$user_id$`sjohnson`$date$`2020-09-25`$city

The very last position actually splits into two options - $city or $zip. At the highest level, there are about 89 users in the complete file.

My goal would be to end up with a rectangular data frame or multiple data frames that I can merge together like this - where I don't actually need the zip code.

example table

I've tried jsonlite along with tidyverse and the farthest I seem to get is a data frame with one variable at the smallest level - cities and zip codes alternating rows using this

df  <-  as.data.frame(matrix(unlist(jdata), nrow=length(unlist(jdata["users"]))))

Any help/suggestions to get closer to the table above would be much appreciated. I have a feeling I'm failing at looping it back through the different levels.

Here is an example of the raw json file structure:

 {
  "user_id": {
    "sjohnson": {
      "date": {
        "2020-09-25": {
              "city": "Denver",
              "zip": "80014"
            },
            "2020-10-01": {
              "city": "Atlanta",
              "zip": "30301"
            },
            "2020-11-04": {
              "city": "Jacksonville",
              "zip": "14001"
            }
         },
    "asmith: {
      "date": {
        "2020-10-16": {
              "city": "Cleavland",
              "zip": "34321"
        },
        "2020-11-10": {
              "City": "Elmhurst",
              "zip": "00013
            },
            "2020-11-10 08:49:36": {
              "location": null,
              "timestamp": 1605016176013
            }
          }

Hi, and welcome! Are you sure your JSON is correct? There seem to be a few missing braces `}` to demarcate the ends the user objects. **Also**, can I ask how this JSON is generated? It seems a rather convoluted structure. — Greg, Oct 27 '21 at 17:32
Can you please check the code example? There are missing several closing brackets. — deschen, Oct 27 '21 at 17:37
There‘s also a missing quotation after `asmith`….and the last zip code. — deschen, Oct 27 '21 at 17:40
maybe rjson package can help https://cran.r-project.org/web/packages/rjson/rjson.pdf — user12256545, Oct 27 '21 at 18:41
Hi Sescro! I just developed a custom function `unnestable()`, that gives a recursive solution to your problem. **WARNING:** Carefully inspect the columns in the tabular output from `unnestable()`! Since you apparently have **some typos in the JSON** (ex. `"City"` rather than `"city"`), some data (ex. like the city names) *will* be present in the initial table (ex. under the `city` column and an extra `City` column), if not where you *expect*. However, subsequent formatting of that table might overlook the columns (like `City`) you don't expect. — Greg, Oct 27 '21 at 21:16
Thank you so much for the initial comments! Yes, the JSON is very convoluted. Unfortunately, I have very little control/say in its structure or upkeep. Its comes from a firestore web app, that is converted from a leveldb file. Missing brackets or errors in the code are likely from me - removing and copying from an "example" of the multi nested JSON. — Sescro, Oct 27 '21 at 21:28

score 3 · Answer 1 · answered Oct 28 '21 at 08:29

Another (straightforward) solution doing the heavy-lifting with rrapply() in the rrapply-package:

library(rrapply)
library(dplyr)

rrapply(jdata, how = "melt") %>%
  filter(L5 == "city") %>%
  select(user_id = L2, date = L4, city = value)

#>    user_id       date         city
#> 1 sjohnson 2020-09-25       Denver
#> 2 sjohnson 2020-10-01      Atlanta
#> 3 sjohnson 2020-11-04 Jacksonville
#> 4   asmith 2020-10-16    Cleavland
#> 5   asmith 2020-11-10     Elmhurst

Data

jdata <- jsonlite::fromJSON('{
   "user_id": {
    "sjohnson": {
       "date": {
        "2020-09-25": {
           "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
         },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
       }
    },
    "asmith": {
       "date": {
         "2020-10-16": {
           "city": "Cleavland",
           "zip": "34321"
         },
        "2020-11-10": {
           "city": "Elmhurst",
           "zip": "00013"
         },
         "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
       }
     }
   }
}')

Oh nice. What an awesome package. This will simplify tons of scripts I have for accessing APIs and reshping their content into some useful data structures. — deschen, Oct 28 '21 at 19:33
Very concise! I ([and others](https://stackoverflow.com/q/69759096)) wondering if there's a way to `pivot_longer()`, such that the value (ex. `"user_id"`) in each odd-numbered column (`1:5`) becomes a column name (`user_id`), containing the corresponding values from the subsequent even-numbered column (`"sjohnson"` and `"asmith"`). I know it can be done iteratively, but I'd love to do it in one fell swoop: a single pivot `field_1_names | field_1_values | field_2_names | field_2_values | ... | field_n_names | field_n_values` would become `field_1 | field_2 | ... | field_n`. — Greg, Oct 28 '21 at 19:55

deschen · Answer 2 · 2021-10-28T05:51:27.253

We can build our desired structure step by step:

library(jsonlite)
library(tidyverse)

df <- fromJSON('{
   "user_id": {
    "sjohnson": {
       "date": {
        "2020-09-25": {
           "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
         },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
       }
    },
    "asmith": {
       "date": {
         "2020-10-16": {
           "city": "Cleavland",
           "zip": "34321"
         },
        "2020-11-10": {
           "city": "Elmhurst",
           "zip": "00013"
         },
         "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
       }
     }
   }
}')

df %>%
  bind_rows() %>%
  pivot_longer(everything(), names_to = 'user_id') %>%
  unnest_longer(value, indices_to = 'date') %>%
  unnest_longer(value, indices_to = 'var') %>%
  mutate(city = unlist(value)) %>%
  filter(var == 'city') %>%
  select(-var, -value)

which gives:

# A tibble: 5 x 3
  user_id  date       city        
  <chr>    <chr>      <chr>       
1 sjohnson 2020-09-25 Denver      
2 sjohnson 2020-10-01 Atlanta     
3 sjohnson 2020-11-04 Jacksonville
4 asmith   2020-10-16 Cleavland   
5 asmith   2020-11-10 Elmhurst

Alternative solution inspired by @Greg where we change the last two rows:

df %>%
  bind_rows() %>%
  pivot_longer(everything(), names_to = 'user_id') %>%
  unnest_longer(value, indices_to = 'date') %>%
  unnest_longer(value, indices_to = 'var') %>%
  mutate(value = unlist(value)) %>%
  pivot_wider(names_from = "var") %>%
  select(user_id, date, city)

This gives almost the same results with the exception of one additional case where city is NA:

# A tibble: 6 x 3
  user_id  date                city        
  <chr>    <chr>               <chr>       
1 sjohnson 2020-09-25          Denver      
2 sjohnson 2020-10-01          Atlanta     
3 sjohnson 2020-11-04          Jacksonville
4 asmith   2020-10-16          Cleavland   
5 asmith   2020-11-10          Elmhurst    
6 asmith   2020-11-10 08:49:36 NA

Oh wow, that's a very nifty solution! I have some experience with **`tidyr`**, but I wasn't fully aware of the `unnest_*()` functions. I ended up writing my own, which was quite laborious in this case...though it does have the advantage of working for arbitrary nestings, without having to code by hand an extra `unnest_longer()` for *each and every every* further level that is unnested. **NOTE:** You *could* revise the `mutate()` step to `mutate(value = unlist(value))`, and then just `%>% pivot_wider(values_from = value, names_from = var)` and finally `select(user_id, date, city)`. — Greg, Oct 27 '21 at 22:51
Oh yes, that's not a bad idea, although it will keep this one NA value (which might or might not be preferable for the TO). — deschen, Oct 28 '21 at 05:49
Nice update! I always default to preserving more information (like `NA` rows), which can be `filter`ed out in the end, at the user's discretion; so I'd incline toward your second solution. — Greg, Oct 28 '21 at 13:36

Greg · Accepted Answer · 2021-10-27T21:34:46.173

Here's a solution in the tidyverse: a custom function unnestable() designed to recursively unnest into a table the contents of a list like you describe. See Details for particulars regarding the format of such a list and its table.

Solution

First ensure the necessary libraries are present:

library(jsonlite)
library(tidyverse)

Then define the unnestable() function as follows:

unnestable <- function(v) {
  # If we've reached the bottommost list, simply treat it as a table...
  if(all(sapply(
    X = v,
    # Check that each element is a single value (or NULL).
    FUN = function(x) {
      is.null(x) || purrr::is_scalar_atomic(x)
    },
    simplify = TRUE
  ))) {
    v %>%
      # Replace any NULLs with NAs to preserve blank fields...
      sapply(
        FUN = function(x) {
          if(is.null(x))
            NA
          else
            x
        },
        simplify = FALSE
      ) %>%
      # ...and convert this bottommost list into a table.
      tidyr::as_tibble()
  }
  # ...but if this list contains another nested list, then recursively unnest its
  # contents and combine their tabular results.
  else if(purrr::is_scalar_list(v)) {
    # Take the contents within the nested list...
    v[[1]] %>%
      # ...apply this 'unnestable()' function to them recursively...
      sapply(
        FUN = unnestable,
        simplify = FALSE,
        USE.NAMES = TRUE
      ) %>%
      # ...and stack their results.
      dplyr::bind_rows(.id = names(v)[1])
  }
  # Otherwise, the format is unrecognized and yields no results.
  else {
    NULL
  }
}

Finally, process the JSON data as follows:

# Read the JSON file into an R list.
jdata <- jsonlite::read_json("./raw.json")


# Flatten the R list into a table, via 'unnestable()'
flat_data <- unnestable(jdata)


# View the raw table.
flat_data

Naturally, you can reformat this table however you desire:

library(lubridate)

flat_data <- flat_data %>%
  dplyr::transmute(
    user_id = as.character(user_id),
    date = lubridate::as_datetime(date),
    city = as.character(city)
  ) %>%
  dplyr::distinct()


# View the reformatted table.
flat_data

Results

Given a raw.json file like that sampled here

{
  "user_id": {
    "sjohnson": {
      "date": {
        "2020-09-25": {
          "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
        },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
      }
    },
    "asmith": {
      "date": {
        "2020-10-16": {
          "city": "Cleavland",
          "zip": "34321"
        },
        "2020-11-10": {
          "city": "Elmhurst",
          "zip": "00013"
        },
        "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
      }
    }
  }
}

then unnestable() will yield a tibble like this

# A tibble: 6 x 6
  user_id  date                city         zip   location     timestamp
  <chr>    <chr>               <chr>        <chr> <lgl>            <dbl>
1 sjohnson 2020-09-25          Denver       80014 NA                  NA
2 sjohnson 2020-10-01          Atlanta      30301 NA                  NA
3 sjohnson 2020-11-04          Jacksonville 14001 NA                  NA
4 asmith   2020-10-16          Cleavland    34321 NA                  NA
5 asmith   2020-11-10          Elmhurst     00013 NA                  NA
6 asmith   2020-11-10 08:49:36 NA           NA    NA       1605016176013

which dplyr will format into the result below:

# A tibble: 6 x 3
  user_id  date                city        
  <chr>    <dttm>              <chr>       
1 sjohnson 2020-09-25 00:00:00 Denver      
2 sjohnson 2020-10-01 00:00:00 Atlanta     
3 sjohnson 2020-11-04 00:00:00 Jacksonville
4 asmith   2020-10-16 00:00:00 Cleavland   
5 asmith   2020-11-10 00:00:00 Elmhurst    
6 asmith   2020-11-10 08:49:36 NA

Details

List Format

To be precise, the list represents nested groupings by the fields {group_1, group_2, ..., group_n}, and it must be of the form:

list(
  group_1 = list(
    "value_1" = list(
      group_2 = list(
        "value_1.1" = list(
          # .
          #  .
          #   .
               group_n = list(
                 "value_1.1.….n.1" = list(
                   field_a =    1,
                   field_b = TRUE
                 ),
                 "value_1.1.….n.2" = list(
                   field_a =   2,
                   field_c = "2"
                 )
                 # ...
               )
        ),
        "value_1.2" = list(
          # .
          #  .
          #   .
        )
        # ...
      )
    ),
    "value_2" = list(
      group_2 = list(
        "value_2.1" = list(
          # .
          #  .
          #   .
               group_n = list(
                 "value_2.1.….n.1" = list(
                   field_a =   3,
                   field_d = 3.0
                 )
                 # ...
               )
        ),
        "value_2.2" = list(
          # .
          #  .
          #   .
        )
        # ...
      )
    )
    # ...
  )
)

Table Format

Given a list of this form, unnestable() will flatten it into a table of the following form:

# A tibble: … x …
  group_1 group_2   ... group_n         field_a field_b field_c field_d
  <chr>   <chr>     ... <chr>             <dbl> <lgl>   <chr>     <dbl>
1 value_1 value_1.1 ... value_1.1.….n.1       1 TRUE    NA           NA
2 value_1 value_1.1 ... value_1.1.….n.2       2 NA      2            NA
3 value_1 value_1.2 ... value_1.2.….n.1     ... ...     ...         ...
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮
j value_2 value_2.1 ... value_2.1.….n.1       3 NA      NA            3
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮
k value_2 value_2.2 ... value_2.2.….n.1     ... ...     ...         ...
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮

This is perfect and exactly what I need! Thank you so much! – Sescro Oct 27 '21 at 21:43 — Sescro, Oct 27 '21 at 21:43