What's the tidy way of having tibble
columns of class tibble
(instead of class list
or data.frame
)?
It's clearly possible to have columns of class data.frame
in tibble
s (see
example below), but none of the "tidy ways of data manipulation" (i.e.
dplyr::mutate()
or purrr::map*_df()
) seem to work for me when trying to cast the columns to tibble
instead of data.frame
Current ouput of jsonlite::fromJSON()
# 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
Desired result
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
Why having data.frame
columns can be very misleading
https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/
Related
- Recursively ensuring tibbles instead of data frames when parsing/manipulating nested JSON
- Ensure that data frames become tibbles when reading MongoDB data with {mongolite}
Example
Example data
library(magrittr)
json <- '[
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
},
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 10,
"z": false
},
{
"x": "B",
"y": 20,
"z": true
}
]
}
},
"schema": "0.0.1"
}
]'
When visualizing this, you'll see that there's a subtle but important distinction between objects (which map to data.frame
s) and array (which map to list
s):
Parsing JSON and converting to tibble
x <- json %>%
jsonlite::fromJSON() %>%
tibble::as_tibble()
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
So it's clearly possible to have columns that are of class data.frame
.
Casting data.frame
to tibble
columns: "the bad way"
But I'd like tibbles instead of data frames, so let's try the only thing I got to work: explicit re-assigning the respective list levels, or data frame/tibble columns, to be more precise:
# Make a copy so we don't mess with the initial state of `x`
y <- x
y$levelOne <- y$levelOne %>%
tibble::as_tibble()
y$levelOne$levelTwo <- y$levelOne$levelTwo %>%
tibble::as_tibble()
y$levelOne$levelTwo$levelThree <- y$levelOne$levelTwo$levelThree %>%
purrr::map(tibble::as_tibble)
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
That works, but is not in line with "tidy data manipulation pipes".
Casting data.frame
to tibble
columns: "the better way" (trying and failing)
# Yet another copy so we can compare:
z <- x
# Just to check that this works
z$levelOne %>%
tibble::as_tibble()
# # A tibble: 2 x 1
# levelTwo$levelThree
# <list>
# 1 <df[,3] [2 × 3]>
# 2 <df[,3] [2 × 3]>
# Trying to get this to work with `dplzr::mutate()` fails:
z %>%
dplyr::mutate(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
dplyr::transmute(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
# Same goes for `{purrr}`:
z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map_df(tibble::as_tibble)
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
tibble::add_column(levelOne = z$levelOne %>% tibble::as_tibble())
# Error: Can't add duplicate columns with `add_column()`:
# * Column `levelOne` already exists in `.data`.
# Works, but not what I want:
z %>%
tibble::add_column(test = z$levelOne %>% tibble::as_tibble()) %>%
str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 4 variables:
# [...]
# $ test :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
The only thing that worked (is not what we want)
Wrapping tibble::as_tibble()
by purrr::map()
seems to work, but the result is clearly not what we want as we duplicate everything below levelOne
(compare to desired output above)
# Works, but not what I want:
z_new <- z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map(tibble::as_tibble)
)
z_new %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
EDIT (follow-up investigation)
Got it to work with Hendrik's help!
Still, IMO this topic raises some interesting follow-up questions regarding
whether or not one should - or even could - do it any other way if the primary
goal is to end up with tidy nested tibbles that play nicely with
tidyr::unnset()
and tidyr::nest()
(see comments in Hendrik's answer below).
As to the proposed approach in https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/: I might be overlooking something obvious, but I think it only works for JSON docs with a single document.
First, let's modify df_to_tibble()
(see Hendrik's answer below) to only turn
"leaf" data frames into tibbles while turning "branch" data frames into lists:
leaf_df_to_tibble <- function(x) {
if (is.data.frame(x)) {
if (!any(purrr::map_lgl(x, is.list))) {
# Only captures "leaf" DFs:
tibble::as_tibble(x)
} else {
as.list(x)
}
} else {
x
}
}
This would give us results that are in line with the proposed way in the blog post, but only for "single object" JSON docs as illustrated below
df <- json %>% jsonlite::fromJSON()
# Only take the first object from the parsed JSON:
df_subset <- df[1, ]
Transforming df_subset
:
df_subset_tibble <- purrr::reduce(
0:purrr::vec_depth(df_subset),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df_subset
) %>%
tibble::as_tibble()
df_subset_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 3 variables:
# $ labels :List of 1
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 1
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 1
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# $ schema : chr "0.0.1"
Transforming df
:
df_tibble <- purrr::reduce(
0:purrr::vec_depth(df),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df
) %>%
tibble::as_tibble()
df_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
As we see, "listifying" nested JSON structures actually may results in copying
the "leafs". It just doesn't jump at you as long as n = 1
(number of JSON
docs), but strikes you as soon as n > 1
.