My data input is comes from JSON data and names in the list are the keys from the key/value pairs of the JSON. So it looks like this:
# Dummy data
doc1 <- list(type = "HTML",
garbage = "blahblah",
`1 - 28` = list(food = "pizza",
birthdate = "12-31-89",
name = "Jill"),
`3 - 36` = list(pet = "gerbil",
gender = "female"))
doc2 <- list(type = "XLS",
`2 - 2` = list(hour = "now",
profession = "Engineer"),
`3 - 36` = list(name = "Fred",
age = "36"))
input <- list(doc1 = doc1, doc2 = doc2)
I would like to "rectangle" the data for easier analysis, so it looks like this:
# A tibble: 9 x 5
doc type location column value
<chr> <chr> <chr> <chr> <chr>
1 doc1 HTML 1 - 28 food pizza
2 doc1 HTML 1 - 28 birthdate 12-31-89
3 doc1 HTML 1 - 28 name Jill
4 doc1 HTML 3 - 36 pet gerbil
5 doc1 HTML 3 - 36 gender female
6 doc2 XLS 2 - 2 hour now
7 doc2 XLS 2 - 2 profession Engineer
8 doc2 XLS 3 - 36 name Fred
9 doc2 XLS 3 - 36 age 36
The complications are that
- The nested terms have different indexes and sometimes there are columns I don't want at all (e.g.,
garbage
) - The lower nested all have different names
I have a for-loop
that loops through the docs and pull out all the appropriate values, but this takes quite a while on large files. I found the map
function from the purrr
package to be useful in extracting the certain columns (see this tutorial). But I can't seem to get map
to work when I don't know the column name or index.
# Work so far
input %>% {
tibble(
doc = names(.),
type = map(., "type")
)
} %>%
unnest(cols = c(type))
I feel like this vignette holds a key.