1

My data input is comes from JSON data and names in the list are the keys from the key/value pairs of the JSON. So it looks like this:

# Dummy data
doc1 <- list(type = "HTML",
             garbage = "blahblah",
             `1 - 28` = list(food = "pizza",
                             birthdate = "12-31-89",
                             name = "Jill"),
             `3 - 36` = list(pet = "gerbil",
                             gender = "female"))
doc2 <- list(type = "XLS",
             `2 - 2` = list(hour = "now",
                             profession = "Engineer"),
             `3 - 36` = list(name = "Fred",
                             age = "36"))
input <- list(doc1 = doc1, doc2 = doc2)

I would like to "rectangle" the data for easier analysis, so it looks like this:

# A tibble: 9 x 5
  doc   type  location column     value   
  <chr> <chr> <chr>    <chr>      <chr>   
1 doc1  HTML  1 - 28   food       pizza   
2 doc1  HTML  1 - 28   birthdate  12-31-89
3 doc1  HTML  1 - 28   name       Jill    
4 doc1  HTML  3 - 36   pet        gerbil  
5 doc1  HTML  3 - 36   gender     female  
6 doc2  XLS   2 - 2    hour       now     
7 doc2  XLS   2 - 2    profession Engineer
8 doc2  XLS   3 - 36   name       Fred    
9 doc2  XLS   3 - 36   age        36   

The complications are that

  1. The nested terms have different indexes and sometimes there are columns I don't want at all (e.g., garbage)
  2. The lower nested all have different names

I have a for-loop that loops through the docs and pull out all the appropriate values, but this takes quite a while on large files. I found the map function from the purrr package to be useful in extracting the certain columns (see this tutorial). But I can't seem to get map to work when I don't know the column name or index.

# Work so far
input %>% {
  tibble(
    doc = names(.),
    type = map(., "type")
  )
} %>%
  unnest(cols = c(type))

I feel like this vignette holds a key.

Jeff Parker
  • 1,809
  • 1
  • 18
  • 28

2 Answers2

1

It is not the most generalizable solution, but maybe this will give you some ideas for your full dataset. The steps I used are

  • data.frame on a single doc to create a messy dataframe
  • remove unwanted columns
  • pivot_longer, splitting on on the last period
  • mutate to change the ... string back to a -
library(tidyverse)

# Function to clean a single doc
doc_unnest <- function(doc, unwanted_cols){
  data.frame(doc) %>% 
    select(-contains(unwanted_cols)) %>%
    pivot_longer(-c(type),
                 names_prefix = 'X',
                 names_sep =  "\\.(?=[^\\.]+$)",
                 names_to = c('location', 'column')) %>%
    mutate(location = str_replace(location, '\\...', ' - '))
}


# Apply to both docs in input
input %>%
  map_dfr(doc_unnest, .id = 'doc', unwanted_cols = 'garbage')


#------------------
# A tibble: 9 x 5
  doc   type  location column     value   
  <chr> <chr> <chr>    <chr>      <chr>   
1 doc1  HTML  1 - 28   food       pizza   
2 doc1  HTML  1 - 28   birthdate  12-31-89
3 doc1  HTML  1 - 28   name       Jill    
4 doc1  HTML  3 - 36   pet        gerbil  
5 doc1  HTML  3 - 36   gender     female  
6 doc2  XLS   2 - 2    hour       now     
7 doc2  XLS   2 - 2    profession Engineer
8 doc2  XLS   3 - 36   name       Fred    
9 doc2  XLS   3 - 36   age        36  
nniloc
  • 4,128
  • 2
  • 11
  • 22
0
library(dplyr)
library(tibblify)
library(tidyr)

input %>%
  tibblify %>%
  Reduce(unnest_longer, 3:7, .) %>%
  pivot_longer(3:7, names_to = "location", values_to = "column")

giving:

# A tibble: 50 x 5
   type  garbage  `2 - 2_id` location  column
   <chr> <chr>    <chr>      <chr>     <chr> 
 1 HTML  blahblah <NA>       1 - 28    pizza 
 2 HTML  blahblah <NA>       1 - 28_id food  
 3 HTML  blahblah <NA>       3 - 36    gerbil
 4 HTML  blahblah <NA>       3 - 36_id pet   
 5 HTML  blahblah <NA>       2 - 2     <NA>  
 6 HTML  blahblah <NA>       1 - 28    pizza 
 7 HTML  blahblah <NA>       1 - 28_id food  
 8 HTML  blahblah <NA>       3 - 36    female
 9 HTML  blahblah <NA>       3 - 36_id gender
10 HTML  blahblah <NA>       2 - 2     <NA>  
# ... with 40 more rows
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341