Is there a way to read a nested column?

Question

I have a bunch of Newline-delimited JSON files that I want to read into R using the arrow package.

One of the parameters in the file is nested. The potential nested values are quite big and messy and I would prefer to only select the nested parameters I actually need.

Here is an example of the data I am working with:

# Bring in libraries
suppressMessages(library(arrow))
suppressMessages(library(data.table))

# Make data
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
    { "hello": 3.5, "world": false, "yo":{"param1":"duck1","param2":"duck2"} }
    { "hello": 3.25, "world": null, "yo":{"param1":"duck3","param2":"duck4"} }
    { "hello": 0.0, "world": true, "yo":{"param1":"duck5","param2":"duck6"} }
  ', tf, useBytes = TRUE)
df <- read_json_arrow(tf)

This is the result of what I just read in:

read_json_arrow(tf, col_select = "yo")

I can also read in the "yo" column. The results is below:

But I am having trouble reading in the "yo.param1" data element:

Any ideas on how I might read this nested column in and avoid reading in the entire column?

Without knowing the internals, I suspect that the best you can do is pull `"yo"`, since it is a nested frame (list-column of sorts) and there doesn't seem to be a lazy-arrow way to deal with such. The best I could find is `read_json_arrow(tf, col_select = "yo") %>% pull(yo) %>% select(param1)`, but that pulls all of `yo` and then subselects it. — r2evans, Mar 28 '23 at 00:40

thisisnic · Accepted Answer · 2023-03-28T10:38:58.413

When you read in an object using the read_* functions, you are reading them in as Arrow Tables, which are stored in-memory. Arrow is designed around doing zero-copy operations, which means that if you can manipulate Arrow objects directly instead of pulling them into R, this should help with not creating intermediate copies of objects and blowing up your R session, when working with larger objects.

I have a potential solution that involves working with Arrow objects until the last moment when you pull the data into R, though it's not the most elegant.

# Bring in libraries
suppressMessages(library(arrow))

# Make data
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
    { "hello": 3.5, "world": false, "yo":{"param1":"duck1","param2":"duck2"} }
    { "hello": 3.25, "world": null, "yo":{"param1":"duck3","param2":"duck4"} }
    { "hello": 0.0, "world": true, "yo":{"param1":"duck5","param2":"duck6"} }
  ', tf, useBytes = TRUE)

# read in the JSON table as an Arrow Table
my_tbl <- read_json_arrow(tf, col_select = c("hello", "world"), as_data_frame = FALSE)
complex_cols <- read_json_arrow(tf, col_select = "yo", as_data_frame = FALSE)

# subselect the "yo" column - this is an Arrow ChunkedArray object 
# containing a Struct at position 0
yo_col <- complex_cols[["yo"]]
yo_col
#> ChunkedArray
#> <struct<param1: string, param2: string>>
#> [
#>   -- is_valid: all not null
#>   -- child 0 type: string
#>     [
#>       "duck1",
#>       "duck3",
#>       "duck5"
#>     ]
#>   -- child 1 type: string
#>     [
#>       "duck2",
#>       "duck4",
#>       "duck6"
#>     ]
#> ]

# extract the Struct by passing in the chunk number
sa <- yo_col$chunk(0)
sa
#> StructArray
#> <struct<param1: string, param2: string>>
#> -- is_valid: all not null
#> -- child 0 type: string
#>   [
#>     "duck1",
#>     "duck3",
#>     "duck5"
#>   ]
#> -- child 1 type: string
#>   [
#>     "duck2",
#>     "duck4",
#>     "duck6"
#>   ]

# extract the "param1" column from the Struct
param1_col <- sa[["param1"]]
param1_col
#> Array
#> <string>
#> [
#>   "duck1",
#>   "duck3",
#>   "duck5"
#> ]

# Add the param1 column to the original Table
my_tbl[["param1"]] = param1_col
my_tbl
#> Table
#> 3 rows x 3 columns
#> $hello <double>
#> $world <bool>
#> $param1 <string>

# now pull the table into R
dplyr::collect(my_tbl)
#> # A tibble: 3 × 3
#>   hello world param1
#>   <dbl> <lgl> <chr> 
#> 1  3.5  FALSE duck1 
#> 2  3.25 NA    duck3 
#> 3  0    TRUE  duck5

I was looking for how this is done directly in the tidyverse (we've modelled a lot of the arrow package design after tidyverse design), but many solutions I've seen involve running purrr::map() inside dplyr::select(), a workflow which isn't currently implemented in arrow, and I don't know if it's even possible. Feel free to open a ticket on the repo though if you do want to make a feature request.

A final note: this probably doesn't make that much difference to memory footprint in the example above, but if you've got lots of nested items to pull out and reassemble into a table, then you'll likely see more benefits.

Great idea! Opened ticket for the repo: https://github.com/apache/arrow/issues/34762 — Mike.Gahan, Mar 28 '23 at 14:47

Is there a way to read a nested column?

1 Answers1