When you read in an object using the read_*
functions, you are reading them in as Arrow Tables, which are stored in-memory. Arrow is designed around doing zero-copy operations, which means that if you can manipulate Arrow objects directly instead of pulling them into R, this should help with not creating intermediate copies of objects and blowing up your R session, when working with larger objects.
I have a potential solution that involves working with Arrow objects until the last moment when you pull the data into R, though it's not the most elegant.
# Bring in libraries
suppressMessages(library(arrow))
# Make data
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
{ "hello": 3.5, "world": false, "yo":{"param1":"duck1","param2":"duck2"} }
{ "hello": 3.25, "world": null, "yo":{"param1":"duck3","param2":"duck4"} }
{ "hello": 0.0, "world": true, "yo":{"param1":"duck5","param2":"duck6"} }
', tf, useBytes = TRUE)
# read in the JSON table as an Arrow Table
my_tbl <- read_json_arrow(tf, col_select = c("hello", "world"), as_data_frame = FALSE)
complex_cols <- read_json_arrow(tf, col_select = "yo", as_data_frame = FALSE)
# subselect the "yo" column - this is an Arrow ChunkedArray object
# containing a Struct at position 0
yo_col <- complex_cols[["yo"]]
yo_col
#> ChunkedArray
#> <struct<param1: string, param2: string>>
#> [
#> -- is_valid: all not null
#> -- child 0 type: string
#> [
#> "duck1",
#> "duck3",
#> "duck5"
#> ]
#> -- child 1 type: string
#> [
#> "duck2",
#> "duck4",
#> "duck6"
#> ]
#> ]
# extract the Struct by passing in the chunk number
sa <- yo_col$chunk(0)
sa
#> StructArray
#> <struct<param1: string, param2: string>>
#> -- is_valid: all not null
#> -- child 0 type: string
#> [
#> "duck1",
#> "duck3",
#> "duck5"
#> ]
#> -- child 1 type: string
#> [
#> "duck2",
#> "duck4",
#> "duck6"
#> ]
# extract the "param1" column from the Struct
param1_col <- sa[["param1"]]
param1_col
#> Array
#> <string>
#> [
#> "duck1",
#> "duck3",
#> "duck5"
#> ]
# Add the param1 column to the original Table
my_tbl[["param1"]] = param1_col
my_tbl
#> Table
#> 3 rows x 3 columns
#> $hello <double>
#> $world <bool>
#> $param1 <string>
# now pull the table into R
dplyr::collect(my_tbl)
#> # A tibble: 3 × 3
#> hello world param1
#> <dbl> <lgl> <chr>
#> 1 3.5 FALSE duck1
#> 2 3.25 NA duck3
#> 3 0 TRUE duck5
I was looking for how this is done directly in the tidyverse (we've modelled a lot of the arrow package design after tidyverse design), but many solutions I've seen involve running purrr::map()
inside dplyr::select()
, a workflow which isn't currently implemented in arrow, and I don't know if it's even possible. Feel free to open a ticket on the repo though if you do want to make a feature request.
A final note: this probably doesn't make that much difference to memory footprint in the example above, but if you've got lots of nested items to pull out and reassemble into a table, then you'll likely see more benefits.