1

The most popular example I have seen of using stream_in with a custom handler uses stream_out to write the processed json to a file connection. It is not clear to me how to write a custom handler that would store all pages processed using my custom handler logic and binds them into a single data frame to be returned, as the default handler does.

The following example returns NULL:

library(jsonlite)

handler <- function(df){
  # process df and store in result
  ...
  return(result)
}
x <- stream_in(file_connection, simplifyVector = FALSE, handler = handler)
# x is NULL

Is there a way to bind the result from multiple handler calls without writing intermediate results to disk?

user12397
  • 58
  • 5
  • `<<-` assignments in-function will let the handler mutate a global variable. it's an expensive operation since there will be multiple copies made as you (I'm assuming) rbind them. i think another (in my personal use) is that you get the benefits of a thinner data file for re-use. but rbind to a global variable if you want to stay in-memory. – hrbrmstr Nov 06 '17 at 20:15

1 Answers1

1

The first answer here: https://stackoverflow.com/a/46646268/8440355

Example:

    new_df <- new.env()
    stream_in(file("file_name.json"), handler = function(df){ 
    new_df <- rbind.data.frame(new_df,dplyr::filter(df$col_name_1<20))}, 
    pagesize = 5000)

This will need your json to be easily written in a data frame (lots of nesting may throw errors while doing rbind, but I hope you get the "new.env" logic to circumvent creating temp files)