2

I'm trying to save a correctly formatted json file to aws s3.

I can save a regular data frame to s3 with e.g.

library(tidyverse)
library(aws.s3)
s3save(mtcars, bucket = "s3://ourco-emr/", object = "tables/adhoc.db/mtcars/mtcars")

But I need to get mtcars into json format. Specifically ndjson.

I am able to create a correctly formatted json file with e.g:

predictions_file <- file("mtcars.json")
jsonlite::stream_out(mtcars), predictions_file)

This saves a file to my directory called mtcars.json.

However, with the aws.s3 function s3save(), I need to send an object that's in memory, not a file.

Tried:

predictions_file <- file("mtcars.json")
s3write_using(mtcars, 
              FUN = jsonlite::stream_out,
              con = predictions_file,
              "s3://ourco-emr/", 
              object = "tables/adhoc.db/mtcars/mtcars")

Gives:

Error in if (verbose) message("opening ", is(con), " output connection.") : argument is not interpretable as logical

I tried the same code block but leaving out the line for con=predictions_file, that just gave:

Argument con must be a connection.

If the function jsonlite::stream_out() creates a correctly formatted json file, how can I then write that file to s3?

Edit: The desired json output would look like this:

{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3,"wt":2,"qsec":16,"vs":0,"am":1,"gear":4,"carb":4,"year":"2020","month":"03","day":"05"}
{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3,"wt":2,"qsec":17,"vs":0,"am":1,"gear":4,"carb":4,"year":"2020","month":"03","day":"05"}
{"mpg":22,"cyl":4,"disp":108,"hp":93,"drat":35,"wt":2,"qsec":18,"vs":1,"am":1,"gear":4,"carb":1,"year":"2020","month":"03","day":"05"}
{"mpg":21,"cyl":6,"disp":258,"hp":110,"drat":8,"wt":3,"qsec":19,"vs":1,"am":0,"gear":3,"carb":1,"year":"2020","month":"03","day":"05"}
{"mpg":18,"cyl":8,"disp":360,"hp":175,"drat":3,"wt":3,"qsec":17,"vs":0,"am":0,"gear":3,"carb":2,"year":"2020","month":"03","day":"05"}

When attempting with readchar:

mtcars_string <- readChar("mtcars.json", 1e6)
s3save(mtcars_string, bucket = "s3://ourco-emr/", object = "tables/adhoc.db/mtcars/2020/03/06/mtcars")

If I then download and open the resulting json file, it looks like this:

5244 5833 0a58 0a00 0000 0300 0306 0000
0305 0000 0000 0555 5446 2d38 0000 0402
0000 0001 0004 0009 0000 000d 6d74 6361
7273 5f73 7472 696e 6700 0000 1000 0000
0100 0400 0900 0012 347b 226d 7067 223a
3231 2c22 6379 6c22 3a36 2c22 6469 7370

So it looks like a tsb has been sent to aws s3 as opposed to json

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Doug Fir
  • 19,971
  • 47
  • 169
  • 299

1 Answers1

2

I had the same problem. I need to write and upload JSON lines (ndjson) to S3 and, as far as I know, only stream_out() from the jsonlite-package writes JSON lines.

stream_out() takes only connection-objects as a destination, s3write_using(), however, writes to a temporary file tmp and passes the path to that file as a string to FUN. stream_out() then throws the error:

Argument con must be a connection.

A tentative fix is to modify s3write_using() to pass a connection to FUN instead of a filepath-string.

  1. trace(s3write_using, edit=TRUE) - opens an editor

  2. Change line 5:
    value <- FUN(x, tmp, ...)

    To this:
    value <- FUN(x, file(tmp), ...)

You can then upload the data using stream_out():

s3write_using(x = data, 
              FUN = stream_out,
              bucket = 'mybucket',
              object = 'my/object.json',
              opts = list(acl = "private", multipart = FALSE, verbose = T, show_progress = T))

The edit remains for the whole session or until you do untrace(s3write_using).

One should probably file a request in their cloudyr/aws.s3 GitHub as this as to be a common use case.

Humpelstielzchen
  • 6,126
  • 3
  • 14
  • 34