1

The Question:

I have a dataset that I am pulling from an S3 bucket that is stored in a compressed CSV file format. The data comes via the S3 API as a raw vector which I then save as a file and then read the data in from that file.

Is there a way that I can just read the data directly from the raw vector without having to write these temporary files?

Current code:

# Import packages -------------------------------------------------------------
library(paws)

# Set up S3 access ------------------------------------------------------------
s3 <- paws::s3()
aws_s3_bucket <- Sys.getenv("AWS_S3_BUCKET")

# Fetch dataset from S3 which returns a raw vector ----------------------------
s3_object <- s3$get_object(Bucket = aws_s3_bucket, Key = "data.csv.gz")
s3_object_body <- s3_object$Body

# Write the raw vector to a temporary file (This is what I want to remove) ----
file_name <- "s3_files/mydataset.csv.gz"
if (file.exists(file_name)) { unlink(file_name) }
writeBin(s3_object_body, con = file_name)

# Finally, read the data from the file ----------------------------------------
data <- data.table::fread(file_name)

Some attempts that haven't worked:

Trying readr::read_csv(s3_object_body) results in the following error:

Error in vroom_(file, delim = delim %||% col_types$delim, col_names = col_names,  : 
  embedded nul in string: '\037<U+008B>\b\b<f1>7\aa\002<ff>data.csv'

Trying iotools::read.csv.raw(s3_object_body) results in the following error:

Error in isOpen(con, "rb") : unimplemented type 'raw' in 'asInteger'

0 Answers0