The Question:
I have a dataset that I am pulling from an S3 bucket that is stored in a compressed CSV file format. The data comes via the S3 API as a raw vector which I then save as a file and then read the data in from that file.
Is there a way that I can just read the data directly from the raw vector without having to write these temporary files?
Current code:
# Import packages -------------------------------------------------------------
library(paws)
# Set up S3 access ------------------------------------------------------------
s3 <- paws::s3()
aws_s3_bucket <- Sys.getenv("AWS_S3_BUCKET")
# Fetch dataset from S3 which returns a raw vector ----------------------------
s3_object <- s3$get_object(Bucket = aws_s3_bucket, Key = "data.csv.gz")
s3_object_body <- s3_object$Body
# Write the raw vector to a temporary file (This is what I want to remove) ----
file_name <- "s3_files/mydataset.csv.gz"
if (file.exists(file_name)) { unlink(file_name) }
writeBin(s3_object_body, con = file_name)
# Finally, read the data from the file ----------------------------------------
data <- data.table::fread(file_name)
Some attempts that haven't worked:
Trying readr::read_csv(s3_object_body)
results in the following error:
Error in vroom_(file, delim = delim %||% col_types$delim, col_names = col_names, :
embedded nul in string: '\037<U+008B>\b\b<f1>7\aa\002<ff>data.csv'
Trying iotools::read.csv.raw(s3_object_body)
results in the following error:
Error in isOpen(con, "rb") : unimplemented type 'raw' in 'asInteger'