0

I am using AWS R studio to read a 35 GB csv file from S3 and perform analyses. I choose a machine with m4.4xlarge with 62 GB memory, but I keep getting the following message when reading the data before any analyses was performed: "Error: cannot allocate vector of size 33.0 Gb". The code I used is:

library("aws.s3")
Sys.setenv("AWS_ACCESS_KEY_ID" = "xxxxxxx",
       "AWS_SECRET_ACCESS_KEY" = "yyyyyyy")
obj <-get_object("s3://xxx/yyy.csv")  

When I used the following code,

aws.s3::s3read_using(read.csv, object=“"s3://xxx/yyyy.csv”)

The error message becomes:

the error message I got was below:

Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
Failed writing body (4400 != 16360)

I am not familiar with Linux and I used Louis Aslett's AMI (http://www.louisaslett.com/RStudio_AMI/). Is there anything setting I should change? Thank you!

I suspect the question is related to the following two questions but no clear answer has been posted.

Reading large JSON files from S3 in RStudio EC2 instance (Louis Aslett's AMI)

Trouble Uploading Large Files to RStudio using Louis Aslett's AMI on EC2

Claire Cui
  • 107
  • 10
  • 1
    Please add the relevant parts of your code to your questions to make it easier for us to "diagnose". My first impression is that the R code (your or that of the used package) is using too much RAM internally due to modifying the data. What exactly are you doing with the data? If you are **not** using a `data.table` and `fread()` you will have problems due to lack of memory (each small modification of a `data.table` makes a full copy and the size of your data is bigger than half the RAM...) – R Yoda Jun 09 '18 at 22:07
  • 1
    Thanks @RYoda! I have updated the question with the R code I used. My problem starts with reading the csv file and no analyses were performed. – Claire Cui Jun 10 '18 at 00:16
  • OK, the first step is to read the CSV file using `data.table:fread` instead of base R's `read.csv`. This is faster and has lower memory overhead. Another strainge thing is: `curl::curl_fetch_disk` is called internally and cannot write (!). I am not sure if curl is writing only virtually ("piping") or physically (temp storage), so: Do you have enough storage? Last thing: What is the encoding of the CSV file? If you are having a non-standard character encoding you have to specify the `fileEncoding` (in `read.csv`) or `encoding` argument (in `fread`) otherwise reading my be cut befor the EOF. – R Yoda Jun 10 '18 at 07:26
  • @RYodam `fread` works. After performing some analyses, I am having problems saving the file from rstudio to S3. I am using `aws.s3::s3write_using`: it works on a small dataset but not on the big dataset (35gb). After running the write_using code for one hour, it is still running and the data is not saved. Any idea about how to efficiently save data from rstudio to S3 so that I do not need to run the codes again next time I start the instance? Thank you! – Claire Cui Jun 24 '18 at 14:02
  • Sorry, I have no experiences with S3, perhaps you open a new question since it is different from the original question. Please describe how/where RStudio is running in your client/server setup since the network is mostly the limiting factor to transport data... – R Yoda Jun 24 '18 at 15:28

1 Answers1

0

I have overcome a very similar problem in R with the same AMI. The issue in my case was that although the default home directory size for AWS was less than 8-10GB regardless of the size of your instance. As this as trying to upload to home then there was not enough room. It sounds like the same problem from personal experience of the same error message with reading in data with the same AMI.

If you upload into a different drive on the instance then this can be solved. As the Louis Aslett Rstudio AMI is based in this 8-10GB space then you will have to set your working directory outside this, the home directory. Not intuitively apparent from Rstudio server interface.

I believe the solution to your issue has nothing to do with the method for reading the data. The issue is the home directory is less than 10GB and you are trying to read into that (this is even more likely in my opinion if you are a windows user as you would not expect a 60GB machine to have only 10GB in the default directory). I would suggest to have a look at other directories (e.g. going up a few levels above home in the Rstudio directory selection box on the RHS in Rstudio or df command on the linux commandline). Then setwd() in another directory (e.g. xda or whatever has enough room) and try to read in again.

I may not have written this answer in the way the moderators like, but I have overcome a similar problem and this question has not had an answer for a year so I hope it helps (oh and I don't have enough "points" to write a comment so it has to be an answer)

Joey
  • 137
  • 2
  • 13