Some failures to use Rstudio + sparklyr in Watson Studio for data manipulation on large data set

Question

I got Error in curl::curl_fetch_memory(url, handle = handle) : Empty reply from server for some operations in Rstudio (Watson studio) when I tried to do data manipulation on Spark data frames.

Background:

The data is stored on IBM Cloud Object Storage (COS). It will be several 10GB files but currently I'm testing only on the first subset (10GB).

The workflow I suppose is, in Rstudio (Watson Studio), connect to spark (free plan) using sparklyr, read the file as Spark data frame through sparklyr::spark_read_csv(), then apply feature transformation on it (e.g., split one column into two, compute the difference between two columns, remove unwanted columns, filter out unwanted rows etc.). After the preprocessing, save out the cleaned data back to COS through sparklyr::spark_write_csv().

To work with Spark I added 2 spark services into the project (seems like any spark service under the account can be used by Rstudio.. Rstudio is not limited by project?); I may need to use R notebooks for data exploration (to show the plots in a nice way) so I created the project for that purpose. In previous testings I found that for R notebooks / Rstudio, the two env cannot use the same Spark service at the same time; so I created 2 spark services, the first for R notebooks (let's call it spark-1) and the second for Rstudio (call it spark-2).

As I personally prefer sparklyr (pre-installed in Rstudio only) over SparkR (pre-installed in R notebooks only), for almost the whole week I was developing & testing code in Rstudio using spark-2.

I'm not very familiar with Spark and currently it behaves in a way that I don't really understand. It would be very helpful if anyone can give suggestions on any issue:

1) failure to load data (occasionally)

It worked quite stable until yesterday, since when I started to encounter issues loading data using exactly the same code. The error does not tell anything but R fails to fetch data (Error in curl::curl_fetch_memory(url, handle = handle) : Empty reply from server). What I observed for several times is, after I got this error, if I again run the code to import data (just one line of code), the data would be loaded successfully.

Q1 screenshot

2) failure to apply (possibly) large amount of transformations (always, regardless of data size)

To check whether the data is transformed correctly, I printed out the first several rows of interested variables after each step (most of them are not ordinal, i.e., the order of steps doesn't matter) of transformation. I read a little bit of how sparklyr translates operations. Basically sparklyr doesn't really apply the transformation to the data until you call to preview or print some of the data after transformation. After a set of transformations, if I run some more, when I printed out the first several rows I got error (same useless error as in Q1). I'm sure the code is right as once I run these additional steps of code right after I load the data, I'm able to print and preview the first several rows.

3) failure to collect data (always for the first subset)

By collecting data I want to pull the data frame down to the local machine, here to Rstudio in Watson Studio. After applying the same set of transformations, I'm able to collect the cleaned version of a sample data (originally 1000 rows x 158 cols, about 1000 rows x 90 cols after preprocessing), but failed on the first 10 GB subset file (originally 25,000,000 rows x 158 cols, at most 50,000 rows x 90 cols after preprocessing). The space it takes up should not exceed 200MB in my opinion, which means it should be able to be read into either Spark RAM (1210MB) or Rstudio RAM. But it just failed (again with that useless error).

4) failure to save out data (always, regardless of data size)

The same error happened every time when I tried to write the data back to COS. I suppose this has something to do with the transformations, probably something happens when Spark received too many transformation request?

5) failure to initialize Spark (some kind of pattern found)

Starting from this afternoon, I cannot initialize spark-2, which has been used for about a week. I got the same useless error message. However I'm able to connect to spark-1.

I checked the spark instance information on IBM Cloud:

spark-2

spark-1

It's weird that spark-2 has 67 active tasks since my previous operations got error messages. Also, I'm not sure why "input" in both spark instances are so large.

Does anyone know what happened and why did it happen? Thank you!

Although this is well detailed, I'm afraid someone will mark this as too broad and move to close it unless you can break it down to a single question. I point this out because it does sound like these issues are possibly all related to the same cause, so you may want to highlight that error at the top of your question, ask a single question about it, and leave the rest intact as additional supporting details. Hope that helps, good luck! — Brian Stamper, May 07 '18 at 14:07
@BrianStamper I don't think the error is very helpful since for all these issues I got the same error; And yes I wonder whether these failures are related to the same issue, so I put everything in the same post. But thank you for your advice I will highlight the error at the beginning. — Wendy, May 07 '18 at 14:11

Some failures to use Rstudio + sparklyr in Watson Studio for data manipulation on large data set

0 Answers0