mclapply and spark_read_parquet

Question

I am relatively new as active user to the forum, but have to thank you all first your contributions because I have been looking for answers since years...

Today, I have a question that nobody has solved or I am not able to find...

I am trying to read files in parallel from s3 (AWS) to spark (local computer) as part of a test system. I have used mclapply, but when set more that 1 core, it fails...

Example: (the same code works when using one core, but fails when using 2)

new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 1)

new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 2) Warning message: In mclapply(seq(file_paths), function(i) { : all scheduled cores encountered errors in user code

Any suggestion???

Thanks in advance.

Can you please clarify the question? It's not at all obvious to me what you're asking. You might also want to add some short explanation as to what you're trying to achieve. — Oldřich Spáčil, Oct 27 '17 at 11:41
I am trying to parallelize readings from s3a bucket which have lots of parquet files stored in different dirs. In this case, "file_paths" is a variable with a list of full name paths and there is nothing more... It's conceptually simple, but I don't know if I can read files in parallel or not. — José Ángel Fernández Segovia, Oct 27 '17 at 11:59

score 0 · Accepted Answer · answered Oct 28 '17 at 04:17

0

Just read everything into one table via 1 spark_read_parquet() call, this way Spark handles the parallelization for you. If you need separate tables you can split them afterwards assuming there's a column that tells you which file the data came from. In general you shouldn't need to use mcapply() when using Spark with R.

answered Oct 28 '17 at 04:17

kevinykuo

4,600
5
23
31

I tried to do that, but spark_read_parquet() is throwing a warning when trying to pass more than one path at a time: spark_read_parquet(sc, "rdd_new", as.list(file_paths) ) In addition: Warning message: In if (grepl("[a-zA-Z]+://", path)) { : the condition has length > 1 and only the first element will be used – José Ángel Fernández Segovia Oct 30 '17 at 08:42
I could read multiple folders specifying them with regular expressions, but it seems that spark_read_parquet() does not allow to pass a list of paths. Is it possible??? – José Ángel Fernández Segovia Oct 30 '17 at 11:17
Is the data in your folders of exactly the same structure? If so, you can use wildcard characters to point to the folders you need and read all the data in one call. – Oldřich Spáčil Oct 31 '17 at 10:28

mclapply and spark_read_parquet

1 Answers1