1

I'd like to read SELECTED multiple files with sparklyr. I have multiple csv files (eg. a1.csv, a2.csv, a3.csv, a4.csv, a5.csv) in a folder, and I'd like to read a2.csv, a3.csv, a4.csv files at once if possible.

I know I can read csv file with spark_read_csv(sc, "cash", "/dir1/folder1/a2") so I tried

a_all <- data.frame(col1=integer(),col2=integer())
a_all <- sdf_copy_to(sc, a_all, "a_all")


for(i in 2:4){
     tmp1 <- spark_read_csv(sc=sc, name="tmp1", paste0("/dir1/folder1/a",i))
     a_all <- sdf_bind_rows(a_all, tmp1)
}

As a result I will get a spark_tbl which is binding a2.csv, a3.csv, a4.csv files rbind(a2,a3,a4).

I think there is a way to do it easier (maybe without for loop) by using path= but I am not sure how to select only few csv files in a folder. Please help!

Kate
  • 33
  • 2
  • Unfortunately `sparklyr` doesn't support multiple paths in a single `spark_read_source` call (see https://stackoverflow.com/q/49586714/6910411). In such simple case you `spark_read_csv(sc=sc, name="tmp1", "/dir1/folder1/a[234].csv"))` but it has rather limited applications. You could try to drop down to SQL, but I doubt it is worth all the fuss. In other words your code looks just fine. – zero323 Jan 24 '19 at 15:22

0 Answers0