What is the options parameter of spark_write_csv dplyr function?

Question

I was looking for a way to make spark_write_csv to upload only a single file to S3 because I want to save the regression result on S3. I was wondering if options has some parameter which defines number of partitions. I could not find it anywhere in the documentation. Or is there any other efficient way to upload resultant table to S3?

Any help is appreciated!

zero323 · Answer 1 · 2019-02-01T13:34:05.623

options argument is equivalent options call on the DataFrameWriter (you can check DataFrameWriter.csv documentation for a full list of options specific to CSV source) and it cannot be used to control the number of the output partitions.

While in general it is not recommended, you can use Spark API to coalesce data and convert it back to sparklyr tbl:

df %>% 
  spark_dataframe() %>% 
  invoke("coalesce", 1L) %>% 
  invoke("createOrReplaceTempView", "_coalesced")

tbl(sc, "_coalesced") %>% spark_write_csv(...)

or, in the recent versions, sparklyr::sdf_coalesce

df %>% sparklyr::sdf_coalesce()

What is the options parameter of spark_write_csv dplyr function?

1 Answers1