How to upload a dataframe as a stream without saving on disc?

Question

I want to upload a dataframe to a server as csv file with Gzip encoding without saving it on the disc.

It is easy to build some csv file with Gzip encoding using spark-csv lib:

df.write
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
    .save(s"result.csv.gz")

But I have no idea how to get Array[Byte], representing my DataFrame, which I can upload via HTTP

score 3 · Answer 1 · answered Sep 09 '19 at 13:28

3

You could write to your remote server as a remote hdfs server, you'd need to have hdfs installed on your remote server but after that you should be able to do something like

df.write
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
    .save("hdfs://your_remote_server_hostname_or_ip/result.csv.gz")

answered Sep 09 '19 at 13:28

randal25

1,290
13
10

Thanks for answer. The problem is that I need to upload files to storage like Google Drive. I can't install hdfs here) – Makrushin Evgenii Sep 09 '19 at 14:32

How to upload a dataframe as a stream without saving on disc?

1 Answers1