0

Writing the dataframe as a tsv in databricks file system (DBFS) with huge data size (30GB to 1TB). I am currently using the below code

df.coalesce(1).write.format("csv").option("delimiter", "\t").option("nullValue",None).option("header", inheader).mode("overwrite").save(tsvPathtemp)

For 100GB it taking an hour to copy the file. I had tried removing the coalesce(1) it copied multiple files but I want one tsv file as a output.

Can any one suggest the best approaches/code to copy the files.

Also how can I import hadoop file system in databricks notebook. refer below question

import org.apache.hadoop.fs.FileUtil
import org.apache.hadoop.fs.FileSystem;

Merge Spark output CSV files with a single header

xom9ikk
  • 2,159
  • 5
  • 20
  • 27
ram
  • 323
  • 4
  • 12

0 Answers0