Marge Move/Copy Large data file in dbutils using a Notebook in pyspark

Asked Dec 24 '20 at 11:49

Active Dec 24 '20 at 12:20

Viewed 462 times

Writing the dataframe as a tsv in databricks file system (DBFS) with huge data size (30GB to 1TB). I am currently using the below code

df.coalesce(1).write.format("csv").option("delimiter", "\t").option("nullValue",None).option("header", inheader).mode("overwrite").save(tsvPathtemp)

For 100GB it taking an hour to copy the file. I had tried removing the coalesce(1) it copied multiple files but I want one tsv file as a output.

Can any one suggest the best approaches/code to copy the files.

Also how can I import hadoop file system in databricks notebook. refer below question

import org.apache.hadoop.fs.FileUtil
import org.apache.hadoop.fs.FileSystem;

edited Dec 24 '20 at 12:20

xom9ikk

asked Dec 24 '20 at 11:49

ram

Try using the hdfs module of python? https://hdfscli.readthedocs.io/en/latest/ – mck Dec 24 '20 at 12:26
Can't use it, as file system I am using is DBFS. – ram Dec 27 '20 at 10:35

0 Answers0