how can merge multiple part file into single file in databricks

Question

i am trying to merge multiple part file into single file. In staging folder, it itterating the all files, schema is same. part file we are converting .Tab files. Files are generating based on salesorgcode ex:7001 ,600,8002 every country having different salesorgcode but schema is same can anyone suggest. Note: files keeping blob container

score 0 · Answer 1 · answered Jun 14 '23 at 12:13

First mount the source container to databricks and store the parquet files which have "part" in the file name in a list.

Then, read it as pyspark dataframe. To get a single output file, convert it into pandas dataframe and write it to the output folder using mount point.

# Mount the source container and store the list of files
paths=[x.path for x in dbutils.fs.ls("/mnt/data/folder1/") if "part" in x.path]
df1=spark.read.parquet(*paths)
df1.show()

# convert pyspark dataframe to pandas datframe
import pandas
pandas_converted=df1.toPandas()

# write pandas dataframe to the .tab file in blob storage using mount point
pandas_converted.to_csv('/dbfs/mnt/data/file1.tab', sep='\t',index=False)

Add the header and seperator as per your requirement.

enter image description here

Output file:

enter image description here

how can merge multiple part file into single file in databricks

1 Answers1