How to combine multiple CSV files into one file if the column seq is different or some the files are not having any header

Question

For example i have over 300 files in the nested folder and i have combine all of them using pyspark or python pandas

File1 -Date,channel,spend,clicks File2 - date ,channel,clicks,spend File3- no File4 : some extra columns also there apart from mandatory ones Etc... Etc

I am expecting a single file combining all the files in folder with different structures

score 0 · Accepted Answer · answered Jan 04 '23 at 19:18

0

You can enforce the schema object to take care of files with no headers and unify the structure using spark.read.schema(ScheamObject).csv(FilesPath).
You can use coalesce(1) in writing out to fit all records into one file: spark.write.coalesce(1).csv(DestinationPath)

answered Jan 04 '23 at 19:18

ARCrow

1,360
1
10
26

Thank you ,but How to loop through the files reading from gcp folders and display those file columns in each file – Anvaith O9999 Jan 06 '23 at 05:36
You can use regular expressions in your file read path. Check this out: https://stackoverflow.com/questions/32233575/read-all-files-in-a-nested-folder-in-spark – ARCrow Jan 06 '23 at 15:58

How to combine multiple CSV files into one file if the column seq is different or some the files are not having any header

1 Answers1