0

For example i have over 300 files in the nested folder and i have combine all of them using pyspark or python pandas

File1 -Date,channel,spend,clicks File2 - date ,channel,clicks,spend File3- no File4 : some extra columns also there apart from mandatory ones Etc... Etc

I am expecting a single file combining all the files in folder with different structures

1 Answers1

0

You can enforce the schema object to take care of files with no headers and unify the structure using spark.read.schema(ScheamObject).csv(FilesPath).
You can use coalesce(1) in writing out to fit all records into one file: spark.write.coalesce(1).csv(DestinationPath)

ARCrow
  • 1,360
  • 1
  • 10
  • 26
  • Thank you ,but How to loop through the files reading from gcp folders and display those file columns in each file – Anvaith O9999 Jan 06 '23 at 05:36
  • You can use regular expressions in your file read path. Check this out: https://stackoverflow.com/questions/32233575/read-all-files-in-a-nested-folder-in-spark – ARCrow Jan 06 '23 at 15:58