I have a large number of csv files (around 20000) and most of these files have around 4000 columns, 10% of files can have slightly less or more columns. I want to load these files from S3 into spark, infer the schema from CSV files and merge the schema to handle mixed schema files. Then write back into S3 as parquet after reducing number of partitions.
val df = spark.read.format("csv").option("header", "true").option("mergeSchema", "true").option("inferSchema", "true").load(<s3-in-path>)
df.coalesce(2).write.mode("overwrite").parquet(<s3-out-path)
But this is taking few hours to complete, even when i threw 100 cpu cores at it.
Any leads on how to deal with this sort of data?