Apache Spark CSV to Parquet, 4000 columns, 20000 small files

Question

I have a large number of csv files (around 20000) and most of these files have around 4000 columns, 10% of files can have slightly less or more columns. I want to load these files from S3 into spark, infer the schema from CSV files and merge the schema to handle mixed schema files. Then write back into S3 as parquet after reducing number of partitions.

    val df = spark.read.format("csv").option("header", "true").option("mergeSchema", "true").option("inferSchema", "true").load(<s3-in-path>)
df.coalesce(2).write.mode("overwrite").parquet(<s3-out-path)

But this is taking few hours to complete, even when i threw 100 cpu cores at it.

Any leads on how to deal with this sort of data?

score 1 · Answer 1 · answered Jul 07 '18 at 15:04

Spark CSV schema inference is a full read of the file. Don't do that: work out the schemas in advance
Probably happens during the partition phase of a query, rather than the parallelized bit
S3 throttles IO on object GET/HEAD calls, and you are doing so many here

If this is a one off, job, and if the dataset is under a terabyte or two, have you considered just downloading locally and doing it that way. One single read of all the source files and you can play with Spark CSV loading for as long as you like without worrying about throttling or AWS charges

Apache Spark CSV to Parquet, 4000 columns, 20000 small files

1 Answers1