How to write many large parquet files to iceberg quickly using spark?

Asked Jun 18 '23 at 03:28

Active Jun 18 '23 at 03:33

Viewed 173 times

I'm new in iceberg and spark. I created an iceberg table and want to write my previous data to this iceberg table. These data are many large parquet files. (500G per day, parquet file has 100 fields).

When I write these files to iceberg, it's very slowly.

The code below:

                    spark.read().parquet(path)
                            .repartition(500)
                            .write().format("iceberg")
                            .mode(SaveMode.Append)
                            .option("mergeSchema", "true")
                            .saveAsTable(table);
                            .writeTo(table)
                            .append();

I found when I run the code above, whatever I change the spark conf, There's only one task write to iceberg, and the speed is very slowly.

--num-executors 15 \
--driver-memory 4g \
--executor-memory 16g \
--executor-cores 4 \
--conf spark.memory.fraction=0.6 \
--conf spark.sql.shuffle.partitions=500 \
--conf spark.shuffle.io.maxRetries=10 \
--conf spark.shuffle.io.retryWait=10s \

How can I let the write faster??

The version is: Spark: 3.2.0 iceberg: 1.2.1

edited Jun 18 '23 at 03:33

asked Jun 18 '23 at 03:28

Xiao-Long Li

How to write many large parquet files to iceberg quickly using spark?

0 Answers0