Create 1GB partitions Spark SQL

Question

I'm trying to split my data in 1GB when writing in S3 using spark. The approach I tried was to calculate the size of the DeltaTable in GB (the define_coalesce function), round, and using that number to write in S3:

# Vaccum to leave 1 week of history

deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")

deltaTable.vacuum(168)

deltaTable.generate("symlink_format_manifest")

# Reading delta table and rewriting with coalesce to reach 1GB per file

df = spark.read.format('delta').load(f"s3a://{delta_table}")

coalesce_number = define_coalesce(delta_table) < this function calculates the size of the delta in GB

df.coalesce(coalesce_number).write.format("delta").mode('overwrite').option('overwriteSchema', 'true').save(f"s3a://{delta_table}")

deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")

deltaTable.generate("symlink_format_manifest")

I'm trying this way cause our Delta is the opensource one and we don't have the optimize method built in.

I did some searching and found the spark.sql.files.maxPartitionBytes configuration in Spark, but some people said that it was not solving their problems, and that this config partitions when reading and not writing.

Any suggestions?

Are you able to upgrade your Delta Lake version? 2.0.0 contains the optimise feature . — Robert Kossendey, Sep 20 '22 at 13:43
I tried upgrading, but the OSS version doesn't have the optimize feature... — Michel Senra, Sep 20 '22 at 13:50
It does since 2.0.0. Databricks has open sourced Delta Lake completely. — Robert Kossendey, Sep 20 '22 at 14:17
Well, I installed in my cluster in jupyter but i couldn't run the tests because it was saying that the DeltaTable didn't have the optimize method... Probably it was missing some configs. The thing is, same testing and aproving the optimize method, our project and sponsors have a huge aversion when it comes to changing versions in the live infra. So, I think that i'll probably have to keep trying doing this way. — Michel Senra, Sep 20 '22 at 14:22

score 0 · Answer 1 · answered Sep 20 '22 at 17:57

I understand your problem, and what you are trying to do but i am not sure what is the output of your current solution. If partitions are still not equal to 1 gb you may try to replace coalesce with repartition. Coalesce does not guarantee that after this operation partitions are equal so your formula may not work. If you know how many partition you need on output use repartition(coalesce_number) and it should create equal partitions with round robin

If the problem is with function which is calculating dataset size (so number of partitions) i know two solutions:

You can cache dataset and then take its size from statistics. Of course this may be problematic and you have to spend some resource to due that. Something similar is done here in first answer: How spark get the size of a dataframe for broadcast?
You can calculate count and divide it by number of records you want to have in single partition. Size of single record depends on your schema, it may be tricky to estimate it but it is viable option to try

Thank you a lot for the answer and the possible solutions. That's exactly what I need. My difficulty was to calculate the size of the DeltaTable as the version we use we don't have the "detail" method. Since we do a lot of writing in the s3, there are a lot of historical files of delta there. Same doing 'vacuum' to keep only one week of history, the leftover files add to the size count and makes it really far from the expected result. — Michel Senra, Sep 21 '22 at 02:46

score 0 · Answer 2 · answered Sep 21 '22 at 13:34

Finally solved my problem. Since we are using Delta, I had the idea of trying to read the manifest files to find all the parquet names. After that, I get the sum of the list of parquets on manifest connecting in S3 with boto3:

def define_repartition(delta_table_path):
    conn = S3Connection()
    bk = conn.get_bucket(bucket)
    manifest = spark.read.text(f's3a://{delta_table_path}_symlink_format_manifest/manifest')
    parquets = [data[0].replace(f's3a://{bucket}/','') for data in manifest.select('value').collect()]
    size = 0
    for parquet in parquets:
        key = bk.lookup(parquet)
        size = size + key.size

    return round(size/1073741824)

Thank you all for the help.Regards from Brazil. :)

Create 1GB partitions Spark SQL

2 Answers2