0

I'm trying to use Spark as part of a full-refresh/oh-shit data replication pipeline to grab data, union it, and stick it in our analytics database. Our source/raw data warehouse is set up on a single-tenant model, with a single client/customer per database, in Azure SQL. The target data warehouse is set up with all customers in a multi-tenant database.

I have a working example that runs in series -- have to redact parts of the code for security reasons but the basic structure is like this:

dfs = []
for d in databases:
    st_df = spark.read \
        .option('table', [TABLENAME]) \
        .load()
    
    dfs.append(st_df)

mt_df = reduce(lambda df1, df2: df1.unionByName(df2), dfs)

mt_df.write \
    .format([TARGET_DB]) \
    .save()

How do I get this to the point where I can parallelize the for d in databases part and have the queries run in parallel? We need to improve run speed - the number of databases on the source side is upwards of 400.

  • Legs of a `union` should be running in parallel as-is, so I don't think you need to do anything (except for adding more resources maybe :)) – mazaneicha Aug 11 '23 at 16:13
  • From what I can tell the limiting factor is not the `union`, it's loading the data into memory from Azure SQL. – walkrflocka Aug 11 '23 at 17:42
  • My point -- in the execution _plan_, "..`for d in databases` part.." is already parallelized. You need more resources (executor cores) to actually _execute_ it in parallel. – mazaneicha Aug 11 '23 at 18:32

1 Answers1

1

Your code would already be correct to parallelize except for one thing, the source databases and queries. As it stands each tenant will be one partition, you need a way to break them up. Per mazaneicha the tenants themselves should be already in parallel but if you want to reduce the time you need extra cores and to read each individual tenant in parallel as well.

You can do this via a predicate array or via partitionColumn, lowerBound, upperBound.

Chris
  • 1,240
  • 7
  • 8