I'm trying to use Spark as part of a full-refresh/oh-shit data replication pipeline to grab data, union it, and stick it in our analytics database. Our source/raw data warehouse is set up on a single-tenant model, with a single client/customer per database, in Azure SQL. The target data warehouse is set up with all customers in a multi-tenant database.
I have a working example that runs in series -- have to redact parts of the code for security reasons but the basic structure is like this:
dfs = []
for d in databases:
st_df = spark.read \
.option('table', [TABLENAME]) \
.load()
dfs.append(st_df)
mt_df = reduce(lambda df1, df2: df1.unionByName(df2), dfs)
mt_df.write \
.format([TARGET_DB]) \
.save()
How do I get this to the point where I can parallelize the for d in databases
part and have the queries run in parallel? We need to improve run speed - the number of databases on the source side is upwards of 400.