I want to be able to force spark to execute my code in the order I want.
In the example below, foo
and bar
functions does data manipulation but send_request
is just a web trigger unaffected by those functions. When Spark executes below code it runs the send_request
the first and foo
and bar
later.
It does not work for me because after foo
and bar
completed, I have timeout from my request. If the request ran after foo
, the result would be ready at the same time bar
ends.
How could I achieve this in spark.
I could have separate scripts for each step but cluster starting times adds up hence I would like to be able modify execution order. I am using databricks on Azure if it helps.
import os
import base64
import requests
import pyspark
sc.addFile("dbfs:/bar.py")
from bar import bar
sc.addFile("dbfs:/foo.py")
from foo import foo
if __name__ == '__main__':
foo()
response = send_request(request=request_json)
bar()
the content of foo
and bar
and send_request
are as follows
def foo():
df = spark.read.parquet(file_1_path)
df = df.filter(F.col('IDType') == 'E') \
.select(F.col('col1'),F.col('col2')).distinct()
df.repartition(10).write.parquet(file_1_new_path)
logger.info('1 foo is done')
and
def bar():
df = spark.read.parquet(file_2_path)
df = df.filter(F.col('IDType') == 'M') \
.select(F.col('col1'),F.col('col2')).distinct()
df.repartition(10).write.parquet(file_2_new_path)
logger.info('3 bar is done')
and
def send_request():
response_json = http_response.json()
logger.info('2 request is sent')
I will try to be more clear. When I run above code in spark the output I get is as follows
2 request is sent
1 foo is done
3 bar is done
But I want it to be in this order
1 foo is done
2 request is sent
3 bar is done