0

I want to be able to force spark to execute my code in the order I want.

In the example below, foo and bar functions does data manipulation but send_request is just a web trigger unaffected by those functions. When Spark executes below code it runs the send_request the first and foo and bar later.

It does not work for me because after foo and bar completed, I have timeout from my request. If the request ran after foo, the result would be ready at the same time bar ends. How could I achieve this in spark.

I could have separate scripts for each step but cluster starting times adds up hence I would like to be able modify execution order. I am using databricks on Azure if it helps.

import os
import base64
import requests
import pyspark

sc.addFile("dbfs:/bar.py")
from bar import bar 

sc.addFile("dbfs:/foo.py")
from foo import foo

if __name__ == '__main__':

    foo()
    response = send_request(request=request_json)
    bar()

the content of foo and bar and send_request are as follows

def foo():
    df = spark.read.parquet(file_1_path)
    df = df.filter(F.col('IDType') == 'E') \
      .select(F.col('col1'),F.col('col2')).distinct()
    df.repartition(10).write.parquet(file_1_new_path)
    logger.info('1 foo is done') 

and

def bar():
    df = spark.read.parquet(file_2_path)
    df = df.filter(F.col('IDType') == 'M') \
      .select(F.col('col1'),F.col('col2')).distinct()
    df.repartition(10).write.parquet(file_2_new_path)
    logger.info('3 bar is done')

and

def send_request():
    response_json = http_response.json()
    logger.info('2 request is sent') 

I will try to be more clear. When I run above code in spark the output I get is as follows

 2 request is sent
 1 foo is done
 3 bar is done

But I want it to be in this order

 1 foo is done
 2 request is sent
 3 bar is done 
zaldir
  • 19
  • 3

0 Answers0