1

I am facing some performance issue with one of pyspark udf function that post data to REST API(uses cosmos db to store the data).

# The spark dataframe(df) contains near about 30-40k data.
# I am using python udf function to post it over rest api:
   Ex. final_df = df.withColumn('status', save_data('A', 'B', 'C'))
# udf function:
    @udf(returnType = IntegerType())
    def save_data(A, B, C):

        post_data = list()
        post_data.append({
        'A': A,
        'B': B,
        'C': C,
        })
      
        retry_strategy = Retry(
        total = 5,
        status_forcelist = [400, 500, 502, 503, 504],
        method_whitelist = ['POST'],
        backoff_factor = 0.1
        )
      
        adapter = HTTPAdapter(max_retries = retry_strategy)
        s = requests.Session()
        s.mount('https://', adapter)
        s.mount('http://', adapter)
        s.keep_alive = False
            
        try:
            response = s.post(
              url = rest_api_url,
              headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},
              data = json.dumps(post_data)
            )
            return response.status_code
     
        except:
            response = requests.post(
              url = rest_api_url,
              headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},
              data = json.dumps(post_data)
            )
            return response.status_code

# Issue: Databricks jobs gets hanged for infinite time at rest api call(save_data()) and never succeeded.
# When checked from API end, its showing the service touching maximum resource utilization(100%).

To me it looks like the python udf is sending bulk data at a time which overwhelmed the api service at some point of time and it stopped responding.

What is the best way we can overcome this bulk posting of data? Should we split the dataframe into multiple chunks and send it out one-by-one or convert it to pandas df and then send out row-by-row

James Z
  • 12,209
  • 10
  • 24
  • 44

1 Answers1

0

I am somehow found the best approach to follow in this case and that is batch load(bulk) with asynchronous rest api call. With udf, its always row-by-row call and that hits the performance in this case.

We can always prepare a dataset with bunch of rows like [{'a':'1','b':'2'},{'c':'3','d':'4'}] and then send the row(batch load) to the API with less number of request.

Asynchronous call make it more robust by not blocking other request at the same time.

Hope this information will help others...Thanks :)