I am facing some performance issue with one of pyspark udf function that post data to REST API(uses cosmos db to store the data).
# The spark dataframe(df) contains near about 30-40k data.
# I am using python udf function to post it over rest api:
Ex. final_df = df.withColumn('status', save_data('A', 'B', 'C'))
# udf function:
@udf(returnType = IntegerType())
def save_data(A, B, C):
post_data = list()
post_data.append({
'A': A,
'B': B,
'C': C,
})
retry_strategy = Retry(
total = 5,
status_forcelist = [400, 500, 502, 503, 504],
method_whitelist = ['POST'],
backoff_factor = 0.1
)
adapter = HTTPAdapter(max_retries = retry_strategy)
s = requests.Session()
s.mount('https://', adapter)
s.mount('http://', adapter)
s.keep_alive = False
try:
response = s.post(
url = rest_api_url,
headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},
data = json.dumps(post_data)
)
return response.status_code
except:
response = requests.post(
url = rest_api_url,
headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},
data = json.dumps(post_data)
)
return response.status_code
# Issue: Databricks jobs gets hanged for infinite time at rest api call(save_data()) and never succeeded.
# When checked from API end, its showing the service touching maximum resource utilization(100%).
To me it looks like the python udf is sending bulk data at a time which overwhelmed the api service at some point of time and it stopped responding.
What is the best way we can overcome this bulk posting of data? Should we split the dataframe into multiple chunks and send it out one-by-one or convert it to pandas df and then send out row-by-row