1

We are running pyspark in an EMR cluster and have ~50 million records in a dataframe. Each needs a field added to it from an API, which accepts 100 records at a time (so ~500k total requests). We are able to split them up and make the API calls successfully, however we are occasionally getting rate limited. When that happens, the process continues to send requests, all returning the same rate-limited response. So when this happens, we want to completely stop all requests from all slave nodes and kill the job.

We have scaled back our cluster size to help avoid this issue, but as response times are not always consistent, we need a way to get out and stop sending requests if we're already rate limited.

We are using mapPartitions() on the dataframe, and within that are calling the API.

I am looking for a way within the function being called by mapPartitions() to stop all processes on all slave nodes, so that whenever we first notice we're being rate limited, all API calls stop.

kylerm42
  • 53
  • 4

0 Answers0