I am using a azure durable-function to orchestrate a scraper of mine. I want to scrape the web in batches of 100 ages at a time in each activity, and have set up the following orchestration:
import ujson
def orchestrator_function(context: df.DurableOrchestrationContext):
scraper = 'scraper_name'
url = 'start_url'
yield context.call_activity('ScrapeActivity', ujson.dumps({
'spider': scraper,
'start_index': 1,
'end_index': 100,
'url': url,
}))
yield context.call_activity('ScrapeActivity', ujson.dumps({
'spider': scraper,
'start_index': 100,
'end_index': 200,
'url': url,
}))
yield context.call_activity('ScrapeActivity', ujson.dumps({
'spider': scraper,
'start_index': 200,
'end_index': 300,
'url': url,
}))
This works. However, it is not very practical to do it this way, and should be done cleaner. I was thinking of setting it as below, where I use a generator to create the start and end index and start the activity using these. This does not seem to work, as the scrapers do not seem to wait for one another when set up like this. And the start of the next scraper cancels the previous one. To me it seemed like the correct way to use yield from
, so I am not really sure what is going wrong, does someone have an idea?
def start_scraper_in_batches(
context: df.DurableOrchestrationContext,
scraper,
total_pages,
url
):
for start_index, end_index in index_generator(int(total_pages)):
yield context.call_activity('ScrapeActivity', ujson.dumps({
'spider': scraper,
'start_index': start_index,
'end_index': end_index,
'url': url,
}))
def orchestrator_function(context: df.DurableOrchestrationContext):
scraper = 'scraper_name'
url = 'start_url'
total_pages = 566
yield from start_scraper_in_batches(
context, scraper, total_pages, url)