0

I have a script for collecting data for different social media hashtags. The script currently makes a bunch of sequential HTTP requests, formats the data into a Pandas data frame, and saves it to a csv. For very popular hashtags, it takes hours to run.

I need to run this program for 1000+ individual hashtags. To save time, I'd like to run many instances concurrently, say, 50-100 instances at a time, each collecting different hashtags.

Assuming I change the CSV portion to utilize a cloud storage service instead, what else do I need to do in order to accomplish what I'm describing? If I have a list of all the hashtags I need, how to I set up AWS lambda or Google Functions in order to execute these concurrently, so that 50-100 instances are always running until all the data is collected?

2 Answers2

1

In AWS I would use Step Functions with Dynamic Parallelism to achieve that.

First Lambda function will emit the list of hashtags that you want to crawl.

Then a Second Lambda will be called many times in parallel by the Step Function State Machine to process each of the hashtags.

The configs (e.g. hashtags) are passed around as JSON objects.

Hope that helps :)

MLu
  • 24,849
  • 5
  • 59
  • 86
0

If your script could take hours to run, I think Cloud functions (GCP) is not an option for you. a Cloud Function can only be run for up maximum 9 minutes, default value is 60 seconds. After this time the functions is shut down.

If you want to keep an instance for hours as you mentioned, a better option could be to use Compute Engine or App Engine Standard with Basic scaling that allows up 24 hours for HTTP request.