0

I have a school project I'm working in a class on web mining where I need to collect a lot of data from certain social media sites. I need data from a large number of individual hashtags on the site. I have a python script that successfully grabs all the data I need for a single hashtag by making sequential HTTP requests until it captures all the records needed for the specified range of time and sales them to a large csv file. I need to run this program a couple thousand times for different hashtags. For some very popular hashtags, the program takes a few hours to run. Many of the hashtags will be much faster though. I wrote a bash script that runs the python program for each hashtag sequentially, but this will take a very long time to collect everything needed.

I wanted to utilize some kind of cloud computing service like google compute engine, AWS, or azure, to run multiple instances of this program separately in parallel so I could collect the data for many of the hashtags at once. Perhaps I could have a large number of cloud machines all running the program for different hashtags at the same time. This is just so I can collect all the data I need faster.

I'm not very experienced with cloud computing outside of a few times I've used google compute engine for simple programs I only needed to run once. I tried reading about instance groups but I'm still not exactly sure how I would use them for this purpose. I'm even less familiar with AWS and Azure offerings.

What's the best way to go about this?

Mark Henderson
  • 68,823
  • 31
  • 180
  • 259
  • Voting to close: Requests for product, service, or learning material recommendations are off-topic because they attract low quality, opinionated and spam answers, and the answers become obsolete quickly. Instead, describe the business problem you are working on, the research you have done, and the steps taken so far to solve it. – TomTom Jun 03 '20 at 21:54
  • I’m not really asking for recommendation I’m just trying to find out what types of cloud computing services best suit this problem. – Conor James Thomas Warford Hen Jun 03 '20 at 23:53
  • So, you do ask for a recommendation, you just try to weasel around the definition. No recommendation, just tell me what is best... – TomTom Jun 04 '20 at 00:10
  • I mean it’s not like I’m asking for your recommendation between AWS, compute engine, and azure. I have a specific task and I’m asking what specific service will allow me to do what I need to do because I’m not very familiar. – Conor James Thomas Warford Hen Jun 04 '20 at 00:11

1 Answers1

3

Without knowing more about your exact script you probably want something that can run lambda functions:

No VMs to worry about, pay per-second-per-gig, and once you're done. No infrastructure to remember to tear down. It will just run your script in its own environment and tear it down when it's done.

Might get a bit more pricey for the longer running scripts, but should be very cheap for the fast ones.

Mark Henderson
  • 68,823
  • 31
  • 180
  • 259