-1

Originally it was a small project, just 150 accounts that I wrote a selenium program in python to do a small task with them. It used one computer and took about 5 hours. Now however, I am looking to scale to 1000 accounts. For obvious reasons I do not want to do this on one computer, this task needs to be done once per day and it would obviously take about 30-35 hours for this task to complete on one machine. I want to use more than one machine, but also want to have the option to easily scale to three, four, or more.

I have moved the data of all the accounts into an amazon cloud database, and can easily connect to it from my python program. However, as I mentioned earlier I want this project to be easily scalable. I do not want to hardcode values, aka have one computer do accounts 1-500 and the other do 501-1000 (What if I added 500 more accounts and 2 machines? I would want each machine to do 1500/4). I'm thinking of a master slave approach. Where on each machine I have a program that can be called with some number of accounts as an array. And a master program that runs on my machine that once per 24 hours can send out a command with the accounts each machine is supposed to utilize.

Then I want the program to return the data back to me and when each slave is finished the master program will combine the data sent back by each slave and update the table accordingly. OR each slave to update the table independently, but I am not sure this is possible due to table-locks (if anyone could maybe comment on this it would be helpful as well)

Thanks for reading!

Edit: If you think this is too broad I'm not looking for an exact answer. Just trying to find someone who has done anything like this before. Just listing a technology or method of doing this that I can research would help me a lot

k9b
  • 1,457
  • 4
  • 24
  • 54
  • Might be worth checking out [CoreOS](https://coreos.com/). It's trivial to allocate another container with Fleet, and etcd could be used to keep global info like where's the database, who's the master (if you go with the master/slave approach). Makes maintaining/managing a distributed system a heck of a lot easier. – willnx Feb 27 '16 at 23:32
  • Thank you very much! I'll definitely start researching that now – k9b Feb 27 '16 at 23:36

1 Answers1

0

I've done a similar thing before and ended up using a master-slave design.

I had a master with the database of "jobs" and the slaves queried it to get their tasks.

In my case the process what something like this:

  1. Slave query master for jobs
  2. Master send 50 jobs and change status in DB with slave name
  3. Slave finish jobs and tells master
  4. Master change status in DB to complete and send new tasks
  5. Repeat until queue is all completed

This way I could add more slaves as the job queue grew bigger and they could have different performance. Some of my slaves did 3 times more than the slowest ones, depending on internet connection and page loading times.

Mr. H
  • 55
  • 9