0

I have a python program for scraping that needs a lot of time to run. To parallelize it, I have modified the code so that the program can run in parallel in different machines. I also created a docker image and pushed it to Dockerhub.

I tried to use Airflow and KubernetesPodOperator to create 10 Kubernetes pods to achieve that. But I didn't have success so far and the documentation is lacking in this regard. Is there any other way I can achieve this? How about GCP, Spark and Airflow? or just GCE machines somehow orchestrated by Airflow? any other options?

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972
vettipayyan
  • 101
  • 3
  • Hello, have you checked [this documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/jobs#managing_parallelism)? You should be able run in parallel using the field "parallelism" – aemon4 Sep 10 '20 at 14:37
  • @aemon4, I am not able to start the pods themselves. They're getting timed out.I would like to see if there are other options, since i find kubernetes to be too complex. – vettipayyan Sep 10 '20 at 15:31

1 Answers1

1

I suggest you have a look at this thread, jug or ray seem like easier options.
And here you will find a pretty complete list of parallel processing (cluster computing) solutions.
Here is a ray example:

import ray
ray.init()

@ray.remote
def mapping_function(input):
    return input + 1

results = ray.get([mapping_function.remote(i) for i in range(100)])

Or, if you are using Python multiprocessing, you can scale it to a cluster by using ray.util.multiprocessing.pool's Pool instead of from multiprocessing.pool's Pool.
Check out this post for details

Example code you could run (Monte Carlo Pi Estimation):

import math
import random
import time

def sample(num_samples):
    num_inside = 0
    for _ in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if math.hypot(x, y) <= 1:
            num_inside += 1
    return num_inside

def approximate_pi_distributed(num_samples):
    from ray.util.multiprocessing.pool import Pool # NOTE: Only the import statement is changed.
    pool = Pool()
        
    start = time.time()
    num_inside = 0
    sample_batch_size = 100000
    for result in pool.map(sample, [sample_batch_size for _ in range(num_samples//sample_batch_size)]):
        num_inside += result
        
    print("pi ~= {}".format((4*num_inside)/num_samples))
    print("Finished in: {:.2f}s".format(time.time()-start))

Note: I am myself willing to achieve the same for one of my projects and I haven't tried it yet but I will soon. I'll post any interesting update I have. Don't hesitate to do the same from your side.

Ksign
  • 128
  • 2