Parallel code execution in Docker containers

Question

I have a script that scrapes data by URLslist. This script is executing in a docker container. I would like to run it in multiple instances, for example, 20. For that, I wanted to use docker-compose scale worker=20 and to pass the INDEX to each instance so that the script knows which URLs should be scraped .

Example.

ID, URL
0 https://example.org/sdga2
1 https://example.org/fsdh34
2 https://example.org/fs4h35
3 https://example.org/f1h36
4 https://example.org/fs4h37
...

If there are 3 instances, 1st instance of script should process a url whose ID equals to 0, 3, 6, 9 i.e. ID = INDEX + INSTANCES_NUM * k.

I don't know how to pass INDEX to script running in Docker container. Of course, I can duplicate services in docker-compose.yml with different INDEX in environment vars. But if instances number is greater 10 or even 50 it will be a very bad solution)

Does anyone know how do this?

Please include your compose file. – BMitch May 08 '19 at 12:07 — BMitch, May 08 '19 at 12:07

score 5 · Accepted Answer · answered May 08 '19 at 12:21

With docker-compose, I don't believe there's any support for this. However, with swarm mode, which can use a similar compose file, you can pass {{.Task.Slot}} as an environment variable using service templates. E.g.

version: '3'
services:
  test:
    image: busybox
    command: /bin/sh -c "echo My task number is $$task_id && tail -f /dev/null"
    environment:
      task_id: "{{.Task.Slot}}"
    deploy:
      replicas: 5

Instead of docker-compose up, I deploy with docker stack deploy -c docker-compose.yml test. My local swarm cluster is just a single node created with docker swarm init.

Then, reviewing each of these running containers:

$ docker ps --filter label=com.docker.swarm.service.name=test_test
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS               NAMES
ccd0dbebbcbe        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.3.i3jg6qrg09wjmntq1q17690q4
bfaa22fa3342        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.5.iur5kg6o3hn5wpmudmbx3gvy1
a372c0ce39a2        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.4.rzmhyjnjk00qfs0ljpfyyjz73
0b47d19224f6        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.1.tm97lz6dqmhl80dam6bsuvc8j
c968cb5dbb5f        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.2.757e8evknx745120ih5lmhk34

$ docker ps --filter label=com.docker.swarm.service.name=test_test -q | xargs -n 1 docker logs
My task number is 3
My task number is 5
My task number is 4
My task number is 1
My task number is 2

score -1 · Answer 2 · answered May 08 '19 at 12:09

-1

why don't you use a database? mysql or redis.

each container can fetch urls from the database and you can mark fetched urls as complete, always fetch not completed urls from each container. This can scale.

answered May 08 '19 at 12:09

Kishore Kadiyala

154
3

Look at BMitch's solution. – Daler May 08 '19 at 12:44
I’d lean towards a dedicated job queue like RabbitMQ, but same idea. – David Maze May 08 '19 at 13:24
There's a lot of answers as to "why not" depending on the code itself. This is a relatively generic requirement and can be reflected onto a lot of different problems. The primary issue with this suggestion is that failed jobs make a big mess. You must handle the situation of starting a new task when you're not *certain* really dead, or just busy. On the other hand docker / swarm / kubernetes etc., these are all aware when a job is completely gone (process is dead) and can smoothly handle restart logic without the risk of running two at a time. – Philip Couling Jan 26 '21 at 15:22

Parallel code execution in Docker containers

2 Answers2

Linked