5

I have a script that scrapes data by URLslist. This script is executing in a docker container. I would like to run it in multiple instances, for example, 20. For that, I wanted to use docker-compose scale worker=20 and to pass the INDEX to each instance so that the script knows which URLs should be scraped .

Example.

ID, URL
0 https://example.org/sdga2
1 https://example.org/fsdh34
2 https://example.org/fs4h35
3 https://example.org/f1h36
4 https://example.org/fs4h37
...

If there are 3 instances, 1st instance of script should process a url whose ID equals to 0, 3, 6, 9 i.e. ID = INDEX + INSTANCES_NUM * k.

I don't know how to pass INDEX to script running in Docker container. Of course, I can duplicate services in docker-compose.yml with different INDEX in environment vars. But if instances number is greater 10 or even 50 it will be a very bad solution)

Does anyone know how do this?

Daler
  • 73
  • 1
  • 6

2 Answers2

5

With docker-compose, I don't believe there's any support for this. However, with swarm mode, which can use a similar compose file, you can pass {{.Task.Slot}} as an environment variable using service templates. E.g.

version: '3'
services:
  test:
    image: busybox
    command: /bin/sh -c "echo My task number is $$task_id && tail -f /dev/null"
    environment:
      task_id: "{{.Task.Slot}}"
    deploy:
      replicas: 5

Instead of docker-compose up, I deploy with docker stack deploy -c docker-compose.yml test. My local swarm cluster is just a single node created with docker swarm init.

Then, reviewing each of these running containers:

$ docker ps --filter label=com.docker.swarm.service.name=test_test
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS               NAMES
ccd0dbebbcbe        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.3.i3jg6qrg09wjmntq1q17690q4
bfaa22fa3342        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.5.iur5kg6o3hn5wpmudmbx3gvy1
a372c0ce39a2        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.4.rzmhyjnjk00qfs0ljpfyyjz73
0b47d19224f6        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.1.tm97lz6dqmhl80dam6bsuvc8j
c968cb5dbb5f        busybox:latest      "/bin/sh -c 'echo My…"   About a minute ago   Up About a minute                       test_test.2.757e8evknx745120ih5lmhk34

$ docker ps --filter label=com.docker.swarm.service.name=test_test -q | xargs -n 1 docker logs
My task number is 3
My task number is 5
My task number is 4
My task number is 1
My task number is 2
BMitch
  • 231,797
  • 42
  • 475
  • 450
-1

why don't you use a database? mysql or redis.

each container can fetch urls from the database and you can mark fetched urls as complete, always fetch not completed urls from each container. This can scale.

  • Look at BMitch's solution. – Daler May 08 '19 at 12:44
  • I’d lean towards a dedicated job queue like RabbitMQ, but same idea. – David Maze May 08 '19 at 13:24
  • There's a lot of answers as to "why not" depending on the code itself. This is a relatively generic requirement and can be reflected onto a lot of different problems. The primary issue with this suggestion is that failed jobs make a big mess. You must handle the situation of starting a new task when you're not *certain* really dead, or just busy. On the other hand docker / swarm / kubernetes etc., these are all aware when a job is completely gone (process is dead) and can smoothly handle restart logic without the risk of running two at a time. – Philip Couling Jan 26 '21 at 15:22