2

I want to run 100 networking (non cpu intense) jobs in parallel and want to understand the best approach.

Specifically is it possible to run 100+ jobs using xargs and what are the drawbacks?

I understand that there is a point where there is more context switching being done then actual packet processing. How to understand where that point is and what is the best way to minimise it?

For example, are there better tools to use other then xargs, etc?

Darthtrader
  • 238
  • 1
  • 8

1 Answers1

1

Better will often be a matter of taste.

Using GNU Parallel you can do something like this to fetch 100 images in parallel:

seq 1000 | parallel -j100 wget https://foo.bar/image{}.jpg

If you want data from 100 servers and you get a full line every time:

parallel -a servers.txt -j0 --line-buffer my_connect {}

Or:

parallel -a servers.txt -j0 --line-buffer --tag my_connect {}

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Thank you for your response Ole. One question I have for you is if this works well for long running jobs too? That is ones with an undefined lifespan (e.g. webserver). – Darthtrader Sep 25 '17 at 15:43
  • You probably need to be more specific. Are you going to run 100 webservers in parallel? – Ole Tange Sep 25 '17 at 15:47
  • I want to consume data from approximately 100 streams. About 80 of whom have 1 event every 2 seconds, and the remaining 20 have about an event per second. The data stream is continuous though and never ends. – Darthtrader Sep 25 '17 at 16:00