Launching parallel network tasks using xargs whilst minimising context switching overhead

Asked Sep 25 '17 at 06:40

Active Sep 25 '17 at 18:07

Viewed 116 times

I want to run 100 networking (non cpu intense) jobs in parallel and want to understand the best approach.

Specifically is it possible to run 100+ jobs using xargs and what are the drawbacks?

I understand that there is a point where there is more context switching being done then actual packet processing. How to understand where that point is and what is the best way to minimise it?

For example, are there better tools to use other then xargs, etc?

edited Sep 25 '17 at 12:17

asked Sep 25 '17 at 06:40

Darthtrader

Just loop over the file and call the function for each? It's just a loop. – Martijn Pieters Sep 25 '17 at 06:41
@MartijnPieters I've updated the post to make the objective much clearer. The main concern is the viability of launching 100+ network jobs and minimising context switching overhead at the same time. – Darthtrader Sep 25 '17 at 12:23
I kinda feel that this might be too broad, but I reopened anyway. – Martijn Pieters Sep 25 '17 at 12:55

1 Answers1

Better will often be a matter of taste.

Using GNU Parallel you can do something like this to fetch 100 images in parallel:

seq 1000 | parallel -j100 wget https://foo.bar/image{}.jpg

If you want data from 100 servers and you get a full line every time:

parallel -a servers.txt -j0 --line-buffer my_connect {}

Or:

parallel -a servers.txt -j0 --line-buffer --tag my_connect {}

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

edited Sep 25 '17 at 18:07

answered Sep 25 '17 at 14:49

Ole Tange

31,768
5
86
104

Thank you for your response Ole. One question I have for you is if this works well for long running jobs too? That is ones with an undefined lifespan (e.g. webserver). – Darthtrader Sep 25 '17 at 15:43
You probably need to be more specific. Are you going to run 100 webservers in parallel? – Ole Tange Sep 25 '17 at 15:47
I want to consume data from approximately 100 streams. About 80 of whom have 1 event every 2 seconds, and the remaining 20 have about an event per second. The data stream is continuous though and never ends. – Darthtrader Sep 25 '17 at 16:00