0

How it is now

I currently have a script running under windows that frequently invokes recursive file trees from a list of servers.

I use an AutoIt (job manager) script to execute 30 parallel instances of lftp (still windows), doing this:

lftp -e "find .; exit" <serveraddr>

The file used as input for the job manager is a plain text file and each line is formatted like this:

<serveraddr>|...

where "..." is unimportant data. I need to run multiple instances of lftp in order to achieve maximum performance, because single instance performance is determined by the response time of the server.

Each lftp.exe instance pipes its output to a file named

<serveraddr>.txt

How it needs to be

Now I need to port this whole thing over to a linux (Ubuntu, with lftp installed) dedicated server. From my previous, very(!) limited experience with linux, I guess this will be quite simple.

What do I need to write and with what? For example, do I still need a job man script or can this be done in a single script? How do I read from the file (I guess this will be the easy part), and how do I keep a max. amount of 30 instances running (maybe even with a timeout, because extremely unresponsive servers can clog the queue)?

Thanks!

turbo
  • 1,233
  • 14
  • 36

1 Answers1

1

Parallel processing

I'd use GNU/parallel. It isn't distributed by default, but can be installed for most Linux distributions from default package repositories. It works like this:

parallel echo ::: arg1 arg2

will execute echo arg1 and and echo arg2 in parallel.

So the most easy approach is to create a script that synchronizes your server in bash/perl/python - whatever suits your fancy - and execute it like this:

parallel ./script ::: server1 server2

The script could look like this:

#!/bin/sh
#$0 holds program name, $1 holds first argument.
#$1 will get passed from GNU/parallel. we save it to a variable.
server="$1"
lftp -e "find .; exit" "$server" >"$server-files.txt"

lftp seems to be available for Linux as well, so you don't need to change the FTP client.

To run max. 30 instances at a time, pass a -j30 like this: parallel -j30 echo ::: 1 2 3

Reading the file list

Now how do you transform specification file containing <server>|... entries to GNU/parallel arguments? Easy - first, filter the file to contain just host names:

sed 's/|.*$//' server-list.txt

sed is used to replace things using regular expressions, and more. This will strip everything (.*) after the first | up to the line end ($). (While | normally means alternative operator in regular expressions, in sed, it needs to be escaped to work like that, otherwise it means just plain |.)

So now you have list of servers. How to pass them to your script? With xargs! xargs will put each line as if it was an additional argument to your executable. For example

echo -e "1\n2"|xargs echo fixed_argument

will run

echo fixed_argument 1 2

So in your case you should do

sed 's/|.*$//' server-list.txt | xargs parallel -j30 ./script :::

Caveats

Be sure not to save the results to the same file in each parallel task, otherwise the file will get corrupt - coreutils are simple and don't implement any locking mechanisms unless you implement them yourself. That's why I redirected the output to $server-files.txt rather than files.txt.

rr-
  • 14,303
  • 6
  • 45
  • 67
  • Thank you for this elaborate answer! I've created a script, `job.sh` with the content you provided. When executing with `sh` it works (example: `sh ./job.sh ftp.microsoft.com`). However, executing the whole thing, `sed s/|.*$//' end_unique.txt | xargs parallel -j30 ./job.sh :::`, just prints `>` and idles. No tasks are being created. Does it matter that end_unique's lines terminate with CRLF? – turbo Jun 13 '15 at 18:09
  • I missed apostrophe in the final example. If an apostrophe is missing, bash understands the command you typed isn't yet finished and prompts you to type the rest using `>` prompt. I updated the example that should work. – rr- Jun 13 '15 at 19:08
  • Stupid me, should have spotted that. So, now it runs, but xarg doesn't seem to parse the argument the way I want. Executing `sed 's/|.*$//' end_unique.txt | xargs parallel -j30 ./job.sh :::` now just trows an error from batch, which looks like: `/bin/bash: : command not found.`. – turbo Jun 13 '15 at 19:22
  • Hmm. Does `sed 's/|.*$//' end_unique.txt` return what you want, meaning, only entries such as `ftp.microsoft.com`? – rr- Jun 13 '15 at 19:51
  • Yes, absolutely. can be an IP or a hostname and `sed 's/|.*$//' end_unique.txt` gets them all. – turbo Jun 13 '15 at 19:56
  • Also, one more question: Does `parallel` wait until all of the 30 instances finished? Previously, I replaced empty slots with new instances when one was finished. – turbo Jun 13 '15 at 19:58
  • It tries to maintain 30 processes at the same time. Try adding `#/bin/sh` at the top of the `job.sh` - it's different to run `sh job.sh` and `./job.sh`; in the latter case Linux needs instruction (= shebang) which executable should be used to run the script. – rr- Jun 13 '15 at 20:18
  • Did that, same result. This is how my script looks like: [pastebin](http://pastebin.com/AzszpwTa). And this is my command: `desktop@***:~$ sed 's/|.*$//' end_unique.txt | xargs parallel -j20 ./job.sh :::`. – turbo Jun 13 '15 at 20:26
  • Uhm, I just tested it and it works on my end. Do you have `lftp` installed? Are you sure `./job.sh ftp.microsoft.com` works? – rr- Jun 13 '15 at 21:10
  • Yes, I needed to `chmod -x` the script file, but running it works fine. `lftp` is installed, so is `parallel`. Still the same error is displayed, it just prints every single server address in the file as an error. – turbo Jun 13 '15 at 21:25
  • Frankly, I believe this deserves separate question. Perhaps you've got different version of GNU/parallel. I'm not sure how can I help you since I can't reproduce the above message... – rr- Jun 13 '15 at 21:27
  • OK, i will create a seperate question. Thanks. – turbo Jun 13 '15 at 21:33
  • If you remove `xargs` and `:::`, then the behaviour will be the same, no? Since parallel is designed as a drop-in replacement for xargs, you should really very rarely need to use both in the same command. – rici Jun 13 '15 at 22:03