1

Context

I need to optimize deduplication using 'sort -u' and my linux machine has an old implementation of 'sort' command (i.e. 5.97) that has not '--parallel' option. Although 'sort' implements parallelizable algorithms (e.g. merge-sort), I need to make such parallelization explicit. Therefore, I make it by hand via 'xargs' command that outperforms ~2.5X w.r.t. to the single 'sort -u' method ... when it works fine.

Here the intuition of what I am doing.

I am running a bash script that splits an input file (e.g. file.txt) into several parts (e.g. file.txt.part1, file.txt.part2, file.txt.part3, file.txt.part4). The resulting parts are passed to the 'xargs' command in order to perform parallel deduplication via the sortu.sh script (details at the end). sortu.sh wraps the invocation of 'sort -u' and outputs the resulting file name (e.g. "sortu.sh file.txt.part1" outputs "file.txt.part1.sorted"). Then the resulting sorted parts are passed to a 'sort --merge -u' that merges/deduplicates the input parts assuming that such parts are already sorted.

The problem I am experiencing is on the parallelization via 'xargs'. Here a simplified version of my code:

 AVAILABLE_CORES=4
 PARTS="file.txt.part1
 file.txt.part2
 file.txt.part3
 file.txt.part4"

 SORTED_PARTS=$(echo "$PARTS" | xargs --max-args=1 \
                                      --max-procs=$AVAILABLE_CORES \
                                      bash sortu.sh \
               )
 ...
 #More code for merging the resulting parts $SORTED_PARTS
 ...

The expecting result is a list of sorted parts into the variable SORTED_PARTS:

 echo "$SORTED_PARTS"
 file.txt.part1.sorted
 file.txt.part2.sorted
 file.txt.part3.sorted
 file.txt.part4.sorted

Symptom

Nevertheless, (sometimes) there is a missing sorted part. For instance, the file.txt.part2.sorted:

 echo "$SORTED_PARTS"
 file.txt.part1.sorted
 file.txt.part3.sorted
 file.txt.part4.sorted

This symptom is non-deterministic in its occurrence (i.e. the execution for the same file.txt succeeds and in another time it fails) or in the missing file (i.e. it is not always the same sorted missing part).

Problem

I have a race condition where all the sortu.sh instances finish and 'xargs' sends EOF before the stdout is flushed.

Question

Is there a way to ensure stdout flushing before 'xagrs' sends EOF?

Constraints

I am not able to use neither parallel command nor "--parallel" option of sort command.

sortu.sh code

 #!/bin/bash

 SORTED=$1.sorted
 sort -u $1 > $SORTED
 echo $SORTED
snap
  • 711
  • 3
  • 11
  • 25
Manolo
  • 1,500
  • 1
  • 11
  • 15
  • Do you get any errors on stderr when this happens? – that other guy Aug 10 '15 at 19:19
  • 1
    I think that you are seeing a race condition resulting from the fact that while the command substitution is free to complete as soon as `xargs` completes, `xargs` itself produces *no* output; only its children write to the file inherited from `xargs`. Since that output is buffered, there is a chance the shell reads from that file before the output from all the children is flushed to the file. – chepner Aug 10 '15 at 19:36
  • What's with the use of multi-line strings for lists of filenames instead of proper arrays? – Charles Duffy Aug 10 '15 at 19:44
  • 1
    Also, unless your script is doing something fancy, it's probably redundant -- GNU sort supports splitting files into pieces and sorting those pieces individually (followed by a merge sort to combine them to a single stream) out-of-the-box; it's unnecessary to implement that yourself. – Charles Duffy Aug 10 '15 at 19:48
  • @that other guy There are no errors – Manolo Aug 10 '15 at 20:37
  • @chepner Thanks for your comment. I already had that perception but after your post I think "sync" could help. – Manolo Aug 10 '15 at 20:39
  • @CharlesDuffy I know sort can implicitly parallelize the task. It is also explicit via --parallel option (that I do not have). My interest here is to make parallelization explicit. May be my script looks fancy but it is outperforming ~2.5X the 'single sort-u' ... when it works fine :) – Manolo Aug 10 '15 at 20:50
  • Have you tried passing `--parallel` to `sort` in that comparison, or are you comparing against a single-threaded invocation? See https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html – Charles Duffy Aug 10 '15 at 20:51
  • 1
    Ahh; as-edited, yes, if you're trying to replace a GNU sort that predates parallelization support, that makes sense. OTOH, frankly, I'm inclined to say that this would be a lot more interesting of a question (ie. one I'd be more tempted to spend time on) if you actually included enough context to allow answers to try to optimize the `sort` parallelization itself, rather than only focusing on a tactical issue you're hitting with your current implementation attempt. – Charles Duffy Aug 10 '15 at 23:26
  • Thanks for your comments @CharlesDuffy. I just added more context and I keep my solution as it is performing well. Hope I get someone motivated that could help or propose a better approach. – Manolo Aug 11 '15 at 14:27
  • Quick questions: Does your sort have `-m` / `--merge` support? Can you provide source for the code doing the initial split? (I ask because there are tradeoffs between correctness and performance in that split process; I'm curious which way you're going on some of them). – Charles Duffy Aug 11 '15 at 15:19
  • Yes it has. As I explained, I use sort --merge once xagrs finishes. The split is done before but it is irrelevant to the issue as it is correct and complete. – Manolo Aug 11 '15 at 15:24
  • I'm building an end-to-end solution which does the split in-line, not ahead-of-time. This makes it more space-efficient -- fewer temporary files used -- and faster -- fewer reads and writes to disk. Which is to say -- I'm bringing it back into play. – Charles Duffy Aug 11 '15 at 15:25
  • Another question -- what's the oldest version of bash this needs to support? – Charles Duffy Aug 11 '15 at 15:29
  • It is 3.2.25. The inline version will be great. At this moment I need to measure performance of the separate stages (i.e. split, parallel sort, merge) regarding-less on memory usage. Also, I have to be sure that this solution is correct and complete through large tests. This is why I found the race condition that I have to solve. BTW, after I revisited pipelining and fork concepts in bash, and I think I got it. I'm running tests right now. – Manolo Aug 11 '15 at 15:34
  • Backport to 3.2 took a bit more effort than anticipated -- dealing with some bugs caused by unnecessary subshells took a bit -- but is now done. – Charles Duffy Aug 11 '15 at 16:46
  • 1
    Is it really worth making performance worse to be able to measure it? Moreover, doing things serially has potential to move the bottlenecks, and thus to not even give you an accurate measurement of how a parallel version would perform. I'd consider using a tool like sysdig to enable measurement of the parallel version in flight. – Charles Duffy Aug 11 '15 at 16:52
  • BTW, do you have GNU awk available? I'm tempted to try to optimize the split. (This will also have the effect of making it easier to measure performance of the split separate from everything else -- just check the CPU accounting info for the awk command). – Charles Duffy Aug 11 '15 at 16:54
  • Done: Split logic is now performed by awk; should both be much faster, and easier to individually time. – Charles Duffy Aug 11 '15 at 17:01
  • Is the reason why you cannot use GNU Parallel covered by http://oletange.blogspot.com/2013/04/why-not-install-gnu-parallel.html ? If not: What is the reason? – Ole Tange Aug 15 '15 at 00:41
  • @OleTange we have to assume that it is not available because it is not part of the default setup of our servers – Manolo Aug 15 '15 at 01:58
  • Is your bash script part of the default setup? If not: How is installing your bash script any different from installing GNU Parallel as a normal user or distributing it with your script? – Ole Tange Aug 15 '15 at 23:47
  • @OleTange the issue of this question is the race condition. We will consider additional implementations of the parallel sort in the future (may be one using gnu parallel). – Manolo Aug 16 '15 at 00:09

1 Answers1

1

The below doesn't write contents out to disk at all, and parallelizes the split process, the sort processes, and the merge, performing all of these at once.

This version has been backported to bash 3.2; a version built for newer releases of bash wouldn't need eval.

#!/bin/bash

nprocs=5  # maybe call nprocs command instead?
fd_min=10 # on bash 4.1, can use automatic FD allocation instead

# create a temporary directory; delete on exit
tempdir=$(mktemp -d "${TMPDIR:-/tmp}/psort.XXXXXX")
trap 'rm -rf "$tempdir"' 0

# close extra FDs and clear traps, before optionally executing another tool.
#
# Doing this in subshells ensures that only the main process holds write handles on the
# individual sorts, so that they exit when those handles are closed.
cloexec() {
    local fifo_fd
    for ((fifo_fd=fd_min; fifo_fd < (fd_min+nprocs); fifo_fd++)); do
        : "Closing fd $fifo_fd"
        # in modern bash; just: exec {fifo_fd}>&-
        eval "exec ${fifo_fd}>&-"
    done
    if (( $# )); then
        trap - 0
        exec "$@"
    fi
}

# For each parallel process:
# - Run a sort -u invocation reading from an FD and writing from a FIFO
# - Add the FIFO's name to a merge sort command
merge_cmd=(sort --merge -u)
for ((i=0; i<nprocs; i++)); do
  mkfifo "$tempdir/fifo.$i"               # create FIFO
  merge_cmd+=( "$tempdir/fifo.$i" )       # add to sort command line
  fifo_fd=$((fd_min+i))
  : "Opening FD $fifo_fd for sort to $tempdir/fifo.$i"
  # in modern bash: exec {fifo_fd}> >(cloexec sort -u >$fifo_fd)
  printf -v exec_str 'exec %q> >(cloexec; exec sort -u >%q)' "$fifo_fd" "$tempdir/fifo.$i"
  eval "$exec_str"
done

# Run the big merge sort recombining output from all the FIFOs
cloexec "${merge_cmd[@]}" &
merge_pid=$!

# Split input stream out to all the individual sort processes...
awk -v "nprocs=$nprocs" \
    -v "fd_min=$fd_min" \
  '{ print $0 >("/dev/fd/" (fd_min + (NR % nprocs))) }'

# ...when done, close handles on the FIFOs, so their sort invocations exit
cloexec

# ...and wait for the merge sort to exit
wait "$merge_pid"
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Thanks @CharlesDuffy for your sophisticated answer. I will check it and compare it with mine. – Manolo Aug 11 '15 at 17:09
  • The one place where this could possibly be slower is if the input is already nearly-sorted. A more sophisticated split algorithm is called for in that case -- but since many of those more-sophisticated approaches require knowing the total number of lines up-front, they add IO expense. Thus, the tiny little awk script here does the simple, stupid thing. Splitting by larger batches would be a simpler way to reduce that cost without the IO: Just make `(NR % nprocs)` instead be `(int(NR / 100) % nprocs)`, adjusting the `100` as appropriate for the batch size chosen. – Charles Duffy Aug 11 '15 at 17:13
  • ...using too large a batch size means your other `sort` instances are sitting around waiting for the splitter to give them input, so you don't want to go too far in that direction either; would want to test with your actual data to see what was optimal. – Charles Duffy Aug 11 '15 at 17:16
  • I solved the race condition issue. As you pointed out, an in-line version avoiding disk writings will perform better. I will study your interesting script (I will learn much from it) and I will put it into a race against my solution. I tell you later what happens. Thanks again! – Manolo Aug 11 '15 at 23:04
  • You might add the minimal race condition fix as your own answer, for folks who are interested. I look forward to hearing how this performs when benchmarked! – Charles Duffy Aug 11 '15 at 23:09