optimize parallelization of line-by-line bash function

Question

I have dozens of files that are 2-10GB in size that need re-formatting, and the bash script I have currently takes >1 day for a single file. Redirecting the file to 3< seems like more overhead, so I commented that approach out. I have several other functions in this script that generates the ${KEYS[@]}, but I really need advice on how to optimize the function below for speed.

The purpose is to reformat lines that don't start with a comment from a specified $INPUT_FILE. I'm trying to find quickest way to parallelize this function and expect it to perform the function 8 lines at a time (8 CPUs).

#!/bin/bash

function reformat {
    #while read -u 3 -r line; do
    while read -r line; do
        semicolVals=$(echo $line | awk '{print $12}')    
        IFS=';' read -r -a SCV -d '' < <(printf '%s' "${semicolVals[@]}")
        VALS=()
        for k in "${KEYS[@]}"; do
            if grep -q "$k=" <<< "${SCV[@]}"; then
                str="${SCV[@]}"
                VAL=${str##*$k=}
                VAL=${VAL%%[[:space:]]*}
                VALS+=("$VAL")
            else
                VAL='0'
                VALS+=("$VAL")
            fi
        done

        outKEY=$(printf ":%s" "${KEYS[@]}")
        outKEY=${outKEY:1}
        outVAL=$(printf ":%s" "${VALS[@]}")
        outVAL=${outVAL:1}
        ColA=$(echo $line | awk '{print $1}')
        ColB=$(echo $line | awk '{print $2}')
        ColC=$(echo $line | awk '{print $3}')
        ColD=$(echo $line | awk '{print $4}')
        ColE=$(echo $line | awk '{print $5}')
        ColF=$(echo $line | awk '{print $6}')
        ColG=$(echo $line | awk '{print $7}')

        if [[ "$RETAIN" -eq 1 ]]; then
            ColH="$semicolVals"
        else
            ColH='0'
        fi

        printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" \
        "$ColA" "$ColB" "$ColC" "$ColD" "$ColE" "$ColF" "$ColG" "$ColH" "${outKEY[@]}" "${outVAL[@]}"
    done
}

export -f reformat
#exec 3< <(grep -v -e '^#' "$INPUT_FILE")
#parallel -j 8 ::: reformat
#exec 3<&- 
grep -v -e '^#' "$INPUT_FILE" | parallel -j 8 ::: reformat

many many subprocesses. You should be able to eliminate most/all by rewriting in `awk`. Good luck. — shellter, Jan 05 '16 at 02:31
You could get at least an order of magnitude boost from sticking to one process, probably another from using a compiled language. — Mad Physicist, Jan 05 '16 at 02:37
This is a great example of what @shellter and I are referring to: `ColA=$(echo $line | awk '{print $1}')...` — Mad Physicist, Jan 05 '16 at 02:39
As @shellter writes: Rewrite in another language. If you choose Perl the change is not that big. — Ole Tange, Jan 05 '16 at 07:25

optimize parallelization of line-by-line bash function

0 Answers0