I have dozens of files that are 2-10GB in size that need re-formatting, and the bash script I have currently takes >1 day for a single file. Redirecting the file to 3< seems like more overhead, so I commented that approach out. I have several other functions in this script that generates the ${KEYS[@]}, but I really need advice on how to optimize the function below for speed.
The purpose is to reformat lines that don't start with a comment from a specified $INPUT_FILE. I'm trying to find quickest way to parallelize this function and expect it to perform the function 8 lines at a time (8 CPUs).
#!/bin/bash
function reformat {
#while read -u 3 -r line; do
while read -r line; do
semicolVals=$(echo $line | awk '{print $12}')
IFS=';' read -r -a SCV -d '' < <(printf '%s' "${semicolVals[@]}")
VALS=()
for k in "${KEYS[@]}"; do
if grep -q "$k=" <<< "${SCV[@]}"; then
str="${SCV[@]}"
VAL=${str##*$k=}
VAL=${VAL%%[[:space:]]*}
VALS+=("$VAL")
else
VAL='0'
VALS+=("$VAL")
fi
done
outKEY=$(printf ":%s" "${KEYS[@]}")
outKEY=${outKEY:1}
outVAL=$(printf ":%s" "${VALS[@]}")
outVAL=${outVAL:1}
ColA=$(echo $line | awk '{print $1}')
ColB=$(echo $line | awk '{print $2}')
ColC=$(echo $line | awk '{print $3}')
ColD=$(echo $line | awk '{print $4}')
ColE=$(echo $line | awk '{print $5}')
ColF=$(echo $line | awk '{print $6}')
ColG=$(echo $line | awk '{print $7}')
if [[ "$RETAIN" -eq 1 ]]; then
ColH="$semicolVals"
else
ColH='0'
fi
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" \
"$ColA" "$ColB" "$ColC" "$ColD" "$ColE" "$ColF" "$ColG" "$ColH" "${outKEY[@]}" "${outVAL[@]}"
done
}
export -f reformat
#exec 3< <(grep -v -e '^#' "$INPUT_FILE")
#parallel -j 8 ::: reformat
#exec 3<&-
grep -v -e '^#' "$INPUT_FILE" | parallel -j 8 ::: reformat