0

Given a file with a list of tables to Sqoop, this script launches a Sqoop import command with a list of options. The intel here is in the "scheduler", which I borrowed from here, meaning that I want the script to launch no more than a max number of subprocesses, defined in a variable, watch over them and as soon as one of them completes, launch another to fill up the queue. This is done until the end of the tables to Sqoop.

The script and the scheduler works correctly, except that the script ends before the subshells have completed their job.

I tried inserting wait at the end of the script, but this way it waits for me to press ENTER.

I can't disclose the full script, I'm sorry. Hope you understand it anyway.

Thanks for your help.

#!/bin/bash

# Script to parallel offloading RDB tables to Hive via Sqoop

confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# This file contains various configuration options, as long as "parallels",
#  which is the number of concurrent jobs I want to launch

# Some nice functions.
usage () {
  ...
}

doSqoop() {

  This function launches a Sqoop command compiled with informations extracted
# in the while loop. It also writes 2 log files and look for Sqoop RC.

}

queue() {
    queue="$queue $1"
    num=$(($num+1))
}

regeneratequeue() {
    oldrequeue=$queue
    queue=""
    num=0
    for PID in $oldrequeue
    do
        if [ -d /proc/"$PID"  ] ; then
            queue="$queue $PID"
            num=$(($num+1))
        fi
    done
}

checkqueue() {
    oldchqueue=$queue
    for PID in $oldchqueue
    do
        if [ ! -d /proc/"$PID" ] ; then
            regeneratequeue # at least one PID has finished
            break
        fi
    done
}

# Check for mandatory values.
 ...

#### HeavyLifting ####

# Since I have a file containing the list of tables to Sqoop along with other
# useful arguments like sourceDB, sourceTable, hiveDB, HiveTable, number of parallels,
# etc, all in the same line, I use awk to grab them and then
# I pass them to the function doSqoop().

# So, here I:
# 1. create a temp folder
# 2. grab values from line with awk
# 3. launch doSqoop() as below:
# 4. Monitor spawned jobs 

awk '!/^($|#)/' < "$listOfTables" | { while read -r line; 
do

  # look for the folder or create it
  # .....

  # extract values fro line with awk
  # ....

  # launch doSqoop() with this line:
  (doSqoop) &

  PID=$!
  queue $PID

  while [[ "$num" -ge "$parallels" ]]; do
    checkqueue
    sleep 0.5
  done

done; }
# Here I tried to put wait, without success.

EDIT (2)

OK so I managed to implement what DeeBee suggested, and as of my knowledge it is correct. I did not implement what Duffy say, because I did not understand quite well and I don't have time ATM.

Now the problem is that I moved some code inside the doSqoop function, and it is not able to create the /tmp folder needed for logs.
I don't understand what's wrong. Here's the code, followed by the error. Please consider that the query argument is very long and contains spaces

Script

#!/bin/bash

# Script to download lot of tables in parallel with Sqoop and write them to Hive

confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# TODO: delete sqoop tmp directory after jobs ends #

doSqoop() {

  local origSchema="$1"
  local origTable="$2"
  local hiveSchema="$3"
  local hiveTable="$4"
  local splitColumn="$5"
  local sqoopParallels="$6"
  local query="$7"
  local logFileSummary="$databaseBaseDir"/"$hiveTable"-summary.log
  local logFileRaw="$databaseBaseDir/"$hiveTable"-raw.log

  databaseBaseDir="$baseDir"/"$origSchema"-"$hiveSchema"
  [ -d "$databaseBaseDir" ] || mkdir -p "$databaseBaseDir"
  if [[ $? -ne 0 ]]; then
    echo -e "Unable to complete the process. \n
    Cannot create logs folder $databaseBaseDir"
    exit 1
  fi

  echo "#### [$(date +%Y-%m-%dT%T)] Creating Hive table $hiveSchema.$hiveTable from source table $origSchema.$origTable ####" | tee -a "$logFileSummary" "$logFileRaw"
  echo -e "\n\n"

  quote="'"

  sqoop import -Dmapred.job.queuename="$yarnQueue" -Dmapred.job.name="$jobName" \
  --connect "$origServer" \
  --username SQOOP --password-file file:///"$passwordFile" \
  --delete-target-dir \
  --target-dir "$targetTmpHdfsDir"/"$hiveTable" \
  --outdir "$dirJavaCode" \
  --hive-import \
  --hive-database "$hiveSchema" \
  --hive-table "$hiveTable" \
  --hive-partition-key "$hivePartitionName" --hive-partition-value "$hivePartitionValue" \
  --query "$quote $query where \$CONDITIONS $quote" \
  --null-string '' --null-non-string '' \
  --num-mappers 1 \
  --fetch-size 2000000 \
  --as-textfile \
  -z --compression-codec org.apache.hadoop.io.compress.SnappyCodec |& tee -a "$logFileRaw"

  sqoopRc=$?
  if [[ $sqoopRc -ne 0 ]]; then 
    echo "[$(date +%Y-%m-%dT%T)] Error importing $hiveSchema.$hiveTable !" | tee -a "$logFileSummary" "$logFileRaw"
    echo "$hiveSchema.$hiveTable" >> $databaseBaseDir/failed_imports.txt 
  fi

  echo "Tail of : $logFileRaw" >> "$logFileSummary"
  tail -10 "$logFileRaw"  >> "$logFileSummary"
}
export -f doSqoop

# Check for mandatory values.
if [[ ! -f "$confFile" ]]; then
  echo -e "   $confFile does not appear to be a valid file.\n"
  usage
fi

if [[ ! -f "$listOfTables" ]]; then
  echo -e "   $listOfTables does not appear to be a valid file.\n"
  usage
fi

if [[ -z "${username+x}" ]]; then
  echo -e "   A valid username is required to access the Source.\n"
  usage
fi
if [[ ! -f "$passwordFile" ]]; then
  echo -e "   Password File $password does not appear to be a valid file.\n"
  usage
fi

if [[ -z "${origServer+x}" ]]; then
  echo -e "   Sqoop connection string is required.\n"
  usage
fi

#### HeavyLifting ####
awk -F"|" '!/^($|#)/ {print $1 $2 $3 $4 $5 $6 $7}' < "$listOfTables" | xargs -n7 -P$parallels bash -c "doSqoop {}"

Error

mkdir: cannot create directory `/{}-'mkdir: : Permission deniedcannot create directory `/{}-'
mkdir: : Permission denied
cannot create directory `/{}-': Permission denied
Unable to complete the process.

    Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.

    Cannot create logs folder /{}-
Unable to complete the process.

    Cannot create logs folder /{}-
Unable to complete the process.

    Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.

    Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.

    Cannot create logs folder /{}-
mkdir: mkdir: cannot create directory `/{}-'cannot create directory `/{}-': Permission denied: Permission denied

Unable to complete the process.

    Cannot create logs folder /{}-
Unable to complete the process.

    Cannot create logs folder /{}-
Unable to complete the process.

    Cannot create logs folder /{}-
Omar
  • 309
  • 1
  • 2
  • 10
  • Why are you using a subshell? `doSquoop &` should work fine. – chepner May 01 '17 at 11:22
  • Is `doSqoop` *also* forking a command? It may be returning before the command it starts finishes. – chepner May 01 '17 at 11:23
  • @chepner, I am using doSqoop &, actually. Sorry. doSqoop is executing the actual Sqoop command, but it is not forked. – Omar May 01 '17 at 12:02
  • The answer you have is a fine approach in general -- but to have an authoritative explanation of why your original code was failing, we'd need a [mcve] -- that is, code tested to reproduce the same problem if run exactly as-is, with elements unessential to duplicating that problem removed. – Charles Duffy May 01 '17 at 15:01
  • BTW, storing a list in a string isn't really ideal. If you want to store an array of PIDs, consider using an actual *array* for the purpose -- in that case, it could be as simple `queue+=( "$!" )`. Even better, with bash 4, you could use an associative array, and associate each PID with the line that it's processing, so you could check the exit status PID-by-PID and report which lines processing failed for. – Charles Duffy May 01 '17 at 15:04
  • (Also, why the extra subshell wrapping `(doSqoop) &`, instead of just `doSqoop &`? This makes your signal handling significantly more complicated if you want to signal a process to stop early). – Charles Duffy May 01 '17 at 15:06
  • (To be clear, it's not the "whole" script I was asking for, but a [mcve] -- that is, the *smallest runnable script that demonstrates the same problem*. If someone doesn't have `sqoop`, your "whole script" isn't runnable -- and it's certainly not "smallest" under any circumstances). – Charles Duffy May 01 '17 at 19:46
  • Your whole line of `awk`s could be replaced with `IFS='|' read -r origSchema origTable hiveSchema hiveTable splitColumn sqoopParallels query <<<"$line"` -- just one built-in command vs a whole bunch of external ones. – Charles Duffy May 01 '17 at 19:48
  • @CharlesDuffy your suggestion with built-in is not working for me. Bash 4 – Omar May 01 '17 at 20:45
  • Which suggestion, specifically? `read`? There's nothing new, modern, or even *interesting* about that -- the line I gave above works all the way back through bash 2.x. – Charles Duffy May 01 '17 at 20:55
  • `line='foo|bar|baz'; IFS='|' read -r one two three <<<"$line"; echo "Parsed $line into pieces -- first is $one, second is $two, third is $three"` if you want to try it in isolation. – Charles Duffy May 01 '17 at 20:56
  • ...that's part of the problem here, though -- what you have isn't a specific isolated question that we can provide a tested response to, but a big sprawling script nobody but you can run. – Charles Duffy May 01 '17 at 20:58
  • 1
    That said -- the reason you're getting the errors shown in your edit is that you're using `bash -c 'doSqoop {}'` without `-I{}`. I strongly advise *against* the `{}` approach, but if you're going to have it, you need to use both pieces together -- they don't make any sense individually. – Charles Duffy May 01 '17 at 20:59

3 Answers3

0

Since you're pushing doSqoop to a background job with &, the only thing limiting script execution time is the sleep 0.5 and however long it takes checkqueue to run.

Have you considered using xargs to run the function in parallel?

Example of what I think is approximating your use case:

$ cat sqoop.bash
#!/bin/bash
doSqoop(){
  local arg="${1}"
  sleep $(shuf -i 1-10 -n 1)  # random between 1 and 10 seconds
  echo -e "${arg}\t$(date +'%H:%M:%S')"
}
export -f doSqoop  # so xargs can use it

threads=$(nproc)  # number of cpu cores
awk '{print}' < tables.list | xargs -n1 -P${threads} -I {} bash -c "doSqoop {}"

$ seq 1 15 > tables.list

Result:

$ ./sqoop.bash
3   11:29:14
4   11:29:14
8   11:29:14
9   11:29:15
11  11:29:15
1   11:29:20
2   11:29:20
6   11:29:21
14  11:29:22
7   11:29:23
5   11:29:23
13  11:29:23
15  11:29:24
10  11:29:24
12  11:29:24

Sometimes it's nice to let xargs do the work for you.

Edit:

Example passing 3 args into the function, up to 8 operations in parallel:

$ cat sqoop.bash
#!/bin/bash
doSqoop(){
  a="${1}"; b="${2}"; c="${3}"
  sleep $(shuf -i 1-10 -n 1)  # do some work
  echo -e "$(date +'%H:%M:%S') $a $b $c"
}
export -f doSqoop

awk '{print $1,$3,$5}' tables.list | xargs -n3 -P8 -I {} bash -c "doSqoop {}"

$ cat tables.list
1a 1b 1c 1d 1e
2a 2b 2c 2d 2e
3a 3b 3c 3d 3e
4a 4b 4c 4d 4e
5a 5b 5c 5d 5e
6a 6b 6c 6d 6e
7a 7b 7c 7d 7e

$ ./sqoop.bash
09:46:57 1a 1c 1e
09:46:57 7a 7c 7e
09:47:05 3a 3c 3e
09:47:06 4a 4c 4e
09:47:06 2a 2c 2e
09:47:09 5a 5c 5e
09:47:09 6a 6c 6e
DeeBee
  • 299
  • 3
  • 4
  • Thanks, I already stumbled upon the use of xargs, but can't actually understand it. What does `... -I {} bash -c "doSqoop {}"` do ? Is it taking arguments and passing them to doSqoop function ? However, I don't know how to use it because I am using awk to extract 7 fields... maybe its my fault but I can't understand how to use it – Omar May 01 '17 at 11:57
  • Sure thing! Yes, it's passing args to the function. If you need more fields per call, `-n` can do that! I'll edit my answer above with some more info. – DeeBee May 01 '17 at 14:34
  • Thanks again, but I still can't understand how to use it. Outside of the doSqoop function, I'm calling it after a `while read...` loop, because prior to launch the function I'm doing some stuff. How am I supposed to pass it to xargs ??? I'm going nuts with this... next time Python ! – Omar May 01 '17 at 14:45
  • Glad to help :) xargs would replace the while read loop entirely, and handle all the parallelization for you. you'd just have to get the arguments you need into the `doSqoop` function. sorry it's difficult to explain in this little box - hopefully my post edit is helpful. I admit I was curious why you were doing it in bash and not python or something else. python's `multiprocessing` module is quite nice, just takes a bit of finagling. – DeeBee May 01 '17 at 14:53
  • Re: `echo -e` -- see the APPLICATION USAGE section of [the POSIX spec for `echo`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/echo.html), advising `printf` be used instead. In this case, that would be `printf '%s\t%s\n "$arg" "$(date ...)"` -- if you don't have a new enough bash to use `%()T` format strings to have the shell do date-formatting internally. – Charles Duffy May 01 '17 at 14:57
  • Also, `bash -c "doSqoop {}"` is prone to shell injection attacks (if you had an argument that evaluated to `>/etc/passwd` or `$(rm -rf ~)` or such, that would be parsed as code). Consider `xargs -n3 -P8 bash -c 'doSqoop "$@"' _` instead (with no `-I`), so you're passing your arguments as data out-of-band from the code. – Charles Duffy May 01 '17 at 14:58
  • @CharlesDuffy I was just using `echo` for a simple illustration, but I do always appreciate more in-depth bash commentary :) – DeeBee May 01 '17 at 14:59
  • @DeeBee I just added the full script. I admin that I did not tried hard to guess how to implement your suggestions since this has been a difficult day. – Omar May 01 '17 at 19:01
  • 1
    @DeeBee, understood. To be clear, though, the security feedback (on shell injection attacks via a malicious query) is of considerably more important than that on `echo`. – Charles Duffy May 01 '17 at 19:49
0

Using GNU Parallel you can probably do:

export -f doSqoop
grep -Ev '^#' "$listOfTables" |
  parallel -r --colsep '\|' -P$parallels doSqoop {}

If you just want one process per CPU core:

  ... | parallel -r --colsep '\|' doSqoop {}
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Thanks Ole, I just started using Parallel and so far it's ok! Great work. Just a question: if I have a long string of arguments (possibly) containing double quotes, singlequotes and semicolon, and I want to pass it as a whole line, how can I be sure it won't be interpreted by my old friend bash ? – Omar Jun 10 '17 at 16:37
  • `echo "My brother's 12\" records cost > 10$" | parallel 'touch {}; echo {}'` – Ole Tange Jun 10 '17 at 20:39
0

After some time I now have some moments to reply to my question, since I really want no one else to fall in this kind of issue.

I experienced more than one issue, related to bugs in my code and to the use of xargs. In hindsight and based on my experience, I can surely suggest not to use xargs for this kind of stuff. Bash is not the most suited language to do this, but if you are forced to it, consider using GNU Parallel. I'll move my script to this, soon.

Regarding the issues:

  • I had problems passing arguments to the function. At the fist place because they contained special chars I didn't noticed, and then because I was not using -I args. I solved this issue cleaning the input lines from newlines in between of then and using xargs' options -l1 -I args. This way it treats the line as a single argument, passing it to the function (where I parse them with awk).
  • The scheduler I was trying to implement just did not work. I ended up using xargs to parallelize the execution and custom code inside the function to write some control files that helped me understanding (at the end of the script) what went wrong and what worked.
  • Xargs doesn't provide a facility to collect output for separate jobs. It just dump it on stdout. I work with hadoop, I have a lot of output, it's just a mess.
  • Again, xargs is fine if you use it with other shell commands like find, cat, zip, etc. Don't use it if you have my use case. Just don't, you'll end up with a white hair. Insted, spend some time learning GNU Parallel, or better use a full featured language (if you can).
Omar
  • 309
  • 1
  • 2
  • 10