Given a file with a list of tables to Sqoop, this script launches a Sqoop import command with a list of options. The intel here is in the "scheduler", which I borrowed from here, meaning that I want the script to launch no more than a max number of subprocesses, defined in a variable, watch over them and as soon as one of them completes, launch another to fill up the queue. This is done until the end of the tables to Sqoop.
The script and the scheduler works correctly, except that the script ends before the subshells have completed their job.
I tried inserting wait
at the end of the script, but this way it waits for me to press ENTER.
I can't disclose the full script, I'm sorry. Hope you understand it anyway.
Thanks for your help.
#!/bin/bash
# Script to parallel offloading RDB tables to Hive via Sqoop
confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# This file contains various configuration options, as long as "parallels",
# which is the number of concurrent jobs I want to launch
# Some nice functions.
usage () {
...
}
doSqoop() {
This function launches a Sqoop command compiled with informations extracted
# in the while loop. It also writes 2 log files and look for Sqoop RC.
}
queue() {
queue="$queue $1"
num=$(($num+1))
}
regeneratequeue() {
oldrequeue=$queue
queue=""
num=0
for PID in $oldrequeue
do
if [ -d /proc/"$PID" ] ; then
queue="$queue $PID"
num=$(($num+1))
fi
done
}
checkqueue() {
oldchqueue=$queue
for PID in $oldchqueue
do
if [ ! -d /proc/"$PID" ] ; then
regeneratequeue # at least one PID has finished
break
fi
done
}
# Check for mandatory values.
...
#### HeavyLifting ####
# Since I have a file containing the list of tables to Sqoop along with other
# useful arguments like sourceDB, sourceTable, hiveDB, HiveTable, number of parallels,
# etc, all in the same line, I use awk to grab them and then
# I pass them to the function doSqoop().
# So, here I:
# 1. create a temp folder
# 2. grab values from line with awk
# 3. launch doSqoop() as below:
# 4. Monitor spawned jobs
awk '!/^($|#)/' < "$listOfTables" | { while read -r line;
do
# look for the folder or create it
# .....
# extract values fro line with awk
# ....
# launch doSqoop() with this line:
(doSqoop) &
PID=$!
queue $PID
while [[ "$num" -ge "$parallels" ]]; do
checkqueue
sleep 0.5
done
done; }
# Here I tried to put wait, without success.
EDIT (2)
OK so I managed to implement what DeeBee suggested, and as of my knowledge it is correct. I did not implement what Duffy say, because I did not understand quite well and I don't have time ATM.
Now the problem is that I moved some code inside the doSqoop function, and it is not able to create the /tmp folder needed for logs.
I don't understand what's wrong. Here's the code, followed by the error.
Please consider that the query argument is very long and contains spaces
Script
#!/bin/bash
# Script to download lot of tables in parallel with Sqoop and write them to Hive
confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# TODO: delete sqoop tmp directory after jobs ends #
doSqoop() {
local origSchema="$1"
local origTable="$2"
local hiveSchema="$3"
local hiveTable="$4"
local splitColumn="$5"
local sqoopParallels="$6"
local query="$7"
local logFileSummary="$databaseBaseDir"/"$hiveTable"-summary.log
local logFileRaw="$databaseBaseDir/"$hiveTable"-raw.log
databaseBaseDir="$baseDir"/"$origSchema"-"$hiveSchema"
[ -d "$databaseBaseDir" ] || mkdir -p "$databaseBaseDir"
if [[ $? -ne 0 ]]; then
echo -e "Unable to complete the process. \n
Cannot create logs folder $databaseBaseDir"
exit 1
fi
echo "#### [$(date +%Y-%m-%dT%T)] Creating Hive table $hiveSchema.$hiveTable from source table $origSchema.$origTable ####" | tee -a "$logFileSummary" "$logFileRaw"
echo -e "\n\n"
quote="'"
sqoop import -Dmapred.job.queuename="$yarnQueue" -Dmapred.job.name="$jobName" \
--connect "$origServer" \
--username SQOOP --password-file file:///"$passwordFile" \
--delete-target-dir \
--target-dir "$targetTmpHdfsDir"/"$hiveTable" \
--outdir "$dirJavaCode" \
--hive-import \
--hive-database "$hiveSchema" \
--hive-table "$hiveTable" \
--hive-partition-key "$hivePartitionName" --hive-partition-value "$hivePartitionValue" \
--query "$quote $query where \$CONDITIONS $quote" \
--null-string '' --null-non-string '' \
--num-mappers 1 \
--fetch-size 2000000 \
--as-textfile \
-z --compression-codec org.apache.hadoop.io.compress.SnappyCodec |& tee -a "$logFileRaw"
sqoopRc=$?
if [[ $sqoopRc -ne 0 ]]; then
echo "[$(date +%Y-%m-%dT%T)] Error importing $hiveSchema.$hiveTable !" | tee -a "$logFileSummary" "$logFileRaw"
echo "$hiveSchema.$hiveTable" >> $databaseBaseDir/failed_imports.txt
fi
echo "Tail of : $logFileRaw" >> "$logFileSummary"
tail -10 "$logFileRaw" >> "$logFileSummary"
}
export -f doSqoop
# Check for mandatory values.
if [[ ! -f "$confFile" ]]; then
echo -e " $confFile does not appear to be a valid file.\n"
usage
fi
if [[ ! -f "$listOfTables" ]]; then
echo -e " $listOfTables does not appear to be a valid file.\n"
usage
fi
if [[ -z "${username+x}" ]]; then
echo -e " A valid username is required to access the Source.\n"
usage
fi
if [[ ! -f "$passwordFile" ]]; then
echo -e " Password File $password does not appear to be a valid file.\n"
usage
fi
if [[ -z "${origServer+x}" ]]; then
echo -e " Sqoop connection string is required.\n"
usage
fi
#### HeavyLifting ####
awk -F"|" '!/^($|#)/ {print $1 $2 $3 $4 $5 $6 $7}' < "$listOfTables" | xargs -n7 -P$parallels bash -c "doSqoop {}"
Error
mkdir: cannot create directory `/{}-'mkdir: : Permission deniedcannot create directory `/{}-'
mkdir: : Permission denied
cannot create directory `/{}-': Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
Unable to complete the process.
Cannot create logs folder /{}-
Unable to complete the process.
Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
mkdir: mkdir: cannot create directory `/{}-'cannot create directory `/{}-': Permission denied: Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
Unable to complete the process.
Cannot create logs folder /{}-
Unable to complete the process.
Cannot create logs folder /{}-