New to NextFlow
, here, and struggling with some basic concepts. I'm in the process of converting a set of bash
scripts from a previous publication into a NextFlow
workflow.
I'm converting a simple bash script (included below for convenience) that did some basic prep work and submitted a new job to the cluster scheduler for each iteration.
Ultimate question: What is the most NextFlow-like way to incorporate this script into a NextFlow workflow (preferably using the new DSL2 schema)?
Possible subquestion: Is it possible to emit a list of lists based on bash
variables? I've seen ways to pass lists from workflows into processes, but not out of process. I could print each set of parameters to a file and then emit that file, but that doesn't seem very NextFlow-like.
I would really appreciate any guidance on how to incorporate the following bash
script into a NextFlow workflow. I have added comments and indicate the four variables that I need to emit as a set of parameters.
Thanks!
# Input variables. I know how to take these in.
GVCF_DIR=$1
GATK_bed=$2
RESULT_DIR=$3
CAMO_MASK_REF_PREFIX=$4
GATK_JAR=$5
# For each directory
for dir in ${GVCF_DIR}/*
do
# Do some some basic prep work defining
# variables and setting up results directory
ploidy=$(basename $dir)
repeat=$((${ploidy##*_} / 2))
result_dir="${RESULT_DIR}/genotyped_by_region/${ploidy}" # Needs to be emitted
mkdir -p $result_dir
# Create a new file with a list of files. This file
# will be used as input to the downstream NextFlow process
gvcf_list="${ploidy}.gvcfs.list" # Needs to be emitted
find $dir -name "*.g.vcf" > $gvcf_list
REF="${CAMO_MASK_REF_PREFIX}.${ploidy}.fa" # Needs to be emitted
# For each line in the $GATK_bed file where
# column 5 == repeat (defined above), submit
# a new job to the scheduler with that region.
awk "\$5 == $repeat {print \$1\":\"\$2\"-\"\$3}" $GATK_bed | \
while read region # Needs to be emitted
do
qsub combine_and_genotype.ogs \
$gvcf_list \
$region \
$result_dir \
$REF \
$GATK_JAR
done
done