Bash loops instead VS parallel processes

Question

I wrote a simple script using cat+pipe+parallel in bash but due to the large amount of input data (>200) my computer crashes. However, it works well with only few files (2). I was recommended to use "for" or "foreach" loops instead to avoid the crash but I am struggling to converting my script into a loop.

input files in DATADIR:

FAO21783_pass_c04106c7_0.fastq

FAO21783_pass_c04106c7_1.fastq

FAO21783_pass_c04106c7_2.fastq

FAO21783_pass_c04106c7_3.fastq

FAO21783_pass_c04106c7_4.fastq

etc...

Original script (using parallel) and working well:

    #!/bin/zsh -x

    DATADIR=shimbok_data/SB1_F2_data/fastq_pass

    DATAOUT=shimbok_data/SB1_F2_data/output

    DATABASEDIR=kaijudb

    DATABASE=kaijudb/refseq/kaiju_db_refseq.fmi

runinfo.txt contains the list of files in DATADIR

    cat shimbok_data/SB1_F2_data/runinfo.txt | parallel kaiju -t ${DATABASEDIR}/nodes.dmp -f ${DATABASE} -i ${DATADIR} -o ${DATAOUT}/{}.out

I am trying to converting it into a loop and I am having troubles with the output file names. I want them to be called like the input files but with the .out extension (I want FAO21783_pass_c04106c7_0.fastq.out)

Here its what I could do:

    for file in shimbok_data/SB1_F2_data/fastq_pass
      do kaiju -t ${DATABASEDIR}/nodes.dmp -f ${DATABASE} -i ${file} -o ${DATAOUT}/${file}.out
    done

The output that it writes is wrong: shimbok_data/SB1_F2_data/output/shimbok_data/SB1_F2_data/fastq_pass.out

I have tried several other ways but this to me seems the closest to the right one...any help, please?

Thanks in advance

UPDATE:

I have listened to the suggestion I got in the comment and it seemed to work fine but I then realized that the parallel process itself is not working for me because the output files that the script produces are all empty.

By using the "parallel" command, the program Kaiju uses the runinfo.txt list but to work properly it needs to use the actual files (fastq) inside DATADIR...

In the meantime, I have found a loop that works well for my case:

      set num = 0
      set num_e = 266


      while ( $num < $num_e )
        set xx = `printf ${num}`
        echo xx

      kaiju -t ${DATABASEDIR}/nodes.dmp -f ${DATABASE} -i 
      ${DATADIR}/FAO21783_pass_c04106c7_${xx}.fastq -o 
      ${DATAOUT}/FAO21783_pass_c04106c7_${xx}.out

         @ num++
         end

Is there a way to do the same iteration using GNU parallel processes? Or other loops that could work well for this kind of problem?

Thanks in advance

score 0 · Answer 1 · answered Feb 06 '21 at 15:24

How about just asking GNU Parallel to run a single job "in parallel":

cat shimbok_data/SB1_F2_data/runinfo.txt |
  parallel -j1 kaiju -t ${DATABASEDIR}/nodes.dmp -f ${DATABASE} -i ${DATADIR} -o ${DATAOUT}/{}.out

or 2:

cat shimbok_data/SB1_F2_data/runinfo.txt |
  parallel -j2 kaiju -t ${DATABASEDIR}/nodes.dmp -f ${DATABASE} -i ${DATADIR} -o ${DATAOUT}/{}.out

score 0 · Answer 2 · answered Feb 12 '21 at 01:37

I'm making a simple example to show how to run a readable for loop. Prepare your "fake" files like this:

mkdir my-input-dir
cd my-input-dir
touch file1.txt  file2.txt  file3.txt  file4.tmp
cd ..
mkdir my-out-dir

Your directory structure should look like this (I voluntarily created a .tmp file to show how you can filter the loop):

$ : tree .
├── my-input-dir
│   ├── file1.txt
│   ├── file2.txt
│   ├── file3.txt
│   └── file4.tmp
└── my-out-dir

The touch command create an empty file, and that's why is useful for demonstration.

Now in order to mimic what you need to do I create a script that based on the input files creates the output files with the same name and .out extension (e.g. file.txt -> file1.out).

INPUT_DIR=./my-input-dir
OUTPUT_DIR=./my-out-dir
for file in `ls $INPUT_DIR/*.txt`
do
  BASENAME=$(basename $file .txt)
  OUTFILE="$OUTPUT_DIR/$BASENAME.out"
  touch $OUTFILE
done

then you can find the produced files in my-out-dir:

$ : ls $OUTPUT_DIR
file1.out  file2.out  file3.out

Bash loops instead VS parallel processes

2 Answers2