I wrote a simple script using cat+pipe+parallel in bash but due to the large amount of input data (>200) my computer crashes. However, it works well with only few files (2). I was recommended to use "for" or "foreach" loops instead to avoid the crash but I am struggling to converting my script into a loop.
input files in DATADIR:
FAO21783_pass_c04106c7_0.fastq
FAO21783_pass_c04106c7_1.fastq
FAO21783_pass_c04106c7_2.fastq
FAO21783_pass_c04106c7_3.fastq
FAO21783_pass_c04106c7_4.fastq
etc...
Original script (using parallel) and working well:
#!/bin/zsh -x
DATADIR=shimbok_data/SB1_F2_data/fastq_pass
DATAOUT=shimbok_data/SB1_F2_data/output
DATABASEDIR=kaijudb
DATABASE=kaijudb/refseq/kaiju_db_refseq.fmi
runinfo.txt contains the list of files in DATADIR
cat shimbok_data/SB1_F2_data/runinfo.txt | parallel kaiju -t ${DATABASEDIR}/nodes.dmp -f ${DATABASE} -i ${DATADIR} -o ${DATAOUT}/{}.out
I am trying to converting it into a loop and I am having troubles with the output file names. I want them to be called like the input files but with the .out extension (I want FAO21783_pass_c04106c7_0.fastq.out)
Here its what I could do:
for file in shimbok_data/SB1_F2_data/fastq_pass
do kaiju -t ${DATABASEDIR}/nodes.dmp -f ${DATABASE} -i ${file} -o ${DATAOUT}/${file}.out
done
The output that it writes is wrong: shimbok_data/SB1_F2_data/output/shimbok_data/SB1_F2_data/fastq_pass.out
I have tried several other ways but this to me seems the closest to the right one...any help, please?
Thanks in advance
UPDATE:
I have listened to the suggestion I got in the comment and it seemed to work fine but I then realized that the parallel process itself is not working for me because the output files that the script produces are all empty.
By using the "parallel" command, the program Kaiju uses the runinfo.txt list but to work properly it needs to use the actual files (fastq) inside DATADIR...
In the meantime, I have found a loop that works well for my case:
set num = 0
set num_e = 266
while ( $num < $num_e )
set xx = `printf ${num}`
echo xx
kaiju -t ${DATABASEDIR}/nodes.dmp -f ${DATABASE} -i
${DATADIR}/FAO21783_pass_c04106c7_${xx}.fastq -o
${DATAOUT}/FAO21783_pass_c04106c7_${xx}.out
@ num++
end
Is there a way to do the same iteration using GNU parallel processes? Or other loops that could work well for this kind of problem?
Thanks in advance