I'm here again ! I would like to optimise my bash script in order to lower the time spent for each loop. Basically what it does is :
- getting an info from a tsv
- using that information to lookup with awk into a file
- printing the line and exporting it
My issues are : 1) the files are 60GB compressed files : I need a software to uncompress it (I'm actually trying now to uncompress it, not sure I'll have enough space) 2) It is long to look into it anyway
My ideas to improve it :
- 0) as said, if possible I'll decompress the file
using GNU parallel with
parallel -j 0 ./extract_awk_reads_in_bam.sh ::: reads_id_and_pos.tsv
but I'm unsure it works as expected? I'm cutting down the time per research from 36 min to 16 so just a factor 2.5 ? (I have 16 cores)I was thinking (but It may be redundant with GNU?) to split down my list of info to look into into several files to launch them parallely
- sorting the bam file by reads name, and exiting awk after having found 2 matches (can't be more than 2)
Here is the rest of my bash script, I'm really open for ideas to improve it but I'm not sure I am a superstar in programming, so maybe keeping it simple would help? :)
My bash script :
#/!bin/bash
while IFS=$'\t' read -r READ_ID_WH POS_HOTSPOT; do
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}" >> /data/bismark2/reads_done_so_far.txt
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}"
samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v read_id="$READ_ID_WH" -v pos_hotspot="$POS_HOTSPOT" '$1==read_id {printf $0 "\t%s\twh_genome",pos_hotspot}'| head -2 >> /data/bismark2/export_reads_mapped.tsv
done <"$1"
My tsv file has a format like :
READ_ABCDEF\t1200
Thank you a lot ++