Optimising my script which lookups into a big compressed file

Question

I'm here again ! I would like to optimise my bash script in order to lower the time spent for each loop. Basically what it does is :

getting an info from a tsv
using that information to lookup with awk into a file
printing the line and exporting it

My issues are : 1) the files are 60GB compressed files : I need a software to uncompress it (I'm actually trying now to uncompress it, not sure I'll have enough space) 2) It is long to look into it anyway

My ideas to improve it :

0) as said, if possible I'll decompress the file
using GNU parallel with parallel -j 0 ./extract_awk_reads_in_bam.sh ::: reads_id_and_pos.tsv but I'm unsure it works as expected? I'm cutting down the time per research from 36 min to 16 so just a factor 2.5 ? (I have 16 cores)
I was thinking (but It may be redundant with GNU?) to split down my list of info to look into into several files to launch them parallely
sorting the bam file by reads name, and exiting awk after having found 2 matches (can't be more than 2)

Here is the rest of my bash script, I'm really open for ideas to improve it but I'm not sure I am a superstar in programming, so maybe keeping it simple would help? :)

My bash script :

#/!bin/bash
while IFS=$'\t' read -r READ_ID_WH POS_HOTSPOT; do
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}" >> /data/bismark2/reads_done_so_far.txt
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}"
samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v read_id="$READ_ID_WH" -v pos_hotspot="$POS_HOTSPOT" '$1==read_id {printf $0 "\t%s\twh_genome",pos_hotspot}'| head -2 >> /data/bismark2/export_reads_mapped.tsv
done <"$1"

My tsv file has a format like :

READ_ABCDEF\t1200

Thank you a lot ++

Have you considered re-writing the parallel parts of your script in Python? I know you can achieve them in bash but this might make your life simpler... — Cristian Ramon-Cortes, Feb 21 '20 at 11:26

Poshi · Accepted Answer · 2020-02-21T22:33:41.160

1

TL;DR

Your new script will be:

#!/bin/bash
samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'

You are reading the entire file for each of the inputs. Better look for all of them at the same time. Start by extracting the interesting reads and then, on this subset, apply the second transformation.

samtools view -@ 2 "$bam" | grep -f <(awk -F$'\t' '{print $1}' "$1") > "$sam"

Here you are getting all the reads with samtools and searching for all the terms that appear in the -f parameter of grep. That parameter is a file that contains the first column of the search input file. The output is a sam file with only the reads that are listed in the search input files.

awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {print $0, st_array[$1], "wh_genome"}' "$sam"

Finally, use awk for adding the extra information:

Open the search input file with awk at the beginning and read its contents into an array (st_array)
Set the Output Field Separator to the tabulator
Traverse the sam file and add the extra information from the pre-populated array.

I'm proposing this schema because I feel like grep is faster than awk for doing the search, but the same result can be obtained with awk alone:

samtools view -@ 2 "$bam" | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'

In this case, you only need to add a conditional to identify the interesting reads and get rid of the grep.

In any case you need to re-read the file more than once or to decompress it before working with it.

edited Feb 21 '20 at 22:33

answered Feb 21 '20 at 13:27

Poshi

5,332
3
15
32

Hi it seems very elegant, I however have some issues understanding everything in there :) 1. So I should quit my bash approach ? I understand the first part, which is only subsetting the dataset with lines that will match. Then "$1" is actually the READ_ID_WH in my script. Then I should make a loop for this part with the while read / do / done 2. it outputs $sam which is given as a parameter to awk. Here I don't quite understand : we should loop again because I indeed need the full information of the read, but also the position on which we had it :)hall I implement two loops? Thanks +++ ! – H. Vilbert Feb 21 '20 at 18:33
1. I think you should quit your bash approach, yes. It is terribly inefficient. Better go with my bash approach, which will be faster and with less resources needed. `$1` is the same `$1` you had in your script. In fact, my command line should be the substitute for your whole set of commands in your question. You don't need any extra bash loop. The only loop needed is inside the awk script, it is used to load in memory the contents of the tsv file (`$1`). – Poshi Feb 21 '20 at 22:29
2. You don't have to loop again. The `$sam` contains all the lines which are of interest, and the awk script reads all the tsv file in memory to expand the output information as if traverses the subsetted file. – Poshi Feb 21 '20 at 22:31
Ok I understand it better now :) I'm just facing the issue of passing the $1 as the file to be read, awk tells me it cannot be found .. I tried to use -v but it doesn't work, I'm digging ! – H. Vilbert Feb 21 '20 at 22:42
If you are executing this as a script, just use the name of the file as the script parameter. If you are running the line interactively, put the name of the file in the `$1` place (`-v st="$1"` -> `-v st="myFile.tsv"`) – Poshi Feb 21 '20 at 22:50
Thank you, it's brilliant. I'm eager to know as much as you do. Thanks again +++ – H. Vilbert Feb 21 '20 at 23:06
Hi, I'm sorry after some months to ask you for more .. my file has now 4 fields and not only two (previously just the read_id and the chromosome). I tried to adapt the script to paste the two other fields but I don't manage to make it. I tried ```{st_array[$2]=$3}}```for instance. $1 is refering to the first field of the file passed as a variable with -v right? – H. Vilbert Nov 26 '20 at 10:42
1

Yes, inside an awk script, `$N` refers to the N-th field. Probably you are using the first field as the key for the line, so you could add the other fields to other vars: `second_field[$1]=$3; third_field[$1]=$4` and then output them in the print command in the desired position. – Poshi Nov 26 '20 at 14:28
Sorry to ask again, now I have a little different mission to achieve. I have a bam file of 130GB and I need to lookup for a very big number of reads (~ 1 billion reads) So far I applied the same approach and it is looking up for it for more than 200h now on 10 threads. I guess it's slightly not adapted. Would you change something on the approach knowing these new parameters? Thanks a lot ! – H. Vilbert Dec 14 '20 at 12:08
Well, `awk` is fast and easy to develop with, but not as fast in terms of computing speed. There are several ways you can approach the bigger problem. You can split the input bam in several intervals and run them in parallel in different nodes, finally join the results back. You can also split the input lookup table in chunks and run them in parallel in diffetent nodes, finally joining and sorting the results as you need. You can even do both splits at the same time and juggle with the outputs to generate a final single file. – Poshi Dec 20 '20 at 13:31
But as the problem is getting bigger, it will be easier and faster to develop a specialized tool for this task. You should try with Python, as suggested in the comments, using the `pysam` library should make the parsing easy. Or even go to a C approach and rely on the `htslib` library. The input is big and I think this will be the fastest way. – Poshi Dec 20 '20 at 13:33
Another idea in the bash scripting world would be to use the `join` command. If I'm not wrong, you are extracting reads by ID. You can sort the sam output and the lookup table and then perform a join of the first field keeping only the results of the sam input. Finally, with the reads selected, apply to them the remaining transformations. – Poshi Dec 20 '20 at 13:35
In any case, you will have to perform a new development, as the current one is already on its limits. BTW, think that bam files could be even bigger... once I had to work with a bam file of almost 1TiB :-O – Poshi Dec 20 '20 at 13:36

Optimising my script which lookups into a big compressed file

1 Answers1

Linked