Wrong plotting data (insert size vs base quality means from BAM file)

Question

I'am using PySam to calculate the insert reads and the mean base quality for the reads of a BAM as follows:

import math
import pysam
import numpy as np
import matplotlib.pyplot as plt

bam_file = pysam.AlignmentFile("mybam.bam", "rb")

mean_base_qualities = []
insert_sizes = []
zipped = []
for read in bam_file:
    if read.is_paired and read.is_proper_pair:
        # length
        mean_bq = np.mean(read.query_qualities)
        read_length = read.query_alignment_length
        tlen = read.template_length
        insert_size = int(math.sqrt(math.pow(tlen, 2))-read_length)
        if insert_size > 0:
            zipped.append((mean_bq, insert_size))
            mean_base_qualities.append(mean_bq)
            insert_sizes.append(insert_size)


plt.plot(mean_base_qualities,insert_sizes)
plt.show()

Finally, I've got two huge lists representing the insert sizes and the mean base quality for a BAM reads, like:

insert_sizes = [291, 188, 165, 212, 186, 127, 240, 166, 120, 303, 255, ...

mean_base_qualities = [36.24324324324324, 34.73831775700935, 35.3448275862069, 30.64788732394366, 36.467289719626166, 22.673267326732674, 36.54216867469879, 31.20754716981132, 25.085714285714285, 34.65625,  ...

When I try to plot them I am getting the following graph but it is obviously wrong:

I also tried to sort the lists independently getting:

And also tried zipping the lists and different ways of plotting but with no succeed. Could you help me, please? What I am missing? Thank you in advance!

----- UPDATE -----

Accordingly to Tom Morris comment, I've made the aggregate mean of mean qualities:

import math
import pysam
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import make_interp_spline

bam_file = pysam.AlignmentFile("mybam.bam", "rb")

mean_base_qualities = []
insert_sizes = []
zipped = []

i = 0
# No longer needed to use read.query_alignment_length for substracting to TLEN value
# TLEN represents the insert size properly
for read in bam_file:
    # Only processing paired proper reads
    if read.is_paired and read.is_proper_pair and read.is_read1:
        if read.template_length != 0:
            i+=1
            mean_bq = np.mean(read.query_qualities)
            i_size = int(math.sqrt(math.pow(read.template_length, 2)))
            zipped.append((i_size, mean_bq))
            mean_base_qualities.append(mean_bq)
            insert_sizes.append(i_size)

res = [(key, statistics.mean(map(itemgetter(1), ele)))
       for key, ele in groupby(sorted(zipped, key = itemgetter(0)),
                                                key = itemgetter(0))]

tuples = zip(*res)
list1, list2 = [list(tuple) for tuple in  tuples]  

xnew = np.linspace(min(list1), max(list1), len(list1))
gfg = make_interp_spline(list1, list2, k=3)
y_new = gfg(xnew)
plt.plot(xnew, y_new)
plt.show()

But graph still does not make any sense:

Because there are no negatives values for mean bases :/

print(max(list1)) # Max insert size: 25381844
print(min(list1)) # Min insert size: 2
print(max(list2)) # Max mean base quality: 37.0
print(min(list2)) # Min mean base quality. 18.677419354838708

Could you help me, please??

Your first plot looks like a line plot, not a scatter plot, but do you *really* want to plot a point for every read in your file? Many/most of them are going to be overlapping and you won't have any way of distinguishing which ones are most common. Perhaps you mean to aggregate mean quality for all reads with a given insert size? — Tom Morris, Oct 28 '22 at 17:27
Hello Tom! This is an interesting approach! I've made what you say, mean of means of base qualities for each insert size value but the graph still does not make much sense. My post has been updated with that! — P. Solar, Oct 31 '22 at 08:27

Wrong plotting data (insert size vs base quality means from BAM file)

0 Answers0