Detect large deletions from CIGAR string

Question

I am new to sequence analysis and I'm doing some exercises to help me learn to use pysam and samtools for WGS data analysis. One thing I would like to do is to detect (rather large) deletions from 2D Oxford nanopore data (large reads). For this purpose, I have extracted the first 10kb, as well as sequencing reads covering the region, from an E.coli genome. Call the original genome A. I then create a genome A' by inserting 1kb of sequence in the middle of A and align the reads for A using A' as a reference to imitate a deletion in the sequence. I would now like to write a program which detects the location of my 'deletion'. My problem is that the CIGAR strings of my reads don't fit my expectations, which I assume must be wrong. I find the information in the sam documentation rather incomplete and would appreciate an explanation of the following.

Suppose I have a sequence ....GTTGCA ---1kb deletion--- GAACGT... and reads aligned to that sequence. I make the following assumptions:

Case 1. Reads to the left of the deletion and not overlapping with the deletion can start with aHbS (a and b being constants, a,b >=0) followed by a sequence of Ms, Is, Ds and before ending with cSdH. I wouldn't expect to find large segments of Is and Ds in these reads.

Case 2. The reads partially overlapping the deletion from the left should start the same as the reads in (1) but should end with rS, with the size of the constant r being determined by how much the read overlaps the deletion.

Case 3. Reads overlapping the deletion completely (remember, I have very long reads so such reads exist) should start the same as the reads in (1) but then I would expect to see 1000D or something like that in my CIGAR string and then the read should end the same as the reads in (1). This is what I don't observe in my data. My 'deletion' starts at 5kb but reads that have 4500 < POS < 5000 and are more than 2kb long actually just seem to contain the same sequence of Ms, Is and Ds as if they had been aligned to the reference.

My questions, which I hope are not off topic as I'm rather asking about data format than actual programming, is i). Which of my assumptions above are wrong ii). What should the cigar string of a read partially overlapping a deletion look like? iii). What should the cigar string of a read completely overlapping (that is to say, whose ends map on either side of the deletion) a deletion look like?

I'm attaching a figure which hopefully will help to illustrate my three cases.

you should ask https://www.biostars.org – Pierre Apr 14 '17 at 12:15 — Pierre, Apr 14 '17 at 12:15

score 0 · Answer 1 · answered Jun 29 '18 at 10:10

I'm not really an expert on this topic, but my bet is that cases 1 and 2 are as you say. Case 3 could vary depending on the aligner used. One way to represent that kind of alignment is as you expect: xHxSxM1000DxMxSxH, but there is another way that an split mapper can give it to you: xHxSxM1000SxH plus xH1000SxMxSxH in two different entries, one of them marked as primary alignment and the other marked as supplementary alignment (some old aligners can mark it as secondary, as supplementary appeared later in the standard). The aligner plays a crucial role in how this is represented.

Have you checked that your fully overlapping read is only represented once in your sam/bam file?

score 0 · Answer 2 · answered Nov 21 '19 at 19:01

If I understand your question correctly, I think your assumptions sounds good. What you're describing (the deletion not being represented in the CIGAR strings) makes me think that while you have changed your reference genome (presumably a *.fasta file?), you may have not re-run the alignment of your reads to that reference.

Your BAM/SAM file, from which you're presumably getting your CIGAR strings, is the result of an alignment and does not change in itself when you only change the reference genome. After you change the reference genome, you need to now re-do the alignment to get a new BAM/SAM file and only then look at the CIGAR strings from that, which should now reflect the simulated deletion. Please, let me know if this was a correct assessment.

(I know this would have been more of a comment-worthy post but I don't have the privileges to write comments yet.)

Detect large deletions from CIGAR string

2 Answers2