As part of a much longer and complex query, I am trying to keep only one entry for overlapping intervals, and all entries which do not overlap. Here is a minimal example:
create table protein (
seqid varchar(100),
start SMALLINT(5),
`end` SMALLINT(5),
cutoff FLOAT(5,4),
seq_region TEXT
);
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 280, 290, 0.75, "RIQHQQKVKEL");
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 314, 556, 0.75, "EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT");
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 356, 406, 1.0, "PPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLK");
SELECT * from protein;
A0MZ66|280|290|0.75|CCCCCC
A0MZ66|314|556|0.75|ABCDEFG
A0MZ66|356|406|1.0|ABCD
Entry 2 and 3 have the same id and overlapping ranges (start and end from one is contained in the other), but different cutoff
and seq_region
. Entry #3 is in fact a sub-string of entry #2. What I can't put into sql is the condition:
- if two ranges from the same seqid overlap, select the one with the score == 0.75 (or longest seq_region, since these attributes are tied together)
Desired output should be entries #1 and #2:
A0MZ66|280|290|0.75|RIQHQQKVKEL
A0MZ66|314|556|0.75|EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT
How to put this as an SQL query? The overlap condition can assume that one interval is always contained in the other (start or end can be same). If it matters, it is an SQLite3 database.
I think I need to do some sort of self inner join for this, or group by operation but I can't get it quite right. I would appreciate your input very much.