-1

As part of a much longer and complex query, I am trying to keep only one entry for overlapping intervals, and all entries which do not overlap. Here is a minimal example:

create table protein (
    seqid varchar(100),
    start SMALLINT(5),
    `end` SMALLINT(5),
    cutoff FLOAT(5,4),
    seq_region TEXT
);

insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 280, 290, 0.75, "RIQHQQKVKEL");
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 314, 556, 0.75, "EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT");
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 356, 406, 1.0, "PPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLK");

SELECT *  from protein;
A0MZ66|280|290|0.75|CCCCCC
A0MZ66|314|556|0.75|ABCDEFG
A0MZ66|356|406|1.0|ABCD

Entry 2 and 3 have the same id and overlapping ranges (start and end from one is contained in the other), but different cutoff and seq_region. Entry #3 is in fact a sub-string of entry #2. What I can't put into sql is the condition:

  • if two ranges from the same seqid overlap, select the one with the score == 0.75 (or longest seq_region, since these attributes are tied together)

Desired output should be entries #1 and #2:

A0MZ66|280|290|0.75|RIQHQQKVKEL
A0MZ66|314|556|0.75|EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT

How to put this as an SQL query? The overlap condition can assume that one interval is always contained in the other (start or end can be same). If it matters, it is an SQLite3 database.

I think I need to do some sort of self inner join for this, or group by operation but I can't get it quite right. I would appreciate your input very much.

fridaymeetssunday
  • 1,118
  • 1
  • 21
  • 31
  • I don't fully follow what you are asking. Do you need logic to determine if two things overlap? Or do you just want to select one of the three rows? – Gordon Linoff Jul 31 '20 at 13:33
  • Kind of both. I need a logic to: determine if two things overlap, and if yes, select only one of those based on some criteria. The output should be first and second entry only. I will add this to the OP. – fridaymeetssunday Jul 31 '20 at 13:40
  • What about seqid? Does it have to be the same to apply these conditions? – forpas Jul 31 '20 at 13:47
  • @forpas, yes. I will also add that to the question. – fridaymeetssunday Jul 31 '20 at 13:48
  • Do you seriously think including strings like the following aids clarity? 'EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT` – Strawberry Jul 31 '20 at 13:58
  • 1
    Fair point @Strawberry - edited now - though that is what the data looks like. – fridaymeetssunday Jul 31 '20 at 14:00

2 Answers2

1

You can use NOT EXISTS:

select p.* from protein p
where not exists (
  select 1 from protein
  where seqid = p.seqid and cutoff <> p.cutoff and seq_region <> p.seq_region
  and seq_region like '%' || p.seq_region || '%'
)

See the demo.

Or if you want to use the columns start and end to get the overlapping intervals:

select p.* from protein p
where not exists (
  select 1 from protein
  where seqid = p.seqid and cutoff <> p.cutoff and seq_region <> p.seq_region
  and start <= p.start and end >= p.end and (end - start) > (p.end - p.start)
)

See the demo.

Results:

| seqid  | start | end | cutoff | seq_region                                                                                                                                                                                                                                          |
| ------ | ----- | --- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| A0MZ66 | 280   | 290 | 0.75   | RIQHQQKVKEL                                                                                                                                                                                                                                         |
| A0MZ66 | 314   | 556 | 0.75   | EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT |
forpas
  • 160,666
  • 10
  • 38
  • 76
0

This is a "Gaps & Islands" problem. First, you'll need to identify rows from the same group, and then pick one from each one, according to your criteria. For example, you can resolve this query as shown below:

with
y as (
  select *,
    sum(st) over(partition by seqid order by start, end) as grp
  from (
    select *,
      case when start > 
             max(`end`) 
               over(partition by seqid 
               order by start, end
               rows between unbounded preceding and 1 preceding) 
           then 1 else 0 end as st
    from protein
  ) x
),
z as (
  select *,
    row_number() over(partition by seqid, grp 
      order by case when cutoff = 0.75 then 1 else 2 end,
               length(seq_region) desc) as rn
  from y
)
select * from z where rn = 1

Result:

seqid   start  end  cutoff  seq_region    st  grp  rn 
------- ------ ---- ------- ------------- --- ---- -- 
A0MZ66  280    290  0.75    RIQHQQKVKEL   0   0    1  
A0MZ66  314    556  0.75    EEDKKELELK... 1   1    1  

See running example at DB Fiddle.

The Impaler
  • 45,731
  • 9
  • 39
  • 76