I have a table with columns like this;
site, chromosome, strand.
The pair of site and chromosome should be unique while they can have more than one strand. Before loading the data, I found that some of the sites have more than one chromosome, which is obviously an error. I was trying to identify the errors, which is sites with more than 1 chromosome. I've thought about it and couldn't come up with proper SQL. So I divided the problem. First I create a table selecting distinct records by;
create table distinct_pair
as select distinct site, chromosome
from original_table;
Then I could find the sites that have more than one chromosome by this;
select site
from distinct_pair
group by site
having count(site)>1;
It worked fine. Then trying to see the whole information of the errors from the original table, I did this;
select * from original_table
where site
in (select name from distinct_pair
group by site
having count(site)>1);
Then this subquery was way too slow even though the columns were all indexed.
I tried to rewrite the query as a join but the having makes it difficult. Please help me.
===================
Thanks all of you who answered this question. My data look like this.
Site | Chromosome | Strand
N111 | 2L | +
N111 | 2L | -
N112 | 2L | +
N112 | 2L | -
N112 | 3L | +
N112 | 3L | -
....
In this case, N111 is fine but N112 is an error because it has two chromosome data. The subquery of the second answer picked N111 as well as N112 because of the strand, which was the same problem I had. The group by function with multi column worked different from what I guessed. However, the suggested answer gave me a clue how group by works so that I could modify it slightly to make it works. The two answers give the same results. Thanks again, you guys.
Site