0

I have a table with columns like this;

site, chromosome, strand.

The pair of site and chromosome should be unique while they can have more than one strand. Before loading the data, I found that some of the sites have more than one chromosome, which is obviously an error. I was trying to identify the errors, which is sites with more than 1 chromosome. I've thought about it and couldn't come up with proper SQL. So I divided the problem. First I create a table selecting distinct records by;

create table distinct_pair 
  as select distinct site, chromosome 
  from original_table;

Then I could find the sites that have more than one chromosome by this;

select site 
  from distinct_pair 
  group by site 
  having count(site)>1;

It worked fine. Then trying to see the whole information of the errors from the original table, I did this;

select * from original_table 
  where site 
  in (select name from distinct_pair
        group by site 
        having count(site)>1);

Then this subquery was way too slow even though the columns were all indexed.

I tried to rewrite the query as a join but the having makes it difficult. Please help me.

===================

Thanks all of you who answered this question. My data look like this.

Site | Chromosome | Strand
N111 | 2L         | +
N111 | 2L         | -
N112 | 2L         | +
N112 | 2L         | -
N112 | 3L         | +
N112 | 3L         | -
....

In this case, N111 is fine but N112 is an error because it has two chromosome data. The subquery of the second answer picked N111 as well as N112 because of the strand, which was the same problem I had. The group by function with multi column worked different from what I guessed. However, the suggested answer gave me a clue how group by works so that I could modify it slightly to make it works. The two answers give the same results. Thanks again, you guys.

Site

microbe
  • 3
  • 2

2 Answers2

0

Looks like you want something like this :

     SELECT site, chromosome, strand
       FROM original_table O
INNER JOIN (SELECT site, chromosome
            FROM original_table
            GROUP BY site, chromosome
            HAVING COUNT(*) > 1) T
         ON USING (site)
        AND USING (chromosome)

The subquery selects the site and chromosome pairs that are repeated more than once, then you join it to the the big table. Since it's an INNER JOIN, it only returns the rows who has a match in the subquery.

Vincent Savard
  • 34,979
  • 10
  • 68
  • 73
  • Thanks. It works, too with a little bit of modification. I should've said the data more clearly. Here is the modified MySQL version. select name, chromosome from (select name, chromosome from nature06929_s2 group by name, chromosome ) as T group by name having count(name) > 1 ; – microbe Nov 26 '10 at 14:59
  • I tried to explain a bit more but I'm not familiar how to format my comments nicely. sorry... – microbe Nov 26 '10 at 15:12
0

You could just find the one with different chromosome for a given site:

SELECT DISTINCT t1.site, t1.chromosome, t2.chromosome
FROM original_table t1
    INNER JOIN original_table t2 USING (site)
WHERE t1.chromosome <&gt t2.chromosome
Danosaure
  • 3,578
  • 4
  • 26
  • 41
  • Thanks, it works. Here is MySQL version of your answer. (I don't know how to format in comments, though). select distinct t1.name, t1.chromosome, t2.chromosome from nature06929_s2 as t1 inner join nature06929_s2 as t2 using (name) where t1.chromosome <> t2.chromosome; – microbe Nov 26 '10 at 14:56
  • @microbe: No formatting in comments. I used `site` and `original_table` because they were used in your OP. – Danosaure Nov 26 '10 at 22:33