0

I have two tables in MySQL that I'm comparing with the following attributes:

tbl_fac : facility_id, chemical_id, criteria
             10      , 25         , 50
             10      , 26         , 60
             10      , 27         , 60
             11      , 25         , 30
             11      , 27         , 31 
              etc...

tbl_samp: sample_id, chemical_id, result
            5     ,    25         , 51
            5     ,    26         , 61
            6     ,    25         , 51
            6     ,    26         , 61
            6     ,    27         , 500

              etc.... 

These tables are joined by chemical_id (many-to-many---- ugh), and there are several thousand facility_id's, and several hundred chemical_id's for each facility_id. There are also several thousand sample_id's, each with several hundred chemical_id's for each sample_id. All-in-all, there are around 500,000 records in tbl_fac, and 1,000,000+ records in tbl_samp.

I'm trying to extract three groups of sample_id's from this dataset:

Group 1: any sample_id where tbl_samp.result > tbl_fac.criteria (i.e., result exceeds criteria)

Group 2: any sample_id where tbl_samp.result < tbl_fac.criteria, AND all tbl_fac.chemical_id's are present for that sample_id (i.e., result is less than criteria, and everything is there)

Group 3: any sample_id where tbl_samp.result < tbl_fac.criteria, BUT one or more tbl_fac.chemical_id's are missing in the sample_id (i.e., result is less than criteria, but something is missing)

Here's the Question: How do I get all three Groups efficiently in one query?

I've tried:

select * 
from tbl_fac 
left join tbl_samp 
    on tbl_fac.chemical_id = tbl_samp.chemical_id

But this only yields values that are missing for the entire dataset (not the individual samples). I have a hackish query working that uses a third table to join tbl_fac and tbl_samp, but it is so ugly I'm actually embarrassed to post it....

As always, many thanks in advance for your thoughts on this one!

Cheers,

Josh

EDIT: Ideally, I would like the sample_id and Group returned -- with just one Group per sample ID (my knowledge of the data indicates that they will always fall into one of the three categories above).

Josh
  • 177
  • 1
  • 11

2 Answers2

1

This answer makes the assumption that there is a unique constraint on facility_id and chemical_id in tbl_fac and a unique constraint on sample_id and chemical_id in tbl_samp. What I did was build up the query one step at a time. Whether this is efficient remains to be seen.

Group 1: any sample_id where tbl_samp.result > tbl_fac.criteria (i.e., result exceeds criteria)

SELECT tbl_samp.sample_id,
       'ResultsGreaterThanCriteria' AS samplegroup
FROM   tbl_fac
       INNER JOIN tbl_samp
         ON tbl_fac.chemical_id = tbl_samp.chemical_id
WHERE  tbl_samp.result > tbl_fac.criteria
GROUP  BY tbl_samp.sample_id

Group 2: any sample_id where tbl_samp.result < tbl_fac.criteria, AND all tbl_fac.chemical_id's are present for that sample_id (i.e., result is less than criteria, and everything is there)

SELECT tbl_samp.sample_id,
       'ResultLessThanCriteriaAndAllChems' AS samplegroup
FROM   tbl_fac
       INNER JOIN tbl_samp
         ON tbl_fac.chemical_id = tbl_samp.chemical_id
WHERE  tbl_samp.result < tbl_fac.criteria
       AND NOT EXISTS (SELECT *
                       FROM   tbl_fac tf
                              LEFT JOIN tbl_samp ts
                                ON tf.chemical_id = ts.chemical_id
                       WHERE  ts.chemical_id IS NULL
                              AND tbl_samp.sample_id = ts.sample_id)
GROUP  BY tbl_samp.sample_id

Group 3: any sample_id where tbl_samp.result < tbl_fac.criteria, BUT one or more tbl_fac.chemical_id's are missing in the sample_id (i.e., result is less than criteria, but something is missing)

SELECT tbl_samp.sample_id,
       'ResultsLessThanCriteriaWithMissingChems' AS samplegroup
FROM   tbl_fac
       INNER JOIN tbl_samp
         ON tbl_fac.chemical_id = tbl_samp.chemical_id
WHERE  tbl_samp.result < tbl_fac.criteria
       AND EXISTS (SELECT *
                   FROM   tbl_fac tf
                          LEFT JOIN tbl_samp ts
                            ON tf.chemical_id = ts.chemical_id
                   WHERE  ts.chemical_id IS NULL
                          AND tbl_samp.sample_id = ts.sample_id)
GROUP  BY tbl_samp.sample_id 

And finally, you union all three queries together and get:

SELECT tbl_samp.sample_id,
       'ResultsGreaterThanCriteria' AS samplegroup
FROM   tbl_fac
       INNER JOIN tbl_samp
         ON tbl_fac.chemical_id = tbl_samp.chemical_id
WHERE  tbl_samp.result > tbl_fac.criteria
GROUP  BY tbl_samp.sample_id
UNION ALL
SELECT tbl_samp.sample_id,
       'ResultLessThanCriteriaAndAllChems' AS samplegroup
FROM   tbl_fac
       INNER JOIN tbl_samp
         ON tbl_fac.chemical_id = tbl_samp.chemical_id
WHERE  tbl_samp.result < tbl_fac.criteria
       AND NOT EXISTS (SELECT *
                       FROM   tbl_fac tf
                              LEFT JOIN tbl_samp ts
                                ON tf.chemical_id = ts.chemical_id
                       WHERE  ts.chemical_id IS NULL
                              AND tbl_samp.sample_id = ts.sample_id)
GROUP  BY tbl_samp.sample_id
UNION ALL
SELECT tbl_samp.sample_id,
       'ResultsLessThanCriteriaWithMissingChems' AS samplegroup
FROM   tbl_fac
       INNER JOIN tbl_samp
         ON tbl_fac.chemical_id = tbl_samp.chemical_id
WHERE  tbl_samp.result < tbl_fac.criteria
       AND EXISTS (SELECT *
                   FROM   tbl_fac tf
                          LEFT JOIN tbl_samp ts
                            ON tf.chemical_id = ts.chemical_id
                   WHERE  ts.chemical_id IS NULL
                          AND tbl_samp.sample_id = ts.sample_id)
GROUP  BY tbl_samp.sample_id 
gangreen
  • 849
  • 7
  • 9
  • I **might** have a couple instances where there are duplicates of chemical_id for a give facility_id or sample_id -- but I see where you're going with this... taking a look at it now. Thanks for the input! Looks great! – Josh Mar 06 '12 at 12:53
  • This is very, very close. Just so I understand, the "ResultLessThanCriteriaAndAllChems" group first links the tbl_fac and tbl_samp by the chemical_id and confirms that the result < criteria. Then, with the "NOT EXISTS" nested query, it checks whether the chemical is missing. Correct? One thing I'm not clear on -- why do you link "AND tbl_fac.facility_id = tf.facility_id" -- is that to make sure only the chemical_id's for the correct facility are checked in the nested query? – Josh Mar 06 '12 at 17:28
  • I don't ever seem to get any "ResultLessThanCriteriaAndAllChems" results. Any thoughts? – Josh Mar 06 '12 at 19:22
  • To be honest, your data model confuses me a bit :-/ so I joined the subquery to the wrong field. Normally, M:M relationships have a junction table. With only 2 tables, I half expect a 1:M or 1:1 relationship somewhere. What is the primary key on these tables? And yes, the idea of the nested query is to see if a chemical is missing from the sample that (currently) has a result less than the criteria. I edited the SQL to link the subqueries back to the main query by the sample_id. – gangreen Mar 07 '12 at 01:09
  • This actually works perfectly now. There isn't a 1:1 relationship here (thought I mentioned that this was many-to-many - which is very annoying but I don't have control over it), but based on your work, I was able to take a huge leap forward and get it working. Thanks so much! – Josh Mar 07 '12 at 15:47
  • As I was trying to solve the problem, I came up with imaginary business rules that would require your data model (to make it easier to understand for me). I wasn't saying that I thought the relationship was 1:1, just that I temporarily (and incorrectly) imagined it to be so. Anyway, glad to have helped :) – gangreen Mar 08 '12 at 01:37
1
SELECT
    sample_id,
    IF(result = criteria, -1,  /* unspecified behavior */
     IF(result > criteria, 1,
      IF(nb_chemicals = total_nb_chemicals, 2, 3))) AS grp

FROM (
    SELECT s.result, s.sample_id, f.criteria, f.chemical_id,
        COUNT(DISTINCT f.chemical_id) AS nb_chemicals
    FROM tbl_fac f JOIN tbl_samp s
        ON f.chemical_id = s.chemical_id
    GROUP BY s.sample_id
) t 

CROSS JOIN (
    SELECT COUNT(DISTINCT chemical_id) AS total_nb_chemicals
    FROM tbl_fac
) u

New solution:

SELECT
    s.sample_id,
    IF(s.result = f.criteria, -1,  /* unspecified behavior */
     IF(s.result > f.criteria, 1,
      IF(sample_nb_chemicals = total_nb_chemicals, 2, 3))) AS grp

FROM
    tbl_fac f JOIN tbl_samp s
    ON f.chemical_id = s.chemical_id

    JOIN (
        SELECT s.sample_id, 
               COUNT(DISTINCT f.chemical_id) AS sample_nb_chemicals
        FROM tbl_fac f JOIN tbl_samp s
             ON f.chemical_id = s.chemical_id
        GROUP BY s.sample_id
    ) u
       ON s.sample_id = u.sample_id

    CROSS JOIN (
        SELECT COUNT(DISTINCT chemical_id) AS total_nb_chemicals
        FROM tbl_fac
    ) v

GROUP BY sample_id, grp
mwhite
  • 2,041
  • 1
  • 16
  • 21
  • Thanks for the response! Going through it now. – Josh Mar 06 '12 at 12:50
  • I think the nested query you have in the center of this query only returns one result for each sample_id. I actually need to compare every single result to each criteria to check whether, for a given sample_id, the sample_id has Group_1, Group_2, or Group_3 attributes. That said -- your query is very fast, but it only analyzes the first result (whatever that may be) returned from your nested query group function. This is exactly the problem I was having when I was trying to figure this out! Any thoughts? – Josh Mar 06 '12 at 16:55
  • Yes, you're right. I thought your three groups of sample_ids were known to be mutually exclusive, which would mean you only need to look at one chemical/result/criteria combo for each sample_id. I'll think about how to make it work if that's not the case. – mwhite Mar 06 '12 at 21:29
  • (by the way, I fixed it to reflect that.) – mwhite Mar 07 '12 at 20:43
  • Mike - this turned out to be a great option. Thanks! – Josh Mar 19 '12 at 22:20