3

Consider two tables:

Foo:
  id INT,
  name VARCHAR

Bar:
  id INT,
  foo_id INT REFERENCES Foo(id),
  event_type VARCHAR DEFAULT NULL,
  event_duration INT DEFAULT NULL

Each Foo item can have multiple Bar events. How to query for Foo items which do not have any Bar events that meet either of the following conditions:

  1. event_type that is not one of the following values: 'miss', 'scratch', 'scrape'
  2. event_duration that is not null

For instance, consider:

Foo id=1:
    event_type: hit       | event_duration: NULL
    event_type: poke      | event_duration: NULL
    event_type: capture   | event_duration: NULL

Foo id=2:
    event_type: hit       | event_duration: 2
    event_type: poke      | event_duration: NULL
    event_type: capture   | event_duration: NULL

Foo id=3:
    event_type: miss      | event_duration: NULL
    event_type: poke      | event_duration: NULL
    event_type: capture   | event_duration: NULL

Foo id=4:
    event_type: strike    | event_duration: NULL
    event_type: hit       | event_duration: NULL
    event_type: land      | event_duration: NULL

Only the Foo items with id=1 and id=4 should be returned. The item with id=2 should not be returned as one of it's event_duration is not NULL. The item with id=3 should not be returned as one of it's event_type is miss (which is on the list of forbidden event_types).

I have tried various ideas from this terrific answer, which responds to a generalization of the situation which I had hoped to learn enough from in order to build this query. Alas, I have been unable to generalize the answer enough to resolve this issue. This is one example of an unworking query, there were quite a few other failed attempts:

SELECT
    f.name
FROM
    Foo f JOIN Bar b ON f.id = b.foo_id
GROUP BY
    b.event_type, b.event_duration
HAVING
    b.event_type not in ('miss', 'scratch', 'scrape')
  AND
    b.event_duration not null

Here is another unworking query:

SELECT
    f.name
FROM
    (
    SELECT
        f.name, b.event_duration
    FROM
        Foo f JOIN Bar b ON f.id = b.foo_id
    GROUP BY
        b.event_type
    HAVING
        b.event_type not in ('miss', 'scratch', 'scrape')
    )
GROUP BY
    b.event_duration
HAVING
    b.event_duration not null

There were lots of other unworking queries with several ideas about JOINs and subqueries. Note that the Foo table has almost 5 million rows, and the Bar table has almost 2 million rows. The tables are indexed on the relevant fields, but O(n^2) are simply impossible on these large tables.

Community
  • 1
  • 1
dotancohen
  • 30,064
  • 36
  • 138
  • 197

4 Answers4

0

You can use NOT EXISTS to get the results you wanted.

SELECT f.name
FROM foo f
WHERE NOT EXISTS (SELECT 1 FROM bar b
                  WHERE b.foo_id = f.id
                  AND (b.event_type IN ('miss','scratch','scrape')
                       OR b.event_duration IS NOT NULL)
                  )
Tin Tran
  • 6,194
  • 3
  • 19
  • 34
  • Thank you. This query does express exactly what I need, but it does so in the absolute least efficient way. You are building a temporary table with all the results that are _not_ to be returned, then comparing each row in the extant database to ensure that they don't match. That means that the 5 million rows of the extant database are each compared to 5 million unmatching rows to find just a few that do match. That is an `O(n^2)` operation. – dotancohen Jan 26 '14 at 18:36
0

Try this, even though it sounds simple, I think it's OK for your case.

select f1.id, f1.name from
Foo f1 
left join 

(
     select distinct f.id 
     from Foo f
     join Bar b on f.id = b.foo_id
     where
     ( b.event_type IN ('miss','scratch','scrape') ) OR ( b.event_duration IS NOT NULL )
) f2 on f1.id = f2.id 

WHERE
(f2.id is null)
peter.petrov
  • 38,363
  • 16
  • 94
  • 159
  • Thank you, but this query cannot work on tables of the size specified (millions of rows). You are building a temporary table with all the results that are not to be returned, then comparing each row in the extant database to ensure that they don't match. That means that the 5 million rows of the extant database are each compared to 5 million unmatching rows to find just a few that do match. That is an O(n^2) operation. – dotancohen Jan 26 '14 at 18:39
  • I am not sure how you can do much better than inspecting all rows from Bar anyway. Then I wouldn't say the matching from f1 to f2 is O(n^2) because you match by id which is PK and is indexed. So it is not really as taking each row from f2 and matching it against each row from f1. I would roughly/intuitively say it is O(max(n,m)) where n and m are the sizes of Foo and Bar. But I am not sure. So well, OK, you might wait for an answer from a bigger expert here. Did you try my query btw? You might be surprised as I think your estimates are not really that close to how the DB does it. – peter.petrov Jan 26 '14 at 19:08
0

You could create a "counter cache" field in the Foo table that will just hold the count of the associated Bar items.

I think your problem will be faster solved with two queries:

  1. query to update the counter cache for all Foo items. Since this will look only at indexed columns it should be rather fast.

  2. second query that will NOT do any joins but will just query the Foo table for the criteria you want and having the "counter cache" value of 0.

With a smart management of the "counter cache" column you may have to run the first query only once. (Of course your program would have to then maintain the "counter cache" column in sync).

Ilie Pandia
  • 1,829
  • 14
  • 16
  • 1
    Thank you Ilie. This would work if I were searching for the same Bar event_types or durations each time. However in reality each query will be for a different list of event_types or durations. – dotancohen Jan 26 '14 at 19:25
  • I still think that a segmentation of the problem based of knowledge about how much data is available for various event types and durations will be the fastest. So you would have to generate some stats first. And based on those stats to find a strategy to query the database. A catch all query that does not take into account *the data distribution*, I think will be very slow in most cases. So I'd find a way to segment my data so that O(n^2) is not that bad. Create multiple "counter cache fields" if you have to... just find a way to reduce the number of rows that will enter the O(n^2) join. – Ilie Pandia Jan 26 '14 at 19:43
0

I would try this one

SELECT DISTINCT f.Id
FROM Foo f
WHERE NOT EXIST (
                 SELECT DISTINCT b.foo_id
                 WHERE b.foo_id = f.Id
                       AND   (b.event_type IN ('miss','scratch','scrape')                     
                               OR b.event_duration IS NOT NULL)
                 )

You can also use Merge like this :

  1. Create list of distinct Bar Ids that either have 'miss','scratch','scrape' in the event_type OR event_duration is not null
  2. Merge Foo and Bar
  3. Use the WHEN NOT MATCHED to find your result
Mzf
  • 5,210
  • 2
  • 24
  • 37
  • Thank you, but this is in MySQL, not SQL server. In any case, the `merge` suggestion is the same as the (untenable) subquery suggestion, just that the table is stored on disk. – dotancohen Jan 26 '14 at 19:27