0

I'm working on a reporting system that allows the user to arbitrarily query a set of fact tables, constraining on multiple dimension tables for each fact table. I've written a query-builder class that automatically assembles all the correct joins and subqueries based on the constraint parameters, and everything works as designed.

But, I have a feeling that I'm not generating the most efficient queries. On a set of tables with a few million records, these queries take about 10 seconds to run, and I'd like to get them down in the less-than-one-second range. I have a feeling that, if I could get rid of the subqueries, the result would be much more efficient.

Rather than show you my actual schema (which is much more complicated), I'll show you an analogous example that illustrates the point without having to explain my whole application and data model.

Imagine that I have a database of concert information, with artists and venues. Users can arbitrarily tag the artists and the venues. So the schema looks like this:

concert
  id
  artist_id
  venue_id
  date

artist
  id
  name

venue
  id
  name

tag
  id
  name

artist_tag
  artist_id
  tag_id

venue_tag
  venue_id
  tag_id

Pretty simple.

Now let's say I want to query the database for all concerts happening within one month of today, for all artists with 'techno' and 'trombone' tags, performing at concerts with 'cheap-beer' and 'great-mosh-pits' tag.

The best query I've been able to come up with looks like this:

SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  artist.name AS artist_name,
  venue.id AS venue_id,
  venue.name AS venue_name,
FROM
  concert
INNER JOIN (
  artist ON artist.id = concert.artist_id
) INNER JOIN (
  venue ON venue.id = concert.venue_id
)
WHERE (
  artist.id IN (
    SELECT artist_id
    FROM artist_tag
    INNER JOIN tag AS a on (
      a.id = artist_tag.tag_id
      AND
      a.name = 'techno'
    ) INNER JOIN tag AS b on (
      b.id = artist_tag.tag_id
      AND
      b.name = 'trombone'
    )
  )
  AND
  venue.id IN (
    SELECT venue_id
    FROM venue_tag
    INNER JOIN tag AS a on (
      a.id = venue_tag.tag_id
      AND
      a.name = 'cheap-beer'
    ) INNER JOIN tag AS b on (
      b.id = venue_tag.tag_id
      AND
      b.name = 'great-mosh-pits'
    )
  )
  AND
  concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
)

The query works, but I really don't like having those multiple subqueries. If I could accomplish the same logic purely using JOIN logic, I have a feeling the performance would drastically improve.

In a perfect world, I'd be using a real OLAP server. But my customers will be deploying to MySQL or MSSQL or Postgres, and I can't guarantee that a compatible OLAP engine will be available. So I'm stuck using an ordinary RDBMS with a star schema.

Don't get too hung up on the details of this example (my real application has nothing to do with music, but it has multiple fact tables with an analogous relationship to the ones I've shown here). In this model, the 'artist_tag' and 'venue_tag' tables function as fact tables, and everything else is a dimension.

It's important to note, in this example, that the queries are much simpler to write if I only allow the user to constrain against a single artist_tag or venue_tag value. It only gets really tricky when I allow the queries to include AND logic, requiring multiple distinct tags.

So, my question is: what are the best techniques that you know of for writing efficient queries against multiple fact tables?

benjismith
  • 16,559
  • 9
  • 57
  • 80
  • I feel that the crux of the matter here is really the AND nature of the query, rather than the "multiple fact tables". (Although they do compound each other.) The answer I give below addresses this by performing the AND component of the query in a HAVING clause, instead of necessitating joins to the same fact tables multiple times. – MatBailie Apr 18 '09 at 17:26
  • Time to mark as solved/close/... :) – adolf garlic Aug 11 '10 at 13:05

3 Answers3

2

My approach is a bit more generic, putting the filter parameters in tables and then using GROUP BY, HAVING and COUNT to filter the results. I've used this basic approach several times for some very sophisticated 'searching' and it works very well (for me grin).

I also don't join on the Artist and Venue dimension tables initially. I'd get the results as id's (just needing artist_tag and venue_tag) then join the results on the artist and venue tables to get those dimension values. (Basically, search for the entity id's in a sub query, then in an outer query get the dimension values you need. Keeping them separate should improve things...)

DECLARE @artist_filter TABLE (
  tag_id INT
)

DECLARE @venue_filter TABLE (
  tag_id INT
)

INSERT INTO @artist_filter
SELECT id FROM tag
WHERE name IN ('techno','trombone')

INSERT INTO @venue_filter
SELECT id FROM tag
WHERE name IN ('cheap-beer','great-most-pits')


SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  venue.id AS venue_id
FROM
  concert
INNER JOIN
  artist_tag
    ON artist_tag.artist_id = concert.artist_id
INNER JOIN
  @artist_filter AS [artist_filter]
    ON [artist_filter].tag_id = artist_tag.id
INNER JOIN
  venue_tag
    ON venue_tag.venue_id = concert.venue_id
INNER JOIN
  @venue_filter AS [venue_filter]
    ON [venue_filter].tag_id = venue_tag.id
WHERE
  concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
GROUP BY
  concert.id,
  concert.date,
  artist_tag.artist_id,
  venue_tag.id
HAVING
  COUNT(DISTINCT [artist_filter].id) = (SELECT COUNT(*) FROM @artist_filter)
  AND
  COUNT(DISTINCT [venue_filter].id)  = (SELECT COUNT(*) FROM @venue_filter)

(I'm on a netbook and suffering for it, so I'll leave out the outer query getting the artist and venue names from the artist and venue tables grin)

EDIT
Note:

Another option would be to filter the artist_tag and venue_tag tables in sub-queries/derived-tables. Whether this is worth it depends on how influential the join on the Concert table is. My assumption here is that there are MANY artist and venues, but once filtered on the concert table (itself filtered by the dates) the number of artists/venues decreases dramatically.

Also, there is often a need/desire to deal with the case where NO artist_tags and/or venue_tags are specified. From experience it is better to deal with this programatically. That is, use IF statements and queries specially suited to those cases. A single SQL query CAN be written to handle it, but is much slower than the programatic alternative. Equally, writing similar queries several times may look messy and degrade maintainability, but the increase in complexity need to get this to be a single query is often harder to maintain.

EDIT

Another similar layout could be...
- Filter concert by artist as sub_query/derived_table
- Filter results by venue as sub_query/derived_table
- Join results on dimension tables to get names, etc

(Cascaded filtering)

SELECT
   <blah>
FROM
  (
    SELECT
      <blah>
    FROM
      (
        SELECT
          <blah>
        FROM
          concert
        INNER JOIN
          artist_tag
        INNER JOIN
          artist_filter
        WHERE
        GROUP BY
        HAVING
      )
    INNER JOIN
      venue_tag
    INNER JOIN
      venue_filter
    GROUP BY
    HAVING
  )
INNER JOIN
  artist
INNER JOIN
  venue

By cascading the filtering, each subsequent filtering has a reduce set it has to work on. This MAY reduce the work done by the GROUP BY - HAVING section of the query. For two levels of filtering I would guess this to be unlikely to be dramatic.

The original may still be more performant as it benefits additional filtering in a different manner. In your example:
- There may be many artist in your date range, but few which meet at least one criteria
- There may be many venues in your date range, but few which meet at least one criteria
- Before the GROUP BY, however, all concerts are eliminated where...
---> the artist(s) meets NONE of the criteria
---> AND/OR the venue meets NONE of the criteria

Where you are searching by many criteria this filtering degrades. Also where venues and/or artists share a lot of tags, the filtering also degrades.

So when would I use the original, or when would I use the Cascaded version?
- Original : Few search criteria and venues/artists are dis-similar from each other
- Cascaded : Lots of search criteria or venues/artists tend to be similar

MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • I didn't use the "tag_artist_user" table as it didn't impact the results in your example – MatBailie Apr 18 '09 at 16:28
  • Oops. The "tag_artist_user" table was an artifact of a previous draft of the query. Just edited the original post to remove it. – benjismith Apr 18 '09 at 17:52
  • I like the approach of using filter tables, but not using table variables. You have no indexes on these. It is possible to have an index on a table variable but for fair reasons there are no statistics. Your solution is SQL Server specific too. If you use a table variable, SQL Server will generate an execution plan that assumes the table variable has a single row (due to no statistics). It might perform OK if there aren't many rows in the table variable, but when there are more then performance will go downhill. – Davos Mar 09 '15 at 00:31
  • @davos - This is essentially querying an EAV structure, which is already slow, there is no way to avoid that; this is about maintainability and flexibility of the code. If there are going to be a "large" number of rows in the filter tables, it's going to be slow no matter what. It should also be noted that the filter tables are always going to be subject to a full table scan. In this case the pertinent indexes are those that are on the persistent tables, not the filters. Finally, `BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)` is hardly MS SQL Server specific ;) – MatBailie Mar 09 '15 at 19:47
  • Sorry, MySQL Specific. I agree that it will always do a full scan on the filter tables, but it could be an index scan rather than a table scan, which if the tables are large can provide some shortcuts like being able to use a merge or hash algorithm in the join and estimating rows (and memory usage) accurately. A unique index provides guarantees to the query engine. – Davos Mar 12 '15 at 02:58
1

Denormalize the model. Include the tag name in venue and artist tables. This way, you avoid a many to many relationship and you have a simple star schema.

By applying this denormalization, the where clause can only check this additional tag_name field in both tables (artist and venue).

Cătălin Pitiș
  • 14,123
  • 2
  • 39
  • 62
  • But if I denormalize, how do I allow an artist or venue to have multiple tags? The thing is, I really CAN'T eliminate the many-to-many relationship without totally crippling the model. – benjismith Apr 18 '09 at 16:22
  • 1
    You'll have multiple records for the same artist, but with different tags. It is usual practice in data warehousing to have denormalized data, to improve query performance. This is one of the reasons why ETL jobs (Extract-Transform-Load data) are employed: to convert from a normalized relational model to a data warehouse specific model (dimensional or star model). – Cătălin Pitiș Apr 18 '09 at 17:04
  • Agreed, on a couple of assumptions. This can cause a dramatic increase in data size, is the space available? (Come on, drives are cheap...) With changable data, refreshing the denormalised data be costly in terms of cpu, etc. Is the data relatively static, and/or can it be ETLed over night, etc? If so, such denormalising (to flat file format, for example) Can be Extremely beneficial to reporting. – MatBailie Apr 18 '09 at 17:16
  • I was thinking about doing it that way. But I'm going to be running a lot of aggregate queries (count, sum, avg, stddev, etc) and the duplicate records would screw up the calculations. For counting, I could use DISTINCT, but how would I eliminate the discrepancies for other aggregate functions? – benjismith Apr 19 '09 at 02:22
  • 1
    That completely depends on the exact queries you want to write and the exact schema. My experience has been that "data-warehousing" doesn't involve storing the data in One new format, but several. Each optimised for different reporting needs. It would be in that "ETL" that you would arrange the data to be suitable for different Aggregates, OR actually do the aggregates there. Either way, you may find that you need to solve your original question first... – MatBailie Apr 19 '09 at 07:35
0

This situation is not technically multiple fact tables. You have a many to many relationship between venues & tags as well as artists & tags.

I think MatBailie provides some interesting examples above, but I feel this can be much simpler if you handle the parameters in your application in a helpful way.

Apart from the user generated query on the fact table, you need two static queries to provide the parameter options to the user in the first place. One of them is a list of Venue-appropriate tags, the other is for Artist-appropriate tags.

Venue appropriate tags:

SELECT DISTINCT tag_id, tag.name as VenueTagName
FROM venue_tag 
INNER JOIN tag 
ON venue_tag.tag_id = tag.id

Artist appropriate tags:

SELECT DISTINCT tag_id, tag.name as ArtistTagName
FROM artist_tag 
INNER JOIN tag 
ON artist_tag.tag_id = tag.id

Those two queries drive some drop-down or other parameter-selection controls. In a reporting system, you should try and avoid passing string variables around. In your application, you present the string name of the variable to the user, but pass the integer ID back to the database.

e.g. When the user chooses the tags, you take the tag.id values and provide them to your query (where I have the (1,2) and the (100,200) bit below) :

 SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  artist.name AS artist_name,
  venue.id AS venue_id,
  venue.name AS venue_name,
FROM 
concert
INNER JOIN artist 
    ON artist.id = concert.artist_id
INNER JOIN artist_tag
    ON artist.id = artist_tag.artist_id
INNER JOIN venue 
    ON venue.id = concert.venue_id
INNER JOIN venue_tag
    ON venue.id = venue_tag.venue_id
WHERE venue_tag.tag_id in ( 1,2 ) -- Assumes that the IDs 1 and 2 map to "cheap-beer" and "great-mosh-pits)
AND   artist_tag.tag_id in (100,200) -- Assumes that the IDs 100 and 200 map to "techno" and "trombone") Sounds like a wild night of drunken moshing to brass band techno!
AND concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
Davos
  • 5,066
  • 42
  • 66
  • Note that `WHERE venue_tag.tag_id in ( 1,2 )` doesn't meet the OP's requirements. That gives venues with `cheap-beer` OR `great-moshpits` but the OP wants to get venues that have `cheap-beer` AND `great-moshpits`. That involves checking multiple rows *(one row having `cheap-beer` and one row having `great-moshpits` and then requiring that both rows must exist for the same venue)*. Also, SQL is notoriously poor at parameterising lists. What if the OP requires `cheap-beer` AND `great-moshpits` AND `free-entry`? This answer doesn't provide a generalised approach for proving `n` attributes. – MatBailie Mar 09 '15 at 19:56
  • @MatBailie Yes I see you're right, I hadn't considered the AND requirement for tags. My example only handles the OR example. I think my point about parameter handling is still valid, but I see why you are comparing the counts of tags in the HAVING clause in your first example, which is indeed generalised so +1. – Davos Mar 12 '15 at 02:52