Group By making query astronomically longer

Question

*As a first note, I only have read access to my server. Just, FYI as it seems to come up a lot...

Server:DB2(6.1) for i (IBM)

I have a query I'm running on a table that has 19mil rows in it (I don't design them, I just query them). I've been limiting my return data to 10 rows (*) until I get this query sorted out so that return times are a bit more reasonable.

The basic design is that I need to get data about categories of products we sell on a week by week basis, using columns: WEEK_ID, and CATEGORY. Here's example code (with some important bits #### out.)

SELECT WEEK_ID, CATEGORY
FROM DWQ####.SLSCATW
INNER JOIN DW####.CATEGORY
ON DWQ####.SLSCATW.CATEGORY_NUMBER = DW####.CATEGORY.CATEGORY_NUMBER
WHERE WEEK_ID  
BETWEEN 200952 AND 201230 --Format is year/week
GROUP BY WEEK_ID, CATEGORY

If I comment out that last line I can get back 100 rows in 254 ms. If I put that line back in my return takes longer than I've had patience to wait for :-). (Longest I've waited is 10 minutes.)

This question has two parts. The first question is quite rudimentary: Is this normal? There are 50 categories (roughly) and 140 weeks (or so) that I'm trying to condense down to. I realize that's a lot of info to condense off of 19mil rows, but I was hoping limiting my query to 10 rows returned would minimize the amount of time?

And, if I'm not just a complete n00b, and this in fact should not take several minutes, what exactly is wrong with my SQL?

I've Googled WHERE statement optimization and can't seem to find anything. All links and explanation are more than welcome.

Apologies for such a newbie post... we all have to start somewhere, right?

(*)using SQLExplorer, my IDE, an Eclipse implementation of Squirrel SQL.

Good question. For each date/category combo there are several thousand entries that are distinct (at least when you consider all the info on the table.) But for the data I'm pulling back I'm not concerned about those distinct columns, so I'm trying to collapse the rows. Should I be using Select Distinct instead? Since I'm not pulling back distinct rows? — Jay Carr, Dec 12 '12 at 17:03
I'm trying to pull back sales data for each week/category combo. Sales Cost and Sales Dollars to be specific. — Jay Carr, Dec 12 '12 at 17:05
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/21021/discussion-between-doug-and-christoffer-lette) — Jay Carr, Dec 12 '12 at 17:07
Btw, no need to apologize. We were all noobs at one time! +1 for effort. — Christoffer Lette, Dec 12 '12 at 17:18

Christoffer Lette · Accepted Answer · 2012-12-12T22:13:28.730

2

I'm not sure how the server handles group by when there's no aggregating functions in the query. Based on your answers in the comments, I'd just try to add those:

SELECT
    ...,
    SUM(SalesCost) as SalesCost,
    SUM(SalesDollars) as SalesDollars
FROM
    ...

Leave the rest of the query as is.

If that doesn't solve the problem, you might have missing indexes. I would try to find out if there's an index where the WEEK_ID is the only column or where it is the first column. You could also check if you have another temporal column (i.e. TransactionDate or something similar) on the same table that already is indexed. If so, you could use that instead in the where clause.

Without correct indexes, the database server is forced to do a complete table scan, and that could explain your performance issues. 39 million rows does take some not insignificant amount of time to read from disk.

Also check that the data type of WEEK_ID is int or similar, just to avoid unnecessary casting in your query.

To avoid a table scan on the Category table, you need to make sure that Category_Number is indexed as well. (It probably already is, since I assume it is a key to that table.)

edited Dec 12 '12 at 22:13

answered Dec 12 '12 at 17:11

Christoffer Lette

14,346
7
50
58

I added the aggregates and was able to pull down roughly 1,000,000 records in 26 seconds. The entire database has 36Mil records (not the 1.9Mil previously quoted). So, in theory, the whole thing should run in 15 minutes (I'm testing my theory right now.) It still seems awful slow... Or is this normal? – Jay Carr Dec 12 '12 at 19:56
I wouldn't say "normal"... Your `where` clause restricts the amount of data that needs to be processed, but that only works if WEEK_ID is indexed - otherwise it would still require a full table scan. Is the inner join necessary? (I only ask because I don't know where you get the SalesCost/SalesDollars columns from.) – Christoffer Lette Dec 12 '12 at 20:34
One more thing. You said that you pulled a million rows. I thought that you were expecting a maximum of about 7000 rows (140 weeks x 50 categories). What am I missing? – Christoffer Lette Dec 12 '12 at 20:37
Pardon, pulling from 34millionw rows. The result was 49 rows worth of data. – Jay Carr Dec 12 '12 at 20:57
Also, the reason for the join is that the Category names are stored on a different table. So I have an index number (CATEGORY_NUMBER) that I join on to get a Category Column from the second table (DW####.CATEGORY.CATEGORY). That's the point of the join, can I make it more efficient though? Also, I just ran the whole query and it took 35 minutes to complete... I'm guessing your comment on indexing will turn out to be true. How is that fixed? What do I tell IT? – Jay Carr Dec 12 '12 at 21:00
(The 49 rows was for 1 week, sorry, wasn't clear about that either.) – Jay Carr Dec 12 '12 at 21:08
When determining the need for indexes, I usually start looking at what columns are used in the `where` and `order by` clauses (if any) and any columns used for matching up any tables involved in a `join`. I'll put the details in the answer. – Christoffer Lette Dec 12 '12 at 21:09
Btw, you use an `inner join`. That could be an issue. If the Category_Number is a key to the Category table, then maybe a `left join` would be more appropriate. – Christoffer Lette Dec 12 '12 at 21:12
Re: my last comment: That could potentially introduce `null` Categories in the result. Be aware of that. – Christoffer Lette Dec 12 '12 at 21:16
Every Category Number will have an associated Category Name, so it should be fine. I'll give it a shot as well. I'm also reading up on indexing as I know little about it. From what I understand so far, if it's not being done, that would explain part of the problem. – Jay Carr Dec 12 '12 at 21:20
Turns out I was using a deprecated table which was not properly indexed. So, in a sense, this is the correct answer because I did in fact need to get at properly indexed table (plus the new table I was joining to was much smaller, so it speeds things up for that reason as well.) – Jay Carr Dec 27 '12 at 19:50

score 0 · Answer 2 · answered Dec 12 '12 at 21:49

0

Indices on WEEK_ID, CATEGORY (and possibly CATEGORY_NUMBER) is the only way to make it really fast, so you need to convince the DBO to introduce those.

answered Dec 12 '12 at 21:49

Morten

3,778
2
25
45

Group By making query astronomically longer

2 Answers2