1

I'm building a few custom reports against a SQL Server 2005 database. The database belongs to a 3rd party management application we run. The data I'm pulling isn't among the primary purposes of the site, so the data is largely non-indexed except for the timestamp column. For now, there is only one table involved -- a table of some 700 million rows. So when I run a query against it that should return only 50 rows, it has to poll all 700mil.

I'm looking to speed this up, but don't want to index every column that I'm adding to the WHERE clause -- I don't know that adding that many indexes will end up improving the speed much (or am I wrong?). So I'm curious what the best practice would be if I couldn't add any new indexes to the table!

Stored procedure doesn't seem like the best fit. An indexed view might be the best idea? Thoughts?

Here's the table schema:

DeviceGuid (PK, uniqueidentifier, not null)
DeviceID (int, not null)
WindowsEventID (PK, int, not null) (indexed)
EventLog (varchar(64), not null)
EventSource (varchar(64), not null)
EventID (int, not null)
Severity (int, not null)
Description (nvarchar(max), not null)
TimeOfEvent (PK, datetime, not null) (indexed)
OccurrenceNbr (int, not null)

Here's a sample query:

SELECT COUNT(*) AS NumOcc, EventID, EventLog, EventSource, Severity, TimeOfEvent, Description
FROM WindowsEvent
WHERE DeviceID='34818'
    AND Severity=1
    AND TimeOfEvent >= DATEADD(hh, DATEDIFF(hh, GETDATE(), GETUTCDATE()), '2010/10/27 12:00:00 AM')
    AND TimeOfEvent <= DATEADD(hh, DATEDIFF(hh, GETDATE(), GETUTCDATE()), '2010/11/3 12:00:00 AM')
    AND EventID<>34113
    AND EventID<>34114
    AND EventID<>34112
    AND EventID<>57755
    AND EventSource<>'AutoImportSvc.exe'
    AND EventLog='Application'
GROUP BY EventID, EventLog, EventSource, Severity, Description
ORDER BY NumOcc DESC

Maybe the query sucks ... it returns 53 rows in 4.5 minutes.

Nathan Loding
  • 3,185
  • 2
  • 37
  • 43
  • Does your data for your query need to be up to date always, or is it okay to work on a snapshot of the data? You could try and create a sub job that parses all the data nightly and writes it to a report structure where you have filtered/aggregated the data so that the amount of rows you need to query is lower. And that table could also have the needed indexes if needed. – Knubo Nov 09 '10 at 22:28
  • @Knubo -- this is something we are considering. We are typically looking at the previous weeks worth of data, so we could archive it weekly to another database. – Nathan Loding Nov 09 '10 at 22:34
  • 700 million rows, say 100 bytes for each, means 70GB worth of data to do a linear search on. It is pretty good to actually get 4.5 minutes out of that. Though I guess that then the database is locked and about nothing else goes through. Depending on what kind of searches you are looking for, you could consider using Lucine or some other indexing software to index your data. Some former colleagues of mine did that with success with some complex mysql databases. – Knubo Nov 09 '10 at 22:47

4 Answers4

0

This is pretty simpleminded, but I'd try the indexed value as the first test in the

user1032402
  • 410
  • 1
  • 3
  • 11
0

If your query isn't using any indexes it'll be real bad. You don't need an index on every column, but you'll want one on the right column. Given that TimeOfEvent is already indexed, it may not be a great one for your needs.

The right column is going to depend on the distribution of your data. The best index will probably be the index that provides highest selectivity (i.e., when you know a key value for the index it returns the fewest rows). If you know the column that provides the best selectivity, you can try the index on it.

To help determine the best index, you can use the Display Estimated Execution Plan in SSMS. This will help you see which index will be used. After adding an index, you can run your query and evaluate the results with the execution plan. And, of course, observing the elapsed time will help too.

bobs
  • 21,844
  • 12
  • 67
  • 78
  • I agree that adding indexes will help, but what if I *couldn't* add indexes. Are there any workarounds that will increase the speed? – Nathan Loding Nov 09 '10 at 22:34
  • Ahh, I misunderstood. I thought indexing was an option. Sounds like the discussion topic with @Knubo is going to be most promising. – bobs Nov 09 '10 at 22:41
0

try this method using the double row_number trick:

SELECT  RN_Desc as NumOcc, *
FROM    (
        SELECT  row_number() Over(partition by EventId order by EventLog, EventSource, Severity, Description) as RN_Asc,
                row_number() Over(partition by EventId order by EventLog desc, EventSource desc, Severity desc, Description desc) as RN_Desc,
                *
        FROM    WindowsEvent 
        WHERE   DeviceID='34818' 
                AND Severity=1 
                AND TimeOfEvent >= DATEADD(hh, DATEDIFF(hh, GETDATE(), GETUTCDATE()), '2010/10/27 12:00:00 AM') 
                AND TimeOfEvent <= DATEADD(hh, DATEDIFF(hh, GETDATE(), GETUTCDATE()), '2010/11/3 12:00:00 AM') 
                AND EventID<>34113 
                AND EventID<>34114 
                AND EventID<>34112 
                AND EventID<>57755 
                AND EventSource<>'AutoImportSvc.exe' 
                AND EventLog='Application' 
        ) t
WHERE   RN_Asc = 1 
ORDER BY NumOcc DESC 

with this the engine doesn't need to do any aggregations just a single pass through the table. if it doesn't work try playing with the order by and partition by parts of the rownumber to get correct groupings.

Mladen Prajdic
  • 15,457
  • 2
  • 43
  • 51
  • It saved about a minute and cut the rows down significantly, but still not nearly fast enough for what we need. I'm thinking that building a separate database and exporting the data to it is the answer. – Nathan Loding Nov 11 '10 at 16:06
0

The final solution here was to run a query against the indexed fields, then filter them within the application running the query. Two fields ended up containing similar enough information that I could query against one index and get a very close approximation of the data I wanted. I looped back through and removed any non-matching entities from the result list. Took MUCH less time!

Nathan Loding
  • 3,185
  • 2
  • 37
  • 43