Optimize my Azure SQL PaaS Table and/or query to increase performance

Question

I am designing a table with a very specialized use pattern. The table will be logged to continously with limited traffic - ~25 records per second, and then every night I run one single big query to extract a lot of data.

My table creation script currently looks like this:

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS (select * from sysobjects where name='records' and xtype='U')
CREATE TABLE [dbo].[records](
    [TripID] varchar(255) NOT NULL,
    [RecordTimeUTC] datetime2(0) NOT NULL,
    [TimeOfDaySeconds] [int] NOT NULL,
    [T0Latitude] [float] NOT NULL,
    [T0Longitude] [float] NOT NULL,
    [T1Latitude] [float] NULL,
    [T1Longitude] [float] NULL,
    [T2Latitude] [float] NULL,
    [T2Longitude] [float] NULL,
    [T3Latitude] [float] NULL,
    [T3Longitude] [float] NULL,
    [T4Latitude] [float] NULL,
    [T4Longitude] [float] NULL,
    [T5Latitude] [float] NULL,
    [T5Longitude] [float] NULL,
    [VehicleID] [int] NULL,
    [ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY
) ON [PRIMARY]
GO

IF NOT EXISTS (select * from sys.indexes where name='TripIDRecordTimeIndex' and object_id = OBJECT_ID('dbo.records'))
    CREATE INDEX TripIDRecordTimeIndex ON records (TripID, RecordTimeUTC desc)
GO

IF NOT EXISTS (select * from sys.indexes where name='TripIDIndex' and object_id = OBJECT_ID('dbo.records'))
    CREATE INDEX TripIDIndex ON records (TripID)
GO

IF NOT EXISTS (select * from sys.indexes where name='RecordTimeUTCIndex' and object_id = OBJECT_ID('dbo.records'))
    CREATE INDEX RecordTimeUTCIndex ON records (RecordTimeUTC desc)
GO

IF NOT EXISTS (select  * from sys.objects where name like 'UniqueConstraint2' and parent_object_id = OBJECT_ID('dbo.records'))
    ALTER TABLE [dbo].[records] ADD CONSTRAINT UniqueConstraint2 UNIQUE(VehicleID, RecordTimeUTC desc);
GO

IF NOT EXISTS (select * from sys.indexes where name='VehicleIDIndex' and object_id = OBJECT_ID('dbo.records'))
    CREATE INDEX VehicleIDIndex ON records (VehicleID)
GO

I currently have around 60 million records in my table and it is under 50 GB in size.

The query to extract data is very time consuming. It currently takes over an hour. I am not sure if it is my table design or the query design that is the root cause (though probably both).

I need to extract the latest X amount of items for each TripID in a set of TripIDs that I specify. There are around 10k different IDs and I typically want to query about half of them. X varies between them also, so my best way to query is currently to generate a script that looks a bit like this:

SELECT rs.* FROM (SELECT *, ROW_NUMBER() over (Partition BY TripID ORDER BY RecordTimeUTC DESC ) AS Rank FROM records where TripID in (20141000,20441000,30011022,30011021,30011008,30012029,30012028,30012027,30011007,30011019,30011018,30012026,30012025,30012024,30011017,30011016,30012023,30012022,30011015,30011014,30012021,30012020,30011013,30011012,30013000,30013001,30013019,30013009,30011011,30011010,30011009,30013008,30013007,30012010,30012009,30013005,30013004,30013003,30012014,30012019,30013021,30013020,30011006,30011004,30012018,30012017,30012016,30013006,30011003,30011002,30012015,30012013,30013013,30013002,30011001,30011000,30011020,30012012,30012011,30011005,30011030,30012001,30012008,30012007,30011029,30011028,30012006,30012005,30011031,30011027,30012004,30012003,30011026,30011025,30011024,30012002,30012000,30012031,30011023,30012030,30015005,30016006,30016013,30016012,30014020,30014019,30014018,30016011,30016010,30014017,30014016,30016009,30016008,30014015,30014013,30014012,30016005,30016004,30016003,30014010,30014009,30016002,30016001,30014008,30014007,30016000,30016007,30014006,30014005,30014004,30014003,30014002,30014001,30014000,30014023,30014014,30015012,30015004,30015003,30013018,30013017,30015002,30015001,30013016,30013015,30013014,30015000,30015013,30015011,30013012,30013011,30015010,30015009,30013010,30014011,30015008,30015007,30014022,30014021,30015006,33651001,33661006)) rs WHERE Rank <= 690
 UNION 
SELECT rs.* FROM (SELECT *, ROW_NUMBER() over (Partition BY TripID ORDER BY RecordTimeUTC DESC ) AS Rank FROM records where TripID in (20431003,20431002,20431001,20432003,20432002,20432001,30221001,33861002,33861003)) rs WHERE Rank <= 855
 UNION 
SELECT rs.* FROM (SELECT *, ROW_NUMBER() over (Partition BY TripID ORDER BY RecordTimeUTC DESC ) AS Rank FROM records where TripID in (20171029,20171030,20002002,26122001)) rs WHERE Rank <= 45
 UNION 
...

(The above query returns 690 instances of each trip in the first list, 855 instances of each in the second, 45 in the third and so on. The query is much bigger than this - this is just a snippet of it. In total I extract 10-15 million rows)

As mentioned my performance is terrible. Is it a cloud thing? Is it a design thing? Should I use a clustered index? (Tried that for TripID, but that was even worse). Could I improve my query somehow? Extract the same amount of instances for each ID for instance, then filter afterwards?

I notice I have a couple of extra indexes that might not be used in my query. I just tried adding more as the insert performance isn't a problem. The plan was for the TripIDRecordTimeIndex to be used in my query.

Even when scaling up the data plan in Azure SQL to S7 (800 DPU) I don't get this to run fast. Any feedback is appreciated.

Edit: I recently changed the TripID from int to varchar(255) - would this break my performance?

Edit2: Execution plan:

Download link to full execution plan

Edit3: Figured out that adding quotes ('') around the TripIDs that I query on improved the performance hugely!

Edit4: I added the index as proposed by TheGameiswar - and the difference is night and day! Thank you! New execution plan attached.

New Execution plan

This is a very broad question. For starters, have you looked at the SQL Execution Plan for your queries? — Rob Reagan, Nov 20 '17 at 13:02
This is an area I know very little about. I have added a copy of the execution plan for a subset of the execution plan at the bottom of the question. I realize it is a broad question, but the reason is that I do not know where to look. Thank you for your feedback. — WPFUser, Nov 20 '17 at 13:39
A picture of a plan is not a query plan. You need to link to the file: https://meta.stackexchange.com/questions/47689/how-can-i-attach-a-file-to-a-stack-overflow-post — Grant Fritchey, Nov 20 '17 at 14:09
The general idea when examining an execution plan is to see the steps that are taking the most time in your query. In your case, there are two clustered index scans. Since your table has 60M rows, this is going to hurt. You can read about index seeks vs scans here: https://blog.sqlauthority.com/2007/03/30/sql-server-index-seek-vs-index-scan-table-scan/ — Rob Reagan, Nov 20 '17 at 14:09
You're scanning your tables in their entirety and then filtering later. Performance will only ever get worse. You need to be able to filter data before you retrieve it. However, you're saying the requirements need to access half of the data. That's always going to lead to scans. This is a design issue, not a cloud issue. — Grant Fritchey, Nov 20 '17 at 14:14
And one other point, if the data is an integer, store it as an integer, not as a VARCHAR(255). That's for strings, like this sentence, not numbers. — Grant Fritchey, Nov 20 '17 at 14:14
Link to full plan added (though it's 25 mb..). I will read more about seeks vs scans. Thanks for the link @RobReagan. — WPFUser, Nov 20 '17 at 14:28
@GrantFritchey Actually I don't need to access half the data. Only the last ~1k entries for each of roughly half of all the unique "TripID"s in my table. — WPFUser, Nov 20 '17 at 14:30
@GrantFritchey The data was first thought to be integer, but I recently found data that may possibly be varchar, so I prefer to handle both cases. If it has a huge performance impact I can redesign it though. — WPFUser, Nov 20 '17 at 14:35

TheGameiswar · Accepted Answer · 2017-11-21T17:12:37.220

SELECT rs.* FROM (SELECT *, 
ROW_NUMBER() over (Partition BY TripID ORDER BY RecordTimeUTC DESC ) 
AS Rank FROM records where TripID in (20141000,20441000,30011022,30011021,30011008,30012029,30012028,30012027,30011007,30011019,30011018, 30012026,30012025.....)) rs WHERE Rank <= 690

The Indexes you have are not usefull for below part of query...

SELECT *, 
ROW_NUMBER() over (Partition BY TripID ORDER BY RecordTimeUTC DESC ) 
AS Rank FROM records where TripID in

I would create an index like below

create index nci_sometst on table (tripid,recorddatetime)
include(<remaining columsn you are selecting>)

The above query can help in getting records for IN part of tripID,but then you are calculating rank using derived table,this may not be of much help ,if your result set from inner query is large..

I would probably put this into a temp table and create an index on rank,so this is helpfull for other union queries as well.

Also had a look at your execution plan and i can see you are scanning the same table multiple times and reading many rows each time

Even without data type conversion warnings, your query will not use any of the indexes you have effectively

That absolutely did the trick! It took a while to create, but once there all queries are near instant! Thank you! PS: I only query on each TripID once, so I don't think the temp table would have helped. — WPFUser, Nov 21 '17 at 09:08
Glad it helped you,if you can provide more details that would be helpfull in future as well ..`questions seeking performance help should include DDL,DML Of the tables involved along with test data..if your test data is large,try scripting out schema and stats for the table(right click database->generate scripts->select specific database objects->in next screen select advanced and choose Script statistics) and paste it in question..With this info any one repro the same issue you are facing.Otherwise it becomes very difficult to answer your question .Pasting server version also helps` — TheGameiswar, Nov 21 '17 at 09:13

Optimize my Azure SQL PaaS Table and/or query to increase performance

1 Answers1