I am designing a table with a very specialized use pattern. The table will be logged to continously with limited traffic - ~25 records per second, and then every night I run one single big query to extract a lot of data.
My table creation script currently looks like this:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF NOT EXISTS (select * from sysobjects where name='records' and xtype='U')
CREATE TABLE [dbo].[records](
[TripID] varchar(255) NOT NULL,
[RecordTimeUTC] datetime2(0) NOT NULL,
[TimeOfDaySeconds] [int] NOT NULL,
[T0Latitude] [float] NOT NULL,
[T0Longitude] [float] NOT NULL,
[T1Latitude] [float] NULL,
[T1Longitude] [float] NULL,
[T2Latitude] [float] NULL,
[T2Longitude] [float] NULL,
[T3Latitude] [float] NULL,
[T3Longitude] [float] NULL,
[T4Latitude] [float] NULL,
[T4Longitude] [float] NULL,
[T5Latitude] [float] NULL,
[T5Longitude] [float] NULL,
[VehicleID] [int] NULL,
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY
) ON [PRIMARY]
GO
IF NOT EXISTS (select * from sys.indexes where name='TripIDRecordTimeIndex' and object_id = OBJECT_ID('dbo.records'))
CREATE INDEX TripIDRecordTimeIndex ON records (TripID, RecordTimeUTC desc)
GO
IF NOT EXISTS (select * from sys.indexes where name='TripIDIndex' and object_id = OBJECT_ID('dbo.records'))
CREATE INDEX TripIDIndex ON records (TripID)
GO
IF NOT EXISTS (select * from sys.indexes where name='RecordTimeUTCIndex' and object_id = OBJECT_ID('dbo.records'))
CREATE INDEX RecordTimeUTCIndex ON records (RecordTimeUTC desc)
GO
IF NOT EXISTS (select * from sys.objects where name like 'UniqueConstraint2' and parent_object_id = OBJECT_ID('dbo.records'))
ALTER TABLE [dbo].[records] ADD CONSTRAINT UniqueConstraint2 UNIQUE(VehicleID, RecordTimeUTC desc);
GO
IF NOT EXISTS (select * from sys.indexes where name='VehicleIDIndex' and object_id = OBJECT_ID('dbo.records'))
CREATE INDEX VehicleIDIndex ON records (VehicleID)
GO
I currently have around 60 million records in my table and it is under 50 GB in size.
The query to extract data is very time consuming. It currently takes over an hour. I am not sure if it is my table design or the query design that is the root cause (though probably both).
I need to extract the latest X amount of items for each TripID in a set of TripIDs that I specify. There are around 10k different IDs and I typically want to query about half of them. X varies between them also, so my best way to query is currently to generate a script that looks a bit like this:
SELECT rs.* FROM (SELECT *, ROW_NUMBER() over (Partition BY TripID ORDER BY RecordTimeUTC DESC ) AS Rank FROM records where TripID in (20141000,20441000,30011022,30011021,30011008,30012029,30012028,30012027,30011007,30011019,30011018,30012026,30012025,30012024,30011017,30011016,30012023,30012022,30011015,30011014,30012021,30012020,30011013,30011012,30013000,30013001,30013019,30013009,30011011,30011010,30011009,30013008,30013007,30012010,30012009,30013005,30013004,30013003,30012014,30012019,30013021,30013020,30011006,30011004,30012018,30012017,30012016,30013006,30011003,30011002,30012015,30012013,30013013,30013002,30011001,30011000,30011020,30012012,30012011,30011005,30011030,30012001,30012008,30012007,30011029,30011028,30012006,30012005,30011031,30011027,30012004,30012003,30011026,30011025,30011024,30012002,30012000,30012031,30011023,30012030,30015005,30016006,30016013,30016012,30014020,30014019,30014018,30016011,30016010,30014017,30014016,30016009,30016008,30014015,30014013,30014012,30016005,30016004,30016003,30014010,30014009,30016002,30016001,30014008,30014007,30016000,30016007,30014006,30014005,30014004,30014003,30014002,30014001,30014000,30014023,30014014,30015012,30015004,30015003,30013018,30013017,30015002,30015001,30013016,30013015,30013014,30015000,30015013,30015011,30013012,30013011,30015010,30015009,30013010,30014011,30015008,30015007,30014022,30014021,30015006,33651001,33661006)) rs WHERE Rank <= 690
UNION
SELECT rs.* FROM (SELECT *, ROW_NUMBER() over (Partition BY TripID ORDER BY RecordTimeUTC DESC ) AS Rank FROM records where TripID in (20431003,20431002,20431001,20432003,20432002,20432001,30221001,33861002,33861003)) rs WHERE Rank <= 855
UNION
SELECT rs.* FROM (SELECT *, ROW_NUMBER() over (Partition BY TripID ORDER BY RecordTimeUTC DESC ) AS Rank FROM records where TripID in (20171029,20171030,20002002,26122001)) rs WHERE Rank <= 45
UNION
...
(The above query returns 690 instances of each trip in the first list, 855 instances of each in the second, 45 in the third and so on. The query is much bigger than this - this is just a snippet of it. In total I extract 10-15 million rows)
As mentioned my performance is terrible. Is it a cloud thing? Is it a design thing? Should I use a clustered index? (Tried that for TripID, but that was even worse). Could I improve my query somehow? Extract the same amount of instances for each ID for instance, then filter afterwards?
I notice I have a couple of extra indexes that might not be used in my query. I just tried adding more as the insert performance isn't a problem. The plan was for the TripIDRecordTimeIndex to be used in my query.
Even when scaling up the data plan in Azure SQL to S7 (800 DPU) I don't get this to run fast. Any feedback is appreciated.
Edit: I recently changed the TripID from int to varchar(255) - would this break my performance?
Download link to full execution plan
Edit3: Figured out that adding quotes ('') around the TripIDs that I query on improved the performance hugely!
Edit4: I added the index as proposed by TheGameiswar - and the difference is night and day! Thank you! New execution plan attached.