A SQL Server table with >200 million records is divided into ~100 partitions (not true SQL Server Partitions - it's not running on a compatible edition of SQL Server) by adding a column PartitionID. PartitionID is the first half the table's clustered index definition; the other half is a unique auto-incrementing integer ID. PartitionID is also foreign key into the Partition table. No record from Example is ever accessed without knowing its PartitionID; they are usually accessed in ranges associated with a single PartitionID (or small number of PartitionIDs).
CREATE TABLE Example (
ID BIGINT IDENTITY(1, 1) NOT NULL,
PartitionID DECIMAL(18, 0) NOT NULL,
-- Other columns omitted for brevity
CONSTRAINT PK_Example PRIMARY KEY NONCLUSTERED (ID),
CONSTRAINT FK_Example_Partition FOREIGN KEY (PartitionID) REFERENCES Partition (ID)
)
CREATE UNIQUE CLUSTERED INDEX IX_Example ON Example(PartitionID, ID)
Partition rows are kept indefinitely, but Example rows are frequently purged by issuing a DELETE statement against a range with the same PartitionID. Over time, this leads to Partition rows that are not referenced by any Example rows. This is not the problem; the problem is identifying the Partition rows that are still referenced.
Without resorting to user-level management techniques like adding and manually maintaining a ReferenceCount field in the Partition table, or adding and manually maintaining a list of in-use PartitionIDs, is there a system-level technique we could use to discover the set of PartitionIDs that are still in use - without scanning all the rows in table Example?
SELECT DISTINCT PartitionID FROM Example
The above query takes tens of seconds to return 100 values because it's scanning 100s of millions of rows in the clustered index. Adding another very narrow index on PartionID alone might reduce the I/O and halve the time but essentially SQL Server is still scanning that index too.
CREATE NONCLUSTERED INDEX IX_Example_PartitionID ON Example(PartitionID)
I should probably also avoid joining Partition with Example (performing a number of clustered index seeks instead of an index scan) because the number of seeks will increase (and decrease performance) over time.
SELECT DISTINCT PartitionID FROM Partition p WHERE EXISTS (
SELECT TOP 1 1 FROM Example e WHERE p.ID = e.PartitionID
)