How to optimize SQL Server Columnstore Alignment

Question

I have a Clustered Columnstore Index Table for our IOT metrics (timeseries data). It contains more than 1 billion rows and structured like this:

CREATE TABLE [dbo].[Data](
[DeviceId] [bigint] NOT NULL,
[MetricId] [smallint] NOT NULL,
[TimeStamp] [datetime2](2) NOT NULL,
[Value] [real] NOT NULL
)

CREATE CLUSTERED INDEX [PK_Data] ON [dbo].[Data] ([TimeStamp],[DeviceId],[MetricId]) --WITH (DROP_EXISTING = ON)
CREATE CLUSTERED COLUMNSTORE INDEX [PK_Data] ON [dbo].[Data] WITH (DROP_EXISTING = ON, MAXDOP = 1, DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)

There are some 10,000 distinct DeviceId values and TimeStamps range from 2008 till now. A typical query against this table looks like this:

SET STATISTICS TIME, IO ON
SELECT
    [DeviceId]
    ,[MetricId]
    ,DATEADD(hh, DATEDIFF(day, '2005-01-01', [TimeStamp]), '2005-01-01') As [Date]
    ,MIN([Value]) as [Min]
    ,MAX([Value]) as [Max]
    ,AVG([Value]) as [Avg]
    ,SUM([Value]) as [Sum]
    ,COUNT([Value]) as [Count]
FROM
    [dbo].[Data]
WHERE
    [DeviceId] = 6077129891325167032
    AND [MetricId] = 1000
    AND [TimeStamp] BETWEEN '2017-07-01' AND '2017-07-30'
GROUP BY
    [DeviceId]
    ,[MetricId]
    ,DATEDIFF(day, '2005-01-01', [TimeStamp])
ORDER BY
    [DeviceId]
    ,[MetricId]
    ,DATEDIFF(day, '2005-01-01', [TimeStamp])

When I execute this query, I get this for performance metrics:

Because at the moment a query like stated above does too many Segment reads I believe:

Table 'Data'. Scan count 2, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 5257, lob physical reads 9, lob read-ahead reads 4000.
Table 'Data'. Segment reads 11, segment skipped 764.

Query plan:

This is not well optimized I believe as there were 11 segments read to retrieve only 212 out of 1 billion source rows (before grouping/aggregation)

So then I ran Niko Neugebauer's great scripts to validate our setup and the Columnstore Alignment https://github.com/NikoNeugebauer/CISL/blob/master/Azure/alignment.sql, I get this result after rebuilding the Columnstore Clustered Index:

MetricId and TimeStamp columns have optimal alignment score of 100%. How can we ensure that the DeviceId column is also well aligned? I played with the column order in the initial Clustered (Rowstore) index, is that where things can be optimized?

if you are using sql server 2016,try to use dbcc clonedb and share the db,so that others can repro the exact scenario you are facing.if you are not using 2016, you can script our table schema,indexes,stats and try sharing the script — TheGameiswar, Aug 05 '17 at 16:57
@TheGameiswar [DBCC CLONEDATABASE](https://support.microsoft.com/en-gb/help/3177838/how-to-use-dbcc-clonedatabase-to-generate-a-schema-and-statistics-only) is available from SQL Server 2014 SP2 onwards : ) — wBob, Aug 05 '17 at 18:01
@TheGameiswar please find the query plan XML here: https://pastebin.com/y0fVDqe2 — Ted van der Veen, Aug 05 '17 at 20:45
@TheGameiswar adding the missing index suggested by query plan sort of defeats the whole purpose of the clustered columnstore index we seek to use also to save on storage space. I did a quick test and created the index on a test table with same structure (but with 3% of the rows). As you can see, the storage savings are evaporated (50 MB for the columnstore data versus 1,2 GB for the added index) https://imagebin.ca/v/3VwKKEezWGFa — Ted van der Veen, Aug 06 '17 at 11:20
am not entirely sure about internals of column store indexes,but in general ,if you have an index , that doesn't satisfy a query,you will have to make trade off..do I need this query to be faster or is it like this is one off a query,so I am fine with it's current speed — TheGameiswar, Aug 06 '17 at 11:35
Also , having a clustered index doesn't mean ,it should satisfy all queries — TheGameiswar, Aug 06 '17 at 11:36
Have you done any maintenance on the index at all? In my local trials, I've loaded the table with over 16 million rows of random data, and my alignment is healthy. Also, the index wouldn't defeat the purpose. The index has an include on it to bring in the `Value` column to speed the query up. You might want to switch your `COLUMNSTORE` index to be non-clustered and include the `Value` column. — tj-cappelletti, Aug 06 '17 at 19:31
@virusstorm did you compare the storage requirements between the clustered Columnstore version of your test table versus a non-clustered one? The latter will require at least 2000% more disk space. This is what we are looking to avoid as the table will eventually contain some 10 billion records. — Ted van der Veen, Aug 07 '17 at 04:23
Just change your clustered index to `(DeviceId, MetricId, TimeStamp)`. That should resolve it. — Evaldas Buinauskas, Aug 07 '17 at 10:59

score 9 · Answer 1 · answered Aug 07 '17 at 10:24

9

the key solution to align your table by DeviceId is to build a clustered rowstore index on your table and then build over it a clustered Columnstore Index with MAXDOP = 1 (in order not to introduce any overlaps which take place when index build runs with multiple cores). So the possible code will look something like this:

CREATE CLUSTERED INDEX [PK_Data] ON [dbo].[Data] ([DeviceId],[TimeStamp],[MetricId]) --WITH (DROP_EXISTING = ON)
CREATE CLUSTERED COLUMNSTORE INDEX [PK_Data] ON [dbo].[Data] WITH (DROP_EXISTING = ON, MAXDOP = 1, DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)

Another possibility would be to do it all within CISL, by preparing and then executing the alignment funcitonality:

insert into dbo.cstore_Clustering( TableName, Partition, ColumnName )
    VALUES ('[dbo].[Data]', 1, 'DeviceId' );

This is though just for 1 partition, but you should consider partitioning your table anyway, once you get into the numbers you are using. After setting up, you can start executing dbo.cstore_doAlignment, which will automatically re-align and optimise your table. (You will have some parameters to configure the threshold of the optimisations, if you like)

Best regards, Niko

answered Aug 07 '17 at 10:24

Niko Neugebuer

351
1
6

Thanks @Niko. So do I understand your answer correctly, for our use-case where we have an "append only"/logging table for our IOT metrics we should partition the table by TimeStamp column (per month?) and than align this table by DeviceId? In our aggregation we always query for one DeviceId, possible multiple different MetricId's and a TimeStamp ranging either 1 month or 12 months (grouping by hour and day respectively). – Ted van der Veen Aug 07 '17 at 11:32
Ted, without all the little details (that might change my perception), I agree that it looks like partitioning by TimeStamp and aligning by DeviceId looks like a fine option. Best regards, Niko – Niko Neugebuer Aug 10 '17 at 11:13
Thank you very much again, @Niko! I have implemented partitioning per month on TimeStamp plus aligning by DeviceId. Results are promising but do require some additional validation because most (75%) queries run in < 100ms, but 25% take 2,000ms. Not much in between. It all runs on a Premium Azure SQL elastic pool. So this is same query, executed once every 3 secs, so no throttle margin is hit. What would be your suggestions for Partitioning section rows volume? I choose month now, but maybe that leads to a too big number of partitions? – Ted van der Veen Aug 11 '17 at 20:47
Thank you @NikoNeugebuer for existing! :) Your blog is amazing! I have a similar scenario as mentioned above, and when I try to apply segment alignment on DeviceID, and my table is partitioned to month on TimeStamp, the column is still misaligned. It aligns it on TimeStamp. DeviceID is 0% aligned. (on Azure SQL) Create index on (DeviceID) on MonthlyPartScheme (TimeStamp). How do I apply segment alignment within partitions? – datamodel Mar 28 '19 at 04:03

score 1 · Answer 2 · answered Jul 31 '19 at 06:23

When Max dop set as 1 for creating clustered column store will sort your records effectively, but for a table with 1 billion rows this max dop 1 will not help u. It is better to partition the table with any of the date columns and then create a clustered column store index with max dop any thing higher than 1 or 0. But in that case sorting will not be guaranteed but clustered column store index will do the segment elimination effectively. One should clearly noted, do not drop or create clustered column store index by keeping any other non clustered index in your table, this will impact your clustered column store index creation/drop performance. If you drop a clustered column store index by keeping other index SQL server will do lot of works on the other indexes.

How to optimize SQL Server Columnstore Alignment

2 Answers2