I have a table containing ~5,000,000 rows of scada data, described by the following:
create table data (o int, m money)
.
Where:
- o
is PK with clustered index on it. o
's fill factor is close to 100%. o
represents the date of meter reading, can be thought of as X axis.
- m
is a decimal value laying within 1..500
region and is the actual meter reading can be thought of as Y axis.
I need to find out about certain patterns i.e. when, how often and for how long they had been occurring.
Example. Looking for all occurrences of m
changing by a region from 500
to 510
within 5
units (well from 1 to 5) of o
I run the following query:
select d0.o as tFrom, d1.o as tTo, d1.m - d0.m as dValue
from data d0
inner join data d1
on (d1.o = d0.o + 1 or d1.o = d0.o + 2 or d1.o = d0.o + 3 or d1.o = d0.o + 4)
and (d1.m - d0.m) between 500 and 510
the query takes 23
seconds to execute.
Previous version took 30
minutes (90 times slower), I' managed to optimize it using a naive approach by replacing : on (d1.o - d0.o) between 1 and 4
with on (d0.o = d1.o - 1 or d0.o = d1.o - 2 or d0.o = d1.o - 3 or d0.o = d1.o - 4)
It's clear to me why it's faster - on one hand indexed column scan should fork fast enough on another one I can afford it as dates are discrete (and I always give 5
minutes grace time to any o
region, so for 120
minutes it's 115..120
region). I can't use the same approach with m
values as they are integral though.
Things I've tried so far:
Soft sharding by applying
where o between @oRegionStart and @oRegionEnd
at the bottoom of my script. and running it within a loop, fetching results into a temp table. Execution time -25
seconds.Hard sharding by splitting data into a number of physical tables. The result is
2
minutes nevermind the maintenance hassle.Using some precooked data structures, like:
create table data_change_matrix (o int, dM5Min money, dM5Max money, dM10Min money, dM10Max money ... dM{N}Min money, dM{N}Max money)
where N
is max dept for which I run the analysis. Having such table I could easily write a query:
select * from data_change_matrix where dMin5Min between 500 and 510
The result is - it went nowhere due to the tremendous size requirements (5M X ~ 250) and maintenance related costs, I need to support that matrix actuality close to real time.
SQL CLR
- don't even ask me what went wrong it just didn't work out.
Right now I'm out of inspiration and looking for help.
All in all - is it possible to get a close to instant response time running such type of queries on large volumes of data?
All's run on MS Sql Server 2012. Didn't try it on MS Sql Server 2014 but happy to do it if it'll make sense.
Update - execution plan: http://pastebin.com/PkSSGHvH.
Update 2 - While I really love LAG function suggested by usr I wonder if there's a LAG**S** function allowing for
select o, MIN(LAG**S**(o, 4)) over(...) - or what's its shortest implementation in TSL?
I tried something very similar using SQL CLR and got it working but the performance was awful.