2

I have a table containing ~5,000,000 rows of scada data, described by the following: create table data (o int, m money).

Where: - o is PK with clustered index on it. o's fill factor is close to 100%. o represents the date of meter reading, can be thought of as X axis. - m is a decimal value laying within 1..500 region and is the actual meter reading can be thought of as Y axis.

I need to find out about certain patterns i.e. when, how often and for how long they had been occurring.

Example. Looking for all occurrences of m changing by a region from 500 to 510 within 5 units (well from 1 to 5) of o I run the following query:

select d0.o as tFrom, d1.o as tTo, d1.m - d0.m as dValue
from data d0
    inner join data d1 
        on (d1.o = d0.o + 1 or d1.o = d0.o + 2 or d1.o = d0.o + 3 or d1.o = d0.o + 4)
            and (d1.m - d0.m) between 500 and 510

the query takes 23 seconds to execute.

Previous version took 30 minutes (90 times slower), I' managed to optimize it using a naive approach by replacing : on (d1.o - d0.o) between 1 and 4 with on (d0.o = d1.o - 1 or d0.o = d1.o - 2 or d0.o = d1.o - 3 or d0.o = d1.o - 4)

It's clear to me why it's faster - on one hand indexed column scan should fork fast enough on another one I can afford it as dates are discrete (and I always give 5 minutes grace time to any o region, so for 120 minutes it's 115..120 region). I can't use the same approach with m values as they are integral though.

Things I've tried so far:

  1. Soft sharding by applying where o between @oRegionStart and @oRegionEnd at the bottoom of my script. and running it within a loop, fetching results into a temp table. Execution time - 25 seconds.

  2. Hard sharding by splitting data into a number of physical tables. The result is 2 minutes nevermind the maintenance hassle.

  3. Using some precooked data structures, like:

create table data_change_matrix (o int, dM5Min money, dM5Max money, dM10Min money, dM10Max money ... dM{N}Min money, dM{N}Max money)

where N is max dept for which I run the analysis. Having such table I could easily write a query:

select * from data_change_matrix where dMin5Min between 500 and 510

The result is - it went nowhere due to the tremendous size requirements (5M X ~ 250) and maintenance related costs, I need to support that matrix actuality close to real time.

  1. SQL CLR - don't even ask me what went wrong it just didn't work out.

Right now I'm out of inspiration and looking for help.

All in all - is it possible to get a close to instant response time running such type of queries on large volumes of data?

All's run on MS Sql Server 2012. Didn't try it on MS Sql Server 2014 but happy to do it if it'll make sense.

Update - execution plan: http://pastebin.com/PkSSGHvH.

Update 2 - While I really love LAG function suggested by usr I wonder if there's a LAG**S** function allowing for

select o, MIN(LAG**S**(o, 4)) over(...) - or what's its shortest implementation in TSL?

I tried something very similar using SQL CLR and got it working but the performance was awful.

user3455395
  • 161
  • 10

3 Answers3

2

I assume you meant to write "on (d1.o = ..." and not "on (d.o = ...". Anyway, I got pretty drastic improvements just by simplifying the statement (making it easy for the query optimizer to pick a better plan I guess):

select d0.o as tFrom, d1.o as tTo, d1.m - d0.m as dValue
from data d0
    inner join data d1 
        on d1.o between d0.o + 1 and d0.o + 4
            and (d1.m - d0.m) between 500 and 510

Good luck with your query!

Fredrik Johansson
  • 3,477
  • 23
  • 37
  • between is about 100 times slower than combination of ors. – user3455395 Aug 04 '14 at 10:37
  • Odd, I got a massive improvement with "my" version of the between-syntax (the one in my answer), but terrible performance with "your" ver that uses subtraction... – Fredrik Johansson Aug 04 '14 at 12:07
  • ...since the subtraction op can't use the index. – Fredrik Johansson Aug 04 '14 at 12:09
  • Possibly. All PCs I tried it on demonstrated worse perf using between. If you look at the plan - you'll see why it happens. – user3455395 Aug 04 '14 at 15:50
  • +1 On a table `create table data (o int primary key, m money)` with 5,000,000 rows of data with sequential `o` values from `1` to `5,000,000` this was fastest of code so far on my machine. Lag = CPU time = 39609 ms, elapsed time = 39656 ms, OP's original was CPU time = 144738 ms, elapsed time = 40478 ms and yours CPU time = 20936 ms, elapsed time = 21037 ms – Martin Smith Aug 04 '14 at 20:30
2

You say you've already tried CLR but don't give any code.

It was fastest in my test for my sample data.

CREATE TABLE data
  (
     o INT PRIMARY KEY,
     m MONEY
  );

INSERT INTO data
SELECT TOP 5000000 ROW_NUMBER() OVER (ORDER BY @@SPID),
                   1 + ABS(CAST(CRYPT_GEN_RANDOM(4) AS INT) %500)
FROM   master..spt_values v1,
       master..spt_values v2 

None of the versions actually return any results (it is impossible for m to be a decimal value laying within 1..500 and simultaneously for two m values to have a difference > 500) but disregarding this typical timings I got for the code submitted so far are.

+-----------------+--------------------+
|                 | Duration (seconds) |
+-----------------+--------------------+
| Lag/Lead        | 39.656             |
| Original code   | 40.478             |
| Between version | 21.037             |
| CLR             | 13.728             |
+-----------------+--------------------+

The CLR code I used was based on that here

To call it use

EXEC    [dbo].[WindowTest]
        @WindowSize = 5,
        @LowerBound = 500,
        @UpperBound = 510

Full code listing

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class StoredProcedures
{
    public struct DataRow
    {
        public int o;
        public decimal m;
    }

    [Microsoft.SqlServer.Server.SqlProcedure]
    public static void WindowTest(SqlInt32 WindowSize, SqlInt32 LowerBound, SqlInt32 UpperBound)
    {
        int windowSize = (int)WindowSize;
        int lowerBound = (int)LowerBound;
        int upperBound = (int)UpperBound;

        DataRow[] window = new DataRow[windowSize];

        using (SqlConnection conn = new SqlConnection("context connection=true;"))
        {
            SqlCommand comm = new SqlCommand();
            comm.Connection = conn;
            comm.CommandText = @"
                                SELECT o,m
                                FROM data
                                ORDER BY o";

            SqlMetaData[] columns = new SqlMetaData[3];
            columns[0] = new SqlMetaData("tFrom", SqlDbType.Int);
            columns[1] = new SqlMetaData("tTo", SqlDbType.Int);
            columns[2] = new SqlMetaData("dValue", SqlDbType.Money);


            SqlDataRecord record = new SqlDataRecord(columns);

            SqlContext.Pipe.SendResultsStart(record);

            conn.Open();

            SqlDataReader reader = comm.ExecuteReader();

            int counter = 0;

            while (reader.Read())
            {
                DataRow thisRow = new DataRow() { o = (int)reader[0], m = (decimal)reader[1] };

                int i = 0;

                while (i < windowSize && i < counter)
                {
                    DataRow previousRow = window[i];

                    var diff = thisRow.m - previousRow.m;

                    if (((thisRow.o - previousRow.o) <= WindowSize-1) && (diff >= lowerBound) && (diff <= upperBound))
                    {
                        record.SetInt32(0, previousRow.o);
                        record.SetInt32(1, thisRow.o);
                        record.SetDecimal(2, diff);

                        SqlContext.Pipe.SendResultsRow(record);
                    }

                    i++;
                }

                window[counter % windowSize] = thisRow;

                counter++;
            }

            SqlContext.Pipe.SendResultsEnd();
        }
    }
}
Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • And maybe at some point I'll look into [parellelizing this](http://channel9.msdn.com/Events/TechEd/NorthAmerica/2013/DBI-B404#fbid=). Unlikely to happen imminently though. – Martin Smith Aug 04 '14 at 23:00
  • Loving the performance difference. I used far more complex code - pregenerated min-O-max for each o + custom code selecting values from the 'candidate' ranges. Thanks a lot for this option! – user3455395 Aug 05 '14 at 06:15
  • @user for larger window sizes than 5 maybe it's worth looking at alternative collections but for 5 using a fixed size array in a circular fashion and nested loops seemed fine. – Martin Smith Aug 05 '14 at 06:22
  • Regarding paralellism I wasn't that taken with the methods in the presentation. probably simplest way would be to get the `min(o)` and `max(o)` and split up into good size ranges yourself. Then run it on concurrent connections. The procedure would need to be altered to accept from and to params, the query changed to return rows `between @from and (@to + @windowSize)` (with another check that `previousRow.o <= from` so duplicates not returned in different batches). I had a go at doing this with a TVF and `QUERYTRACEON 8649` but parallelism disappeared as soon as I added `DataAccessKind.Read`. – Martin Smith Aug 06 '14 at 22:31
1

This looks like a great case for windowed aggregate functions or LAG. Here a version using LAG:

select *
from (
 select o
      , lag(m, 4) over (order by o) as m4
      , lag(m, 3) over (order by o) as m3
      , lag(m, 2) over (order by o) as m2
      , lag(m, 1) over (order by o) as m1
      , m as m0
 from data
) x
where 0=1
 or (m1 - m0) between 500 and 510
 or (m2 - m0) between 500 and 510
 or (m3 - m0) between 500 and 510
 or (m4 - m0) between 500 and 510

Using a windowed aggregate function you should be able to remove the manual expansion of those LAG calls.

SQL Server implements these things using a special execution plan operator called Window Spool. That makes it quite efficient.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Getting: 'The function 'lag' must have an OVER clause.' error – user3455395 Jul 31 '14 at 14:23
  • @user3455395 edited. Don't be too shy to look into the documentation. – usr Jul 31 '14 at 14:27
  • I never am! The result is 27 seconds, which is close to the record. Could we enhance it further by any sort of indexes (Columnstore ones? - never used them wonder if they can help)? – user3455395 Jul 31 '14 at 14:31
  • Post the actual execution plan for this query as XML (if too big, upload on pastebin). Problems will become apparent, if any. – usr Jul 31 '14 at 14:32
  • Execution plan is added. – user3455395 Jul 31 '14 at 14:34
  • One thing I was thinking of - I can get rid of decimals by multiplying m values by 10K. Is it worth doing at all? – user3455395 Jul 31 '14 at 14:36
  • No, because processing a few million decimals cannot possibly take 20 seconds. It can't be a substantial part of the query.; The plan shows that each `LAG` results in another window spool operator. We can probably speed it up by 2-4x but we will never hit your performance goal of instant execution. – usr Jul 31 '14 at 14:38
  • Does historical data change? If not, you probably have to pre-generate some data. – usr Jul 31 '14 at 14:39
  • I'm out of ideas for now. Pregenerate :) – usr Jul 31 '14 at 14:41
  • no once it's in the db it never changes. Any sort of pre-generated data exploits the database. If I need to analyze up to 250 depth o wise we're talking 5M X 250 values, populating it is just killing my PC. – user3455395 Jul 31 '14 at 14:41