Make statistics out of a SQL table

Question

I have a table in my database where I register readings from several sensors this way:

CREATE TABLE [test].[readings] (
    [timestamp_utc] DATETIME2(0) NOT NULL, -- 48bits
    [sensor_id] INT NOT NULL, -- 32 bits
    [site_id] INT NOT NULL, -- 32 bits
    [reading] REAL NOT NULL, -- 64 bits
    PRIMARY KEY([timestamp_utc], [sensor_id], [site_id])
)

CREATE TABLE [test].[sensors] (
    [sensor_id] int NOT NULL ,
    [measurement_type_id] int NOT NULL,
    [site_id] int NOT NULL ,
    [description] varchar(255) NULL ,
    PRIMARY KEY ([sensor_id], [site_id])
)

And I want to easily make statistics out of all these readings.

Some queries I would like to do:

Get me all readings for site_id = X between date_hour1 and date_hour2

Get me all readings for site_id = X and sensor_id in <list> between date_hour1 and date_hour2

Get me all readings for site_id = X and sensor measurement type = Z between date_hour1 and date_hour2

Get me all readings for site_id = X, aggregated (average) by DAY between date_hour1 and date_hour2

Get me all readings for site_id = X, aggregated (average) by DAY between date_hour1 and date_hour2 but in UTC+3 (this should give a different result than previous query because now the beginning and ending of days are shifted by 3h)

Get me min, max, std, mean for all readings for site_id = X between date_hour1 and date_hour2

So far I have been using Java to query the database and perform all this processing locally. But this ends up a bit slow and the code stays a mess to write and maintain (too much cicles, generic functions to perform repeated tasks, large/verbose code base, etc)...

To make things worse, table readings is huge (hence the importance of the primary key, which is also a performance index), and maybe I should be using a TimeSeries database for this (are there any good ones?). I am using SQL Server.

What is the best way to do this? I feel I am reinventing the wheel because all of this is kinda of an analytics app...

I know these queries sound simple, but when you try to parametrize all this you can end up with a monster like this:

-- Sums all device readings, returns timestamps in localtime according to utcOffset (if utcOffset = 00:00, then timestamps are in UTC)
CREATE PROCEDURE upranking.getSumOfReadingsForDevices
    @facilityId int,
    @deviceIds varchar(MAX),
    @beginTS datetime2,
    @endTS datetime2,
    @utcOffset varchar(6),
    @resolution varchar(6) -- NO, HOURS, DAYS, MONTHS, YEARS
AS BEGIN
    SET NOCOUNT ON -- http://stackoverflow.com/questions/24428928/jdbc-sql-error-statement-did-not-return-a-result-set
    DECLARE @deviceIdsList TABLE (
            id int NOT NULL
    );

    DECLARE @beginBoundary datetime2,
            @endBoundary datetime2;

    SELECT @beginBoundary = DATEADD(day, -1, @beginTS);
    SELECT @endBoundary = DATEADD(day, 1, @endTS);

    -- We shift sign from the offset because we are going to convert the zone for the entire table and not beginTS endTS themselves
    SELECT @utcOffset = CASE WHEN LEFT(@utcOffset, 1) = '+' THEN STUFF(@utcOffset, 1, 1, '-') ELSE STUFF(@utcOffset, 1, 1, '+') END

    INSERT INTO @deviceIdsList
    SELECT convert(int, value) FROM string_split(@deviceIds, ',');

    SELECT SUM(reading) as reading,
           timestamp_local
    FROM (
            SELECT reading,
                   upranking.add_timeoffset_to_datetime2(timestamp_utc, @utcOffset, @resolution) as timestamp_local
            FROM upranking.readings
            WHERE
                device_id IN (SELECT id FROM @deviceIdsList)
                AND facility_id = @facilityId
                AND timestamp_utc BETWEEN @beginBoundary AND @endBoundary
         ) as innertbl
    WHERE timestamp_local BETWEEN @beginTS AND @endTS
    GROUP BY timestamp_local
    ORDER BY timestamp_local
END
GO

This is a query that receives the site id (facilityId in this case), the list of sensor ids (the deviceIds in this case), the beginning and the ending timestamps, followed by their UTC offset in a string like "+xx:xx" or "-xx:xx", terminating with the resolution which will basically say how the result will be aggregated by SUM (taking the UTC offset into consideration).

And since I am using Java, at first glance I could use Hibernate or something, but I feel Hibernate wasn't made for these type of queries.

this has nothing to do with Java, please consider to remove the Java tag — fantaghirocco, Sep 04 '17 at 22:42
Like your first query is quite literally `SELECT * FROM Sensors s JOIN Readings r on r.sensor_id = s.sensor_id WHERE s.site_id = X AND r.timestamp_utc > early_timestamp AND r.timestamp_utc < late_timestamp` - is thsi the kind of thing you're looking for? — corsiKa, Sep 04 '17 at 22:43
No, those quesries were just basic examples. I am asking if there is any better way than this: http://prntscr.com/gh4w44 — PedroD, Sep 04 '17 at 22:45
@corsiKa that query does not consider UTC offset, granularity/resolution for the aggregation, aggregation type, multiple sensor ids, sensor by type, can't make statistics out of it (min, max, std, mean, median, etc.) etc. I need more generic queries (like the procedure I've posted in the OP). But again, there must be already some framework capable of doing this heavy lifting, I am just reiventing the wheel — PedroD, Sep 04 '17 at 23:00
and what if I want to aggregate by the same hour for all selected dates? — PedroD, Sep 04 '17 at 23:07
I dunno, I think you need to think about what your actual question is. I will say I think you're better off writing individual queries with parameterized statements than database procedures. None of this seems like it's complicated stuff - every one of those queries is pretty straghtforward out of the box SQL. But as it stands, it really looks like you're asking SO to write the queries for you which is unlikely to happen. There's a difference between helping someone and doing free consulting work. — corsiKa, Sep 04 '17 at 23:32
@corsiKa That's nice and all, but I was just asking if there is some framework that already provides these types of queries out of the box, not free consulting. Let me know if you still don't get the difference, I can give you free consulting on this. — PedroD, Sep 05 '17 at 13:47

Rafa · Accepted Answer · 2017-09-06T14:37:32.910

Your structure looks good at first glance but looking at your queries it makes me think that there are tweaks you may want to try. Performance is never an easy subject and it is not easy to find an "one size fits all answer". Here a some considerations:

Do you want better read or write performance? If you want better read performance you need to reconsider your indexes. Sure you have a primary key but most of your queries don't make use of it (all three fields). Try creating an index for [sensor_id], [site_id].
Can you use cache? If some searches are recurrent and your app is the single point of entry to your database, then evaluate if your use cases would benefit from caching.
If the table readings is huge, then consider using some sort of partitioning strategy. Check out MSSQL documentation
If you don't need real time data, then try some sort of search engine such as Elastic Search

Make statistics out of a SQL table

1 Answers1