1

I have a large table with time-series data for a whole year, with 24 values for each day (one per hour), for multiple customers.

Customer-ID Date Value
001 2020-01-01 00:00:00 xx
001 2020-01-01 00:01:00 xx
001 2020-01-01 00:02:00 xx
... ... ...
002 2020-01-01 00:00:00 xx
002 2020-01-01 00:01:00 xx
002 2020-01-01 00:02:00 xx
... ... ...

I am currently storing the whole dataset in a single sqlite table, which I then query from an python application in which you can select different forms of visualization (by month which shows 12 values per year or by day, which shows 365 values per year).

CREATE TABLE "data" (
    "index" INTEGER NOT NULL,
    "customer_id"   INTEGER NOT NULL,
    "date"  DATETIME NOT NULL,
    "value" FLOAT NOT NULL,
    "year"  INTEGER NOT NULL,
    "month" INTEGER NOT NULL,
    "day"   INTEGER NOT NULL,
    PRIMARY KEY("index")
);

CREATE INDEX "idx_data_ym" ON "data" (
    "year",
    "month"
);

CREATE INDEX "idx_data_ymd" ON "data" (
    "year",
    "month",
    "day"
);

One naive query I could use for showing the monthly data, would be

SELECT date, sum(value) FROM data GROUP BY CAST(STRFTIME('%Y', date) AS INTEGER), CAST(STRFTIME('%m', date) AS INTEGER)

which is rather slow on a large dataset and from my understanding can't use indexes, which is why I am storing year, month and day as an extra column, so that I can use

SELECT date, sum(value) from data GROUP BY year, month

Result:

Date Sum
2017-01-01 00:00:00 yy
2017-01-02 00:00:00 yy
.. ..
2018-01-01 00:00:00 yy
2018-01-02 00:00:00 yy
.. ..

For a small test dataset (100 customers, 24 hours per value, two years = 100 * 24 * 365 * 2 = 1.752.000 records), the first query takes about 20s while the second one takes only 1.8s.

Checking the queries with EXPLAIN QUERY PLAN, the second query is using the idx_data_ym index, which is what I wanted, while the first query does not use an index.

Output EXPLAIN QUERY PLAN for the first query:

id parent notused detail
6 0 0 SCAN TABLE data
8 0 0 USE TEMP B-TREE FOR GROUP BY

Output for the second query:

id parent notused detail
7 0 0 SCAN TABLE data USING INDEX idx_data_ym

Now I'm wondering, 1.8s may be okay, but the dataset in production will be much larger which will slow down the application tremendously. Grouping by year, month, day to get the sum values for each day is even slower.

Is there a way to improve my query performance? Am I doing something complety wrong with understanding how to aggregate the data?

Thanks for your help!

forpas
  • 160,666
  • 10
  • 38
  • 76
HealYouDown
  • 158
  • 3
  • 12
  • 1
    Have you done some other performance checks. How Long would it take to simply select the whole Table. Is ist faster if you try it without the sum Function? In my Opinion it looks good, it is a minimalistic Query with IMHO no imporvments to make. On which hardware do you run the Query? Another option would be to precalculate this Values and save it to antoher Table. If I get you right this Values will not change after written in the Table, right? – fisi-pjm Jan 26 '22 at 12:04
  • I'm running it on a normal notebook given out by my company. Not the best specs though. Running it on my home computer does increase the query speed a bit, but still not as much as I would have liked. Summing the whole test table takes about 1s, so I know it's rather hard to do below that. Calculating the values on insert is my plan B, right. The data does not get changed and is imported for a whole year each, so that's no problem. I was just curious whether there is a way to optimize the query. – HealYouDown Jan 26 '22 at 12:08

1 Answers1

1

There is no need for the additional columns year, month and day.

You can set Indexes On Expressions:

CREATE TABLE "data" (
    "index" INTEGER NOT NULL,
    "customer_id" INTEGER NOT NULL,
    "date" TEXT NOT NULL, -- there is no DATETIME data type in SQLite
    "value" FLOAT NOT NULL,
    PRIMARY KEY("index")
);

CREATE INDEX "idx_data_ym" ON "data"(strftime('%Y-%m', date));

CREATE INDEX "idx_data_ymd" ON "data"(date(date)); -- equivalent of strftime('%Y-%m-%d', date)

For these queries:

SELECT STRFTIME('%Y-%m', date) AS year_month, 
       SUM(value) AS total
FROM data 
GROUP BY year_month;

SELECT date(date) AS year_month_day, 
       SUM(value) AS total
FROM data 
GROUP BY year_month_day; 

the proper index will be used.

See the demo.

Indexes are the best that you can do to speed up the operation.

forpas
  • 160,666
  • 10
  • 38
  • 76