4

I have a few billions rows in table for 4k mutable params and I need to get last values for 500 of them My table is partitioned by day and order by param ids so i need just to find last record with needed id

SELECT max(time)
FROM obj_ntgres.param_values_history
PREWHERE param_id = 4171

work slow: Elapsed: 0.437 sec. Processed 2.56 million rows, 5.21 MB (5.87 million rows/s., 11.92 MB/s.)

SELECT *
FROM obj_ntgres.param_values_history
PREWHERE param_id = 4171
ORDER BY time DESC
LIMIT 1

slower: 1 rows in set. Elapsed: 3.413 sec. Processed 2.56 million rows, 5.45 MB (751.21 thousand rows/s., 1.60 MB/s.)

Table

CREATE TABLE obj_ntgres.param_values_history (

  time DateTime,

  param_id UInt16,

  param_value Float32,

  param_value_quality Decimal(1, 0),

  msec Decimal(3, 0)

) ENGINE = MergeTree PARTITION BY toStartOfDay(time)

ORDER BY

  param_id SETTINGS index_granularity = 8192

Maybe you have some ideas how to make it faster?

I mean: find last element without using max() on all table

Artem Gusev
  • 43
  • 1
  • 1
  • 7

2 Answers2

7

Actually the reason it that it still need to scan quite a lot of data with same param_id.

There are few approaches possible. In all cases at the beginning you need to add time column to table sorting key:

    CREATE TABLE param_values_history (
      time DateTime,
      param_id UInt16,
      param_value Float32,
      param_value_quality Decimal(1, 0),
      msec Decimal(3, 0)
    ) ENGINE = MergeTree PARTITION BY toStartOfDay(time)
    ORDER BY
      (param_id,time) SETTINGS index_granularity = 8192

After that - if your data is time aligned, i.e. if you know exactly that for all 500 parameters you had some value in last few seconds / minutes, you can just add a filter like AND time > now() - INTERVAL 10 MINUTES, and it will work really fast (no need to scan a lot of rows).

A bit worse if some of your parameters has no regular activity.

In that case the fastest way would be to cache the last time for each parameter by Materialized view, or even the whole last row. Something like that:

CREATE MATERIALIZED VIEW last_positions
 Engine=ReplacingMergeTree(max_time)
ORDER BY param_id
PARTITION BY tuple()
AS SELECT param_id, max(time) as max_time
FROM param_values_history
GROUP BY param_id;

SELECT * FROM param_values_history PREWHERE (param_id,time) IN (SELECT param_id, max(max_time) FROM last_positions GROUP BY param_id);

Or: whole last row collected in MV

CREATE MATERIALIZED VIEW last_positions
 Engine=ReplacingMergeTree(max_time)
ORDER BY param_id
PARTITION BY tuple()
AS SELECT param_id,
   argMax(param_value, time) as _param_value, 
   argMax(param_value_quality, time) as _param_value_quality, 
   argMax(param_value, msec) as _msec, 
   max(time) as max_time
FROM param_values_history
GROUP BY param_id;

SELECT * FROM last_positions FINAL;
filimonov
  • 1,666
  • 1
  • 9
  • 20
2

I don't understand what you mean by "work bad". But if the question is to

select last record with specific where

you can try this (modify to your needs):

SELECT 
    max((time, param_value, param_value_quality, msec)) AS result,
    result.2 AS param_value,
    result.3 AS param_value_quality
FROM obj_ntgres.param_values_history
PREWHERE param_id = 4171
simPod
  • 11,498
  • 17
  • 86
  • 139
  • OK, then my variant will probably work at the same speed. It will just allow you to get the values from the max row as a bonus – simPod Apr 10 '19 at 14:06