0

I have the following MYSQL table:

measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)

The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.

So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.

Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.

If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.

So, is there a way to reduce the resultset directly on DB-side? I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?

Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.

Isn't there something on DB side I can use? Something like avg() to shrink X rows to Y averaged rows? Or a simple magic to just skip every #th row to shrink X to Y?

update: Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.

update2: I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html See section "Data aggregation".

The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...

update3: Inspired by update 2, I came up with this query:

SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value`  FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp

Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".

Alex
  • 365
  • 1
  • 2
  • 9
  • `width_bucket()` comes to mind in Postgres: https://www.postgresql.org/docs/current/functions-math.html#FUNCTIONS-MATH-FUNC-TABLE –  Jan 27 '20 at 20:17

1 Answers1

0

There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:

select md.*
from (select md.*,
             row_number() over (partition by tile order by timestamp) as seqnum
      from (select md.*, ntile(800) over (order by timestamp) as tile
            from measuredata md
            where . . .   -- your filtering conditions here
           ) md
     ) md
where seqnum = 1;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Hmm, that's kind of triple-selecting the data. This will for sure reduce the data that has to be transferred, but I'm not sure about the query-effort/db-load ... I really had the hope that there is some aggregate function that I did no see during my research. Does anyone know if PostgreSQL would offer more than this? – Alex Jan 27 '20 at 19:35
  • @Alex . . . Why are you asking about Postgres in your comment, when your question is clearly about MySQL? If you have a question about Postgres, you should ask that as a *question*. – Gordon Linoff Jan 27 '20 at 19:40
  • Because I'm not tied to MySQL? The described Db structure is that simple, that the actually used database-system should not matter. That's why. I'll add this detail to the question ... – Alex Jan 27 '20 at 20:00