2

Consider this table that stores the value of two stock variables A and B at each point in time:

         A   B
day 1   10   0 
day 2    0  10
day 3    7   7
day 4    7   7

We want to answer questions like:

  • What was the maximum value achieved by variable A in a given range of days?

  • What was the maximum value achieved by the sum of variables A and B in a given range of days?

The actual table might have billions of rows and many variables, however. In order to get the answers faster, we plan to precompute a summary table with lower time granularity.

The problem is that naively calculating the maximum across the new temporal granularity for A and B separately is not enough for answering the second question. For example:

         Max-A  Max-B
day 1&2     10     10
day 3&4      7      7

We have lost the fact that the maximum of A + B is achieved across days 3 & 4.

We can add a new Max-(A+B) column to the summary table. But if there are many different variables, we will face a combinatorial explosion. The summary table might end up being bigger than the original one!

Is there an algorithm / data structure for efficiently storing these kinds of precomputed maximums, in a way that lets us make questions about arbitrary combinations of variables, all the while avoiding a combinatorial explosion? My guess is that it could assume some regularities in the data and try to exploit them—at the cost of some generality.

danidiaz
  • 26,936
  • 4
  • 45
  • 95
  • 1
    What about computing sparse tables for each of the stock variables? Then for each time range, compute the maximum of A and B to get the maximum of A + B. – xrisk May 04 '19 at 07:08
  • 3
    https://cp-algorithms.com/data_structures/sparse-table.html – xrisk May 04 '19 at 07:08
  • How much space can you use? If it's essentially infinite that implies different answers than if it's very limited. – Richard May 04 '19 at 07:23
  • 1
    You should look for Range Maximum Query problems – here's a nice broad treatment of the problem https://www.topcoder.com/community/competitive-programming/tutorials/range-minimum-query-and-lowest-common-ancestor/ – xrisk May 04 '19 at 07:26
  • @Richard Ideally, the space used should be no bigger than the one taken by the original table. – danidiaz May 04 '19 at 07:33
  • 1
    Make sense to consider OLAP system like [clickhouse](https://clickhouse.yandex/) and schema like *table{date, param_id, value}* where index built based on pair *{date, param_id}*. Using [AggregatingMergeTree](https://clickhouse.yandex/docs/en/operations/table_engines/aggregatingmergetree/) you can precalculate aggregates for each param per date and get required results on the fly. – vladimir May 04 '19 at 07:34
  • 1
    Well you can either use sparse table which takes O(n log n) space and answers max queries in O(1) or use a segment tree which uses O(4*n) space and answers max queries in O(log n) time – Photon May 04 '19 at 15:15
  • @danidiaz Can you give examples of some queries and expected output? – nice_dev May 05 '19 at 17:32
  • @vivek_23 "What was the *maximum* value achieved by the sum of values A and B across all recorded history?" The answer is 14 (7 + 7 in both day 3 and day 4) but that information can't be deduced from the lower-granularity table, unless one explicitly adds a pre-computed column for the sum. I have come to the conclusion that I'm asking for too much: if I want to lower the granularity of the table, I'll have to bite the bullet and add columns for each combination of variables (A+B, B+C, A+B+C, A+D...) in which I'm interested. – danidiaz May 05 '19 at 18:03
  • It seems that downsampling (reducing the granularity or received data) https://docs.influxdata.com/influxdb/v1.7/guides/downsampling_and_retention/ and pre-computing aggregates http://druid.io/blog/2011/04/30/introducing-druid.html are things commonly done time-series databases that have features for analytics, like Druid. – danidiaz May 05 '19 at 18:11
  • @Rishav: "compute the maximum of A and B to get the maximum of A + B": That's not how that works. Did you read the question? – ruakh May 05 '19 at 18:11
  • Related: https://stackoverflow.com/questions/38513382/data-aggregation-and-caching-how-to-quickly-graph-large-time-series-datasets-by – danidiaz May 05 '19 at 18:21
  • @ruakh yeah, it seems I misunderstood the problem. My bad! – xrisk May 05 '19 at 18:25

1 Answers1

2

There is no really good data structure for everything you want... but you know there are only 365 days in a year, i.e., your table will not have billions of rows.

The table will only have a few thousand rows at most, so it's not going to take any significant amount of time just to iterate over it to calculate whatever statistics you like.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87