Efficient data structure for storing precomputed maximums

Question

Consider this table that stores the value of two stock variables A and B at each point in time:

         A   B
day 1   10   0 
day 2    0  10
day 3    7   7
day 4    7   7

We want to answer questions like:

What was the maximum value achieved by variable A in a given range of days?
What was the maximum value achieved by the sum of variables A and B in a given range of days?

The actual table might have billions of rows and many variables, however. In order to get the answers faster, we plan to precompute a summary table with lower time granularity.

The problem is that naively calculating the maximum across the new temporal granularity for A and B separately is not enough for answering the second question. For example:

         Max-A  Max-B
day 1&2     10     10
day 3&4      7      7

We have lost the fact that the maximum of A + B is achieved across days 3 & 4.

We can add a new Max-(A+B) column to the summary table. But if there are many different variables, we will face a combinatorial explosion. The summary table might end up being bigger than the original one!

Is there an algorithm / data structure for efficiently storing these kinds of precomputed maximums, in a way that lets us make questions about arbitrary combinations of variables, all the while avoiding a combinatorial explosion? My guess is that it could assume some regularities in the data and try to exploit them—at the cost of some generality.

What about computing sparse tables for each of the stock variables? Then for each time range, compute the maximum of A and B to get the maximum of A + B. — xrisk, May 04 '19 at 07:08
How much space can you use? If it's essentially infinite that implies different answers than if it's very limited. — Richard, May 04 '19 at 07:23
You should look for Range Maximum Query problems – here's a nice broad treatment of the problem https://www.topcoder.com/community/competitive-programming/tutorials/range-minimum-query-and-lowest-common-ancestor/ — xrisk, May 04 '19 at 07:26
@Richard Ideally, the space used should be no bigger than the one taken by the original table. — danidiaz, May 04 '19 at 07:33
Make sense to consider OLAP system like [clickhouse](https://clickhouse.yandex/) and schema like *table{date, param_id, value}* where index built based on pair *{date, param_id}*. Using [AggregatingMergeTree](https://clickhouse.yandex/docs/en/operations/table_engines/aggregatingmergetree/) you can precalculate aggregates for each param per date and get required results on the fly. — vladimir, May 04 '19 at 07:34
Well you can either use sparse table which takes O(n log n) space and answers max queries in O(1) or use a segment tree which uses O(4*n) space and answers max queries in O(log n) time — Photon, May 04 '19 at 15:15
@danidiaz Can you give examples of some queries and expected output? — nice_dev, May 05 '19 at 17:32
@vivek_23 "What was the *maximum* value achieved by the sum of values A and B across all recorded history?" The answer is 14 (7 + 7 in both day 3 and day 4) but that information can't be deduced from the lower-granularity table, unless one explicitly adds a pre-computed column for the sum. I have come to the conclusion that I'm asking for too much: if I want to lower the granularity of the table, I'll have to bite the bullet and add columns for each combination of variables (A+B, B+C, A+B+C, A+D...) in which I'm interested. — danidiaz, May 05 '19 at 18:03
It seems that downsampling (reducing the granularity or received data) https://docs.influxdata.com/influxdb/v1.7/guides/downsampling_and_retention/ and pre-computing aggregates http://druid.io/blog/2011/04/30/introducing-druid.html are things commonly done time-series databases that have features for analytics, like Druid. — danidiaz, May 05 '19 at 18:11
@Rishav: "compute the maximum of A and B to get the maximum of A + B": That's not how that works. Did you read the question? — ruakh, May 05 '19 at 18:11
Related: https://stackoverflow.com/questions/38513382/data-aggregation-and-caching-how-to-quickly-graph-large-time-series-datasets-by — danidiaz, May 05 '19 at 18:21

score 2 · Answer 1 · answered May 04 '19 at 16:42

There is no really good data structure for everything you want... but you know there are only 365 days in a year, i.e., your table will not have billions of rows.

The table will only have a few thousand rows at most, so it's not going to take any significant amount of time just to iterate over it to calculate whatever statistics you like.

Efficient data structure for storing precomputed maximums

1 Answers1