Best of breed indexing data structures for Extremely Large time-series

Question

I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).

Two basic types of time-series exist based on the sampling/discretisation characteristic:

Regular discretisation (Every sample is taken with a common frequency)
Irregular discretisation(Samples are taken at arbitary time-points)

Queries that will be required:

All values in the time range [t0,t1]
All values in the time range [t0,t1] that are greater/less than v0
All values in the time range [t0,t1] that are in the value range[v0,v1]

The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.

Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.

Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.

I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.

Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)

Queries of types 1–3 are often referred to as “orthogonal range reporting”. — oldboy, Apr 11 '12 at 15:08
http://dba.stackexchange.com/questions/16583/using-an-rdbms-for-querying-tenth-of-terabytes-of-time-series-data — Martin Ba, Apr 16 '12 at 20:16
@Martin: Thanks for that, but the problem with only having a hammer is that everything looks like a nail - posing such a question in a highly db/dba oriented Q/A site, will result in answers with a slight bias. — Xander Tulip, Apr 17 '12 at 05:25
@Xander: No worries - There was a reason I didn't put any comment here and just linked to the DBA question. I was just wondering how/if your problem could be tackled in a traditional RDBMS setup. Not saying it will be the best solution. — Martin Ba, Apr 17 '12 at 11:24

score 10 · Answer 1 · answered Apr 14 '12 at 13:30

You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:

http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf

Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)

Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].

On the other hand, using an extra log factor in the storage means (assuming no big-O constants) going from $2,000 worth of hard disks (20TB * approximately $100/TB for today's prices) to $80,000. At less than one year of programmer cost, this might be worth it, but good luck getting one's manager to see things that way. — oldboy, Apr 14 '12 at 14:24
@mikola: very interesting indeed! any time-series indexing structures that take advantage of the inherent value-structure of value being modeled is worth a look. — Xander Tulip, Apr 16 '12 at 02:29

score 0 · Answer 2 · answered Apr 18 '12 at 01:48

0

It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra. Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future. To learn how to store time series in cassandra please take a look at: http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra and http://www.youtube.com/watch?v=OzBJrQZjge0.

answered Apr 18 '12 at 01:48

Behrang Javaherian

91
4

4

Given the basic requirements, latency and data sizes, anything that is managed will obviously fall short. – Xander Tulip Apr 18 '12 at 06:18

score 0 · Answer 3 · answered Apr 02 '12 at 07:56

0

General ideas:

Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family). Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.

You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.

answered Apr 02 '12 at 07:56

Tobias Langner

10,634
6
46
76

2

The problem with using b-tree structures with time-series is that most time-series model 'continuous' values in a discrete sense. eg: The temperature of a room at 30degrees will need to drop to 25 before it can get to 20, b-trees don't use such insights, hence are ineffiicient for indexing time-series. – Xander Tulip Apr 16 '12 at 02:28
for Problem 1, your comment doesn't make sense to me. If you want so search all points in time where the temperature was 30 degrees, you'd have to dindex that however you obtained the data. Regarding problems 2 and 3 - I don't see a contradiction. It actually assumes that the data is continous - otherwise working with min/max value to determine that the data was inbetween doesn't work. – Tobias Langner Apr 17 '12 at 11:57
Please reread my original comment to you. it should make sense if you've worked with similar data in the past. – Xander Tulip Apr 18 '12 at 06:20

Best of breed indexing data structures for Extremely Large time-series

3 Answers3

Linked