Architecture and pattern for large scale, time series based, aggregation operation

Question

I will try to describe my challenge and operation: I need to calculate stocks price indices over historical period. For example, I will take 100 stocks and calc their aggregated avg price each second (or even less) for the last year. I need to create many different indices like this where the stocks are picked dynamically out of 30,000~ different instruments.

The main consideration is speed. I need to output a few months of this kind of index as fast as i can.

For that reason, i think a traditional RDBMS are too slow, and so i am looking for a sophisticated and original solution.

Here is something i had In mind, using NoSql or column oriented approach: Distribute all stocks into some kind of a key value pairs of time:price with matching time rows on all of them. Then use some sort of a map reduce pattern to select only the required stocks and aggregate their prices while reading them line by line.

I would like some feedback on my approach, suggestion for tools and use cases, or suggestion of a completely different design pattern. My guidelines for the solution is price (would like to use open source), ability to handle huge amounts of data and again, fast lookup (I don't care about inserts since it is only made one time and never change)

Update: by fast lookup i don't mean real time, but a reasonably quick operation. Currently it takes me a few minutes to process each day of data, which translates to a few hours per yearly calculation. I want to achieve this within minutes or so.

Is the data only historical, or is it updated very frequently and you always require the latest for your calculations? — Nitin Midha, Apr 23 '11 at 17:19
How many prices to you have per instrument and year? Does your system have sufficient memory to hold everything in memory, or is disk access required? — Codo, Apr 23 '11 at 17:22
@codo I think to reduce the quotes to second or millisecond based rows. So just calculate how many seconds are in a year (for 8 hours working day) — AltControl, Apr 23 '11 at 17:25
Sorry, I had a typo in my remark. I'd like to know how many prices per instrument have to be processed for a certain period of time (not how many will result). And what's the typical period of time of each time series that's processed? (You mention something about processing a day of data...) — Codo, Apr 23 '11 at 17:32
@condo I ment to say that I am reducing prices to be time based, so at each second I will you the latrest price available. So basically, I have 25000 prices a day, over 10,000,000 a year — AltControl, Apr 23 '11 at 17:53
What data format do get the market data in? Are you willing to convert it and/or re-organize it? — hifier, Apr 28 '11 at 04:29

score 1 · Answer 1 · answered Apr 23 '11 at 18:16

In the past, I've worked on several projects that involved the storage and processing of time series using different storage techniques (files, RDBMS, NoSQL databases). In all these projects, the essential point was to make sure that the time series samples are stored sequentially on the disk. This made sure reading several thousand consecutive samples was quick.

Since you seem to have a moderate number of time series (approx. 30,000) each having a large number of samples (1 price a second), a simple yet effective approach could be to write each time series into a separate file. Within the file, the prices are ordered by time.

You then need an index for each file so that you can quickly find certain points of time within the file and don't need to read the file from the start when you just need a certain period of time.

With this approach you can take full advantage of today's operating systems which have a large file cache and are optimized for sequential reads (usually reading ahead in the file when they detect a sequential pattern).

Aggregating several time series involves reading a certain period from each of these files into memory, computing the aggregated numbers and writing them somewhere. To fully leverage the operating system, read the full required period of each time series one by one and don't try to read them in parallel. If you need to compute a long period, then don’t break it into smaller periods.

You mention that you have 25,000 prices a day when you reduce them to a single one per second. It seems to me that in such a time series, many consecutive prices would be the same as few instruments are traded (or even priced) more than once a second (unless you only process S&P 500 stocks and their derivatives). So an additional optimization could be to further condense your time series by only storing a new sample when the price has indeed changed.

On a lower level, the time series files could be organized as a binary files consisting of sample runs. Each run starts with the time stamp of the first price and the length of the run. After that, the prices for the several consecutive seconds follow. The file offset of each run could be stored in the index, which could be implemented with a relational DBMS (such as MySQL). This database would also contain all the meta data for each time series.

(Do stay away from memory mapped files. They're slower because they aren’t optimized for sequential access.)

You are suggesting rather low level approach of file based data. Why not leverage the power of modern systems like hadoop? — AltControl, Apr 23 '11 at 18:33
Yes, it's a low level approach, designed for a single system with inexpensive storage. I don't know much about Hadoop but it's main purpose is computation and storage in a cluster. You didn't mention you have a cluster of computers. My experience with time series processing is that the limiting factor is IO and the key for high performance is a good amount of control about the physical layout of the data. — Codo, Apr 23 '11 at 21:46
If speed is your primary concern, I think the suggestions outlined here will go along way. Trying to get too fancy will likely just add overhead. — hifier, Apr 28 '11 at 04:34

score 0 · Answer 2 · answered Apr 24 '11 at 20:41

0

If the scenario you described is the ONLY requirement, then there are "low tech" simple solutions which are cheaper and easier to implement. The first that comes to mind is LogParser. In case you haven't heard of it, it is a tool which runs SQL queries on simple CSV files. It is unbelievably fast - typically around 500K rows/sec, depending on row size and the IO throughput of the HDs.

Dump the raw data into CSVs, run a simple aggregate SQL query via the command line, and you are done. Hard to believe it can be that simple, but it is.

More info about logparser:

answered Apr 24 '11 at 20:41

Addys

2,461
15
24

I like simple and straight forward approaches like the one you describe. But what's missing in your description is how the period to process is selected. AltControl's description seems to indicate that the time series are very long but only a small period of time is processed at once. So efficiently selecting the range to process is probably essential. And LogParser - if I understand it correctly - will read the whole file from the start to its end. – Codo Apr 24 '11 at 21:25
Partition (and name) the input files by date/time and run the query only over the relevant files... standard practice for any logfile processing – Addys Apr 25 '11 at 13:46
wouldn't binary data stores be more efficient? For numeric processing the string processing is going to add a lots of overhead. – hifier Apr 28 '11 at 04:26

score 0 · Answer 3 · answered Apr 26 '11 at 09:09

What you really need is a relational database that has built in time series functionality, IBM released one very recently Informix 11.7 ( note it must be 11.7 to get this feature). What is even better news is that for what you are doing the free version, Informix Innovator-C will be more than adequate. http://www.freeinformix.com/time-series-presentation-technical.html

Architecture and pattern for large scale, time series based, aggregation operation

3 Answers3