Database design for high sample rate data, graphing at multiple zoom levels

Question

I've got multiple sensors feeding data to my web app. Each channel is 5 samples per second and the data gets uploaded bundled together in 1 minute json messages (containing 300 samples). The data will be graphed using flot at multiple zoom levels from 1 day to 1 minute.

I'm using Amazon SimpleDB and I'm currently storing the data in the 1 minute chunks that I receive it in. This works well for high zoom levels, but for full days there will be simply be too many rows to retrieve.

The idea I've currently got is that every hour I can crawl through the data and collect together 300 samples for the last hour and store them in another table, essentially down-sampling the data.

Does this sound like a reasonable solution? How have others implemented the same sort of systems?

Can you change to an aggregated view when zoomed out that has less data points? — Yzmir Ramirez, Jan 16 '11 at 21:27
That's what I'm planning to do. It's the database storage I'm struggling with. Currently, to show the day overview, I would have to select every minute from the database (1440 per day...), then remove the excess data points. — Tim, Jan 17 '11 at 12:39

score 5 · Answer 1 · answered Jan 19 '11 at 13:50

5

Storing downsampled data is perfectly fine approach. Check out how munin stores it's graphs - dayly, mounthly, early and intraday graphs are stored separately there.

You may store data for each minute, each 5 minutes, each hour, each 4 hours, each day in different tables. Overhead is very little in comparison to just storing every minute with lots of benefit as you don't transmit what you don't need to.

answered Jan 19 '11 at 13:50

BarsMonster

6,483
2
34
47

I'm trying to decide whether I should calculate and update these downsampled entries on the fly (ie. every minute) or should I have a background task compiling the data at predetermined intervals? – Tim Jan 21 '11 at 14:37
Realtime calculation is better as long as you can do it in time. But periodic batch processing is faster. – BarsMonster Jan 21 '11 at 15:41
And also there're more tricky methods: when the user requests for item 12, it triggers the calculation of 11 and 13, so when he/she jumps to the neighbour item, it will be ready-to-show. Maybe the summarization is as quick that there's no need for any pre-calculation, but maybe as slow, that all sum must be pre-calculated. – ern0 Jan 25 '11 at 11:00

score 2 · Answer 2 · answered Jan 25 '11 at 10:17

Speed up the database, use direct organization model. It's the fastest method to store/retrieve data from files. The implementation is as simple, that you don't need any framework or library.

The method is:

you have to create an algorhytm, which converts the key to a continous record numero (0..max. number of records),
you have to use fixed record size,
the data is stored in flat files, where the record's position within the file is the rec. no. (based on key, as described in 1.) multiplied by the record size (see 2.).

Native data

You may create one data file per day for easier maintenance. Then your key is the no. of the sample within the day. So, your daily file will be 18000 * 24 * record size. You should pre-create that file with 0s in order to make operating system's life easier (maybe it does not help much, it depends on underlying filesystem/caching mechanism).

So, when a data arrives, calculate the file position, and insert the record to its place.

Summarized data

You should store summarized data in direct files, too. These files will be much smaller ones. In case of the 1-minute summarized values there will be 24*60*60 records in it.

There're some decisions, which you have to take:

the stepping of zoom,
the steping of the summarized data (it's not sure to worth collect summarized data for each zoom stepping),
the organization of the summarized databases (the native data may be stored in daily files, but the daily data should be stored in monthly files).

Another thing is to think about, the creation time of the summarized data. While native data should be stored just as the data arrives, summarized data may be calculated any time:

as the native data arrives (in this case, a 1-s data is updated 300 times, which is not optimal to write to disk immediatelly, the summing should be done in memory);
a background job should process the native data periodically,
the sum data should be created lazy way, on demand.

Don't forget, not-too-many years ago these issues were the database design issues. I can promise one thing: it will be fast, faster than anything (except using memory for storing data).

score 1 · Accepted Answer · answered Jan 23 '11 at 19:03

I implemented this some time ago with downsampling on the fly for some graphs. The drawback is that older looses resolution, but I believe this is acceptable for you. And if you are interested in peaks you could provide max, avg, and min values.

The algorithm isn't too hard also. If you have 5 samples per second and want to hold this granularity for maybe an hour you have to store 5*60*60 = 18000 samples for this hour.

For the day you might go down to 1 sample every 5 seconds, reducing the amount by a factor of 25. The algorith would then run every 5 seconds, and calculate the median, min and max of the 5 seconds what passed 24 hours ago. Results in 12*60*23 = 16560 more samples per day, and if you store

Further back I recommend a sample every minute, reducing the amount by 12 for maybe two week, so you have 60*24*13 = 18720 more samples for two weeks data.

Special consideration should be taken for storing the data in the DB. To get max performance you should ensure data of one sensor is stored in one block in thae database. If you use e.g. PostgreSQL, you know that one block is 8192 bytes in length, and no two records are stored in one block. Assuming one sample has 4 bytes length, and considering the overhead per row I could add 2048 minus a few samples in one block. Given the highest resolution, this are 2040 / 5 / 60 = 6 minutes of data. It MIGHT be a good idea now to always add 6 minutes at once, maybe 5 to be just dummies to update in the later minutes, so the queries can fetch blocks of a single sensor faster.

At least I would use different tables for different sensor granularity.

I'm using SimpleDB, so I don't have any control over how the data is stored. +1 for getting me to think about how long the data is stored. I think I'll have a chargeable option to archive into S3. — Tim, Jan 25 '11 at 14:53

score 0 · Answer 4 · answered May 12 '11 at 13:58

0

Since some days Amazon CloudWatch allows you to use custom metrics as well. If monitoring and alarming is your main concern, this may be helpful.

answered May 12 '11 at 13:58

Leen Toelen

389
6
12

Database design for high sample rate data, graphing at multiple zoom levels

4 Answers4