10

I am trying to identify possible methods for storing 100 channels of 25 Hz floating point data. This will result in 78,840,000,000 data-points per year.

Ideally all this data would be efficiently available for Web-sites and tools such as Sql Server reporting services. We are aware that relational databases are poor at handling time-series of this scale but have yet to identify a convincing time-series specific database.

Key issues are compression for efficient storage yet also offering easy and efficient queries, reporting and data-mining.

  • How would you handle this data?

  • Are there features or table designs in Sql Server that could handle such a quantity of time-series data?

  • If not, are there any 3rd party extensions for Sql server to efficiently handle mammoth time-series?

  • If not, are there time-series databases that specialise in handling such data yet provide natural access through Sql, .Net, and Sql Reporting services?

thanks!

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
Duncan
  • 101
  • 1
  • 3
  • How large are the data points? – cgp Jun 04 '09 at 16:47
  • What's the datatype of the samples? Is the sample/feed rate fixed or varying? Do you need to store the sample time or can it be inferred? And what kind of data is being sampled? – RBarryYoung Jun 04 '09 at 16:55
  • Assume 32-bit floating point. There could be some optimisation but it would be perfect if we didn't have to handle that head-ache. – Duncan Jun 04 '09 at 17:26
  • It's virtually impossible to deal with the compression issues of FP data unless we know what kind of data it is: is it real-world physical measurements or synthetic/artifical/generated data? Is the data frequency-based, periodic or aperiodic? Are the data points monotonic accumulators (never goes down), or are they fluctuating levels or are they rate measurements? Are they mostly flat with occasional spikes (events) or highly variable and chaotic? And how accurate is the data and how much noise is in it? Finally, is lossy compression acceptable? – RBarryYoung Jun 12 '09 at 07:17

8 Answers8

1

I'd partition the table by, say, date, to split the data into tiny bits of 216,000,000 rows each.

Provided that you don't need a whole-year statistics, this is easily servable by indexes.

Say, the query like "give me an average for the given hour" will be a matter of seconds.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • Thanks for the answer. Using sql server partioning or just multiple tables? If its multiple tables, are there any design patterns for easily handling queries across tables? – Duncan Jun 09 '09 at 06:34
  • @Duncan: SQL Server paritioning would be the best, but I don't know how many partitions will it allow. With multiple tables, you can run a scheduler to recreate views as SELECT * FROM table_20090609 UNION ALL SELECT * FROM table_20090608 etc. Don't forget to include a partitioning column into the tables. – Quassnoi Jun 09 '09 at 07:41
1

I suppose you need a random access to the data series. The idea that I've already used for rainfall data table, is to subdivide the entire dataset in smaller part, to create an entry for each few minutes or even one minute. Then you can pop this, still big, array from the db and access directly to the needed part, you can find a direct correlation between time offset and byte offset.

Lopoc
  • 1,308
  • 5
  • 16
  • 26
  • Thanks for the answer. Using a blob for the big array? Are there any design approaches for making a blob easily queryable e.g. views? – Duncan Jun 09 '09 at 06:35
1

The feature set you're describing is for an Analysis Cube. Check out Analysis services from Microsoft if you're in that part of the tech world:

http://msdn.microsoft.com/en-us/library/ms175609(SQL.90).aspx

As far as the model you're describing, you need to implement a Kimball model (the standard data warehousing model) with a Time dimension. I ran into this problem storing media log files a while back.

Good luck.

Chris B. Behrens
  • 6,255
  • 8
  • 45
  • 71
  • Thanks for the answer. Its difficult to know where to get started with data-warehousing. I have googled and read around your link but would benefit from something like a sample project that tackles a similar problem. Are you aware of anything like this? – Duncan Jun 09 '09 at 06:37
  • 1
    You're right that Data Warehousing is hard to get started with. The best process I can recommend is (and keep in mind that I'm talking Microsoft Visual Studio and Sql Server 2005/2008 here) to download the sample warehouse of the AdventureWorks db: http://www.microsoft.com/downloads/details.aspx?FamilyId=E719ECF7-9F46-4312-AF89-6AD8702E4E6E&displaylang=en and then check out some of the hands on labs for Sql Server: http://www.microsoft.com/sqlserver/2008/en/us/virtual-labs.aspx Lastly, I recommend Kimball's book to get started: http://www.ralphkimball.com/html/booksDWT2.html Good luck! – Chris B. Behrens Jun 09 '09 at 22:23
1

You can check out Infobright Community or Enterprise Edition, I think. It is column-oriented storage designed for analytics purposes and big (existing installations up to 30 TB now as they say) data and good compression rate.

Data loader are also pretty fast and connectors exists for ETL-tools (Talend, kettle and so on).

Community edition available free under GNU GPL terms, but allows to add data only via native loader. Enterprise edition supports add/update by single row via DML.

Another benefit that you can use it with all tools that supports MySQL connections.

Column-orientation allow you, f.e., add columns for date component on every needed aggregation level (I use date, week numbers, months and qtr.) for better performance, but it's good without it as well.

I use it for relatively small (yet) amount of business transactions data for analytic purposes with R as data analysis tool via mysql interface and python (numpy) scripts as some kind of ETL.

Cons: lack of official utf-8 support, aggregation by function values (select month(date from ...)) not implemented yet (plan: july 2009, AFAIK), but I use ETL for this.

Link: http://www.infobright.org/Download/ICE/

dotsbyname
  • 250
  • 5
  • 14
  • please feel free to share you experience after exploring ICE :) I’m working on architecture of our small analysis/reporting app with R, Infobright and Django as reports viewer and interested in new thoughts about storing/representing large data :) – dotsbyname Jun 18 '09 at 12:53
0

Have you considered HBASE or Open TSDB. You can also have a look to Cassandra

tom
  • 1,647
  • 14
  • 15
0

If it's floating point data only, TSDBs will offer you far better performance. Timeseries compression algorithms are different therefore you get better storage and query rates.

0

Have you considered a time series database, like http://opentsdb.net ?

Brady Moritz
  • 8,624
  • 8
  • 66
  • 100
0

You have

A. 365 x 24 x 100 = 876,000 hourly signals (all channels) per year

B. each signal comprising 3600 * 25 = 90,000 datapoints

How about if you store data as one row per signal, with columns for summary/query stats for currently supported use cases, and a blob of the compressed signal for future ones?

bubaker
  • 2,279
  • 1
  • 18
  • 13
  • Thanks for the answer. I may not fully understand the suggestion. Is the suggestion for each row to like (signalId, timeperiod, float ave, float min, float max, blob raw)? Are there any examples of making a blob data easily queryable e.g. views? – Duncan Jun 09 '09 at 06:42
  • Something along those lines, but I'm not sure about making blob data queryable at all.. My thinking was to limit queries to additional stats columns as needed. – bubaker Jun 09 '09 at 14:46