1

I was looking for advice on how to store data in a database for historical data mining purposes. If I can get the state of an entity at a given time, what is the best way to store it so that I can historically mine that data, predicting what the state is likely to be based on how it has been in the past?

For a more concrete example, I can get the up/down state of a server, and that server's current load. I can get this on a periodic schedule. I'd like to store this data such that I could easily query for the up/down state or the load for a specific timeframe, or get the entire history of that server. I don't have much experience, if any, in database design.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
s73v3r
  • 1,751
  • 2
  • 22
  • 48

2 Answers2

1

This is why Ralph Kimball (and others) invented the Data Warehouse.

You have star schema with dimensions like Server and Time. You have a fact table that records state changes (Up and Down) and a fact table that records Load at a given point in time.

Step 1. Find a good tutorial on star schema design. If necessary buy a book or two. It's worth every moment you spend learning how to do star schema data modeling.

Step 2. Build a prototype schema as cheaply and quickly as you can. Get data loaded so you can write some queries and see how things work. Make mistakes. Fix them.

Step 3. After you get something to work, then write down a good design.

Step 4. Build your "for real" data mart.

Step 5. Build your "production" loads.

Step 6. Query.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • So it sounds like I'd have 2 fact tables (Up or Down and Load) for each server, correct? – s73v3r Jan 18 '12 at 00:53
  • Now, when it comes to storing data, should I add a record when something changes, or for every data point I get? For example, lets say I get my server data every 15 minutes. At 1pm, the server transitions from High load to Medium load, and stays that way for an hour before transitioning to another state. So I've got 4 data points which all say Medium load. Should I store this as 4 entries, all with the time of measurement and Medium load, or should I have one entry with a Medium load, a start time of 13:00 and an end time of 14:00? – s73v3r Jan 18 '12 at 02:20
  • Step 1. Get Kimball's book. Store data at specific time intervals so you don't have to interpolate. Gather a lot of data. Use flat files and load the RDBMS last -- and only when you have a subset you want to query with SQL. Most processing can be done by organizing CSV files to be easy-to-load. – S.Lott Jan 18 '12 at 10:50
0

This is why there is a new version of the standard, SQL:2011.

Just add the appropriate start- and end-date attributes to your entities, and/or the appropriate columns to your tables, and basically you're done.

The new SQL will do quite a bit (but not all alas) of the otherwise very pesky work for you, if you have an engine that supports the new features, of course.

Erwin Smout
  • 18,113
  • 4
  • 33
  • 52