1

I am interested in storing scientific data from a chart (plot) of say x against y in a data warehouse, where both x and y are real numbers.

Each chart will be generated for a fixed set of descriptive dimensions (e.g. time, date, location, equipment) which can be modelled in a traditional star/snowflake schema.

An example would be say angle vs response of a detector, where angle is the independent variable and response is the dependant. Angle here could be any real number between 0 and 360 degrees.

My current thought is to use the real value as a dimension, potentially prepopulating an angle_dimension table with values from 0 to 360 at a suitable scale (e.g. 3dp) and round the measured results where necessary, although this results in a loss of precision.

I am wondering if there are any more effective ways to store this data for later use in an OLAP cube? The type of query I'd be looking to do is to compare chart data at different time points to look for changes or to look at the average response in a given range (say 0-15 degrees) at different locations or for different equipment.

adam
  • 23
  • 4
  • Do you want to see detail level data or summary data? It seems like you want to analyse detail data - cubes aren't very good at this. Do you know how many records you expect? How many other attributes are there against angle? It doesn't really seem suitable for dimensional modelling.How is the data stored now and what problems do you have with it? – Nick.Mc Feb 19 '17 at 13:01
  • I really want both available, the aim is to look at average values, and then be able to drill down into the detail to find out why things are outside of tolerance. Currently I'm storing different parameters from equipment (about 20 but could extend to a couple of hundred) produced as charts against sampling time (in 0.25s) intervals, however these could also vary as a function of angle. We have several similar pieces of equipment, each with 2 to 3 modes of operation. The equipment produces the charts in XML. – adam Feb 19 '17 at 13:40
  • I was hoping to put the data in a form where I could use an off the shelf OLAP tool (Pentaho community) to allow users to manipulate the data rather than writing custom code. It is currently more a proof of principle, I have a simple cube with about 7 million records. – adam Feb 19 '17 at 13:43
  • In terms of storage for any/all analysis, I'd suggest a fact table at the 'chart plot point' grain, i.e. one row per X on a graph. You'd get two real numbers and some dimensionality. However, that doesn't really lend itself to easy analysis, so you might build on this by making a more easily analysed star schema/cube. Your idea of having an angle dimension is good, because it allows people to slice/dice the data, but it does assume that all your measurements will always be angles and responses: it'd be a pain if you always needed to create new dimensions for each type of analysis. – Rich Feb 20 '17 at 08:50
  • Your point about range is also interesting: each angle_dimension point could have an 'angle-range' column that is precalcualted to 0-15,16-30, etc, which is then easy to use in bar charts. – Rich Feb 20 '17 at 08:51

1 Answers1

0

Your last paragraph gives I think a good hint about how you want to store it for analysis: by time, by angle range, by location, by equipment- all of which would be dimensions.

One way of modelling this could be to have the grain of the fact as 'one row per plot point' with the 2 real figures in the fact, losing no precision.

You could then add supplementary dimensions as you say to categorise the figures. In your angle example, you could also have 'angle range' as a column showing 0-15,16-30, etc.

You may have to have a more complex/generic design if you have more than angle and response to contend with, with a generic dimension of 'X Axis type' including the range but with an additional 'X Axis type' column which is 'angle', 'response' etc.

I think your broad idea here is sound and off the shelf tools should be fine with both details and summaries. The key is to model something that reflects both the essential nature of the thing you're measuring (i.e. a reading in a machine) and how people want to analyse it. You would want to use the cube's capabilities to provide the calculations for averages rather than having the underlying dimensional model deal with that.

Rich
  • 2,207
  • 1
  • 23
  • 27
  • Hi Rich, thanks for the answer and comments. I had considered the approach of row per point as an alternative to resolve duplicate key issues with my prototype schema (I have a composite pkey of each dimension pkey for each fact, including an "x_value_dimension". duplicate x values, e.g. 0->360->0 led to violations). I thought this approach however would make it harder to re-visualise the original graphs for fault finding. Adding an additional dimension, as you suggest, for grouping (probably each integer to be more flexible) would resolve this problem in practice. – adam Feb 20 '17 at 21:15
  • For the more generic (interesting?) problem, I could potentially borrow from floats and have significand and exponent grouping dimensions, as for similar "x axis types" we'd generally be expecting a similar scale, and we'd probably bin data probably only be binning data to 1sf (e.g. 0.1 or 0.05). – adam Feb 20 '17 at 21:30
  • The reason for a primary key is to define and enforce the level of detail in the table. You shouldn't automatically just set the PK to all the dimension keys. If you have some kind of sample number from your source sytem, that should be in the fact and should have some kind of unique constraint – Nick.Mc Feb 21 '17 at 00:50
  • @adam, grouping could be both for each integer (i.e. your dimension has one row per integer) plus in ranges, in that the dimension table could have an additional column for the ranges, i.e. each integer in the dimension sits within a range. – Rich Feb 21 '17 at 08:27
  • Thanks both for your comments, this is my first foray into dimensional modelling for data warehousing, so I've made a fair few mistakes with the prototype schema, which I'm hoping to iron out in a revision. luckily nothing is nailed down yet and the raw data is readily available! I was thinking one or the other with regards to sample vs "x_value" dimensions, I can happily do both really. @Rich, I will need to decided on appropriate hierarchies for the x_value dimension. Also as angle is a special case (Polar coords instead of Cartesian) this probably needs to live somewhere as an axis type. – adam Feb 21 '17 at 21:58