What's the best practice to store performance data in CrateDB?

Question

My application stores performance time series data in CrateDB, and in order to get the setup right, I have a few questions, since it's going to be around 72M records a day and it should scale properly :). My goal is to visualize the resulting data with Grafana and currently I have the following structure in mind:

CREATE TABLE metrics (
  ts TIMESTAMP,
  hostname STRING,
  servicename STRING, 
  perfdata OBJECT(DYNAMIC)
)

// for example
{
    "hostname": "localhost",
    "servicename": "ping",
    "timestamp": 1483699527,
    "perfdata": {
        "rta": {
            "current": 0.5,
            "unit": "ms",
            "warn": 100,
            "critical": 200
        },
        "pl": {
            "current": 0,
            "unit": "%",
            "warn": 10,
            "crit": 20
        }
    }
}

The important bits are the host-/servicename, the metric's name and values, and the timestamp. Which would also be the alternative schema:

 CREATE TABLE metrics (
   ts TIMESTAMP,
   hostname STRING,
   servicename STRING, 
   metric OBJECT(DYNAMIC) AS (
      unit STRING,
      name STRING,
      value DOUBLE,
   )       
 )

So which one would be the preferred way to store the data? Do I also need partitioning? My aggregations usually show the last 24h and rarely the last month...

Thanks!

score 0 · Accepted Answer · edited Jan 20 '17 at 16:26

In general, I would suggest going with the second table schema, since its simpler and it lets you capture the raw data (instead of something pre-assembled) and do the aggregation for what you actually need.

However, this question is quite tricky since it depends a lot on the actual requirements. So essentially the combined schema will make updates tricky (objects can only be replaced and not updated) and therefore requires the data to be sent together (and maybe collected at the same pace?).

On top of that, setting up partitioning may be useful. This can speed up queries and allows to change the number of shards for future partitions (and allows you to scale better). Common practice is to partition by month or week, in some cases also by day. But has to take care there to prevent the number of shards from exploding as shards also require system resources.

CrateDB works very well with Grafana, thanks to the datasource plugin :)

score 0 · Answer 2 · answered Mar 09 '17 at 08:39

just saw this one now and I've been playing around with perdata (nagios) and cratedb. What we tested (based on a mongodb timeseries demo) was storing per hour and let perfdata[ts]={perf object} (we did actually test perfdata['minute']['second']={})

A 24 hour schema would be perfdata['hour']['minute']

So by defining a primary key on host/service/ you do 'on duplicate key update..'

So a query by host/service/hour is just one query :-)

What's the best practice to store performance data in CrateDB?

2 Answers2