2

I am currently studying the documentation of InfluxDB 2.0; however, I don't understand the logic between buckets, measurements & retention policies entirely yet.

The documentation says that databases and retention policies got replaced with buckets. A bucket is per definition:

"a named location where time series data is stored in InfluxDB 2.0"

In my understanding

A bucket contains shard groups => Shard groups store data of a certain interval in a particular folder; for example.: a shard group could always save data of a four-hours-interval in a single folder.

A shard group contains shards => Shards are the single rows/points of the time-series table.

Moreover, Influx writes in the documentation that one bucket has one retention policy.

This means that "a bucket" stores only one time-series and not several ones; otherwise, a bucket could have several retention policies.

In case my understanding is correct, does this mean that you can only include measurements in the same bucket when all of them have the same retention policy? Because if there are two measurements with different retention policies in the same bucket, one retention policy could delete data from the other measurement. Please correct me if I confuse things here.

However, in case I am right, how does this influence hardware requirements?

Influx says that the number of series affects hardware requirements.

That actually means, that every bucket/retention policy raises the number of series and by that the hardware requirements?

For example, does it make a difference when storing 60,000 series in one bucket VS Storing 20,000 series in bucket A, another 20,000 series in bucket B, and the final 20,000 series in bucket C.

I am looking forward to your feedback!

1 Answers1

2

Alvaro -

The most important feature of a bucket is that it defines the retention policy for all data in it. A bucket only has one retention policy. If you have data that needs two different time horizons, you will need two buckets. Often this is done through downsampling. For example, I keep high fidelity 1/s data for a week and then I keep a lower resolution 1/min version of data for a month. I would use two buckets here.

For InfluxDB, "a time series" is defined by its "series key" which is the measurement, tag set, and field keyset. So a bucket can contain many different time series. You can put many measurements into a single bucket. It seems you're familiar with InfluxDB 1.x so I think you know about measurements, tags, and fields already.

"Series Cardinality" is the number of time series you have in total. The same series key in different buckets are treated as separate series. So for a contrived example, if you duplicated writing your data into two different buckets, but it is identical otherwise, you have doubled your cardinality. It makes sense that in this situation the hardware requirements would be higher - you've doubled your data under management.

This blog post gives a great overview of these concepts. Data Layout and Schema Design Best Practices for InfluxDB If you have follow up questions, please ask them. There is also an InfluxDB community slack chat if you want to ask the dedicated community there.

Phil
  • 1,226
  • 10
  • 20
  • So what I get is that stocking 60k series in one bucket, or stocking 20k series in 3 buckets makes no difference on the hardware requirements for InfluxDB nor on the performances? I'm trying to see how we're going to stock billion of series in InfluxDB and I need to optimize the data and ram consumption. – Danielle Paquette-Harvey Apr 26 '21 at 15:49
  • @DaniellePaquette-Harvey - for influxDB 2.x, I would expect little performance different (and little hardware requirements difference) for 60k series in one bucket and 20k series in 3 different buckets with the same retention policy. Were you able to come to a solution? – Phil Jan 19 '22 at 23:48