Set the right partitions for Crate Database

Question

I am modelling for the Database CrateDB.
I have an avg. of 400 customers and the produce different amounts of time-series data every day. (Between 5K and 500K; avg. ~15K)

Later I should be able to query per customer_year_month and per customer_year_calendar_week. That means that I will only query for the intervals:

week
and month

Now I'am asking myself how to partition this table?

I would partion per customer and year.
Does this make sense?
Or would it be better to partion by customer, year and month?

Which kind of partioning are you thinking about ? Hash, range, list ? Which RDBMS ? Will you query single values or intervals ? Do you aim to parallelize big queries or minimize disk access for small queries - this depends on your hardware and or business requirements ? — , Dec 13 '16 at 15:51
I am thinking of the [partioning of crateDB](https://crate.io/docs/reference/sql/partitioned_tables.html). I will query only the mentionend intervalls. — duichwer, Dec 13 '16 at 15:58
So sorry, I do not know CrateDB I thought it was a mistype for Create ! Theoriticaly, I would choose a Hash-Range partition-subpartition sheme on customer-date (day level). The day level will handle the month and week case bu you will have to carefully transform month / week intervals to day intervals in the WHERE clause. But again I did not know this DBMS (thanks to you I do now). — , Dec 13 '16 at 16:42
@ydarma Thanks. Google has often the same Problem ("Did you mean _Create_ DB"). Well i wasn't sure how expensive it would be to perform queries on more than one partition and I didn't found anything about that on the website of CrateDB. — duichwer, Dec 14 '16 at 09:07

claus · Accepted Answer · 2017-01-16T09:36:32.943

so the question of partitioning a table is quite complex and should consider a lot of things. Among others:

What queries should be run?
The way the data is inserted
Available hardware resources
Cluster size

Essentially, each partition also creates overhead by multiplying the shard count (a partition can be considered a "sub-table" based on a column value), which - if chosen improperly - can hinder performance a lot. So in your case 15k inserts a day is not too much, however the distribution of inserts might cause problems, a customer's partition that grows with 500k inserts a day will run into performance problems earlier than the 5k person. As a consequence I would use weekly partitioning only.

create table "customer-logging" (
    customer_id long, 
    log string, 
    ts timestamp, 
    week as date_trunc('week', ts)
) partitioned by (week) into 8 shards

Please only use 8 shards if you have an appropriate amount of CPU cores ;)

Docs: date_trunc(), partitioned tables

Ideally you try out a few different combinations and find what works best for you. Insights into shard sizes and locations are provided by our sys tables, so you can see if there's a particularly fat shard that overloads a node ;)

Cheers, Claus

Set the right partitions for Crate Database

1 Answers1