0

I am modelling for the Database CrateDB.
I have an avg. of 400 customers and the produce different amounts of time-series data every day. (Between 5K and 500K; avg. ~15K)

Later I should be able to query per customer_year_month and per customer_year_calendar_week. That means that I will only query for the intervals:

  • week
  • and month

Now I'am asking myself how to partition this table?

I would partion per customer and year.
Does this make sense?
Or would it be better to partion by customer, year and month?

duichwer
  • 157
  • 1
  • 14
  • Which kind of partioning are you thinking about ? Hash, range, list ? Which RDBMS ? Will you query single values or intervals ? Do you aim to parallelize big queries or minimize disk access for small queries - this depends on your hardware and or business requirements ? –  Dec 13 '16 at 15:51
  • I am thinking of the [partioning of crateDB](https://crate.io/docs/reference/sql/partitioned_tables.html). I will query only the mentionend intervalls. – duichwer Dec 13 '16 at 15:58
  • So sorry, I do not know CrateDB I thought it was a mistype for Create ! Theoriticaly, I would choose a Hash-Range partition-subpartition sheme on customer-date (day level). The day level will handle the month and week case bu you will have to carefully transform month / week intervals to day intervals in the WHERE clause. But again I did not know this DBMS (thanks to you I do now). –  Dec 13 '16 at 16:42
  • @ydarma Thanks. Google has often the same Problem ("Did you mean _Create_ DB"). Well i wasn't sure how expensive it would be to perform queries on more than one partition and I didn't found anything about that on the website of CrateDB. – duichwer Dec 14 '16 at 09:07

1 Answers1

0

so the question of partitioning a table is quite complex and should consider a lot of things. Among others:

  • What queries should be run?
  • The way the data is inserted
  • Available hardware resources
  • Cluster size

Essentially, each partition also creates overhead by multiplying the shard count (a partition can be considered a "sub-table" based on a column value), which - if chosen improperly - can hinder performance a lot. So in your case 15k inserts a day is not too much, however the distribution of inserts might cause problems, a customer's partition that grows with 500k inserts a day will run into performance problems earlier than the 5k person. As a consequence I would use weekly partitioning only.

create table "customer-logging" (
    customer_id long, 
    log string, 
    ts timestamp, 
    week as date_trunc('week', ts)
) partitioned by (week) into 8 shards

Please only use 8 shards if you have an appropriate amount of CPU cores ;)

Docs: date_trunc(), partitioned tables

Ideally you try out a few different combinations and find what works best for you. Insights into shard sizes and locations are provided by our sys tables, so you can see if there's a particularly fat shard that overloads a node ;)

Cheers, Claus

claus
  • 377
  • 2
  • 9