1

Lets say I have a table

    id     col1    col2   col3
    1      ABC     DEF     XYZ
    2      XXX     YYY     ZZZ

Most frequent queries is going to be

SELECT * from XYZ where col1='abc' and col2='def'
SELECT * from XYZ where col1='abc' and col2='def' and col3='xyz'

As per VoltDB docs:-

  1. Partition should be done on column on which most searches are going to be performed.
  2. Partitioning should be done on one column

I couldn't find any example where search is performed on multiple columns.

I wonder what is the best way to partition table for multi column searches?

EDIT:-

Or what if my query is like:-

SELECT * from XYZ where col1 IN ('abc', ..., ...) and col2 IN ('def', ...) and col3 IN ('xyz', ...).
Jack Daniel's
  • 2,583
  • 22
  • 28
  • Why are you partitioning in the first place? There are valid reasons but also very common invalid reasons. – usr Apr 24 '14 at 08:27
  • Because table will have many records.. – Jack Daniel's Apr 24 '14 at 08:32
  • Would appreciate if you can let me know both valid and invalid reasons – Jack Daniel's Apr 24 '14 at 08:33
  • "Many records" is not a valid reason by itself because partitioning does not magically make performance better. Why would it? You should probably research in what cases it helps and in what cases it hurts. – usr Apr 24 '14 at 08:38
  • Exactly this is what i am looking for.. thats why posted this question. – Jack Daniel's Apr 24 '14 at 09:23
  • Stack Overflow is not a good resource for general tutorial style content. – usr Apr 24 '14 at 10:31
  • 1
    To respond to usr, VoltDB partitioning can absolutely make performance better, as it can distribute work among many cores and many hosts in a cluster. VoltDB partitioning is a different animal than partitioning in some non-clustered, traditional RDBMSs. – John Hugg Apr 24 '14 at 14:17

2 Answers2

4

Guidelines for picking a column:

First off, you should pick a partitioning column that has many different values. To illustrate, picking a Male/Female column partitions poorly if you have more than two partitions (common).

It's also a bad idea to pick a column with a few values that dominate other values. If 20% of your values are NULL, then more than 20% of your rows will partition to the same place. Distributions don't have to be even, but if you have "hot" values, it's helpful to at least have a lot more "hot" values than partitions.

Picking a timestamp can also be tricky if the timestamp advances slower than the rate of ingestion. In this case your load will round-robin the partitions one-at-a-time when the timestamp advances. Though in practice a single partition can often handle 10-50k inserts per second, so this actually works for non-exteme use cases.

So if you partition on a column with lots of values that are pretty evenly distributed, your inserts will partition nicely and you will be able to ingest some serious load.

Picking a column to optimize queries:

Now the question becomes, given a set of candidate columns, can you pick one to make your queries run faster?

Any query that matches on an equality test to the partition column can be sent to a single partition. In your example above, if you partitioned on col1 or col2, then both queries would be single partition. If you partitioned on col3, only the second query would be single partitioned.

A lot of times the partitioning column will be obvious, perhaps a customer id or ticket symbol. But even if it's obvious, and especially if it's not, you're going to want to run queries that don't partition. The good news is that VoltDB 4.0 has made read-only cross-partition queries dramatically faster than in previous versions. Our internal benchmarks show that tens of thousands of queries per second are possible.

This level of cross-partition read performance is often better than the read performance of non-partitioned RDBMSs. So in VoltDB 4.0, it's now more important to partition for write operations than for reads. This makes partitioning a bit simpler.

John Hugg
  • 380
  • 1
  • 2
  • thanks for the explanation. As of now I am not worried about insertions as there are going to be one time bulk insertions. I am looking for fastest method to search. As you suggested that I can partition on col1. What if my queries can also be like :- SELECT * from XYZ where col1 IN ('abc', ..., ...) and col2 IN ('def', ...) and col3 IN ('xyz', ...). For these type of queries, I am not sure weather partitioning will make sense or not. IMHO, I feel it may slow down the performance... What do you suggest? – Jack Daniel's Apr 25 '14 at 03:33
0

Here are some criteria that may help in selecting a partition column:

Base considerations:

  • Should have values with sufficient cardinality so that it uses all of the partitions

  • Ideally, the values should hash evenly so that the distribution to the partitions is even.

That may leave you with several choices. Any would be fine if the workload was mostly inserts, because inserts will always provide the partitioning column value, so inserts will always be executed in a single partition and will therefore scale very well. To decide which column is best, you might consider:

  • For the queries and other transactions, which column is most commonly provided as an input parameter?

  • If there are transactions that involve multiple tables, which column is shared by all of the relevant tables?

  • If you need to join the table with another partitioned table, you must partition on one of the join keys.

Hopefully that will make it clear what is the best choice. There can be trade-offs, so sometimes it is worthwhile to test different approaches. Sometimes denormalizing slightly to provide a common partitioning key for related tables is something you might consider to result in a higher percentage of single-partition transactions, or to enable more joins. Also, it is perfectly ok to have queries that run as multi-partition transactions. These can scale to thousands per second, in some cases many thousands per second. So while you do want to maximize the percentage of the workload that is single-partition, you can still have a percentage that isn't.

Indexing is also very important. In your example, if you chose col1 or col2, then both queries would be executed as a single-partition transaction, but within a single-partition there may be many records with different partition key values. Defining a column as the partition key does not automatically create an index on that column. You still want to define indexes to support the queries you need to perform quickly and frequently. VoltDB is a row store, so many of the same considerations you would use in creating indexes on a traditional RDBMS will apply. Based on the example queries, an index on (col1,col2,col3) would support both queries. If you have a lot of different search queries that need to run frequently, it may help to create multiple indexes.

When designing the indexes it helps to examine the explain plans for your queries. You can do that in VoltDB's SQL interface using the following commands: https://voltdb.com/docs/UsingVoltDB/sysprocexplain.php https://voltdb.com/docs/UsingVoltDB/...xplainproc.php

You an also see these explain plans in the html catalog report that is output when you run "voltdb compile". The catalog report is also available through the web interface on port 8080.

The plan will show if the query execution would involve a table scan, or if it will use an index.

BenjaminBallard
  • 1,482
  • 12
  • 11