How to scale a range sharded index on a timestamp column in YugabyteDB?

Question

Is there any performance tuning to do for a write-bound workload in YugabyteDB? We thought that by simply adding additional nodes to our YugabyteDB cluster, without further tuning, we would have seen some noticeable increase in writes, however this is not the case. Schema can be found below.

  Column   |            Type             | Collation | Nullable |           Default            | Storage  | Stats target | Description
update_id  | character varying(255)      |           | not null |                              | extended |              |
node_id    | character varying(255)      |           | not null |                              | extended |              |
data       | character varying           |           | not null |                              | extended |              |
created_at | timestamp without time zone |           |          | timezone('utc'::text, now()) | plain    |              |
Indexes:
   "test_pkey" PRIMARY KEY, lsm (update_id HASH)
   "test_crat" lsm (created_at DESC)

This table has tablets spread across all tservers with RF=3. Created_at is a timestamp that changes all of the time. At this point it has no more than two days of data, all new inserts are acquiring a new timestamp.

MarkoR · Answer 1 · 2021-05-26T14:27:11.943

In the case of the schema called out above, the test_crat index here is limited to 1 tablet because it is range-sharded. Since created_at has only recent values they will end up going to 1 shard/tablet even with tablet splitting, meaning that all inserts will go to 1 shard. As explained in this Google Spanner documentation, whose sharding, replication, and transactions architecture YugabyteDB is based off of, this is an antipattern for scalability. As mentioned in that documentation:

If you need a global (cross node) timestamp ordered table, and you need to support higher write rates to that table than a single node is capable of, use application-level sharding. Sharding a table means partitioning it into some number N of roughly equal divisions called shards. This is typically done by prefixing the original primary key with an additional ShardId column holding integer values between [0, N). The ShardId for a given write is typically selected either at random, or by hashing a part of the base key. Hashing is often preferred because it can be used to ensure all records of a given type go into the same shard, improving performance of retrieval. Either way, the goal is to ensure that, over time, writes are distributed across all shards equally. This approach sometimes means that reads need to scan all shards to reconstruct the original total ordering of writes.

What that would mean is: to get recent changes, you would have to query each of the shards. Suppose you have 32 shards:

select * from raw3 where shard_id = 0  and created_at > now() - INTERVAL 'xxx';
..
select * from raw3 where shard_id = 31  and created_at > now() - INTERVAL 'xxx';

On the insert, every row could just be given a random value for your shard_id column from 0..31. And your index would change from:

(created_at DESC)

to

(shard_id HASH, created_at DESC)

Another approach you could use that may not be as intuitive, but may be more effective, would be to use a partial index for each shard_id that you would want.

Here is a simple example using 4 shards:

create index partial_0 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4=0;

The partial index above only includes rows where the modulus of the epoch in milliseconds of created_at timestamp is 0. And you repeat for the other 3 shards:

create index partial_1 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 1;
create index partial_2 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 2;
create index partial_3 ON raw3(created_at DESC) where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 3;

And then when you query PostgreSQL is smart enough to pick the right index:

yugabyte=# explain analyze select * from raw3 where (extract(epoch from timezone('utc',created_at)) * 1000)::bigint % 4 = 3 AND created_at < now();
                                          

     QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------------
Index Scan using partial_3 on raw3  (cost=0.00..5.10 rows=10 width=16) (actual time=1.429..1.429 rows=0 loops=1)
  Index Cond: (created_at < now())
Planning Time: 0.210 ms
Execution Time: 1.502 ms
(4 rows)

No need for a new shard_id column in the base table or in the index. If you want to reshard down the road, you can recreate new partial indexes with different shards and drop the old indexes.

More information about the DocDB sharding layer within YugabyteDB can be found here. If you are interested in the different sharding strategies we evaluated, and why we decided on consistent hash sharding as the default sharding strategy, take a look at this blog written by our Co-Founder and CTO Karthik Ranganathan.

How to scale a range sharded index on a timestamp column in YugabyteDB?

1 Answers1