1

Pre Requirement : I need to find all the matching result in one query which is more than 40K results.

Requirement : Two tables - product and product_category. I am trying to fetch all the products with matching category from product_category table.

Table Structure :

CREATE TABLE catalog.product (
    product_id string PRIMARY KEY index using plain,
    name string,
    sku string,
) clustered by (product_id) into 4 shards;

create table catalog.product_category (
    category_id string primary key index using plain,
    product_id  string primary key index using plain,
    INDEX product_category_index using plain(product_id, category_id)
    active boolean,
    parent_category_id integer,
    updated_at timestamp
);

Join Query :

select p.product_id from catalog.product_category pc join catalog.product p on p.product_id=pc.product_id limit 40000;

Tried multiple things - indexing product_id (both as integer and string) etc.

Result : To showup 35K result it is taking more than 90 seconds everytime.

Problem : What can I do to optimize the query response time?

Some Other Information : - CPU Core -4 - Tried with one or multiple nodes - Default Sharding - Total number of products - 35K and product_category has 35K enteries only.

Usecase : I am trying to use crateDB as persistent cache but with the given query response time we can't really do that. So we will move to some in-memory database like REDIS or Memcache. The reason to choose crateDB was the querying ability on your persistent data.

Sanjay Kumar
  • 1,474
  • 14
  • 22

1 Answers1

0

Joining on STRING type columns is very expensive comparing to joining on NUMERIC types (equality check on string values is much more expensive than on numeric values). If you don't have a special reason for doing that I'd recommend to change them to a NUMERIC type (e.g. INTEGER, LONG, ...), this would improve the query speed by a multiple factor.

Btw. index using plain is the default index setting for all columns, so you could just leave it out. Also the compound index product_category_index will not help improving the join query, it's just important if you would filter on that by using a WHERE clause including this index column.

Updated

Another improvement you can do is adding an ORDER BY p.product_id, pc.product_id clause. By this, the join algorithm could stop when it reached your applied LIMIT.

Sebastian Utz
  • 719
  • 3
  • 9
  • @SanjayKumar Updated my answer with an `ORDER BY` suggestion. – Sebastian Utz Mar 28 '17 at 09:45
  • I tried what you said, the only way the performance got blazingly fast was when I put everything in a single table. Though it is slower than in memory access but comparable. Thanks @Sabastian. – Sanjay Kumar Mar 30 '17 at 18:36