0

I have an application where I have multiple selection dropdowns. According to that dropdown I need to get all values from the table.

CREATE TABLE generic_keyspace.cust_table (
    account_executive text,
    certification text,
    customer_category text,
    customer_name text,
    engine_model text,
    target_cost_final text,
    target_price_final text,
    PRIMARY KEY (account_executive, certification, customer_category, customer_name, engine_model)
) WITH CLUSTERING ORDER BY (certification ASC, customer_category ASC, customer_name ASC, engine_model ASC)

this is my table

SELECT * from cust_table
  WHERE customer_name IN ('cust1','cust2')
  AND customer_category IN ('cat1','cat2')
  ALLOW FILTERING;

while trying to execute this query I am getting an error

InvalidRequest: Error from server: code=2200 [Invalid query] \
  message="IN restrictions are not supported on indexed columns"

I tried removing the columns from primary key then I am getting another error

InvalidRequest: Error from server: code=2200 [Invalid query] \
  message="IN predicates on non-primary-key columns (customer_name) is not yet supported"
Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23

2 Answers2

0

ALLOW FILTERING is a band-aid, and is not something you should ever do in a production Cassandra deployment. The only recommended usage of ALLOW FILTERING is when the query is limited to a single partition. There is some serious data modeling considerations that need to happen here.

General idea is for you to gather all the required read queries and then define the data model of the table(s) and then store the data on it.

I'd strong recommend you to go through the following free browser based courses,

  1. Fundamentals
  2. Data modeling by examples

Since your table's primary key is defined as follows as compound primary key,

PRIMARY KEY (account_executive, certification, customer_category, customer_name, engine_model)

wherein,

  • account_executive is your table's partition key.
  • certification, customer_category, customer_name, and engine_model are the clustering keys.

within the clustering keys, one cannot skip the preceeding key and use the other. For example, you cannot use a query like SELECT ... FROM ... WHERE account_executive = ? AND customer_category = ?; since certification column appears prior to customer_category. You could skip the other clustering key columns on the right side though for instance.

If you want the fastest response on your read queries, you'll use the full primary key as a whole,

SELECT ... FROM ... WHERE account_executive = ? AND certification = ? AND customer_category = ? AND customer_name = ? AND engine_model = ?;

Querying to match a non-primary key column is an anti-pattern, as querying should always result in a continuous slice of data retrieved from the table.

A custom secondary index can be created on non-primary key columns to help facilitate some queriying flexibility. This technique, however, does not guarantee trouble-free indexing, so know when and when not to use an index.

Based on how you've designed your table model, following are the queries that are supported,

  • SELECT ... FROM cust_table WHERE account_executive = ? AND certification = ? AND customer_category = ? AND customer_name = ? AND engine_model = ?; -- fastest & more efficient
  • SELECT ... FROM cust_table WHERE account_executive = ? AND certification = ?;
  • SELECT ... FROM cust_table WHERE account_executive = ? AND certification = ? AND customer_category = ?;
  • SELECT ... FROM cust_table WHERE account_executive = ? AND certification = ? AND customer_category = ? AND customer_name = ?;
  • SELECT ... FROM cust_table WHERE account_executive = ?;

You could also do other conditions within the WHERE clause and I'd recommend you read this CQL WHERE clause understanding blog which is a bit old but still has a lot of relevance with more recent versions of Cassandra®.

Madhavan
  • 758
  • 4
  • 8
0

The IN() operator is primarily designed for filtering on the partition key to retrieve multiple partitions.

Additionally, you can only use IN to filter on the last column of a compound primary key provided all the preceding columns are specified with the equality (=) operator. As you already discovered, the IN operator cannot be used on indexed columns.

To illustrate, I'll use this table as an example:

CREATE TABLE sample_table (
    pk int,
    ck1 int,
    ck2 int,
    ck3 int,
    some_column text,
    another_column text,
    PRIMARY KEY (pk, ck1, ck2, ck3)
)

The valid uses of the IN() operator for this table are:

SELECT * FROM sample_table WHERE pk IN (...)
SELECT * FROM sample_table WHERE pk = ? AND ck1 IN (...)
SELECT * FROM sample_table WHERE pk = ? AND ck1 = ? AND ck2 IN (...)
SELECT * FROM sample_table WHERE pk = ? AND ck1 = ? AND ck2 = ? AND ck3 IN (...)

Notice that the IN operator is used to filter just last column -- the IN operator cannot be used on any preceding column of the WHERE clause.

As a side note, it looks like your application has multiple queries it needs to execute to retrieve the required data and attempting to do that on a shared table. Cassandra is the DB of choice when you have a scale problem, meaning you need to retrieve data at super speeds. You can only achieve this when the tables are optimised for reads by designing a table for each application query.

So if you have an application query that filters on the customer name, you need to model your data so the table is partitioned by that column. For example:

CREATE TABLE customers_by_name (
    customer_name text,
    ...
    PRIMARY KEY (customer_name)
)

If you need to query by category, then design a table that is partitioned by category:

CREATE TABLE customers_by_category (
    customer_category text,
    ...
    PRIMARY KEY (customer_category)
)

As a general recommendation, the use of the IN operator on partition keys is discouraged because the coordinator needs to fire a separate read request for each item in the list. If you must, limit its use to 2 to 3 items for optimum performance.

Using the IN operator on a clustering column is not as problematic since the query is restricted on a single partition so it is only filtering against rows within a partition. Cheers!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23