How do I use GROUP BY in Cassandra?

Question

I have a dataset like

("group_1" , uuid , other, columns),
("group_1" , uuid , other, columns),
("group_1" , uuid , other, columns),
("group_2" , uuid , other, columns),
("group_2" , uuid , other, columns),
("group_3" , uuid , other, columns),
("group_3" , uuid , other, columns),

That is within a table declared like:

CREATE TABLE sample(
 group TEXT,
 id TEXT,
 Other,
 columns,
 PRIMARY KEY( group , id)
);

What I want to do is fetching a list of list of data, that all data with same group name will be in the same row. What do I mean is:

[ 
 [("group_1" , uuid , other, columns),
  ("group_1" , uuid , other, columns),
  ("group_1" , uuid , other, columns)],
 [("group_2" , uuid , other, columns),
  ("group_2" , uuid , other, columns)],
 [("group_3" , uuid , other, columns),
  ("group_3" , uuid , other, columns)],
]

That is a list of list of rows with the same group name.

I don't know if is it possible to reach such a result using database built-in functions like GROUP BY or not, because when I try it, it only returns the first row of same group name rows in the following query:

SELECT * FROM sample GROUP BY group;

I am also interested to know if there is a better way of achiving this result?

PS: The reason why don't I use group keys to fetch data like SELECT * FROM sample WHERE group = 'group_1'; is because I don't know the group names( there are thousands of them) and I only want fetch a bunch of data with different group names and put them in same group names and make a list of lists with same group name of them.

score 0 · Answer 1 · answered Jul 22 '23 at 07:40

0

Cassandra is not designed to select a significant bunch of data using SELECT. It's very important to define partition key in WHERE clause to select from specific node and clustering key to further limit the selected number of rows.

answered Jul 22 '23 at 07:40

Sergey Olontsev

1,084
7
19

score 0 · Answer 2 · answered Jul 27 '23 at 03:44

The GROUP BY clause summarises the column values into groups.

To illustrate, let's consider this dataset of vehicles:

 Brand  | Model         | Body type
--------+---------------+-----------
 Kia    | Sportage      | SUV
 Kia    | Sorento       | SUV
 Kia    | Seltos        | SUV
 Kia    | Stonic        | SUV
 Kia    | Niro          | EV
 Kia    | EV6           | EV
 Kia    | EV9           | EV
 Kia    | Cerato        | Car
 Kia    | Picanto       | Car
 Toyota | Yaris         | Car
 Toyota | Corolla       | Car
 Toyota | Camry         | Car
 Toyota | Yaris Cross   | SUV
 Toyota | C-HR          | SUV
 Toyota | Corolla Cross | SUV
 Toyota | RAV4          | SUV
 Toyota | Kluger        | SUV
 Toyota | Fortuner      | SUV
 Toyota | Prado         | SUV
 Toyota | Landcruiser   | SUV

In this table, we store the data by brand clustered by body type:

CREATE TABLE vehicles_by_brand_body_type (
    brand text,
    body_type text,
    model text,
    PRIMARY KEY (brand, body_type, model)
)

Here's an example of how to use the GROUP BY clause to get the list of body types sold by Kia:

cqlsh> SELECT body_type FROM vehicles_by_brand_body_type
           WHERE brand = 'Kia'
           GROUP BY body_type;

 body_type
-----------
       Car
        EV
       SUV

But in your case, the issue isn't really around grouping the data but the fact that you're not querying data by partition key. Performing an unbounded query (no WHERE clause or filter) is expensive as it requires a full table scan. It may work in non-production environments or when you have a small cluster but it doesn't scale so is not recommended in production.

You need to model your data correctly so that they are grouped/clustered in a partition and you are requesting data from the cluster by partition key. Cheers!

How do I use GROUP BY in Cassandra?

2 Answers2