1

I have uploaded my csv file in my Datastax Astra DB. The thing is why the data when it is uploaded, the rows gets unorganized or random. Say if in my csv the row id is like 1, 2, 3, ... Then when I am uploading this to my DB, it gets uploaded scrambled/rearranged say like row id 2, 1, 4, 5, 3 ... Due to this reason when I am requesting the data via http_methods in python, the data is also coming as it is stored i.e. unorganized way. Why is this happening ? And is there a way to solve this ?

Pritam Sinha
  • 309
  • 6
  • 11

2 Answers2

2

What it comes down to is how Cassandra stores your data. I’m assuming that you’re only defining a partition key within the primary key. The partition key is responsible for how the data is stored on the nodes in your cluster. If you want your data in a particular order then you’ll need to define the second part of the primary key, the clustering key.

I’d suggest taking a look at this answer which gives an excellent explanation of ordering https://stackoverflow.com/a/35708544/13078421

dwettlaufer
  • 378
  • 1
  • 5
  • Hi @dwettlaufer. Thanks for the reply. I went through the link and the idea of primary key consisting of partition key and clustering key. I see in the link they uses 'dept' as partition key and 'empno.' as clustering key. Correct me if I am wrong. But for my problem, I want to set say 'Id' column as my partition key and also I want to order the row by 'Id' itself. I have columns ['Id', 'Sepallength', 'Sepalwidth', 'Petallength', 'Petalwidth', 'Species']. How can I do this ? Can I use same key as both, if not what is the workaround ? I am using astra.datastax load data to upload. – Pritam Sinha Oct 08 '21 at 15:09
2

Why is this happening ?

dwettlaufer's answer is spot-on, but I'll provide a visual representation to help with understanding just what's happening here:

> SELECT id,token(id),name FROM numbers ;

 id | system.token(id)     | name
----+----------------------+-------
  5 | -7509452495886106294 |  five
  1 | -4069959284402364209 |   one
  2 | -3248873570005575792 |   two
  4 | -2729420104000364805 |  four
  3 |  9010454139840013625 | three

(5 rows)

Data in Astra DB is stored in order by the hashed token value of the partition key. Because of the hash, numeric values are quite likely to be arranged in a different order.

And is there a way to solve this ?

As was mentioned, sort order can only be enforced within a partition key. So if a different column is used as the partition key, id can be used to sort data within it.

Consider a table that looks like this:

CREATE TABLE dept_numbers (
    dept int,
    id int,
    name TEXT,
PRIMARY KEY(dept,id));

Then something like this would work:

> SELECT * FROM dept_numbers WHERE dept=1;

 dept | id | name
------+----+-------
    1 |  1 |   one
    1 |  2 |   two
    1 |  3 | three
    1 |  4 |  four
    1 |  5 |  five

(5 rows)

Note that all queries with Astra DB should have a WHERE clause. Without it, an expensive table scan will result.

Aaron
  • 55,518
  • 11
  • 116
  • 132