1

I have the following code:

cluster := gocql.NewCluster("our-cass")
cass, err := cluster.CreateSession()
defer cass.Close()
iter := cass.Query(`SELECT * FROM cmuser.users LIMIT 9999999999;`).Iter()
c :=iter.Columns()
scanArgs := make([]interface{}, len(c))

for i:=0; i < len(scanArgs); i++ {
    scanArgs[i] = makeType(c[i])
}

for iter.Scan(scanArgs...) { ... }

The problem is that we have way too many rows in that table. But I need to read all of them, to migrate the data to another db. Is there a way to stream the data from Cassandra? Unfortunately, we don't have a sequence for the primary key of the table, we are using a uuid for the PK. So that means we can't do a simple technique of 2 for loops, one incrementing a counter and going through all the rows that way.

Aaron
  • 55,518
  • 11
  • 116
  • 132

2 Answers2

2

For a pure Cassandra way to do this, you could run a range query on the token ranges as they are broken-up per-node.

First, find the token ranges.:

$ nodetool ring

Datacenter: dc1
==========
Address   Rack       Status State   Load         Owns                Token
                                                                      8961648479018332584
10.1.4.3  rack3      Up     Normal  1.34 GB      100.00%             -9023369133424793632
10.1.4.1  rack2      Up     Normal  1.56 GB      100.00%             -7946127339777435347
10.1.4.3  rack3      Up     Normal  1.34 GB      100.00%             -7847456805881540087
...

etc...(this might be large, depending on the number of nodes and tokens on each node)

Then adjust your query to use the token() function on your partition key. As I don't know what your PRIMARY KEY definition is, I'm going to guess and use users_id as the partition key:

SELECT * FROM cmuser.users
WHERE token(users_id) > 8961648479018332584
  AND token(users_id) <= -9023369133424793632;

Once that completes, move to the next token range:

SELECT * FROM cmuser.users
WHERE token(users_id) > -9023369133424793632
  AND token(users_id) <= -7946127339777435347;

Breaking your query down like this will help to ensure that it's only reading from a single node at a time. That should allow the query to read the data sequentially from the cluster (and from the disk), without worry of timeouts.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • 1
    Important condition for this is explicit routing of the query, as all drivers aren't able to automatically route them to the replica. Other issue usually is that you need to handle begin/end of the token range correctly, as they aren't starting exactly at MIN token value... I have an example of how to do it correctly in Java (the same way as Spark & DSBulk work): https://github.com/alexott/dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java – Alex Ott Jun 05 '20 at 07:10
0

Gocql has some options for paging (assuming your Cassandra version is at least version 2).

Gocql's Session has a method SetPageSize

And Gocql's Query has a similar method, PageSize

This may help you break up your query. Here's what the code would look like:

cluster := gocql.NewCluster("our-cass")
cass, err := cluster.CreateSession()
defer cass.Close()

iter := cass.Query(`SELECT * FROM cmuser.users;`).PageSize(5000).Iter()

// use the iter as usual to iterate over all results 
// this will send additional CQL queries when it needs to get new pages
Eklavya
  • 17,618
  • 4
  • 28
  • 57
phonaputer
  • 1,485
  • 4
  • 9
  • please don't do `select * from ...` it's very ineffective way to fetch data – Alex Ott Jun 04 '20 at 09:41
  • Do you mean due to the lack of a `WHERE` clause or because the specific fields are not named? For the field names, he can add them. For the `WHERE` clause in his post he claimed that he couldn't use one. – phonaputer Jun 04 '20 at 22:46
  • Yes, query without WHERE harms performance. See answer from @Aaron – Alex Ott Jun 05 '20 at 04:35