Cassandra driver package in node does not stream all rows from Cassandra table

Question

I am using the Cassandra-driver node package (4.6.3). I am currently trying to migrate data from Cassandra to SQL. There are around 450 000 records, tried using ETL tools like Altryx but that doesn't work as there are unsupported data types like map and list. So trying to do the migration using node JS scheduler. But the problem is that Cassandra's driver does not stream through all the records. Below is the codebase.

let output = [], params = []
let query = `select * from users`
cassandraDriver.client.eachRow(query, params, { prepare : false , fetchSize: 1000 }, function(n, rows) {
        output.push(1)
    }, async function(err, result){
        if(result.nextPage) {
             result.nextPage()
        } else if(output.length > 0) {
             console.log('Total size : ', output.length)
        }
    })

When I check the Cassandra DB for the query (select count(*) from users) the count is different from the output value that I get in the above case. It seems like there is a mismatch in the number. Looks like it does not stream all the rows that are present in Cassandra. Any idea why this is happening? Is it the problem with the package? I would love to have an alternative for the same. Really breaking my head here to figure this out.

No idea why the node driver gives out a random count. Multiples of 1000 fetch size work fine, but that last batch which is less than 1000 kinda screwes up. For example, if I have 9602 records in Cassandra, the records I get streamed using the node Cassandra driver maybe around 9588. Not sure why the last 14 records are not considered.

We living in the rest of the world don't use words like "lakh" or "lac". So please don't use Indian words here. — James Z, Aug 03 '21 at 14:38
Sure James will keep that in mind. Any help on the above problem will mean a lot as well. — Abhishek, Aug 03 '21 at 16:48

score 0 · Answer 1 · answered Aug 03 '21 at 06:56

0

The problem may lie in the way you are counting the records. An unbounded SELECT COUNT(*) performs a full table scan and isn't necessarily going to be accurate as I've explained in this article -- Why COUNT() is bad in Cassandra.

The DataStax Bulk Loader (DSBulk) has a count command that efficiently counts the records in a table. For details, see the Counting with DSBulk blog post from Brian Hess.

In addition since you are migrating the data to SQL, DSBulk allows you to export the data from a table to CSV or JSON format with the unload command. It is free to use for open-source Apache Cassandra users. Here are some references with examples to help you get started quickly:

Blog - Unloading data with DSBulk
Docs - Unloading data examples
Answered questions - community.datastax.com

Cheers!

answered Aug 03 '21 at 06:56

Erick Ramirez

13,964
1
18
23

Any idea why the node driver gives out random count. Multiples of 1000 fetch size works fine, but that last batch which is less than 1000 kinda screwes up. For example if i have 9602 records in Cassandra, the records i get streamed using the node cassandra driver may be around 9588. Not sure why the last 14 records are not considred. – Abhishek Aug 03 '21 at 07:15
As I've explained in my answer, it won't be accurate for the reasons I've outlined in the article I've linked. Cheers! – Erick Ramirez Aug 03 '21 at 12:20
1

I agree with you. But my problem is in a table with just 142 records (which I counted after migrating it to SQL) but when I change the fetch size from 50 to 25 I am able to stream through only 138 records. That is I can only migrate 138 records instead of 142 records, clearly a problem with the node cassandra driver. The output value is 138 instead of 142. – Abhishek Aug 03 '21 at 14:11

Cassandra driver package in node does not stream all rows from Cassandra table

1 Answers1