0

I have a table in Cassandra which is actually not very large. Only 50k rows. I need to stream all the rows from this table and index them in Elasticsearch.

I wrote a simple script in Node.js using the following method:

var myStream = CassandraService.cassandra_client.stream("select * from my_table");

And then started listening to data events, creating a bulk request of 1000 rows, pausing the stream, indexing the rows and un pausing the stream.

This was working fine for 1000-2000 rows. But now since the table size has grown to 50000 rows, I get query time out error while fetching from Cassandra.

Unhandled rejection ResponseError: Operation timed out - received only 0 responses.

So the process does not even start. What would be the recommended way to solve this problem ?

Mandeep Singh
  • 7,674
  • 19
  • 62
  • 104

1 Answers1

1

If each row is quite large and you need to stream large volumes of data from Cassandra, it is better to reduce the page size (fetchSize). In the options argument, along with autoPage, also send the fetchSize with a small number. For example

{autoPage: true, fetchSize: 100}

By default, fetchSize is 5000 and this is what was causing all the problem in my case. Since each row contained a lot of data, this was leading to timeouts. Keeping fetchSize 100 solved the problem.

Mandeep Singh
  • 7,674
  • 19
  • 62
  • 104