cassandra huge data reading using java driver

Question

I have to read 3 TB of production data from a Cassandra database.

I have implemented paging using java driver but this technique uses offset value which means I am tracing my data all over again to reach a particular row and this process is using heap memory which is not a good practice. I want to read data without using lots of heap memory

Typically I want to fetch 10000 rows in a batch and then again read next 10000 without reading the first ten thousand reads again

I don't need high read latency my only problem is reading data without consuming lots of heap memory...

here is my code in part Statement select = QueryBuilder.select().all().from("demo", "emp");

and this is how i am paging

List<Row> secondPageRows = cassandraPaging.fetchRowsWithPage(select, 100001, 25000);
printUser(secondPageRows);

Where 100001 is the start value from where I want to output my row and 25000 is the size of the page. so here I have to first reach till 100000 and then I will print the 100001st value. this is causing me the heap problem plus in my case, I don't want to reach at the end of one page to get the first record for another page.

your question needs some more details and better code style. In some parts, you should put your codes here to display your tries to solve your problems. — AmerllicA, Oct 11 '18 at 12:17
@AngelHotxxx i have edited my question with the detail . Hope you might be able to help now . — surbhi bohra, Oct 11 '18 at 12:33
Why not just do a session.execute on the `select` statement and iterate through them? If build your app to take an Iterator instead of List there will be no memory problems. You can still break it up however you want on your side (ie fill a list with 10,000 results for processing at a time) and by default it will automatically grab it in 5000 batches. You can then test throughput changes by increasing to 10,000 but that might actually end up hurting you more than helping. something like `session.execute(select).forEach(r -> this::printUser)` — Chris Lohfink, Oct 11 '18 at 13:38

score 0 · Answer 1 · answered Oct 11 '18 at 11:25

0

I can think of 2 possible solution for this:

1) You need to have a better data model to handle this query. Remodel your Table to handle such queries.

2) Use spark job to handle such request, for this you need to have a separate Data Center to handle this queries so to not have to bother about heap memory.

answered Oct 11 '18 at 11:25

Mehul Gupta

440
5
23

is batch processing of data possible for read query because i don't have sequential data so that i can use partition key in where clause. paging data is not solving my problem and this is a huge data i cant remodel the production table – surbhi bohra Oct 11 '18 at 11:37
I think partitioning is also available on Driver level, but not very sure about that, but it may help with the heap memory. – Mehul Gupta Oct 11 '18 at 12:34
i cant find it over internet can you suggest me a link . – surbhi bohra Oct 11 '18 at 12:55

score 0 · Answer 2 · answered Oct 11 '18 at 14:12

0

FYI, below document could help although never tried my own.

https://docs.datastax.com/en/developer/java-driver/3.6/manual/paging/

Here driver will take care of pagination.

answered Oct 11 '18 at 14:12

Mehul Gupta

440
5
23

read the `https://docs.datastax.com/en/developer/java-driver/3.6/manual/paging/#offset-queries` in the same page .. It says it doesn't help random jumps it is good for paging the data sequentially . This random jump is only causing me the heap problem. – surbhi bohra Oct 12 '18 at 05:57

cassandra huge data reading using java driver

2 Answers2