I maintaining a cassandra cluster with 2 data centers. Now I am going to add new data center in that existing cluster. After rebuilding data, how can i verify the consistency of data in new data center?
Asked
Active
Viewed 205 times
0
-
What do you mean by that? As in if the data is replicated correctly? – raam86 Aug 03 '17 at 12:56
-
How can i ensure that data in old DC replicated completely in new DC? – Rishikesan Varudharajan Aug 03 '17 at 13:20
1 Answers
1
Read with LOCAL_QUORUM from each DC and compare be most straight forward.
A repair builds a hash of partitions from the sstables in a compaction task and compares ranges of them which is more efficient than reading data individually. You could just pull that part out of code to build a tool to do same thing... or if you can just run a (full not incremental) repair. It logs about differences it finds.

Chris Lohfink
- 16,150
- 1
- 29
- 38
-
Both suggestions are interesting, I guess the first one depends on the size of the data set, the second one sounds like a fun project – raam86 Aug 03 '17 at 14:45
-
Running full repair will be IO intensive task. Any other suggestions? I have heard we could run spark job to do this. any idea on that? – Rishikesan Varudharajan Aug 04 '17 at 10:34
-
a spark job would read all the data as well. Difference is after reading all the data the repair job will only send a merkle tree (hashes) of data to be compared while spark will stream all the data over to be compared. But if you want to know specifics a spark job or a script to read at local_quorum will give you more details. – Chris Lohfink Aug 04 '17 at 16:16