cassandra stress testing distribution of writes

Question

How do I build a test that will tell me which Cassandra nodes are being written to, so I would want to specify number of nodes and replication factor and get back which nodes are affected by each write as the result of an attempted insert. this will tell me how evenly the data would be distributed at runtime. I have test data, so what i really need is a way to call mock Cassandra that's configured the way i would run in production that would return to me which node is affected. I don't see a way to do that with the Cassandra stress tool, unless i am completely missing it...

score 1 · Answer 1 · answered Jul 15 '15 at 15:28

Since you are interested in knowing all nodes that were impacted by a query, in I would recommend looking into tracing.

Here are a few approaches you could take:

Use cassandra-stress and enable tracing with nodetool settraceprobability on each of your C* nodes and set it to a low value like .01. This will enable query on 1% of your queries for which you can observe the results of the trace in the system via the system_traces.events and sessions tables (see this article for more information on how to use these tables). The trace will include information like which node was used as the coordinator, what other nodes were used as replicas for reads/writes and how long it took to process individual steps. Note that how your application will end up querying data may be slightly different then cassandra-stress since what nodes are queried is influenced by your Cluster configuration. cassandra-stress uses JavaDriverClient#connect. You will want to compare your configuration with what JavaDriverClient is doing and understand the differences. You could also modify JavaDriverClient to match your application.
You may also want to write a test against your application that uses cassandra. The java-driver has an API for enabling tracing and observing the data which I've documented in a video here. Additionally when you get a ResultSet back, there is a method getExecutionInfo() that provides information such as which hosts were tried, but this only includes nodes that were used as a coordinator, not all the replicas.

the issue i am having with the stress tool is i don't understand how to feed it my own dataset. thank you for posting the video btw, it's very helpful. I think tracing will go a long way figuring this out. Thanks again! — Alex, Jul 15 '15 at 17:57
Thanks! The new stress tool introduced in 2.1 has a lot of capabilities for running stress against custom schemas, I think some of the formatting has changed a bit since this blog post, but this would be a good starting point if you haven't seen it: http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema — Andy Tolbert, Jul 15 '15 at 18:29

cassandra stress testing distribution of writes

1 Answers1