0

I got a Cassandra cluster that only uses one node (because I only got one server and do a comparison). So I got a time-series table that is 43 GB big and every query I run is very slow. My question is, why is 43GB to much for one node in a cluster with only one node, when 43GB in one node in a cluster with more nodes would be ok?

Does Cassandra use RAM and CPU of every server in the cluster, even when a query only needs one node? That's my idea but I am not sure...

I hope somebody is able to help here,

Thank you!

Edit: My table:

CREATE TABLE table(
  num int,
  part_key int,
  val1 int, val2 float, val3 text, ...,
  PRIMARY KEY((part_key), num)
);

num is the number of the record. There are 300-400 values and like 10 000 000 records. Right now the database is ca. 60GB (43GB was from yesterday) and even the INSERT queries time out. If I set time-out higher the Server service crashes.

Friedrich
  • 47
  • 6
  • I think I loaded that much in one cluster, but it wasn't something productional, more an internal utility. It really depends on your scheme and queries. Can you add them too? – Aleksandar Stojadinovic Jan 21 '15 at 08:02
  • 2
    possible duplicate of [Cassandra Database overwhelmed?](http://stackoverflow.com/questions/28024187/cassandra-database-overwhelmed) – RussS Jan 21 '15 at 16:40
  • @Friedrich please avoid publishing the same question twice! If you're worried about exposure there are other ways to increase exposure of your posted questions like: adding more relevant tags (C* version for example), updating your question with more relevant information etc. – Nir Alfasi Jan 21 '15 at 18:03
  • I am kinda new here and did not know how the reactions would be if I ask new questions in an existing post, even if it is the same situation. But thanks for your effort! – Friedrich Jan 22 '15 at 07:43

1 Answers1

1

why is 43GB to much for one node in a cluster with only one node

43GB is not much for one node in C* cluster (even if the cluster contains only one node). As an example, we have clusters in Netflix that contains nodes with 800GB (per node) or even more !

There is another reason for the slowness of your query, and one guess would be that you have one (or more) very large rows which is an achilles heel for Cassandra. Another thing that you should check is the read/write pattern that you're using: since C* is eventually consistent, if you try to perform: read-modify-read - you'll get poor results.

Further, you should make sure C* heap size is tuned to your application requirements.

Another option is, that you're running into the following performance issue (he also published this question here in SO and it's a good use of your time to read the answers).

There could be other options as well, but in order to dig further you should provide more details about what you're doing: C* version, CF structure, how do you insert (code) etc.

Does Cassandra use RAM and CPU of every server in the cluster, even when a query only needs one node?

CPU and RAM are not shared across the cluster. Assuming that all the required data to execute your query exist on one node, the query will pass through (at most) two nodes: the coordinator (the node that received the query), which will forward it in one hop to the node that holds the data. If you'll use token aware strategy your query will go directly to the node that holds the data. You can read more about it in datastax documentation.

Community
  • 1
  • 1
Nir Alfasi
  • 53,191
  • 11
  • 86
  • 129
  • I really got large rows... How come that is a problem in Cassandra? When testing I only execute read queries. After I did these queries I read again to prepare the table for the next test. I guess that should not be a problem. I would like to hear your answer for the first question. – Friedrich Jan 21 '15 at 08:33
  • CPU and RAM are not shared across the cluster. See update to my answer. – Nir Alfasi Jan 21 '15 at 08:48
  • Ok, that answers my question to RAM and CPU. I will read the article to token aware strategy, but I only got one node. Isn't the coordinator node and the node that holds the data the same? With first question I meant why large rows are a problem. Can you please answer that? Thank you for your help! – Friedrich Jan 21 '15 at 08:59
  • Wide rows have the following two major cons: 1. (not relevant to your case) in a cluster that contains a few node - a query to fetch a wide row will keep hitting the same node causing heap pressure. 2. Though a row can be wider than to fit in memory, dealing with large rows will affect in more GC ("stop the world" events) which might introduce slowness to your application – Nir Alfasi Jan 21 '15 at 16:36