cassandra read performance for large number of keys

Question

Here is situation

I am trying to fetch around 10k keys from CF. Size of cluster : 10 nodes Data on node : 250 GB Heap allotted : 12 GB Snitch used : property snitch with 2 racks in same Data center. no. of sstables for cf per node : around 8 to 10

I am supercolumn approach.Each row contains around 300 supercolumn which in terms contain 5-10 columns.I am firing multiget with 10k row keys and 1 supercolumn.

When fire the call 1st time it take around 30 to 50 secs to return the result.After that cassandra serves the data from key cache.Then it return the result in 2-4 secs.

So cassandra read performance is hampering our project.I am using phpcassa.Is there any way I can tweak cassandra servers so that I can get result faster?

Is super column approach affects the read performance?

Can you state your data model? And read and write pattern against it? Coz it would make more sense to comment — Tamil, Jul 28 '12 at 07:19

score 1 · Accepted Answer · answered May 25 '12 at 09:07

1

Use of super columns is best suited for use cases where the number of sub-columns is a relatively small number. Read more here: http://www.datastax.com/docs/0.8/ddl/column_family

answered May 25 '12 at 09:07

divaka

593
7
12

Supercolumns in our case just contains 5 columns.I think its really small in number. – MANISH ZOPE May 25 '12 at 10:48
1

I think your problem is that you are trying to get a lot of columns - 10k at once. And also 10k columns x 300 super columns x 10 is not a small number at all. What I could propose is to get 10 times x 1000 rows and see if it will be faster. – divaka May 25 '12 at 11:21

score 0 · Answer 2 · answered Jun 01 '12 at 16:44

Just in case you haven't done this already, since you're using phpcassa library, make sure that you've compiled the Thrift library. Per the "INSTALLING" text file in the phpcassa library folder:

Using the C Extension

The C extension is crucial for phpcassa's performance.

You need to configure and make to be able to use the C extension.

cd thrift/ext/thrift_protocol
phpize
./configure
make
sudo make install

Add the following line to your php.ini file:

extension=thrift_protocol.so

score 0 · Answer 3 · answered Nov 01 '12 at 04:53

After doing much of RND about this stuff we figured there is no way you can get this working optimally. When cassandra is fetching these 10k rows 1st time it is going to take time and there is no way to optimize this.

1) However in practical, probability of people accessing same records are more.So we take maximum advantage of key cache.Default setting for key cache is 2 MB. So we can afford to increase it to 128 MB with no problems of memory. After data loading run the expected queries to warm up the key cache.

2) JVM works optimally at 8-10 GB (Dont have numbers to prove it.Just observation).

3) Most important if you are using physical machines (not cloud OR virtual machine) then do check out the disk scheduler you are using.Set it NOOP which is good for cassandra as it reads all keys from one section reducing disk header movement.

Above changes helped to bring down time required for querying within acceptable limits.

Along with above changes if you have CFs which are small in size but frequently accessed enable row caching for it.

Hope above info is useful.

cassandra read performance for large number of keys

3 Answers3

Using the C Extension