0

I need a table on Cassandra2 with 56K columns of 1 byte each, for benchmark purposes.

I'm trying to create a "usertable" with this requirement like this:

create table usertable (
    y_id varchar primary key,
    field0 varchar,
    field1 varchar,
    field2 varchar,
     ...
     ...
    field55999 varchar,
    field56000 varchar);

When I try to execute this from a file using CQLSH, it runs forever without response and allocating a lot of memory.

Is there a better way to accomplish that?

vschettino
  • 101
  • 2
  • 9

1 Answers1

2

Try to put your your CREATE TABLE statement in a flat file (schema.cql for example) and then execute cqlsh -f schema.cql

By they way, 56k columns is HUGE and no sane developer will ever create a table with more than 1k columns ... What are you trying to test and assert with this scenario ?

---- Answer for 1st comment --

Schema is all about metadata because raw data are written as byte[] on disk anyway. The more you have columns in a table, the bigger the metadata will be in memory.

So while retrieving, I will pass the specific column name in select query (keeping performance in mind) so it wont retrieve all the columns

It's not that simple. All the 56k columns are stored on disk contiguously. When reading data, Cassandra has index structures to skip partition keys and clustering columns. For normal columns, as in your case, there is no index to get the exact column requested by the client so for example, if you're doing a SELECT field1293 FROM usertable WHERE y_id = xxx, Cassandra will need to scan the whole block from field1 until field56000 into memory before picking the right column and this is very very horribly inefficient

--- Answer for Nth comment --

I do agree it would become a very slow/inefficient, but I need to achieve this scenario to simulate genotype data.

I recommend to try and test this schema:

create table usertable (
    y_id varchar,
    field_index int,
    field_value varchard, 
    PRIMARY KEY(y_id, field_index)
);

//INSERT/UPDATE data into field N
INSERT INTO usertable(y_id, field_index, field_value)
VALUES('xxx', N, 'fieldN value');

//DELETE field N
DELETE FROM usertable WHERE y_id='xxx' AND field_index=N;

// Read EXACTLY field N
SELECT field_value FROM usertable WHERE y_id='xxx' AND field_index=N;

// Read field N to M, N <= M
SELECT field_value FROM usertable WHERE y_id='xxx' 
AND field_index >=N 
AND field_index <= M;

You'll see that it works wayyyyyyy better

doanduyhai
  • 8,712
  • 27
  • 26
  • What are the penalties if I do so? Since 2 billion cells per partition are supported by cassandra. So while retrieving, I will pass the specific column name in select query (keeping performance in mind) so it wont retrieve all the columns, So what will be the impact or symptoms for this type of ugly schema or In other words, what are the impacts for too wide row? – Jaya Ananthram Apr 05 '16 at 09:57
  • So let us consider the query `SELECT field56000 FROM usertable WHERE y_id = xxx`. If I try to execute the above query, then 56000 column will be loaded into memory and then it will perform sequential scan until it reaches field56000 column name. This is what you are exactly meaned right? Correct me if I am wrong. – Jaya Ananthram Apr 05 '16 at 10:34
  • 1
    Yes it is. To be more accurate, Cassandra will fetch your CQL row data by block of 64kb into memory and iterate over all the successive block until `field56000` is found. In your example, it must scan the whole partition. If `field00001` is requested, it will be much faster – doanduyhai Apr 05 '16 at 10:50
  • 1
    If you were using **clustering columns** in your schema, Cassandra will leverage the partition index to skip blocks of data and to reach the closest block from the requested column and the start scanning sequentially from this closest block, which is much more optimized – doanduyhai Apr 05 '16 at 10:51
  • I tryed `cqlsh -f schema.cql`, but got the same result. I do agree it would become a very slow/inefficient, but I need to achieve this scenario to simulate genotype data. – vschettino Apr 05 '16 at 14:36
  • @doanduyhai, this approach wouldn't allow me to use YCSB. – vschettino Apr 06 '16 at 13:13
  • **It's the problem of YCSB not to be able to work with so basic data model**, it's not a Cassandra data modeling issue. You don't force a data model so it can work with a tool. You take the right data model for the job and if the tools can't manage it, change the tool – doanduyhai Apr 06 '16 at 13:16
  • If you want a better tool that work with all Cassandra data model for benchmarking, I recommend Gatling (http://gatling.io/#/) with its free CQL plugin (https://github.com/Mishail/GatlingCql) – doanduyhai Apr 06 '16 at 13:20