Cassandra Data Modelling: Use a Map or have a lot of empty columns?

Question

I have about ~20-30ish columns that I would need to store in my column family in total. However, my data comes in different variations. I have different objects that belong together logically but are not having the same fields (fields as in key names). Sometimes, 5 fields are provided, sometimes 7 fields and so on. All of them share a portion of fields that are always provided though.

A row I insert in this column family will never have all of the columns filled. When using a Map, I could add key/values based on the object type and will not have the possible overhead that is introduced by my other model.

I am concerned about having a lot of empty columns in each row.

A possible downside of using a Map is that you can't have an index for map keys and map values coexist.

Questions gathered:

Do you suggest me to use a Map or just add all of the columns I may need to my column family?
I assume that querying the data based on keys/values in the Map is way slower than "directly" accessing them from the columns. Is this correct?
What downsides are there when I have a lot of empty columns for each row? Overhead?
Is it possible to have a "generic" value type when using a Map? I want to store different data, mostly Strings but also Floats and Integers. Do I need to use a map<text,text> and cast the values within my application?

I am using Cassandra 3.0.8 | CQL spec 3.4.0 | Native protocol v4

Thanks

score 3 · Accepted Answer · edited May 23 '17 at 12:22

3

I think that having sparse column values is totally fine since that's one of the reason why BigTable and all related solutions implementing the same sparse map data model were created for.

I will be more concerned about limitations in the use of cql collections instead, as pointed out in another S.O. answer here.

Regarding your specific questions:

I will personally use plain columns.
It depends on the access pattern. Do you need all the columns in the map? If not, be aware that Cassandra will retrieve the collection as a whole, so you will get all the data even if not needed.
I don't see any overhead here: data will be stored contiguously ignoring empty columns

Anyway, You can find some info about Cassandra's limitations here. It's an old page, but I assume you can use them as lower bounds for the updated values.

Hope it helps.

edited May 23 '17 at 12:22

Community

1
1

answered Jul 19 '16 at 14:54

riccamini

1,161
1
13
29

Using plain columns would force me on to a strict number of columns, right? When using the collection approach, I'd be more flexible when new columns are added. Is it even possible to add more columns to a schema when the schema already has xx GB of data? – j9dy Aug 08 '16 at 09:20
1

Using plain columns, I think you should define your schema with the maximum number of columns you can have. There's no downside in this, since if a row does not have a particular column, it won't waste space. And you can still add new columns to the schema if you need to. As pointed out in the documentation [here](https://docs.datastax.com/en/cql/3.1/cql/cql_reference/alter_table_r.html), adding a column to the schema does not validate past data. Using a collection you will keep your schema shorter, but you will have the downside pointed out in the answer. – riccamini Aug 16 '16 at 06:34

score 1 · Answer 2 · answered Jul 19 '16 at 16:21

1

Actually, Map, Set, List are just CQL sintax for old Cassndra data structures, and maps stored as a usual wide row.

Here is several Slides about mapping cql types

answered Jul 19 '16 at 16:21

Mikhail Baksheev

1,394
11
13

Cassandra Data Modelling: Use a Map or have a lot of empty columns?

2 Answers2