What is the purpose of dividing rows into columnfamilies if they can have different number/types of columns anyway?

Question

Given that a column family can have rows with arbitrary structure we could store all rows in a single "store" (avoiding the name 'columnfamily/table' on purpose). What is the purpose of column families then?

partitioner not only assigns tokens to row keys but also nodes. maybe you want to have a look at consistent hashing — manuzhang, Jan 22 '13 at 06:53

NG Algo · Answer 1 · 2013-01-19T00:36:09.243

1

The simplest of all reasons is evident in the name itself "Column Family". A Column Family groups a bunch of related columns together. You could consider it as a namespace containing related columns.

For example the Column "Name" by itself lacks context, which can be provided by ColumnFamilies like "Employees" or "Cities". Or each Column would need to carry all of it's context by itself with no concept of related Columns.

edited Jan 19 '13 at 00:36

answered Jan 18 '13 at 22:54

NG Algo

3,570
2
18
27

I could argue that the context of the column is the row rather than a column family. In relational databases the collection of tuples with same structure are grouped into tables. Here each row can have its own structure, so from the storage point of view there isn't much difference whether I have 1 column family or many (as long I can insert anything in a column family). Timeseries data is just a very long row, with each column representing a time event. Don't see the reason why I can't put that into a single "store". – Eugen Jan 18 '13 at 23:47
Having said that, I understand that for each given application there is a data model, so having 'name scopes' to represent different concepts is useful for organizing your model. My interest is to find out what implications the concept has for physical storage. – Eugen Jan 18 '13 at 23:52
I think the logical groupings of related Columns in a CF to represent an Entity is pretty significant even from a storage perspective. Data represented in the logical entity (CF) is frequently accessed together. This allows you to physically store frequently accessed entities on faster storage and less frequently accessed entities on slower physical storage. – NG Algo Jan 19 '13 at 00:33
Storage placement depends on the machines that constitute your cluster and partitioning scheme, not on your CFs. – Eugen Jan 19 '13 at 10:01
1

Actually per Cassandra documentation, within a single node in a cluster, data for different CFs are stored in separate files. I assume that was the intent of the question of why CF plays a role in data storage. – NG Algo Jan 21 '13 at 22:32
You got my point. Since there are no schema for rows the purpose of dividing them into CFs is less clear than in RDBs. Having a separate file for each CF is one reason to justify their existence -> CFs as a speed-up mechanism for accessing related data by storing it in one file. The other comments so far don't get the point of my question. I wonder why the question is confusing for other readers. – Eugen Jan 22 '13 at 08:57

score 1 · Answer 2 · answered Jan 20 '13 at 20:44

1

Atomicity

In Cassandra 1.1 and below, the only atomic guarantee you have is that writes to the same row (i.e. with the same key) will be atomic.

Thus, you think very carefully about what you want in your columns, and what row those columns should be in so that your application will behave appropriately if a write fails.

answered Jan 20 '13 at 20:44

Sarge

2,367
2
23
36

Although your answer is valid it doesn't answer the question of why one would divide rows into multiple columnfamilies which is the point of this question. – Eugen Jan 21 '13 at 07:32

Eugen · Accepted Answer · 2013-07-20T21:57:00.607

Reasons:

To have a different sort order for the columns within a row. The comparator is specified at column family creation time and can't be changed afterwards. So if you have rows which columns must be sorted alphabetically or numerically you have to create different column families.
Customize the storage options that can be set on per column family basis. E.g. caching or rows, compaction, deletion of expired columns, etc. Per column family storage options can be found here
Can't mix counter and non-counter columns in the same column family
As mentioned in other answers, due to logical cohesion - columns represent attributes of some entity identified by the row id.

What is the purpose of dividing rows into columnfamilies if they can have different number/types of columns anyway?

3 Answers3