Scenario for Document vs Columnar DBs

Question

NoSQL databases can be categorized as KV, Document, Columnar and Graph. I had been trying to figure out which NoSQL to use for different scenario, read a couple of blogs/articles and am a still confused.

Lets say I want to store the Employee details. It can be stored in a Columnar DB like HBase and a Document DB like Mongo. So, what are the scenario to go for Columnar vs Document? I guess that based on the query pattern, the appropriate database has to be chosen.

Agreed the query pattern will determine which to use. I think also consider the tools surrounding the technology. I think a big argument for Mongo is that even though it stores data as BSON, the textual representation is JSON. JSON has become the lingua franca these days. I would consider that. — ryan1234, Mar 08 '13 at 22:55
The database selection depends on the CAP theorem if both seems to be fit in the requirement — Amareswar, Mar 15 '13 at 19:23

eSniff · Answer 1 · 2013-03-13T19:41:06.267

Like Ryan mentioned, the access patterns to the data has a lot to do with it. Since Ryan covered the MongoDB side (which I know little about), I'll give the Hbase side of things a shot.

For starters I suggest you read the BigTable paper, since Hbase was heavily influenced by it's design. This video also has some good details on Hbase's design elements. Also if your more interested in Zookeeper try reading the Chubby Paper too.

Things to consider for Hbase:

Indexing rows: The way rows are "indexed" in Hbase (or Cassandra using the Ordered Partitioner) is it's blessing and curse. I believe, mongoDb uses a B+Tree (correct me if I'm wrong), where Hbase just stores rows in-order. This approach is good for map-reduce jobs and sequential reads. For map-reduce jobs it means data-locality to the region servers that run the jobs. It helps sequential reads by allowing the disk controllers to read sequential sectors on disk while doing a "scan" of keys. The curse is that the data is stored in order... So if you don't design your rows well, you end up having "hot" nodes. For example, if you simplely used a timestamp as a row-key, you could end up with one node taking all the writes and your other nodes sitting idle. So, designing your row-keys in Hbase is very important. This video on OpenTSDB goes into some good details about how they use HBase.

Another advantage of columnar databases are the they can use column compression instead of row compression. Normally, the entropy of a column is much lower than that of a row. So it makes compression more effective. For example, if your columns are storing UserAgents, URLs, Keywords,... they will compress really well.

Example Hbase solution: Lets say you wanted to create a solution for tracking visitor data on your ecommerce site, with a requirement to support aggregates over any date range. Because Hbase stores keys in a sequential manner on disk, if your keys are designed well Hbase may give you better performance creating realtime sequential scans.

For this example lets assume we store lots of metrics about visitors with the following key schema ( {product-category}.{sub-category}.{metric}.{timestamp-rounded-to-the-minute} ). For example: a single page visit may write to the following keys: shoes.running.search-terms.1362818100, shoes.running.user-agents.1362818100, shoes.running.visitors-country.1362818100,... SideNote: all these keys are basically sequential and would most likely be written to a single region server and you may want these writes to be distributed to more than one machine. One solution would be to replace the {product-category}.{sub-category} part of the key with a HashOf( {product-category}.{sub-category} ). Or to use a key lookup like OpenTSDB does.

So with this key design it becomes fast to run ad-hoc/realtime query of these metrics. For example, to query all the search-terms used between 1331666259 (Tue, 13 Mar 2012) to 1334344659 (Fri, 13 Apr 2012), you would issue a scan for (shoes.running.search-terms.1331666259 to shoes.running.search-terms.1334344659)

EDITs: I fixed a couple of typos

HBase also has a B-Tree in terms of -ROOT- and .META., which are the catalog tables in HBase (http://goo.gl/Z44Oi). — Praveen Sripati, Mar 17 '13 at 13:24
True, I'll accept it's "effectively" a distributed B+tree. But Hbase clients cache region locations to avoid "traversing the tree", that is querying ZooKeeper, -ROOT-, and META tables every time. I definitely wasn't trying to getting into HBase is better than MongoDB wars. I know Mongo is better in many ways than Hbase, I'm was just trying to give an example of one scenario were it's a good fit. — eSniff, Mar 17 '13 at 18:34

score 3 · Answer 2 · answered Mar 12 '13 at 22:02

I'll venture an answer. I have a decent amount of experience with documents and Mongo, but have no experience with columnar databases.

Depth versus Flat

My first suggestion was going to be that documents can have nested sub-documents and that maybe columnar databases are strictly flat records.

But after reading this: http://wiki.apache.org/cassandra/DataModel, it seems that some columnar databases can have depth to records as well.

In fact if you read that page about Cassandra, you'll see they often represent the conceptual record as JSON. So in that sense there seems to be no difference between the data modeling - at least from a conceptual perspective.

Homogeneous versus Heterogeneous

The other big potential difference is homogeneous versus heterogeneous data models in one collection/table.

Mongo allows you to store documents that have different schemas in the same collection in a database.

As far as I can tell for HBase, each row has to have the same table schema. From the Column Families section (http://wiki.apache.org/hadoop/Hbase/DataModel):

"The families are part of the table schema and stay the same for each row; what differs from rows to rows is that the column keys can be very sparse."

Maybe someone can correct me if I'm wrong about HBase.

HBase allows for flexible schema. The Table and the Column Family have to be defined upfront, the columns can be defined run type. Also every thing is stored as byte array and it's left to the application to interpret the byte array as it wants to. — Praveen Sripati, Mar 17 '13 at 13:18

Scenario for Document vs Columnar DBs

2 Answers2