21

Have been reading up on Hadoop and HBase lately, and came across this term-

HBase is an open-source, distributed, sparse, column-oriented store...

What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more about it.

David
  • 3,251
  • 18
  • 28
Jai
  • 3,549
  • 3
  • 23
  • 31

5 Answers5

27

In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column).

This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers.

Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage.

Storage is cheap. Performance isn't.

Peter Wone
  • 17,965
  • 12
  • 82
  • 134
  • 2
    In some cases, the sparsity property improves performance in HBase. If you are doing a summary over a particular column family, it doesn't have to check to see if a particular value is Null to see if it should include it. – Donald Miner Jul 05 '11 at 23:22
  • I generally agree with your sentiment, though. You shouldn't use HBase because of its sparseness... it feels more like a nice side effect of storing the data in a columnar fashion. – Donald Miner Jul 05 '11 at 23:35
  • Interesting, so in RDBMS, the rows are sparse as they can be defined as null, and in HBase, since you do not need to define column data for each row. @orangeoctopus, how is it a performance hit in HBase too at times? – Jai Jul 06 '11 at 07:26
  • 2
    HBase does not use a "linked-list style of chained pointer architecture." Its architecture is completely different (see David's link in the other answer). HBase also doesn't store pointers to cell values held elsewhere in the filesystem unless you explicitly tell it to. A table may have hundreds or thousands of columns (or more), and column values may be relatively large (indexes, for example). In such a context, sparsity is basically the only option. – ajduff574 Jul 11 '11 at 18:07
  • Perhaps it doesn't use pointer chaining but when the column data isn't in a predictable relative position, somewhere *something* will explicitly record the storage address and the thrust of my argument stands. If I'm wrong about this I would be absolutely fascinated to learn how it was done without pointers. – Peter Wone Jul 12 '11 at 01:35
  • 1
    @Peter Wone See the link in David's answer. HBase basically stores sorted tuples of the form (key, column family, column name, timestamp, value). If a column has no value for a given row, there is no tuple stored. There isn't a pointer to every tuple, so some scanning is often involved if you only need to look up one column. There are certainly disadvantages to this sort of structure, but it allows each row to have many sparse columns (with columns added easily) and permits versioning as well. – ajduff574 Jul 14 '11 at 14:42
4

Sparse in respect to HBase is indeed used in the same context as a sparse matrix. It basically means that fields that are null are free to store (in terms of space).

I found a couple of blog posts that touch on this subject in a bit more detail:

http://blog.rapleaf.com/dev/2008/03/11/matching-impedance-when-to-use-hbase/

http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Donald Miner
  • 38,889
  • 8
  • 95
  • 118
4

At the storage level, all data is stored as a key-value pair. Each storage file contains an index so that it knows where each key-value starts and how long it is.

As a consequence of this, if you have very long keys (e.g. a full URL), and a lot of columns associated with that key, you could be wasting some space. This is ameliorated somewhat by turning compression on.

See: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

for more information on HBase storage

David
  • 3,251
  • 18
  • 28
1

The best article I have seen, which explains many databases terms as well.

> http://jimbojw.com/#understanding%20hbase

Xiaoxia Lin
  • 736
  • 6
  • 16
Kh.Taheri
  • 946
  • 1
  • 10
  • 25
1

There are two way of data storing in the tables it will be either Sparse data and Dense data. example for sparse data.

Suppose we have to perform a operation on a table containing sales data for transaction by employee between the month jan2015 to nov 2015 then after triggering the query we will get data which satisfies above timestamp condition if employee didnt made any transaction then the whole row will return blank

eg. EMPNo Name Product Date Quantity

 1234  Mike    Hbase    2014/12/01     1
 5678                                        
 3454  Jole    Flume    2015/09/12   3

the row with empno5678 have no data and rest of the rows contains the data if we consider whole table with blanks row and populated row then we can termed it as sparse data.

If we take only populated data then it is termed as dense data.

Ashish Singh
  • 51
  • 2
  • 8