Database to store sparse matrix

Question

I have a very large and very sparse matrix, composed of only 0s and 1s. I then basically handle (row-column) pairs. I have at most 10k pairs per row/column.

My needs are the following:

Parallel insertion of (row-column) pairs
Quick retrieval of an entire row or column
Quick querying the existence of a (row-column) pair
A Ruby client if possible

Are there existing databases adapted for these kind of constraints?

If not, what would get me the best performance :

A SQL database, with a table like this:

row(indexed) | column(indexed) (but the indexes would have to be constantly refreshed)

A NoSQL key-value store, with two tables like this:

row => columns ordered list

column => rows ordered list

(but with parallel insertion of elements to the lists)

Something else

Thanks for your help!

I'm not sure whether the Math.SE site would know an answer to this or not... — Blender, Dec 01 '11 at 04:00

score 6 · Accepted Answer · answered Dec 21 '11 at 00:02

A sparse 0/1 matrix sounds to me like an adjacency matrix, which is used to represent a graph. Based on that, it is possible that you are trying to solve some graph problem and a graph database would suit your needs.

Graph databases, like Neo4J, are very good for fast traversal of the graph, because retrieving the neighbors of an vertex takes O(number of neighbors of a given vertex), so it is not related to the number of vertices in the whole graph. Neo4J is also transactional, so parallel insertion is not a problem. You can use the REST API wrapper in MRI Ruby, or a JRuby library for more seamless integration.

On the other hand, if you are trying to analyze the connections in the graph, and it would be enough to do that analysis once in a while and just make the results available, you could try your luck with a framework for graph processing based on Google Pregel. It's a little bit like Map-Reduce, but aimed toward graph processing. There are already several open source implementations of that paper.

However, if a graph database, or graph processing framework does not suit your needs, I recommend taking a look at HBase, which is an open-source, column-oriented data store based on Google BigTable. It's data model is in fact very similar to what you described (a sparse matrix), it has row-level transactions, and does not require you to retrieve the whole row, just to check if a certain pair exists. There are some Ruby libraries for that database, but I imagine that it would be safer to use JRuby instead of MRI for interacting with it.

ConcernedOfTunbridgeWells · Answer 2 · 2011-12-27T14:11:33.293

If your matrix is really sparse (i.e. the nodes only have a few interconnections) then you would get reasonably efficient storage from a RDBMS such as Oracle, PostgreSQL or SQL Server. Essentially you would have a table with two fields (row, col) and an index or key each way.

Set up the primary key one way round (depending on whether you mostly query by row or column) and make another index on the fields the other way round. This will only store data where a connection exists, and it will be proportional to the number ot edges in the graph.

The indexes will allow you to efficiently retrieve either a row or column, and will always be in sync.

If you have 10,000 nodes and 10 connections per node the database will only have 100,000 entries. 100 ednges per node will have 1,000,000 entries and so on. For sparse connectivity this should be fairly efficient.

A back-of-fag-packet estimate

This table will essentially have a row and column field. If the clustered index goes (row, column, value) then the other covering index would go (column, row, value). If the additions and deletions were random (i.e. not batched by row or column), the I/O would be approximatley double that for just the table.

If you batched the inserts by row or column then you would get less I/O on one of the indexes as the records are physically located together in one of the indexes. If the matrix really is sparse then this adjacency list representation is by far the most compact way to store it, which will be much faster than storing it as a 2D array.

A 10,000 x 10,000 matrix with a 64 bit value would take 800MB plus the row index. Updating one value would require a write of at least 80k for each write (writing out the whole row). You could optimise writes by rows if your data can be grouped by rows on inserts. If the inserts are realtime and random, then you will write out an 80k row for each insert.

In practice, these writes would have some efficiency because the would all be written out in a mostly contiguous area, depending on how your NoSQL platform physically stored its data.

I don't know how sparse your connectivity is, but if each node had an average of 100 connections, then you would have 1,000,000 records. This would be approximately 16 bytes per row (Int4 row, Int4 column, Double value) plus a few bytes overhead for both the clustered table and covering index. This structure would take around 32MB + a little overhead to store.

Updating a single record on a row or column would cause two single disk block writes (8k, in practice a segment) for random access, assuming the inserts aren't row or column ordered.

Adding 1 million randomly ordered entries to the array representation would result in approximately 80GB of writes + a little overhead. Adding 1m entries to the adjacency list representation would result in approximately 32MB of writes (16GB in practice because the whole block will be written for each index leaf node), plus a little overhead.

For that level of connectivity (10,000 nodes, 100 edges per node) the adjacency list will be more efficient in storage space, and probably in I/O as well. You will get some optimisation from the platform, so some sort of benchmark might be appropriate to see which is faster in practice.

If my matrix has a high insertion rate, wouldn't the constant indexing be too costly? — MrRuru, Dec 27 '11 at 13:05

Database to store sparse matrix

2 Answers2