I'm assuming you're thinking of sparse matrices from mathematical context:
http://en.wikipedia.org/wiki/Sparse_matrix (The storing techniques described there are for memory storage (fast arithmetic operation), not persistent storage (low disk usage).)
Since one usually do operate on this matrices on client side rather than on server side a SQL-ARRAY[] is the best choice!
The question is how to take advantage of the sparsity of the matrix? Here the results from some investigations.
Setup:
- Postgres 8.4
- Matrices w/ 400*400 elements in double precision (8 Bytes) --> 1.28MiB raw size per matrix
- 33% non-zero elements --> 427kiB effective size per matrix
- averaged using ~1000 different random populated matrices
Competing methods:
- Rely on the automatic server side compression of columns with SET STORAGE MAIN or EXTENDED.
- Only store the non-zero elements plus a bitmap (
bit varying(xx)
) describing where to locate the non-zero elements in the matrix. (One double precision is 64 times bigger than one bit. In theory (ignoring overheads) this method should be an improvement if <=98% are non-zero ;-).) Server side compression is activated.
- Replace the zeros in the matrix with NULL. (The RDBMSs are very effective in storing NULLs.) Server side compression is activated.
(Indexing of non-zero elements using a 2nd index-ARRAY[] is not very promising and therefor not tested.)
Results:
- Automatic compression
- no extra implementation efforts
- no reduced network traffic
- minimal compression overhead
- persistent storage = 39% of the raw size
- Bitmap
- acceptable implementation effort
- network traffic slightly decreased; dependent on sparsity
- persistent storage = 33.9% of the raw size
- Replace zeros with NULLs
- some implementation effort (API needs to know where and how to set the NULLs in the ARRAY[] while constructing the INSERT query)
- no change in network traffic
- persistent storage = 35% of the raw size
Conclusion:
Start with the EXTENDED/MAIN storage parameter. If you have some free time investigate your data and use my test setup with your sparsity level. But the effect may be lower than you expect.
I suggest always to use the matrix serialization (e.g. Row-major order) plus two integer columns for the matrix dimensions NxM. Since most APIs use textual SQL you are saving a lot of network traffic and client memory for nested "ARRAY[ARRAY[..], ARRAY[..], ARRAY[..], ARRAY[..], ..]" !!!
Tebas
CREATE TABLE _testschema.matrix_dense
(
matdata double precision[]
);
ALTER TABLE _testschema.matrix_dense ALTER COLUMN matdata SET STORAGE EXTERN;
CREATE TABLE _testschema.matrix_sparse_autocompressed
(
matdata double precision[]
);
CREATE TABLE _testschema.matrix_sparse_bitmap
(
matdata double precision[]
bitmap bit varying(8000000)
);
Insert the same matrices into all tables. The concrete data depends on the certain table.
Do not change the data on server side due to unused but allocated pages. Or do a VACUUM.
SELECT
pg_total_relation_size('_testschema.matrix_dense') AS dense,
pg_total_relation_size('_testschema.matrix_sparse_autocompressed') AS autocompressed,
pg_total_relation_size('_testschema.matrix_sparse_bitmap') AS bitmap;