Database recommendations for hybrid scientific & relational data?

Question

I have a strong use case for mixing up scientific data i.e. double matrices and vectors along with relational data and use this as data source for a distributed computation e.g. MapReduce, hadoop etc. Up to now I have been storing my scientific data in HDF5 files with custom HDF schemas and the relational data in Postgres but since this setup does not scale very well I was wondering whether there is a more NoSQL hybrid approach to support the heterogeneity of this data?

e.g. my use case would be to distribute a complex process that involves:

loading GB of data from a time series database provider
link the time series to static data e.g. symbol information, expiry, maturity dates etc
launch a series of scientific computations e.g. covariance matrix, distribution fitting, MC simulations
distribute the computations across many separate HPC nodes and storing the intermediate results for traceability.

These steps require a distributed database that can handle both relational and scientific data. A possibility would be to store the scientific data in HDF5 and then put it as BLOB columns within a relational database but this is a misuse. Another would be to store the HDF5 results in disk and have a relational database linking to it but we lose self-containment. However, none of these two approaches accounts for distributing the data for direct access in the HPC nodes as the data would need to be pulled from a central node and this is not ideal.

SO doesn't do recommendations so this is off-topic. It may be on topic at http://scicomp.stackexchange.com/, if it is you'll probably get better-informed answers there too. While you wait for answers investigate SciDB, Rasdaman, MonetDB and any other *array databases* your favourite search engine suggests. — High Performance Mark, Feb 03 '14 at 16:36

score 2 · Accepted Answer · answered Feb 04 '14 at 09:24

I am not sure if I can give a proper solution but we have a similar setup.

We have meta-information stored in a RBDMS (postgresql) and the actual scientific data in HDF5 files.
We have a couple of analysis that are run on a our HPC. The way it is done is as follows:

User wants to run an analysis (from a web-frontend)
A message is sent to a central message broker (AMQP, RabbitMQ) containing the type of analysis and some additional information
A worker machine (VM) picks up the message from the central message broker. The worker uses REST to retrieve meta-information from the RDBMS database and stages the files on the HPC and then creates a PBS job on the cluster.
Once the PBS job is submitted a message with the job-id is sent back to the message broker to be stored in the RBDS database.
The HPC job will run the scientific analysis and then store the result in a HDF5 file.
Once the job is finished, the worker machine will stage-out the HDF5 files into a NFS share and it will store the link in the RBMS database.

I would recommend against storing binary files in a RDBMS as a BLOB.
I would keep them in HDF5 format. You can have different backup policies for the database and the filesystem.

A couple of additional pointers:

You could hide everything (both RBMS and HDF5 storage) behind a REST interface. This might solve your containment issue
If you want to store everything in a NoSQL DB I would recommend to have a look at Elasticsearch. It works well with time-series data, it is distributed out of the box and it has also a Hadoop plugin

Database recommendations for hybrid scientific & relational data?

1 Answers1