I have a strong use case for mixing up scientific data i.e. double matrices and vectors along with relational data and use this as data source for a distributed computation e.g. MapReduce, hadoop etc. Up to now I have been storing my scientific data in HDF5 files with custom HDF schemas and the relational data in Postgres but since this setup does not scale very well I was wondering whether there is a more NoSQL hybrid approach to support the heterogeneity of this data?
e.g. my use case would be to distribute a complex process that involves:
- loading GB of data from a time series database provider
- link the time series to static data e.g. symbol information, expiry, maturity dates etc
- launch a series of scientific computations e.g. covariance matrix, distribution fitting, MC simulations
- distribute the computations across many separate HPC nodes and storing the intermediate results for traceability.
These steps require a distributed database that can handle both relational and scientific data. A possibility would be to store the scientific data in HDF5 and then put it as BLOB columns within a relational database but this is a misuse. Another would be to store the HDF5 results in disk and have a relational database linking to it but we lose self-containment. However, none of these two approaches accounts for distributing the data for direct access in the HPC nodes as the data would need to be pulled from a central node and this is not ideal.