Options for fast, highly concurrent, in-process access to large datasets

Question

Context: I'm currently leading a project to integrate our application (a model which works with high resolution scientific data - .NET, Winforms) with that of another provider in my company's field (a similar model - .NET, cloud architecture). I'll be implementing interfaces defined by the collaborator's application - at runtime, instances of these classes will be passed to the collaborator's cloud based application to provide analysis detail. The cloud application will distribute these instances across processing nodes, coordinating the analysis as a whole.

The specific question I'd like to ask is: What might be a good store for the model data used to feed our elements of the application?

Our data is complex and structured in that our current database schema is reasonably highly normalised (the database platform is enterprise grade and relational). The current input format for our application is comma separated text file and the format of these files mirrors the database schema. The data to be used by the elements of the collaborator's application we implement can be held on disk at a location of their choosing, and each processing node will have access to that location. Each node will need access to a very small proportion of all data (say, 0.001% - 0.01% of it on average). We have the following requirements:

Must Have

There must be no process associated with data access other than that of the application
Support for .NET
Will hold and work successfully with 100GB - 1TB of data.
Fast for selections (there will be no insertion, updating or deletion at runtime).

Desirable

Free, and if so then licensed permissively (e.g. BSD / Public Domain)
Fast for insertions - less critical than selections, because database population will be performed prior to analysis
Support for visual schema design
Well respected / proven.

We have considered the following options so far:

Development of our own indexed file format - I'm not familiar with how to do this. I've considered dividing data down the axis of parallelisation (so that each processing node would access only one partition), still holding data in the same flat file format as we currently use (partitions would just be sub-folders inside a root folder). I'm then thinking of reading the subsets of data into standard .NET collections, but would need to devise a sensible way to perform inter-collection lookups.
SQLite - I've read of people using it successfully for databases of 100GB+, which surprised me - apparently it's not as lightweight as it sounds. My benchmarking work so far demonstrates that insert / select performance is fine on tables of up to 10 million records, but we will have billions of records in some of our tables.
NoSQL - I am unfamiliar with NoSQL technologies and had understood they are designed to solve very different problems to ours (working well with loosely structured data, where horizontal scalability is a concern, which sounds like the opposite of what we need). However, I briefly tried MongoDB (ineligible here because there is no in-process mode) and selection and insertion performance both seem to be many times better than that for the relational databases I've used. Eligible NoSQL databases include Redis and DensoDB and I plan to evaluate these next - there may be others, I'm just not sure whether this line of enquiry is actually sensible.

If you've read this far, thanks, and if you are able to evaluate the validity of any of the options mentioned above, or suggest something more fitting then I'll be very grateful. I look forward to hearing from you!

Lot of background - thanks! I do have a question though. You're not junking your database, right? Is that database the source of the data and you're just looking for an intermediate store where you can put data for your partner to pick it up? — simon at rcl, Jan 08 '14 at 17:08
That's right - we plan to keep the current database and offer the new solution as an alternative, both to our partner's application and ours. — pauld, Jan 09 '14 at 08:36

Options for fast, highly concurrent, in-process access to large datasets

0 Answers0