I am starting a project and have tried to abstract out the challenges that it faces. I am from a RDBMS background and am looking to make a sensible decision on storage technology(ies) for my next project. I know how I would approach these problems if RDBMS was the only option to me, but am interested to understand what the industry would go with – ideally someone sees this and says something like: ‘I did exactly this, and used ###, it worked perfectly but we had to employ ### to deal with spikes in usage’. And I'm not scared to crawl back into my RDBMS cave if that's the best option for the business.
So the problem:
[object A] – type: person
{
/*some fields that every person has*/
name: “A”
email: “a@example.com”
age: 22
/*some fields that can be dynamically*/
my_custom_user_property : 332 /* or maybe a struct of some type */
/*some relations (fixed)*/
groups: member of C; administrator of C; member of F; reader of G
/*some more arbitrary relations*/
mother_of: B
}
[list of groups]
That is to say – each customer may want to add their own ‘columns’ to the database, and then later on search against them.
My expectation is that the data isn’t fast changing (high read to write ratio) and I could happily asynchronise [e.g. the generation of reports]. But simple criteria based fetches would need to be fast and against the custom fields.
On top of the DB is some functionality to limit what can be seenat column level– e.g. only member of F can view email on members of G. These, again, would need to be dynamic (let’s say that my custom user property is sensitive and I have some means for setting business rules around that). Depending on the technology, this, I suppose could exist purely application (fetch whole objects, then limit based on rules) or as a more complicated query-builder type system.
Next is a ‘graph’ type search – I currently can’t see this going beyond a couple of degrees of freedom, but being able to find e.g. users on 2nd degree connection to groups through several different routes (some connections may not be fixed at development time). As above, this might be something that can be processed asynchronously,
I want to look for something that will handle, for now, 10M users, 1M groups, 100K daily active users, 5K users able to administrate (e.g. add columns). (And yes, totally achievable with MySQL or similar, but with a reasonable amount of engineering on top)
As far as practical development /infrastructure goes:
- I don’t want to tear my hair out with undocumented configurations/gotchas and the like (that said I am TOTALLY happy learning things, as long as it's not going to take me a degree in the thing just to get off the ground)
- Something that can be set up for high availability and robustness – e.g. decent cluster management and reporting available (or not that expensive with the help of an expert)
- Preferably something that will deploy out of the box relatively quickly
- I may have a module for financial transactions (unconfirmed) so ACID a plus
- With mature library that will play nicely with Spring framework.
- Of course, Good documentation/examples. Enough info to get a grasp of conceptual model as well as practical how-to type stuff
- opensource
I have read [lots] about the offerings out there, but would like to whittle this down to 2 sensible options that I can spike out. When reading about: MongoDB; Cassandra; CouchBase; CouchDb; Neo4J; (and lots more), I sort of settled on CouchBase. But I’m also aware of the amount of marketing material out there designed to hook people like me on a particular idea.
So the question is summed up by three questions: Are there any approaches that won't work? Are there any approaches that have been proven to work? Is there a clear best option at this point?