1

I am starting a project and have tried to abstract out the challenges that it faces. I am from a RDBMS background and am looking to make a sensible decision on storage technology(ies) for my next project. I know how I would approach these problems if RDBMS was the only option to me, but am interested to understand what the industry would go with – ideally someone sees this and says something like: ‘I did exactly this, and used ###, it worked perfectly but we had to employ ### to deal with spikes in usage’. And I'm not scared to crawl back into my RDBMS cave if that's the best option for the business.

So the problem:

[object A] – type: person
{
  /*some fields that every person has*/
  name: “A”
  email: “a@example.com”
  age: 22

  /*some fields that can be dynamically*/
  my_custom_user_property : 332 /* or maybe a struct of some type */

  /*some relations (fixed)*/
  groups: member of C; administrator of C; member of F; reader of G

  /*some more arbitrary relations*/
  mother_of: B
}

[list of groups]

That is to say – each customer may want to add their own ‘columns’ to the database, and then later on search against them.

My expectation is that the data isn’t fast changing (high read to write ratio) and I could happily asynchronise [e.g. the generation of reports]. But simple criteria based fetches would need to be fast and against the custom fields.

On top of the DB is some functionality to limit what can be seenat column level– e.g. only member of F can view email on members of G. These, again, would need to be dynamic (let’s say that my custom user property is sensitive and I have some means for setting business rules around that). Depending on the technology, this, I suppose could exist purely application (fetch whole objects, then limit based on rules) or as a more complicated query-builder type system.

Next is a ‘graph’ type search – I currently can’t see this going beyond a couple of degrees of freedom, but being able to find e.g. users on 2nd degree connection to groups through several different routes (some connections may not be fixed at development time). As above, this might be something that can be processed asynchronously,

I want to look for something that will handle, for now, 10M users, 1M groups, 100K daily active users, 5K users able to administrate (e.g. add columns). (And yes, totally achievable with MySQL or similar, but with a reasonable amount of engineering on top)

As far as practical development /infrastructure goes:

  • I don’t want to tear my hair out with undocumented configurations/gotchas and the like (that said I am TOTALLY happy learning things, as long as it's not going to take me a degree in the thing just to get off the ground)
  • Something that can be set up for high availability and robustness – e.g. decent cluster management and reporting available (or not that expensive with the help of an expert)
  • Preferably something that will deploy out of the box relatively quickly
  • I may have a module for financial transactions (unconfirmed) so ACID a plus
  • With mature library that will play nicely with Spring framework.
  • Of course, Good documentation/examples. Enough info to get a grasp of conceptual model as well as practical how-to type stuff
  • opensource

I have read [lots] about the offerings out there, but would like to whittle this down to 2 sensible options that I can spike out. When reading about: MongoDB; Cassandra; CouchBase; CouchDb; Neo4J; (and lots more), I sort of settled on CouchBase. But I’m also aware of the amount of marketing material out there designed to hook people like me on a particular idea.

So the question is summed up by three questions: Are there any approaches that won't work? Are there any approaches that have been proven to work? Is there a clear best option at this point?

Community
  • 1
  • 1
button
  • 666
  • 6
  • 14
  • Go with a nosql or serialize your schemaless stuff and dump them into a column. –  Mar 13 '15 at 14:49
  • @Will Just any old NoSQL-labelled DB? – button Mar 13 '15 at 14:58
  • Depends on your requirements. That's why you should grab a bunch, prototype against them, and see which suits you best. But document dbs are designed to handle this kind of thing. –  Mar 13 '15 at 15:02
  • @Will So you think start with any one of MongoDB, Couch*? And there's no further distinction to be made based on the requirements laid out above? – button Mar 13 '15 at 15:14
  • I don't have a lot of experience with nosqls except for Azure's DocumentDb. So I can't answer your question. Also, we don't do recommendations. Best bet is to try out a couple with some prototypes. That's what I always end up doing when trying to find something that meets my needs. –  Mar 13 '15 at 15:15
  • So that is exactly what my question is about -- the inclusion or exclusion of options based on experience. I'm not looking for recommendations any more than I'd be looking for recommendations on the fastest sort algorithm for some data set; the only difference here being that the best answer is perhaps not as clear cut. As such, we do provide recommendations but for solutions to problems with specific scope. Perhaps there are simply some specific things to look out for when navigating this strange new world, if speaking about a specific solution is not an option. – button Mar 13 '15 at 15:34
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/72939/discussion-between-button-and-will). – button Mar 13 '15 at 15:36

1 Answers1

1

Here's some info regarding your questions. The following is from my exposure to user databases backed by Cassandra, MongoDB, MySQL, and Oracle.

Approaches That Won't Work

Some approaches just won't work, such as those based on simple key-value stores or caches that store a fixed amount of data, while others are just not as ideal. An interesting example is Cassandra which is nice for it's multi-datacenter support. Cassandra experts have told me that document-stores like MongoDB are a more appropriate solution for some user database use cases but there are organizations that use it all the same, including the Apache Usergrid project which is used in production by organizations managing millions of users. Of course the Apache Usergrid project is looking to add Elasticsearch to get better query capabilities, but adding a fulltext search can be desirable for just about any database.

Approaches Proven to Work

There are some known production systems that allow customers to add their own columns. One of the most famous is Salesforce which runs on Oracle and that I've used at a few firms. Each customer can define their own custom fields / columns and run SOQL (Salesforce SQL) against those columns. Since every customer can create their own columns, this could result in a multi-tenant table is much wider than Oracle allows, so they use "flex columns" where a column can hold many different types of data and even data types. Another approach proven to work is MongoDB, which is used by another vendor I've used that supports per-customer fields in a multi-tenant data store. In both solutions, custom fields were first-class fields used for storing and searching against custom fields.

Clear Best Option

I don't think there's a clear best option at the moment. In addition to functionality, there are some requirements such as multi-datacenter support, no single points of failure (SPOFs), and search that are supported better in some solutions and less in others. The "best" solution will depend on the full range of requirements you have across a variety of factors and may involve more than one solution such as the Cassandra / Elasticsearch combination mentioned.

Options and CAP Theorem

To choose a database system, it is useful to consider the CAP Theorem which I described here:

Big Data or relational database (like Mysql cluster)?

In this model, you can choose any 2 of the three: Consistency, Availability and Partition Tolerance, but not all 3. If you want consistency, fast-read,, availability and fast-write in that order, consider what that means with CAP. Consistency and Availability are obvious. Fast-reads and writes depend on how much scale you have which drives your Partition Tolerance requirements. If you are talking about 10 million users, that isn't very large in the greater scheme of things and you may be able to do with less Partition Tolerance. Once you decide whether you want a CA system or a CP system, look along the side of the triangle in the digram to see what solutions are suitable.

Couchbase is a CP system so this appears to fit your priority requirements for Consistency and Partition Tolerance (scaling). However, since you only have 10 million users, you should consider how much Partition Tolerance you actually need.

Community
  • 1
  • 1
Grokify
  • 15,092
  • 6
  • 60
  • 81
  • Thanks for the response! It sounds to me like the suggestion here is to cross each bridge as I get to it. What I was keen on was a (or two) starting point(s) that would cover all the above. I guess for some it's cheaper to have a technically less suitable solution in favour of working with a known technology for cost reasons. I don't have the restriction right now. Normally for business systems that I work on it's consistency, fast-read, availability, fast-write (in order of priority). I think I'm at the point of drawing from a few names and random and just getting on with it! – button Mar 30 '15 at 17:05
  • I just updated my answer with respect to your priorities of consistency, fast-read, availability, and fast-write with respect to the CAP Theorem which can help with your choice. Check out the CAP theorem, how it maps to your requirements, and the solutions on the digram in the linked answer. – Grokify Mar 30 '15 at 19:43