How to handle massive storage of records in database for user authorization purposes?

Question

I am using Ruby on Rails 3.2.2 and MySQL. I would like to know if it is "advisable" / "desirable" to store in a database table related to a class all records related to two others classes for each "combination" of their instances.

That is, I have User and Article models. In order to store all user-article authorization objects, I would like to implement a ArticleUserAuthorization model so that given N users and M articles there are N*M ArticleUserAuthorization records.

Making so, I can state and use ActiveRecord::Associations as the following:

class Article < ActiveRecord::Base
  has_many :user_authorizations, :class_name => 'ArticleUserAuthorization'
  has_many :users, :through => :user_authorizations
end

class User < ActiveRecord::Base
  has_many :article_authorizations, :class_name => 'ArticleUserAuthorization'
  has_many :articles, :through => :article_authorizations
end

However, the above approach of storing all combinations will result in a big database table containing billions billions billions of rows!!! Furthermore, ideally speaking, I am planning to create all authorization records when an User or an Article object is created (that is, I am planning to create all previously mentioned "combinations" at once or, better, in "delayed" batches... in any way, this process creates other billions billions of database table rows!!!) and make the viceversa when destroying (by deleting billions billions of database table rows!!!). Furthermore, I am planning to read and update those rows at once when an User or Article object is updated.

So, my doubts are:

Is this approach "advisable" / "desirable"? For example, what kind of performance problems may occur? or, is a bad "way" / "prescription" to admin / manage databases with very large database tables?
How may / could / should I proceed in my case (maybe, by "re-thinking" at all how to handle user authorizations in a better way)?

Note: I would use this approach because, in order to retrieve only "authorized objects" when retrieving User or Article objects, I think I need "atomic" user authorization rules (that is, one user authorization record for each user and article object) since the system is not based on user groups like "admin", "registered" and so on. So, I thought that the availability of a ArticleUserAuthorization table avoids to run methods related to user authorizations (note: those methods involve some MySQL querying that could worsen performance - see this my previous question for a sample "authorization" method implementation) on each retrieved object by "simply" accessing / joining the ArticleUserAuthorization table so to retrieve only "user authorized" objects.

It's hard to give you a possibly better option about your case when we don't know what you're use case is. Esp. what is an ArticleUserAuthorization supposed to be? How does it relate to the other objects and how is it generated? — Andrew Marshall, Jun 20 '12 at 06:57
@Andrew Marshall - The question *doesn't focus* on the specific use case related to the `ArticleUserAuthorization` class (the `ArticleUserAuthorization` class is a sample class stated just to make reader to better understand my "doubts"); the question *focuses* more on "common" / "general" performance issues and database handling choices. — Backo, Jun 20 '12 at 07:01
The performance of your database depends on how you plan to use it! @AndrewMarshall's comment is absolutely relevant to the question. To be able to advise you on performance issues and database handling choices, the purpose of the data and use cases of the data contribute to the overall solution. — sparrow, Jun 20 '12 at 07:19
@Andrew Marshall @sparrow - So, hoping that this information can improve understanding the question, `ArticleUserAuthorization` is supposed to be used to keep authorization information related to all associations between articles and users. The database table related to the `ArticleUserAuthorization` model have Boolean columns. — Backo, Jun 20 '12 at 07:28
Are you saying that for each User there will be one ArticleUserAuthorization record for every Article? I.e. given N users and M articles there are N*M ArticleUserAuthorization records? — Andrew Marshall, Jun 20 '12 at 07:55
Do you *need* individual users to have individual article access? — Matthew, Jun 22 '12 at 17:27
Many moons ago I implemented something similar by having two sorts of access: access as part of a group and access as an individual. The vast majority of access was granted at the group level (e.g. this article is public, this article is admin only) with a very small number of per user overrides, so my authorisations table was of the same order of magnitude as the articles table. — Frederick Cheung, Jun 22 '12 at 17:51
@Frederick Cheung - Do you mean that you have implemented authorizations as I "explain" / "would make" in the question content? If so, for example, how do you have managed those authorizations when a user or a article was created / destroyed? — Backo, Jun 22 '12 at 18:05
No, not quite as you describe - I used group level permissions to avoid creating the vast majority of the n*m rows you describe — Frederick Cheung, Jun 22 '12 at 21:01
@Rudi Verago - I read the cancan gem documentation but, in my case, it doesn't "work" / "support" what I am looking for. Precisely, if I use the cancan gem, I am in the *same* "critical" situation as I am now (note: the situation is "critical" because many SQL statements must run in order to retrieve user authorizations - see [this question for more information](http://stackoverflow.com/questions/11074008/what-is-a-common-approach-to-scope-records-by-those-that-an-user-can-read)). — Backo, Jun 23 '12 at 17:48
@Rudi Verago - So, I can not fetch "authorized" articles (for example in the controller index action) as described in the "Fetching Records" section of the [Official Documentation](https://github.com/ryanb/cancan/wiki/Defining-Abilities-with-Blocks): "A block's conditions are only executable through Ruby. If you are Fetching Records using accessible_by it will raise an exception. [...]" — Backo, Jun 23 '12 at 17:48
Are you really planning to store every single record as to a user's permission on an article in the table? Would it not be better to only record a value when a user is given permission to the article - I expect that would be smaller than a xref of every single record. In addition, creating a security model that includes "roles" or "groups" would alleviate much of this pain, as long as you can assign a user to a role or group, you can still keep the individual article access as well. — N West, Jun 27 '12 at 13:30

Matthew · Answer 1 · 2012-06-22T15:34:45.877

6

The fact of the matter is that if you want article-level permissions per user then you need a way to relate Users to the Articles they can access. This neccesitates a minimum you need N*A (where A is the number of uniquely permissioned articles).

The 3NF approach to this would be, as you suggested, to have a UsersArticles set... which would be a very large table (as you noted).

Consider that this table would be accessed a whole lot... This seems to me like one of the situations in which a slightly denormalized approach (or even noSQL) is more appropriate.

Consider the model that Twitter uses for their user follower tables:

Jeff Atwood on the subject

And High Scalability Blog

A sample from those pieces is a lesson learned at Twitter that querying followers from a normalized table puts tremendous stress on a Users table. Their solution was to denormalize followers so that a user's follower's are stored on their individual user settings.

Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins. - Avoid complex joins. - Avoid scanning large sets of data.

I imagine a similar approach could be used to serve article permissions and avoid a tremendously stressed UsersArticles single table.

edited Jun 22 '12 at 15:34

answered Jun 22 '12 at 15:29

Matthew

10,244
5
49
104

I read linked posts, but given I am not expert on these matters, do you mean that I should create / add a NoSQL database (for example, a MongoDB) to my Rails application in order to store user authorization data? If no, how could I "slightly denormalize" the SQL database in my case since linked posts are not enough to make me understand that? – Backo Jun 22 '12 at 16:19
More, what it means "stored on their individual settings" in the sentence "[...] Their solution was to denormalize followers so that a user's follower's are stored on their individual user settings"? If I understood, each user-follower association is stored in a "user_settings" database table column for each user... assuming that that column Data Type is TEXT and it contains all follower ids, could it happen that followers are so many so to generate a SQL error because out of memory for the TEXT Data Type? – Backo Jun 22 '12 at 17:09
I was at a session the twitter guys gave around the time of some of those posts (railsconf europe 2008 ?) and at that point in time they had a follower_ids serialised array column on the users table with a user's followers. That was a while ago - before people like stephen fry had 4 million followers – Frederick Cheung Jun 22 '12 at 17:55
@Backo you have to accept that if you **need** article level permissions then you cannot get away from storing "billions and billions" of rows. If you need to uniquely connect user "ABC" to article "123" then you must have a data point indicating this... your design goals should be to isolate and optimize these lookups. – Matthew Jun 26 '12 at 14:20
@Backo, in the "traditional" db sense I mean "a row"... If you **need** to correlate **single** articles to **individual** users then you will need to have some indicator for each... this could be anything you want... but you should think of it as a row if you're developing a data model. – Matthew Jun 26 '12 at 23:05

score 4 · Answer 2 · answered Jun 22 '12 at 21:24

If there really is the prospect of "a big database table containing billions billions billions of rows" then perhaps you should craft a solution for your specific needs around a (relatively) sparsely populated table.

Large database tables pose a significant performance challange in how quickly the system can locate the relevant row or rows. Indexes and primary keys are really needed here; however they add to the storage requirements and also require CPU cycles to be maintained as records are added, updated, and deleted. Evenso, heavy-duty database systems also have partitioning features (see http://en.wikipedia.org/wiki/Partition_(database) ) that address such row location performance issues.

A sparsely populated table can probably serve the purpose assuming some (computable or constant) default can be used whenever no rows are returned. Insert rows only wherever something other than the default is required. A sparsely populated table will require much less storage space and the system will be able to locate rows more quickly. (The use of user-defined functions or views may help keep the querying straightforward.)

If you really cannot make a sparsely populated table work for you, then you are quite stuck. Perhaps you can make that huge table into a collection of smaller tables, though I doubt that's of any help if your database system supports partitioning. Besides, a collection of smaller tables makes for messier querying.

So let's say you have millions or billions of Users who or may not have certain privileges regarding the millions or billions of Articles in your system. What, then, at the business level determines what a User is privileged to do with a given Article? Must the User be a (paying) subscriber? Or may he or she be a guest? Does the User apply (and pay) for a package of certain Articles? Might a User be accorded the privilege of editing certain Articles? And so on and so forth.

So let's say a certain User wants to do something with a certain Article. In the case of a sparsely populated table, a SELECT on that grand table UsersArticles will either return 1 row or none. If it returns a row, then one immediately knows the ArticleUserAuthorization, and can proceed with the rest of the operation.

If no row, then maybe it's enough to say the User cannot do anything with this Article. Or maybe the User is a member of some UserGroup that is entitled to certain privileges to any Article that has some ArticleAttribute (which this Article has or has not). Or maybe the Article has a default ArticleUserAuthorization (stored in some other table) for any User that does not have such a record already in UsersArticles. Or whatever...

The point is that many situations have a structure and a regularity that can be used to help reduce the resources needed by a system. Human beings, for instance, can add two numbers with up to 6 digits each without consulting a table of over half a trillion entries; that's taking advantage of structure. As for regularity, most folks have heard of the Pareto principle (the "80-20" rule - see http://en.wikipedia.org/wiki/Pareto_principle ). Do you really need to have "billions billions billions of rows"? Or would it be truer to say that about 80% of the Users will each only have (special) privileges for maybe hundreds or thousands of the Articles - in which case, why waste the other "billions billions billions" (rounded :-P).

With "sparsely populated table" do you mean that I should populate (create or, if a record doesn't exist, update) database table rows *only* when needed? That is, for example, to populate that table *only* when a related user authorization *method* runs (that is, practically speaking, when in my controllers I run a `readable_by_user?`, `editable_by_user?`, ... method)? But, by making that, how could / should I retrieve "*authorized*" records (in my case, records are article objects) if *no* user authorization table row (in my case, the table is `article_user_authorizations`) exists yet? — Backo, Jun 23 '12 at 03:09
(1) "to populate that table only when a related user authorization method runs": You can populate on the fly during the method call if that is preferable; otherwise, no, there is no need to - the method call could instead compute and return the default. The only method that really needs to populate is one that _grants_ authorization, in particular an authorization not already granted by the default. — rskar, Jun 23 '12 at 15:32
(2) "how could / should I retrieve 'authorized' records": Most database systems (such as Oracle, DB2, and SQL Server) provide a means to make this sort of thing possible, such as: views, stored procedures, user-defined functions, etc. They can be used as part of a query. You can even introduce a computed column such as IS_DEFAULT in the row set so your can know the source. The only tricky part I can imagine may be in the granting or removing of privileges, but even then that sort of thing could be handled via a stored procedure. — rskar, Jun 23 '12 at 15:32

score 4 · Answer 3 · answered Jun 25 '12 at 19:01

You don't have to re-invent the wheel. ACL(Access Control List) frameworks deals with same kind of problem for ages now, and most efficiently if you ask me. You have resources (Article) or even better resource groups (Article Category/Tag/Etc).On the other hand you have users (User) and User Groups. Then you would have a relatively small table which maps Resource Groups to User Groups. And you would have another relatively small table which holds exceptions to this general mapping. Alternatively you can have rule sets to satify for accessing an article.You can even have dynamic groups like : authors_friends depending on your user-user relation.

Just take a look at any decent ACL framework and you would have an idea how to handle this kind of problem.

score 1 · Answer 4 · answered Jun 28 '12 at 12:49

You should look at a hierarchical role based access control (RBAC) solutions. You should also consider sensible defaults.

Are all users allowed to read an article by default? Then store the deny exceptions.
Are all users not allowed to read an article by default? Then store the allow exceptions.
Does it depend on the article whether the default is allow or deny? Then store that in the article, and store both allow and deny exceptions.
Are articles put into issues, and issues collected into journals, and journals collected into fields of knowledge? Then store authorizations between users and those objects.
What if a User is allowed to read a Journal but is denied a specific Article? Then store User-Journal:allow, User-Article:deny and the most specific instruction (in this case the article) takes precedence over the more general (in this case the default, and the journal).

deepak · Answer 5 · 2012-06-24T18:21:43.187

Shard the ArticleUserAuthorization table by user_id. The principle is to reduce the effective dataset size on the access path. Some data will be accessed more frequently than others, also it be be accessed in a particular way. On that path the size of the resultset should be small. Here we do that by having a shard. Also, optimize that path more by maybe having an index if it is a read workload, cache it etc

This particular shard is useful if you want all the articles authorized by a user.
If you want to query by article as well, then duplicate the table and shard by article_id as well. When we have this second sharding scheme, we have denormalized the data. The data is now duplicated and the application would need to do extra work to maintain data-consistency. Writes also will be slower, use a queue for writes

Problem with sharding is that queries across shards is ineffectve, you will need a separate reporting database. Pick a sharding scheme and think about recomputing shards.

For truly massive databases, you would want to split it across physical machines. eg. one or more machines per user's articles.

some nosql suggestions are:

relationships are graphs. so look at graph databases. particularly
https://github.com/twitter/flockdb
redis, by storing the relationship in a list.
column-oriented database like hbase. can treat it like a sparse nested hash

all this depends on the size of your database and the types of queries

EDIT: modified answer. the question previously had 'had_one' relationships Also added nosql suggestions 1 & 2

I updated the question because I was wrong on `ActiveRecord::Associations` statements. — Backo, Jun 23 '12 at 18:01

score 0 · Answer 6 · answered Jun 25 '12 at 15:54

0

First of all, it is good to think about default values and behaviors and not store them in the database. For example, if by default, a user cannot read an article unless specified, then, it does not have to be stored as false in the database.

My second thought is that you could have a users_authorizations column in your articles table and a articles_authorizations in your users table. Those 2 columns would store user ids and article ids in the form 3,7,65,78,29,78. For the articles table for example, this would mean users with ids 3,7,65,78,29,78 can access the articles. Then you would have to modify your queries to retrieve users that way:

@article = Article.find(34)
@users = User.find(@article.user_authorizations.split(','))

Each time an article and a user is saved or destroyed, you would have to create callbacks to update the authorization columns.

class User < ActiveRecord 
   after_save :update_articles_authorizations
   def update_articles_authorizations
     #...
   end
end

Do the same for Article model.

Last thing: if you have different types of authorizations, don't hesitate creating more columns like user_edit_authorization.

With these combined techniques, the quantity of data and hits to the DB are minimal.

answered Jun 25 '12 at 15:54

Wawa Loo

2,266
18
15

As I said in a my previous comment: assuming that the `user_authorization` column Data Type is TEXT and it contains all user / article ids, could it happen that followers are so many so to generate a SQL error because out of memory for the TEXT Data Type? – Backo Jun 25 '12 at 17:17
Possibly, but you could always use LONGTEXT columns. How many would be manually/explicitly authorized for a certain article? – Wawa Loo Jun 25 '12 at 18:09
Should also the LONGTEXT Data Type have a limit to the maximum number of chars? – Backo Jun 25 '12 at 19:02
It has, but its pretty long, take a look at that: http://stackoverflow.com/questions/4443477/rails-3-migration-with-longtext. Could you answer my question? – Wawa Loo Jun 25 '12 at 19:10
"How many would be manually/explicitly authorized for a certain article?". Theoretically speaking, it could be billions of billions so also the LONGTEXT Data Type is not enough. – Backo Jun 26 '12 at 07:04

score 0 · Answer 7 · answered Jun 28 '12 at 06:21

Reading through all the comments and the question I still doubt the validity of storing all the combinations. Think about the question in another way - who will populate that table? The author of the article or moderator, or someone else? And based on what rule? You wound imagine how difficult that is. It's impossible to populate all the combinations.

Facebook has a similar feature. When you write a post, you can choose who do you want to share it with. You can select 'Friends', 'Friends of Friends', 'Everyone' or custom list. The custom list allows you to define who will be included and excluded. So same as that, you only need to store the special cases, like 'include' and 'exclude', and all the remaining combinations fall into the default case. By dong this, N*M could be reduced significantly. Post visibility

How to handle massive storage of records in database for user authorization purposes?

7 Answers7

Linked

Related