0

Currently I build a website, that will contain lot's of active and inactive contents. When a content is published, it is active and after some time it becomes inactive. I want to track all users who opend the contents in the active period. What is the best way to store these information in the database?

I need to query the database following questions: 1. Which users opened a specific content? This is queried rarly, maybe two or three times by an admin for each content. 2. Which active contents were opened by a specific user? 3. Which inactive contents (contents, which were active when they were opened and are inactive in the meantime) were opened by a specific user? Both queries are queried more often than the first one (maybe every two or three weeks). The user can check in his/her profile, which content he/she opened.

First I thought about a database table with three columns: "user id", "content id", "status". Where "status" is true for active contents and false for inactive contents. "User id" and "content id" are the primary key and "status" is a foreign key from the content table. But then I thought about the size of this table. In some years there (hopefully) might be 2 million nodes and 250k users, so that there could be 500 Billion rows... I think, that would decrease the performance heavily.

Therefore I thought about another way: I create two database tables, one is called "content_opened_by_users" and the other one is called "user_opened_content". The former will contain two collumns: "content id" and "users". Where "users" is an serialized array containing all the user ids, which opened the corresponding content id in the same row. The second table will contain three columns: "user id", "active content", "inactive content". Where "active content" and "inactive content" are also serialized arrays containing the content ids of active/inactive contents the corresponding user opened. So, whenever a content is opened by a user, the server loads the corresponding "users"-array from the "content_opened_by_users" table and the corresponding "active content"-array from the "user_opened_content" table. If the user id does not already exist in the "users"-array, it will be added and if the content id does not already exist in the "active content"-array, it will also be added. Then both arrays will be overwritten in the database. Now, if I query all users opened a certain content, I get an array from the "content_opened_by_users" table. And if I query all active/inactive nodes which were opened by a specific user, I get the "active content" and "inactive content" arrays from the "user_opened_content" table. Then I check, if the "active content" array contains some content ids, which are meanwhile inactive and transfer them to the "inactive content" array and pass them back to the database.

I know, that I create redundant data in this two tables, but I hope this will increase the performance.

Well, would this be a convient way to accomplish the user tracking? Or might there be another, more efficient way?

I would welcome every advice! Thank you very much. Daniel

Daniel
  • 3
  • 5

1 Answers1

0

First, you might want to read about database normalization.

In order to track the information you described, I would use separate tables for users, content, and access like this:

table    | columns
-------------------
users    | id, login, ...
content  | id, title, active, ...
access   | id, user_id, content_id, timestamp, ...

You can then store users and content items irrespective of who opened which item. The access table contains entries about which user opened which content item at which time. This allows you to track numerous statistics without having to store any information about a user not opening an item.

The main queries would be:

  1. Which users opened an item

    SELECT DISTINCT users.login FROM access JOIN users ON access.user_id = users.id JOIN content ON access.content_id = content.id WHERE content.title LIKE '%test%'
    
  2. Active content opened by user

    SELECT DISTINCT content.title FROM access JOIN users ON access.user_id = users.id JOIN content ON access.content_id = content.id WHERE content.active = true AND user.login = 'testuser'
    
  3. Inactive content opened by a user

    SELECT DISTINCT content.title FROM access JOIN users ON access.user_id = users.id JOIN content ON access.content_id = content.id WHERE content.active = false AND user.login = 'testuser'
    

    Assuming that access can only happen to active content. To track the time of a status change, add another table which you can use to look up the timestamp of the last change and compare to the access entries.

Note, that these queries are not optimized at all. You should also add corresponding FOREIGN KEY constraints to ensure data consistency and consider adding additional indexes to increase performance.

andy
  • 2,002
  • 1
  • 12
  • 21
  • Thank you for your fast response! Sorry I didn't mention, that I also have two seperate tables for the users and the content. So all the stuff with the user/content management in general is already working fine. Your proposal with the access table seems similar to my first idea. As I mentioned, I worry about the performance. Let's say there are millions of items and thousands of users in some years. If every user opened every item (ok, that's completely unrealistic, but I just want to be prepared), there would be several billion entries in the access table... – Daniel Nov 02 '14 at 17:46
  • ... and I guess, this would decrease the performance heavily, right? – Daniel Nov 02 '14 at 17:48
  • That depends... The maximum number of rows for MySQL is *really* high. So thousands of users with hundreds of accesses per day are not a problem, as long as you add the right indexes. – andy Nov 02 '14 at 17:54
  • Do I really need indexes in the access table? I would set user_id and content_id as primary key, so I don't need the access_id field. And the primary key is always indexed... Or should I index the user_id and the content_id seperately? – Daniel Nov 02 '14 at 18:07
  • The columns in the access table should have [foreign key constraints](http://dev.mysql.com/doc/refman/5.6/en/create-table-foreign-keys.html) which require indexes to work. They are used by mysql to match the data from different tables. However, changing indexes later on is not a big issue. BTW: Frameworks like [Doctrine](http://www.doctrine-project.org/projects/orm.html) will generate these constraints automatically. – andy Nov 02 '14 at 18:16
  • Again, thanks for your fast responding! Can't I set two foreign keys as primary key? – Daniel Nov 02 '14 at 18:54
  • Sure you can, but if a user looks at a document more than once, you would have trouble storing that in your table. – andy Nov 02 '14 at 19:14
  • That would be no problem, I just need the first time a user opened an item. So I think, I can save the creation of the access_id right? Well, if the user_id (as foreign key) and the content_id (as foreign key) are the primary key in the access table: should I index them seperately? Or should I just index the second column of the primary key, because according to http://stackoverflow.com/a/3048260/3083978 the performance will decrease, if I query a single column of a multi-column index? – Daniel Nov 02 '14 at 23:41
  • This is not a discussion forum and your original question has been answered already. See also http://stackoverflow.com/help/dont-ask – andy Nov 03 '14 at 12:01
  • I'd invite you to [chat](http://chat.stackoverflow.com/faq) to discuss this, but that requires 20 rep. Mark this question as resolved, pose another one, and you will soon be there! :) – andy Nov 03 '14 at 16:58