24

I got a new requirement from the business about "real time" data/statistics. They want to show how our system is performing in real time.

I am not sure how to do it but here are my thoughts:

I don't think it's possible to get the data every seconds because cronjob run at least every minute. So, without telling them, I said YES it's possible.

Now my problem is this how can I run a cronjob that get statistics on our sites (sales, impressions, cpc etc...)?

Example:

From 9h01 AM to 9h02 AM i have:

  • 41 views on product 1
  • 1 order
  • 8 referral click from clients
  • 2 added to wish list

From 9h02 AM to 9h03 AM i have:

  • 57 views on product 1
  • 0 order
  • 13 referral click from clients
  • 0 added to wish list

Total:

  • 98 views on product 1
  • 1 order
  • 21 referral click from clients
  • 2 added to wish list

How can i make sure I won't calculate duplicates if for some reason the database is slow and does not process information on time?

Thanks

Edit: The company has 200 employee in 3 different states which consists of sales, business analyst, tech, accounting, and executive staff and these people can read these reports.

In the last year we hired 20 employees so it will grow a little. For traffic data, it is hard to tell exactly how much data we get per minutes. The estimate is approx 2.5k to 10k per minute.

We just ordered 3 PowerEdge R510 (Intel® Xeon® E5503, 2.0Ghz, 4M Cache, 12GB Memory (3x4GB), 1333MHz Dual Rank, 4 x 300GB 15K RPM Serial-Attach SCSI 6Gbps RAID 5).

Book Of Zeus
  • 49,509
  • 18
  • 174
  • 171
Gino Sullivan
  • 2,203
  • 18
  • 29
  • how many users do you have and will that grow? how much data you get per minute? What kind of servers you have (how much, speed, hdd, memory)? – Book Of Zeus Dec 02 '11 at 00:31
  • are these servers are only used for reporting? or used for other purpose? – Book Of Zeus Dec 02 '11 at 01:23
  • no my idea is to put i server 1 for reporting and use the current one for backup, 2 for the new client system i am building (and use the old one for backup) load balanced (and use the old one for backup) – Gino Sullivan Dec 02 '11 at 01:29

4 Answers4

30

Here's what I recommend based on your servers/employee/data if these servers are. Because you are using 1 server (and 1 backup), the capacity of your drive should be enough for a while unless you want to archive complete data on this server. Data can grow rapidly and I would think to increase the capacity or archive the data somewhere else.

Now, because you have a lot of people that can request reporting data the main idea is to retrieve data as fast as possible to make sure you don't lock records (specially if you using myisam tables - table locking vs innodb which has row level locking).

Use your index (unique if you need) wisely and store your data as efficiently as possible using timestamp.

What you can also do is to summarize your data which can simplify your queries. Although, is not a common practice in databases since it does not respect the normal forms. You can get great performance but it's a pain to maintain.

To be honest, a cron that runs every minutes is fine since you have the time when you save the record but it is possible to get data every second. I recommend to make sure when you get a record, you mark this record as "processed" or some other status in order you don't take this record twice.

Now when you summarize your data make sure you optimize your queries and you can check also what the explain will output and then make a decision.


EDIT: Summarizing data (which does not respect the database normalization) will get you great performance since you only query records without using aggregate functions and having joins tables using minimal where clause.

Example:

98 views on product 1
1 order
21 referral click from clients
2 added to wishlist

can be:

SELECT
 views, orders, referral, whishlist
FROM
 summarize_stats_20111201 /* daily table for example */
WHERE
 `time` between 1322791200 /*2011-12-01 21:00:00*/ AND 1322791260 /*2011-12-01 21:01:00*/;

views has the total amount of views, in this example 98

orders has the total amount of orders, in this example 1

referral has the total amount of referral, in this example 21

wishlist has the total amount of wishlist, in this example 2

These are calculated data in a summary table (this is why i said "does not respect the database normalization" because you never calculate data in a RDBMS) but if you need data instantly, this is a way you can do it.


EDIT 2: Here's a example of maintaining this solution:

You have a cronjob that maintains tables. His job is to create the table for the next day or what ever you need.

// in php
$date = date('Ymd', strtotime('+1 day')); // for daily table 
$sql = 'CREATE TABLE IF NOT EXISTS the_database.summarize_stats_" . $date . ";

So when you inserts, make sure you have the right table name and you use ON DUPLICATE KEY

// in php
$sql = 'INSERT INTO TABLE summarize_stats_20111201 SET /* all the fields you need */ ON DUPLICATE KEY views = views + 1;

for example if you want to increase the view

What I also forget is if you need to query 1 week of data, you will have to create a merge table. This way you can do something like:

SELECT
 views, orders, referral, whishlist
FROM
 summarize_stats_2011 /* yearly merge table for example */
WHERE
 `time` between 1322272800 /*2011-11-25 21:00:00*/ AND 1322791260 /*2011-12-01 21:01:00*/;

This way you don't have to UNION ALL tons of queries.

Book Of Zeus
  • 49,509
  • 18
  • 174
  • 171
2

Save in the database the timestamp of the records, and evaluate the data according to it (for mysql http://dev.mysql.com/doc/refman/5.0/en/timestamp.html)

de3
  • 1,890
  • 5
  • 24
  • 39
  • You can use fat-controller.sourceforge.net to run jobs every second or continually repeat a job once it finishes, but I think in this case it would be much, much better to do as de3 days and timestamp your data and then just group and aggregate it using SQL. – SlappyTheFish Nov 29 '11 at 14:52
  • This. You must have a timestamp of the state of the data's last change. Otherwise, any delay between the data state change and the time you fetch it has caused your results to be inaccurate. How you archive this is really dependant on your capabilities and your DB's current schema. Do you have any auditing which you could leverage for this? – Steven Dec 02 '11 at 00:34
0

Gino, you can do something like cronjobs if you have access to your php.ini on the server. If you can set the value of max_execution_time to zero, you can make your own cronjobs that runs every second. What you need is something like this in your php.ini :

max_execution_time = 0

or set it as a runtime script in your PHP codes:

ini_set("max_execution_time",0); 

The next function that you need, is sleep() function in PHP. This function will put delay among your operations. You can use it like sleep(10);. For more information about this function, take a look at this link.

Mohammad Saberi
  • 12,864
  • 27
  • 75
  • 127
  • 3
    i don't think this is good to do. sleep(10)? on a high traffic site? – aki Dec 02 '11 at 14:03
  • Aki, this suggestion is just for those applications that are running on a private server. Otherwise, you can not use max_execution_time too. – Mohammad Saberi Dec 02 '11 at 19:50
0

This might sounds strange, but why don't you use Google Analytics for this kind of tracking tasks. With the new "live beta"

And retrieving the data using its API and perform all the fancy requirements that management may require.

Plus js and google will handle most of the load.

edit: My real point is why don't you try using the js (woopra or your own) to gather the clicks, events on the front end, and store all the stats on another database, I don't believe mixing OLAP and OLTP is a good idea on any production server. Hope that make sense.

Will
  • 900
  • 10
  • 20
  • I believe no, the live beta have not yet make it to the API areas unless Google didn't update its documentation, so the minimum filtering option is still at Hour level. I also updated my answer. – Will Dec 02 '11 at 04:19