1

I'm working on a sales ranking widget for our site, which will display products ordered by their current position in a "top charts" or "best seller" list (so to speak).

After some reading, it looks like a good method of implementing this would simply be a rolling sales average algorithm where the more recent the sale, the higher it is weighted.

Example:

$rolling_avg = ((4*$d1)+(3*$d2)+(2*$d3)+$d4+$d5+$d6+$d7)/13;

Where:

  • $d1 = Number of sales in the last 24 hours.
  • $d2 = Number of sales in the last 24-48 hours.
  • $d3 = Number of sales in the last 48-72 hours.
  • $d4 = Number of sales in the last 72-96 hours.

and so on...

Currently, I'm trying to run this on a products dataset of around 500k records, inserting the calculated rank back into the products table so that this can be queried against later. If possible, I'd like to be able to create a script to re-calculate rankings and run this on a cron every 12 or 24 hours.

Current Implementation:

My current implementation takes far too long to execute, and I feel a lot more of the processing needs to be done at the SQL level (with a lot fewer SELECT queries) but I'm unsure how to go about starting this.

$products = mysql_query("SELECT * FROM products ORDER BY id DESC"); // <-- Est 450-500k rows.

while($product = mysql_fetch_array($products)) {
    $product_id = $product['id'];

    $d1 = mysql_query("SELECT COUNT(*) FROM orders WHERE (product_id = '$product_id') AND (sale_completed BETWEEN (NOW()-INTERVAL 24 HOUR) AND NOW())") or die(mysql_error);
    $d1 = mysql_fetch_array($d1);

    $d2 = mysql_query("SELECT COUNT(*) FROM orders WHERE (product_id = '$product_id') AND (sale_completed BETWEEN (NOW()-INTERVAL 48 HOUR) AND (NOW()-INTERVAL 24 HOUR))");
    $d2 = mysql_fetch_array($d2);

    $d3 = mysql_query("SELECT COUNT(*) FROM orders WHERE (product_id = '$product_id') AND (sale_completed BETWEEN (NOW()-INTERVAL 72 HOUR) AND (NOW()-INTERVAL 48 HOUR))");
    $d3 = mysql_fetch_array($d3);

    $d4 = mysql_query("SELECT COUNT(*) FROM orders WHERE (product_id = '$product_id') AND (sale_completed BETWEEN (NOW()-INTERVAL 96 HOUR) AND (NOW()-INTERVAL 72 HOUR))");
    $d4 = mysql_fetch_array($d4);

    $d5 = mysql_query("SELECT COUNT(*) FROM orders WHERE (product_id = '$product_id') AND (sale_completed BETWEEN (NOW()-INTERVAL 120 HOUR) AND (NOW()-INTERVAL 96 HOUR))");
    $d5 = mysql_fetch_array($d5);

    $d6 = mysql_query("SELECT COUNT(*) FROM orders WHERE (product_id = '$product_id') AND (sale_completed BETWEEN (NOW()-INTERVAL 144 HOUR) AND (NOW()-INTERVAL 120 HOUR))");
    $d6 = mysql_fetch_array($d6);

    $d7 = mysql_query("SELECT COUNT(*) FROM orders WHERE (product_id = '$product_id') AND (sale_completed BETWEEN (NOW()-INTERVAL 168 HOUR) AND (NOW()-INTERVAL 144 HOUR))");
    $d7 = mysql_fetch_array($d7);

    $d1 = $d1[0];
    $d2 = $d2[0];
    $d3 = $d3[0];
    $d4 = $d4[0];
    $d5 = $d5[0];
    $d6 = $d6[0];
    $d7 = $d7[0];       

    $rolling_avg = ((4*$d1)+(3*$d2)+(2*$d3)+$d4+$d5+$d6+$d7)/13;

    mysql_query("UPDATE products SET rolling_sales = '$rolling_avg' WHERE id = '$product_id'");
}

Not sure how to optimize/ progress from here. But it definitely needs a lot of work.

Before it's mentioned, I understand mysql_* functions are depreciated, and I will move this over to PDO before it moves into a production environment.

Christian Owens
  • 1,086
  • 1
  • 10
  • 16
  • 2
    There is a weighted average algorithm that generally gives better results and is easier to calculate: it uses exponential decay instead of your ad-hoc function. See my answer to a different question: http://stackoverflow.com/questions/8211558/data-structure-algorithm-to-efficiently-save-weighted-moving-average?answertab=active#tab-top – Rotsor Dec 18 '12 at 02:04

1 Answers1

0

This is a function that calculates the rolling sales using a single query.

function get_rolling_sales($product_id) {

    $query = <<<EOF
SELECT (
    4 * (

   SELECT COUNT(*) FROM orders WHERE (product_id = $product_id) 
   AND (sale_completed BETWEEN (NOW()-INTERVAL 24 HOUR) AND NOW())

) + 3 * (

  SELECT COUNT(*) FROM orders WHERE (product_id = $product_id) 
  AND (sale_completed BETWEEN (NOW()-INTERVAL 48 HOUR) AND (NOW()-INTERVAL 24 HOUR))

) + 2 * (

  SELECT COUNT(*) FROM orders WHERE (product_id = $product_id) 
  AND (sale_completed BETWEEN (NOW()-INTERVAL 72 HOUR) AND (NOW()-INTERVAL 48 HOUR))

) + (

  SELECT COUNT(*) FROM orders WHERE (product_id = $product_id) 
  AND (sale_completed BETWEEN (NOW()-INTERVAL 96 HOUR) AND (NOW()-INTERVAL 72 HOUR))

) + (

  SELECT COUNT(*) FROM orders WHERE (product_id = $product_id) 
  AND (sale_completed BETWEEN (NOW()-INTERVAL 120 HOUR) AND (NOW()-INTERVAL 96 HOUR))

) + (

  SELECT COUNT(*) FROM orders WHERE (product_id = $product_id) 
  AND (sale_completed BETWEEN (NOW()-INTERVAL 144 HOUR) AND (NOW()-INTERVAL 120 HOUR))

) + (

  SELECT COUNT(*) FROM orders WHERE (product_id = $product_id) 
  AND (sale_completed BETWEEN (NOW()-INTERVAL 168 HOUR) AND (NOW()-INTERVAL 144 HOUR))

)

) / 13 AS rolling_sales
EOF;

$result = mysql_query($query);
$row = mysql_fetch_assoc($result);
return $row['rolling_sales'];
}

However iterating over all 500.000 product records will still take very long. Do you really need all this information at one time (e.g. for calculations) or is it planned to be displayed in paged table views? If you just want to display the data it should be ok the calculate the rolling_sales on demand.

hek2mgl
  • 152,036
  • 28
  • 249
  • 266