12

So I've read the two related questions for calculating a trend line for a graph, but I'm still lost.

I have an array of xy coordinates, and I want to come up with another array of xy coordinates (can be fewer coordinates) that represent a logarithmic trend line using PHP.

I'm passing these arrays to javascript to plot graphs on the client side.

Geoff
  • 7,935
  • 3
  • 35
  • 43
Stephen
  • 18,827
  • 9
  • 60
  • 98

3 Answers3

29

Logarithmic Least Squares

Since we can convert a logarithmic function into a line by taking the log of the x values, we can perform a linear least squares curve fitting. In fact, the work has been done for us and a solution is presented at Math World.

In brief, we're given $X and $Y values that are from a distribution like y = a + b * log(x). The least squares method will give some values aFit and bFit that minimize the distance from the parametric curve to the data points given.

Here is an example implementation in PHP:

First I'll generate some random data with known underlying distribution given by $a and $b

  // True parameter valaues
  $a = 10;
  $b = 5;

  // Range of x values to generate
  $x_min = 1;
  $x_max = 10;
  $nPoints = 50;

  // Generate some random points on y = a * log(x) + b
  $X = array();
  $Y = array();
  for($p = 0; $p < $nPoints; $p++){
    $x = $p / $nPoints * ($x_max - $x_min) + $x_min;
    $y = $a + $b * log($x);

    $X[] = $x + rand(0, 200) / ($nPoints * $x_max);
    $Y[] = $y + rand(0, 200) / ($nPoints * $x_max);

  }

Now, here's how to use the equations given to estimate $a and $b.

  // Now convert to log-scale for X
  $logX = array_map('log', $X);

  // Now estimate $a and $b using equations from Math World
  $n = count($X);
  $square = create_function('$x', 'return pow($x,2);');
  $x_squared = array_sum(array_map($square, $logX));
  $xy = array_sum(array_map(create_function('$x,$y', 'return $x*$y;'), $logX, $Y));

  $bFit = ($n * $xy - array_sum($Y) * array_sum($logX)) /
          ($n * $x_squared - pow(array_sum($logX), 2));

  $aFit = (array_sum($Y) - $bFit * array_sum($logX)) / $n;

You may then generate points for your Javascript as densely as you like:

  $Yfit = array();
  foreach($X as $x) {
    $Yfit[] = $aFit + $bFit * log($x);
  }

In this case, the code estimates bFit = 5.17 and aFit = 9.7, which is quite close for only 50 data points.

alt text

For the example data given in the comment below, a logarithmic function does not fit well.

alt text

The least squares solution is y = -514.734835478 + 2180.51562281 * log(x) which is essentially a line in this domain.

Community
  • 1
  • 1
Geoff
  • 7,935
  • 3
  • 35
  • 43
  • 1
    Alright, I'm off to the google races. I'll get back to you with what I find. – Stephen May 04 '10 at 21:31
  • In theory, your updated comment makes sense. In practice? I'm a dunce at math. I looked at the two equations you mentioned and almost fainted. – Stephen May 04 '10 at 21:54
  • Okay. Well, I did some more research into the problem and have written and tested some code that does this for you, and re-written my answer. Let me know if you have any questions. – Geoff May 05 '10 at 15:06
  • That's awesome. This is exactly what I was looking for. I'm going to break down your code and try to learn what exactly is going on. I appreciate it. – Stephen May 05 '10 at 17:03
  • I tried to implement your solution, but my y values are coming out way wrong the range of y values in my array of points is anywhere from 0 to 100. The trendline I'm coming up with has y values in the negative thousands. Here's the code: – Stephen May 05 '10 at 18:05
  • Stephen, maybe you can use `pastebin` to share your data? Is it possible it's not really a logarithmic distribution? – Geoff May 05 '10 at 18:51
  • Sure thing. Here is the array in javascript after I've pulled it from the database view PHP: http://pastebin.com/JTLQdRhg – Stephen May 05 '10 at 20:44
  • These data look more like an exponential distribution.http://mathworld.wolfram.com/LeastSquaresFittingExponential.html – Geoff May 05 '10 at 21:08
  • I didn't mention it because I didn't think it would matter, but: The x values are numeric representations of dates. – Stephen May 05 '10 at 22:34
  • Sorry! I had `aFit` and `bFit` confused! I have updated the post to reflect this. – Geoff May 06 '10 at 14:23
  • YES! As soon as I swapped the aFit and bFit everything works perfectly now. I've been struggling with this for days. I really appreciate it. And bonus: I studied those equations in detail, so I've learned a lot from this hurdle. – Stephen May 06 '10 at 19:52
  • Great. Will you either correct, or remove (probably better) the code you put in your updated question to avoid any confusion a future reader may have? – Geoff May 06 '10 at 20:18
  • You calculate y-squared but don't use it. Does that have any purpose? – Blizz Mar 22 '14 at 20:24
  • @Blizz - Good point. No; not needed. I was probably just playing around graphing stuff. – Geoff Mar 24 '14 at 11:51
  • 1
    This is a very useful answer. Thank you! – Nicolas Castro Mar 16 '17 at 02:56
4

I would recommend using library: http://www.drque.net/Projects/PolynomialRegression/

Available by Composer: https://packagist.org/packages/dr-que/polynomial-regression.

Piotr Borek
  • 866
  • 1
  • 6
  • 6
1

In case anyone is having problems with the create_function, here is how I edited it. (Though I wasn't using logs, so I did take those out.)

I also reduced the number of calculations and added an R2. It seems to work so far.

function lsq(){
    $X = array(1,2,3,4,5);
    $Y = array(.3,.2,.7,.9,.8);

    // Now estimate $a and $b using equations from Math World
    $n = count($X);

    $mult_elem = function($x,$y){   //anon function mult array elements 
        $output=$x*$y;              //will be called on each element
        return $output;
    };

    $sumX2 = array_sum(array_map($mult_elem, $X, $X));

    $sumXY = array_sum(array_map($mult_elem, $X, $Y));
    $sumY = array_sum($Y);
    $sumX = array_sum($X);

    $bFit = ($n * $sumXY - $sumY * $sumX) /
    ($n * $sumX2 - pow($sumX, 2));
    $aFit = ($sumY - $bFit * $sumX) / $n;
    echo ' intercept ',$aFit,'    ';
    echo ' slope ',$bFit,'   ' ;    

    //r2
    $sumY2 = array_sum(array_map($mult_elem, $Y, $Y));
    $top=($n*$sumXY-$sumY*$sumX);
    $bottom=($n*$sumX2-$sumX*$sumX)*($n*$sumY2-$sumY*$sumY);
    $r2=pow($top/sqrt($bottom),2);
    echo '  r2  ',$r2;
}
Dr. Pierce
  • 11
  • 2