-1

I am trying to convert the following one-pass standard deviation code in C to Python:

double std_dev(double a[], int n) {
    if(n == 0)
        return 0.0;
    int i = 0;
    double meanSum = a[0];
    double stdDevSum = 0.0;
    for(i = 1; i < n; ++i) {
        double stepSum = a[i] - meanSum;
        double stepMean = ((i - 1) * stepSum) / i;
        meanSum += stepMean;
        stdDevSum += stepMean * stepSum;
    }
    // for poulation variance: return sqrt(stdDevSum / n);
    return sqrt(stdDevSum / (n));

Here is what I have in Python so far:

def std_dev(a,n):
    if n == 0:
        return 0.0
    i = 0
    meanSum = float(a[0])
    stdDevSum = float(0.0)
    for i in range(1,n,1):
        stepSum = float(float(a[i]) - meanSum)
        stepMean = float(((i - 1)*stepSum)/i)
        meanSum += stepMean
        stdDevSum += stepMean*stepSum
        print(stdDevSum)
    value = float(sqrt(stdDevSum/(n)))
    print(value)

However, I am not arriving at the correct result for Standard Deviation of the population. For instance, the program returns the standard deviation of the set [10,20,500,40,50] as 175.33, whereas an online calculator or by-hand calculation returns 188.53. How can we account for the difference?

Thanks!

C Algorithm Source: https://www.strchr.com/standard_deviation_in_one_pass

  • What is the fewest number of numbers for which you have found this does not return correct results? What are they? Before posting here, did you print the values of each variable and see where they differed between the C and Python implementations? In which iteration does the first deviation occur, and in which variable? Edit the question to provide a [mre]. – Eric Postpischil Aug 08 '21 at 19:59
  • Why does the Python code have `sqrt(stdDevSum/(n)` where the C code has `sqrt(stdDevSum / (n - 1)`? – Eric Postpischil Aug 08 '21 at 20:01
  • @Eric - I also asked that but then saw the comment near the end of the C code. – Adrian Mole Aug 08 '21 at 20:02
  • @AdrianMole: Nonetheless, the code should be analogous. C code with a comment that a different result could be produced is not equivalent to Python code that produces a different result. – Eric Postpischil Aug 08 '21 at 20:03
  • @Eric I agree - OP should edit to not have the "sample sigma" version and **only** the "population sigma" code. – Adrian Mole Aug 08 '21 at 20:04
  • Anyway, my answer does actually address the problem you first presented. You have applied my solution to your question, and now pose a **new** question. Your edit invalidates my answer, and that is not how Stack Overflow works. Consider rolling back your edit(s) (i.e. removing my suggested answer from your post) and asking a separate question as to why the abbreviated one-pass algorithm loses accuracy. – Adrian Mole Aug 09 '21 at 06:14

1 Answers1

1

The syntax of your range() expression is wrong – the third argument should be the actual increment to apply (i.e. 1) not an expression involving the loop index. [ Reference ]

Also, you should be using math.sqrt() and you need to import math:

import math 

def std_dev(a,n):
    if n == 0:
        return 0.0
#   i = 0 # This line is unnecessary
    meanSum = float(a[0])
    stdDevSum = float(0.0)
    for i in range(1,n,1): # Note the third argument!
#   for i in range(1, n):  # Alternative version - the default for "step" is 1
        stepSum = float(float(a[i]) - meanSum)
        stepMean = float(((i - 1)*stepSum)/i)
        meanSum += stepMean
        stdDevSum += stepMean*stepSum
        print(stdDevSum)
    value = float(math.sqrt(stdDevSum/(n)))
    print(value)

test = [1., 2., 3., 2., 1.]
std_dev(test, 5)

Output:

0.0
2.0
2.0
2.75
0.7416198487095663

Using the same test data with your C code (or a version of it using the n divisor rather than (n - 1)), as follows:

int main()
{
    double test[] = { 1., 2., 3., 2., 1. };
    double answer = std_dev(test, 5);
    printf("%lg\n", answer);
    return 0;
}

gives 0.74162 as the output.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
  • Sorry for the confusion regarding the sample and population. We want the population SD, not the sample. Furthermore, I have fixed the range() problem. The true question is why this version does not return a fully accurate SD. In your example [1,2,3,2,1], the SD was small enough that the result was reasonable. However, take the set [10,20,500,40,50], and the program's calculated SD becomes very different from what you would calculate by hand or using an online calculator. How can we account for this inaccuracy in the program? – Lance Gibson Aug 09 '21 at 04:30
  • Hmm. But both the C and the Python code give the same answer (175.339). – Adrian Mole Aug 09 '21 at 05:45
  • Interesting, it could be a good issue to raise with the maker of the algorithm themselves. – Lance Gibson Aug 09 '21 at 05:52