2

I'm using the Statistics::Descriptive library in Perl to calculate frequency distributions and coming up against a floating point rounding error problem.

I pass in two values, 0.205 and 0.205, (taken from other numbers and sprintf'd to those) to the stats module and ask it to calculate the frequency distribution but it's getting stuck in an infinite loop.

Stepping through with a debugger I can see that it's doing:

my $interval = $self->{sample_range}/$partitions;

my $iter = $self->{min};

while (($iter += $interval) <  $self->{max}) {

  $bins{$iter} = 0;

  push @k, $iter;  ##Keep the "keys" unstringified

}

$self->sample_range (The range is max-min)is returning 2.77555756156289e-17 rather than 0 as I'd expect. This means that the loop ((min+=range) < max)) enters a (for all intents and purposes) infinite loop.

DB<8> print $self->{max};
0.205
DB<9> print $self->{min};
0.205
DB<10> print $self->{max}-$self->{min};
2.77555756156289e-17

So this looks like a rounding problem. I can't think how to fix this on my side though, and I'm not sure editing the library is a good idea. I'm looking for suggestions of a workaround or alternative.

Cheers, Neil

brian d foy
  • 129,424
  • 31
  • 207
  • 592
NeilInglis
  • 3,431
  • 4
  • 30
  • 31

3 Answers3

5

I am the Statistics::Descriptive maintainer. Due to its numeric nature, many rounding problems have been reported. I believe this particular one was fixed in a later version to the one you were using that I released recently, by using multiplication for the divisions instead of +=.

Please use the most up-to-date version from the CPAN, and it should be better.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Shlomi Fish
  • 4,380
  • 3
  • 23
  • 27
  • Hi, Shlomi! Glad you noticed this question; you've saved me from having to email you a link to it. I see the new version is still using numbers as hash keys like $bins{$self->max()} = 0; to avoid having this round the values, you may want to use pack "F" (requires 5.8.0+) and unpack whenever you use the key. – ysth Jun 04 '09 at 06:54
  • Excellent, thanks! I should've checked for a new version, my fault. Very impressed by this response to my first Stack Overflow question. Thanks again to all who responded. – NeilInglis Jun 04 '09 at 08:31
3

Not exactly a rounding problem; you can see the more precise values with something like

printf("%.18g %.18g", $self->{max}, $self->{min});

Looks to me like there's a flaw in the module where it assumes the sample range can be divided up into $partitions pieces; because floating point doesn't have infinite precision, this isn't always possible. In your case, the min and max values are exactly adjacent representable values, so there can't be more than one partition. I don't know what exactly the module is using the partitions for, so I'm not sure what the impact of this may be. Another possible problem in the module is that it is using numbers as hash keys, which implicitly stringifies them which slightly rounds the value.

You may have some success in laundering your data through stringization before feeding it to the module:

$data = 0+"$data";

This will at least ensure that two numbers that (with the default printing precision) appear equal are actually equal.

ysth
  • 96,171
  • 6
  • 121
  • 214
  • Yup, thanks. Max is actually 0.20500000000000002 and min is 0.20499999999999999 so that explains why it's going wrong. I'll try some workarounds. – NeilInglis Jun 03 '09 at 13:46
-1

That shouldn't cause an infinite loop. What would cause that loop to be infinite would be if $self->{sample_range}/$partitions is 0.

Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
  • Yeah, I didn't think so either but DB<12> p $iter; 0.205 DB<13> p $interval; 3.46944695195361e-18 DB<14> p $iter+=$interval 0.205 DB<15> p $self->{max} 0.205 DB<16> p ($iter += $interval) < $self->{max} 1 so (( 0.205 + 3.46944695195361e-18) < 0.205 ) evaluates to true. Of course, it's been a long day so I could be off the ball... – NeilInglis Jun 03 '09 at 13:45
  • Nope; take for instance the numbers 1 and 1+2**-52. They differ by 2**-52. Assuming you want 4 partitions, that gives an interval of 2**-54 (which is clearly non-zero), but if you try to add that to 1, you leave the 1 unchanged (on most platforms), since the nearest representable value to 1+2**-54 is 1. The loop is assuming that if you increment a number by a non-zero value, that will increase the number, and that's not true in this case, leading to an endless loop. – ysth Jun 03 '09 at 14:20