0

So I'm trying to compare two implementations of a function with Hypothesis to determine if they work the same way with a huge variety of different inputs that I might not think of myself.

I tried using numpy.testing.assert_allclose to compare the outputs, but Hypothesis just repeatedly outsmarts it. The more I widen the acceptable tolerance, the larger values Hypothesis throws at it until it fails, even though the outputs are similar enough to be considered the same.

E   Not equal to tolerance rtol=0.1, atol=0.001
...
Falsifying example: test_resample_1d_consistency(a=array([7.696582e+12, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00], dtype=float32), num=11)

 

E   Not equal to tolerance rtol=0.1, atol=0.01
...
Falsifying example: test_resample_1d_consistency(a=array([7.366831e+13, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00], dtype=float32), num=11)

 

E   Not equal to tolerance rtol=1000, atol=1000
...
Falsifying example: test_resample_1d_consistency(a=array([8.360933e+18, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00], dtype=float32), num=186)

etc.

So I guess I need a different "good enough" test of similarity, or I need to limit the input value range in some way? But I'm not sure how to do these in a way that won't miss genuinely wrong answers. Any advice?

endolith
  • 25,479
  • 34
  • 128
  • 192

1 Answers1

1

It looks to me like rfft does give very different results in extreme cases - so you'll need to decide whether this is a bug or not. Maybe Hypothesis has actually shown that it's not a suitable optimization!

Put another way, the problem of determining an appropriate error tolerance for a given magnitude of input is actually the hardest part of testing! (in the literature, this is "the oracle problem" of how to distinguish good from bad behavior)

Once you have a bound though - say rtol=0.1, atol=0.001 for all arrays with elements in [-1000., 1000.] you can pass the elements argument to the arrays strategy to constrain those values for each test, or try a range of magnitude/tolerance combinations.

Zac Hatfield-Dodds
  • 2,455
  • 6
  • 19
  • `the problem of determining an appropriate error tolerance for a given magnitude of input is actually the hardest part of testing!` Yes, that's why I couldn't figure it out on my own and asked an SO question about how to decide it! :D – endolith Apr 24 '19 at 03:28