Is it OK to prolong a non-significant A/B test?

Question

We know from this article that ending an A/B test early due to "significant" results is a mistake.

But what about when a test runs for the desired time period and shows insignificant results – is it fine to prolong it? What are the risks?

It would be great with a simple mathematical example of any risks, similar to the example in that linked article.

I have only a basic knowledge of probability theory and maths, so I would appreciate an answer I can understand with that knowledge.

My intuition is that it could be problematic, because you had an experiment with a calculated reliability (will show false positives in X% and false negatives in Y% of such experiments), but now you're effectively waiting indefinitely for the first true-positive or false-positive significance.

So I should think you get more false positives than you accounted for when setting up the original experiment. But presumably the likelihood of false positives also decreases as we get more data. I would love to get specific numbers on this, if it's true at all.

(Cross-posted from http://stats.stackexchange.com/questions/269557/is-it-ok-to-prolong-a-non-significant-a-b-test due to a lack of activity there. Happy to delete that question if someone feels that this is unacceptable otherwise.) — Henrik N, Apr 03 '17 at 13:01
The cross-post was marked as a duplicate of this: https://stats.stackexchange.com/questions/310119/why-does-collecting-data-until-obtaining-a-significant-result-increase-the-type — Henrik N, Nov 14 '17 at 08:07

score 1 · Answer 1 · answered Nov 14 '17 at 01:35

This is an area of current research. We've done some modeling and advise our customers to follow this principle:

• If the experiment reaches statistical significance, i.e. when the CI ribbon entirely rises above zero or entirely falls below it, and remains significant for 50% more observations than it took to get to significance for 0.10 level tests (65% more observations than it took to get to significance for .05 level tests), the experiment is called by accepting the alternative hypothesis, or, in other words, the treatment wins.

• If the experiment does not reach statistical significance, while the CI ribbon has narrowed to where its width represents a difference between the treatment and the control that is not consequential to the application semantics, the experiment is called by rejecting the research hypothesis, or, in other words, the treatment fails to win and we stick with the control.

For more, here's the White Paper.

Is it OK to prolong a non-significant A/B test?

1 Answers1