What is the most conclusive way to evaluate an n-way split test where n > 2?

Question

I have plenty of experience designing, running and evaluating two-way split tests (A/B Tests). Those are by far the most common in digital marketing, where I do most of my work.

However, I'm wondering if anything about the methodology needs to change when more variants are introduced into an experiment (creating, say, a 3-way test (A/B/C Test)).

My instinct tells me I should just run n-1 evaluations against the control group.

If I run a 3-way split test, for example, instinct says I should find significance and power twice:

Treatment A vs Control
Treatment B vs Control

So, in that case, I'm finding out which, if any, treatment performed better than the control (1-tailed test, alt: treatment - control > 0, the basic marketing hypothesis).

But, I'm doubting my instinct. It's occurred to me that running a third test contrasting Treatment A with Treatment B could yield confusing results.

For example, what if there's not enough evidence to reject a null that treatment B = treatment A?

That would lead to a goofy conclusion like this:

Treatment A = Control
Treatment B > Control
Treatment B = Treatment A

If treatments A and B are likely only different due to random chance, how could only one of them outperform the control?

And that's making me wonder if there's a more statistically sound way to evaluate split tests with more than one treatment variable. Is there?

[ANOVA](https://en.wikipedia.org/wiki/Analysis_of_variance#For_multiple_factors) comes to mind, but the lot over on [stats.SE] are way better at this than us here on SO :). — Nelewout, Jun 01 '21 at 21:17

score 3 · Accepted Answer · answered Aug 03 '21 at 20:16

Your instincts are correct, and you can feel less goofy by rewording your statements:

We could find no statistically significant difference between Treatment A and Control.
Treatment B is significantly better than Control.
However it remains inconclusive whether Treatment B is better than Treatment A.

This would be enough to declare Treatment B a winner, with the possible followup of retesting A vs B. But depending on your specific situation, you may have a business need to actually make sure Treatment B is better than Treatment A before moving forward and you can make no such decision with your data. You must gather more data and/or restart a new test.

What I've found is a far more common scenario is Treatment A and Treatment B both soundly beat control (as they're often closely related and have related hypotheses), but there is no statistically significant difference between Treatment A or Treatment B. This is an interesting scenario where if you are required to pick a winner, it's okay throwing significance out the window and picking the one that has the strongest effect. The reason why is that the significance level (eg. 95%) is set to avoid false positives and making unnecessary changes. There's an assumption that there are switching costs. In this case, you must pick A or B and throw out control, so in my opinion it's okay picking the best one until you have more data.

What is the most conclusive way to evaluate an n-way split test where n > 2?

1 Answers1