I am trying to implement an A/B testing (online validation) for ML model that has a highly imbalanced positive event rate. For example, the model detects spam and only 1 out of 1000 samples is spam, or baseline click through rate is very low <0.1%
I know one issues is that I will need very large samples in each control and treatment cohort. Are there other issues that I need to be aware of? Will the statistical properties breakdown? What are the ways to counter them?
Thanks.