So we can run a K-S test to assess if we have a difference in the distribution of dtwo datasets, as outlined here.
So lets take the following data
set.seed(123)
N <- 1000
var1 <- runif(N, min=0, max=0.5)
var2 <- runif(N, min=0.3, max=0.7)
var3 <- rbinom(n=N, size=1, prob = 0.45)
df <- data.frame(var1, var2, var3)
We can then seperate based on var3 outcome
df.1 <- subset(df, var3 == 1)
df.2 <- subset(df, var3 == 0)
Now we can run a Kolmogorov–Smirnov test to test for differences in the distributions of each individual variable.
ks.test(jitter(df.1$var1), jitter(df.2$var1))
ks.test(jitter(df.1$var2), jitter(df.2$var2))
And not suprisngly, we do not get a difference and can assume the different dataset have been drawn from the same distribution. This can be visualised through:
plot(ecdf(df.1$var1), col=2)
lines(ecdf(df.2$var1))
plot(ecdf(df.1$var2), col=3)
lines(ecdf(df.2$var2), col=4)
But now we want to consider if the distributions between var3==0
and var3==1
differ when we consider both var1
& var2
together.
Is there an R package to run such a test when we have a multiple predictors
The similar question was posed here, but has not received any answers
There appears to be some literature: Example 1 Example 2
But nothing appears to be linked to R