I have dataset 261 data points, and another with 373 data points. Here is the data
dataset_1 = data.frame(dataset_name = rep("dataset_1", 261),
value = seq(40, 10000, length.out = 261))
dataset_2 = data.frame(dataset_name = rep("dataset_2", 373),
value = seq(50, 5000, length.out = 373))
dataset <- rbind(dataset_1, dataset_2)
the ks test
ks.test(dataset$value[dataset$dataset_name=="dataset_1"],
dataset$value[dataset$dataset_name=="dataset_2"],
alternative = c("less")) -> test_result
Plotting the ecdfs
library(ggplot2)
dataset %>%
ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
stat_ecdf(size =2)
Now, I need to measure the values of horizontal distances at each probability points. For example, at 0.25, we have 2500 from dataset_1, and 1250 from dataset_2, hence the distance is 1250. As dataset 1 has 261, and dataset 2 has 373 points. How can I generate a dataframe that can show me the distances.
I have modified dataset_1 using a linear approximation to create 373 datapoints and then checked the results.
interpolated_dataset_1 <- approx(dataset_1$value, n = 373)
# creating the dataframe
interpolated_dataset_1_dataframe <- data.frame(dataset_name =
"modified_dataset_1", value = interpolated_dataset_1$y)
# combining the data
modified_dataset <- rbind(dataset,interpolated_dataset_1_dataframe)
# the ks test
ks.test(modified_dataset$value[modified_dataset$dataset_name==
"modified_dataset_1"],
modified_dataset$value[modified_dataset$dataset_name=="dataset_2"],
alternative = c("less")) -> modified_test_result
# the ecdfs
library(ggplot2)
modified_dataset %>%
ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
stat_ecdf(size =2)
The d-statistic is almost the same but not quite, although the result is significant.
Is there a better way to do it using step function where I will get the exact same test statistics?