1

I have dataset 261 data points, and another with 373 data points. Here is the data

dataset_1 = data.frame(dataset_name = rep("dataset_1", 261), 
                       value = seq(40, 10000, length.out = 261))
dataset_2 = data.frame(dataset_name = rep("dataset_2", 373), 
                       value = seq(50, 5000, length.out = 373))

dataset <- rbind(dataset_1, dataset_2)

the ks test

ks.test(dataset$value[dataset$dataset_name=="dataset_1"],
        dataset$value[dataset$dataset_name=="dataset_2"],
        alternative = c("less")) -> test_result

Plotting the ecdfs

library(ggplot2)
dataset %>% 
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

ecdfs of these two datasets

Now, I need to measure the values of horizontal distances at each probability points. For example, at 0.25, we have 2500 from dataset_1, and 1250 from dataset_2, hence the distance is 1250. As dataset 1 has 261, and dataset 2 has 373 points. How can I generate a dataframe that can show me the distances.

I have modified dataset_1 using a linear approximation to create 373 datapoints and then checked the results.

interpolated_dataset_1  <- approx(dataset_1$value, n = 373)

# creating the dataframe
interpolated_dataset_1_dataframe <- data.frame(dataset_name = 
              "modified_dataset_1", value = interpolated_dataset_1$y)

# combining the data
modified_dataset <- rbind(dataset,interpolated_dataset_1_dataframe)

# the ks test
ks.test(modified_dataset$value[modified_dataset$dataset_name==
                               "modified_dataset_1"],
        modified_dataset$value[modified_dataset$dataset_name=="dataset_2"],
        alternative = c("less")) -> modified_test_result
# the ecdfs
library(ggplot2)
modified_dataset %>% 
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

The d-statistic is almost the same but not quite, although the result is significant.

Is there a better way to do it using step function where I will get the exact same test statistics?

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
mra343
  • 11
  • 1
  • Perhaps relevant, I asked a similar question last year: https://stackoverflow.com/a/74907126/6851825 – Jon Spring May 17 '23 at 21:20
  • btw, small nit you might want to either replace `%>%` with the base pipe `|>` or else load `dplyr` or `magrittr`, or take it out by using `ggplot(dataset, aes(...` – Jon Spring May 17 '23 at 22:35

1 Answers1

1

Update

I guess quantile should be helpful for your purpose, which is more efficient than the previous solution (ecdf + uniroot)

dstat2 <- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(quantile(df1$value, p) - quantile(df2$value, p))
}

such that

> dstat2(0.25)
   25%
1242.5

> dstat2(0.5)
 50%
2495

> dstat2(0.75)
   75%
3747.5

Here is a solution using ecdf + uniroot

dstat <- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(
        diff(
            sapply(
                list(df1, df2),
                \(v) {
                    with(
                        v,
                        uniroot(
                            \(x) ecdf(value)(x) - p,
                            range(value)
                        )$root
                    )
                }
            )
        )
    )
}

and we can obtain

> dstat(0.25)
[1] 1242.5

> dstat(0.5)
[1] 2495

> dstat(0.75)
[1] 3747.5
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81