8

My simple question is: How do you do a ks.test between two data frames column by column?

Eg. We have two data frames:

D1 <- data.frame(D$Ag, D$Al, D$As, D$Ba, D$Be, D$Ca, D$Cd, D$Co, D$Cu, D$Cr)
D2 <- data.frame(S$Ag, S$Al, S$As, S$Ba, S$Be, S$Ca, S$Cd, S$Co, S$Cu, S$Cr)

Note: this is just an example - real case would include much more columns and they contain concentrations of a certain element in a specific location.

Now i would like to run a ks.test between the two data frames :

ks.test(D$Ag, S$Ag)
ks.test(D$Al, S$Al)
ks.test(D$As, S$As)

etc. how is that done without doing the slavery work?

When i did a shapiro.test on one data frame i simply use:

lshap1 <- lapply(D1, shapiro.test)
lres1 <- sapply(lshap1, `[`, c("statistic","p.value"))

I have read something abot a loop, aggregate, mapply - tried different stuff like:

apply(D1, 2, function(D2) ks.test(D2,D1[,1])$p.value)

but then i get a lot of p-values = 0.. . which is not the case when i do it manually.

EDIT: 09.10.2017 I import the data as two data frames and then i extract some data to "smaller" data frames for analysis - e.g. in this case looking at toxic elements and excluding others.

Sample data: dput(head(D1)) and dput(head(D2)).

## Output dput(head(D1)):
structure(list(DF.As = c(-0.154868225169351, -0.291459578010276,
0.0355227595866723, 0.0892191549433623, 0.189115121672669,
-0.365222418641706
), DF.Cd = c(1.28810277421719, 1.45844987179892, 0.642331353138319,
0.673164023466527, 0.131548822144598, 0.146964746525726), DF.Cu
c(8.01131080231879, 
6.52606822875086, 2.93449454196807, 4.08720148249298, 1.55494291704341,
1.73663851851503), DF.Cr = c(0.164849379809527, 0.196759436988158,
0.307645386162046, 0.302917612808149, 0.187202322026229, 0.25358922601195
), DF.Ni = c(0.362592459542858, 0.527078409257359, 0.477116357433909,
0.469287608844157, 0.225865184678244, 0.355321456594576), DF.Pb
c(0.414448963979605,
0.616598678960665, -0.0531899082482045, 0.47477978516042,
0.422106471495816,
0.0326241032568164), DF.Zn = c(74.7657982668, 74.2978919524635,
36.6575117549406, 47.8440365300156, 21.4962811912273, 23.3823413091772
)), .Names = c("DF.As", "DF.Cd", "DF.Cu", "DF.Cr", "DF.Ni", "DF.Pb",
"DF.Zn"), row.names = c(NA, 6L), class = "data.frame")

## Output dput(head(D2)):
structure(list(DO.As = c(0.0150158517208966, -0.0477743050574027,
-0.121541780066373, -0.0376195600535572, 0.115393920133327,
0.265450918075612), DO.Cd = c(0.367936811743133, 0.445545318262818,
0.350071986298948, 
0.331513644782201, 0.603874629105229, 0.598527030667747), DO.Cu
c(1.65127139067621, 
1.90306634226191, 1.08280240161368, 1.12130376047927, 1.23137174481965,
1.16618813144813), DO.Cr = c(0.162996340978278, 0.493799568371693,
0.18441814919492, 0.179883906525139, 0.128058190333676, 0.030406737049484
), DO.Ni = c(0.290717040452464, 0.331891307317008, 0.387987078391917,
0.36147470695146, 0.774910299821917, 0.323259411199816), DO.Pb
c(-0.0584055598838365, 
0.377799120780818, -0.0741768575020139, 0.511278669452117,
0.320822577941608, 0.250377389869303), DO.Zn = c(16.5625482436821,
14.5084409384572, 16.571001044493, 18.4509635406253, 15.6876446591721,
12.7649440587945)), .Names = c("DO.As", "DO.Cd", "DO.Cu", "DO.Cr", "DO.Ni",
"DO.Pb", "DO.Zn"), row.names = c(NA, 6L), class = "data.frame")

I am posting this as i still get an error:

## This is code for execution:
col.names = colnames(D1)
lapply(col.names, function(t, d1, d2){ks.test(d1[, t], d2[, t])}, D1, D2)

## Output:
 Error in `[.data.frame`(d2, , t) : undefined columns chosen

(traceback button shows):

6.stop("undefined columns selected") 
5.`[.data.frame`(d2, , t) 
4.d2[, t] 
3.ks.test(d1[, t], d2[, t]) 
2.FUN(X[[i]], ...) 
1.lapply(col.names, function(t, d1, d2) {ks.test(d1[, t], d2[, t])}, D1, D2) 
Carl
  • 4,232
  • 2
  • 12
  • 24
Ib Nemer
  • 85
  • 1
  • 6
  • *note: My main goal is to do a distribution comparison of the two data sets using ks.test - comparing column 1 and 1, 2 and 2, 3 and 3 so on... – Ib Nemer Oct 06 '17 at 11:14

2 Answers2

5

Created two data.frames D1 and D2 with some random numbers and same column names.

set.seed(12)
D1 = data.frame(A=rnorm(n = 30,mean = 5,sd = 2.5),B=rnorm(n = 30,mean = 4.5,sd = 2.2),C=rnorm(n = 30,mean = 2.5,sd = 12))
D2 = data.frame(A=rnorm(n = 30,mean = 5,sd = 2.49),B=rnorm(n = 30,mean = 4.4,sd = 2.2),C=rnorm(n = 30,mean = 2,sd = 12))

Now we can use the column names to loop through and pass it to D1 and D2 to perform the ks.test on the corresponding columns of the respective data.frames.

col.names = colnames(D1)
lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D1,D2)

#[[1]]

#Two-sample Kolmogorov-Smirnov test

#data:  d1[, t] and d2[, t]
#D = 0.167, p-value = 0.81
#alternative hypothesis: two-sided


#[[2]]

#Two-sample Kolmogorov-Smirnov test

#data:  d1[, t] and d2[, t]
#D = 0.233, p-value = 0.39
#alternative hypothesis: two-sided


#[[3]]

#Two-sample Kolmogorov-Smirnov test

#data:  d1[, t] and d2[, t]
#D = 0.2, p-value = 0.59
#alternative hypothesis: two-sided

In the notation you have used in the question description, ideally the following code should work:

col.names =colnames(S)
lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D,S)
tushaR
  • 3,083
  • 1
  • 20
  • 33
  • Could you please explain what this line does: lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D1,D2) – Ib Nemer Oct 06 '17 at 17:40
  • I get an error: "undefined columns chosen" but when i ask for D1[,T] i get the columns list... mystical – Ib Nemer Oct 06 '17 at 17:41
  • @lb Nemer: First 't' is in lower case (D1[,t]). `col.names` has all the names of the columns (which is same in both D1 and D2). So you loop through col.names, just like a for loop, and subset D1 and D2 using the column name as `D1[,t]` and `D2[,t]` and use them in the `ks.test` function. I am not getting any error, as I have reproduced the example here with complete code. Maybe you should check that the column names is same for both the data.frames that you are using. – tushaR Oct 06 '17 at 17:53
  • Aaah okay! The name of the columns are different in my case - which is why it was messing up. I will see if i can figure out a solution with different column names, as i am going to use it on a wide variety of data sets and thereby data frames. Have a nice evening! – Ib Nemer Oct 06 '17 at 21:06
  • I have not been able to carry the test - as you said, your code works fine. But my code does not - and i was wondering if it had something to do with my origin of data (excel file) So my code: library(readxl) DW <- read_excel("~/R/wood.xlsx") library(readxl) DS <- read_excel("~/R/soil.xlsx") D1 <-data.frame(DW$Ag,DW$Al,DW$As,DW$Ba,DW$Be,DW$Ca) D2 <- data.frame(DS$Ag,DS$Al,DS$As,DS$Ba,DS$Be,DS$Ca) col.names = colnames(DS) lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},DW,DS i get errors like:unexpected symbol or i simply get a + (Missing constant) – Ib Nemer Oct 09 '17 at 09:39
  • As you are reading data from excel, it will be in a `data.frame` object already. Why are you creating `D1` and `D2` again? If you want to run the test for only a set of particular columns of `DS` and `DW` which you have shown in the code, then in that case you need to use the `colnames` of `D1`: Use the following code: `col.names = colnames(D1) lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D‌​1,D2)` – tushaR Oct 09 '17 at 10:05
  • If you still face an issue, please share output of `dput(head(DS))` and `dput(head(DW))` in the question description. – tushaR Oct 09 '17 at 10:23
  • **The `colnames` are same in `DF` and `DO` but not in `D1` and `D2`.** The names of the corresponding data.frames `DF` and `DO` gets appended to the similar column names and hence that is causing the issue. Please check that. It is because of the way you are creating `D1` and `D2`. Create `D1` and `D2` like this: `D1 = DF[,c('As','Cd','Cu','Cr','Ni','Pb','Zn')] D2 = DO[,c('As','Cd','Cu','Cr','Ni','Pb','Zn')] `. Hope this helps. – tushaR Oct 10 '17 at 03:11
  • Now it says: Error: Can't use matrix or array for column indexing – Ib Nemer Oct 10 '17 at 06:31
  • D1 = DW[,c('Ag','Al','As','Ba','Be','Ca','Cd','Co','Cu','Cr','Fe','K', 'La','Mg','Mn','Na','Ni','P','Pb','S','Se','Sr','Zn')] D2 = DS[,c('Ag','Al','As','Ba','Be','Ca','Cd','Co','Cu','Cr','Fe','K', 'La','Mg','Mn','Na','Ni','P','Pb','S','Se','Sr','Zn')] col.names =colnames(D1) lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D1,D2) – Ib Nemer Oct 10 '17 at 06:31
  • is `DW` and `DS` a matrix? how are you reading data into `DW` and `DS` and why do you keep changing the variable names everytime? Please maintain consistency. – tushaR Oct 10 '17 at 06:50
  • So far I have not faced any error. Please take a look that the `data frames` you pass to the `function` has the same names as the values in `col.names`. If that is same, everything should work as expected. – tushaR Oct 10 '17 at 06:59
4

A tidyverse solution using map function from the purrr package together with tidy function from the broom package

library(purrr)
library(broom)

# Data posted by @TUSHAr
set.seed(12)
D1 <- data.frame(A = rnorm(n = 30, mean = 5, sd = 2.5), 
                 B = rnorm(n = 30, mean = 4.5, sd = 2.2), 
                 C = rnorm(n = 30, mean = 2.5, sd = 12))
D2 <- data.frame(A = rnorm(n = 30, mean = 5, sd = 2.49), 
                 B = rnorm(n = 30, mean = 4.4, sd = 2.2), 
                 C = rnorm(n = 30, mean = 2, sd = 12))

# Loop through each column
result <- colnames(D1) %>%
  set_names() %>% 
  # apply `ks.test` function for each column pair
  map(~ ks.test(D1[, .x], D2[, .x])) %>%
  # extract test results using `tidy` then bind them together by rows
  map_dfr(., broom::tidy, .id = "parameter")
result

#> # A tibble: 3 x 5
#>   parameter statistic p.value method                           alternative
#>   <chr>         <dbl>   <dbl> <chr>                            <chr>      
#> 1 A             0.167   0.808 Two-sample Kolmogorov-Smirnov t~ two-sided  
#> 2 B             0.2     0.594 Two-sample Kolmogorov-Smirnov t~ two-sided  
#> 3 C             0.233   0.393 Two-sample Kolmogorov-Smirnov t~ two-sided

Created on 2018-08-24 by the reprex package (v0.2.0.9000).

Tung
  • 26,371
  • 7
  • 91
  • 115
  • I have pretty much similar question here may I request you to have a look at it https://stackoverflow.com/questions/73743765/running-ks-test-on-multiple-groups-in-r – PesKchan Sep 16 '22 at 19:56