2

Sorry for the confusing title, this one's a bit hard to describe. Basically, I've got two data tables which look similar to this:

df1 <- data.frame(SNP=c("W", "X", "Y", "Z"),
                  Gene.ID=c("A", "B", "C", "B"), pval=NA)
df2 <- data.frame(W=c(1, 0, 1), X=c(1, 1, 0), Y=c(0, 0, 1), Z=c(1, 0, 1),
                  A=c(3.5, 2.5, 3.5), C=c(4.5, 2.5, 1.5), B=c(1.5, 2.5, 1.5))

So all the entries in df1 correspond to column names in df2. My goal is to fill df1$pval with the p-values from a t-test. For every row in df1, I want to do a t-test comparing the df2 column that matches the value of df1$SNP, and compares that against the df2 column that matches the value of df1$Gene.ID.

For example, for the first row in df1, I would want to compare df2$W vs. df2$A, and then return the resulting p-value inside of df1[1, 3]. For the second row, I would compare df2$X vs. df2$B and return the p-value in df1[2, 3]. In other words, something like this:

for (i in 1:nrow(df1)){
  test <- t.test(df2[,which(colnames(df2)==df1[i, 1]] ~ df2[,which(colnames(df2)==df1[i, 2]])
  df1[i, 3] <- test$p.value
}

But this does not work because you can only select multiple column names using the colnames function, not just a single column name. Suggestions for how to get around this would be greatly appreciated--or if you have a simpler method in mind, that would be great too.

2 Answers2

1

I don't understand why you think this wouldn't work - I think your code just has syntax errors in it. The following code seems to work fine (note the change to use sapply, which is slightly more conventional in R):

df1[, 3] <- sapply(seq_len(nrow(df1)), 
  function(i) {
    test <- t.test(
      df2[, which(colnames(df2) == df1[i, 1])],
      df2[, which(colnames(df2) == df1[i, 2])])
    test$p.value
  })
alan ocallaghan
  • 3,116
  • 17
  • 37
  • Maybe I'm just doing something wrong, but I can't seem to get colnames to work with just a single column. For example, if I were to type `colnames(df[,1])` the output is NULL, where as `colnames(df[,1:2]` returns SNP and Gene.ID. – Maya Gosztyla Apr 18 '18 at 21:08
  • 1
    Look at the output of `df[, 1]` vs `df[, 1:2]` or `df[, 1, drop=FALSE]`, and see the following sections (8.1.44 and 45): http://www.burns-stat.com/pages/Tutor/R_inferno.pdf#page=68 – alan ocallaghan Apr 18 '18 at 21:10
  • 1
    btw I don't mean to sound short - it is a common mistake, and you are in fact merely expecting logical and consistent behaviour from R - something I'm sure you'll stop doing with time :) – alan ocallaghan Apr 18 '18 at 21:12
1

The use of which(colnames(df2)...) may not be the best choice here, since all you want to do is to select the columns of df2 that have df1[i,1] or df1[i,2] as names.

In R, one way to select a column by its name is to use the double brackets: e.g. df2[["A"]] will retrieve the column A of df2, which seems to be what you want, and less cumbersome than df2[, which(colnames(df2) == "A")].

With that in mind, you could rewrite your code like this:

for (i in 1:nrow(df1)){
  test <- t.test(df2[[df1[i, 2]]] ~ df2[[df1[i, 1]]])
  df1[i, 3] <- test$p.value
} 

Notice that I switched df1[i, 1] and df1[i, 2] since the documentation of t.test states that the binary variable has to be on the right hand side.

a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the corresponding groups

kluu
  • 2,848
  • 3
  • 15
  • 35
  • That double bracket trick is really handy! I didn't realize you could do this without the $ operator. Thanks for the tip! – Maya Gosztyla Apr 19 '18 at 00:32
  • Yeah, I was a big fan of the `$`, it's simpler to use, but `[[...]]` offers a little bit more. – kluu Apr 19 '18 at 05:47