1

How can i compare two data frames (test and control) of unequal length, and remove the row from test based on three criteria, i) if the test$chr == control$chr ii) test$start and test$end lies with in the range of control$start and control$end iii) test$CNA and control$CNA are same.

    test = 
        R_level  logp   chr start   end     CNA    Gene
        2     7.079     11  1159    1360    gain   Recl,Bcl
        11    2.4       12  6335    6345    loss   Pekg
        3     19        13  7180    7229    loss   Sox1

control =

  R_level    logp   chr  start  end     CNA    Gene
        2     5.9     11  1100  1400    gain   Recl,Bcl 
        2     3.46    11  1002  1345    gain    Trp1
        2     6.4     12  6705  6845    gain    Pekg
        4     7       13  6480  8129    loss    Sox1

The result should look something like this

result =
     R_level     logp   chr start   end     CNA     Gene
          11      2.4    12  6335   6345    loss   Pekg
beginner
  • 411
  • 1
  • 5
  • 13

2 Answers2

1

Here's one way using foverlaps() from data.table.

require(data.table) # v1.9.4+
dt1 <- as.data.table(test)
dt2 <- as.data.table(control)
setkey(dt2, chr, CNA, start, end)

olaps = foverlaps(dt1, dt2, nomatch=0L, which=TRUE, type="within")
#    xid yid
# 1:   1   2
# 2:   3   4

dt1[!olaps$xid]
#    R_level logp chr start  end  CNA Gene
# 1:      11  2.4  12  6335 6345 loss Pekg

Read ?foverlaps and see the examples section for more info.

Alternatively, you can also use GenomicRanges package. However, you might have to filter based on CNA after merging by overlapping regions (AFAICT).

Arun
  • 116,683
  • 26
  • 284
  • 387
0

When you say "exclude the variable", I assume you mean you want to remove the rows that satisfies those criteria.

If so, you are nearly there. The following should work:

exclude_bool <- data1[,3] == data2[,3] &
data1[,4] > data2[,5] &
data1[,5] < data2[,4] &
data1[,6] == data2[,6] 

data1 <- data1[!exclude_bool , ]
dwcoder
  • 478
  • 2
  • 8
  • Thank you decoder..But it seems the length of both data frames are not same. Warning messages: 1: In data1[, 3] == data2[, 3] : longer object length is not a multiple of shorter object length 2: In data1[, 4] > data2[, 5] : longer object length is not a multiple of shorter object length 3: In data1[, 5] < data2[, 4] : longer object length is not a multiple of shorter object length 4: In daat1[, 6] == daat2[, 6] : longer object length is not a multiple of shorter object length – beginner Jan 23 '15 at 12:47
  • If the length is not the same, the comparisons don't make sense. What do you mean when you say `data1[,3]==data2[,3]`? Since in R, this syntax is comparing the values element-wise. The same is true for `data1[,4] > data2[,5]`: how do you want the comparisons to happen? Should the max value of `data1[,5]` be larger than that of `data2[,4]`? – dwcoder Jan 23 '15 at 13:06
  • data1 is test and data 2 is control. I want to filter the row in test if a) data1[,3] is equal to data2[,3] b)if start and end of test (data1) is with in the range of start and end of control (data2) c) and if data1[,6] is also equal to data2[,6]. – beginner Jan 23 '15 at 13:14
  • Then they will still need to have the same number of rows. I think you will have to update your question to reflect exactly what you are trying to do. – dwcoder Jan 23 '15 at 13:27
  • But what if the number of rows are not equal. how can we force R to perform the comparison? In my case the no. of rows in test and control is not equal. – beginner Jan 23 '15 at 13:37
  • You can't force R to compare two objects of different lengths, just like you can't force R to compare apples with pears. You have to tell it how to compare the objects. If you update your question with an example of how you want to compare, perhaps we can use a different function. – dwcoder Jan 23 '15 at 13:41
  • What do you mean when you say `data1[,6]` is equal to `data2[,6]`. These are two vectors of different length. How do you compare them? That is, how would you compare `c(1,2,3)` with `c(3,2)`? – dwcoder Jan 23 '15 at 13:58
  • Sorry for not being clear. Sixth column of control and test contains two vectors, gain and loss. If the values in both test and control are similar then remove the row and if test contain gain and control contains loss or vice versa print. – beginner Jan 23 '15 at 14:07