0

I have a very large data set that contains multiple groups. They all contain the same information, however, occasionally and inconsistently, this information is misordered. In the example below, The Group1_A1 and Group2_A1 columns do not match (Rows 3 & 4 are flipped), therefore the rest of the information in those rows are not comparable. In order to correct this, the BETA of GROUP1_BETA should be multipled by -1 (again, given that the A1 columns between groups do not match, if they do match, the Beta should remain as it is).

MARKER  GROUP1_A1  GROUP1_A2  GROUP1_BETA  GROUP1_SE  GROUP2_A1  GROUP2_A2  GROUP2_BETA GROUP2_SE
rs10        A         C         -0.055      0.003        A         C           0.056      0.200
rs1000      A         G          0.208      0.100        A         G           0.208      0.001
rs10000     G         C         -0.134      0.009        C         G          -0.8624     0.010
rs10001     C         A          0.229      0.012        A         C           0.775      0.003

When dealing with frequencies falling between 0-1, I was using:

data$GROUP1_oppositeFrequency <- abs( (as.character(data$Group2_A1) != 
                            as.character(data$Group1_A1)) -                   
                          as.numeric(data$Group1_Frequency) )

however, because Beta values can be negative, this is not going to work. Can anyone point me in the right direction?

Reproducible Data:

data <- textConnection("SNP,GROUP1_A1,GROUP1_A2,GROUP1_Beta,GROUP1_SE,GROUP2_A1,GROUP2_A2,GROUP2_Beta,GROUP2_SE,GROUP3_A1,GROUP3_A2,GROUP3_Beta,GROUP3_SE
rs1050,C,T,0.0462,0.0035,T,C,0.007,0.0039,C,T,-0.007,0.009
rs1073,A,G,-0.0209,0.0035,A,G,0.0004,0.0031,A,G,-0.009,0.013
rs1075,C,T,-0.001,0.0039,T,C,-0.0013,0.0028,C,T,0.004,0.011
rs1085,C,G,-0.0001,0.0068,C,G,-0.0027,0.0032,C,G,-0.049,0.026
rs1127,C,T,0.0015,0.0044,T,C,0.0002,0.0029,C,T,-0.017,0.009
rs1312,A,G,-0.0014,0.0039,A,G,-0.0025,0.0029,A,G,0,0.01")
test_data <- read.csv(data, header = TRUE, sep = ",")
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
mfk534
  • 719
  • 1
  • 9
  • 21
  • could you try to post a reproducible dataset, maybe by using `dput(head(yourDataFrame,100))`. – Matt Bannert Aug 29 '12 at 15:14
  • Also, your "very large" could be someone elses "tiny". How many rows? Then we know whether `data.table` is required, or not. – Matt Dowle Aug 29 '12 at 15:29
  • 10 cols (say) x 500k rows is under 40MB so that's quite small. Unless you have a lot of groups (say 100k groups of 5 rows) then probably no speed worries then. – Matt Dowle Aug 29 '12 at 22:54

1 Answers1

1

Assuming the only possibility is a "flip" , all you need to do is see if the value in group1_a1 is identical to the value in group2_a1 Hence:

mydata$group1_beta <- mydata$group1_beta * (-1)^((mydata$group1_a1 == mydata$group2_a1) + 1)

Update: here's an example, showing it works (at least, the way I meant it to work :-) ).

 Rgames> mydat<-data.frame(A=c('a','b','d','c'),B=c('a','b','c','d'),one=1:4,two=1:4)
Rgames> mydat
  A B one two
1 a a   1   1
2 b b   2   2
3 d c   3   3
4 c d   4   4
Rgames> mydat$two<-mydat$two*(-1)^((mydat$A==mydat$B)+1)
Rgames> mydat
  A B one two
1 a a   1   1
2 b b   2   2
3 d c   3  -3
4 c d   4  -4
Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
  • This line is giving me the following error message: `Error in mydata$GROUP2_A1 + 1 : non-numeric argument to binary operator` – mfk534 Aug 29 '12 at 16:00
  • @mfk534 At a glance, it looks like operator precedence is choosing the `+` beforethe `==`. Wrap `mydata$GROUP1_A1 == mydata$GROUP2_A1` in parentheses. – Blue Magister Aug 29 '12 at 17:17
  • I tried that, but it gave me: `Error in `$<-.data.frame`(`*tmp*`, "Cogent_WBC_Beta_new", value = numeric(0)) : replacement has 0 rows, data has 542303` – mfk534 Aug 29 '12 at 17:42
  • I may have misread the inputs: can you report what is returned from `mydata$Group2_A1` and `mydata$Group1_A1` , and assuming you get "A" "G" and so on, that these are of type "character" and not "factor" ? – Carl Witthoft Aug 29 '12 at 18:38
  • @mfk534 Ping for the last comment (which I think is needed now there are 3 people in the comment thread) – Blue Magister Aug 29 '12 at 18:44
  • Thanks for your patience! It was a factor/character issue :] – mfk534 Aug 29 '12 at 21:40