0

I have a data file that is several million lines long, and contains information from many groups. Below is an abbreviated section:

MARKER      GROUP1_A1   GROUP1_A2   GROUP1_FREQ GROUP1_N    GROUP2_A1   GROUP2_A2   GROUP2_FREQ GROUP2_N
rs10    A   C   0.055   1232    A   C   0.055   3221
rs1000  A   G   0.208   1232    A   G   0.208   3221
rs10000 G   C   0.134   1232    C   G   0.8624  3221
rs10001 C   A   0.229   1232    A   C   0.775   3221

I would like to created a weighted average of the frequency (FREQ) variable (which in itself is straightforward), however in this case some of the rows are mismatched (rows 3 & 4). If the letters do not line up, then the frequency of the second group needs to be subtracted by 1 before the weighted mean of that marker is calculated.

I would like to set up a simple IF statement, but I am unsure of the syntax of such a task.

Any insight or direction is appreciated!

mfk534
  • 719
  • 1
  • 9
  • 21

1 Answers1

1

Say you've read your data in a data frame called mydata. Then do the following:

mydata$GROUP2_FREQ <- mydata$GROUP2_FREQ - (mydata$GROUP1_A1 != mydata$GROUP2_A1)

It works because R treats TRUE values as 1 and FALSE values as 0.

EDIT: Try the following instead:

mydata$GROUP2_FREQ <- abs( (as.character(mydata$GROUP1_A1) != 
                            as.character(mydata$GROUP2_A1)) -                   
                          as.numeric(mydata$GROUP2_FREQ) )
Edward
  • 5,367
  • 1
  • 20
  • 17
  • Answer edited; I misunderstood your question the first time I read it. – Edward Jul 27 '12 at 19:53
  • Hi Edward, Thanks, I'm not sure this is quite what I'm looking for. It 's given me an error: "Error in `$<-.data.frame`(`*tmp*`, "GROUP2_FREQ", value = numeric(0)) : replacement has 0 rows, data has 5" so I can't be certain, but the frequency should be subtracted from 1, ONLY if the A1 and A2 between groups don't match. Does that make sense? – mfk534 Jul 27 '12 at 20:20
  • It does. I see you want to subtract the frequency from 1, not the other way round (which I thought from your question). So my answer should be changed to reflect that... as for your error, is there any way you can give a small reproducible sample to check against? – Edward Jul 27 '12 at 20:29
  • I found the problem - the data.frame saved everything as factors. Try wrapping my solution in `as.numeric` (I'll post an update to my solution now) – Edward Jul 27 '12 at 20:35
  • No problem. Just one thing with the `abs` - my solution will only work under the assumption that frequencies are between 0 and 1. If you ever have to deal with a larger range, you'll have to do the checking in a more elegant way than I have it here – Edward Jul 27 '12 at 20:47
  • Good to know! In this case the range will always be between 0-1 (or something has screwed up). Thanks again! – mfk534 Jul 27 '12 at 21:45