I have a convoluted problem, and I hope I can explain it easily...
I have the following data:
CHROM POS REF SNP INDEL
5 290 A --|T|--|-- 0
5 890 A A|T|--|G 0
7 672 A A|--|C|-- +C,+CC
9 459 G A|T|--|G -C
I want to create an ALT variable so I can eventually run this through VCFtools. However, I'm not entirely sure how to create a variable by continually adding to it if and only if a certain statement is satisfied.
For instance:
THe first column is easy, the ALT is only T; however I only want to paste T in the ALT column, without adding the "|" or "--". The second is slightly different, I don't want to add the A to the ALT variable only because its seen under the SNP entry but add the T and the G, separated by a column.
So in essence, I want to add each letter to the ALT variable only if it does not equal the REF variable and its not equal to "--".
I've gone ahead and split the SNP column up as follows:
m$A <- sapply(strsplit(as.character(mito$SNP),"\\|"),function(x) x[1])
m$T <- sapply(strsplit(as.character(mito$SNP),"\\|"),function(x) x[2])
m$C <- sapply(strsplit(as.character(mito$SNP),"\\|"),function(x) x[3])
m$G <- sapply(strsplit(as.character(mito$SNP),"\\|"),function(x) x[4])
But kinda stuck from here. Also I have the problem with "+C,+CC" and "-C"... with these ones, the letters in the SNP column are ignored but the REF and ALT become: "A" and "AC,ACC" and "GC" and "G" respectively. I've also split this up:
m$indel1 <- sapply(strsplit(as.character(mito$INDEL),","),function(x) x[1])
m$indel2 <- sapply(strsplit(as.character(mito$INDEL),","),function(x) x[2])
If this doesn't really make sense; here is what I would like the different options to be:
CHROM POS REF SNP INDEL ALT
5 290 A --|T|--|-- 0 T
5 890 A A|T|--|G 0 T,G
7 672 A A|--|C|-- +C,+CC AC,ACC
9 459 GC A|T|--|G -C G
I've only included the above examples, but there all different combinations of this in the file. Can this be done in R, or is this going to get very complicated.
Thanks advance...
Note 1:
First, apologies if this wasn't clear to begin with in my query above. And thanks for those that have helped so far. As requested, the ALT variable will change for INDELs depending on whether there is a "-" sign or a "+" sign in front of the INDEL (ie this won't follow the same rule as the SNP which will be most of the rows).
For example:
"-C" (or wherever there is a "-" sign), as stated above, REF needs to become REF+INDEL and ALT becomes the REF (separated by a comma if need be): :
CHROM POS REF SNP INDEL ALT 9 459 GC A|T|--|G -C G
If there is a "+" sign (whether it be +C,+CC or +GGG or something else), REF stays the same, but ALT becomes REF+INDEL (separated by a comma if need be):
CHROM POS REF SNP INDEL ALT 7 672 A A|--|C|-- +C,+CC AC,ACC 9 987 T --|T|C|-- +GGG TGGG