0

I have a convoluted problem, and I hope I can explain it easily...

I have the following data:

CHROM   POS      REF    SNP          INDEL  
5       290      A      --|T|--|--   0  
5       890      A      A|T|--|G     0  
7       672      A      A|--|C|--    +C,+CC     
9       459      G      A|T|--|G     -C     

I want to create an ALT variable so I can eventually run this through VCFtools. However, I'm not entirely sure how to create a variable by continually adding to it if and only if a certain statement is satisfied.

For instance:

THe first column is easy, the ALT is only T; however I only want to paste T in the ALT column, without adding the "|" or "--". The second is slightly different, I don't want to add the A to the ALT variable only because its seen under the SNP entry but add the T and the G, separated by a column.

So in essence, I want to add each letter to the ALT variable only if it does not equal the REF variable and its not equal to "--".

I've gone ahead and split the SNP column up as follows:

m$A <- sapply(strsplit(as.character(mito$SNP),"\\|"),function(x) x[1])
m$T <- sapply(strsplit(as.character(mito$SNP),"\\|"),function(x) x[2])
m$C <- sapply(strsplit(as.character(mito$SNP),"\\|"),function(x) x[3])
m$G <- sapply(strsplit(as.character(mito$SNP),"\\|"),function(x) x[4])

But kinda stuck from here. Also I have the problem with "+C,+CC" and "-C"... with these ones, the letters in the SNP column are ignored but the REF and ALT become: "A" and "AC,ACC" and "GC" and "G" respectively. I've also split this up:

m$indel1 <- sapply(strsplit(as.character(mito$INDEL),","),function(x) x[1])
m$indel2 <- sapply(strsplit(as.character(mito$INDEL),","),function(x) x[2])

If this doesn't really make sense; here is what I would like the different options to be:

CHROM   POS      REF    SNP          INDEL    ALT
5       290      A      --|T|--|--   0        T 
5       890      A      A|T|--|G     0        T,G   
7       672      A      A|--|C|--    +C,+CC   AC,ACC    
9       459      GC     A|T|--|G     -C       G

I've only included the above examples, but there all different combinations of this in the file. Can this be done in R, or is this going to get very complicated.

Thanks advance...

Note 1:

First, apologies if this wasn't clear to begin with in my query above. And thanks for those that have helped so far. As requested, the ALT variable will change for INDELs depending on whether there is a "-" sign or a "+" sign in front of the INDEL (ie this won't follow the same rule as the SNP which will be most of the rows).

For example:

  1. "-C" (or wherever there is a "-" sign), as stated above, REF needs to become REF+INDEL and ALT becomes the REF (separated by a comma if need be): :

    CHROM   POS      REF    SNP          INDEL    ALT   
    9       459      GC     A|T|--|G     -C       G
    
  2. If there is a "+" sign (whether it be +C,+CC or +GGG or something else), REF stays the same, but ALT becomes REF+INDEL (separated by a comma if need be):

    CHROM   POS      REF    SNP          INDEL    ALT
    7       672      A      A|--|C|--    +C,+CC   AC,ACC    
    9       987      T      --|T|C|--    +GGG     TGGG
    
zx8754
  • 52,746
  • 12
  • 114
  • 209
user2726449
  • 607
  • 4
  • 11
  • 23
  • that's an unusual looking vcf, what program made that? Maybe you should just use something like GATK which outputs the VCF in the format that you want by default. – JeremyS Feb 19 '14 at 01:22
  • This is not a VCF... we want to convert a TXT to VCF (which can be done) – user2726449 Feb 19 '14 at 14:51
  • Sorry, if I missed something but are there, also, `INDEL` values whith two `-`'s or both `+` and `-`? I.e. like `+C,-CC` or `-C,-G`. If so, what happens in those cases to `REF` and `ALT`? – alexis_laz Feb 19 '14 at 18:10
  • Generally there shouldn't be a deletion and an insertion at the same place. But yeah - it can be multiple `-`'s and multiple `+`'s (as well as single `-` or `+`) but there can't be a `+` AND a `-` – user2726449 Feb 19 '14 at 18:15

3 Answers3

2

I tried something a bit naive -and probably not so effcient- to an arbitrary extrapolation I did on your dataframe. Of course, if I 'm anywhere mistaken, it could be changed. Anyway, I hope it's helpful.

#> DF
#   CHROM POS REF        SNP  INDEL
#1      5 290   A --|T|--|--      0
#2      5 890   A   A|T|--|G      0
#3      7 672   A  A|--|C|-- +C,+CC
#4      9 459   G   A|T|--|G     -C
#5      3 554   T   A|T|--|G -GG,-A
#6      9 987   T  --|T|C|--   +GGG
#7     21 214   G   A|T|--|G      0
#8      1 145   G  G|--|--|G      0
#9      3 554 T,C   A|T|--|G -GG,-A
#10     7 672 A,T  A|--|C|-- +C,+CC

And what I thought would be the solution:

ff = function(xrow) {
   ref   = as.character(xrow[3])
   snp   = as.character(xrow[4])
   indel = as.character(xrow[5])

   if(indel == "0") {
      alt = gsub(paste(ref, "|\\||--", sep = ""), "", snp)
      if(nchar(alt) > 1) 
             alt = paste(strsplit(alt, "", fixed = T)[[1]], collapse = ",")
   }
   else {
      indels = strsplit(indel, ",", fixed = T)[[1]]

      if(grepl("-", indels[1], fixed = T)) {
        alt = ref
        ref = paste0(strsplit(ref, ",", fixed = T)[[1]], 
                        gsub("-", "", indels, fixed = T), collapse = ",")
      }
      if(grepl("+", indels[1], fixed = T)) {
        alt = paste0(strsplit(ref, ",", fixed = T)[[1]], 
                        gsub("+", "", indels, fixed = T), collapse = ",")
      }
  }    

   return(cbind(CHROM = xrow[1], POS = xrow[2], REF = ref, 
                SNP = snp, INDEL = indel, ALT = alt))
}
as.data.frame(t(apply(DF, 1, ff)))
#   V1  V2     V3         V4     V5     V6
#1   5 290      A --|T|--|--      0      T
#2   5 890      A   A|T|--|G      0    T,G
#3   7 672      A  A|--|C|-- +C,+CC AC,ACC
#4   9 459     GC   A|T|--|G     -C      G
#5   3 554 TGG,TA   A|T|--|G -GG,-A      T
#6   9 987      T  --|T|C|--   +GGG   TGGG
#7  21 214      G   A|T|--|G      0    A,T
#8   1 145      G  G|--|--|G      0       
#9   3 554 TGG,CA   A|T|--|G -GG,-A    T,C
#10  7 672    A,T  A|--|C|-- +C,+CC AC,TCC

DF :

structure(list(CHROM = c("5", "5", "7", "9", "3", "9", "21", 
"1", "3", "7"), POS = c("290", "890", "672", "459", "554", "987", 
"214", "145", "554", "672"), REF = c("A", "A", "A", "G", "T", 
"T", "G", "G", "T,C", "A,T"), SNP = c("--|T|--|--", "A|T|--|G", 
"A|--|C|--", "A|T|--|G", "A|T|--|G", "--|T|C|--", "A|T|--|G", 
"G|--|--|G", "A|T|--|G", "A|--|C|--"), INDEL = c("0", "0", "+C,+CC", 
"-C", "-GG,-A", "+GGG", "0", "0", "-GG,-A", "+C,+CC")), .Names = c("CHROM", 
"POS", "REF", "SNP", "INDEL"), row.names = c(NA, 10L), class = "data.frame")
alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • This is brilliant. I didn't know it could be written is such an interpretable way (I assume if there was a more efficient way to write it I wouldn't have understood it). This does exactly what I need. THANKS – user2726449 Feb 19 '14 at 22:02
  • @user2726449 : You're welcome! Glad I could help! I feared I would have misunderstood something.. – alexis_laz Feb 19 '14 at 22:45
1

For the first part of the logic, something like this works:

mito <- read.table(text="CHROM   POS      REF    SNP          INDEL  
5       290      A      --|T|--|--   0
5       890      A      A|T|--|G     0
7       672      A      A|--|C|--    +C,+CC
9       459      G      A|T|--|G     -C",header=TRUE,stringsAsFactors=FALSE)

"add each letter to the ALT variable only if it does not equal the REF variable and its not equal to '--'".

SNPlist <- strsplit(mito$SNP,"\\|")
output <- Map(function(x,y) x[x!=y & x!="--"] , SNPlist, mito$REF)
mito$alt <- sapply(output,paste,collapse=",")
mito

#  CHROM POS REF        SNP  INDEL alt
#1     5 290   A --|T|--|--      0   T
#2     5 890   A   A|T|--|G      0 T,G
#3     7 672   A  A|--|C|-- +C,+CC   C
#4     9 459   G   A|T|--|G     -C A,T
thelatemail
  • 91,185
  • 12
  • 128
  • 188
0

You may try this:

library(stringr)
snp <- str_extract_all(string = df$SNP, pattern = "[[:alpha:]]")

snp2 <- mapply(FUN = function(x, y) x[x != y], x = snp, y = df$REF)

df$ALT <- lapply(snp2, function(x) paste(x, collapse = ","))

df$ALT[df$INDEL == "+C,+CC"] <- "AC,ACC"
df$ALT[df$INDEL == "-C"] <- "G"

df

#   CHROM POS REF        SNP  INDEL    ALT
# 1     5 290   A --|T|--|--      0      T
# 2     5 890   A   A|T|--|G      0    T,G
# 3     7 672   A  A|--|C|-- +C,+CC AC,ACC
# 4     9 459   G   A|T|--|G     -C      G
Henrik
  • 65,555
  • 14
  • 143
  • 159
  • This works great... but if this a large file, all indels will have to be done separately. In addition, #4's REF will need to become GC. I'm assuming another pattern (or two) will need to established to flag those that have "-" or a "+"... does my logic make sense? P.S. no need to do this for me, but just want to know if I'm on the right path... and if so, can probably figure this out (or not). Thanks again. – user2726449 Feb 19 '14 at 15:11
  • I think you need to clarify the 'INDEL' issue in your question. As the question stands, it seems that there are 'only' two 'INDEL' cases that needs to be handled (for rows where INDEL is "+C,+CC" or "-C", change ALT to "AC,ACC" and "G"). It is no problem to change also the 'REF' variable according to the 'INDEL' variable, although my impression was that is was the resulting 'ALT' variable that is the main focus. – Henrik Feb 19 '14 at 15:18
  • True... but when there are INDELs, the ALT (and REF) variables will need to be changed accordingly. E.G. for "-C" (or wherever there is a "-" sign), as stated above, REF needs to become GC and ALT becomes G, but if there is a "+" sign (whether it be +C,+CC or +GGG), REF stays the same, but ALT becomes REF+the indel. This is just a short example above, but these files can get big (with 250K+ rows)... so being able to write this in a way to do this automatically will be ideal. – user2726449 Feb 19 '14 at 16:00