2

I would like to extract the character around a symbol using R and sub. I have tried many regular expression but I'm not getting what I want.

My vector:

c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")

I only need one character before and after the >.

My best try was:

sub("(.*?)>", ">", aa, perl = TRUE)
zx8754
  • 52,746
  • 12
  • 114
  • 209

3 Answers3

9

You need to use capture groups in your regex:

vec <- c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
> sub(".*(.)>(.).*","\\1\\2",vec)
 [1] "GG" "TA" "GA" "GA" "AT" "TC" "TC" "TC" "AT" "TC" "TA" "AG" "AC" "CT" "TA"
[16] "TC" "TG" "GC" "TG" "TA" "GA"

In words the regex matches anything zero or more times .* then capture the next character (.) then match the greater than sign > then capture the next character (.) and then match anything zero or more times at the end .*. Replace all of this with the two captured characters \\1\\2.

James
  • 65,548
  • 14
  • 155
  • 193
5

Provide a reproducible example

> x = c("A>G", "AT>GC")

Find the index of the symbol you're interested in (use fixed=TRUE because you're not actually looking for a regular expression).

> i = regexpr(">", x, fixed=TRUE)

Then extract the preceding and / or following character

> substr(x, i-1, i-1)
[1] "A" "T"
> substr(x, i+1, i+1)
[1] "G" "G"

or get the sequence

> substr(x, i-1, i+1)
[1] "A>G" "T>G"

Maybe your reproducible example includes edge cases

> x = c("A>G", "AT>GC", "", ">G", "A>", ">", NA)

and then more processing is needed?

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
0

It looks like you are trying to get the reference and alternate alleles? Only looking for one character suggests you are only interested in SNPs? You could use strsplit to generate a data frame of ref and alt alleles.

test <- c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
Alleles <- data.frame(t(data.frame(sapply(test, function(x)   strsplit(x,split=">")))),row.names=NULL,stringsAsFactors=F)
colnames(Alleles) <- c("Ref","Alt")
Alleles$bases <- apply(Alleles,1,function(x) sum(length(unlist(strsplit(x[1],split=""))),length(unlist(strsplit(x[2],split="")))))
SNPs <- Alleles[Alleles$bases == 2,]

Just taking a single base either side of the replace (>) is going to give you wrong genetic information. The variant "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C" would get reduced to "A>C" - it looks like a simple SNP but is the same as a deletion of the last 38 bases "CGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>-".

Is this what you were after?

JeremyS
  • 3,497
  • 1
  • 17
  • 19