Extract the character around a symbol in R

Question

I would like to extract the character around a symbol using R and sub. I have tried many regular expression but I'm not getting what I want.

My vector:

c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")

I only need one character before and after the >.

My best try was:

sub("(.*?)>", ">", aa, perl = TRUE)

My vector: c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A") — user3186183, Jan 13 '14 at 16:00

score 9 · Answer 1 · answered Jan 13 '14 at 15:59

9

You need to use capture groups in your regex:

vec <- c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
> sub(".*(.)>(.).*","\\1\\2",vec)
 [1] "GG" "TA" "GA" "GA" "AT" "TC" "TC" "TC" "AT" "TC" "TA" "AG" "AC" "CT" "TA"
[16] "TC" "TG" "GC" "TG" "TA" "GA"

In words the regex matches anything zero or more times .* then capture the next character (.) then match the greater than sign > then capture the next character (.) and then match anything zero or more times at the end .*. Replace all of this with the two captured characters \\1\\2.

answered Jan 13 '14 at 15:59

James

65,548
14
155
193

1

@user3186183 Oh, I misunderstood what you wanted. This will suffice for that then: `sub(".*(.>.).*","\\1",vec)`. – James Jan 13 '14 at 16:27
2

Maybe narrow `.` down to `[A-Z]` since all the strings use only capital letters. – tenub Jan 13 '14 at 16:28
1

@tenub Well, `[ACGT]` as it's genetic data. – James Jan 13 '14 at 16:39

score 5 · Answer 2 · answered Jan 13 '14 at 16:07

Provide a reproducible example

> x = c("A>G", "AT>GC")

Find the index of the symbol you're interested in (use fixed=TRUE because you're not actually looking for a regular expression).

> i = regexpr(">", x, fixed=TRUE)

Then extract the preceding and / or following character

> substr(x, i-1, i-1)
[1] "A" "T"
> substr(x, i+1, i+1)
[1] "G" "G"

or get the sequence

> substr(x, i-1, i+1)
[1] "A>G" "T>G"

Maybe your reproducible example includes edge cases

> x = c("A>G", "AT>GC", "", ">G", "A>", ">", NA)

and then more processing is needed?

score 0 · Answer 3 · answered Jan 14 '14 at 02:35

It looks like you are trying to get the reference and alternate alleles? Only looking for one character suggests you are only interested in SNPs? You could use strsplit to generate a data frame of ref and alt alleles.

test <- c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
Alleles <- data.frame(t(data.frame(sapply(test, function(x)   strsplit(x,split=">")))),row.names=NULL,stringsAsFactors=F)
colnames(Alleles) <- c("Ref","Alt")
Alleles$bases <- apply(Alleles,1,function(x) sum(length(unlist(strsplit(x[1],split=""))),length(unlist(strsplit(x[2],split="")))))
SNPs <- Alleles[Alleles$bases == 2,]

Just taking a single base either side of the replace (>) is going to give you wrong genetic information. The variant "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C" would get reduced to "A>C" - it looks like a simple SNP but is the same as a deletion of the last 38 bases "CGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>-".

Is this what you were after?

Extract the character around a symbol in R

3 Answers3