2

I have a string (fasta format), something like this:

a = ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"

and would like to seperate at character >, filter out the newlines and put the thre substrings seperated by > into a vector or list with three elements:

>atttaggaccttaattgtcggta >ccattnnnncccatt >ttaggccta

I tried strsplit:

unlist(strsplit(a, "(?<=>)", perl=T))

but this puts the delimiter > at the end of the each string.

I found related questions are here or here but I can't really get it to work without making a complicated construct.

Is there a simple solution to do this in one go?

Community
  • 1
  • 1
user1981275
  • 13,002
  • 8
  • 72
  • 101
  • _"on one go"_ can also == _"unreadable for your future self or for others you share code with"_. Code is meant for humans. – hrbrmstr Aug 25 '16 at 11:18

2 Answers2

2

Your regex only contains a lookbehind that matches any empty location after a >, see your regex demo. The engine processes a string from left to right, checks if there is a > to the left of the current location, and then returns a valid empty string match if < is found.

You may use (?<=[^>])(?=>) regex:

> res <- unlist(strsplit(a, "(?<=[^>])(?=>)", perl=T))
> res
[1] ">atttaggacctta\nattgtcggta\n" ">ccattnnnn\ncccatt\n"        
[3] ">ttaggccta"  
> gsub("\n", "", res, fixed=TRUE)
[1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt"        
[3] ">ttaggccta"  

The pattern matches a location that is preceded with a non-> char and is followed with > char.

Note that using a lookbehind pattern only with strsplit often leads to unexpected behavior. See Why does strsplit use positive lookahead and lookbehind assertion matches differently?

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I think they also want to remove the line breaks within each vector – talat Aug 25 '16 at 11:07
  • Ah, I see. I have added a `gsub` as the second step to remove newline symbols. – Wiktor Stribiżew Aug 25 '16 at 11:07
  • Thanks, this works. So `(?=>)` is a lookbehind for character `>`? And what exactly does `(?<=[^>])` do? – user1981275 Aug 25 '16 at 11:26
  • `(?=>)` matches any position/location (empty string) that *is followed with `>` char*. `(?<=[^>])` matches (and requires) a non-`>` char before the current empty position/location (so it will never match at the beginning of the string). If you need to also match at the start of the string, use a negative lookbehind, `(?<!>)`, but I doubt you need an empty item in the resulting vector. – Wiktor Stribiżew Aug 25 '16 at 11:29
1
library(stringi)
library(magrittr)

a <- ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"

stri_replace_all_regex(a, "\\n", "") %>% 
  stri_extract_all_regex("(>[[:alpha:]]+)") %>% 
  unlist()
## [1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt"         ">ttaggccta"              

If one must use base only:

a <- gsub("\\n", "", a)
unlist(regmatches(a, gregexpr("(>[[:alpha:]]+)", a)))
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205