Split string on special character

Question

I have a string (fasta format), something like this:

a = ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"

and would like to seperate at character >, filter out the newlines and put the thre substrings seperated by > into a vector or list with three elements:

>atttaggaccttaattgtcggta >ccattnnnncccatt >ttaggccta

I tried strsplit:

unlist(strsplit(a, "(?<=>)", perl=T))

but this puts the delimiter > at the end of the each string.

I found related questions are here or here but I can't really get it to work without making a complicated construct.

Is there a simple solution to do this in one go?

_"on one go"_ can also == _"unreadable for your future self or for others you share code with"_. Code is meant for humans. — hrbrmstr, Aug 25 '16 at 11:18

score 2 · Accepted Answer · edited May 23 '17 at 12:22

2

Your regex only contains a lookbehind that matches any empty location after a >, see your regex demo. The engine processes a string from left to right, checks if there is a > to the left of the current location, and then returns a valid empty string match if < is found.

You may use (?<=[^>])(?=>) regex:

> res <- unlist(strsplit(a, "(?<=[^>])(?=>)", perl=T))
> res
[1] ">atttaggacctta\nattgtcggta\n" ">ccattnnnn\ncccatt\n"        
[3] ">ttaggccta"  
> gsub("\n", "", res, fixed=TRUE)
[1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt"        
[3] ">ttaggccta"

The pattern matches a location that is preceded with a non-> char and is followed with > char.

Note that using a lookbehind pattern only with strsplit often leads to unexpected behavior. See Why does strsplit use positive lookahead and lookbehind assertion matches differently?

edited May 23 '17 at 12:22

Community

1
1

answered Aug 25 '16 at 10:52

Wiktor Stribiżew

607,720
39
448
563

I think they also want to remove the line breaks within each vector – talat Aug 25 '16 at 11:07
Ah, I see. I have added a `gsub` as the second step to remove newline symbols. – Wiktor Stribiżew Aug 25 '16 at 11:07
Thanks, this works. So `(?=>)` is a lookbehind for character `>`? And what exactly does `(?<=[^>])` do? – user1981275 Aug 25 '16 at 11:26
`(?=>)` matches any position/location (empty string) that *is followed with `>` char*. `(?<=[^>])` matches (and requires) a non-`>` char before the current empty position/location (so it will never match at the beginning of the string). If you need to also match at the start of the string, use a negative lookbehind, `(?<!>)`, but I doubt you need an empty item in the resulting vector. – Wiktor Stribiżew Aug 25 '16 at 11:29

score 1 · Answer 2 · answered Aug 25 '16 at 11:11

library(stringi)
library(magrittr)

a <- ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"

stri_replace_all_regex(a, "\\n", "") %>% 
  stri_extract_all_regex("(>[[:alpha:]]+)") %>% 
  unlist()
## [1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt"         ">ttaggccta"

If one must use base only:

a <- gsub("\\n", "", a)
unlist(regmatches(a, gregexpr("(>[[:alpha:]]+)", a)))

Split string on special character

2 Answers2