Split a string by two delimiters only in the first occurrence

Question

I have read many examples here and other forums, tried things myself, but still can´t do what I want:

I have a string like this:

myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")

And I want to split it into columns by the first dot and the vertical slash so it looks like this:

data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2"))

The biggest problem here is the dot that is sometimes present in the right part of the slash (e.g. third row), by which I don´t want to split.

Among others, what I tried was:

data.frame(do.call(rbind, strsplit(myString,"(\\.)|(\\|)")))

but this also creates a fourth column when it splits after the second dot.

I tried to tell it to only split once for the dot:

data.frame(do.call(rbind, strsplit(myString,"(\\.{1})|(\\|)"))) but same result.

Then tried to tell it that the dot could not be preceded by a slash:

data.frame(do.call(rbind, strsplit(myString,"([^\\|]\\.)|(\\|)")))
data.frame(do.call(rbind, strsplit(myString,"([[:alnum:]][^\\|]\\.)|(\\|)")))

but in both cases it splits by both dots.

I tried various combinations with reshape2::colsplit as well, similar results; either it splits in both dots, or it splits on the first dot but not on the slash:

reshape2::colsplit(myString, "([^\\|]\\.)|(\\|)", c("col1", "col2"))

Does anyone have an idea on how to solve this?

It is totally ok if it creates 3 columns instead of 2, I can then select the ones of interest. E.g.

data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("10","9","1"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2", "col3"))

"ENSG00000185561.10|TLCD2"," this is from gencode i believe !! — PesKchan, Oct 21 '21 at 10:40
it's a StringTie result, but yes, it´s about genes, I am trying to separate ensemble ID from gene symbol — Paula, Oct 21 '21 at 12:03

score 4 · Answer 1 · answered Oct 21 '21 at 10:18

4

library(stringr)
str_split_fixed(df$myString, "[\\.,\\|]", 3)

output:

    [,1]              [,2] [,3]           
[1,] "ENSG00000185561" "10" "TLCD2"        
[2,] "ENSG00000124785" "9"  "NRN1"         
[3,] "ENSG00000287339" "1"  "RP11-575F12.4"

answered Oct 21 '21 at 10:18

TarJae

72,363
6
19
66

score 3 · Answer 2 · answered Oct 21 '21 at 10:00

This should work. The secret sauce is the option extra = "merge", which means that any extra separated parts get added back onto the last column.

library(tidyr)

tibble(string = c(
  "ENSG00000185561.10|TLCD2", 
  "ENSG00000124785.9|NRN1", 
  "ENSG00000287339.1|RP11-575F12.4"
)) %>% 
  separate(
    string, into = c("c1", "c2", "c3"), sep = "[.]|[|]", extra = "merge"
  )
#> # A tibble: 3 x 3
#>   c1              c2    c3           
#>   <chr>           <chr> <chr>        
#> 1 ENSG00000185561 10    TLCD2        
#> 2 ENSG00000124785 9     NRN1         
#> 3 ENSG00000287339 1     RP11-575F12.4

^{Created on 2021-10-21 by the reprex package (v2.0.0)}

NB, reshape2 is superseded by tidyr. You should make the switch ASAP!

nice one, thanks! good tip, I just came across reshape2 now when looking for similar questions — Paula, Oct 21 '21 at 12:01

Konrad Rudolph · Answer 3 · 2021-10-21T10:13:49.683

3

I would suggest using matching instead of splitting (i.e. write a regex that specifies the parts that should be matched, rather than the splitter):

df = tibble(ID = myString)
df %>% extract(ID, into = c('ID', 'Name'), '([^.]+).*\\|(.+)')

# A tibble: 3 × 2
  ID              Name
  <chr>           <chr>
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4

Just like the other answer, this is using ‘tidyr’ (which supersedes ‘reshape2’).

edited Oct 21 '21 at 10:13

answered Oct 21 '21 at 10:06

Konrad Rudolph

530,221
131
937
1,214

I think this is a better solution. Matching the parts you want to keep gives a slightly longer regex but seems a bit more robust. – wurli Oct 21 '21 at 10:15
@Konrad you thesis is of really great help for analysis – PesKchan Oct 21 '21 at 10:41
thanks Konrad :) ; it took me some time to understand how it works, but probably more applicable to similar though not identical situations – Paula Oct 21 '21 at 12:24

score 1 · Answer 4 · answered Oct 21 '21 at 13:59

1

This could also help in base R:

as.data.frame(do.call(rbind, strsplit(myString, "\\.\\d+.+?", perl = TRUE)))

               V1            V2
1 ENSG00000185561         TLCD2
2 ENSG00000124785          NRN1
3 ENSG00000287339 RP11-575F12.4

answered Oct 21 '21 at 13:59

Anoushiravan R

21,622
3
18
41

1

Good to see you! – TarJae Oct 21 '21 at 19:24
1

Thank you very much. These 3 months has been so full with university stuff as don't think I can spend much time here. – Anoushiravan R Oct 21 '21 at 19:42
1

Good to have you here! Mr. Map! :-) – TarJae Oct 21 '21 at 19:44

Chris Ruehlemann · Answer 5 · 2021-10-21T10:38:11.807

You can use str_extract and lookahead (?=\\|) and, respectively, lookbehind (?<=\\|) to assert the | as demarcation point:

library(stringr)
df <- data.frame(
  col1 = str_extract(myString, ".*?(?=\\|)"),
  col2 = str_extract(myString, "(?<=\\|).*$")
)
df
                col1          col2
1 ENSG00000185561.10         TLCD2
2  ENSG00000124785.9          NRN1
3  ENSG00000287339.1 RP11-575F12.4

EDIT:

If you want three columns:

df <- data.frame(
  col1 = str_extract(myString, ".*?(?=\\.)"),
  col2 = str_extract(myString, "(?<=\\.)\\d+(?=\\|)"),
  col3 = str_extract(myString, "(?<=\\|).*$")
)
df
             col1 col2          col3
1 ENSG00000185561   10         TLCD2
2 ENSG00000124785    9          NRN1
3 ENSG00000287339    1 RP11-575F12.4

score 0 · Answer 6 · answered Oct 21 '21 at 10:43

0

It seems to me that you are trying to cram two operations into a single command. First split at | and create two columns, than remove the dot suffix from the first column. I think this is simpler and there is no need for external packages either:

myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")

df <- do.call(rbind, strsplit(myString, '\\|'))
df[,1] <- sub('\\..*', '', df[,1])

df
     [,1]              [,2]           
[1,] "ENSG00000185561" "TLCD2"        
[2,] "ENSG00000124785" "NRN1"         
[3,] "ENSG00000287339" "RP11-575F12.4"

or am I missing something...?

answered Oct 21 '21 at 10:43

dariober

8,240
3
30
47

exactly, I wanted to do it in one line. I could do both separately and join them afterwards with cbind() but I could not figure out how to do it all together – Paula Oct 21 '21 at 12:34
@Paula Is there a particular reason for doing it in one line (other than the challenge)? You could always wrap a two-line solution like mine into a function that becomes a one-liner when used. In my opinion, the other answers so far are clever but quite obscure. (I don't see why you would need `cbind`). – dariober Oct 21 '21 at 13:01
because it keeps the overall script shorter and simpler, especially when you want to change something upstream, you have to make sure you run >1 lines of code so the change is applied to your final, in this case, data frame. I used cbind because I created both columns separately and then joined them: ´´´col2 <- data.frame(do.call(rbind, strsplit(myString,"\\|"))) %>% select(2)´´´ ´´´col1 <- data.frame(do.call(rbind, strsplit(myString,"\\."))) %>% select(1)´´´ ´´´df <- cbind(col1, col2)´´´ – Paula Oct 22 '21 at 07:20

Split a string by two delimiters only in the first occurrence

6 Answers6