3

I have read many examples here and other forums, tried things myself, but still can´t do what I want:

I have a string like this:

myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")

And I want to split it into columns by the first dot and the vertical slash so it looks like this:

data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2"))

The biggest problem here is the dot that is sometimes present in the right part of the slash (e.g. third row), by which I don´t want to split.

Among others, what I tried was:

data.frame(do.call(rbind, strsplit(myString,"(\\.)|(\\|)")))

but this also creates a fourth column when it splits after the second dot.

I tried to tell it to only split once for the dot:

data.frame(do.call(rbind, strsplit(myString,"(\\.{1})|(\\|)"))) but same result.

Then tried to tell it that the dot could not be preceded by a slash:

data.frame(do.call(rbind, strsplit(myString,"([^\\|]\\.)|(\\|)")))
data.frame(do.call(rbind, strsplit(myString,"([[:alnum:]][^\\|]\\.)|(\\|)")))

but in both cases it splits by both dots.

I tried various combinations with reshape2::colsplit as well, similar results; either it splits in both dots, or it splits on the first dot but not on the slash:

reshape2::colsplit(myString, "([^\\|]\\.)|(\\|)", c("col1", "col2"))

Does anyone have an idea on how to solve this?

It is totally ok if it creates 3 columns instead of 2, I can then select the ones of interest. E.g.

data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("10","9","1"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2", "col3"))
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Paula
  • 107
  • 6

6 Answers6

4
library(stringr)
str_split_fixed(df$myString, "[\\.,\\|]", 3)

output:

    [,1]              [,2] [,3]           
[1,] "ENSG00000185561" "10" "TLCD2"        
[2,] "ENSG00000124785" "9"  "NRN1"         
[3,] "ENSG00000287339" "1"  "RP11-575F12.4"
TarJae
  • 72,363
  • 6
  • 19
  • 66
3

This should work. The secret sauce is the option extra = "merge", which means that any extra separated parts get added back onto the last column.

library(tidyr)

tibble(string = c(
  "ENSG00000185561.10|TLCD2", 
  "ENSG00000124785.9|NRN1", 
  "ENSG00000287339.1|RP11-575F12.4"
)) %>% 
  separate(
    string, into = c("c1", "c2", "c3"), sep = "[.]|[|]", extra = "merge"
  )
#> # A tibble: 3 x 3
#>   c1              c2    c3           
#>   <chr>           <chr> <chr>        
#> 1 ENSG00000185561 10    TLCD2        
#> 2 ENSG00000124785 9     NRN1         
#> 3 ENSG00000287339 1     RP11-575F12.4

Created on 2021-10-21 by the reprex package (v2.0.0)

NB, reshape2 is superseded by tidyr. You should make the switch ASAP!

wurli
  • 2,314
  • 10
  • 17
  • nice one, thanks! good tip, I just came across reshape2 now when looking for similar questions – Paula Oct 21 '21 at 12:01
3

I would suggest using matching instead of splitting (i.e. write a regex that specifies the parts that should be matched, rather than the splitter):

df = tibble(ID = myString)
df %>% extract(ID, into = c('ID', 'Name'), '([^.]+).*\\|(.+)')
# A tibble: 3 × 2
  ID              Name
  <chr>           <chr>
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4

Just like the other answer, this is using ‘tidyr’ (which supersedes ‘reshape2’).

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • I think this is a better solution. Matching the parts you want to keep gives a slightly longer regex but seems a bit more robust. – wurli Oct 21 '21 at 10:15
  • @Konrad you thesis is of really great help for analysis – PesKchan Oct 21 '21 at 10:41
  • thanks Konrad :) ; it took me some time to understand how it works, but probably more applicable to similar though not identical situations – Paula Oct 21 '21 at 12:24
1

This could also help in base R:

as.data.frame(do.call(rbind, strsplit(myString, "\\.\\d+.+?", perl = TRUE)))

               V1            V2
1 ENSG00000185561         TLCD2
2 ENSG00000124785          NRN1
3 ENSG00000287339 RP11-575F12.4
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
0

You can use str_extract and lookahead (?=\\|) and, respectively, lookbehind (?<=\\|) to assert the | as demarcation point:

library(stringr)
df <- data.frame(
  col1 = str_extract(myString, ".*?(?=\\|)"),
  col2 = str_extract(myString, "(?<=\\|).*$")
)
df
                col1          col2
1 ENSG00000185561.10         TLCD2
2  ENSG00000124785.9          NRN1
3  ENSG00000287339.1 RP11-575F12.4

EDIT:

If you want three columns:

df <- data.frame(
  col1 = str_extract(myString, ".*?(?=\\.)"),
  col2 = str_extract(myString, "(?<=\\.)\\d+(?=\\|)"),
  col3 = str_extract(myString, "(?<=\\|).*$")
)
df
             col1 col2          col3
1 ENSG00000185561   10         TLCD2
2 ENSG00000124785    9          NRN1
3 ENSG00000287339    1 RP11-575F12.4
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

It seems to me that you are trying to cram two operations into a single command. First split at | and create two columns, than remove the dot suffix from the first column. I think this is simpler and there is no need for external packages either:

myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")

df <- do.call(rbind, strsplit(myString, '\\|'))
df[,1] <- sub('\\..*', '', df[,1])

df
     [,1]              [,2]           
[1,] "ENSG00000185561" "TLCD2"        
[2,] "ENSG00000124785" "NRN1"         
[3,] "ENSG00000287339" "RP11-575F12.4"

or am I missing something...?

dariober
  • 8,240
  • 3
  • 30
  • 47
  • exactly, I wanted to do it in one line. I could do both separately and join them afterwards with cbind() but I could not figure out how to do it all together – Paula Oct 21 '21 at 12:34
  • @Paula Is there a particular reason for doing it in one line (other than the challenge)? You could always wrap a two-line solution like mine into a function that becomes a one-liner when used. In my opinion, the other answers so far are clever but quite obscure. (I don't see why you would need `cbind`). – dariober Oct 21 '21 at 13:01
  • because it keeps the overall script shorter and simpler, especially when you want to change something upstream, you have to make sure you run >1 lines of code so the change is applied to your final, in this case, data frame. I used cbind because I created both columns separately and then joined them: ´´´col2 <- data.frame(do.call(rbind, strsplit(myString,"\\|"))) %>% select(2)´´´ ´´´col1 <- data.frame(do.call(rbind, strsplit(myString,"\\."))) %>% select(1)´´´ ´´´df <- cbind(col1, col2)´´´ – Paula Oct 22 '21 at 07:20