Efficient string splitting on first match using data.table

Question

I have data of the following structure:

require(data.table)
dt = data.table(c('string1: val1', 'gnistr2: val2', 'ingstr3: :::!val3', 'gtrins4: val4'))

> dt
                  V1
1:     string1: val1
2:     gnistr2: val2
3: ingstr3: :::!val3
4:     gtrins4: val4

My goal is to separate the column V1 using only the first delimiter match (:), hence resulting in a Nx2 table. Had the third entry not featured the delimiter in its "value"-part, I know perfectly well that data.table's built-in tstrsplit() function would work like a charm. I do not seek to regex out additional delimiter occasions, please assume that :::!val3 is a valid entry for line 3.

The stringi library contains a neat function stri_split_fixed() that allows limiting the split into n components. I do have a working solution that capitalises on this (see the stringi call using n=2):

require(stringi)
dt1 = dt[, .(val = unlist(stri_split_fixed(V1, ':', n=2))), .(r = seq_along(V1))]
dt1[, var := paste0('COL', 1:.N), r]
dt1 = dcast(dt1, r ~ var, value.var = 'val')[, .(COL1, COL2)]

> dt1 (this is my preferred output)
      COL1      COL2
1: string1      val1
2: gnistr2      val2
3: ingstr3  :::!val3
4: gtrins4      val4

I do, however feel, that the whole dcast call introduces unnecessary overhead, and I suspect that I can reduce my runtime. For example, I believe that there must be a way to achieve the same solution by omitting unlist and working from list columns:

dt2 = dt[, .(listcol = stri_split_fixed(V1, ':', n=2)), .(r = seq_along(V1))][, .(listcol)]
> dt2
             listcol
1:     string1, val1
2:     gnistr2, val2
3: ingstr3, :::!val3
4:     gtrins4, val4

For an easier overview of the structure:

> dt2[[1]]
[[1]]
[1] "string1" " val1"  

[[2]]
[1] "gnistr2" " val2"  

[[3]]
[1] "ingstr3"   " :::!val3"

[[4]]
[1] "gtrins4" " val4"

Is there a way to efficiently merge dt2 into the output presented by the first solution? My experience in working with list columns is limited. Perhaps there are even other, even quicker approaches that do not even involve list columns?

Thank you.

Not sure if this covers all scenarios, but maybe something like `dt[, tstrsplit(V1, "(?<=[^:]):(?=[^:])", perl=TRUE)]` ? — thelatemail, Mar 15 '23 at 21:16

score 3 · Answer 1 · answered Mar 15 '23 at 21:58

Try

library(data.table)
setnames(dt[, tstrsplit(V1, ":\\s+")], c("COL1", "COL2"))[]

-output

      COL1     COL2
1: string1     val1
2: gnistr2     val2
3: ingstr3 :::!val3
4: gtrins4     val4

Or with tidyr

library(tidyr)
separate_wider_delim(dt, V1, delim = ": ", 
    names = c("COL1", "COL2"), too_few = "align_end")

-output

# A tibble: 4 × 2
  COL1    COL2    
  <chr>   <chr>   
1 string1 val1    
2 gnistr2 val2    
3 ingstr3 :::!val3
4 gtrins4 val4

langtang · Answer 2 · 2023-03-16T00:43:35.870

1

Just wrap a single call to stri_split_fixed(), using simplify=T, in data.table();

data.table(dt[, stri_split_fixed(V1,":",2,simplify=T)])

Output:

        V1        V2
1: string1      val1
2: gnistr2      val2
3: ingstr3  :::!val3
4: gtrins4      val4

edited Mar 16 '23 at 00:43

answered Mar 15 '23 at 23:22

langtang

22,248
1
12
27

Efficient string splitting on first match using data.table

2 Answers2