I have data of the following structure:
require(data.table)
dt = data.table(c('string1: val1', 'gnistr2: val2', 'ingstr3: :::!val3', 'gtrins4: val4'))
> dt
V1
1: string1: val1
2: gnistr2: val2
3: ingstr3: :::!val3
4: gtrins4: val4
My goal is to separate the column V1 using only the first delimiter match (:
), hence resulting in a Nx2 table. Had the third entry not featured the delimiter in its "value"-part, I know perfectly well that data.table
's built-in tstrsplit()
function would work like a charm. I do not seek to regex out additional delimiter occasions, please assume that :::!val3
is a valid entry for line 3.
The stringi
library contains a neat function stri_split_fixed()
that allows limiting the split into n
components. I do have a working solution that capitalises on this (see the stringi
call using n=2
):
require(stringi)
dt1 = dt[, .(val = unlist(stri_split_fixed(V1, ':', n=2))), .(r = seq_along(V1))]
dt1[, var := paste0('COL', 1:.N), r]
dt1 = dcast(dt1, r ~ var, value.var = 'val')[, .(COL1, COL2)]
> dt1 (this is my preferred output)
COL1 COL2
1: string1 val1
2: gnistr2 val2
3: ingstr3 :::!val3
4: gtrins4 val4
I do, however feel, that the whole dcast
call introduces unnecessary overhead, and I suspect that I can reduce my runtime. For example, I believe that there must be a way to achieve the same solution by omitting unlist
and working from list columns:
dt2 = dt[, .(listcol = stri_split_fixed(V1, ':', n=2)), .(r = seq_along(V1))][, .(listcol)]
> dt2
listcol
1: string1, val1
2: gnistr2, val2
3: ingstr3, :::!val3
4: gtrins4, val4
For an easier overview of the structure:
> dt2[[1]]
[[1]]
[1] "string1" " val1"
[[2]]
[1] "gnistr2" " val2"
[[3]]
[1] "ingstr3" " :::!val3"
[[4]]
[1] "gtrins4" " val4"
Is there a way to efficiently merge dt2
into the output presented by the first solution? My experience in working with list columns is limited. Perhaps there are even other, even quicker approaches that do not even involve list columns?
Thank you.