Fast way to split comma-separated strings into (sparse) boolean matrix in R

Question

I have a dataframe in the style of

id choice
----------
1  "(a,b,c)"
2  "(c)"
3  "(a,c)"
4  "(d)"

i.e.

DF = data.frame(id=c(1,2,3,4), choice=c("(a,b,c))","(c)","(a,c)","(d)"))

and I would like to obtain a dataframe with boolean columns

id a b c d
----------
1  T T T F
2  F F T F
3  F F T F
4  F F F T

Based on Split comma-separated strings into boolean columns I tried using

library(splitstackshape)
cSplit_e(DF, "choice", sep = "[,()]", mode = "binary", 
         type = "character", fill = 0, drop = TRUE, fixed = FALSE)

however, this runs out of memory with my actual dataframe (I have ca. 10 million rows & ca. 1.5 million unique choice values (number not known in advance)) & in general is also quite slow.

I was therefore wondering if there is perhaps a faster & more memory efficient way to do this, e.g. using data.table or dplyr or multidplyr and maybe also storing the boolean matrix in a sparse format?

In my actual dataset choice is possible amino acid mutations, so instead of a,b,c, ... e.g.

(NSP5_P132H,Spike_H69del,Spike_T95I,Spike_A67V,Spike_N969K,Spike_H655Y,Spike_N856K,N_R203K,Spike_G142D,NSP3_A1892T,Spike_Q954H,N_P13L,NSP3_L1266I,N_R32del,M_Q19E,NSP4_T492I,NSP6_L105del,Spike_N679K,Spike_N764K,Spike_L212I,NSP6_G107del,NSP6_I189V,Spike_T547K,M_D3G,Spike_D796Y,N_G204R,Spike_V143del,M_A63T,Spike_K417N)

and I don't have an a priori list of possible mutations occurring in the dataset - so I don't know my choices a priori (there is ca. 1.5 million unique ones)...

Here is the first 100 rows of my actual dataset:

library(readr)
DF = read_csv("https://www.dropbox.com/s/v1dhbzmlyudxvyi/DF.csv?dl=1")
DF$id = 1:nrow(DF)

EDIT2: the data.table solution below in principle would have worked, were it not for the fact that I would end up with a data table with 1.5 million columns and 10 million rows, which doesn't fit in memory anymore on my machine. I guess storing it as a sparse matrix would really be necessary in the end?

Completely uninformed, likely hair-brained, but perhaps treating your otherwise unwieldy oversized 1.5^6x10^6 (is it a file?) as a `terra::rast`. Book ending with completely uninformed and likely hair-brained. — Chris, Sep 18 '22 at 03:37
@Chris I was hoping I could store it in a sparse matrix (Matrix::Matrix(x, sparse=TRUE)). But in the end I'm veering now towards keeping it instead as a list column, using DF%>% transform(choice = strsplit(choice, ",")). I was hoping that having a nice matrix would facilitate later selections of rows, but it seems it's more trouble than it's worth, and with list columns I can apparently also do fast subsetting using %in%. — Tom Wenseleers, Sep 18 '22 at 11:57
How much RAM are you overwhelming? I have 8G, for instance. For future readers with 'run out of memory' problems. — Chris, Sep 18 '22 at 13:03
In your DF max(lengths of DF$choice is 79L, is max choice known or knowable? — Chris, Sep 19 '22 at 15:08
The max choice is not known in advance in my case no - I would just have to determine it empirically, using choices <- unique(do.call('c', strsplit(DF$choice, ",", fixed = TRUE))) (though that call in itself is already pretty slow for my full 10 million row dataset - doing this in chunks, potentially parallelized, could speed this up a bit) — Tom Wenseleers, Sep 21 '22 at 08:44

ThomasIsCoding · Answer 1 · 2022-09-17T21:45:57.170

4

You can try dcast + strsplit like below

dcast(
  setDT(DF)[
    ,
    .(choice = Filter(nzchar, unlist(strsplit(choice, "\\W+")))),
    id
  ],
  id ~ choice,
  fun.aggregate = function(x) length(x) > 0
)

which gives

   id     a     b     c     d
1:  1  TRUE  TRUE  TRUE FALSE
2:  2 FALSE FALSE  TRUE FALSE
3:  3  TRUE FALSE  TRUE FALSE
4:  4 FALSE FALSE FALSE  TRUE

edited Sep 17 '22 at 21:45

answered Sep 17 '22 at 21:16

ThomasIsCoding

96,636
9
24
81

gsub will be faster using fixed=T, AFAIR – jangorecki Sep 17 '22 at 21:29
@jangorecki Yes, you are right, `fixed=T` speeds up. I updated my solution with another approach, please check it out. – ThomasIsCoding Sep 17 '22 at 21:47
Many thanks! On my small example this works, but it doesn't for some reason on my actual data - I then get Error in as.vector(x, "list") : cannot coerce type 'closure' to vector of type 'list'. I edited my question to include a bit of my data library(readr) DF = read_csv("https://www.dropbox.com/s/v1dhbzmlyudxvyi/DF.csv?dl=1") DF$id = 1:nrow(DF). Do you see where things are going wrong? – Tom Wenseleers Sep 18 '22 at 12:39

akrun · Answer 2 · 2022-09-17T21:58:44.060

3

Using fastDummies

library(fastDumies)
library(dplyr)
library(stringr)
DF %>%
  mutate(choice = str_remove_all(choice, "\\(|\\)")) %>% 
  dummy_cols("choice", split = ",") %>%
  transmute(id, across(starts_with('choice'), as.logical, 
     .names = "{str_remove(.col, 'choice_')}")) %>%
  select(-choice)

-output

  id     a     b     c     d
1  1  TRUE  TRUE  TRUE FALSE
2  2 FALSE FALSE  TRUE FALSE
3  3  TRUE FALSE  TRUE FALSE
4  4 FALSE FALSE FALSE  TRUE

With base R, strsplit + table should be fast

lst1 <- strsplit(gsub("[()]", "", DF$choice), ",")
table(rep(seq_along(lst1), lengths(lst1)), unlist(lst1)) > 0
   
        a     b     c     d
  1  TRUE  TRUE  TRUE FALSE
  2 FALSE FALSE  TRUE FALSE
  3  TRUE FALSE  TRUE FALSE
  4 FALSE FALSE FALSE  TRUE

edited Sep 17 '22 at 21:58

answered Sep 17 '22 at 21:30

akrun

874,273
37
540
662

Many thanks! The fastDummies solution is very slow, but the base R solution has good speed. Only trouble is that I have 1.5 million choices and 10 million rows & so I run out of memory. Maybe I'll have to work in chunks, convert each chunk to a sparse matrix (Matrix::Matrix(x, sparse=T)) and then bind_rows those? – Tom Wenseleers Sep 18 '22 at 12:34
Though in the end I'm veering now towards keeping it instead as a list column, using DF%>% transform(choice = strsplit(choice, ",")). I was hoping that having a nice matrix would facilitate later selections of rows, but it seems it's more trouble than it's worth, and with list columns I can apparently also do fast subsetting using %in%. – Tom Wenseleers Sep 18 '22 at 12:35

score 1 · Answer 3 · answered Sep 17 '22 at 20:49

1

Try this

library(tidyverse)
x %>%
  mutate(choice = str_extract_all(choice, "\\w")) %>%
  unnest(choice) %>%
  pivot_wider(
    names_from = choice,
    values_fill = FALSE,
    values_from = choice,
    values_fn = ~ !is.na(.)
  )

assuming x is your dataset (i.e. DF)

answered Sep 17 '22 at 20:49

Mossa

1,656
12
16

Thanks - for my example above that works, but in my dataset I have columns that look e.g. like "(NSP5_P132H,Spike_H69del,Spike_T95I,Spike_A67V,Spike_N969K,Spike_H655Y,Spike_N856K,N_R203K,Spike_G142D,NSP3_A1892T,Spike_Q954H,N_P13L,NSP3_L1266I,N_R32del,M_Q19E,NSP4_T492I,NSP6_L105del,Spike_N679K,Spike_N764K,Spike_L212I,NSP6_G107del,NSP6_I189V,Spike_T547K,M_D3G,Spike_D796Y,N_G204R,Spike_V143del,M_A63T,Spike_K417N)" and then I get error Applying `values_fn` to `choice` must result in a single summary value per key. & Applying `values_fn` resulted in a value with length 30. Any thoughts? – Tom Wenseleers Sep 17 '22 at 20:57
Guess the "\\w" needs changing perhaps? – Tom Wenseleers Sep 17 '22 at 20:58
1

I know it's late in the day, but changing \\w to \\w+ will get this to work. (\w+ = zero or more words) – RobS Jul 17 '23 at 09:15

Andrew Brown · Answer 4 · 2022-09-17T21:09:55.543

1

Here is an example with data.table:

library(data.table)
DF = data.table(id=c(1,2,3,4), choice=c("(a,b,c))","(c)","(a,c)","(d)"))
DF$choice <- gsub("\\(|\\)", "", DF$choice)
choices <- unique(do.call('c', strsplit(DF$choice, ",", fixed = TRUE)))

DF[, `colnames<-`(as.data.frame(t(seq_along(choices) %in% as.numeric(
  factor(strsplit(choice, ",", fixed = TRUE)[[1]], choices)
))), choices), by = "id"]
#>    id     a     b     c     d
#> 1:  1  TRUE  TRUE  TRUE FALSE
#> 2:  2 FALSE FALSE  TRUE FALSE
#> 3:  3  TRUE FALSE  TRUE FALSE
#> 4:  4 FALSE FALSE FALSE  TRUE

edited Sep 17 '22 at 21:09

answered Sep 17 '22 at 20:58

Andrew Brown

1,045
6
13

Many thanks! Only trouble I have here is that I don't know my choices beforehand... Is there a workaround for that? – Tom Wenseleers Sep 17 '22 at 21:01
In my dataset my choices are possible amino acid mutations, e.g. "(NSP5_P132H,Spike_H69del,Spike_T95I,Spike_A67V,Spike_N969K,Spike_H655Y,Spike_N856K,N_R203K,Spike_G142D,NSP3_A1892T,Spike_Q954H,N_P13L,NSP3_L1266I,N_R32del,M_Q19E,NSP4_T492I,NSP6_L105del,Spike_N679K,Spike_N764K,Spike_L212I,NSP6_G107del,NSP6_I189V,Spike_T547K,M_D3G,Spike_D796Y,N_G204R,Spike_V143del,M_A63T,Spike_K417N)" – Tom Wenseleers Sep 17 '22 at 21:02
1

I updated example to remove parentheses and calculate unique levels from the source data, hope that helps – Andrew Brown Sep 17 '22 at 21:06
Ha many thanks - it was worse than I thought - I have 1.5 million unique choices / AA mutations & 10 million rows. So it had to crunch quite hard even to just get the unique AA mutations. But unfortunately after that it runs out of memory for me here... Which is maybe not surprising with 1.5 million columns & 10 million rows... Guess I really have to store that as a sparse matrix then to be able to fit this in memory? – Tom Wenseleers Sep 17 '22 at 21:28
So I presume I would have to work in chunks, convert each chunk to a sparse matrix & then bind_rows those? – Tom Wenseleers Sep 17 '22 at 21:34
Just a detail: stringr::str_sub(DF$choice, 2, -2) was ca. 10 times faster than gsub("\$|\$", "", DF$choice) to remove that leading & trailing parenthesis... – Tom Wenseleers Sep 17 '22 at 21:41
1

Yes that makes sense. The stringi/stringr methods should be considerably more efficient for that regex replacement. And that operation would not scale well to things that don't fit in memory. But should not be too bad to do as you said (process chunks of `DF`, appending to a file) – Andrew Brown Sep 17 '22 at 21:48

Fast way to split comma-separated strings into (sparse) boolean matrix in R

4 Answers4