I have a dataframe in the style of
id choice
----------
1 "(a,b,c)"
2 "(c)"
3 "(a,c)"
4 "(d)"
i.e.
DF = data.frame(id=c(1,2,3,4), choice=c("(a,b,c))","(c)","(a,c)","(d)"))
and I would like to obtain a dataframe with boolean columns
id a b c d
----------
1 T T T F
2 F F T F
3 F F T F
4 F F F T
Based on Split comma-separated strings into boolean columns I tried using
library(splitstackshape)
cSplit_e(DF, "choice", sep = "[,()]", mode = "binary",
type = "character", fill = 0, drop = TRUE, fixed = FALSE)
however, this runs out of memory with my actual dataframe (I have ca. 10 million rows & ca. 1.5 million unique choice values (number not known in advance)) & in general is also quite slow.
I was therefore wondering if there is perhaps a faster & more memory efficient way to do this, e.g. using data.table
or dplyr
or multidplyr
and maybe also storing the boolean matrix in a sparse format?
In my actual dataset choice
is possible amino acid mutations, so instead of a,b,c, ...
e.g.
(NSP5_P132H,Spike_H69del,Spike_T95I,Spike_A67V,Spike_N969K,Spike_H655Y,Spike_N856K,N_R203K,Spike_G142D,NSP3_A1892T,Spike_Q954H,N_P13L,NSP3_L1266I,N_R32del,M_Q19E,NSP4_T492I,NSP6_L105del,Spike_N679K,Spike_N764K,Spike_L212I,NSP6_G107del,NSP6_I189V,Spike_T547K,M_D3G,Spike_D796Y,N_G204R,Spike_V143del,M_A63T,Spike_K417N)
and I don't have an a priori list of possible mutations occurring in the dataset - so I don't know my choices a priori (there is ca. 1.5 million unique ones)...
Here is the first 100 rows of my actual dataset:
library(readr)
DF = read_csv("https://www.dropbox.com/s/v1dhbzmlyudxvyi/DF.csv?dl=1")
DF$id = 1:nrow(DF)
EDIT2: the data.table solution below in principle would have worked, were it not for the fact that I would end up with a data table with 1.5 million columns and 10 million rows, which doesn't fit in memory anymore on my machine. I guess storing it as a sparse matrix would really be necessary in the end?