I have a large transaction dataset (around 5 million rows), i need to split all transactions by ID (around 1 million unique ID). The expected results would be unique ID with item in lists.
I did try the most simple and direct way to split the transaction dataset (by referring to Why is split inefficient on large data frames with many groups? ), i know that convert dataframe into datatable might be more efficient.
Sample source df
set.seed(123)
n = 500000 #number of sample data (500k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""),
Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE)
)
Convert character to factor
x$ID <- as.character(x$ID)
x$Item <- as.factor(x$Item)
Convert df into dt, then split dt into lists
library(data.table)
x <- as.data.table(x)
system.time(
xx <- split(x$Item, x$ID)
)
Expected results in lists
head(xx, 2)
#$A100
#[1] tea orange
#Levels: apple lemon orange rice tea
#$A101
#[1] rice
#Levels: apple lemon orange rice tea
Problem: After running for 2 hours, on my machine (4 cores, 16Gb RAM, Win10, R 3.4.3) it still running and never completes. I did check my CPU usage when it's running, it only consumed 35-40% of the CPU usage.
My idea:
I'm thinking is there any way to fully utilized the computational power of my machine (run the "split" in parallel), using only detectCores() - 1 = 3 cores.
1st: Split the large transaction dataset by IDs into 3 smaller partitions (smaller dataset)
2nd: Using foreach loop to run split 3 partitions (smaller dataset) into list in parallel, then append(row bind) each list for every iteration until the end.
Question: Is my idea practical? i did read about mclapply
and it's mc.cores
, but seems like mc.cores = 1
is the only option for windows, so it won't help for my case. Is there any better and more efficient way to do the split for large dataset? Any comment is welcome, Thanks!