1

I have a large transaction dataset (around 5 million rows), i need to split all transactions by ID (around 1 million unique ID). The expected results would be unique ID with item in lists.

I did try the most simple and direct way to split the transaction dataset (by referring to Why is split inefficient on large data frames with many groups? ), i know that convert dataframe into datatable might be more efficient.

Sample source df

set.seed(123)
n = 500000 #number of sample data (500k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""), 
                Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE) 
                )

Convert character to factor

x$ID <- as.character(x$ID)
x$Item <- as.factor(x$Item)

Convert df into dt, then split dt into lists

library(data.table)
x <- as.data.table(x)
system.time(
  xx <- split(x$Item, x$ID)
)

Expected results in lists

head(xx, 2)
#$A100
#[1] tea    orange
#Levels: apple lemon orange rice tea

#$A101
#[1] rice
#Levels: apple lemon orange rice tea

Problem: After running for 2 hours, on my machine (4 cores, 16Gb RAM, Win10, R 3.4.3) it still running and never completes. I did check my CPU usage when it's running, it only consumed 35-40% of the CPU usage.

My idea:

I'm thinking is there any way to fully utilized the computational power of my machine (run the "split" in parallel), using only detectCores() - 1 = 3 cores.

1st: Split the large transaction dataset by IDs into 3 smaller partitions (smaller dataset)

2nd: Using foreach loop to run split 3 partitions (smaller dataset) into list in parallel, then append(row bind) each list for every iteration until the end.

Question: Is my idea practical? i did read about mclapply and it's mc.cores, but seems like mc.cores = 1 is the only option for windows, so it won't help for my case. Is there any better and more efficient way to do the split for large dataset? Any comment is welcome, Thanks!

yc.koong
  • 175
  • 2
  • 10
  • One problem with splitting an object that is in memory with a parallel mechanism is that most (not all) of these mechanisms expect it to be already in a list. You might split them evenly and have each process split by `ID`, then you'd need to concatenate the individual id's lists. – r2evans Jan 22 '19 at 16:20
  • what will be the purpose of splitting? can you just not work with the whole dataframe? – Onyambu Jan 22 '19 at 17:12
  • Hi @Onyambu my ultimate goal is to generate the list then convert list as transaction object `basket <- as(xx, "transactions")` for market basket analysis using `library(arules)`. I know there are several way of loading raw transactions dataset into transaction object, but i'm trying to explore current approach :) – yc.koong Jan 23 '19 at 03:38
  • In this case the problem with parallel in Windows, as well, is it will require copying the data to each parallel process which would probably take longer than splitting without factors. – chasemc Jan 23 '19 at 13:52

2 Answers2

2

Surprisingly and interestingly, consider by (the object-oriented wrapper to tapply) which operates similarly as split on data frames with an added feature to run splits into a function call. The equivalent to split would be to return the argument or call identity.

by(x$Item, x$ID, function(x) x)

by(x$Item, x$ID, identity)

Do note, the return of by is a by class object which essentially is a list with additional attributes.

Using your random data frame example, base::split did not finish after 1 hour, but base::by did well below 5 mins on my machine with a 64 GB RAM! Usually, I assumed by would have more overhead being a sibling to the apply family but my opinion may soon change.

50K ROW EXAMPLE

set.seed(123)
n = 50000 #number of sample data (50k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""), 
                Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE) 
)

system.time( xx <- split(x$Item, x$ID) )
#   user  system elapsed 
#  20.09    0.00   20.09 

system.time( xx2 <- by(x$Item, x$ID, identity) )
#   user  system elapsed 
#   1.55    0.00    1.55 

all.equal(unlist(xx), unlist(xx2))
# [1] TRUE

identical(unlist(xx), unlist(xx2))
# [1] TRUE

500K ROW EXAMPLE

set.seed(123)
n = 500000 #number of sample data (500k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""), 
                Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE) 
)

system.time( xx <- split(x$Item, x$ID) )
# DID NOT FINISH AFTER 1 HOUR

system.time( xx2 <- by(x$Item, x$ID, identity) )
#   user  system elapsed 
#  23.00    0.06   23.09 

Source code reveals split.default might run more processes at the R (unlike C or Fortran) level with a for loop across factor levels:

getAnywhere(split.data.frame)

function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...) 
{
    if (!missing(...)) 
        .NotYetUsed(deparse(...), error = FALSE)
    if (is.list(f)) 
        f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
    else if (!is.factor(f)) 
        f <- as.factor(f)
    else if (drop) 
        f <- factor(f)
    storage.mode(f) <- "integer"
    if (is.null(attr(x, "class"))) 
        return(.Internal(split(x, f)))
    lf <- levels(f)
    y <- vector("list", length(lf))
    names(y) <- lf
    ind <- .Internal(split(seq_along(x), f))
    for (k in lf) y[[k]] <- x[ind[[k]]]
    y
}

Conversely, source code for by.data.frame reveals a call to tapply which itself is a wrapper to lapply:

getAnywhere(by.data.frame)

function (data, INDICES, FUN, ..., simplify = TRUE) 
{
    if (!is.list(INDICES)) {
        IND <- vector("list", 1L)
        IND[[1L]] <- INDICES
        names(IND) <- deparse(substitute(INDICES))[1L]
    }
    else IND <- INDICES
    FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
    nd <- nrow(data)
    structure(eval(substitute(tapply(seq_len(nd), IND, FUNx, 
        simplify = simplify)), data), call = match.call(), class = "by")
}
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Hi @Parfait thanks for providing very good insight by comparing `split` and `by`. I did try to split 5 million rows using `xx2 <- by(x$Item, x$ID, identity)`, the run time is super fast (completed around 1min)! I have another question, is that possible to convert `by` object to `list` object? Actually my ultimate goal is to get an `list` of transactions then convert it into `transactions` object `basket <- as(xx, "transactions")` via using `library(arules)`, but it won't work if using `by` object, although the end results for `by`and `list` look the same. – yc.koong Jan 23 '19 at 04:33
  • @yc.koong just use `uclass(results)`. This should give you a list in case you have a `by` object – Onyambu Jan 23 '19 at 04:41
  • @Onyambu i couldn't found any function name `uclass()`you mean `unclass(results)`? – yc.koong Jan 23 '19 at 04:58
  • 1
    @yc.koong if its not working, brute force it to class list. ie `as(results,'list')` that will force it to class `list` – Onyambu Jan 23 '19 at 05:00
  • 1
    @Onyambu convert `by` object to `list` object via using `as(results, 'list')` works very well, Thanks! – yc.koong Jan 23 '19 at 05:56
2

The factors seems to be the key here. I don't have 64GB RAM but maybe you can try again with stringsAsFactors = F. My results for a smaller test are below and it seems split is quite faster when not using factors.


n <- 50000

x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""), 
                Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE),
                stringsAsFactors = T
)


x2 <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""), 
                Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE),
                stringsAsFactors = F)


splitFactor <- function() split(x$Item, x$ID)
byFactor <- function() by(x$Item, x$ID, identity)


splitNotFactor <- function() split(x2$Item, x2$ID)
byNotFactor <- function() by(x2$Item, x2$ID, identity)

a <- microbenchmark::microbenchmark(splitFactor(),
                                    byFactor(),
                                    splitNotFactor(),
                                    byNotFactor(),
                                    times = 3
)

Unit: milliseconds
             expr        min         lq       mean     median         uq        max neval  cld
    splitFactor() 51743.1633 51936.7261 52025.1205 52130.2889 52166.0990 52201.9091     3    d
       byFactor()  1963.0673  1987.7360  2030.5779  2012.4048  2064.3332  2116.2616     3  b  
 splitNotFactor()   399.7618   401.6796   412.4632   403.5973   418.8139   434.0306     3 a   
    byNotFactor()  2410.3804  2518.3651  2578.3501  2626.3499  2662.3349  2698.3199     3   c 

Benchmarks

splitNotFactor() should also result in an object with much smaller memory footprint than the other functions.

chasemc
  • 849
  • 6
  • 12
  • 1
    Hi @Chase Clark, thanks for providing another great insight with details comparison :) split without factor is proven to have better performance! – yc.koong Jan 23 '19 at 04:05