2

I am working on large data sets, for which i have written a code to perform row by row operation on a data frame, which is sequential. The process is slow. I am trying to perform the operation using parallel processing to make it fast.

Here is code

library(geometry)

# Data set - a
data_a   = structure(c(10.4515034409741, 15.6780890052356, 12.5581992918563, 
                       9.19067944250871, 14.4459166666667, 11.414, 17.65325, 12.468, 
                       11.273, 15.5945), .Dim = c(5L, 2L), .Dimnames = list(c("1", "2", 
                       "3", "4", "5"), c("a", "b")))

# Data set - b
data_b   = structure(c(10.4515034409741, 15.6780890052356, 12.5581992918563, 
                       9.19067944250871, 14.4459166666667, 11.3318076923077, 13.132273830156, 
                       6.16003995082975, 11.59114820435, 10.9573192090395, 11.414, 17.65325, 
                       12.468, 11.273, 15.5945, 11.5245, 12.0249, 6.3186, 13.744, 11.0921), .Dim = c(10L, 
                       2L), .Dimnames = list(c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), c("a", 
                       "b")))


conv_hull_1    <- convhulln( data_a, options = "FA")                        # Draw Convex Hull


test = c()


for (i in 1:nrow(data_b)){
  
  
  df = c()
  
  con_hull_all        <- inhulln(conv_hull_1,  matrix(data_b[i,], ncol = 2))
  
  df$flag             <- ifelse(con_hull_all[1] == TRUE , 0 , ifelse(con_hull_all[1] == FALSE , 1, 2))
  
  
  test                <- as.data.frame(rbind(test, df))
  
  print(i)
  
}

test

Is there any way to parallelize row wise computation?

As you can observe, for small datasets the computational time is really low, but as soon as i increase the data size, the computation time increases drastically.

Can you provide solution with the code. Thanks in advance.

Ankit Sagar
  • 61
  • 1
  • 8

1 Answers1

0

You could take advantage of the parameter to the inhulln function. This allows more than one row of points to be tested to be passed in.

I've tried the code below on a 320,000 row matrix that I made from the original data and it's quick.

library(geometry)
library(dplyr)
# Data set - a
data_a   = structure(
    c(
        10.4515034409741,
        15.6780890052356,
        12.5581992918563,
        9.19067944250871,
        14.4459166666667,
        11.414,
        17.65325,
        12.468,
        11.273,
        15.5945
    ),
    .Dim = c(5L, 2L),
    .Dimnames = list(c("1", "2",
                                         "3", "4", "5"), c("a", "b"))
)

# Data set - b
data_b   = structure(
    c(
        10.4515034409741,
        15.6780890052356,
        12.5581992918563,
        9.19067944250871,
        14.4459166666667,
        11.3318076923077,
        13.132273830156,
        6.16003995082975,
        11.59114820435,
        10.9573192090395,
        11.414,
        17.65325,
        12.468,
        11.273,
        15.5945,
        11.5245,
        12.0249,
        6.3186,
        13.744,
        11.0921
    ),
    .Dim = c(10L,
                     2L),
    .Dimnames = list(c(
        "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"
    ), c("a",
             "b"))
)

conv_hull_1    <- convhulln( data_a, options = "FA")                        # Draw Convex Hull

#Make a big data_b
for (i in 1:15) {
    data_b = rbind(data_b, data_b)
}
In_Or_Out <- inhulln(conv_hull_1, data_b)
result <- data.frame(data_b) %>% bind_cols(InOrOut=In_Or_Out)

I use dplyr::bind_cols to bind the in or out result to a data frame version of the original data so you might need some changes for your specific environment.

Andrew Chisholm
  • 6,362
  • 2
  • 22
  • 41
  • i am well aware of this process. I asked this question because as soon as i am increasing dimensionality of data (instead of 2 columns if i am using 10 columns) the process is too slow. I wanted to test the potential of inhull function using parallel computation in my way. @andrew – Ankit Sagar Mar 05 '21 at 14:24
  • Please can you post some representative data that allows the slow down to be observed? Without this, responders will make assumptions about dimensions. I assumed number of rows. Your new information reveals it's the number of columns that is important. – Andrew Chisholm Mar 05 '21 at 14:55
  • inhull function works fast for 5 columns and 6 million rows (computation time is ~2 minutes). As soon as i am increasing to 10 column and 6 million rows (computational time is too high, ~ 3-4 Days, my code is still running from last 3 days). @andrew – Ankit Sagar Mar 05 '21 at 15:21
  • How long does 10 columns and 100,000 rows take? – Andrew Chisholm Mar 05 '21 at 15:38
  • It is also long (~3 hour). I think it has something to do with number of columns. I don't know how to get around this @andrew – Ankit Sagar Mar 05 '21 at 17:35
  • How many CPUs are used during these 3 hours to process one chunk of 100,000? If it's all of them then the best you will ever get is going to be 3 hours x 60 chunks = 180 hours. If 1/6 of your CPUs are being used, this might reduce to 30 hours. Of course, these are back of the envelope calculations. It's worth thinking about this first before embarking on a solution that may never give a good result. – Andrew Chisholm Mar 05 '21 at 21:19