I recently switched to AWS EC2 to do my data analytics using Ubuntu R Studio. I believe it will provide me with more computing power than my MacBook. I am looking for ways to better utilize the computing power (even under the t2.micro instance which is in the free tier).
Goal - exhaustive search for the optimal parameters combination (For example, I have 5 parameters Var1-Var4 and a field for groups. Let's say I have 100 groups then I will have a model for each of the group and each of the model taking different Var1-Var4 value as the optimal models. Each of the Var1-Var4 can have 10 possible value. Therefore, in theory, I will have to perform the same set of subsetting/calculation for 100*10*10*10*10 = 1,000,000 scenarios.
Here is the outlay of my script:
library(doParallel)
library(foreach)
group <- as.list(c(1:100))
Var1 <- as.list(c(1:10))
Var2 <- as.list(c(1:10))
Var3 <- as.list(c(1:10))
Var4 <- as.list(c(1:10))
TestScript <- expand.grid(group = group, Var1 = Var1,... , Var4 = Var4)
Divide the testing scenarios into small lists for apply functions to work.
Test <- list()
n <- 50
Test$Var1 <- with(TestScript, split(Var1, ceiling(seq_along(Var1)/n)))
...
Test$Var4 <- with(TestScript, split(Var1, ceiling(seq_along(Var4)/n)))
Using the apply function inside the doParallel loop:
result.table <- foreach(i=1:length(Test$Var1), .combine = "rbind", .inorder = FALSE) %dopar% {
--- subsetting dataset and using nested apply functions ---
}
I understand that the doParallel package utilizes the number of threads (CPU) within a machine. However, within each thread, is there any way I can make it work harder?
The CPU utilization rate shown in the EC2 monitoring dashboard shows the CPU is utilized at around 10% (as compared to 80%+ I expect). I am not sure if it is related to the I/O speed. I don't think it is an memory issue since my dataset is only around 3MB large (50k row x 20 columns). Is there any way I can check where the bottleneck is?