0

I recently switched to AWS EC2 to do my data analytics using Ubuntu R Studio. I believe it will provide me with more computing power than my MacBook. I am looking for ways to better utilize the computing power (even under the t2.micro instance which is in the free tier).

Goal - exhaustive search for the optimal parameters combination (For example, I have 5 parameters Var1-Var4 and a field for groups. Let's say I have 100 groups then I will have a model for each of the group and each of the model taking different Var1-Var4 value as the optimal models. Each of the Var1-Var4 can have 10 possible value. Therefore, in theory, I will have to perform the same set of subsetting/calculation for 100*10*10*10*10 = 1,000,000 scenarios.

Here is the outlay of my script:

library(doParallel)
library(foreach)
group <- as.list(c(1:100))
Var1 <- as.list(c(1:10))
Var2 <- as.list(c(1:10))
Var3 <- as.list(c(1:10))
Var4 <- as.list(c(1:10))
TestScript <- expand.grid(group = group, Var1 = Var1,... , Var4 = Var4)

Divide the testing scenarios into small lists for apply functions to work.

Test <- list()
n <- 50
Test$Var1 <- with(TestScript, split(Var1, ceiling(seq_along(Var1)/n)))
...
Test$Var4 <- with(TestScript, split(Var1, ceiling(seq_along(Var4)/n)))

Using the apply function inside the doParallel loop:

result.table <- foreach(i=1:length(Test$Var1), .combine = "rbind", .inorder = FALSE) %dopar% {
    --- subsetting dataset and using nested apply functions ---
}

I understand that the doParallel package utilizes the number of threads (CPU) within a machine. However, within each thread, is there any way I can make it work harder?

The CPU utilization rate shown in the EC2 monitoring dashboard shows the CPU is utilized at around 10% (as compared to 80%+ I expect). I am not sure if it is related to the I/O speed. I don't think it is an memory issue since my dataset is only around 3MB large (50k row x 20 columns). Is there any way I can check where the bottleneck is?

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
Bosco Lam
  • 43
  • 2
  • You can change the number of cores, and register different backends. https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf. there are `doSnow` and `doMC` packages too that may be worth checking out – Jonny Phelps Jun 05 '18 at 07:46
  • Are you sure that is not an artifact of how much compute power Amazon is giving you for free? – Ralf Stubner Jun 05 '18 at 08:36
  • Yes, I understand there is a limitation on the vCPU usage for the t2.micro instance. Actually I do a few changes for my script trying to fully utilize the CPU. I just wonder if I change n to a larger number (more calculations to be performed in an apply function), will it make the CPU work harder? – Bosco Lam Jun 05 '18 at 08:45
  • @BoscoLam the CPU burns through as much work as you're giving it, as fast as it can. If you can't keep it busy, that is the only time it will be < 100%. Use `top` and observe the percent totals. %wa is time spent blocking on I/O waits, and a high %st indicates that your t2 is out of CPU credits. – Michael - sqlbot Jun 05 '18 at 10:14

0 Answers0