Akka - worse performance with more actors

Question

I'm trying out some parallel programming with Scala and Akka, which I'm new to. I've got a pretty simple Monte Carlo Pi application (approximates pi in a circle) which I've built in several languages. However the performance of the version I've built in Akka is puzzling me.

I have a sequential version written in pure Scala that tends to take roughly 400ms to complete.

In comparison with 1 worker actor the Akka version takes around 300-350ms, however as I increase the number of actors that time increases dramatically. With 4 actors the time can be anywhere between 500ms all the way up to 1200ms or higher.

The number of iterations are being divided up between the worker actors, so ideally performance should be getting better the more of them there are, currently it's getting significantly worse.

My code is

object MCpi{
  //Declare initial values
  val numWorkers = 2
  val numIterations = 10000000

  //Declare messages that will be sent to actors
  sealed trait PiMessage
  case object Calculate extends PiMessage
  case class Work(iterations: Int) extends PiMessage
  case class Result(value: Int) extends PiMessage
  case class PiApprox(pi: Double, duration: Double)

  //Main method
  def main(args: Array[String]): Unit = {
    val system = ActorSystem("MCpi_System") //Create Akka system
    val master = system.actorOf(Props(new MCpi_Master(numWorkers, numIterations))) //Create Master Actor
    println("Starting Master")

    master ! Calculate //Run calculation
  }
}

//Master
class MCpi_Master(numWorkers: Int, numIterations: Int) extends Actor{

  var pi: Double = _ // Store pi
  var quadSum: Int = _ //the total number of points inside the quadrant
  var numResults: Int = _ //number of results returned
  val startTime: Double = System.currentTimeMillis() //calculation start time

  //Create a group of worker actors
  val workerRouter = context.actorOf(
    Props[MCpi_Worker].withRouter(RoundRobinPool(numWorkers)), name = "workerRouter")
  val listener = context.actorOf(Props[MCpi_Listener], name = "listener")

  def receive = {
    //Tell workers to start the calculation
      //For each worker a message is sent with the number of iterations it is to perform,
      //iterations are split up between the number of workers.
    case Calculate => for(i <- 0 until numWorkers) workerRouter ! Work(numIterations / numWorkers);

      //Receive the results from the workers
        case Result(value) =>
            //Add up the total number of points in the circle from each worker
      quadSum += value
            //Total up the number of results which have been received, this should be 1 for each worker
      numResults += 1

      if(numResults == numWorkers) { //Once all results have been collected
          //Calculate pi
          pi = (4.0 * quadSum) / numIterations
          //Send the results to the listener to output
        listener ! PiApprox(pi, duration = System.currentTimeMillis - startTime)
        context.stop(self)
      }
  }
}
//Worker
class MCpi_Worker extends Actor {
  //Performs the calculation
  def calculatePi(iterations: Int): Int = {

    val r = scala.util.Random // Create random number generator
    var inQuadrant: Int = 0 //Store number of points within circle

    for(i <- 0 to iterations){
      //Generate random point
      val X = r.nextFloat()
      val Y = r.nextFloat()

      //Determine whether or not the point is within the circle
      if(((X * X) + (Y * Y)) < 1.0)
        inQuadrant += 1
    }
    inQuadrant //return the number of points within the circle
  }

  def receive = {
    //Starts the calculation then returns the result
    case Work(iterations) => sender ! Result(calculatePi(iterations))
  }
}

//Listener
class MCpi_Listener extends Actor{ //Recieves and prints the final result
  def receive = {
    case PiApprox(pi, duration) =>
        //Print the results
      println("\n\tPi approximation: \t\t%s\n\tCalculation time: \t%s".format(pi, duration))
        //Print to a CSV file
        val pw: FileWriter = new FileWriter("../../../..//Results/Scala_Results.csv", true)
        pw.append(duration.toString())
        pw.append("\n")
        pw.close()
      context.system.terminate()

  }
}

The plain Scala sequential version is

object MCpi {
    def main(args: Array[String]): Unit = {
        //Define the number of iterations to perform
        val iterations = args(0).toInt;
        val resultsPath = args(1);

        //Get the current time
        val start = System.currentTimeMillis()


        // Create random number generator
        val r = scala.util.Random
        //Store number of points within circle
        var inQuadrant: Int = 0

        for(i <- 0 to iterations){
            //Generate random point
            val X = r.nextFloat()
            val Y = r.nextFloat()

            //Determine whether or not the point is within the circle
            if(((X * X) + (Y * Y)) < 1.0)
                inQuadrant += 1
        }
        //Calculate pi
        val pi = (4.0 * inQuadrant) / iterations
        //Get the total time
        val time = System.currentTimeMillis() - start
        //Output values
        println("Number of Iterations: " + iterations)
        println("Pi has been calculated as: " + pi)
        println("Total time taken: " + time + " (Milliseconds)")

        //Print to a CSV file
        val pw: FileWriter = new FileWriter(resultsPath + "/Scala_Results.csv", true)
        pw.append(time.toString())
        pw.append("\n")
        pw.close()
    }
}

Any suggestions as to why this is happening or how I can improve performance would be very welcome.

Edit: I'd like to thank all of you for your answers, this is my first question on this site and all the answers are extremely helpful, I have plenty to look in to now :)

In this case some information about the kind of processor(s) you are running this on is probably helpful. — Jasper-M, Jan 19 '17 at 15:05
1) Please post your code on SO. Format it before posting. 2) What do you even expect from actor implementation when you're executing `calculatePi` method multiple times which is from what I can see, an equivalent to your sequential implementation? And from what I see, you're just calculating PI multiple times (number of calculations is equivalent to the number of worker actors which is probably the explanation for slowdown)? Correct me if I'm wrong. 3) Did you consider that you might not gain anything by using an actor model in this case? — Branislav Lazic, Jan 19 '17 at 15:07
@Jasper-M Processor is an Intel i7-4510U quad core @ 3.1GHz @Branislav 1) Okay, I'll try update the post with the code when I'm free later. 2) `calculatePi` is run by each worker, it generates many random points and measures whether those points are within a "circle" of a particular size (in this case 1.0), then returns how many points were in the circle (quadSum), once the results are back from each worker the calculation is done once, to work out what Pi is (in the master actor). 3) I assumed that I'd get some sort of a performance increase splitting the work over multiple actors. — Cipher478, Jan 19 '17 at 15:25
This is a CPU bound task and actors share a thread pool so adding more actors without configuring the pool to host more threads will decrease performance. Keep in mind that the actor pattern is a tool for concurrent communication not for parallel computing — Mustafa Simav, Jan 20 '17 at 10:49
@MustafaSimav still 2 actors instead of 1 on a quadcore are likely to show a speed-up. actors can be used for parallel computing *as well as* concurrent communication. — Stefano Bonetti, Jan 20 '17 at 11:53

Stefano Bonetti · Accepted Answer · 2017-01-20T10:09:55.327

8

You have a synchronisation issue around the Random instance you're using.

More specifically, this line

val r = scala.util.Random // Create random number generator

actually doesn't "create a random number generator", but picks up the singleton object that scala.util conveniently offers you. This means that all threads will share it, and will synchronise around its seed (see the code of java.util.Random.nextFloat for more info).

Simply by changing that line to

val r = new scala.util.Random // Create random number generator

you should get some parallelisation speed-up. As stated in the comments, the speed-up will depend on your architecture, etc. etc., but at least it will not be so badly biased by strong synchronisation.

Note that java.util will use System.nanoTime as seed of a newly created Random, so you should need not worry about randomisation issues.

edited Jan 20 '17 at 10:09

answered Jan 20 '17 at 09:29

Stefano Bonetti

8,973
1
25
44

Thank you, I wasn't aware of this, that's really helpful to know. – Cipher478 Jan 21 '17 at 12:17
Wasnt aware either, I found out investigating your example :) good learning! Were you able to see the speedup? – Stefano Bonetti Jan 21 '17 at 12:47
Just got the chance to implement your suggestion and wow, dramatic speed up. Results are now much more in line with what I had expected. Thank you very much. – Cipher478 Jan 21 '17 at 15:09

score 4 · Answer 2 · answered Jan 21 '17 at 00:17

I think it's a great question worth digging into. Using Akka Actor system that does come with some systems overhead, I expect performance gain will be seen only when the scale is large enough. I test-ran your two versions (non-akka vs akka) with minimal code change. At 1 million or 10 million hits, as expected there is hardly any performance difference regardless of Akka vs non-Akka or number of workers used. But at 100 million hits, you can see consistent performance difference.

Besides scaling up the total hits to 100 million, the only code change I made was replacing scala.util.Random with java.util.concurrent.ThreadLocalRandom:

//val r = scala.util.Random // Create random number generator
def r = ThreadLocalRandom.current
...
  //Generate random point
  //val X = r.nextFloat()
  //val Y = r.nextFloat()
  val X = r.nextDouble(0.0, 1.0)
  val Y = r.nextDouble(0.0, 1.0)

This was all done on an old MacBook Pro with a 2GHz quadcore CPU and 8GB of memory. Here are the test-run results at 100 million total hits:

Non-Akka app takes ~1720 ms
Akka app with 2 workers takes ~770 ms
Akka app with 4 workers takes ~430 ms

Individual test-runs below ...

Non-Akka