2

This is my processor: 2.3 GHz Intel Core i7 (so 4 cores with hyper threading) on MacOs Sierra

This is my program:

package nn

import org.nd4j.linalg.api.ndarray.INDArray
import org.nd4j.linalg.factory.Nd4j.randn
import org.nd4j.linalg.ops.transforms.Transforms._
import org.nd4s.Implicits._

object PerfTest extends App {

  val topology = List(784, 30, 10)
  val biases: List[INDArray] =
    topology.tail.map(size => randn(size, 1))

  val weights: List[INDArray] =
    topology.sliding(2).map(t => randn(t(1), t.head)) toList

  (1 to 100000).foreach { i =>
    val x = randn(784, 1)
    biases.zip(weights).foldLeft(List(x)) {
      case (as, (b, w)) =>
        val z = (w dot as.last) + b
        val a = sigmoid(z)
        as :+ a
    }
  }
}

When I run above program with the default threading (for nd4j and this processor this will be 4), it takes around 28 seconds.

When I run it on 1 core (export OMP_NUM_THREADS=1), then it takes 18 seconds.

Any idea why this is ? Thank you.

botkop
  • 934
  • 1
  • 8
  • 17
  • just a guess - maybe your data is (relatively) small, so there is no benefit from parallelizarion (as I understand automatic concurrency applies only to operations on vectors/matricies/tensors like dot product, sum) – dk14 Jun 07 '17 at 10:36
  • This is just an extract of a program which trains a neural network of 786x30x10 on the MNIST data set with 50000 vectors of 786 elements. That should be enough to feed 4 threads, I think.Training a single epoch on 4 threads takes 48 seconds, and on 1 thread 22 seconds. – botkop Jun 07 '17 at 10:44
  • @botkop i just did - thanks. removed my redundant comment. – diginoise Jun 07 '17 at 12:41
  • did you try setting `OMP_NUM_THREADS=2` ? – diginoise Jun 07 '17 at 12:43
  • @diginoise 21 seconds. One thread is still faster. – botkop Jun 07 '17 at 12:58

1 Answers1

0

I wasn't able to find a clarification for this. So I migrated to Breeze, which is like 6 times faster, without much ado.

botkop
  • 934
  • 1
  • 8
  • 17