How to optimize this short factorial function in scala? (Creating 50000 BigInts)

Question

I've compaired the scala version

(BigInt(1) to BigInt(50000)).reduce(_ * _)

to the python version

reduce(lambda x,y: x*y, range(1,50000))

and it turns out, that the scala version took about 10 times longer than the python version.

I'm guessing, a big difference is that python can use its native long type instead of creating new BigInt-objects for each number. But is there a workaround in scala?

How much is the scala version taking? it's ~ 7 seconds on my machine — Pablo Fernandez, Oct 23 '11 at 01:38
I mean, I wrote that in plain java and it takes about 6 seconds. According to your statements, python should be an order of magnitude faster than java? — Pablo Fernandez, Oct 23 '11 at 01:47
I measured something around 23s by running the scala version in sbt. and 2.8s by running it in python's REPL using time.time() differences. I surely made a mistake, but the difference is obvious. — PSchwede, Oct 23 '11 at 08:36

score 16 · Answer 1 · edited May 23 '17 at 12:00

16

The fact that your Scala code creates 50,000 BigInt objects is unlikely to be making much of a difference here. A bigger issue is the multiplication algorithm—Python's long uses Karatsuba multiplication and Java's BigInteger (which BigInt just wraps) doesn't.

The easiest workaround is probably to switch to a better arbitrary precision math library, like JScience's:

import org.jscience.mathematics.number.LargeInteger

(1 to 50000).foldLeft(LargeInteger.ONE)(_ times _)

This is faster than the Python solution on my machine.

Update: I've written some quick benchmarking code using Caliper in response to Luigi Plingi's answer, which gives the following results on my (quad core) machine:

              benchmark   ms linear runtime
         BigIntFoldLeft 4774 ==============================
             BigIntFold 4739 =============================
           BigIntReduce 4769 =============================
      BigIntFoldLeftPar 4642 =============================
          BigIntFoldPar  500 ===
        BigIntReducePar  499 ===
   LargeIntegerFoldLeft 3042 ===================
       LargeIntegerFold 3003 ==================
     LargeIntegerReduce 3018 ==================
LargeIntegerFoldLeftPar 3038 ===================
    LargeIntegerFoldPar  246 =
  LargeIntegerReducePar  260 =

I don't see the difference between reduce and fold that he does, but the moral is clear: if you can use Scala 2.9's parallel collections, they'll give you a huge improvement, but switching to LargeInteger helps as well.

edited May 23 '17 at 12:00

Community

1
1

answered Oct 23 '11 at 01:53

Travis Brown

138,631
12
375
680

Hum. How can `LargeIntegerReduce` take 11times longer than `LargeIntegerReducePar` on a quad core? I mean sure scaling a bit better than linear is totally possible with cache effects and whatnot in practice, but getting a speedup of 11.6 on 4cores seems fishy - or am I missing something? – Voo Oct 23 '11 at 17:48
1

@Voo: It seemed odd to me as well, but it (at least conceivably) makes sense that we'd see better than linear scaling, since we're multiplying fewer enormous numbers by splitting the sequence, taking the product of the subsequences, and multiplying the results. – Travis Brown Oct 23 '11 at 18:00
That could be true, still a tremendous improvement, but your benchmarking code seems ok too. The last segment (assuming a simple separation) would be 50k*3/4! smaller which is a gigantic number in its own. So that's the best working hypothesis I can come up with too. Assuming that's true also opens a venue for single threaded improvement - interesting idea ;-) – Voo Oct 23 '11 at 18:10
@Voo: Yes, even something as simple as `(BigInt(1) to BigInt(50000)).grouped(100).map(_.product).grouped(100).map(_.product).product` gets you close to the performance of parallel collections (550 ms, in this case). – Travis Brown Oct 23 '11 at 18:31
@Travis Interesting... I used this to knock off another 40% using `zipAll`, which is faster than `grouped`. See my answer. – Luigi Plinge Oct 24 '11 at 23:44

Luigi Plinge · Answer 2 · 2011-10-24T23:23:42.137

Python on my machine:

def func():
  start= time.clock()
  reduce(lambda x,y: x*y, range(1,50000))
  end= time.clock()
  t = (end-start) * 1000
  print t

gives 1219 ms

Scala:

def timed[T](f: => T) = {
  val t0 = System.currentTimeMillis
  val r = f
  val t1 = System.currentTimeMillis
  println("Took: "+(t1 - t0)+" ms")
  r
}

timed { (BigInt(1) to BigInt(50000)).reduce(_ * _) }
4251 ms

timed { (BigInt(1) to BigInt(50000)).fold(BigInt(1))(_ * _) }
4224 ms

timed { (BigInt(1) to BigInt(50000)).par.reduce(_ * _) }
2083 ms

timed { (BigInt(1) to BigInt(50000)).par.fold(BigInt(1))(_ * _) }
689 ms

// using org.jscience.mathematics.number.LargeInteger from Travis's answer
timed { val a = (1 to 50000).foldLeft(LargeInteger.ONE)(_ times _) }
3327 ms

timed { val a = (1 to 50000).map(LargeInteger.valueOf(_)).par.fold(
                                          LargeInteger.ONE)(_ times _) }
361 ms

This 689 ms and 361 ms were after a few warmup runs. They both started at about 1000 ms, but seem to warm up by different amounts. The parallel collections seem to warm up significantly more than the non-parallel: the non-parallel operations did not reduce significantly from their first runs.

The .par (meaning, use parallel collections) seemed to speed up fold more than reduce. I only have 2 cores, but a greater number of cores should see a bigger performance gain.

So, experimentally, the way to optimize this function is

a) Use fold rather than reduce

b) Use parallel collections

update: Inspired by the observation that breaking the calculation down into smaller chunks speeds things up, I managed to get he following to run in 215 ms on my machine, which is a 40% improvement on the standard parallelized algorithm. (Using BigInt, it takes 615 ms.) Also, it doesn't use parallel collections, but somehow uses 90% CPU (unlike for BigInt).

  import org.jscience.mathematics.number.LargeInteger

  def fact(n: Int) = {
    def loop(seq: Seq[LargeInteger]): LargeInteger = seq.length match {
      case 0 => throw new IllegalArgumentException
      case 1 => seq.head
      case _ => loop {
        val (a, b) = seq.splitAt(seq.length / 2)
        a.zipAll(b, LargeInteger.ONE, LargeInteger.ONE).map(i => i._1 times i._2)
      } 
    }
    loop((1 to n).map(LargeInteger.valueOf(_)).toIndexedSeq)
  }

@PeterSchwede, I don't think it's misleading; it just shows that Java's BigInteger class is a bit slow compared to Python's, and that there are faster algorithms for computing the factorial than the obvious one. The fact that Scala allows you to tune your code so you eventually get something 6 times faster than Python should be viewed as positive. Now if there were a built-in factorial function that were dog-slow, that would be a cause for concern. — Luigi Plinge, Oct 29 '11 at 02:13

score 1 · Answer 3 · answered Oct 24 '11 at 11:28

Another trick here could be to try both reduceLeft and reduceRight to see what is fastest. On your example I get a much faster execution of reduceRight:

scala> timed { (BigInt(1) to BigInt(50000)).reduceLeft(_ * _) }
Took: 4605 ms

scala> timed { (BigInt(1) to BigInt(50000)).reduceRight(_ * _) }
Took: 2004 ms

Same difference between foldLeft and foldRight. Guess it matters what side of the tree you start reducing from :)

Andriy Plokhotnyuk · Answer 4 · 2014-12-26T15:56:50.973

Most efficient way to calculate factorial in Scala is using of divide and conquer strategy:

def fact(n: Int): BigInt = rangeProduct(1, n)

private def rangeProduct(n1: Long, n2: Long): BigInt = n2 - n1 match {
  case 0 => BigInt(n1)
  case 1 => BigInt(n1 * n2)
  case 2 => BigInt(n1 * (n1 + 1)) * n2
  case 3 => BigInt(n1 * (n1 + 1)) * ((n2 - 1) * n2)
  case _ => 
    val nm = (n1 + n2) >> 1
    rangeProduct(n1, nm) * rangeProduct(nm + 1, n2)
}

Also to get more speed use latest version of JDK and following JVM options:

-server -XX:+TieredCompilation

Bellow are results for Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz (max 3.50GHz), RAM 12Gb DDR3-1333, Windows 7 sp1, Oracle JDK 1.8.0_25-b18 64-bit:

(BigInt(1) to BigInt(100000)).product took: 3,806 ms with 26.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduce(_ * _) took: 3,728 ms with 25.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceLeft(_ * _) took: 3,510 ms with 25.1 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceRight(_ * _) took: 4,056 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).fold(BigInt(1))(_ * _) took: 3,697 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.product took: 406 ms with 66.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduce(_ * _) took: 296 ms with 71.1 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceLeft(_ * _) took: 3,495 ms with 25.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceRight(_ * _) took: 3,900 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.fold(BigInt(1))(_ * _) took: 327 ms with 56.1 % of CPU usage
fact(100000) took: 203 ms with 28.3 % of CPU usage

BTW to improve efficience of factorial calculation for numbers that greater than 20000 use following implementation of Schönhage-Strassen algorithm or wait until it will be merged to JDK 9 and Scala will be able to use it

How to optimize this short factorial function in scala? (Creating 50000 BigInts)

4 Answers4

Linked