16

I want to randomly sample from a Scala list or array (not an RDD), the sample size can be much longer than the length of the list or array, how can I do this efficiently? Because the sample size can be very big and the sampling (on different lists/arrays) needs to be done a large number of times.

I know for a Spark RDD we can use takeSample() to do it, is there an equivalent for Scala list/array?

Thank you very much.

Carter
  • 1,563
  • 8
  • 23
  • 32
  • 1
    Random number generators are stateful, so it doesn't make sense for Lists to have such a function. You would have to implement one yourself (also, it would be a linear time operation). For arrays, you can get a random integer from the "Random" objects like so: 'Random.nextInt(myArray.length)' and index into the array. – Felix Oct 04 '15 at 10:06
  • Ahh, nvm. I read too quickly xD – Felix Oct 04 '15 at 10:12

7 Answers7

29

An easy-to-understand version would look like this:

import scala.util.Random

Random.shuffle(list).take(n)
Random.shuffle(array.toList).take(n)

// Seeded version
val r = new Random(seed)
r.shuffle(...)
Marius Soutier
  • 11,184
  • 1
  • 38
  • 48
  • 2
    "the sample size can be longer than the length of the list or array," – Felix Oct 04 '15 at 12:37
  • I know how take works, but don't you think he means that it should also give a sample bigger than the sequence in that case? – Felix Oct 05 '15 at 05:30
  • Btw, why do you convert to list? Isn't the complexity of shuffle on lists quite bad (I don't know the implementation). – Felix Oct 05 '15 at 07:47
  • Ahh, it's linear time. No worries then : https://github.com/scala/scala/blob/v2.11.7/src/library/scala/util/Random.scala#L107-L122 – Felix Oct 05 '15 at 07:49
  • 2
    Thanks guys. Yes I need sampling with replacement and the sample size is always much larger than the length of the array/list, e.g., I may need 10,000 samples from a list of 50 length. – Carter Oct 05 '15 at 10:47
4

For arrays:

import scala.util.Random
import scala.reflect.ClassTag

def takeSample[T:ClassTag](a:Array[T],n:Int,seed:Long) = {
  val rnd = new Random(seed)
  Array.fill(n)(a(rnd.nextInt(a.size)))
}

Make a random number generator (rnd) based on your seed. Then, fill an array with random numbers from 0 until the size of your array.

The last step is applying each random value to the indexing operator of your input array. Using it in the REPL could look as follows:

scala> val myArray = Array(1,3,5,7,8,9,10)
myArray: Array[Int] = Array(1, 3, 5, 7, 8, 9, 10)

scala> takeSample(myArray,20,System.currentTimeMillis)
res0: scala.collection.mutable.ArraySeq[Int] = ArraySeq(7, 8, 7, 3, 8, 3, 9, 1, 7, 10, 7, 10,
1, 1, 3, 1, 7, 1, 3, 7)

For lists, I would simply convert the list to Array and use the same function. I doubt you can get much more efficient for lists anyway.

It is important to note, that the same function using lists would take O(n^2) time, whereas converting the list to arrays first will take O(n) time

Felix
  • 8,385
  • 10
  • 40
  • 59
  • 1
    Your `takeSample` method is unnecessarily creating the array containing the indices and then mapping that. You should probably instead do something like `Array.fill(n)(a(rng.nextInt(a.size)))` – Jason Scott Lenderman Oct 04 '15 at 18:39
  • Yeah that doesn't compile though. It's unable to find there required manifest. Probably you can just add the explicit parameter and it will work. – Felix Oct 05 '15 at 05:34
  • When I run the code above I get the following. What am I doing wrong? scala> takeSample(myArray,20,System.currentTimeMillis) res0: Array[() => Int] = Array(, , , , , , , , , , , , , , , , , , , ) – Max Oct 25 '16 at 17:14
  • Please try again. I changed it from `() => a(rnd.nextInt(a.size))` to `a(rnd.nextInt(a.size))` and added the classtag of `T` in order for the construction of the array to work. Try it now :) Sorry for the inconvenience – Felix Oct 27 '16 at 12:02
2

If you want to sample without replacement -- zip with randoms, sort O(n*log(n), discard randoms, take

import scala.util.Random
val l = Seq("a", "b", "c", "d", "e")
val ran = l.map(x => (Random.nextFloat(), x))
  .sortBy(_._1)
  .map(_._2)
  .take(3)
KevinKatz
  • 21
  • 2
1

Using a for comprehension, for a given array xs as follows,

for (i <- 1 to sampleSize; r = (Math.random * xs.size).toInt) yield a(r)

Note the random generator here produces values within the unit interval, which are scaled to range over the size of the array, and converted to Int for indexing over the array.

Note For pure functional random generator consider for instance the State Monad approach from Functional Programming in Scala, discussed here.

Note Consider also NICTA, another pure functional random value generator, it's use illustrated for instance here.

Community
  • 1
  • 1
elm
  • 20,117
  • 14
  • 67
  • 113
  • Isn't Math.random bad practice? This is quite literally static global state. – Felix Oct 05 '15 at 07:46
  • in my mind there is a huge difference between local and global state. One is bad, the other is horrible. – Felix Oct 05 '15 at 09:20
1

Using classical recursion.

import scala.util.Random

def takeSample[T](a: List[T], n: Int): List[T] = {
    n match {
      case n: Int if n <= 0 => List.empty[T]
      case n: Int => a(Random.nextInt(a.size)) :: takeSample(a, n - 1)
    }
}
Thomas Pocreau
  • 470
  • 5
  • 12
  • `takeSample(List(1,2,3),10000)` try this, it'll blow up because it's not tail-recursive. – Felix Oct 27 '16 at 12:12
0
package your.pkg

import your.pkg.SeqHelpers.SampleOps

import scala.collection.generic.CanBuildFrom
import scala.collection.mutable
import scala.language.{higherKinds, implicitConversions}
import scala.util.Random

trait SeqHelpers {

  implicit def withSampleOps[E, CC[_] <: Seq[_]](cc: CC[E]): SampleOps[E, CC] = SampleOps(cc)
}

object SeqHelpers extends SeqHelpers {

  case class SampleOps[E, CC[_] <: Seq[_]](cc: CC[_]) {

    private def recurse(n: Int, builder: mutable.Builder[E, CC[E]]): CC[E] = n match {
      case 0 => builder.result
      case _ =>
        val element = cc(Random.nextInt(cc.size)).asInstanceOf[E]
        recurse(n - 1, builder += element)
    }

    def sample(n: Int)(implicit cbf: CanBuildFrom[CC[_], E, CC[E]]): CC[E] = {
      require(n >= 0, "Cannot take less than 0 samples")
      recurse(n, cbf.apply)
    }
  }
}

Either:

  • Mixin SeqHelpers, for example, with a Scalatest spec
  • Include import your.pkg.SeqHelpers._

Then the following should work:

Seq(1 to 100: _*) sample 10 foreach { println }

Edits to remove the cast are welcome.

Also if there is a way to create an empty instance of the collection for the accumulator, without knowing the concrete type ahead of time, please comment. That said, the builder is probably more efficient.

Darren Bishop
  • 2,379
  • 23
  • 20
0

Did not test for performance, but the following code is a simple and elegant way to do the sampling and I believe can help many that come here just to get a sampling code. Just change the "range" according to the size of your end sample. If pseude-randomness is not enough for your need, you can use take(1) in the inner list and increase the range.

Random.shuffle((1 to 100).toList.flatMap(x => (Random.shuffle(yourList))))

ruhsuzbaykus
  • 13,240
  • 2
  • 20
  • 21