3

I have a scala.collection.immutable.HashSet that I want to randomly select an element from.

I could solve the problem with an extension method like this:

implicit class HashSetExtensions[T](h: HashSet[T]) {
  def nextRandomElement (): Option[T] = {
    val list = h.toList
    list match {
      case null | Nil => None
      case _ => Some (list (Random.nextInt (list.length)))
    }
  }
}

...but converting to a list will be slow. What would be the most efficient solution?

sungiant
  • 3,152
  • 5
  • 32
  • 49
  • Is this `mutable.HashSet` or `immutable.HashSet` you are using? – Odomontois Apr 22 '15 at 15:45
  • I suspect using an Iterator on the Set directly, and advancing it a random number of times between 0 and the size of the set may be better than converting to a List, but I don't know what the implementation of size or iterator are on HashSet so I'm not sure. – Angelo Genovese Apr 22 '15 at 15:46
  • actually the solution what I proposed with `Random.shuffle(h).headOption` is completely wrong, it returns always the same result – mutantacule Apr 22 '15 at 15:50

2 Answers2

2

WARNING This answer is for experimental use only. For real project you probably should use your own collection types.

So i did some research in the HashSet source and i think there is little opportunity to someway extract the inner structure of most valuable class HashTrieSet without package violation.

I did come up with this code, which is extended Ben Reich's solution:

package scala.collection

import scala.collection.immutable.HashSet
import scala.util.Random

package object random {
  implicit class HashSetRandom[T](set: HashSet[T]) {
    def randomElem: Option[T] = set match {
      case trie: HashSet.HashTrieSet[T] => {
        trie.elems(Random.nextInt(trie.elems.length)).randomElem
      }
      case _ => Some(set.size) collect {
        case size if size > 0 => set.iterator.drop(Random.nextInt(size)).next
      }
    }
  }
}

file should be created somewhere in the src/scala/collection/random folder

note the scala.collection package - this thing makes the elems part of HashTrieSet visible. This is only solution i could think, which could run better than O(n). Current version should have complexity O(ln(n)) as any of immutable.HashSet's operation s.

Another warning - private structure of HashSet is not part of scala's standard library API, so it could change any version making this code erroneous (though it's didn't changed since 2.8)

Community
  • 1
  • 1
Odomontois
  • 15,918
  • 2
  • 36
  • 71
  • Probably over the top, but clever. You could probably do something similar for mutable `HashSet` as well (might even be able to get that one in constant time if you can expose the inner hash table array). Also, if we're going over the top anyway, you can probably reconfigure some things here to make your method here tail recursive! – Ben Reich Apr 22 '15 at 17:27
  • @BenReich There is no need to provide tail recursion here since it would not be called more than `2*log_32(n)` times. We could try to optimize it to call `Random.nextInt` only one time, since multiple call not only expensive, but also provides non-uniform distribution over leaves of not-perfectly balanced tree. – Odomontois Apr 22 '15 at 19:47
1

Since size is O(1) on HashSet, and iterator is as lazy as possible, I think this solution would be relatively efficient:

implicit class RichHashSet[T](val h: HashSet[T]) extends AnyVal {
    def nextRandom: Option[T] = Some(h.size) collect {
        case size if size > 0 => h.iterator.drop(Random.nextInt(size)).next
    }
}

And if you're trying to get every ounce of efficiency you could use match here instead of the more concise Some/collect idiom used here.

You can look at the mutable HashSet implementation to see the size method. The iterator method defined there basically just calls iterator on FlatHashTable. The same basic efficiencies of these methods apply to immutable HashSet if that's what you're working with. As a comparison, you can see the toList implementation on HashSet is all the way up the type hierarchy at TraversableOnce and uses far more primitive elements which are probably less efficient and (of course) the entire collection must be iterated to generate the List. If you were going to convert the entire set to a Traversable collection, you should use Array or Vector which have constant-time lookup.

You might also note that there is nothing special about HashSet in the above methods, and you could enrich Set[T] instead, if you so chose (although there would be no guarantee that this would be as efficient on other Set implementations, of course).

As a side note, when implementing enriched classes for extension methods, you should always consider making an implicit, user-defined value class by extending AnyVal. You can read about some of the advantages and limitations in the docs, and on this answer.

Community
  • 1
  • 1
Ben Reich
  • 16,222
  • 2
  • 38
  • 59
  • `Iterator.drop` would iterate through whole dropped part of set. It's complexity is really O(n) – Odomontois Apr 22 '15 at 16:26
  • Right – I didn't mean to say the whole method was `O(1)`, just that `size` was `O(1)`. This is still `O(n)`, but with a better average- and worst-case. – Ben Reich Apr 22 '15 at 16:27
  • It's definitely better than build new collection but i steel looking it the HashSet.scala to find how to hack-out parts of HashTrieSet to make this O(log(n)) – Odomontois Apr 22 '15 at 16:30
  • if you extended `HashSet`, you could expose the inner hash table (which is an `Array`), and then pick a random element in constant time! Although that's probably unwise. – Ben Reich Apr 22 '15 at 16:31
  • Trouble is that we could not expose. That is private to `scala.immutable.collection` not protected. So only thing we could do - define new class inside the `collection` package which is totally crap – Odomontois Apr 22 '15 at 16:36
  • And second issue - we could not force `HashSet` factory and other HashSet subclasses to use our new class instead of original HashTrieSet – Odomontois Apr 22 '15 at 16:38