SCALA: Which data structures are optimal in which situations when using ".contains()" or ".exists()"?

Question

I would like to know in which situations which data structures are optimal for using "contains" or "exists" checks.

I ask because I come from a Python background and am used to using if x in something: expressions for everything. For example, which expressions are evaluated the quickest:

val m = Map(1 -> 1, 2 -> 2, 3 -> 3, 4 -> 4)
                                          //> m  : scala.collection.immutable.Map[Int,Int] = Map(1 -> 1, 2 -> 2, 3 -> 3, 4
                                          //|  -> 4)
val l = List(1,2,3,4)                     //> l  : List[Int] = List(1, 2, 3, 4)
val v = Vector(1,2,3,4)                   //> v  : scala.collection.immutable.Vector[Int] = Vector(1, 2, 3, 4)

m.exists(_._1 == 3)                       //> res0: Boolean = true
m.contains(3)                             //> res1: Boolean = true
l.exists(_ == 3)                          //> res2: Boolean = true
l.contains(3)                             //> res3: Boolean = true
v.exists(_ == 3)                          //> res4: Boolean = true
v.contains(3)                             //> res5: Boolean = true

Intuitively, I would assume that vectors should be the quickest for random checks, and lists would be quickest if one knows that the value checked is in the beginning of the list and there is a lot of data. However, a confirmation or correction would be most welcome. Also, please feel free to expand to other data structures.

Note: Please let me know if you feel this question is too vague as I'm not sure I am phrasing it correctly.

FYI http://www.scala-lang.org/docu/files/collections-api/collections_40.html — om-nom-nom, May 08 '13 at 14:25
In Python, as in every other language, the abstract data type of choice for when you primarily need membership checks is a **set**, not a sequence or mapping. — , May 08 '13 at 14:27
Checking for a specific element is **not** a random check, it is rather shortcircuting full scan for vectors/lists/arrays: *take first element, compare, if not equals, take second, compare, ...* . On the other side, `contains` on sets and maps is meant to be constant-time (unlike exists, which has to apply some predicate first and thus, I think, is linear too) — om-nom-nom, May 08 '13 at 14:28

Rex Kerr · Accepted Answer · 2015-09-25T13:15:18.973

22

Set and Map (with a default hash table implementation) are by far the fastest at contains since they compute the hash value and jump to the right location immediately. For example, if you want to find an arbitrary string out of a list of a thousand, contains on a set is about 100x faster than contains on List or Vector or Array.

With exists, you really just care about how fast the collection is to traverse--you have to traverse everything anyway. There, List is usually the champ (unless you want to traverse an array by hand), but only Set and so on are usually particularly bad (e.g. exists on List is ~8x faster than on a Set when each have 1000 elements). The others are within about 2.5x of List (usually 1.5x, but Vector has an underlying tree structure which is not all that fast to traverse).

edited Sep 25 '15 at 13:15

answered May 08 '13 at 16:39

Rex Kerr

166,841
26
322
407

This won't be relevant to most use cases, but if you're willing to give your `exists` predicate more structure than an opaque function, it'll be possible to implement it more efficiently. If you define a simple AST for possible relations, then hash-based structures will be good at equality predicates, whereas ordered structures (`TreeMap`, `IntMap` with gotchas) will be good at equality predicates _and_ ordering predicates. Tries might be good at prefix match predicates, and you can get arbitrarily complex DAWGs and such. Add variable binders in your predicate DSL and it'll be even fancier! – Mysterious Dan May 08 '13 at 18:30
@MyseriousDan - `exists` is the name of the structure-free test in the collections library. Of course there are transforms to `contains` for equality and something equivalent to `range` for trees. But that isn't `exists`, that's something else. (Did you mean to reply to gzm0, who claimed that there's nothing faster than O(n) for `exists`? Your reply would make more sense in that context.) – Rex Kerr May 08 '13 at 18:46
Err, yep, sorry :) but I do know that `exists` is an API method :P I'm just saying that if we're willing to break from the opaque function model, we can do better (yet still have a way to embed arbitrary functions with an implicit conversion, so people can pretend the smarter interface doesn't exist) – Mysterious Dan May 08 '13 at 19:24
@MyseriousDan - Indeed, and the bytecode is not opaque to the compiler or JVM, so smartness can come in at that level even with the same interface. – Rex Kerr May 08 '13 at 20:03
Set is not implemented by Red Black Tree which find element in O(logn)? – Zvi Mints Mar 26 '23 at 15:16
@ZviMints - `OrderedSet` is a `TreeSet` which is a red-black tree. Regular (immutable) `Set` is a hash trie set. – Rex Kerr Apr 05 '23 at 05:35

score 1 · Answer 2 · answered May 08 '13 at 15:35

1

If you want to use contains extensively, you should use a Set (or a Map).

AFAIK there is no datastructure that implements an efficient (i.e. faster than O(n)) exists since the closure you pass in may not even be related to the elements inside.

answered May 08 '13 at 15:35

gzm0

14,752
1
36
64

SCALA: Which data structures are optimal in which situations when using ".contains()" or ".exists()"?

2 Answers2