9

I was trying the Facebook Hacker Cup 2013 Qualification Problems in Scala, and for the 3rd problem I felt the need of an ordered Multiset but could not find one in scala's (2.10) collections. Is this data structure missing in scala's collections. Is it going to be implemented in a future version? Is the Multiset not really necessary if you have already a set implemented?

redoacs
  • 133
  • 1
  • 4

5 Answers5

8

A multiset can be pretty useful sometimes. I often find myself coding the Map[K, List[V]] manually. There is a great implementation of multisets called a Bag by Nicolas Stucki, and is released on Maven.

Announced here:

https://groups.google.com/forum/#!topic/scala-internals/ceaEAiQPgK4

Code here:

https://github.com/nicolasstucki/multisets

Maven:

https://github.com/nicolasstucki/multisets/blob/master/MavenRepository.md

axel22
  • 32,045
  • 9
  • 125
  • 137
  • 2
    I'm wondering why you implement a `Map[K, List[V]`. The `Map[T, Int]`mentioned by @Steve in a comment on the accepted answer makes more sense to me. – lex82 Sep 22 '16 at 14:41
  • 2
    Depends on what your use-case is - sometimes your elements are equal according to `equals`, but have extra information that you want to keep around. – axel22 Sep 22 '16 at 17:06
  • 1
    The code to import in an SBT project is `libraryDependencies ++= Seq("io.github.nicolasstucki" %% "multisets" % "0.4")` – Mikaël Mayer Mar 23 '17 at 10:19
  • 1
    For what it's worth, a `Map[K, List[V]]` is really a `MultiMap`, and Scala does have a trait for that: https://www.scala-lang.org/api/2.12.3/scala/collection/mutable/MultiMap.html – Brian McCutchon Aug 11 '18 at 18:41
2

A multiset is a rather peculiar and uncommon data structure. It is not, for instance, part of Java's standard library either. Guava does have one, and so does Boost, but Boost has basically everything.

If all you want is to count the number of occurrences of the elements, you could resort to a SortedMap from element to count instead. If you require, on the other hand, for the elements to be distinct, retrievable, but equivalent under sorting rules, you could use a SortedMap from element (not important which one) to a Set of distinguished elements.

Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681
  • Thanks, I did not realize a SortedMap was useful in this situation. – redoacs Jan 31 '13 at 22:20
  • Now I know what Multisets are like. Thank you. – idonnie Feb 01 '13 at 21:47
  • 30
    I disagree with "A multiset is a rather peculiar and uncommon data structure". A multiset is an extremely common data structure if you're doing things like counting words or making histograms of things. – Steve Feb 22 '13 at 13:07
  • @Steve There's no mention of it in Mehta&Sahni Handbook of Data Structures and Applications nor Peter Brass Advanced Data Structures. It's not part of Java's, Ruby's or Python's standard libraries, and in fact, it seems only C++ has it as part of its standard data structures. So I stand by my words. – Daniel C. Sobral Feb 22 '13 at 15:16
  • 6
    There *are* multisets in [Python's standard library](http://docs.python.org/2/library/collections.html#collections.Counter), and in [C++'s standard library](http://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a01117.html), and in two of very common extensions of the Java collections library, [Apache commons](http://commons.apache.org/collections/apidocs/org/apache/commons/collections/Bag.html) and [Guava](http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Multiset.html). I just don't see how you can claim that it's "peculiar and uncommon" given that. – Steve Feb 23 '13 at 04:21
  • @Steve No, what Python has is a *map*, which serves about the same purpose of multisets -- as I described in my answer. As the doc say, "A Counter is a _dict subclass_ for counting hashable objects". Apache Commons and Guava are libraries full of odds and ends -- being a part of them doesn't make anything not peculiar or common. _Only_ C++ has them in the standard library. If not a single other language has them in the standard library, they are not common, period. And they are peculiar because they are just a special case of a map. – Daniel C. Sobral Feb 23 '13 at 04:57
  • 1
    Straight from the Python Counter class documentation: "The Counter class is similar to bags or multisets in other languages". – Steve Feb 23 '13 at 14:56
  • @Steve Yes. Similar means not the same. – Daniel C. Sobral Feb 23 '13 at 20:41
  • 2
    Maybe you could comment on what you think the difference is? Based on the [formal definition](http://en.wikipedia.org/wiki/Multiset#Formal_definition), a multiset is essentially a `Map[T, Int]`, and that's exactly what Python is providing. – Steve Feb 24 '13 at 11:29
  • @Steve That's a mathematical definition. The [computer science](http://en.wikipedia.org/wiki/Multiset_(abstract_data_type)#Multiset) is more relevant. The two considerations are API -- whether it is a map/dictionary api or a set api -- and the data structure itself -- whether it is a map/dictionary data structure or a specialized structure. While Counter does seem to have a specialized data structure -- given that it is considered high performance -- it does not have a Set api, but a Dict api. – Daniel C. Sobral Feb 25 '13 at 00:04
  • 1
    I see, so while [C++](http://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a01117.html), [Python](http://docs.python.org/2/library/collections.html#collections.Counter) and [Smalltalk](http://www.gnu.org/software/smalltalk/manual-base/html_node/Bag.html) all provide built-in multisets/bag classes, you only count the C++ one because it's the only one that also implements the language's set API. I'm more of a practicality-beats-purity kind of guy - if a class provides the basic multiset/bag operations (is_element_of, iterate, count, etc.) then I count it as a multiset/bag. – Steve Feb 25 '13 at 08:31
  • By that logic, any map **is** a set as you just need to use its keys and ignore the values. That rationale does not feel right to me. You could also say that any integer type is a boolean type as we can use 1 for true and any other value for false (as is done in pre-99 C by example), but I'm pretty sure you see how different it is to have a proper boolean type. @Daniel: I'm sure I am missing something, but how can a SortedMap to Sets emulate an ordered MultiSet? Surely you would lose the insertion order? – Régis Jean-Gilles Feb 25 '13 at 11:24
  • @RégisJean-Gilles Actually, I think I wasn't thinking straight. I probably was thinking of of `SortedMap` ability to provide a custom equality through its `Ordering` so you can ignore the value, but a `Map` already works like a `Set` for its keys. It might be useful, but it is not necessary. – Daniel C. Sobral Feb 25 '13 at 14:11
  • @Régis No, a map is not a set, since it doesn't support the basic add/update operation `Map[K, V] + K`. In contrast, all of the mutiset/bag implementations I linked to (including Python's) do provide the basic add/update operation `MultiSet[K] + K` in some form or another, though most of them also provide the additional add/update operation `MultiSet[K] + (K, Int)`. Of course, you can **create** a `Map` from a `Set` using `.keySet`, but that's not the same as a `Map` **being** a `Set`. – Steve Feb 25 '13 at 18:57
  • When talking mathematics, I think of a multiset (over some base set A) of a function from A to the natural numbers (including 0), exactly as was suggested: map each element to its occurrence count. Then intersection is something like `Map(k -> min(v1, v2) for k in union(keys(multiset1), keys(multiset2)) letting v1 = multiset1[k] letting v2 = multiset2[k])` and union is the same with sum instead of min. Note that there is no sensible complement, unless you have a maximum occurrence count for each element (at which point it's no longer a multiset but rather a multiset-ish.) – Jonas Kölker Apr 02 '16 at 18:59
1

Seq trait has diff, intersect and even union. That should help you with a lot of your multiset problems. http://www.scala-lang.org/api/2.11.7/index.html#scala.collection.Seq@diff(that:Seq[A]):Seq[A]

dividebyzero
  • 2,190
  • 1
  • 21
  • 33
0

If all you need is equality and you don't care too much about performance, you can just use sorted lists.

boggle
  • 221
  • 3
  • 3
0

Both mutable and immutable Multisets are provided in https://github.com/scala/scala-collection-contrib

Voivoid
  • 461
  • 3
  • 11