Good Clojure representation for unordered pairs?

Question

I'm creating unordered pairs of data elements. A comment by @Chouser on this question says that hash-sets are implemented with 32 children per node, while sorted-sets are implemented with 2 children per node. Does this mean that my pairs will take up less space if I implement them with sorted-sets rather than hash-sets (assuming that the data elements are Comparable, i.e. can be sorted)? (I doubt it matters for me in practice. I'll only have hundreds of these pairs, and lookup in a two-element data structure, even sequential lookup in a vector or list, should be fast. But I'm curious.)

score 1 · Answer 1 · answered Nov 05 '13 at 22:12

When comparing explicitly looking at the first two elements of a list, to using Clojure's built in sets I don't see a significant difference when running it ten million times:

user> (defn my-lookup [key pair] 
         (condp = key 
               (first pair) true 
               (second pair) true false))
#'user/my-lookup

user> (time (let [data `(1 2)] 
              (dotimes [x 10000000] (my-lookup (rand-nth [1 2]) data ))))
"Elapsed time: 906.408176 msecs"
nil

user> (time (let [data #{1 2}] 
               (dotimes [x 10000000] (contains? data (rand-nth [1 2])))))
"Elapsed time: 1125.992105 msecs"
nil

Of course micro-benchmarks such as this are inherently flawed and difficult to really do well so don't try to use this to show that one is better than the other. I only intend to demonstrate that they are very similar.

Thanks very much for going through the trouble of setting up the micro-benchmark. That's about what I would have expected. I imagine lists take up less space than hash-sets, and that vectors take up even less space. I got similar results using a vector to represent the set, replacing `(first pair)` with `(pair 0)` and `(second pair)` with `(pair 1)`. — Mars, Nov 05 '13 at 23:53

score 1 · Answer 2 · edited May 23 '17 at 11:57

If I'm doing something with unordered pairs, I usually like to use a map since that makes it easy to look up the other element. E.g., if my pair is [2 7], then I'll use {2 7, 7 2}, and I can do ({2 7, 7 2} 2), which gives me 7.

As for space, the PersistentArrayMap implementation is actually very space conscious. If you look at the source code (see previous link), you'll see that it allocates an Object[] of the exact size needed to hold all the key/value pairs. I think this is used as the default map type for all maps with no more than 8 key/value pairs.

The only catch here is that you need to be careful about duplicate keys. {2 2, 2 2} will cause an exception. You could get around this problem by doing something like this: (merge {2 2} {2 2}), i.e. (merge {a b} {b a}) where it's possible that a and b have the same value.

Here's a little snippet from my repl:

user=> (def a (array-map 1 2 3 4))
#'user/a
user=> (type a)
clojure.lang.PersistentArrayMap
user=> (.count a) ; count simply returns array.length/2 of the internal Object[]
2

Note that I called array-map explicitly above. This is related to a question I asked a while ago related to map literals and def in the repl: Why does binding affect the type of my map?

Very nice suggestion. The two-way lookup idea is useful, and I shouldn't have to worry about duplicate keys in my application. And thanks for ref to the wierd behavior of curly braces in the REPL. — Mars, Nov 05 '13 at 23:57
@Mars - If you don't need to be able to look up the other item, then just storing `{a a, b b}` will make it behave just like a set—but since it's backed by an array it'll take up less space. It will actually probably have better performance than the set as well, since it just does a linear search through the 2 keys in the array, meaning it does at most 2 `equal?` operations, and then it knows if the key is in the map or not. (That's the reason the default is an array-map and not a hash-map for map literals with just a few keys.) — DaoWen, Nov 06 '13 at 01:20

score 1 · Answer 3 · answered Nov 06 '13 at 17:07

1

This should be a comment, but i'm too short in reputation and too eager to share information. If you are concerned about performance clj-tuple by Zachary Tellman may be 2-3 times faster than ordinary list/vectors, as claimed here ztellman / clj-tuple.

answered Nov 06 '13 at 17:07

icamts

186
1
8

No, that deserves to be an answer. I'm not really worried too much about performance on lookup in a pair, but I will take a look at clj-tuple anyway. – Mars Nov 06 '13 at 17:18

Mars · Answer 4 · 2013-11-06T18:44:25.063

I wasn't planning to benchmark different pair representations now, but @ArthurUlfeldt's answer and @DaoWen's led me to do so. Here are my results using criterium's bench macro. Source code is below. To summarize, as expected, there are no large differences between the seven representations I tested. However, there is a gap between times for the fastest, array-map and hash-map, and the others. This is consistent with DaoWen's and Arthur Ulfeldt's remarks.

Average execution time in seconds, in order from fastest to slowest (MacBook Pro, 2.3GHz Intel Core i7):

array-map: 5.602099

hash-map: 5.787275

vector: 6.605547

sorted-set: 6.657676

hash-set: 6.746504

list: 6.948222

Edit: I added a run of test-control below, which does only what is common to all of the different other tests. test-control took, on average, 5.571284 seconds. It appears that there is a bigger difference between the -map representations and the others than I had thought: Access to a hash-map or an array-map of two entries is essentially instantaneous (on my computer, OS, Java, etc.), whereas the other representations take about a second for 10 million iterations. Which, given that it's 10M iterations, means that those operations are still almost instantaneous. (My guess is that the fact that test-arraymap was faster than test-control is due to noise from other things happening in the background on the computer. Or it could have to do with idiosyncrasies of compilation.)

(A caveat: I forgot to mention that I'm getting a warning from criterium: "JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active." I believe this means that Leiningen is starting Java with a command line option that is geared toward the -server JIT compiler, but is being run instead with the default -client JIT compiler. So the warning is saying "you think you're running -server, but you're not, so don't expect -server behavior." Running with -server might change the times given above.)

(use 'criterium.core)

;; based on Arthur Ulfedt's answer:
(defn pairlist-contains? [key pair] 
  (condp = key 
    (first pair) true 
    (second pair) true 
    false))

(defn pairvec-contains? [key pair] 
  (condp = key 
    (pair 0) true 
    (pair 1) true
    false))

(def ntimes 10000000)

;; Test how long it takes to do what's common to all of the other tests
(defn test-control []
    (print "=============================\ntest-control:\n")
    (bench
      (dotimes [_ ntimes]
        (def _ (rand-nth [:a :b])))))

(defn test-list []
  (let [data '(:a :b)] 
    (print "=============================\ntest-list:\n")
    (bench
      (dotimes [_ ntimes]
        (def _ (pairlist-contains? (rand-nth [:a :b]) data))))))

(defn test-vec []
  (let [data [:a :b]] 
    (print "=============================\ntest-vec:\n")
    (bench
      (dotimes [_ ntimes]
        (def _ (pairvec-contains? (rand-nth [:a :b]) data))))))

(defn test-hashset []
  (let [data (hash-set :a :b)]
    (print "=============================\ntest-hashset:\n")
    (bench
      (dotimes [_ ntimes]
        (def _ (contains? data (rand-nth [:a :b])))))))

(defn test-sortedset []
  (let [data (sorted-set :a :b)]
    (print "=============================\ntest-sortedset:\n")
    (bench
      (dotimes [_ ntimes]
        (def _ (contains? data (rand-nth [:a :b])))))))

(defn test-hashmap []
  (let [data (hash-map :a :a :b :b)]
    (print "=============================\ntest-hashmap:\n")
    (bench
      (dotimes [_ ntimes]
        (def _ (contains? data (rand-nth [:a :b])))))))

(defn test-arraymap []
  (let [data (array-map :a :a :b :b)]
    (print "=============================\ntest-arraymap:\n")
    (bench
      (dotimes [_ ntimes]
        (def _ (contains? data (rand-nth [:a :b])))))))

(defn test-all []
  (test-control)
  (test-list)
  (test-vec)
  (test-hashset)
  (test-sortedset)
  (test-hashmap)
  (test-arraymap))

Good Clojure representation for unordered pairs?

4 Answers4