0

Given two (unordered) identical sets, is (first <set>) guaranteed to always return the same element?

If this is documented somewhere, where is it?

Why I care

The main reason I want this right at the moment is for searching graphs that are produced by random processes, yielding results to which more random processes are applied. uber/nodes and loom/nodes appear to return sets. Some of the specifics of the traversal order don't matter, but some do. What matters is that each time I run the program with the same random-number seed, I should get the same results.

I'd prefer not to impose an ordering on the nodes or edges of the graph. That seems to impose computational overhead for no real benefit—except determinism.

I've run into this same need on other projects, not involving graphs. Often with genetic algorithms, I've got lots of sets that I chug through (but no duplicates!). The sequence in which the program goes through the set doesn't matter except that it affects how many times the random-number generator is called before a given element is processed. So, more generally, you could say that my question is about whether (seq <set>) is deterministic.

Redefining the problem?

If the answer is no, or if it's yes but a sin against Clojure, here's another way to state the problem: What's the Clojurely way to work through a collection item-by-item, where you never want duplicates and where you don't care about the order—except that you need the items to come in the same order each time you run the program? (More precisely, every time you call the function with the same random-number seed.)

That may be hopeless, though, unless I'm up for writing another graph library.

Ben Kovitz
  • 4,920
  • 1
  • 22
  • 50
  • Even if it was, I'd never want to use it in code - it obscures the intention of your data structure (i.e. being ordered or not). Understood that you might not be asking because you want to use it, though. :) – matt_t_gregg Jun 18 '17 at 16:32
  • @matt_t_gregg I definitely want to use it, specifically for walking graphs. I'll add this to the question. – Ben Kovitz Jun 18 '17 at 16:41
  • 1
    I'd reconsider. The very notion of getting the first element out of a set is nonsensical (I mean that literally). It's like saying: I want the element that first appears according to the order of the container in a container that has no order. Imo you should definitely add the walking graphs part to your question because I really don't think you should use `first` with a set. – MasterMastic Jun 18 '17 at 16:57
  • @MasterMastic You must be one o' them [Axiom of Choice](https://en.wikipedia.org/wiki/Axiom_of_choice)-haters. Seriously, I've just got a need to go item-by-item through collections where I don't want duplicates, where the order doesn't matter, but where the order needs to be the same in multiple runs of the program. Maybe there's a whole other approach to this that I haven't thought of. (I'll add this to the question, too.) – Ben Kovitz Jun 18 '17 at 17:13
  • Are you certain that the computational cost of using sorted-set is significant? For practical purposes, you're probably safe - i.e. same machine, same version of Java/Clojure/libraries, same random seed things will run the same - but it's not something that I'd like to launch a satellite with. – matt_t_gregg Jun 18 '17 at 17:41
  • 2
    @matt_t_gregg I'm not certain, and indeed this may be a time to heed the wisdom in Knuth's exhortation against premature optimization. This program is very much a CPU-hog, though, and I'd like to unnecessary stuff in inner loops wherever it's not too difficult. More generally, another reason for avoiding a sort is that a sort _also_ obscures the intention. Since the order is meaningless, putting in code to force a specific order is somewhat misleading. – Ben Kovitz Jun 18 '17 at 17:51
  • @BenKovitz - Good point - sounds like you're looking at something more like a queue than a set/sorted set? – matt_t_gregg Jun 18 '17 at 17:54
  • @matt_t_gregg Yes, that's a good way of putting it: it's an "unordered queue" (with no duplicates). – Ben Kovitz Jun 18 '17 at 23:59

3 Answers3

3

It is not guaranteed anywhere that I know of. In fact, it is highly likely to change for different releases of Clojure if the implementation and/or hash function changes.

If you want it in a certain order, use either sorted-set or sorted-set-by For example:

user> (sorted-set-by > 3 5 8 2 1)
#{8 5 3 2 1}
Alan Thompson
  • 29,276
  • 6
  • 41
  • 48
  • Good point that even if it happens to work, it still can't be depended on. Is there a way to get an arbitrary but consistent order in a setcollection that never allows duplicates in it, without adding the extra overhead of a comparison function? – Ben Kovitz Jun 18 '17 at 17:31
3

Someone will probably provide an authoritative answer later, in the meantime I did a quick generative test.

(require '[clojure.spec.alpha :as s]
         '[clojure.spec.test.alpha :as stest])

(defn rebuild-set [set]
  (into #{} (shuffle (vec set))))

(s/fdef rebuild-set
  :args (s/cat :set set?)
  :ret set?
  :fn #(= (-> % :ret first)
          (-> % :args :set first)))

(stest/check `rebuild-set
             {:clojure.spec.test.check/opts {:num-tests 100}})

Surprisingly to me, this seemed to succeed at first.

Then an edge case, a counterexample came up:

(= #{0 -0.0} #{-0.0 0})  ; => true
(first #{0 -0.0})  ; => 0
(first #{-0.0 0})  ; => -0.0

So we can state very generally that there are equivalent sets that don’t return the same element with first.

glts
  • 21,808
  • 12
  • 73
  • 94
  • 2
    Apparently this specific case will be changed in Clojure 1.9, see https://dev.clojure.org/jira/browse/CLJ-1860. – glts Jun 18 '17 at 17:24
  • Thanks for the counterexample—and for showing me how to make a quick generative test! So far, I've just barely looked into spec. This is a perfect example to learn from, being directly related to what I'm doing at the moment. – Ben Kovitz Jun 18 '17 at 17:33
3

There is another alternative may be worth mentioning - linked hash sets. Linked hash sets preserve insertion order. The element inserted first is the first when iterating over the values in the set.

If the same random seed guarantees that the elements are added into sets in the same order in different runs of the application, then (first my-linked-set) is deterministic. However, there are many situations where the same random seed would not guarantee the same insertion order. That would be, for example, if multiple threads update the same set via an atom.

According to The Clojure Toolbox, a library called linked provides the linked hash set and map implementations in Clojure.

ez121sl
  • 2,371
  • 1
  • 21
  • 28
  • Thanks! I just experimented and it's looking good, but there's still one obstacle for my immediate application: [`uber/nodes`](http://engelberg.github.io/ubergraph/ubergraph.core.html#var-nodes) and [`loom/nodes`](https://github.com/aysylu/loom) appear to return sets. Time to update the question again… – Ben Kovitz Jun 19 '17 at 17:07
  • Thanks also for reminding me that when I modify the program to run on multiple cores, there may be no good way at all to keep determinism. (Ecch!) – Ben Kovitz Jun 19 '17 at 17:09