12

While preparing an answer to Count how many different values a list takes in Mathematica I came across an instability (for lack of a better term) in both DeleteDuplicates and Tally that I do not understand.

Consider first:

a = {2.2000000000000005, 2.2, 2.1999999999999999};

a // InputForm
DeleteDuplicates@a // InputForm
Union@a // InputForm
Tally@a // InputForm
   {2.2000000000000006`, 2.2, 2.1999999999999997`}
   {2.2000000000000006`, 2.2, 2.1999999999999997`}
   {2.1999999999999997`, 2.2, 2.2000000000000006`}
   {{2.2000000000000006`, 3}}

This behavior is as I expected in each case. Tally compensates for the slight numerical differences and sees each element as being equivalent. Union and DeleteDuplicates see all elements as unique. (This behavior of Tally is not documented to my knowledge, but I have made use of it before.)

Now, consider this complication:

a = {11/5, 2.2000000000000005, 2.2, 2.1999999999999997};

a // InputForm
DeleteDuplicates@a // InputForm
Union@a // InputForm
Tally@a // InputForm
   {11/5, 2.2000000000000006, 2.2, 2.1999999999999997}
   {11/5, 2.2000000000000006, 2.2}
   {2.1999999999999997, 2.2, 11/5, 2.2000000000000006}
   {{11/5, 1}, {2.2000000000000006, 1}, {2.2, 2}}

The output of Union is as anticipated, but the results from both DeleteDuplicates and Tally are surprising.

  • Why does DeleteDuplicates suddenly see 2.1999999999999997 as a duplicate to be eliminated?

  • Why does Tally suddenly see 2.2000000000000006 and 2.2 as distinct, when it did not before?


As a related point, it can be seen that packed arrays affect Tally:

a = {2.2000000000000005, 2.2, 2.1999999999999999};
a // InputForm
Tally@a // InputForm
   {2.2000000000000006, 2.2, 2.1999999999999997}
   {{2.2000000000000006`, 3}}
a = Developer`ToPackedArray@a;
a // InputForm
Tally@a // InputForm
   {2.2000000000000006, 2.2, 2.1999999999999997}
   {{2.2000000000000006`, 1}, {2.2, 2}}
Community
  • 1
  • 1
Mr.Wizard
  • 24,179
  • 5
  • 44
  • 125

2 Answers2

12

The exhibited behaviour appears to be the result of a the usual woes associated with floating point arithmetic coupled with some questionable behaviour in some of the functions under discussion.

SameQ Is Not An Equivalence Relation

First on the slate: consider that SameQ is not an equivalence relation because it is not transitive:

In[1]:= $a = {11/5, 2.2000000000000005, 2.2, 2.1999999999999997};

In[2]:= SameQ[$a[[2]], $a[[3]]]
Out[2]= True

In[3]:= SameQ[$a[[3]], $a[[4]]]
Out[3]= True

In[4]:= SameQ[$a[[2]], $a[[4]]]
Out[4]= False                     (* !!! *)

So right out the gate, we are faced with erratic behaviour even before turning to the other functions.

This behaviour is due to the documented rule for SameQ that says that two real numbers are treated as "equal" if they "differ in their last binary digit":

In[5]:= {# // InputForm, Short@RealDigits[#, 2][[1, -10;;]]} & /@ $a[[2;;4]] // TableForm
(* showing only the last ten binary digits for each *)
Out[5]//TableForm= 2.2000000000000006  {0,1,1,0,0,1,1,0,1,1}
                   2.2                 {0,1,1,0,0,1,1,0,1,0}
                   2.1999999999999997  {0,1,1,0,0,1,1,0,0,1}

Note that, strictly speaking, $a[[3]] and $a[[4]] differ in the last two binary digits, but the magnitude of the difference is one bit of the lowest order.

DeleteDuplicates Does Not Really Use SameQ

Next, consider that the documentation states that DeleteDuplicates[...] is equivalent to DeleteDuplicates[..., SameQ]. Well, that is strictly true -- but probably not in the sense that you might expect:

In[6]:= DeleteDuplicates[$a] // InputForm
Out[6]//InputForm= {11/5, 2.2000000000000006, 2.2}

In[7]:= DeleteDuplicates[$a, SameQ] // InputForm
Out[7]//InputForm= {11/5, 2.2000000000000006, 2.2}

The same, as documented... but what about this:

In[8]:= DeleteDuplicates[$a, SameQ[#1, #2]&] // InputForm
Out[8]//InputForm= {11/5, 2.2000000000000006, 2.1999999999999997}

It appears that DeleteDuplicates goes through a different branch of logic when the comparison function is manifestly SameQ as opposed to a function whose behaviour is identical to SameQ.

Tally is... Confused

Tally shows similar, but not identical, erratic behaviour:

In[9]:= Tally[$a] // InputForm
Out[9]//InputForm=  {{11/5, 1}, {2.2000000000000006, 1}, {2.2, 2}}

In[10]:= Tally[$a, SameQ] // InputForm
Out[10]//InputForm= {{11/5, 1}, {2.2000000000000006, 1}, {2.2, 2}}

In[11]:= Tally[$a, SameQ[#1, #2]&] // InputForm
Out[11]//InputForm= {{11/5, 1}, {2.2000000000000006, 1}, {2.2000000000000006, 2}}

That last is particularly baffling, since the same number appears twice in the list with different counts.

Equal Suffers Similar Problems

Now, back to the problem of floating point equality. Equal fares a little bit better than SameQ -- but emphasis on "little". Equal looks at the last seven binary digits instead of the last one. That doesn't fix the problem, though... troublesome cases can always be found:

In[12]:= $x1 = 0.19999999999999823;
         $x2 = 0.2;
         $x3 = 0.2000000000000018;

In[15]:= Equal[$x1, $x2]
Out[15]= True

In[16]:= Equal[$x2, $x3]
Out[16]= True

In[17]:= Equal[$x1, $x3]
Out[17]= False             (* Oops *)

The Villain Unmasked

The main culprit in all of this discussion is the floating-point real number format. It is simply not possible to represent arbitrary real numbers in full fildelity using a finite format. This is why Mathematica stresses symbolic form and makes every possible attempt to work with expressions in symbolic form for as long as possible. If one finds numeric forms to be unavoidable, then one must wade into that swamp called numerical analysis to sort out all of the corner cases involving equality and inequality.

Poor SameQ, Equal, DeleteDuplicates, Tally and all of their friends never stood a chance.

WReach
  • 18,098
  • 3
  • 49
  • 93
  • +1 - Very nice discussion, thanks! From the practical viewpoint, I'd still bet that `Equal` will handle many more cases than `SameQ` rather satisfactory. A better-defined problem would be to define equivalence classes based on some rigid grid (set of bins) and consider two numbers equal if they end up in the same "bin". This is ad hoc, but well-defined, and perhaps not that unreasonable for many problems. – Leonid Shifrin May 29 '11 at 20:33
  • @Leonid and WReach, is `Order[#, #2]===0 &` the best option for SameQ-And-I-Really-Mean-It-This-Time? – Mr.Wizard May 30 '11 at 09:09
  • @Mr.Wizard I don't think `Order` used in the way you suggest solves any problem of type discussed by @WReach, that is inherent for `SameQ`. Both are for symbolic comparisons. The significance of `Order` is that it is a comparison function, rather than an equivalence relation. This is a stronger condition-for example, `Union` given only `SameQ` has no choice but to be quadratic complexity (pairwise comparisons), while with `Order` it in principle can be `n log n`, since `Order` can be used to sort the list (in practice built-in `Union` will always be quadratic in `n` with explicit test, alas). – Leonid Shifrin May 30 '11 at 09:21
  • @Mr.Wizard The documentation for `Order` is silent on the matter, but experiments suggest that `Order` is doing a bitwise comparison of floating point numbers. In particular, it does not appear to perform the same "1 bit fudge" that `SameQ` is using. So, I would agree that `Order` appears to be a good substitute for `ReallySameFloatQ` (in the sense of the same bits) -- at least until WRI decides to change it! To be really safe, one could implement `MySameQ` that converts floats to some canonical representation of the bits before comparing them. – WReach May 30 '11 at 13:26
  • 1
    @Mr.Wizard I am actually surprised that `SameQ` *has* the 1-bit fudge -- I would have expected it to do a simple bitwise comparison, leaving the "fudging" to `Equal`. – WReach May 30 '11 at 13:26
9

In my opinion, relying on anything for Tally or DeleteDuplicates with default (SameQ-like based) comparison function and numerical values is relying on implementation details, because SameQ does not have a well-defined semantics on numerical values. What you see is what is normally called "undefined behavior" in other languages. What one should be doing to get robust results is to use

DeleteDuplicates[a,Equal]

or

Tally[a,Equal]

and similarly for Union (although I would not use Union since explicit test leads to quadratic complexity for it). OTOH, if your desire is to understand the internal implementation details because you want to make use of them, I can not say much except to warn that this may cause more harm than good, particularly because these implementations are subject to change from version to version - even assuming that you get all their details right for some particular version.

Leonid Shifrin
  • 22,449
  • 4
  • 68
  • 100
  • Off-Topic: Not one visit to this site sofar has been without learning something new about mma. – nilo de roock May 29 '11 at 10:20
  • I guess "undefined behavior" is reasonable, but somehow I expected more consistency. I suppose that is the meaning of "undefined" however. I guess it's just luck that I have gotten away with using the default `Tally` the way I have. – Mr.Wizard May 29 '11 at 10:27
  • `DeleteDuplicates[a, Equal]` also seems to default to the default and not be the same as `DeleteDuplicates[a, Equal[##]&]` – Rojo Mar 20 '13 at 18:35
  • @Rojo This is strange. Sounds like a bug. – Leonid Shifrin Mar 20 '13 at 19:22