-1

The usual saying is that string comparison must be done in constant time when checking things like password or hashes, and thus, it is recommended to avoid a == b. However, I run the follow script and the results don't support the hypothesis that a==b short circuit on the first non-identical character.

from time import perf_counter_ns
import random

def timed_cmp(a, b):
    start = perf_counter_ns()
    a == b
    end = perf_counter_ns()
    return end - start

def n_timed_cmp(n, a, b):
    "average time for a==b done n times"
    ts = [timed_cmp(a, b) for _ in range(n)]
    return sum(ts) / len(ts)

def check_cmp_time():
    random.seed(123)
    # generate a random string of n characters
    n = 2 ** 8
    s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])

    # generate a list of strings, which all differs from the original string
    # by one character, at a different position
    # only do that for the first 50 char, it's enough to get data
    diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]

    timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
    sorted_timed = sorted(timed, key=lambda t: t[1])

    # print the 10 fastest
    for x in sorted_timed[:10]:
        i, t = x
        print("{}\t{:3f}".format(i, t))

    print("---")
    i, t = timed[0]
    print("{}\t{:3f}".format(i, t))

    i, t = timed[1]
    print("{}\t{:3f}".format(i, t))

if __name__ == "__main__":
    check_cmp_time()

Here is the result of a run, re-running the script gives slightly different results, but nothing satisfactory.

# ran with cpython 3.8.3

6   78.051700
1   78.203200
15  78.222700
14  78.384800
11  78.396300
12  78.441800
9   78.476900
13  78.519000
8   78.586200
3   78.631500
---
0   80.691100
1   78.203200

I would've expected that the fastest comparison would be where the first differing character is at the beginning of the string, but it's not what I get. Any idea what's going on ???

Geekingfrog
  • 191
  • 1
  • 8
  • i don't think strings are compared character-wise, but idk i'm not a Dr – SuperStew Jul 02 '20 at 14:12
  • Why don't you just try comparing random strings with random strings ? What if you increase the number of iterations ? – Wise Man Jul 02 '20 at 14:18
  • "Any idea?" is not a real question. What's your real question? If `str == str` short circuits? Then I have to tell you that it is not defined in the Python language and a matter of implementation in the interpreter. It might change with any version. – Klaus D. Jul 02 '20 at 14:30
  • 1
    please check the answer by @JulienPalard and unmark mine from being accepted. I was misinformed and my answer is not correct. – Chase Jul 03 '20 at 09:25

2 Answers2

5

There's a difference, you just don't see it on such small strings. Here's a small patch to apply to your code, so I use longer strings, and I do 10 checks by putting the A at a place, evenly spaced in the original string, from the beginning to the end, I mean, like this:

A_______________________________________________________________
______A_________________________________________________________
____________A___________________________________________________
__________________A_____________________________________________
________________________A_______________________________________
______________________________A_________________________________
____________________________________A___________________________
__________________________________________A_____________________
________________________________________________A_______________
______________________________________________________A_________
____________________________________________________________A___
@@ -15,13 +15,13 @@ def n_timed_cmp(n, a, b):
 def check_cmp_time():
     random.seed(123)
     # generate a random string of n characters
-    n = 2 ** 8
+    n = 2 ** 16
     s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])

     # generate a list of strings, which all differs from the original string
     # by one character, at a different position
     # only do that for the first 50 char, it's enough to get data
-    diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+    diffs = [s[:i] + "A" + s[i+1:] for i in range(0, n, n // 10)]

     timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
     sorted_timed = sorted(timed, key=lambda t: t[1])

and you'll get:

0   122.621000
1   213.465700
2   380.214100
3   460.422000
5   694.278700
4   722.010000
7   894.630300
6   1020.722100
9   1149.473000
8   1341.754500
---
0   122.621000
1   213.465700

Note that with your example, with only 2**8 characters, it's already noticable, apply this patch:

@@ -21,7 +21,7 @@ def check_cmp_time():
     # generate a list of strings, which all differs from the original string
     # by one character, at a different position
     # only do that for the first 50 char, it's enough to get data
-    diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+    diffs = [s[:i] + "A" + s[i+1:] for i in [0, n - 1]]
 
     timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
     sorted_timed = sorted(timed, key=lambda t: t[1])

to only keep the two extreme cases (first letter change vs last letter change) and you'll get:

$ python3 cmp.py
0   124.131800
1   135.566000

Numbers may vary, but most of the time test 0 is a tad faster that test 1.

To isolate more precisely which caracter is modified, it's possible as long as the memcmp does it character by character, so as long as it does not use integer comparisons, typically on the last character if they get misaligned, or on really short strings, like 8 char string, as I demo here:

from time import perf_counter_ns
from statistics import median
import random


def check_cmp_time():
    random.seed(123)
    # generate a random string of n characters
    n = 8
    s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])

    # generate a list of strings, which all differs from the original string
    # by one character, at a different position
    # only do that for the first 50 char, it's enough to get data
    diffs = [s[:i] + "A" + s[i + 1 :] for i in range(n)]

    values = {x: [] for x in range(n)}
    for _ in range(10_000_000):
        for i, diff in enumerate(diffs):
            start = perf_counter_ns()
            s == diff
            values[i].append(perf_counter_ns() - start)

    timed = [[k, median(v)] for k, v in values.items()]
    sorted_timed = sorted(timed, key=lambda t: t[1])

    # print the 10 fastest
    for x in sorted_timed[:10]:
        i, t = x
        print("{}\t{:3f}".format(i, t))

    print("---")
    i, t = timed[0]
    print("{}\t{:3f}".format(i, t))

    i, t = timed[1]
    print("{}\t{:3f}".format(i, t))


if __name__ == "__main__":
    check_cmp_time()

Which gives me:

1   221.000000
2   222.000000
3   223.000000
4   223.000000
5   223.000000
6   223.000000
7   223.000000
0   241.000000

The differences are so small, Python and perf_counter_ns may no longer be the right tools here.

Julien Palard
  • 8,736
  • 2
  • 37
  • 44
0

See, to know why it doesn't short circuit, you'll have to do some digging. The simple answer is, of course, it doesn't short circuit because the standard doesn't specify so. But you might think, "Why wouldn't the implementations choose to short circuit? Surely, It must be faster!". Not quite.

Let's take a look at cpython, for obvious reasons. Look at the code for unicode_compare_eq function defined in unicodeobject.c

static int
unicode_compare_eq(PyObject *str1, PyObject *str2)
{
    int kind;
    void *data1, *data2;
    Py_ssize_t len;
    int cmp;

    len = PyUnicode_GET_LENGTH(str1);
    if (PyUnicode_GET_LENGTH(str2) != len)
        return 0;
    kind = PyUnicode_KIND(str1);
    if (PyUnicode_KIND(str2) != kind)
        return 0;
    data1 = PyUnicode_DATA(str1);
    data2 = PyUnicode_DATA(str2);

    cmp = memcmp(data1, data2, len * kind);
    return (cmp == 0);
}

(Note: This function is actually called after deducing that str1 and str2 are not the same object - if they are - well that's just a simple True immediately)

Focus on this line specifically-

cmp = memcmp(data1, data2, len * kind);

Ahh, we're back at another cross road. Does memcmp short circuit? The C standard does not specify such a requirement. As seen in the opengroup docs and also in Section 7.24.4.1 of the C Standard Draft

7.24.4.1 The memcmp function

Synopsis

#include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);

Description

The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2.

Returns

The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.

Most Some C implementations (including glibc) choose to not short circuit. But why? are we missing something, why would you not short circuit?

Because the comparison they use isn't might not be as naive as a byte by byte by check. The standard does not require the objects to be compared byte by byte. Therein lies the chance of optimization.

What glibc does, is that it compares elements of type unsigned long int instead of just singular bytes represented by unsigned char. Check out the implementation

There's a lot more going under the hood - a discussion far outside the scope of this question, after all this isn't even tagged as a C question ;). Though I found that this answer may be worth a look. But just know, the optimization is there, just in a much different form than the approach that may come in mind at first glance.

Edit: Fixed wrong function link

Edit: As @Konrad Rudolph has stated, glibc memcmp does apparently short circuit. I've been misinformed.

Chase
  • 5,315
  • 2
  • 15
  • 41
  • You are discussing `memcmp` but you are showing the source code of *`memcpy`*. And [`memcmp` *of course* short-circuits](https://github.molgen.mpg.de/git-mirror/glibc/blob/20003c49884422da7ffbc459cdeee768a6fee07b/string/memcmp.c#L324-L337). Not doing so would be phenomenally wasteful. – Konrad Rudolph Jul 02 '20 at 16:17
  • 2
    Unfortunately your edit didn’t fix the answer, and it still wrongly claiming that `memcmp` does not short-circuit, when it does. – Konrad Rudolph Jul 02 '20 at 16:52
  • 2
    Prooved it actually does in my answer. – Julien Palard Jul 03 '20 at 08:48
  • I agree with you @KonradRudolph. I was misinformed and I apologize for it. OP, please check out the other answer and unmark mine as accepted. It's factually false. – Chase Jul 03 '20 at 09:24