11

This function was found here. It's an implementation of strcmp:

int strcmp(const char* s1, const char* s2)
{
    while (*s1 && (*s1 == *s2))
        s1++, s2++;
    return *(const unsigned char*)s1 - *(const unsigned char*)s2;
}

I understand all but the last line, in short what is going on in the last line?

chqrlie
  • 131,814
  • 10
  • 121
  • 189
Cody Smith
  • 2,732
  • 3
  • 31
  • 43

5 Answers5

5
return *(const unsigned char*)s1-*(const unsigned char*)s2;

OP: in short what is going on in the last line?

A: The first potential string difference is compared. Both chars are referenced as unsigned char as required by the spec. The 2 are promoted to int and the difference is returned.


Notes:

1 The return value's sign (<0, 0, >0) is the most meaningful part. It is the only part that is specified by the C spec.

2 On some systems char is signed (more common). On others, char is unsigned. Defining the "sign-ness" of the last comparison promotes portability. Note that fgetc() obtains characters as unsigned char.

3 Other than that a string ends with a \0, the character encoding employed (like ASCII - most common), makes no difference at the binary level. If the first chars that differ in 2 strings have values 65 and 97, the first string will be less than the second, even if the character encoding is non-ASCII. OTOH, strcmp("A", "a") will return a negative number when character encoding is ASCII, but may return a positive number in a different character encoding for their underlying value and order are not defined by C.

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
2

This implementation is definitely not optimization of the built-in strcmp, it is simply another implementation and I believe it will most probably perform worse than the built-in version.

A comparison function is supposed to return 0 if the values being compared are equal, any negative number if the first value is smaller and any positive number if the first value is greater. And that is what happens on the last line.

The idea of the last line is to cast the characters to unsigned chars and I believe the author meant for this to sort non-standard characters after the standard ones(ASCII codes 0-127).

EDIT: there is no bug in the code and it can and will return negative values if value pointed to by s1 is smaller than the value pointed to by s2 ordering standard characters before characters with code 128 and above.

Ivaylo Strandjev
  • 69,226
  • 18
  • 123
  • 176
  • Yeah I was wondering why the cast was there. So what is the real implementation of strcmp? – Cody Smith Nov 15 '13 at 15:31
  • @el.pescado the cast to int will happen after the value is computed. The value will already have underflown 0. – Ivaylo Strandjev Nov 15 '13 at 15:32
  • 1
    The real implementation is [here](https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcmp.c), but most architectures actually override this with a better, more specific version like [this](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/strcmp.S). – ams Nov 15 '13 at 15:36
  • 2
    I can't subscribe to this answer. http://ideone.com/crYyX7 What am I misunderstanding? – Charlie Burns Nov 15 '13 at 15:40
  • @CharlieBurns I get what you mean and I have edited my answer. Still I am not sure how this works. – Ivaylo Strandjev Nov 15 '13 at 15:43
  • So if both chars are above 128 the bug will be displayed? – Charlie Burns Nov 15 '13 at 15:54
  • @CharlieBurns there is no but it seems. Now thanks to Blagovest Buyukliev's comment under the accepted answer I know this happens due to integer promotion. – Ivaylo Strandjev Nov 15 '13 at 16:01
  • I was looking at the standard conversions, but that doesn't apply. Neither does ranking. I guess it's integer promotion http://www.idryman.org/blog/2012/11/21/integer-promotion/ . Whoops just saw your comment. Want me to delete all my comments or leave them? – Charlie Burns Nov 15 '13 at 16:05
  • All comments in this thread(except this one) are useful in the sense they add value to the question. There is no need to remove them. I will remove this comment in a while. – Ivaylo Strandjev Nov 15 '13 at 16:11
  • The standard specifically says that the comparison is done interpreting both characters as unsigned char, so the casts are not a foible of the author of the code. – Paul Hankin Jun 10 '20 at 11:29
2

I'm preffer this code:

int strcmp(const char *str1, const char *str2)
{
    int s1;
    int s2;
    do {
        s1 = *str1++;
        s2 = *str2++;
        if (s1 == 0)
            break;
    } while (s1 == s2);
    return (s1 < s2) ? -1 : (s1 > s2);
}

for ARMv4 it compiled as:

strcmp:
    ldrsb   r3, [r0], #1 ;r3 = *r0++
    ldrsb   r2, [r1], #1 ;r2 = *r1++
    cmp     r3, #0       ;compare r3 and 0
    beq     @1           ;if r3 == 0 goto @1
    cmp     r3, r2       ;compare r3 and r2
    beq     strcmp       ;if r3 == r2 goto strcmp
;loop is ended
@1:
    cmp     r3, r2     ;compare r3 and r2
    blt     @2         ;if r3 < r2 goto @2
    movgt   r0, #1     ;if r3 > r2 r0 = 1
    movle   r0, #0     ;if r3 <= r2 r0 = 0
    bx      lr         ;return r0
@2:
    mov     r0, #-1    ;r0 = -1
    bx      lr         ;return r0

As you can see there is only 6 instructions under the loop + atmost 5 instructions at the end. So complexity is 6 * (strlen+1) + 5.

Moving (s1 == 0) to the while condition causes worse machine code for ARM (I do not know why).

  • 1
    You should cast the characters as unsigned char: `s1 = (unsigned char)*str1++;` to implement the exact semantics of `strcmp()`. – chqrlie Jul 14 '17 at 23:22
1

This implementation can be further optimized, shaving off some comparisons:

int strcmp(const char *s1, const char *s2) {
    unsigned char c1, c2;
    while ((c1 = *s1++) == (c2 = *s2++)) {
        if (c1 == '\0')
            return 0;
    }
    return c1 - c2;
}

The return value is 0 if the string are identical up to and including the terminating null byte. The sign of the return value is that of the difference between the first differing characters, converted to unsigned char as per the C Standard.

  • If char is smaller than int, which is true on all but some rare embedded systems, this difference can be computed with a simple subtraction, both c1 and c2 being promoted to int and this difference is guaranteed to fit in the range of type int.

  • On systems where sizeof(int) == 1, the return value should be computed this way:

    return (c1 < c2) ? -1 : 1;
    
chqrlie
  • 131,814
  • 10
  • 121
  • 189
0

strcmp returns which string is greater then the other, not just whether they are equal.

The last line subtracts the first non-matching character to see which is larger. If the whole string matches then it will be 0-0=0 which gives the "equal" result.

This implementation is not really well optimized, as that would take assembly code and knowledge of cache-lines, load sizes etc.

ams
  • 24,923
  • 4
  • 54
  • 75