Understanding the strcmp function of gnu libc

Question

Here is the strcmp function that i found in the glibc:

int
STRCMP (const char *p1, const char *p2)
{
  const unsigned char *s1 = (const unsigned char *) p1;
  const unsigned char *s2 = (const unsigned char *) p2;
  unsigned char c1, c2;

  do
    {
      c1 = (unsigned char) *s1++;
      c2 = (unsigned char) *s2++;
      if (c1 == '\0')
        return c1 - c2;
    }
  while (c1 == c2);

  return c1 - c2;
}

This is a pretty simple function where the body of while initiates c1 and c2 with the value of *s1 and *s2 and continues till either c1 is nul or the values of c1 and c2 are equal, then returns the difference between c1 and c2.

What i didn't understand is the use of s1 and s2 variables. I mean other than the fact that they are unsigned char they are also const like the 2 arguments p1 and p2, so why not just use the p1 and p2 inside the body of while and cast them ? Does in this case using those 2 extra variables make the function somehow more optimized? because here is the same function for FreeBSD I found on github:

int
strcmp(const char *s1, const char *s2)
{
    while (*s1 == *s2++)
        if (*s1++ == '\0')
            return (0);
    return (*(const unsigned char *)s1 - *(const unsigned char *)(s2 - 1));
}

In their version they didn't even bother using any extra variables.

Thanks in advance for your answers.

PS: I did search on the internet about this specific fact before asking here, but i didn't got anything.

I would also like to know if there are any particular reason why glibc used those extra variables instead of casting the parameters p1 and p2 directly inside while.

would https://stackoverflow.com/questions/1356741/strcmp-and-signed-unsigned-chars?rq=1 give you an answer ? — OznOg, Oct 29 '18 at 06:30
I din't think that it is so easy to treat glibc. There are many versions of depending on architecture etc. Just look at https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;hb=921595d151ee1661cc5476bb019483e12b7b47f6 — muradm, Oct 29 '18 at 06:31
I'm curious. Why did you undo the changes I made to your question? I fixed two spelling errors, fixed one indentation error (probably caused by tabs -> spaces conversion), and removed the irrelevant first paragraph, which has nothing to do with your question. — melpomene, Oct 29 '18 at 06:50
I thought it cut the fact that i wanted to optimize the function, but i saw that i do talk about optimization again later, do you think the first part is confusing? should i redo your changes? — localhost, Oct 29 '18 at 06:53
If your main question is how to optimize the function, you're looking at the wrong code. As muradm said, check out stuff like https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/strcmp-sse42.S;h=5a0c6668a7e795a39acfd8ef2037d0720271bfff;hb=921595d151ee1661cc5476bb019483e12b7b47f6. If you're wondering why the plain C fallback version is written the way it is, optimization is not (directly) relevant. — melpomene, Oct 29 '18 at 06:59
@melpomene: Plus, GCC replaces calls to functions like `strcmp()` (full list [here](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)) with its own built-ins, unless `-fno-builtins` option is used. — Nominal Animal, Oct 29 '18 at 08:07
Ok so I got confused here by the different chapter numbers between C99 and C17. https://stackoverflow.com/questions/1356741/strcmp-and-signed-unsigned-chars?rq=1 is really a duplicate. Some other gold badger please make the call to close, I'll step away from this. — Lundin, Oct 29 '18 at 12:31
@Lundin: OP didn't ask anything about the casts. You seem to have decided it is part of the question at hand, while a plain reading of the question shows it is not. Mis-interpreting a question to make it a duplicate of another is pretty obnoxious behaviour from a gold badger. — Nominal Animal, Oct 29 '18 at 13:32

Antti Haapala -- Слава Україні · Answer 1 · 2018-10-29T07:04:37.113

You're correct of course. One of the casts should be sufficient. Especially if the pointer is cast, casting the retrieved value is a no-op.

Here is the x86-64 compiled with gcc -O3 for the one with unnecessary cast:

STRCMP:
.L4:
        addq    $1, %rdi
        movzbl  -1(%rdi), %eax
        addq    $1, %rsi
        movzbl  -1(%rsi), %edx
        testb   %al, %al
        je      .L7
        cmpb    %dl, %al
        je      .L4
        subl    %edx, %eax
        ret
.L7:
        movzbl  %dl, %eax
        negl    %eax
        ret

and here's the one without the unnecessary cast:

STRCMP:
.L4:
        addq    $1, %rdi
        movzbl  -1(%rdi), %eax
        addq    $1, %rsi
        movzbl  -1(%rsi), %edx
        testb   %al, %al
        je      .L7
        cmpb    %dl, %al
        je      .L4
        subl    %edx, %eax
        ret
.L7:
        movzbl  %dl, %eax
        negl    %eax
        ret

They're identical

However there is one gotcha, that is now mostly of historical interest. If char is signed and the signed representation is not two's complement,

*(const unsigned char *)p1

and

(unsigned char)*p1

are not equivalent. The former reinterprets the bit pattern, while the latter converts the value using modulo arithmetic. This is of only historical interest since not even GCC supports any architecture that doesn't have 2's complement signed representation. And it is the compiler with most ports.

The former is apparently not compliant. See C17 7.24.4 "The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared." This appears to be the reason why they force a conversion to unsigned char, as per your second example. — Lundin, Oct 29 '18 at 12:26
Yeah it seems a bit confused. They could just have written `c1 = (unsigned char) *p1++;`. — Lundin, Oct 29 '18 at 12:51

Nominal Animal · Accepted Answer · 2018-10-29T13:28:59.533

2

What i didn't understand is the use of s1 and s2 variables. I mean other than the fact that they are unsigned char they are also const like the 2 arguments p1 and p2, so why not just use the p1 and p2 inside the body of while and cast them ?

For readability; to make it easier for us humans to maintain the code.

If you look at glibc sources, the code tends to readability rather than concise expressions. It seems to be a good policy, because it has kept it relevant and vibrant (actively maintained) for over 30 years now.

Does in this case using those 2 extra variables make the function somehow more optimized?

No, not at all.

I would also like to know if there are any particular reason why glibc used those extra variables instead of casting the parameters p1 and p2 directly inside while.

For readability only.

The authors know that the C compiler used should be able to optimize this code just fine. (And it is easy to prove that is the case, just by looking at the code compiler generateds. For GCC, you can use the -S option, or you can use the binutils' objdump -d to examine an object file or a binary executable.)

Note that the casts to unsigned char are required for the exact same reasons as they are for isspace(), isalpha() et cetera: the character codes compared must be treated as unsigned char for correct results.

edited Oct 29 '18 at 13:28

answered Oct 29 '18 at 08:14

Nominal Animal

38,216
5
59
86

2

How does adding a bunch of extra casts improve readability? It's rather the other way around. – Lundin Oct 29 '18 at 11:33
1

@Lundin: It is the opinion of the GNU C library developers that it does; it is part of their style. They do use that same style quite uniformly. It is not my preference, and I personally believe there are better styles, but that is the reasoning. (The longevity of the project indicates they're not *utterly* wrong.) How should I reword my answer to make this clearer? – Nominal Animal Oct 29 '18 at 11:36
That's just subjective hogwash... do you have any source for this? – Lundin Oct 29 '18 at 12:12
Nevermind. The rationale is C17 7.24.4 "The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared." So your answer is incorrect and this had nothing to do with readability, only standard compliance. – Lundin Oct 29 '18 at 12:27
2

@Lundin: Do you realize that OP did **not** ask if the casts were necessary, or why they are used? Your "subjective hogwash" is completely in your own mind, you even invented your own question. Please reread the question, then re-evaluate your actions. Thanks for the downvote. – Nominal Animal Oct 29 '18 at 13:30
1

The OP asked about the presence of the 2 extra variables. They are completely superfluous and only serves one purpose: to make the code harder to read. A more readable version would have been written as `c1 = *p1++;`. – Lundin Oct 29 '18 at 15:26
@Lundin: That I agree with. I only claim that their *intended purpose* is, from glibc developers' perspective, to make the code easier to read/maintain. A patch to actually clean it up might even get accepted in a year or two. – Nominal Animal Oct 30 '18 at 09:07
But where do you get this readability argument from? Do they have a coding standard, source code comments or other documentation? – Lundin Oct 30 '18 at 11:50
@Lundin: From the [history](https://sourceware.org/git/?p=glibc.git;a=history;f=string/strcmp.c) of the file; it includes minor cleanups by Will Newton, for example. The original version is (by Roland McGrath, I assume) quite [similar](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/generic/strcmp.c;hb=2e0367958d92be7c63116795991cff404cf1f010). Although the [Style guide](https://sourceware.org/glibc/wiki/Style_and_Conventions) does not explicitly say which style one should use, the same style (temporary variables preferred over casting) is maintained; ... – Nominal Animal Oct 30 '18 at 13:38
... for patches to be accepted. Compare to e.g. the [strcspn()](https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=HEAD) implementation. A better reference is, I believe, the mailing list discussions on [coding style](https://sourceware.org/cgi-bin/search.cgi?wm=wrd&form=extended&m=all&s=D&ul=%2Fml%2Flibc-alpha%2F%25&q=coding%20style) in submitted patches on the libc-alpha mailing list. (Their [Contribution checklist](https://sourceware.org/glibc/wiki/Contribution%20checklist) etc. is at sourceware.org.) – Nominal Animal Oct 30 '18 at 13:42
Will Newton only cleaned up the ancient "K&R style", he didn't add the extra variables. They seem to have been there since the dawn of time. The fact that the file contained K&R style until year 2014 probably says a lot about the general quality of the code base... – Lundin Oct 30 '18 at 14:11
@Lundin: No, I meant that that code was deemed acceptable: only minor cleanup patches were needed/accepted. The glibc developers in general are a very slow, careful bunch in their response to patches, but their track record (see [bugzilla](https://sourceware.org/bugzilla/buglist.cgi?product=glibc&component=libc&resolution=---)) is better than one might infer from the coding style. (I think I have a few patches of my own there, but only the gettext ones were painless. I've even thought of writing my own syscall wrappers to avoid the c library.) – Nominal Animal Oct 30 '18 at 14:40

Understanding the strcmp function of gnu libc

2 Answers2