Why is strlen() about 20 times faster than manually looping to check for null-terminated character?

Question

The original question was badly received and got many downvotes. So I thought I'd revise the question to make it easier to read and hopefully to be of more help to anyone seeing it. The original question was why strlen() was 20 times faster than manually looping through the string and finding the '\0' character. I thought this question was well founded, as everywhere I'd read strlen()'s technique to find the string length is essentially looping until it finds a null-terminating character '\0'. This is a common criticism of C strings for more reasons than one. Well as many people pointed out, functions that are part of the C library are created by smart programmers to maximise performance.

Thanks to ilen2, who linked me to a VERY clever way of using bitwise operators to check 8 bytes at once, I managed to get something that, on a string larger than about say 8 to 15 characters runs faster than strlen(), and many many times faster than strlen() when the string is considerably larger. For example, and strangely, strlen() seems to be linearly time dependent on the length of the string to finish. On the other hand, the custom one takes pretty much the same amount of time no matter the string length (I tested up to a couple of hundred). Anyway, my results are rather surprising, I did them with optimisation turned OFF, and I don't know how valid they are. Thanks a lot to ilen2 for the link, and John Zwinck. Interestingly, John Zwinck suggested SIMD as a possibility for why strlen() might be faster, but I don't know anything about that.

Your implementation for counting character uses two additions per loop. I can think of a way using only 1 addition. — ilent2, May 22 '16 at 02:56
One is a library call to an optimized library, the other is some lump of unoptimized assembly on top of a poorly optimized algorithm? This is like asking "why does using the oven cook eggs faster than putting them in a bag next to the fridge?" — Yakk - Adam Nevraumont, May 22 '16 at 02:58
I would also expect that with optimisations turned on that the compiler might be able to identify that calculating the string length can be done at compile time, so this isn't the best example. A better example might be to first load strings of different lengths into memory (from a file or somewhere else) and then determine their length. — ilent2, May 22 '16 at 03:00
If you are curious, take a look at a glibc implementation: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strlen.c;h=5f22ce95097d4090c6c32fc7cf6c2ef9cf6e86a8;hb=24c0bf7a76ecec65aca0dbce1f7ebb8f68425dc2 — ilent2, May 22 '16 at 03:09
@Yakk Thanks for your constructive help. Everywhere I've read says that strlen uses loop counting to find the null-terminated character. Shaving off one addition still can't account for it being about 1800% slower. John Zwinck's answer was helpful in that he he suggest SIMD operations, ie, calculating more than one char at once. As you know, there is NO official documentation of how strlen() works. Information on this usually involves people digging into the assembly to find out. Your comment is is nothing other than a jab at me for not knowing as much as you. I thought that's what SO was for — Zebrafish, May 22 '16 at 07:39
@ilent2 That's for your help. I was wondering how you could do it with one addition. I did come up with a one addition answer, that is, arrayCharCount = 0; while (*arrayStr) arrayStr++; arrayCharCount = arrayStr - startOfString; arrayStr = startOfString; But it gave me the same result — Zebrafish, May 22 '16 at 08:07
Every single "why is X faster" question on stack overflow in C++ points out that testing for "faster" without optimization is pointless. SO is about answering questions of people who first seach and see if it was already answered: the point is to build Q&A for the *next* person, not you. "Speed is meaningless when not optimized" has been answered 1000s of times, which makes this question not useful, hence your flurry of downvotes. I tried to explain why your "but I am special, not optimizing makes sense here" is not an exception. — Yakk - Adam Nevraumont, May 22 '16 at 10:04
@Yakk If you want to measure the time it takes a computer to do a calculation over and over again, how are you going to get accurate measurement results if you allow the compiler to CHANGE your code so that your code runs one tenth the number of times you intended, or not at all? — Zebrafish, May 22 '16 at 10:29
Note: the the quote is slightly mis-quoted https://en.wikisource.org/wiki/Gettysburg_Address — chux - Reinstate Monica, May 22 '16 at 13:16
I down-voted because the question does not even show the code snippets being compared. — Ben Jones, Oct 26 '22 at 20:55

score 6 · Accepted Answer · answered May 22 '16 at 03:00

6

strlen() is a very heavily hit function and you can bet that several very bright people have spent days and months optimizing it. Once you get your algorithm right, the next thing is, can you check multiple bytes at once? The answer of course is that you can, using SIMD (SSE) or other tricks. If your processor can operate on 128 bits at a time, that's 16 characters per clock instead of 1.

answered May 22 '16 at 03:00

John Zwinck

239,568
38
324
436

1

I wrote something based on the link that ilent2 gave, it checks 8 bytes at once and so far from testing it's 4 TIMES faster than strlen(), I'm so happy, but there's more work to do. It'd be interesting to use SIMD, I wouldn't have a clue on how to use it. – Zebrafish May 22 '16 at 17:39

score 0 · Answer 2 · answered Aug 14 '23 at 10:13

Here is how strlen() works in MSVC:

; Function compile flags: /Ogtpy
; File D:\P\MT\prftst.cpp
;   COMDAT ?testR@@YAXXZ
_TEXT   SEGMENT
len$ = 8
?testR@@YAXXZ PROC                  ; testR, COMDAT

; 44   :    volatile ui64 len = strlen(str);

  00000 48 8d 0d 00 00
    00 00        lea     rcx, OFFSET FLAT:?str@@3PADA ; str
  00007 48 c7 c0 ff ff
    ff ff        mov     rax, -1
  0000e 66 90        npad    2  ; >>> xchg  ax,ax 
$LL3@testR:
  00010 48 ff c0     inc     rax
  00013 80 3c 01 00  cmp     BYTE PTR [rcx+rax], 0
  00017 75 f7        jne     SHORT $LL3@testR
  00019 48 89 44 24 08   mov     QWORD PTR len$[rsp], rax

; 45   : }

  0001e c3       ret     0
?testR@@YAXXZ ENDP                  ; testR
_TEXT   ENDS

No need to be very fluent in assembly to get it. Very simple algorithm, just loops through every character and tests if it's not 0. Now, I think that compiler actually inlines this function every time it sees it. It's not located in any library, the code for it is generated by the compiler itself.

Also note, that if you feed pointer to const char * that was declared at compile time compiler will cheat and do this:

; Function compile flags: /Ogtpy
; File D:\P\MT\prftst.cpp
;   COMDAT ?testR@@YAXXZ
_TEXT   SEGMENT
len$ = 8
?testR@@YAXXZ PROC                  ; testR, COMDAT

; 60   :    volatile ui64 len = strlen(str200);

  00000 48 c7 44 24 08
    c8 00 00 00  mov     QWORD PTR len$[rsp], 200 ; 000000c8H

; 61   : }

  00009 c3       ret     0
?testR@@YAXXZ ENDP                  ; testR
_TEXT   ENDS

Yep. It just pasted the const cstring literal size that was known at compile time!

I think this might actually be the reason for your tests beeing so wierd. Always test strlen() on char[] array, without initialisting it with literal. memset() the array in main(), this way compiler will never know the size of string and will be forced to count it at runtime.

Also, always use volatile variable to put strlen() result in, this will force compiler to actually count the size.

Use #pragma optimize( "", off ) and #pragma optimize( "", on ) in your loop function, and call wrapper functions with actual code you are testing from it. This wrapper functions must have __declspec(noinline) specifier.

Why is strlen() about 20 times faster than manually looping to check for null-terminated character?

2 Answers2

Linked