Fast strlen with bit operations

Question

I found this code

int strlen_my(const char *s)
{
    int len = 0;
    for(;;)
    {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }
}

I'm very interested in knowing how it works. ¿Can anyone explain how it works?

It trades undefined behaviour for a very questionable speedup (it is very possibly even slower). And is not standard-compliant, because it returns `int` instead of `size_t` — too honest for this site, Sep 06 '15 at 00:14
Yeah, doesn't this cause problems if the int type becomes larger than 4 bytes or if the machine is not little-endian? — Millie Smith, Sep 06 '15 at 00:14
@MillieSmith: That is the least problem, as most 64 bit systems are I32LP64 (POSIX). Problem is unaligned access, endianess (as you stated). Even if unaligned accesses are allowed on the platform, they can be much slower than aligned accesses. Not to mention the multiple mask and conditional operations. — too honest for this site, Sep 06 '15 at 00:16
This is probably from [here](http://www.strchr.com/optimized_strlen_function) and it does mention a lot of trade-offs with this code although it does not mention it is undefined behavior. It is usually helpful to link the source of the code. — Shafik Yaghmour, Sep 06 '15 at 00:26
@ShafikYaghmour: They just mention "it may crash ..." The article does not sound very reliable to me. Until proof, I'd say: hands off. — too honest for this site, Sep 06 '15 at 00:33
@ShafikYaghmour: It is interesting that the test code does not attempt to force misaligned accesses (AFAICT), and hence doesn't test the behaviour as thoroughly as it should. — Jonathan Leffler, Sep 06 '15 at 02:05
It doesn't work because it invokes undefined behavior (reading past the end of the string). It can also raise alignment errors depending on architecture. — Joshua, Sep 06 '15 at 02:08
It's worth noting that glibc uses a nifty bithack which can test four or eight bytes at a time (depending on the size of a long) using a single conditional to check whether none of the bytes are 0. (https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strlen.c;h=b7fd645429d2732f0793467fb3f4efc424a5e9dc;hb=HEAD#l80) On particular architectures, it can use SSE (or equivalent) or other bithacks to do even better. Moral: use the standard library. — rici, Sep 06 '15 at 03:22
A good implementation will consume odd bytes first to align the memory access and use much faster ways to check for zero bytes (like SSE4 or http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord) — phuclv, Sep 06 '15 at 06:23
If you downvoted my answer, you should at least be so kind to leave a comment why. The comment you left initially is now deleted for no reason and without providing an answer for my request for clarification. This is quite unfriendly - to say the least. — too honest for this site, Sep 09 '15 at 10:45

too honest for this site · Answer 1 · 2015-09-06T11:00:41.640

3

It trades undefined behaviour (unaligned accesses, 75% probability to access beyond the end of the array) for a very questionable speedup (it is very possibly even slower). And is not standard-compliant, because it returns int instead of size_t. Even if unaligned accesses are allowed on the platform, they can be much slower than aligned accesses.

It also does not work on big-endian systems, or if unsigned is not 32 bits. Not to mention the multiple mask and conditional operations.

That said:

It tests 4 8-bit bytes at a time by loading a unsigned (which is not even guaranteed to have more than 16 bits). Once any of the bytes contains the '\0'-terminator, it returns the sum of the current length plus the position of that byte. Else it increments the current length by the number of bytes tested in parallel (4) and gets the next unsigned.

My advice: bad example of optimization plus too many uncertainties/pitfalls. It's likely not even faster — just profile it against the standard version:

size_t strlen(restrict const char *s)
{
    size_t l = 0;
    while ( *s++ )
        l++;
    return l;
}

There might be a way to use special vector-instructions, but unless you can prove this is a critical function, you should leave this to the compiler — some may unroll/speedup such loops much better.

edited Sep 06 '15 at 11:00

answered Sep 06 '15 at 00:22

too honest for this site

12,050
4
30
52

+1 on noting how bad this code is. 1 addi tion, most compilers will optimize std strlen to a machine specific ASM which will be faster by using SSE and other extensions – Tomer W Sep 06 '15 at 08:38
1

@TomerW: Thanks. For the addition: that is an implication of the last paragraph. But you should not forget that the most CPUs do not have such extensions or only of little use here. (Embedded MCUs are by far the most CPUs with ARM Cortex-M and similar (ColdFire, embedded PPC) being the largest already). – too honest for this site Sep 06 '15 at 10:53
@Kevin:: I do not understand what you mean. – too honest for this site Sep 06 '15 at 15:33
@kevin: If you downvoted my answer, you should at least be so kind to leave a comment why. The comment you left is now deleted for no reason and without providing an answer for my request for clarification. This is quite unfriendly - to say the least. – too honest for this site Sep 09 '15 at 00:10

score 3 · Accepted Answer · answered Sep 06 '15 at 01:07

A bitwise AND with ones will retrieve the bit pattern from the other operand. Meaning, 10101 & 11111 = 10101. If the result of that bitwise AND is 0, then we know we know the other operand was 0. A result of 0 when ANDing a single byte with 0xFF (ones) will indicate a NULL byte.

The code itself checks each byte of the char array in four-byte partitions. NOTE: This code isn't portable; on another machine or compiler, an unsigned int could be more than 4 bytes. It would probably be better to use the uint32_t data type to ensure 32-bit unsigned integers.

The first thing to note is that on a little-endian machine, the bytes making up the character array will be read into an unsigned data type in reverse order; that is, if the four bytes at the current address are the bit pattern corresponding to abcd, then the unsigned variable will contain the bit pattern corresponding to dcba.

The second is that a hexadecimal number constant in C results in an int-sized number with the specified bytes at the little-end of the bit pattern. Meaning, 0xFF is actually 0x000000FF when compiling with 4-byte ints. 0xFF00 is 0x0000FF00. And so on.

So the program is basically looking for the NULL character in the four possible positions. If there is no NULL character in the current partition, it advances to the next four-byte slot.

Take the char array abcdef for an example. In C, string constants will always have null terminators at the end, so there's a 0x00 byte at the end of that string.

It'll work as follows:

Read "abcd" into unsigned int x:

x: 0x64636261 [ASCII representations for "dcba"]

Check each byte for a null terminator:

  0x64636261
& 0x000000FF
  0x00000061 != 0,

  0x64636261
& 0x0000FF00
  0x00006200 != 0,

And check the other two positions; there are no null terminators in this 4-byte partition, so advance to the next partition.

Read "ef" into unsigned int x:

x: 0xBF006665 [ASCII representations for "fe"]

Note the 0xBF byte; this is past the string's length, so we're reading in garbage from the runtime stack. It could be anything. On a machine that doesn't allow unaligned accesses, this will crash if the memory after the string is not 1-byte aligned. If there were just one character left in the string, we'd be reading two extra bytes, so the alignment of the memory adjacent to the char array would have to be 2-byte aligned.

Check each byte for a null terminator:

  0xBF006665
& 0x000000FF
  0x00000065 != 0,

  0xBF006665
& 0x0000FF00
  0x00006600 != 0,

  0xBF006665
& 0x00FF0000
  0x00000000 == 0 !!!

So we return len + 2; len was 4 since we incremented it once by 4, so we return 6, which is indeed the length of the string.

I accept this answer because it helped me understand how code works — Kevin, Sep 06 '15 at 17:02

chux - Reinstate Monica · Answer 3 · 2015-09-06T06:56:44.033

Code "works" by attempting to read 4 bytes at a time by assuming the string is laid out and accessible like an array of int. Code reads the first int and then each byte in turn, testing if it is the null character. In theory, code working with int will run faster then 4 individualchar operations.

But there are problems:

Alignment is an issue: e.g. *(unsigned*)s may seg-fault.

Endian is an issue with if((x & 0xFF) == 0) might not get the byte at address s

s += 4 is a problem as sizeof(int) may differ from 4.

Array types may exceed int range, better to use size_t.

An attempt to right these difficulties.

#include <stddef.h>
#include <stdio.h>

static inline aligned_as_int(const char *s) {
  max_align_t mat; // C11
  uintptr_t i = (uintptr_t) s;
  return i % sizeof mat == 0;
}

size_t strlen_my(const char *s) {
  size_t len = 0;
  // align
  while (!aligned_as_int(s)) {
    if (*s == 0) return len;
    s++;
    len++;
  }
  for (;;) {
    unsigned x = *(unsigned*) s;
    #if UINT_MAX >> CHAR_BIT == UCHAR_MAX
      if(!(x & 0xFF) || !(x & 0xFF00)) break;
      s += 2, len += 2;
    #elif UINT_MAX >> CHAR_BIT*3 == UCHAR_MAX
      if (!(x & 0xFF) || !(x & 0xFF00) || !(x & 0xFF0000) || !(x & 0xFF000000)) break;
      s += 4, len += 4;
    #elif UINT_MAX >> CHAR_BIT*7 == UCHAR_MAX
      if (   !(x & 0xFF) || !(x & 0xFF00)
          || !(x & 0xFF0000) || !(x & 0xFF000000)
          || !(x & 0xFF00000000) || !(x & 0xFF0000000000)
          || !(x & 0xFF000000000000) || !(x & 0xFF00000000000000)) break;
      s += 8, len += 8;
    #else
      #error TBD code
    #endif
  }
  while (*s++) {
    len++;
  }
  return len;
}

Which is the use of *max_align_t mat;* in *aligned_as_int*, and also I want to know that does exactly *aligned_as_int* — Kevin, Sep 06 '15 at 15:19
@Kevin Various platforms has alignment requirements, Example, some require all `int` variable addresses to be a multiple of 4. Before C11, determination of this requirement was not portably possible. With C11, `max_align_t` is a type with the aliment requirement for larger types. So code should go byte-by-byte until `s` is on an `int` aligned address. Then higher speed `int` may begin. If all this effort is worth it remains an open question. Profiling this solution vs. `strlen()` would answer that - still that is platform/compiler dependent. — chux - Reinstate Monica, Sep 06 '15 at 16:02
i.e. a four-byte move from an address that is not a multiple of four can cause an alignment error, but this depends on the machine, true? — Kevin, Sep 06 '15 at 16:40
@Kevin Yes per the example given. Another example: a machine may have an 8-byte alignment requirement yet with `int` as 4-bytes. Another: a machine may have a 1 byte alignment requirement (that is: no special requirement), but works _fastest_ with 4-byte alignmened `int`. That is why this post uses a made up function `aligned_as_int()` as the details of alignment requirements and optimal performance are a sub-question unto itself. — chux - Reinstate Monica, Sep 06 '15 at 16:47

score 2 · Answer 4 · answered Sep 06 '15 at 11:57

All there proposals are slower than a simple strlen().

The reason is that they do not reduce the number of comparisons and only one deals with alignment.

Check for the strlen() proposal from Torbjorn Granlund (tege@sics.se) and Dan Sahlin (dan@sics.se) in the net. If you are on a 64 bit platform this really helps to speed up things.

score 1 · Answer 5 · answered Sep 06 '15 at 00:13

It detects if any bits are set at a specific byte on a little-endian machine. Since we're only checking a single byte (since all the nibbles, 0 or 0xF, are doubled up) and it happens to be the last byte position (since the machine is little-endian and the byte pattern for the numbers is therefore reversed) we can immediately know which byte contains NUL.

score 1 · Answer 6 · answered Sep 06 '15 at 00:19

1

The loop is taking 4 bytes of the char array for each iteration. The four if statements are used to determine if the string is over, using bitmask with AND operator to read the status of i-th element of the substring selected.

answered Sep 06 '15 at 00:19

gpicchiarelli

454
6
16

Fast strlen with bit operations

6 Answers6