fast padded strcpy for a single word

Question

I'm trying to write a very cheap C++ code snippet to do the following operation on a short null terminated string.

The input is a string like "ABC". It is null terminated and has maximum length of 4 (or 5 with the null terminator).

The output goes to a char[4] which is not null terminated and should be space padded on the right. So in this case it would be {'A','B','C',' '}

It is ok to assume that the input string is properly null terminated, so there's no need to read a second word of the input to make sure. 4 bytes is the longest it can be.

So the code around it looks like this:

char* input = "AB";
char output[4];
// code snippet goes here
// afterward output will be populated with {'A','B',' ',' '}

How cheaply can this be done? If it matters: I'm working with:

Linux 2.6.32-358.11.1.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Lastly, the input is word aligned.

Please be aware that fast and cheap is not the same as hacky, even if hacking may be necessary to get optimal speed. Don't forget to benchmark different solutions, it's sometimes surprising what compilers and modern processors can do ;) — Antoine, Nov 26 '13 at 18:52
Thanks very much Antoine. I'm going to do a bunch of benchmarking on these. I accepted your answer because it was the one I was looking for. I had tried a few more obvious solutions with loops but figured I'd come to stackoverflow to see if anyone could produce the branchless solution as you have. It may not actually be the one I end up using. — E_G, Nov 26 '13 at 18:55

Antoine · Accepted Answer · 2013-11-27T07:57:36.180

3

How about something like this:

typedef unsigned int word;
int spacePad(word input) {
    static const word spaces = 0x20202020;

    word mask =
       !input ?                0 :
       !(input & 0x00ffffff) ? 0xff:
       !(input & 0x0000ffff) ? 0xffff :
       !(input & 0x0000ff)   ? 0xffffff :
                               0xffffffff;
    // or without branches
    word branchless_mask =
       1u << (8 * (
         bool(input & 0xff000000) +
         bool(input & 0x00ff0000) +
         bool(input & 0x0000ff00) +
         bool(input & 0x000000ff)
       ));

    return (spaces & mask) | (input & ~mask);
}

And if I didn't screw up, spacePad(0xaabb0000) is 0xaabb2020.

Instead of computing and-masks, you could use SSE intrinsics which would probably be faster since you'd get the mask in a couple of instruction, and then masked move would do the rest, but the compiler would probably move your variables arround from SSE to standard registers which could outweight the slight gain. It all depends on how much data you need to process, how it's packed in memory, etc.

If the input in a char* and not an int, normally additionnal code would be necessary since a cast could read into unallocated memory. But since you mention all strings are word-aligned a cast is enough, indeed even if there are a few unallocated bytes, they are on the same word as at least one allocated byte. Since you are only reading there's no risk of memory corruption and on all architectures I know of, hardware memory protection has a granularity larger than a word. For instance on x86 a memory page is often 4k aligned.

Now that's all nice and hacky, but: before selecting a solution, benchmark it, that's the only way to know which is best for you (except of course the warm fuzzy feeling of writing code like this ^^)

edited Nov 27 '13 at 07:57

answered Nov 26 '13 at 18:27

Antoine

13,494
6
40
52

Thanks. I think this is as close as I can get. I had been hoping for something completely branchless but that may not be possible. Much appreciated. I'm going to hold off for a little bit before accepting in case somebody comes up with something branchless. – E_G Nov 26 '13 at 18:31
are you sure that the last byte (after the `'\0'`) will be 0? – Glenn Teitelbaum Nov 26 '13 at 18:33
@E_G: added a branch-less version – Antoine Nov 26 '13 at 18:45
I'm having trouble following the branch-less version :) With the original version, what happens if say `input[2] == 0` and the last byte is garbage? – Jonathan Potter Nov 26 '13 at 18:51
I assume that there is no garbage, but with a little more fiddling you can handle those cases. – Antoine Nov 26 '13 at 18:54
You probably want to use an `unsigned int` instead of an `int` for the constant, return value, and the function argument. – Zac Howland Nov 26 '13 at 19:04
2

Neither version is really branchless. You would have to convert the `char*` to an `int` (or rather, and `unsigned int`) and have it at the proper alignment, somehow. If your `input` is `"AB"`, you would want it to be aligned as `"AB "`. All that manipulation is going to have to happen somewhere (you're just moving it outside the function, presently). – Zac Howland Nov 26 '13 at 19:10
No: a simple cast would suffice, and there's no alignement issue since the OP said the charù is word aligned. Also since we are word aligned the cast might read a couple of unallocated bytes located on an allocated word which is not nice but won't corrupt memory since it's a read and won't segfault since it's in an alloc'd word. – Antoine Nov 27 '13 at 07:47
@ZacHowland: changed to unsigned int, and clarified the `char*` conversion issue. Thanks. – Antoine Nov 27 '13 at 08:02

score 1 · Answer 2 · answered Nov 26 '13 at 18:25

If speed is your issue - use brute force.

This does not access input outside its bounds, nor destroys it.

 const char* input = TBD();
 char output[4] = {' '};
 if (input[0]) {
   output[0] = input[0];
   if (input[1]) {
     output[1] = input[1];
     if (input[2]) {
       output[2] = input[2];
       if (input[3]) {
         output[3] = input[3];
       }
     }
   }
 }

score 1 · Answer 3 · answered Nov 26 '13 at 18:27

char* input = "AB";
char output[4];

input += (output[0] = *input ? *input : ' ') != ' ';
input += (output[1] = *input ? *input : ' ') != ' ';
input += (output[2] = *input ? *input : ' ') != ' ';
output[3] = *input ? *input : ' ';

Note that this destroys the original input pointer, so make a copy of that if you need to preserve it.

score 1 · Answer 4 · answered Nov 26 '13 at 18:32

For short strings like this, I don't think you can do much better than the trivial implementation:

char buffer[4];

const char * input = "AB";
const char * in = input;
char * out = buffer;
char * end = buffer + sizeof buffer;

while (out < end)
{
    *out = *in != 0 ? *in++ : ' ';
    out++;
}

score 0 · Answer 5 · answered Nov 26 '13 at 18:07

0

If your input is null terminated a simple strcpy will suffice. The memcpy is faster but will copy whatever garbage it find after the null char.

answered Nov 26 '13 at 18:07

fernando.reyes

597
2
15

I could certainly use strcpy or memcpy, but it should be possible to make this faster because they're going to look at the next word in the case of a length-4 input. Furthermore, the output needs to get space padded, which could incur an extra cost. strncpy with a max length of 4 would avoid looking at the next word but still doesn't take care of the space padding. – E_G Nov 26 '13 at 18:09
No, strcpy will not pad the buffer like he's asking. – harald Nov 26 '13 at 18:11

Zac Howland · Answer 6 · 2013-11-27T14:02:18.313

0

You are looking for memcpy:

char* input = "AB\0\0";
char output[4];
memcpy(output, input, 4);

If your input is variable, you'll need to calculate the size first:

char* input = "AB";
std::size_t len = strlen(input);
char output[4] = {' ', ' ', ' ', ' '};
memcpy(output, input, std::min(4, len));

edited Nov 27 '13 at 14:02

answered Nov 26 '13 at 18:17

Zac Howland

15,777
1
26
42

Thanks. I'm familiar with memcpy, but I'm trying to produce something faster than this, because a) strlen might unnecessarily look at another word of input and b) this doesn't space pad the output. – E_G Nov 26 '13 at 18:19
The only way to do this faster would be to know the length ahead of time, or write your own version of `strlen` that takes a maximum length (there is a non-standard extension `strnlen` that does exactly that). – Zac Howland Nov 26 '13 at 18:57
Oh, and for b), If you want the output to end in spaces, just initialize it to be spaces. `memcpy` will only overwrite the characters it needs to and leave the spaces. – Zac Howland Nov 27 '13 at 14:02

fast padded strcpy for a single word

6 Answers6