2

I need to process a lot of these in (more or less) real-time. The method I'm currently using is not cutting it anymore.

std::string parse_ipv4_address( const std::vector<unsigned char> & data, int start )
{
    char ip_addr[16];
    snprintf( ip_addr, sizeof(ip_addr), "%d.%d.%d.%d", 
        data[start + 0], data[start + 1], data[start + 2], data[start + 3] );
    return std::string( ip_addr ); 
}

// used like this
std::vector<unsigned char> ip = { 0xc0, 0xa8, 0x20, 0x0c };
std::string ip_address = parse_ipv4_address( ip, 0 );

std::cout << ip_address << std::endl; // not actually printed in real code 
// produces 192.168.32.12

Is there a faster way to do it? And how?

NOTE! Performance is the issue here so this issue is not a duplicate.

Community
  • 1
  • 1
Ragnar
  • 1,387
  • 4
  • 20
  • 29
  • This is not C , for sure.... – Sourav Ghosh Dec 17 '15 at 12:10
  • My code is not C for sure ... but a solution in C is acceptable as well ... You may switch the vector to char array and std::cout to printf. – Ragnar Dec 17 '15 at 12:13
  • 2
    you could build a vector of 4-byte strings (with extra trailing '\0's to fill the 4 bytes when necessary), and index into that with each address byte * 4 – Walter Tross Dec 17 '15 at 12:14
  • 5
    Are you absolutely sure the bottleneck is in that method? It doesn't look like you can squeeze much more performance out of it. – Frédéric Hamidi Dec 17 '15 at 12:14
  • 2
    Possible duplicate of [convert string to ipaddress](http://stackoverflow.com/questions/5328070/convert-string-to-ipaddress) – Danh Dec 17 '15 at 12:15
  • @WalterTross as example might help here :) – Ragnar Dec 17 '15 at 12:15
  • @FrédéricHamidi Yes, I am ... it is most executed method and smallest performance gain here would have biggest net effect – Ragnar Dec 17 '15 at 12:16
  • @Danh He is converting IP address to string, not string to IP address. – erip Dec 17 '15 at 12:16
  • I think it's trivial, but why are you writing `data[start + 0]` instead of just `data[start]`? I don't think it'll help much, if at all due to compiler optimizations. – erip Dec 17 '15 at 12:17
  • 1
    @FrédéricHamidi: `snprintf` is fairly slow-ish. And constructing a `std::string` object (that needs to allocate heap memory) isn't cheap either. Not passing an explicit length argument will not favorably affect performance either. In this case, using a lookup table instead of `snprintf` may be faster. – IInspectable Dec 17 '15 at 12:17
  • @erip in that question, they asked for 2 ways conversion. And they answered 2 ways conversion – Danh Dec 17 '15 at 12:18
  • @Danh performance doesn't seem to be he's problem ... I already have a solution but its not fast enough. – Ragnar Dec 17 '15 at 12:20
  • @erip the `data[start + 0]` is copy-paste mistake ... its not in real code. – Ragnar Dec 17 '15 at 12:21
  • I think os implementation should be the best implementation – Danh Dec 17 '15 at 12:22
  • 1
    Don't use `vector` to represent the 4-bytes of an IP address. Just use char ip[4], or even a 32-bit integer. – Roddy Dec 17 '15 at 12:24
  • @ragnar How often are you calling it every second, and how many microseconds does your function take? (just `parse_ipv4_address`, not the printing or vector setup...) – Roddy Dec 17 '15 at 12:29
  • @Roddy ... the requirement is about 10000 times a second and currently it takes 3 ms. – Ragnar Dec 17 '15 at 12:34
  • 1
    @Ragnar. Your code on ideone. Runs 1million calls in 0.62 seconds. = 0.6 microseconds per call. How come yours is 5000 times slower? http://ideone.com/lbOJRI – Roddy Dec 17 '15 at 12:41
  • @Roddy I suspect the compiler figures out that You're using the same IP address here and 0.62 seconds is actually spent on std::cout ... which I don't use in real code. – Ragnar Dec 17 '15 at 13:25
  • 1
    @Ragnar, No, that's really the sprintf time. I'd go and recheck your measurements, because unless you're running on an early 1980's 8-bit micro, 3 milliseconds makes no sense. – Roddy Dec 17 '15 at 13:36

5 Answers5

5

Here are the potential candidates that impact performance:

  • snprintf needs to parse the format string, and performs error handling. Either costs time, to implement features you don't need.
  • Constructing a std::string object on return is costly. It stores the controlled sequence in freestore memory (usually implemented as heap memory), that is somewhat costly to allocate in C++ (and C).
  • Use of a std::vector to store a 4-byte value needlessly burns resources. It, too, allocates memory in the freestore. Replace that with char[4], or a 32-bit integer (uint32_t).

Since you don't need the versatility of the printf-family of functions, you might drop that altogether, and use a lookup-table instead. The lookup table consists of 256 entries, each of which holding the visual representation of the byte values 0 through 255. To optimize this, have the LUT contain a trailing . character as well. (Special care needs to be taken, to address endianness. I'm assuming little-endian here.)

A solution might look like this1):

const uint32_t mapping[] = { 0x2E303030, // "000."
    0x2E313030, // "001."
    // ...
    0x2E343532, // "254."
    0x2E353532  // "255."
};

alignas(uint32_t) char ip_addr[16];
uint32_t* p = reinterpret_cast<uint32_t*>(&ip_addr[0]);
p[0] = mapping[data[0]];
p[1] = mapping[data[1]];
p[2] = mapping[data[2]];
p[3] = mapping[data[3]];

// Zero-terminate string (overwriting the superfluous trailing .-character)
ip_addr[15] = '\0';

// As a final step, waste all the hard earned savings by constructing a std::string.
// (as an ironic twist, let's pick the c'tor with the best performance)
return std::string(&ip_addr[0], &ip_addr[15]);

// A more serious approach would either return the array (ip_addr), or have the caller
// pass in a pre-allocated array for output.
return ip_addr;


1) Disclaimer: Casting from char* to uint32_t* technically exhibits undefined behavior. Don't use, unless your platform (compiler and OS) provide additional guarantees to make this well defined.
IInspectable
  • 46,945
  • 8
  • 85
  • 181
  • Lookup table might be a way to go but is solves only half the problem ... optimal string construction is another since numbers vary in length from 1 to 3. – Ragnar Dec 17 '15 at 12:29
  • 1
    Why don't you just use string literals in your mapping array? – MikeMB Dec 17 '15 at 12:41
  • @MikeMB: Using string literals deals with endianness issues. Constructing objects of type `const char[4]` (so that they can be copied as a 4-byte integer) from a string literal like `"128."` requires more conversions, one way or another. Since this is a one-off array that (presumably) will not get modified, readability concerns may not matter all that much. – IInspectable Dec 17 '15 at 12:57
  • @Ragnar: What is this second half you are talking about, that you feel this answer should address? – IInspectable Dec 17 '15 at 13:01
  • I don't quite get your point. The OP wants to have a string in the end anyway, you can just copy the bytes - granted the syntax might be a little less pretty. But I'm pretty sure, that the way you wrote the example, your reinterpret_cast invokes undefined behavior. In reality on x86 its probably no problem, but you should at least make sure that `ip_addr` is properly aligned. I think what Ragnar meant is that e.g. `0` is not supposed to be printed as `"000"`. Which is another reason, why I'd use chars instead, so you can copy different number of bytes. – MikeMB Dec 17 '15 at 13:10
  • @MikeMB: You have to strike a balance here. Since I read the question to mean, that performance is the paramount goal, I took the freedom to produce a semantically identical string representation, that does deviate (cosmetically) from the code in the question. Using char arrays to implement variable sized portions requires comparisons, that will eventually lead to branching. A missed branch prediction generally trashes performance. – IInspectable Dec 17 '15 at 13:15
  • 1
    Use `memcpy` to not break the strict-aliasing rule. – edmz Dec 17 '15 at 13:21
  • Even if your platform guarantees that this is well defined, unaligned access is usually rather slow - you can just add `alignas(4)` to circumvent that problem – MikeMB Dec 17 '15 at 13:21
  • @black: Then there is literally not reason to first cast to uint32_t*, just to cast it to void* (and internally treat it as unsigned char*). – MikeMB Dec 17 '15 at 13:25
  • @IInspectable: You are right, that this is faster, but the way I read Ragnar's comment (its not a requirement stated in the question), it would also be wrong. – MikeMB Dec 17 '15 at 13:27
  • @IInspectable. I did a small test http://ideone.com/2YlV4k (unfortunately it runs out of time in ideone) and run both parse_ipv4_address1 (the old one) and parse_ipv4_address2 (Yours) in Visual Studio diagnostics 3 times. Performance gain from Your solution was about 5 times :) Thank You for great solution! – Ragnar Dec 17 '15 at 14:52
  • I added an answer as an extension of this one, that does away with casting and endian issues. – fret Oct 16 '18 at 23:26
  • `p[0] = mapping[data[0]];` causes undefined behaviour by violating the strict aliasing rule. This is a separate issue to the one you mention in your footnote (i.e. whether the cast is well-defined) – M.M Oct 17 '18 at 00:43
  • @m.m: Would `uint32_t ip_addr[4];` and removing the following `reinterpret_cast` fix this? The cast on return (`return reinterpret_cast(&ip_addr[0]);`) is well defined, as far as I understand. – IInspectable Oct 17 '18 at 12:36
2

Three Four answers for the price of one.

First, make really, really sure that you're optimizing the right part. Both std::vector and std::string creation involve memory allocations, and cout << could involve file access, graphics, etc!

Second: Don't use vector to represent the 4-bytes of an IP address. Just use char ip[4], or even a 32-bit integer

Third: I'm guessing that you're not dealing with totally random IP addresses. Probably a few hundred or thousand different addresses? In which case, use a std::map<INT32, std::string> as a cache, and just pull the required ones from the cache as required, writing new ones in as needed. If the cache gets too big, just empty it and start over...


Fourth: Consider writing the IP address in Hexadecimal dotted quad notation. This is still accepted by calls like inet_addr() and has several advantages: All fields are fixed width, there are only 8 'characters' to update, and the binary to Hex conversion is usually faster than binary to decimal. https://en.wikipedia.org/wiki/IPv4#Address_representations
Roddy
  • 66,617
  • 42
  • 165
  • 277
  • 2
    +1, but I'm not sure that caching would do any good, once you optimized away `std::vector` with a 32 bit integer, `std::string` by writing directly in the output buffer and killed the `sprintf` with ad-hoc code (or maybe a 256-strings lookup table if it turns out to be faster) the whole thing may be faster (and have a better locality) than a map lookup. – Matteo Italia Dec 17 '15 at 12:46
  • @MatteoItalia. Agreed. But I like people to consider both pre-calculation and caching as optimization techniques early on. – Roddy Dec 17 '15 at 13:09
  • 1
    Yep, but it's important to know also of its dark sides, namely: how to decide when the cache is too big? is caching actually doing any good? and my personal favorite: how to deal with threads? – Matteo Italia Dec 17 '15 at 13:20
1

Lookup table could be of use here (initialized on program start). I guess you already have profiling configured so I didn't profile solution and wonder what would be the results so please share when you get some.

char LOOKUP_TABLE[256][4];

void init_lookup_table() {
    char digits[] = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};

    for (int i = 0; i < 10; ++i) {
        LOOKUP_TABLE[i][0] = digits[i % 10];
        LOOKUP_TABLE[i][1] = '\0';
        LOOKUP_TABLE[i][2] = '\0';
        LOOKUP_TABLE[i][3] = '\0';
    }

    for (int i = 10; i < 100; ++i) {
        LOOKUP_TABLE[i][0] = digits[(i / 10) % 10];
        LOOKUP_TABLE[i][1] = digits[i % 10];
        LOOKUP_TABLE[i][2] = '\0';
        LOOKUP_TABLE[i][3] = '\0';
    }
    for (int i = 100; i < 256; ++i) {
        LOOKUP_TABLE[i][0] = digits[(i / 100) % 10];
        LOOKUP_TABLE[i][1] = digits[(i / 10) % 10];
        LOOKUP_TABLE[i][2] = digits[i % 10];
        LOOKUP_TABLE[i][3] = '\0';
    }
}

void append_octet(char **buf, unsigned char value, char terminator) {
    char *src = LOOKUP_TABLE[value];
    if (value < 10) {
        (*buf)[0] = src[0];
        (*buf)[1] = terminator;
        (*buf) += 2;
    }
    else if (value < 100) {
        (*buf)[0] = src[0];
        (*buf)[1] = src[1];
        (*buf)[2] = terminator;
        (*buf) += 3;
    }
    else {
        (*buf)[0] = src[0];
        (*buf)[1] = src[1];
        (*buf)[2] = src[2];
        (*buf)[3] = terminator;
        (*buf) += 4;
    }
}

std::string parse_ipv4_address( const std::vector<unsigned char> & data, int start ) {
    char ip_addr[16];
    char *dst = ip_addr;
    append_octet(&dst, data[start + 0], '.');
    append_octet(&dst, data[start + 1], '.');
    append_octet(&dst, data[start + 2], '.');
    append_octet(&dst, data[start + 3], '\0');
    return std::string( ip_addr ); 
}


int main() {
    init_lookup_table();

    std::vector<unsigned char> ip = { 0xc0, 0x8, 0x20, 0x0c };
    std::cout << parse_ipv4_address( ip, 0 ) << std::endl;
}

Other way to improve performance would be to replace string with specialized object. In that case you will be able to implement required I/O methods (my guess is that you need string to print it somewhere) and will be freed from copying on string construction.

UPD on second thought I guess in my code lookup table is out of use so one could just copy code used to build lookup table to append_octet directly making digits global.

Updated code (thanks to MikeMB and Matteo Italia) which also looks very cache friendly

inline void append_octet(char **buf, unsigned char value, char terminator) {
    if (value < 10) {
        (*buf)[0] = '0' + (value % 10);
        (*buf)[1] = terminator;
        (*buf) += 2;
    }
    else if (value < 100) {
        (*buf)[0] = '0' + ((value / 10) % 10);
        (*buf)[1] = '0' + (value % 10);
        (*buf)[2] = terminator;
        (*buf) += 3;
    }
    else {
        (*buf)[0] = '0' + ((value / 100) % 10);
        (*buf)[1] = '0' + ((value / 10) % 10);
        (*buf)[2] = '0' + (value % 10);
        (*buf)[3] = terminator;
        (*buf) += 4;
    }
}

std::string parse_ipv4_address( const std::vector<unsigned char> & data, int start ) {
    char ip_addr[16];
    char *dst = ip_addr;
    append_octet(&dst, data[start + 0], '.');
    append_octet(&dst, data[start + 1], '.');
    append_octet(&dst, data[start + 2], '.');
    append_octet(&dst, data[start + 3], '\0');
    return std::string( ip_addr ); 
}


int main() {
    std::vector<unsigned char> ip = { 0xc0, 0x8, 0x20, 0x0c };
    std::cout << parse_ipv4_address( ip, 0 ) << std::endl;
}

UPD 2 I guess I found a way to avoid extra copy (altough there's still extra copy on return). Here's versions with look up table and w/o it

#include <string>
#include <iostream>
#include <vector>

std::string LUT[256];

void init_lookup_table() {
    for (int i = 0; i < 10; ++i) {
        LUT[i].reserve(2);
        LUT[i].push_back('0' + i);
        LUT[i].push_back('.');
    }
    for (int i = 10; i < 100; ++i) {
        LUT[i].reserve(3);
        LUT[i].push_back('0' + (i/10));
        LUT[i].push_back('0' + (i%10));
        LUT[i].push_back('.');
    }
    for (int i = 100; i < 256; ++i) {
        LUT[i].reserve(4);
        LUT[i].push_back('0' + (i/100));
        LUT[i].push_back('0' + ((i/10)%10));
        LUT[i].push_back('0' + (i%10));
        LUT[i].push_back('.');
    }
}

std::string parse_ipv4_address_lut( const std::vector<unsigned char> & data, int start ) {
    std::string res;
    res.reserve(16);
    res.append(LUT[data[start + 0]]);
    res.append(LUT[data[start + 1]]);
    res.append(LUT[data[start + 2]]);
    res.append(LUT[data[start + 3]]);
    res.pop_back();
    return res; 
}

inline void append_octet_calc(std::string *str, unsigned char value, char terminator) {
    if (value < 10) {
        str->push_back('0' + (value % 10));
        str->push_back(terminator);
    }
    else if (value < 100) {
        str->push_back('0' + ((value / 10) % 10));
        str->push_back('0' + (value % 10));
        str->push_back(terminator);
    }
    else {
        str->push_back('0' + ((value / 100) % 10));
        str->push_back('0' + ((value / 10) % 10));
        str->push_back('0' + (value % 10));
        str->push_back(terminator);
    }
}

std::string parse_ipv4_address_calc( const std::vector<unsigned char> & data, int start ) {
    std::string res;
    res.reserve(16);
    append_octet_calc(&res, data[start + 0], '.');
    append_octet_calc(&res, data[start + 1], '.');
    append_octet_calc(&res, data[start + 2], '.');
    append_octet_calc(&res, data[start + 3], '\0');
    return res; 
}


int main() {
    init_lookup_table();

    std::vector<unsigned char> ip = { 0xc0, 0x8, 0x20, 0x0c };
    std::cout << parse_ipv4_address_calc( ip, 0 ) << std::endl;
    std::cout << parse_ipv4_address_lut( ip, 0 ) << std::endl;
}

UPD 3 I made some measurements (1 000 000 repeats)

clang++ -O3
orig...done in 5053 ms // original implementation by OP
c_lut...done in 2083 ms // lookup table -> char[] -> std::string
c_calc...done in 2245 ms // calculate -> char[] -> std::string
cpp_lut...done in 2597 ms // lookup table + std::string::reserve + append
cpp_calc...done in 2632 ms // calculate -> std::string::reserve + push_back
hardcore...done in 1512 ms // reinterpret_cast solution by @IInspectable

g++ -O3
orig...done in 5598 ms // original implementation by OP
c_lut...done in 2285 ms // lookup table -> char[] -> std::string
c_calc...done in 2307 ms // calculate -> char[] -> std::string
cpp_lut...done in 2622 ms // lookup table + std::string::reserve + append
cpp_calc...done in 2601 ms // calculate -> std::string::reserve + push_back
hardcore...done in 1576 ms // reinterpret_cast solution by @IInspectable

Note that 'hardcore' solution doesn't equivalent because of leading zeroes.

Ilya Denisov
  • 868
  • 8
  • 25
  • Probably a matter of taste, but you know, that you can get character corresponding to a digit, by just adding 48 to it? No need for an array holding the digits. Also, the way you are composing the final string, there is no ned for separate initialization loops for 0-9,10-99 and 100 -255 – MikeMB Dec 17 '15 at 12:56
  • 3
    @MikeMB: no need to even write `digit+48`, just write `digit+'0'`. – Matteo Italia Dec 17 '15 at 13:02
  • Just to make this clear. I was not against using a lookup table. What I wanted to say is that - as you only copy the necessary characters anyway - there is no reason, not to fill the lookuptable with entries like `"000."` so you can use one loop to initialize all entires. Whether your new solution is faster one would have to investigate - divisions are usually rather slow, but so is memory access if you have to go beyond l1 cache. @Matteo: Yes, thank you for the correction. That is of course the better solution. – MikeMB Dec 17 '15 at 13:51
0

you can use a lookup table which holds a string for numbers from 0 to 255. if speed is very important, you can also use inline keyword or maybe a macro for the function. also you can check sse instructions.

By the way, usually the more primitive your code the faster it is. I would use unsigned char array instead of vector, char array instead of string, memcpy(or even byte by byte copy directly) instead of sprintf.

seleciii44
  • 1,529
  • 3
  • 13
  • 26
  • Ostensibly he's calling the function many times if his code is too slow. Adding `inline` would make OP's binary *huge*. – erip Dec 17 '15 at 12:20
  • @erip Maybe, maybe not... Depends how he is calling it. – ilent2 Dec 17 '15 at 12:22
  • If I use -O2 for compiling then ... doesn't the compiler do already all sorts of magic regardless whether I use inline keyword? But he lookup table idea sounds promising :) – Ragnar Dec 17 '15 at 12:24
  • Lookup table sounds good, `inline` will probably not help much. – Jabberwocky Dec 17 '15 at 12:28
  • for highest optimization for speed you should use -O3. also i'm not sure if it's going to auto inline or not. i wouldn't risk it :) i still recommend to look for sse instructions. – seleciii44 Dec 17 '15 at 12:29
  • Careful about lookup tables, for simple stuff on current processors they may be do more bad than good (in general, processor is fast, memory is slow). – Matteo Italia Dec 17 '15 at 12:37
  • @MatteoItalia: The total lookup table would be about 1KB. That would easily fit into l2 cache and the hot part (if not all entries are used equally) should even stay in L1, so it should not be a big problem (of course one would have to verify this for the program in question). – MikeMB Dec 17 '15 at 12:47
  • @MikeMB: indeed, 1 KB is actually quite small. – Matteo Italia Dec 17 '15 at 13:03
  • @seleciii44: Actually I don't think the inline keyword itself has a big impact on the inlining decision of a compiler nowerdays. However, it is required if you want to put your function body in the header which (I believe) is still important for inlining. – MikeMB Dec 17 '15 at 13:33
  • @MikeMB you are right, it's up to compiler. But unless I check the generated assembly code i would use the keyword. – seleciii44 Dec 17 '15 at 13:45
0

Here you go...

    std::string IP_parse(unsigned char data[4])
    {
            std::string parsedString = "";
            snprintf((char*)parsedString.c_str(), sizeof(char[15]), "%d.%d.%d.%d", data[0], data[1], data[2], data[3]);
            return parsedString;
    }