Is this a correct and portable way of checking if 2 c-strings overlap in memory?

Question

Might not be the most efficient way, but is it correct and portable?

int are_overlapping(const char *a, const char *b) {
  return (a + strlen(a) == b + strlen(b));
}

To clarify: what I'm looking for is overlap in memory, not in the actual content. For example:

const char a[] = "string";
const char b[] = "another string";
are_overlapping(a, b); // should return 0
are_overlapping(a, a + 3); // should return 1

no, it will not work. You are comparing only the end of every string. — Cacho Santa, Jul 09 '13 at 23:29
I think you didn't get the idea behind this cacho. He wants to check if the two strings are overlapping in memory, if they are so their end will be the same. — Havenard, Jul 09 '13 at 23:31
(If you don't, I will - this question sitting here with two wrong answers isn't helping anyone.) — Carl Norum, Jul 09 '13 at 23:45
So in a nutshell, if two C strings overlap, they must end at the same point... — Kerrek SB, Jul 09 '13 at 23:53
thanks @CarlNorum. What I don't understand is why cacho's comment keeps being the most upvoted response, while it provides no explanation whatsoever. — infokiller, Jul 09 '13 at 23:54
Your answer is correct and even optimal. @chacho is thinking of C++ -like strings where length is a separate field. He's wrong for C strings with null termination. — Gene, Jul 10 '13 at 00:04
@Gene I think there is a more efficient way described here: http://stackoverflow.com/questions/4023320/how-to-implement-memmove-in-standard-c-without-an-intermediate-copy?answertab=votes#tab-top — infokiller, Jul 10 '13 at 00:07
@JohnnyW Your SO article isn't really relevant because it assumes equal length blocks and that their length is known in advance. Yes with elaborate tests inside the loops you might avoid touching part of one string on average. However those loops are going to have more comparisons than simple `strlen`, which may well eat any theoretical advantage. — Gene, Jul 10 '13 at 01:28
I added an answer that cuts out the second strlen. No need for it. If the second string starts between (inclusive) the first, you're done. — xaxxon, Jul 10 '13 at 01:39
@Gene I think the code linked can be modified to work without knowing the length in advance. Regarding efficiency- note that the code in the link has only 1 loop, compared to the 2 loops hidden in the 2 strlen calls, and each loop adds control overhead. Overall in the worst cast my code has 2 for loops and a comparison on each iteration (strlen), while the other code has 1 loop with 3 additions and 2 comparisons. — infokiller, Jul 10 '13 at 10:00

Carl Norum · Accepted Answer · 2013-07-10T17:48:01.387

33

Yes, your code is correct. If two strings end at the sample place they by definition overlapped - they share the same null terminator. Either both strings are identical, or one is a substring of the other.

Everything about your program is perfectly well-defined behaviour, so assuming standards-compliant compilers, it should be perfectly portable.

The relevant bit in the standard is from 6.5.9 Equality operators (emphasis mine):

Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

edited Jul 10 '13 at 17:48

answered Jul 09 '13 at 23:47

Carl Norum

219,201
40
422
469

It depends, if he means overlap or fully overlap. There is a huge difference. – dtech Jul 09 '13 at 23:49
2

No, it doesn't matter at all. Try some examples. – Carl Norum Jul 09 '13 at 23:49
4

Forget that, I completely ignored that the strings are null terminated. – dtech Jul 09 '13 at 23:52
3

NP, I think a lot of people did that on this question. – Carl Norum Jul 09 '13 at 23:52
4

This answer would be more useful if it explained why `==` is supported even though it is not known that `a` and `b` are pointers into the same object (or one byte beyond it). – Eric Postpischil Jul 10 '13 at 01:41
2

@EricPostpischil - done, but you have plenty of rep to have added it yourself. Please feel free in the future - I trust you! =) – Carl Norum Jul 10 '13 at 02:31

score 12 · Answer 2 · answered Jul 09 '13 at 23:47

Thinking about zdan's comments on my previous post (which will probably shortly be deleted), I've come to the conclusion that checking endpoints is sufficient.

If there's any overlap, the null terminator will make the two strings not be distinct. Let's look at some possibilities.

If you start with

a 0x10000000 "Hello" and somehow add
b 0x10000004 "World",

you'll have a single word: HellWorld, since the W would overwrite the \0. They would end at the same endpoint.

If somehow you write to the same starting point:

a 0x10000000 "Hello" and
b 0x10000000 "Jupiter"

You'll have the word Jupiter, and have the same endpoint.

Is there a case where you can have the same endpoint and not have overlap? Kind of.

a = 0x1000000 "Four" and
b = 0x1000004 "".

That will give an overlap as well.

I can't think of any time you'll have overlap where you won't have matching endpoints - assuming that you're writing null terminated strings into memory.

So, the short answer: Yes, your check is sufficient.

Thanks for the explanation -- a lot of people didn't get this at first! — Ernest Friedman-Hill, Jul 09 '13 at 23:54
@ErnestFriedman-Hill That includes me. I actually had 3 upvotes on saying "No, you have to check the whole string" (and 2 downvotes. :-) ) — Scott Mermelstein, Jul 09 '13 at 23:55

jxh · Answer 3 · 2013-07-10T03:46:22.567

It is probably not relevant to your use case, as your question is specifically about C-strings, but the code will not work in the case that the data has embedded NUL bytes in the strings.

char a[] = "abcd\0ABCD";
char *b = a + 5;

Other than that, your solution is straight forward and correct. It works since you are only using == for the pointer comparison, and according to the standard (from C11 6.5.9/6)

Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

However, the relational operators are more strict (from C11 6.5.8/5):

When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. If two pointers to object types both point to the same object, or both point one past the last element of the same array object, they compare equal. If the objects pointed to are members of the same aggregate object, pointers to structure members declared later compare greater than pointers to members declared earlier in the structure, and pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values. All pointers to members of the same union object compare equal. If the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression Q+1 compares greater than P. In all other cases, the behavior is undefined.

The last sentence is the kicker.

Some have taken exception to the fact that your code may compute the length of the overlap twice, and have attempted to take precautions to avoid it. However, the efficiency of reducing that compute is countered with an extra pointer comparison per iteration, or involves undefined or implementation defined behavior. Assuming you want a portable and compliant solution, the actual average gain is likely nil, and not worth the effort.

xaxxon · Answer 4 · 2013-07-10T02:52:37.207

1

This solution is still the same worst-case performance, but is optimized for hits -- you don't have to parse both strings.

char * temp_a = a;
char * temp_b = b;

while (*temp_a != '\0') {

    if (temp_a++ == b) 
        return 1;

}

// check for b being an empty string
if (temp_a == b) return 1;

/* but if b was larger, we aren't done, so you have to try from b now */
while (*temp_b != '\0') {
    if (temp_b++ == a)
        return 1;
}

/* don't need the a==b check again here

return 0;

Apparently, only pointer equality (not inequality) is portable in C, so the following solutions aren't portable -- everything below is from before I knew that.

Your solution is valid, but why calculate strlen on the second string? You know the start and end point of one string, just see if the other is between them (inclusive). saves you a pass through the second string -- O(M+N) to O(M)

char * lower_addr_string = a < b ? a : b
char * higher_addr_string = a > b ? a : b
length = strlen(lower_addr_string)
return higher_addr_string >= lower_addr_string && higher_addr_string <= lower_addr_string + length;

alternatively, do the string parsing yourself..

char * lower_addr_string = a < b ? a : b
char * higher_addr_string = a > b ? a : b
while(*lower_addr_string != '\0') {
    if (lower_addr_string == higher_addr_string)
        return 1;
    ++lower_addr_string;
}
/* check the last character */
if (lower_addr_string == higher_addr_string)
    return 1;
return 0;

edited Jul 10 '13 at 02:52

answered Jul 10 '13 at 01:38

xaxxon

19,189
5
50
80

You must locate the endpoints and compare them for equality because the C standard supports comparing any pointers for equality (`==`) but does not support comparing any pointers for order (`<`, `<=`, et cetera). Only pointers within the same object as each other (including a sentinel byte at the end) may be compared for order. – Eric Postpischil Jul 10 '13 at 01:40
No. The question asks for a “correct and portable” way of comparing. This method can fail in some C implementations, such as those that use segment-and-offset addressing. In such C implementations, the order comparisons such as `<` may be implemented by comparing only offsets, so the segment base addresses are neglected. – Eric Postpischil Jul 10 '13 at 01:42
@EricPostpischil ok I'll take another shot – xaxxon Jul 10 '13 at 01:45
@EricPostpischil better? I even kept the "lesson learned" part for others to see. – xaxxon Jul 10 '13 at 01:49
1

The first set of code has some merit, since it might produce the result more quickly than the OP’s code, depending on circumstances. Other alternatives may also be useful, depending on what is known or likely about `a` and `b`. – Eric Postpischil Jul 10 '13 at 01:54

AnT stands with Russia · Answer 5 · 2013-07-10T07:19:39.137

Yes, your check is correct, but it is certainly not the most efficient (if by "efficiency" you mean the computational efficiency). The obvious intuitive inefficiency in your implementation is based on the fact that when the strings actually overlap, the strlen calls will iterate over their common portion twice.

For the sake of formal efficiency, one might use a slightly different approach

int are_overlapping(const char *a, const char *b) 
{
  if (a > b) /* or `(uintptr_t) a > (uintptr_t) b`, see note below! */
  {
    const char *t = a; 
    a = b; 
    b = t;
  }

  while (a != b && *a != '\0')
    ++a;

  return a == b;
}

An important note about this version is that it performs relational comparison of two pointers that are not guaranteed to point to the same array, which formally leads to undefined behavior. It will work in practice on a system with flat memory model, but might draw criticism from a pedantic code reviewer. To formally work around this issue one might convert the pointers to uintptr_t before performing relational comparisons. That way the undefined behavior gets converted to implementation-defined behavior with proper semantics for our purposes in most (if not all) traditional implementations with flat memory model.

This approach is free from the "double counting" problem: it only analyzes the non-overlapping portion of the string that is located "earlier" in memory. Of course, in practice the benefits of this approach might prove to be non-existent. It will depend on both the quality of your strlen implementation and one the properties of the actual input.

For example, in this situation

const char *str = "Very very very long string, say 64K characters long......";

are_overlapped(str, str + 1);

my version will detect the overlap much faster than yours. My version will do it in 1 iteration of the cycle, while your version will spend 2 * 64K iterations (assuming a naive implementation of strlen).

If you decide to dive into the realm of questionable pointer comparisons, the above idea can also be reimplemented as

int are_overlapping(const char *a, const char *b) 
{
  if (a > b)
  {
    const char *t = a; 
    a = b; 
    b = t;
  }

  return b <= a + strlen(a);
}

This implementation does not perform an extra pointer comparison on each iteration. The price we pay for that is that it always iterates to the end of one of the strings instead of terminating early. Yet it is still more efficient than your implementation, since it calls strlen only once.

Try this test: const char b[] = "123456789"; const char * a = &b[5]; printf("%i\n", are_overlapping(a,b)); // should print 1, but actually prints 0 — Jeremy Friesner, Jul 10 '13 at 02:14
@Jeremy Friesner: Yes, I don't know why I switched to incorrect implementation. I reverted to the previous version of my answer, which was correct. Thank you for pointing this out. — AnT stands with Russia, Jul 10 '13 at 02:19

Is this a correct and portable way of checking if 2 c-strings overlap in memory?

5 Answers5