72

Given two equal-length strings, is there an elegant way to get the offset of the first different character?

The obvious solution would be:

for ($offset = 0; $offset < $length; ++$offset) {
    if ($str1[$offset] !== $str2[$offset]) {
        return $offset;
    }
}

But that doesn't look quite right, for such a simple task.

NikiC
  • 100,734
  • 37
  • 191
  • 225
  • 2
    Related: [Tetris'ing an array](http://stackoverflow.com/q/3275258) – Pekka Sep 19 '11 at 18:19
  • 8
    Looks simple enough to me. – Lightness Races in Orbit Sep 19 '11 at 18:20
  • There are more efficient ways to do this, but possibly more complicated to read. Will this bit of code be called lots of times? I.e. Does it matter if it's efficient? – Robert Martin Sep 19 '11 at 18:23
  • 2
    @Robert: How could it be done more efficiently? This is `O(n)` and you _will_ have to examine up to `n` characters. – Lightness Races in Orbit Sep 19 '11 at 18:24
  • @Tomalak You're right that it's O(n), but a byte-wise compare written in PHP will be much slower than a built-in function that utilizes C. For example, code strcmp in PHP and use built-in, run each 10000 times for a decently long string, and see how badly it loses. – Robert Martin Sep 19 '11 at 18:28
  • 4
    !BE AWARE!, that this might result in a wrong offset when dealing with unicode characters. If you want doing it this way, better use [mb_substr()](http://de.php.net/manual/en/function.mb-substr.php) – breiti Oct 05 '11 at 15:14

4 Answers4

178

You can use a nice property of bitwise XOR (^) to achieve this: Basically, when you xor two strings together, the characters that are the same will become null bytes ("\0"). So if we xor the two strings, we just need to find the position of the first non-null byte using strspn:

$position = strspn($string1 ^ $string2, "\0");

That's all there is to it. So let's look at an example:

$string1 = 'foobarbaz';
$string2 = 'foobarbiz';
$pos = strspn($string1 ^ $string2, "\0");

printf(
    'First difference at position %d: "%s" vs "%s"',
    $pos, $string1[$pos], $string2[$pos]
);

That will output:

First difference at position 7: "a" vs "i"

So that should do it. It's very efficient since it's only using C functions, and requires only a single copy of memory of the string.

Edit: A MultiByte Solution Along The Same Lines:

function getCharacterOffsetOfDifference($str1, $str2, $encoding = 'UTF-8') {
    return mb_strlen(
        mb_strcut(
            $str1,
            0, strspn($str1 ^ $str2, "\0"),
            $encoding
        ),
        $encoding
    );
}

First the difference at the byte level is found using the above method and then the offset is mapped to the character level. This is done using the mb_strcut function, which is basically substr but honoring multibyte character boundaries.

var_dump(getCharacterOffsetOfDifference('foo', 'foa')); // 2
var_dump(getCharacterOffsetOfDifference('©oo', 'foa')); // 0
var_dump(getCharacterOffsetOfDifference('f©o', 'fªa')); // 1

It's not as elegant as the first solution, but it's still a one-liner (and if you use the default encoding a little bit simpler):

return mb_strlen(mb_strcut($str1, 0, strspn($str1 ^ $str2, "\0")));
ircmaxell
  • 163,128
  • 34
  • 264
  • 314
  • 10
    Are you a ringer? How did NikiC _know_ you were planning on posting this? – Robert Martin Sep 19 '11 at 18:29
  • 12
    @Robert Martin, visit our courses of telepathy [here](http://chat.stackoverflow.com/rooms/11/php). – OZ_ Sep 19 '11 at 18:32
  • 5
    @Robert: Yes, I am. We had discussed this yesterday and Nikic had asked me to post this solution here now to give a baseline to see if there are any other (potentially better) solutions than this one. And to get other's comments on it as well... – ircmaxell Sep 19 '11 at 18:35
  • 2
    Out of curiosity, why the downvote? Is there something that can be improved or expanded upon (and as such perhaps should be discussed)? – ircmaxell Sep 19 '11 at 19:09
  • 1
    I guess it is related to the difference in upvotes on comment #1 and comment #2 (unfortunately). – JK. Sep 26 '11 at 21:40
  • @jk I assume as much. But if the answer could be improved, that is much more important than 2 measly rep points... – ircmaxell Sep 26 '11 at 22:59
  • Very Clever code! BTW I notice there always seem to be random down votes when a bounty is involved. Been stung myself. – Charlie Sep 27 '11 at 18:36
  • Nice. But the downside of this code is that it doesn't work with multibyte strings. – Karolis Sep 28 '11 at 13:32
  • @Karolis: Correct. It will find the byte offset of the difference though (which could be in the middle of a multi-byte character)... – ircmaxell Sep 28 '11 at 17:29
  • 1
    +1 The XOR idea is brilliant. I'll have to remember that in the future. I've always used the method OP posted. – Steve Buzonas Oct 04 '11 at 15:59
  • Doesn't [strcspn](http://www.php.net/manual/en/function.strcspn.php) do this out of the box? – Lance Kidwell Jan 21 '12 at 02:42
  • 1
    @Lance: nope. `strcspn('abcd', 'abcd') === 0` ([on codepad](http://codepad.viper-7.com/qEzUiE)), which is obviously not the same answer to this question (which in this case would be `4`)... So no, it doesn't do that out of the box... – ircmaxell Jan 21 '12 at 03:03
  • @ircmaxell can you start peer reviewing my tickets? – Lance Kidwell Jan 21 '12 at 03:27
  • How do you find all characters diffrent with this approach – Baba Apr 05 '13 at 15:37
  • @Baba you save the XOR first and wrap `strspn()` in a loop and use it with the `$start` parameter – CSᵠ Apr 12 '13 at 00:40
16

If you convert a string to an array of single character one byte values you can use the array comparison functions to compare the strings.

You can achieve a similar result to the XOR method with the following.

$string1 = 'foobarbaz';
$string2 = 'foobarbiz';

$array1 = str_split($string1);
$array2 = str_split($string2);

$result = array_diff_assoc($array1, $array2);

$num_diff = count($result);
$first_diff = key($result);

echo "There are " . $num_diff . " differences between the two strings. <br />";
echo "The first difference between the strings is at position " . $first_diff . ". (Zero Index) '$string1[$first_diff]' vs '$string2[$first_diff]'.";

Edit: Multibyte Solution

$string1 = 'foorbarbaz';
$string2 = 'foobarbiz';

$array1 = preg_split('((.))u', $string1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$array2 = preg_split('((.))u', $string2, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

$result = array_diff_assoc($array1, $array2);

$num_diff = count($result);
$first_diff = key($result);

echo "There are " . $num_diff . " differences between the two strings.\n";
echo "The first difference between the strings is at position " . $first_diff . ". (Zero Index) '$string1[$first_diff]' vs '$string2[$first_diff]'.\n";
Steve Buzonas
  • 5,300
  • 1
  • 33
  • 55
  • I'm not too familiar with working with multibyte encoding. If someone could give more insight as to how this would hold up/how str_split works with mb it would be greatly appreciated. – Steve Buzonas Oct 07 '11 at 00:56
  • 1
    It won't work with multibyte encodings. If you wanted that, you'd pretty much have to use something like this: `$array = preg_split('((.))u', $string, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);` Basically, it will split into individual UTF-8 characters... – ircmaxell Oct 07 '11 at 13:52
  • Thanks for the `preg_split` tip, added it to the answer. – Steve Buzonas Jan 12 '14 at 06:24
4

I wanted to add this as as comment to the best answer, but I do not have enough points.

$string1 = 'foobarbaz';
$string2 = 'foobarbiz';
$pos = strspn($string1 ^ $string2, "\0");

if ($pos < min(strlen($string1), strlen($string2)){
    printf(
        'First difference at position %d: "%s" vs "%s"',
        $pos, $string1[$pos], $string2[$pos]
    );
} else if ($pos < strlen($string1)) {
    print 'String1 continues with' . substr($string1, $pos);
} else if ($pos < strlen($string2)) {
    print 'String2 continues with' . substr($string2, $pos);
} else {
    print 'String1 and String2 are equal';
}
Bradley Slavik
  • 875
  • 7
  • 13
-5
string strpbrk ( string $haystack , string $char_list )

strpbrk() searches the haystack string for a char_list.

The return value is the substring of $haystack which begins at the first matched character. As an API function it should be zippy. Then loop through once, looking for offset zero of the returned string to obtain your offset.

Sinthia V
  • 2,103
  • 2
  • 18
  • 36
  • What about when you have a string "foobarr" being compared to a string "foobaar". There is no difference in character set, just the counts and positioning. – Steve Buzonas Oct 07 '11 at 00:53
  • Not applicable here. For example, if haystack is `abcdef` and char_list is `fedcba` it would return the entire string (since `a` is in the char list). So while this function would work for a very limited subset of possible inputs, it won't work in a generic way, so it's not a good answer to the question. – ircmaxell Oct 07 '11 at 13:55
  • @NikiC asked for "an elegant way to get the offset of the first different character". The first character in your example is the correct answer, ircmaxell. While Steve has a better point. I love the xor approach, however unicode is the fly in that ointment. Hmmmm.... – Sinthia V Oct 07 '11 at 16:35
  • @Sinthia: correct, however it would also return `abcdef` when the char_list is `abcdef` as well. So it's only "accidental" that it returns the correct answer. – ircmaxell Oct 07 '11 at 19:46