PHP multibyte preg_split() with PREG_SPLIT_OFFSET_CAPTURE

Question

I want to use preg_split() with its PREG_SPLIT_OFFSET_CAPTURE option to capture both the word and the index where it begins in the original string.

However my string contains multibyte characters which is throwing off the counts. There doesn't seem to be a mb_ equivalent to this. What are my options?

Example:

$text = "Hello world — goodbye";

$words = preg_split("/(\w+)/x",
                    $text,
                    -1,
                    PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);

foreach($words as $word) {
    print("$word[0]: $word[1]<br>");
}

This outputs:

Hello: 0
: 5
world: 6
— : 11
goodbye: 16

Because the dash is is an em-dash, rather than a standard hyphen, it's a multibyte character - so "goodbye"s offset comes out as 16 instead of 14.

This thread seems related https://stackoverflow.com/questions/30605173/php-mb-split-capturing-delimiters — user3783243, Mar 13 '20 at 19:02
Thanks! However, if I try using the `mb_explode()` function from that question I get exactly the same result as `preg_split()`. — Phil Gyford, Mar 13 '20 at 19:13
Do you think you might utilize [this](https://www.php.net/manual/en/function.mb-split.php#117588) and loop the array looking for spaces? — El_Vanja, Mar 13 '20 at 19:35
A bit of a hack, but you could replace the mb character with a non multi-byte character, then run the preg_split() function on the string. — dale landry, Mar 13 '20 at 22:07
@El_Vanja Thanks but I’m not sure how looking for spaces would help - spaces aren’t a problem. — Phil Gyford, Mar 13 '20 at 22:20
@dale landry I could but it’s a big text with a load of such characters, and I also the punctuation is important - I don’t want to change it. — Phil Gyford, Mar 13 '20 at 22:21
I meant looping the array to find spaces in order to determine where the next word starts. If there's a space in the array and next index is not a space, that means a new word has started. — El_Vanja, Mar 13 '20 at 22:22

dale landry · Answer 1 · 2020-03-13T22:41:21.170

This is kind of a hack, but seems to work. Use str_replace() to replace the multi-byte character with a non-multi-byte character and then run the preg_split() on the string.

$text = 'Hello world — goodbye';
$mb = '—';
$rplmnt = "X";

function chkPlc($text, $mb, $rplmnt){
    if(strpos($text, $mb) !== false){ 
        $rpl = str_replace($mb, $rplmnt, $text);
        $words = preg_split("/(\w+)/x",
                        $rpl,
                        -1,
                        PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);

        foreach($words as $word) {    
            $stmt = print("$word[0]: $word[1]<br>");
        }
    }

    $stmt .= 'New String with replaced md char with non mb char: '.$rpl.'<br>';
    return $stmt;
}

chkPlc($text, $mb, $rplmnt);

OUTPUTS:

Hello: 0
: 5
world: 6
X : 11
goodbye: 14

A more in depth function could be written to check if a non-multi-byte character is not present within the string first, then used as a replacement for the multi-byte character defined. Again, kind of a hack but it works.

Thanks @dale landry. I found another solution, posted here, using mb_convert_encoding() that I guess has a similar effect. — Phil Gyford, Mar 14 '20 at 12:32

score 0 · Answer 2 · answered Mar 14 '20 at 12:29

Here's another not-ideal solution: convert the text to something like ISO-8859-1 using mb_convert_encoding() that will get rid of the multibyte characters. They'll either be turned to a similar ASCII character or a question-mark.

So transforming $text before doing the preg_split() using this:

$text = mb_convert_encoding($text, "ISO-8859-1", "UTF-8");

Results in:

Hello: 0
: 5
world: 6
? : 11
goodbye: 14

Although it makes a mess of the text, you can still keep a copy of the original of course.

I found it via this comment about the iconv() function.

score 0 · Accepted Answer · answered Jul 06 '21 at 10:32

Over a year later I was revisiting this and came up with a function to do this better. The good thing is it handles multibyte strings without having to ditch the multibyte characters entirely. The bad thing is that it can't use a regular expression like preg_split() does.

/**
 * Splits a piece of text into individual words and the words' position within
 * the text.
 *
 * @param string $text The text to split.
 * @return array Each element is an array, of the word and its 0-based position.
 */
function split_offset_capture($text) {
    $words = array();

    // We split into words based on these characters:
    $non_word_chars = array(
        " ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
        "\\", "?", "!", "*", "'", "’", "\n", "\r", "\t",
    );

    // To keep track within the loop:
    $word_started = FALSE;
    $current_word = "";
    $current_word_position = 0;

    $characters = mb_str_split($text);

    foreach($characters as $i => $letter) {
        if ( ! in_array($letter, $non_word_chars)) {
            // A character in a word.
            if ( ! $word_started) {
                // We're starting a brand new word.
                if ($current_word != "") {
                    // Save the previous, now complete, word's info.
                    $words[] = array($current_word, $current_word_position);
                }
                $current_word_position = $i;
                $word_started = TRUE;
                $current_word = "";
            }
            $current_word .= $letter;
        } else {
            $word_started = FALSE;
        }
    };

    // Add on the final word.
    $words[] = array($current_word, $current_word_position);

    return $words;
}

Doing this:

$text = "Héllo world — goodbye";

$words = split_offset_capture($text);

Ends up with $words containing this:

array(
    array("Héllo", 0),
    array("world", 6),
    array("goodbye", 14),
);

You might need to add further characters to $non_word_chars.

For real-world texts one awkward thing is handling punctuation that immediately follows words (e.g. Russ' or Russ’), or within words (e.g. Bob's, Bob’s or new-found). To cope with this I came up with this altered function that has three arrays of characters to look for. So it perhaps does more than preg_split() but, again, doesn't use regular expressions:

/**
 * Splits a piece of text into individual words and the words' position within
 * the text.
 *
 * @param string $text The text to split.
 * @return array Each element is an array, of the word and its 0-based position.
 */
function split_offset_capture_2($text) {
    $words = array();

    // We split into words based on these characters:
    $non_word_chars = array(
        " ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
        "\\", "?", "!", "*", "'", "’", "\n", "\r", "\t"
    );

    // EXCEPT, these characters are allowed to be WITHIN a word:
    // e.g. "up-end", "Bob's", "O'Brien"
    $in_word_chars = array("-", "'", "’");

    // AND, these characters are allowed to END a word:
    // e.g. "Russ'"
    $end_word_chars = array("'", "’");

    // To keep track within the loop:
    $word_started = FALSE;
    $current_word = "";
    $current_word_position = 0;

    $characters = mb_str_split($text);

    foreach($characters as $i => $letter) {
        if ( ! in_array($letter, $non_word_chars)
            ||
            (
                // It's a non-word-char that's allowed within a word.
                in_array($letter, $in_word_chars)
                &&
                ! in_array($characters[$i-1], $non_word_chars)
                &&
                ! in_array($characters[$i+1], $non_word_chars)
            )
            ||
            (
                // It's a non-word-char that's allowed at the end of a word.
                in_array($letter, $end_word_chars)
                &&
                ! in_array($characters[$i-1], $non_word_chars)
            )
        ) {
            // A character in a word.
            if ( ! $word_started) {
                // We're starting a brand new word.
                if ($current_word != "") {
                    // Save the previous, now complete, word's info.
                    $words[] = array($current_word, $current_word_position);
                }
                $current_word_position = $i;
                $word_started = TRUE;
                $current_word = "";
            }
            $current_word .= $letter;
        } else {
            $word_started = FALSE;
        }
    };

    // Add on the final word.
    $words[] = array($current_word, $current_word_position);

    return $words;
}

So if we have:

$text = "Héllo Bob's and Russ’ new-found folks — goodbye";

then the first function (split_offset_capture()) gives us:

array(
    array("Héllo", 0),
    array("Bob", 6),
    array("s", 10),
    array("and", 12),
    array("Russ", 16),
    array("new", 22),
    array("found", 26),
    array("folks", 32),
    array("goodbye", 40),
);

While the second function (split_offset_capture_2()) gets us:

array(
    array("Héllo", 0),
    array("Bob's", 6),
    array("and", 12),
    array("Russ’", 16),
    array("new-found", 22),
    array("folks", 32),
    array("goodbye", 40),
);

PHP multibyte preg_split() with PREG_SPLIT_OFFSET_CAPTURE

3 Answers3