Over a year later I was revisiting this and came up with a function to do this better. The good thing is it handles multibyte strings without having to ditch the multibyte characters entirely. The bad thing is that it can't use a regular expression like preg_split()
does.
/**
* Splits a piece of text into individual words and the words' position within
* the text.
*
* @param string $text The text to split.
* @return array Each element is an array, of the word and its 0-based position.
*/
function split_offset_capture($text) {
$words = array();
// We split into words based on these characters:
$non_word_chars = array(
" ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
"\\", "?", "!", "*", "'", "’", "\n", "\r", "\t",
);
// To keep track within the loop:
$word_started = FALSE;
$current_word = "";
$current_word_position = 0;
$characters = mb_str_split($text);
foreach($characters as $i => $letter) {
if ( ! in_array($letter, $non_word_chars)) {
// A character in a word.
if ( ! $word_started) {
// We're starting a brand new word.
if ($current_word != "") {
// Save the previous, now complete, word's info.
$words[] = array($current_word, $current_word_position);
}
$current_word_position = $i;
$word_started = TRUE;
$current_word = "";
}
$current_word .= $letter;
} else {
$word_started = FALSE;
}
};
// Add on the final word.
$words[] = array($current_word, $current_word_position);
return $words;
}
Doing this:
$text = "Héllo world — goodbye";
$words = split_offset_capture($text);
Ends up with $words
containing this:
array(
array("Héllo", 0),
array("world", 6),
array("goodbye", 14),
);
You might need to add further characters to $non_word_chars
.
For real-world texts one awkward thing is handling punctuation that immediately follows words (e.g. Russ'
or Russ’
), or within words (e.g. Bob's
, Bob’s
or new-found
). To cope with this I came up with this altered function that has three arrays of characters to look for. So it perhaps does more than preg_split()
but, again, doesn't use regular expressions:
/**
* Splits a piece of text into individual words and the words' position within
* the text.
*
* @param string $text The text to split.
* @return array Each element is an array, of the word and its 0-based position.
*/
function split_offset_capture_2($text) {
$words = array();
// We split into words based on these characters:
$non_word_chars = array(
" ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
"\\", "?", "!", "*", "'", "’", "\n", "\r", "\t"
);
// EXCEPT, these characters are allowed to be WITHIN a word:
// e.g. "up-end", "Bob's", "O'Brien"
$in_word_chars = array("-", "'", "’");
// AND, these characters are allowed to END a word:
// e.g. "Russ'"
$end_word_chars = array("'", "’");
// To keep track within the loop:
$word_started = FALSE;
$current_word = "";
$current_word_position = 0;
$characters = mb_str_split($text);
foreach($characters as $i => $letter) {
if ( ! in_array($letter, $non_word_chars)
||
(
// It's a non-word-char that's allowed within a word.
in_array($letter, $in_word_chars)
&&
! in_array($characters[$i-1], $non_word_chars)
&&
! in_array($characters[$i+1], $non_word_chars)
)
||
(
// It's a non-word-char that's allowed at the end of a word.
in_array($letter, $end_word_chars)
&&
! in_array($characters[$i-1], $non_word_chars)
)
) {
// A character in a word.
if ( ! $word_started) {
// We're starting a brand new word.
if ($current_word != "") {
// Save the previous, now complete, word's info.
$words[] = array($current_word, $current_word_position);
}
$current_word_position = $i;
$word_started = TRUE;
$current_word = "";
}
$current_word .= $letter;
} else {
$word_started = FALSE;
}
};
// Add on the final word.
$words[] = array($current_word, $current_word_position);
return $words;
}
So if we have:
$text = "Héllo Bob's and Russ’ new-found folks — goodbye";
then the first function (split_offset_capture()
) gives us:
array(
array("Héllo", 0),
array("Bob", 6),
array("s", 10),
array("and", 12),
array("Russ", 16),
array("new", 22),
array("found", 26),
array("folks", 32),
array("goodbye", 40),
);
While the second function (split_offset_capture_2()
) gets us:
array(
array("Héllo", 0),
array("Bob's", 6),
array("and", 12),
array("Russ’", 16),
array("new-found", 22),
array("folks", 32),
array("goodbye", 40),
);