3

preg_split has an optional PREG_SPLIT_DELIM_CAPTURE flag, which also returns all delimiters in the returned array. mb_split does not.

Is there any way to split a multibyte string (not just UTF-8, but all kinds) and capture the delimiters?

I'm trying to make a multibyte-safe linebreak splitter, keeping the linebreaks, but would prefer a more genericaly usable solution.

Solution Thanks to user Casimir et Hippolyte, I built a solution and posted it on github (https://github.com/vanderlee/PHP-multibyte-functions/blob/master/functions/mb_explode.php), which allows all the preg_split flags:

/**
 * A cross between mb_split and preg_split, adding the preg_split flags
 * to mb_split.
 * @param string $pattern
 * @param string $string
 * @param int $limit
 * @param int $flags
 * @return array
 */
function mb_explode($pattern, $string, $limit = -1, $flags = 0) {       
    $strlen = strlen($string);      // bytes!   
    mb_ereg_search_init($string);

    $lengths = array();
    $position = 0;
    while (($array = mb_ereg_search_pos($pattern)) !== false) {
        // capture split
        $lengths[] = array($array[0] - $position, false, null);

        // move position
        $position = $array[0] + $array[1];

        // capture delimiter
        $regs = mb_ereg_search_getregs();           
        $lengths[] = array($array[1], true, isset($regs[1]) && $regs[1]);

        // Continue on?
        if ($position >= $strlen) {
            break;
        }           
    }

    // Add last bit, if not ending with split
    $lengths[] = array($strlen - $position, false, null);

    // Substrings
    $parts = array();
    $position = 0;      
    $count = 1;
    foreach ($lengths as $length) {
        $is_delimiter   = $length[1];
        $is_captured    = $length[2];

        if ($limit > 0 && !$is_delimiter && ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY) && ++$count > $limit) {
            if ($length[0] > 0 || ~$flags & PREG_SPLIT_NO_EMPTY) {          
                $parts[]    = $flags & PREG_SPLIT_OFFSET_CAPTURE
                            ? array(mb_strcut($string, $position), $position)
                            : mb_strcut($string, $position);                
            }
            break;
        } elseif ((!$is_delimiter || ($flags & PREG_SPLIT_DELIM_CAPTURE && $is_captured))
               && ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY)) {
            $parts[]    = $flags & PREG_SPLIT_OFFSET_CAPTURE
                        ? array(mb_strcut($string, $position, $length[0]), $position)
                        : mb_strcut($string, $position, $length[0]);
        }

        $position += $length[0];
    }

    return $parts;
}
Martijn
  • 3,696
  • 2
  • 38
  • 64

1 Answers1

3

Capturing delimiters is only possible with preg_split and is not available in other functions.

So three possibilities:

1) convert your string to UTF8, use preg_split with PREG_SPLIT_DELIM_CAPTURE, and use array_map to convert each items to the original encoding.

This way is the more simple. That is not the case in the second way. (Note that in general, it is more simple to work always in UTF8, instead of dealing with exotic encodings)

2) in place of a split-like function you need to use for example mb_ereg_search_regs to get the matched parts and to build the pattern like this:

delimiter|all_that_is_not_the_delimiter

(Note that the two branches of the alternation must be mutually exclusive and take care to write them in a way that makes impossible gaps between results. The first part must be at the beginning of the string and the last part must be at the end. Each part must be contiguous to the previous and so on.)

3) use mb_split with lookarounds. By definition, lookarounds are zero-width assertions and don't match any characters but only positions in the string. So you can use this kind of pattern that matches positions after or before the delimiter:

(?=delimiter)|(<=delimiter)

(The limitation of this way is that the subpattern in the lookbehind can't have a variable length (in other words, you can't use a quantifier inside), but it can be an alternation of fixed length subpatterns: (?<=subpat1|subpat2|subpat3) )

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • I wanted to use it to split lines on linebreaks. Method 3 turned out to work well: `mb_split('(?=\r\n|\r|\n)|(<=\r\n|\r|\n)', $text);`. Thanks! – Martijn Jun 03 '15 at 18:38
  • @Martijn: This way will not work if the newline sequence is `\r\n` because the pattern will split at the `\r` and at the `\n`. So you will obtain: `line`, `\r`, `\n`, `line`. The way 2) is more appropriate in this case since you can simply use this pattern: `[^\r\n]+|\r?\n|\r` – Casimir et Hippolyte Jun 03 '15 at 21:55
  • Well, it seems to work in my tests, but there's also the problem that PHP 5.2 and 5.3 throw an error because they think the pattern is empty. I'll look into you solution 2 next. – Martijn Jun 04 '15 at 05:10
  • I think I found a solution, using something inspired by your method 2 (but using `mb_ereg_search_pos` instead). Pastebin here: http://pastebin.com/arJPucV4 It's not thoroughly tested, but preliminary tests seem to work fine; supporting all `preg_split` flags and limit. – Martijn Jun 04 '15 at 08:15