How to split text to match double quotes plus trailing text to dot?

Question

How can I get a sentence that is in double quotes in which there is a dot that must be split?

Example document like this:

“Chess helps us overcome difficulties and sufferings,” said Unnikrishnan, taking my queen. “On a chess board you are fighting. as we are also fighting the hardships in our daily life.” he said.

I want to get output like this:

Array
(
    [0] =>"Chess helps us overcome difficulties and sufferings," said Unnikrishnan, taking my queen.
    [1] =>"On a chess board you are fighting. as we are also fighting the hardships in our daily life," he said.
 )

My code still explode by dots.

function sample($string)
{
    $data=array();
    $break=explode(".", $string);
    array_push($data, $break);

    print_r($data);
}

I'm still confused to split two delimiter about double quote and dot. because inside double quote there is a sentence that contain dot delimiter.

Jan · Answer 1 · 2017-05-20T06:35:49.833

2

A perfect example for (*SKIP)(*FAIL):

“[^“”]+”(*SKIP)(*FAIL)|\.\s*
# looks for strings in double quotes
# throws them away
# matches a dot literally, followed by whitespaces eventually

In PHP:

$regex = '~“[^“”]+”(*SKIP)(*FAIL)|\.\s*~';
$parts = preg_split($regex, $your_string_here);

This yields

Array
(
    [0] => “Chess helps us overcome difficulties and sufferings,” said Unnikrishnan, taking my queen
    [1] => “On a chess board you are fighting. as we are also fighting the hardships in our daily life.”
)

See a demo on regex101.com as well as a demo on ideone.com.

edited May 20 '17 at 06:35

answered May 20 '17 at 06:30

Jan

42,290
8
54
79

can you tell me what the meaning of character `~` in your regex sintax? Cz I try to learn regex but I didn't find the character `~` in regex. Or can you give me reference to learn regex character?, thanks. – Rachmad May 26 '17 at 10:30
@Rachmad: These are delimiters such as `/` or `#` and needed on both sides of the regex string. – Jan May 26 '17 at 19:49
Oh..so If I change `~` to ~/~ its no problem? @Jan – Rachmad May 26 '17 at 20:44

mickmackusa · Accepted Answer · 2021-07-15T12:22:57.730

Here is a simpler pattern used by preg_split() followed by preg_replace() to fix the left and right double quotes up (Demo):

$in = '“Chess helps us overcome difficulties and sufferings,” said Unnikrishnan, taking my queen. “On a chess board you are fighting. as we are also fighting the hardships in our daily life.” he said.';

$out = preg_split('/ (?=“)/', $in, 0, PREG_SPLIT_NO_EMPTY);
//$out = preg_match_all('/“.+?(?= “|$)/', $in, $out) ? $out[0] : null;

$find = '/[“”]/u';  // unicode flag is essential
$replace = '"';
$out = preg_replace($find, $replace, $out);  // replace curly quotes with standard double quotes

var_export($out);

Output:

array (
  0 => '"Chess helps us overcome difficulties and sufferings," said Unnikrishnan, taking my queen.',
  1 => '"On a chess board you are fighting. as we are also fighting the hardships in our daily life." he said.',
)

preg_split() matches the space followed by a “ (LEFT DOUBLE QUOTE).

The preg_replace() step requires a pattern with the u modifier to make sure the left and right double quotes in the character class are identified. Using '/“|”/' means you can remove the u modifier, but it doubles the steps that the regex engine has to perform (for this case, my character class uses just 189 steps versus the piped characters using 372 steps).

Furthermore regarding the choice between preg_split() and preg_match_all(), the reason to go with preg_split() is because the objective is to merely split the string on the space that is followed by a left double quote. preg_match_all() would be a more practical choice if the objective was to omit substrings not neighboring the delimiting space character.

Despite my logic, if you still want to use preg_match_all(), my preg_split() line can be replaced with:

$out = preg_match_all('/“.+?(?= “|$)/', $in, $out) ? $out[0] : null;

oh.. I know my problem, just edit .htacces and add, specialcharacter `AddDefaultCharset UTF-8 AddCharset UTF-8 .php`, thanks too @mickmackusa — Rachmad, May 20 '17 at 07:52

Mi-Creativity · Answer 3 · 2017-05-20T09:00:03.677

Alternatively:

regex101 ^{( 16 steps )}

“.[^”]+”(?:.[^“]+)?

“.[^”]+” matches everything between “ and ”.
(?:.[^“]+)? matches - a possibility, this why there's the last ?- of everything that's not a starting “, ?: means non-capturing group.

PHP - PHPfiddle: - Hit "Run-F9" - [ updated to replace “, ” with " ]

<?php
    $str = '“Chess helps us overcome difficulties and sufferings,” said Unnikrishnan, taking my queen. “On a chess board you are fighting. as we are also fighting the hardships in our daily life.”';

if(preg_match_all('/“.[^”]+”(?:.[^“]+)?/',$str, $matches)){
    echo '<pre>';
    print_r(preg_replace('[“|”]', '"', $matches[0]));
    echo '</pre>';
}
?>

output:

Array
(
    [0] => "Chess helps us overcome difficulties and sufferings," said Unnikrishnan, taking my queen. 
    [1] => "On a chess board you are fighting. as we are also fighting the hardships in our daily life."
)

How to split text to match double quotes plus trailing text to dot?

How can I get a sentence that is in double quotes in which there is a dot that must be split?

3 Answers3