3

I'm trying to split a UTF-8 string on a quote character (") with delimiter capture, except where that quote is followed by a second quote ("") so that (for example)

"A ""B"" C" & "D ""E"" F"

will split into three elements

"A ""B"" C"
&
"D ""E"" F"

I've been attempting to use:

$string = '"A ""B"" C" & "D ""E"" F"';
$temp = preg_split(
    '/"[^"]/mui',
    $string,
    null, 
    PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
);

but without success as it gives me

array(7) {
  [0]=>
  string(2) " ""
  [1]=>
  string(1) """
  [2]=>
  string(1) "C"
  [3]=>
  string(2) "& "
  [4]=>
  string(2) " ""
  [5]=>
  string(1) """
  [6]=>
  string(2) "F""
}

So it's losing any characters that immediately follow a quote unless that character is also a quote

In this example there's a quote as the first and last characters in the string, though that may not always be the case, e.g.

{ "A ""B"" C" & "D ""E"" F" }

needs to split into five elements

{
"A ""B"" C"
&
"D ""E"" F"
}

Can anybody help me get this working?

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • If you split on quotes not followed by another quote (assuming that the quote is not consumed), you will get `{ `, `"A ""B"" C`, `" & `, `"D ""E"" F`, `" }`. I don't think that's the proper way to identify where to split... Do you have more examples? – Jerry Aug 11 '13 at 16:09
  • I can live with the starting/ending quote being consumed if the split is correct, as the quoted strings would be identifiable as alternating entries in the resulting array – Mark Baker Aug 11 '13 at 16:13
  • Tricky when there's a number of answers that give me what I need, especially as one seems fractionally faster, but uses fractionally more memory; while the other is fractionally slower, but uses fractionally less memory - will run some more extensive tests with more complex strings before I accept an answer.... but thanks guys, as always you've come to my rescue – Mark Baker Aug 11 '13 at 16:33
  • 1
    @MarkBaker: I have no idea what the situation is, but a couple more things to note: mine doesn’t work for mismatched quotes, and Jerry’s doesn’t work for leading or trailing escapes. Assuming the former takes more memory and the latter takes more time, you can possibly reduce memory by using `(?:[^"]|"")` instead of `([^"]|"")` and reduce execution time by removing the optional spaces. – Ry- Aug 11 '13 at 16:38
  • [What is this :)](https://eval.in/42544) – HamZa Aug 11 '13 at 18:27
  • 1
    @HamZa: I don’t know. What is that? [It doesn’t work any better.](https://eval.in/42545) – Ry- Aug 11 '13 at 18:54

2 Answers2

4

Since you said that you don't mind the quotes to be consumed on the split, you can use the expression:

(?<!")\s?"\s?(?!")

Where two negative lookarounds are used. The output on your sample will be:

{ 
A ""B"" C
&
D ""E"" F
}

[I put the \s? to consume any trailing space, remove them if you want to keep them]

Jerry
  • 70,495
  • 13
  • 100
  • 144
3

I think it would probably be easier to use preg_match_all:

preg_match_all('/"([^"]|"")+"|[^"]+/', $string, $matches);

Here’s a demo. The regular expression matches a quoted string or not a quoted string, so if the last part doesn‘t have a closing quote, it’ll ignore that; that might need changing, depending on your situation.

Ry-
  • 218,210
  • 55
  • 464
  • 476