3

I try to split a text with preg_split, but I dont get the regrex for it.

example:

I search 1, regex to:  no. Or... yes!

should get:

Array
(
    [0] => I
    [1] => search
    [2] => 1
    [3] => ,
    [4] => regex
    [5] => to
    [6] => :
    [7] => no
    [8] => .
    [9] => Or
    [10] => ...
    [11] => yes
    [12] => !
)

I tryd the following code:

preg_split("/([\s]+)/", "I search 1, regex to:  no. Or... yes!")

which end in:

Array
(
    [0] => I
    [1] => search
    [2] => 1,
    [3] => regex
    [4] => to:
    [5] => no.
    [6] => Or...
    [7] => yes!
)

EDIT: Ok, the original question was solved, but I forgot something in my example:

new example:

I search 1, regex (regular expression) to: That's it is! Und über den Wolken müssen wir...

should get:

array (
  0 => 'I',
  1 => 'search',
  2 => '1',
  3 => ',',
  4 => 'regex',
  5 => '(',
  6 => 'regular',
  7 => 'expression',
  8 => ')',
  9 => 'to',
  10 => ':',
  11 => 'That',
  12 => '\'s',
  13 => 'it',
  14 => 'is',
  15 => '!',
  16 => 'Und',
  17 => 'über',
  18 => 'den',
  19 => 'Wolken',
  20 => 'müssen',
  21 => 'wir',
  22 => '...',
)

one thing is, that the opening ( get not matched in the first solution. A other thing is, that also not the german chars ÄÖÜäöüß inside a word get not matched.

Hope its ok to update the question (not to open a new one).

My last try was the following, which dont match:

\s+|(?<!(A-Za-z1-0ÄÖÜäöüß)+)(?=(A-Za-z1-0ÄÖÜäöüß)+)
Thomas
  • 1,193
  • 1
  • 7
  • 16

2 Answers2

4

You can use this lookahead based regex:

$str = 'I search 1, regex to: no. Or... yes!';
$tok = preg_split('/\h+|(?<!\W)(?=\W)/', $str);

print_r($tok);

Array
(
    [0] => I
    [1] => search
    [2] => 1
    [3] => ,
    [4] => regex
    [5] => to
    [6] => :
    [7] => no
    [8] => .
    [9] => Or
    [10] => ...
    [11] => yes
    [12] => !
)

/\h+|(?<!\W)(?=\W) is alternation based regex which is splitting on 1+ horizontal space OR at a position where previous character is not a non-word char and next char is a non-word char.

RHS of alternation is (?<!\W)(?=\W) where (?<!\W) is negative lookbehind which means previous char is not a non-word char. Then (?=\W) is positive lookahead which means next char is a non-word char.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Maybe you want to explain a bit more to OP what your regex does – Rizier123 Mar 15 '15 at 10:06
  • 1
    Yes I've just added it in my answer. – anubhava Mar 15 '15 at 10:07
  • You're welcome. I have added a bit more details about lookaheads. – anubhava Mar 15 '15 at 10:10
  • ok, one thing I forgot about - a text with ( and ). I try my self if I can find the solution. But when not I add a new comment :) – Thomas Mar 15 '15 at 10:13
  • ok, I dont get it :( hope that someone can have a look at my **updated** question. – Thomas Mar 15 '15 at 13:10
  • As a fair practice please don't update a question and change requirements on an accepted answer. If you have a new problem then don't hesitate to post a new question. I also note that you have removed accepted mark which was based on your original requirement and solved your problem fully. – anubhava Mar 15 '15 at 13:43
  • ok, sorry - - I would undo - but now a new answer recived - next time I make a new question. – Thomas Mar 15 '15 at 13:51
  • Looking at your edited output it seems `preg_split('/\h+|(?<=[()])|(?<!\W)(?=\W)/u', $str)` will give you exact same output. – anubhava Mar 16 '15 at 10:44
1

I think apart from the 's bit that you seem to want as one piece – which doesn’t make that much sense to me, since for other punctuation chars such as ! or , you want individual parts – you could do it by simply splitting at any whitespace or word boundary,

preg_split(
  '#\s|\b#u',
  "I search 1, regex (regular expression) to: That's it is! Und über den Wolken müssen wir...",
  -1,
  PREG_SPLIT_NO_EMPTY
);
CBroe
  • 91,630
  • 14
  • 92
  • 150
  • thanks - it works now - with the 's i can live :) [link](http://thomas.creutz.info/split.php?text=%5Cs%7C%5Cb) – Thomas Mar 15 '15 at 15:24