RegExp to get lines with linebreaks

Question

I'm trying to get some comment lines out of our database, they are stored as a string, separated by '\n'. Unfortunately in some of the comments contain texts - also with '\n', and I don't get them separated accordingly.

An example comment looks like:

27.11.2012 13:19 (MB): test123
27.11.2012 13:20 (MB): test456
27.11.2012 13:21 (JA): test789
lalala
lululu
27.11.2012 13:22 (JA): test10

Now I tried so separate them using a reg exp and preg_split():

#(\d{2}\.\d{2}\.20[0123]{2} \d{2}:\d{2} \([A-Z]{2,3}\): .*)#
(PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE)

but I get

Array
(
    [0] => 27.11.2012 13:19 (MB): test123
    [1] => 
    [2] => 27.11.2012 13:20 (MB): test456
    [3] => 
    [4] => 27.11.2012 13:21 (JA): test789
    [5] => 
lalala
lululu
    [6] => 27.11.2012 13:22 (JA): test10
)

How do I get them combined?

alexis · Accepted Answer · 2012-11-27T23:38:19.643

A dot in a regexp doesn't match a newline, so your .* goes to the end of the line; the seemingly empty rows contain the newlines. So drop the .* from your split pattern, and use the rest with PREG_SPLIT_DELIM_CAPTURE.

(\d{2}\.\d{2}\.20[0123]{2} \d{2}:\d{2} \([A-Z]{2,3}\):)

Each row will be split into two parts at the colon. You can then join your strings in pairs to get the original row (or save yourself the trouble of splitting them in the next step in your program, when you'll need to separate the fields).

If you really hate the idea of splitting your input rows:

Use preg_match_all instead of splitting.
Add the PCRE_DOTALL (s) flag to modify the meaning of ., so that it also matches newlines.
That would make the first .* match all the way to the end of the file, so make it non-greedy: .*?.

Now, you need to match everything until the next date pattern, but stop just before it. You can express this by ending the regex with a lookahead expression. Since it will separate your matched groups, you no longer need to put it explicitly in the matched pattern.

In other words, try this pattern (I've added the s flag as a suffix, but of course you can pass it separately):

/(.*?)\n(?=\d{2}\.\d{2}\.20[0123]{2} \d{2}:\d{2} \([A-Z]{2,3}\):)/s

Comment: I avoid lookaheads/lookbehinds as much as possible, and you can probably see why. I find the two-part solution simpler and more maintainable, but the lookahead makes sense here.

PS. If changing the file format is still an option, consider converting to csv format and reading it with fgetcsv or something similar.

thank you for that answer. i tried to avoid getting two lines but will handle with it... — manuxi, Nov 27 '12 at 22:12
If you really hate that, try the second variant I just added. — alexis, Nov 27 '12 at 22:31
I just handled it but was curious, whether your second regex give better results: but preg_match_all(self::$sRegExpDateTimex,$sComment,$aMatches); gives me the lines only until the colon... — manuxi, Nov 27 '12 at 22:53
Oops, you're right, I was being stupid; `.*?` stops as soon as possible, which is immediately since there's no reason to go on. Revising it... — alexis, Nov 27 '12 at 23:28
PS. If my answer helped you, you should "accept" it by clicking on the check box next to it. (Feel free to upvote it as well :-)) — alexis, Nov 27 '12 at 23:41

RegExp to get lines with linebreaks

1 Answers1