RegEx for "fixing" e-mail headers, making them a single line

Question

Possible Duplicate:
How to do unfolding RFC 822
Parsing e-mail-like headers (similar to RFC822)

I have some input data that is similar to e-mail data, in that long lines are wrapped to the next line. For example:

robot-useragent: ABCdatos BotLink/1.0.2 (test links)
robot-language: basic
robot-description: This robot is used to verify availability of the ABCdatos
                   directory entries (http://www.abcdatos.com), checking
                   HTTP HEAD. Robot runs twice a week. Under HTTP 5xx
                   error responses or unable to connect, it repeats
                   verification some hours later, verifiying if that was a
                   temporary situation.

The robot-description field is "too long" for one line, and is wrapped to the next. For aid in parsing this data, I would like to come up with a RegEx that can be used with preg_replace() to replace with the following conditions:

New line characters followed by whitespace
Not replacing new line characters followed by additional new line characters

Example output:

robot-description: This robot is used to verify availability of the ABCdatos directory entries (http://www.abcdatos.com), checking HTTP HEAD. Robot runs twice a week. Under HTTP 5xx error responses or unable to connect, it repeats verification some hours later, verifiying if that was a temporary situation.

I am new to RegEx. How can I build such an expression? If you choose to answer, please include a brief explanation of the components in the expression. I'd really like to learn how to do these.

I've started with this: \n([^\S])* It is close. http://codepad.org/iMObpgFX

@MarcB, This isn't a duplicate. In my other question, I am asking about how to handle the headers as a whole in manner similar to the built-in IMAP functions. In this question, I am specifically asking about RegEx to re-join the lines. In my view, these are entirely separate questions. While they relate to the same goal for me, I would like to know a solution for both. If you disagree, please let me know. — Brad, Oct 09 '12 at 17:57

score 1 · Answer 1 · answered Oct 09 '12 at 17:57

1

Maybe you could try:

(\r|\n)\s+

(\r|\n) # matches both newline and carriage return 
\s+     # any whitespace (tabs, spaces, new lines)

Try it!

answered Oct 09 '12 at 17:57

Boris Guéry

47,316
8
52
87

Thank you for your answer. Unfortunately, this seems to put everything all on one line. http://codepad.org/zGRAzqhM – Brad Oct 09 '12 at 17:58
Boris, I see that it is working in your example. Any idea why the codepad version isn't? Are there some additional options that I need to specify? – Brad Oct 09 '12 at 18:05
It is working too on codepad, check out the source or use `nl2br()` (The new line are not interpreted in HTML) – Boris Guéry Oct 09 '12 at 18:07
Codepad shows the raw output with line numbers, not HTML. You can see this example with your expression, with a new line and test text after it: http://codepad.org/zfWgK80b – Brad Oct 09 '12 at 18:29
@Brad, I suspect there is something wrong with codepad. I tested both on my local env and on http://ideone.com/d4WMT and the result is what it is expected to be. Note that your answer is doing exactly the same (except it doesn't allow new line as whitespace). `[ \t\n]` = `\s` – Boris Guéry Oct 10 '12 at 08:13

RegEx for "fixing" e-mail headers, making them a single line

1 Answers1

Linked