Matching double line breaks using Regex

Question

I am writing a Regex that will extract the various pieces of information from an EDIFACT UN Codes List. As there are tens of thousands of codes I do not wish to type them all in so I have decided to use Regex to parse the text file and extract out the bits that I need. The text file is structured in a way that I can easily identify the bits that I want.

I have created the following Regex using Regex Hero to test it, but I just cannot get it to match everything up to a double line break for the codeComment group. I have tried using the character class [^\n\n] but this still won't match double line breaks.

Note: I have selected the Multiline option on Regex Hero.

(?<element>\d+)\s\s(?<elementName>.*)\[[B|C|I]\]\s+Desc: (?<desc>[^\n]*\s*[^\n]*)
^\s*Repr: (?<type>a(?:n)?)..(?<length>\d+)
^\s*(?<code>\d+)\s*(?<codeName>[^\n]*)
^\s{14}(?<codeComment>[^\n]*)

This is the example text I am using to match.

----------------------------------------------------------------------

1073 Document line action code [B]

Desc: Code indicating an action associated with a line of a
    document.

Repr: an..3

1 Included in document/transaction
    The document line is included in the
    document/transaction.
    should capture this as well.

2 Excluded from document/transaction
    The document line is excluded from the
    document/transaction.

What I want is for codeComment to contain the following:

The document line is included in the
          document/transaction.
          should capture this as well.

but it is only extracting the first line:

The document line is included in the

score 5 · Accepted Answer · answered Oct 25 '12 at 09:31

5

In a character class, every character counts once, no matter how often you write it. So a character class can't be used to check for consecutive linebreaks. But you can use a lookahead assertion:

^\s{14}(?<codeComment>(?s)(?:(?!\n\n).)*)

(?s) switches on singleline mode (to allow the dot to match newlines).

(?!\n\n) asserts that there are no two consecutive linebreaks at the current position.

answered Oct 25 '12 at 09:31

Tim Pietzcker

328,213
58
503
561

Your answer is spot on, but I am having trouble getting the amended regex to pick up the "2 Excluded from document/transaction The document line is excluded from the document/transaction." lines as well. – Intrepid Oct 25 '12 at 10:58
@MikeClarke: But those follow after a double linebreak, so I thought you did *not* want to pick them up? If you do, what *is* the correct delimiter? – Tim Pietzcker Oct 25 '12 at 11:47
Like I said, your answer correctly picked up all lines of the comment for code 1, but I also need it to pick up the other code blocks. The amended regex just stops on code 1 and doesn't carry on picking up further codes. – Intrepid Oct 25 '12 at 12:02
OK, but if it shouldn't stop at a double linebreak (which is what you wrote in your question), where should it stop then? – Tim Pietzcker Oct 25 '12 at 12:03
You're misunderstanding me. The amended regex is now correctly picking up all lines for the comment but ONLY for code 1. For some reason the entire regex is not correctly continuing to pick up code 2 and it's comments, etc. I think it's because code 1 follows 'Repr:' and code 2 follows code 1, etc so I probably need to change the regex to allow for this. – Intrepid Oct 25 '12 at 12:21
There will be one or more code sections. For example take a look at [this](http://www.unece.org/fileadmin/DAM/trade/untdid/d12a/tred/tred1001.htm). – Intrepid Oct 25 '12 at 12:35
In that link, after the 998th "code section", there are more than 2 newlines for the first time. Then just use that as your stopping criterion: `^\s{14}(?(?s)(?:(?!\n\n\n).)*)` – Tim Pietzcker Oct 25 '12 at 12:54
I think I have sorted it, but not 100% sure it's correct. I didn't need to worry about the regex in your last comment; your original regex was sufficient to allow me to grab all comment lines PER code. I have come up with this: ^(?:\*\s{4}|^\s{6})(?\d+)\s\s(?.*)\[[B|C|I]\]$ \s{5,}Desc:\s(?(?s)(?:(?!\n\n).)*)$ \s{5,}Repr:\s(?a(?:n)?)..(?\d(?s)(?:(?!\n\n).)*)$ \s{5,}(?\d+)\s*(?(?s)(?:(?!\n).)*)$ \s{14,}(?(?s)(?:(?!\n\n).)*)$ |\s{5,}(?\d+)\s*(?(?s)(?:(?!\n).)*)$ \s{14,}(?(?s)(?:(?!\n\n).)*)$ – Intrepid Oct 25 '12 at 13:05

score 3 · Answer 2 · answered Jun 13 '19 at 03:52

3

try

    [\r\n]{2,}

To "match double line breaks"

Used in DWR to remove double/bloated line breaks (left over from unzipping files for some reason)

more info: How to remove unwanted "extra line breaks" that appear in PHP/CSS/JS files after unzip?

answered Jun 13 '19 at 03:52

Christian Žagarskas

1,068
10
20

This would match `\r\n` which is a single line break. Maybe `(\r?\n){2,}` – PHPirate Jan 31 '22 at 07:59

score 0 · Answer 3 · answered Apr 18 '20 at 18:25

0

This one is simple and works best for me:

/[\r]?\n[\r]?\n/g

answered Apr 18 '20 at 18:25

habibhassani

486
1
6
15

Matching double line breaks using Regex

3 Answers3

Linked