2

I am writing a Regex that will extract the various pieces of information from an EDIFACT UN Codes List. As there are tens of thousands of codes I do not wish to type them all in so I have decided to use Regex to parse the text file and extract out the bits that I need. The text file is structured in a way that I can easily identify the bits that I want.

I have created the following Regex using Regex Hero to test it, but I just cannot get it to match everything up to a double line break for the codeComment group. I have tried using the character class [^\n\n] but this still won't match double line breaks.

Note: I have selected the Multiline option on Regex Hero.

(?<element>\d+)\s\s(?<elementName>.*)\[[B|C|I]\]\s+Desc: (?<desc>[^\n]*\s*[^\n]*)
^\s*Repr: (?<type>a(?:n)?)..(?<length>\d+)
^\s*(?<code>\d+)\s*(?<codeName>[^\n]*)
^\s{14}(?<codeComment>[^\n]*)

This is the example text I am using to match.

----------------------------------------------------------------------

  • 1073 Document line action code [B]

    Desc: Code indicating an action associated with a line of a
        document.

    Repr: an..3

    1 Included in document/transaction
        The document line is included in the
        document/transaction.
        should capture this as well.

    2 Excluded from document/transaction
        The document line is excluded from the
        document/transaction.

What I want is for codeComment to contain the following:

The document line is included in the
          document/transaction.
          should capture this as well.

but it is only extracting the first line:

The document line is included in the
Intrepid
  • 2,781
  • 2
  • 29
  • 54

3 Answers3

5

In a character class, every character counts once, no matter how often you write it. So a character class can't be used to check for consecutive linebreaks. But you can use a lookahead assertion:

^\s{14}(?<codeComment>(?s)(?:(?!\n\n).)*)

(?s) switches on singleline mode (to allow the dot to match newlines).

(?!\n\n) asserts that there are no two consecutive linebreaks at the current position.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Your answer is spot on, but I am having trouble getting the amended regex to pick up the "2 Excluded from document/transaction The document line is excluded from the document/transaction." lines as well. – Intrepid Oct 25 '12 at 10:58
  • @MikeClarke: But those follow after a double linebreak, so I thought you did *not* want to pick them up? If you do, what *is* the correct delimiter? – Tim Pietzcker Oct 25 '12 at 11:47
  • Like I said, your answer correctly picked up all lines of the comment for code 1, but I also need it to pick up the other code blocks. The amended regex just stops on code 1 and doesn't carry on picking up further codes. – Intrepid Oct 25 '12 at 12:02
  • OK, but if it shouldn't stop at a double linebreak (which is what you wrote in your question), where should it stop then? – Tim Pietzcker Oct 25 '12 at 12:03
  • You're misunderstanding me. The amended regex is now correctly picking up all lines for the comment but ONLY for code 1. For some reason the entire regex is not correctly continuing to pick up code 2 and it's comments, etc. I think it's because code 1 follows 'Repr:' and code 2 follows code 1, etc so I probably need to change the regex to allow for this. – Intrepid Oct 25 '12 at 12:21
  • There will be one or more code sections. For example take a look at [this](http://www.unece.org/fileadmin/DAM/trade/untdid/d12a/tred/tred1001.htm). – Intrepid Oct 25 '12 at 12:35
  • In that link, after the 998th "code section", there are more than 2 newlines for the first time. Then just use that as your stopping criterion: `^\s{14}(?(?s)(?:(?!\n\n\n).)*)` – Tim Pietzcker Oct 25 '12 at 12:54
  • I think I have sorted it, but not 100% sure it's correct. I didn't need to worry about the regex in your last comment; your original regex was sufficient to allow me to grab all comment lines PER code. I have come up with this: ^(?:\*\s{4}|^\s{6})(?\d+)\s\s(?.*)\[[B|C|I]\]$ \s{5,}Desc:\s(?(?s)(?:(?!\n\n).)*)$ \s{5,}Repr:\s(?a(?:n)?)..(?\d(?s)(?:(?!\n\n).)*)$ \s{5,}(?\d+)\s*(?(?s)(?:(?!\n).)*)$ \s{14,}(?(?s)(?:(?!\n\n).)*)$ |\s{5,}(?\d+)\s*(?(?s)(?:(?!\n).)*)$ \s{14,}(?(?s)(?:(?!\n\n).)*)$ – Intrepid Oct 25 '12 at 13:05
3

try

    [\r\n]{2,}

To "match double line breaks"

Used in DWR to remove double/bloated line breaks (left over from unzipping files for some reason)

more info: How to remove unwanted "extra line breaks" that appear in PHP/CSS/JS files after unzip?

Christian Žagarskas
  • 1,068
  • 10
  • 20
0

This one is simple and works best for me:

/[\r]?\n[\r]?\n/g
habibhassani
  • 486
  • 1
  • 6
  • 15