3

I have use regex to successfully extract anything right after "Abc 123" but it doesn't extract anything from the new line.

enter image description here

Is there any way I can use regex to extract the following:

"Abc 123 def
ghi
jkl"
"Abc 123 def ghi jkl mno"
"Abc 123 def ghi jkl 
mno"

I am using Regex in Talend.

Lighteden
  • 31
  • 4
  • 1
    You seem to ask about not working code. But you forgot to post that code. Hint: spent less time on creating screenshots; just put down the text you are working with. Makes things so much easier for everybody. – GhostCat Dec 12 '16 at 07:41
  • Try `/^(\w+)\s(\d+)(.*(?:\r?\n(?!\w+\s\d).*)*)/gm`. Not sure you indicated correct expected output. – Wiktor Stribiżew Dec 12 '16 at 07:46
  • @WiktorStribiżew what desired output did you assume? – xenteros Dec 12 '16 at 07:49
  • @xenteros: Like [this one](https://regex101.com/r/2QWWa2/1). Lighteden, you removed the Java tag, please confirm the environment where you use the regex and how. – Wiktor Stribiżew Dec 12 '16 at 07:52
  • @WiktorStribiżew I believe your Redex coding is correct, this example is a simplified version of what I am working on. I am trying to understand your coding right now. Thank you anyway. – Lighteden Dec 12 '16 at 07:56
  • 1
    @light it seems like you are trying to *split* the input, is that right? And in what specific product/feature within talent are you using regex? – Bohemian Dec 12 '16 at 08:16
  • @Bohemian comment right, why not split the string at [`\n(?=\w+ \d)`](https://regex101.com/r/ZKQV4p/1) or is this not possible in your environment. – bobble bubble Dec 12 '16 at 10:31

2 Answers2

1

I think you want to exract substrings that start at the beginning of a line with 1+ word chars, then a whitespace, then 1 or more digits and span across multiple lines up to the same pattern.

You may use the following regex (note the flags and notation may differ depending on the language you are using):

/^(\w+)\s(\d+)(.*(?:\r?\n(?!\w+\s\d).*)*)/gm

See the regex demo.

Details:

  • ^ - start of a line
  • (\w+) - Group 1: one or more word chars
  • \s - 1 whitespace
  • (\d+) - Group 2: one or more digits
  • (.*(?:\r?\n(?!\w+\s\d).*)*) - Group 3:
    • .* - any 0+ chars other than line break chars
    • (?:\r?\n(?!\w+\s\d).*)* - zero or more sequences of:
      • \r?\n - a line break...
      • (?!\w+\s\d) - that is not followed with 1+ word chars, whitespace, 1+ digits
      • .* - any 0+ chars other than line break chars
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

(\w)+\s(\d+)((.|\R)+) is what you want so after escaping it'll be: (\\w)+\\s(\\d+)((.|\\R)+).
\R is a new group in Java regex available since Java 8 - it stands for a line break. Both: \r\n and \n.

If you only allow a single linebreak:
(\w)+\s(\d+)((.+)(\R.+){0,1})

I think that you should specify more what is your desired output, but from this answer you can learn how to include multiple lines or up to 2 lines

xenteros
  • 15,586
  • 12
  • 56
  • 91