-2

I am trying to parse data from a line like this

"Lorem ipsum dolor sit amet, IP: 111.111.111.111, 222.222.222.222, 333.333.333.333\r\n adipiscing elit, sed do eiusmod\r\n tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud"

I am trying to capture the values like this:

  • message: "Lorem ipsum dolor sit amet, IP: 111.111.111.111, 222.222.222.222, 333.333.333.333\r\n adipiscing elit, sed do eiusmod\r\n tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud"
  • ip: "111.111.111.111, 222.222.222.222, 333.333.333.333"

There can be arbitrary many IPs, including zero.

I am using fluent-bit with a single regex. This is an example of a fluent-bit parser definition:

[PARSER]
Name syslog-rfc3164
Format regex
Regex /^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$/
Time_Key    time
Time_Format %b %d %H:%M:%S
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep   On

Thanks to Cary and Aleksei here is the solution:

\A(?<whole>.*?((?<=IP: )(?<ip>(?<four_threes>\d{1,3}(?:\.\d{1,3}){3})(?:, \g<four_threes>)*)).*?)\z

https://rubular.com/r/Kgh5EXMCA0lkew

EDIT

I realized that some strings don't have the "IP:..." pattern in them which give me a parsing error.

string1: "Lorem ipsum dolor sit amet, IP: 111.111.111.111, 222.222.222.222, 333.333.333.333\r\n adipiscing elit, sed do eiusmod\r\n tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud"

string2: "Lorem ipsum dolor sit amet, \r\n adipiscing elit, sed do eiusmod\r\n tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud"

I tried applying *(0 or more) to the ip group name match but i was not able to make it work. Any idea how i can do this?

nomad
  • 1
  • 5
  • The initial string you have is not **a line**, its **lines**. – sawa Mar 15 '19 at 10:08
  • @sawa ok so lets say its a string. I want to capture that whole string and also capture the IPs within that string but i need them to be captured in 2 different values/tags. Does it makes more sense? – nomad Mar 15 '19 at 15:37
  • off-topic: @nomad, I have some thoughts about https://stackoverflow.com/questions/61514390/how-to-prevent-echo-command-to-execute-commands-in-bash-script but the question is deleted and I can't comment. – glenn jackman Apr 30 '20 at 17:13

2 Answers2

2
str = 'Lorem, IP: 111.111.111.111, 222.222.222.222, 333.333.333.333\r\n adipiscing'

r = /
    \A                     # match the beginning of the string
    (?<whole>              # begin named group 'whole' 
      .*?                  # match >= 0 characters 
      (?<ip>               # begin named group 'ip'
        (?<four_threes>    # begin a named group 'four_threes'
          \d{1,3}          # match 1-3 digits
          (?:              # begin a non-capture group
            \.             # match a period
            \d{1,3}        # match 1-3 digits
          ){3}             # close non-capture group and execute same 3 times
        )                  # close capture group 'four_threes'
        (?:                # begin a non-capture group
          ,\p{Space}       # match ', '
          \g<four_threes>  # execute subexpression named 'four_threes'
        )*                 # close non-capture group and execute same >= 0 times
      )                    # close capture group 'ip'
      .*                   # match >= 0 characters
    )                      # close capture group 'whole'
    /x                     # free-spacing regex definition mode

m = str.match(r)
m[:whole] 
  #=> "Lorem, IP: 111.111.111.111, 222.222.222.222, 333.333.333.333\\r\\n adipiscing" 
m[:ip]
  #=> "111.111.111.111, 222.222.222.222, 333.333.333.333" 

The regex is conventionally written:

/\A(?<whole>.*?(?<ip>(?<four_threes>\d{1,3}(?:\.\d{1,3}){3})(?:, \g<four_threes>)*).*)/

When defining a regex in free-spacing mode spaces must be protected in some way, else they will be removed before the expression is parsed. I have used \p{Space}, but [[:space:]], \s and [ ] (a space within a character class) could be used as well. (All but the last match a whitespace character.) When the regex is written in the conventional way a space can be used, as shown above.

\g<four_threes> is a subexpression call (search "Subexpression Calls"). Their use saves typing and reduces the chance of errors. If this, the third named capture, is not wanted, it can of course be substituted out.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • Thanks Cary, that covers the multiple IPs part but i also need to capture the whole message in another tag, is there a way to do this with subexpression recall? – nomad Mar 15 '19 at 02:33
  • I don't follow. You do have the string in a variable. Is that not straightforward? – Cary Swoveland Mar 15 '19 at 02:40
  • I cannot use a string variable... I updated the ticket maybe it will clear things up a little. message is the whole text but i need to extract the IPs – nomad Mar 15 '19 at 03:02
  • 2
    `\d{3}` → `\d{1,3}` if you indeed want to match _IP_s. – Aleksei Matiushkin Mar 15 '19 at 06:06
  • @nomad Put the above in parentheses inside `/\A(?.*?(⇑).*?)\z/` and you are all set. – Aleksei Matiushkin Mar 15 '19 at 06:07
  • @AlekseiMatiushkin Thanks, thats exactly what I meant!! I still have an issue with getting all IPs in IP but i am working on it. Thanks a bunch! – nomad Mar 15 '19 at 16:38
  • 1
    @AlekseiMatiushkin, good point about `\d{1,3}`. I've been waiting for the OP to confirm your suspicion, but as that has not happened I've made the change. – Cary Swoveland Mar 15 '19 at 18:03
  • Thanks a lot for your time and effort, I added a link to rubular so you can check the results i get. I still dont get all the IPs unfortunately. – nomad Mar 15 '19 at 18:09
  • nomad and @AlekseiMatiushkin, I didn't really understand what was wanted until I looked at your (nomad's) revised regex. I've edited my my answer to provide what I now understand is the requirement. – Cary Swoveland Mar 15 '19 at 19:08
  • I fixed that, by just removing the positive lookbehind. As a result I had to make the preceding `.*` non-greedy by changing it to `.*?`. I noticed the end-of-string anchor (`\z`) was not needed (as the final `.*` is greedy), so I removed that. – Cary Swoveland Mar 16 '19 at 07:00
  • I realized i have strings without the 'IP: ...' pattern and i get parsing errors for these ones because it doesnt match. I was curious to know if there was a way to do something like this (IP: ?(?(?\d{1,3}(?:\.\d{1,3}){3})(?:, \g)*) )* So this works either there is 0 or more occurences of the pattern – nomad Mar 16 '19 at 16:29
  • The final `*` would mean a match could be made if there were no IP's in the string. – Cary Swoveland Mar 30 '19 at 05:03
0

You can use /([0-9]_\.)+/ as a very basic regexp (there are much better IPv4 regexp out there).

Then by using .scan(...) on your string you will get the results as an array.

localhostdotdev
  • 1,795
  • 16
  • 21
  • Thanks for the answer localhostdotdev but i forgot to mention i am not coding this, so i cant call functions... I use this regex in fluent-bit so I am limited to the regex – nomad Mar 15 '19 at 01:47