IP Address regex vs Numbered List

Question

I am using Trellix DLP solution and have IP Address classification to block outgoing IP Address information.

My regex is \b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b

However, this also block documents which have 4 level numbered lists, like:

 1.blah
    1.1 blah blah
           1.1.1 blah blah blah
                1.1.1.1 blah blah blah blah (DLP thinks this is an IP Address and block the document)

is there any way to bypass this.

`1.1.1.1`is a legal IP address. So tell us the difference where ip addresses occur and this text (with examples). — Poul Bak, Sep 13 '22 at 12:06
"On the **1.** January the profit rate was **3.5** %, and by article **3.1.4.** in the agreement we sent that information to ***127.0.0.1***" this is a silly example just to illustrate another point - if you are to discard numbered lists, then how would you still match the IP address in that sentence? It has very similar structure to a numbered lists with four sections of numbers, yet it's not. — VLAZ, Sep 13 '22 at 12:18

Julio · Accepted Answer · 2022-09-13T14:08:02.630

Regexes sometimes feel like magic, but unfortunatelly they are not. A regex cannot distinguish between an ip address versus a numbered footnote or article.

You can try to add some sort of intelligence (to say) to the regex, but you'll always end up having false positives/negatives. This sort of intelligence comes from inspecting previous or next characters.

If you try to go this way, start to use a regular expression that matches just valid ip addresses (your regex can match 300.1.2.3, which is not valid)

Also determine what ip address are you trying to avoid. Because if you are trying to avoid just private ip addresses, then you have less chances to get a false positive if you craft a regex that matches only private ip addresses.

If you try to get whatever ip address, then try to avoid matches that have 4 or more spaces before the match (or less than 4 and a begin of line). This is to try to avoid numbered titles.

(?<!^\h)(?<!^\h\h)(?<!^\h\h\h)(?<!\h\h\h\h)\b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b

note: Use m modifier. If you cannot specify flags, try to use the regex like this:

(?m)(?<!^\h)(?<!^\h\h)(?<!^\h\h\h)(?<!\h\h\h\h)\b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b

NOTE: if your tool does not support \h, change them for [\t\p{Zs}] or [ \t]

You have a very basic demo here. Please, keep on reading before using that for production :-)

Of course, since negative lookbehind usually cannot be variable length (unless some specific programming languages/tools), the more cases you add to the negative lookbehind with extra spaces, the more probable to skip those articles and not getting a false negative.

Also the tool must support negative lookbehinds, of course.

You could even combine both cases: a regex that matches 172.x.x.x and 192.x.x.x private addresses (not including 10.x.x.x private addresses because they are pretty low), in which case it may not take into account extra constraints, or any other valid ip address with extra constraints (like the spaces)

Are there any more false positives that you detected? Try to stablish similar rules for them. For example, consider that you could match footnotes like these: <<See 1.2.3.4>> or *1.2.3.4. Try to add exceptions for ip-address-like strings that start by * or end with >>, for example.

To sum up: "You cannot", but if you insist or try to...

Add extra 'logic' to the regex according to your found false positives
Check if the tool lacks needed regex features (like positive/negative lookbehinds)
The logic may be very specific to the document that you specified on your example. If there are other documents with other different formats, it may not be possible to have a generic solution for any kind of document
Even if you just have a single type of document to inspect, you may still have false positives/negatives, in which case, go to step 1 and repeat

See this regex https://regex101.com/r/5ATGKU/1 that may work for you as a starting point (in case you are decided to go this route and your tool supports negative lookbehinds) I'm putting this here because it goes against the 'spirit' of the answer ;) — Julio, Sep 13 '22 at 15:08
hello, thank you for your detailed information. however my DLP requires Google RE2 regex. how can i convert your regex to RE2, or can i? — bahmet, Sep 14 '22 at 13:07
Then you are facing the problem "Check if the tool lacks needed regex features". It doesn't since your tool uses re2, which does not support lookbehinds (not even look aheads) If you don't mind capturing some extra data besides the the IP, this may work for you as an starting point: https://regex101.com/r/Qk3pzl/1 You still have the IP on the first and second capturing group. — Julio, Sep 14 '22 at 14:31

IP Address regex vs Numbered List

1 Answers1