-2

I have a URL that contains multiple sequences of numbers I want to capture them all in groups suppose I have the following

https://www.example.com//first/part/54323?key=value

or https://www.example.com/first/12345/second/part/part2/5432?key=value

I tried to use something like that but it only matches one sequence of numbers

(.*\/)([0-9]{4,})(\/.*|$|)

I want to have multiple groups represent different sections if numbers sequence is included

  • 1st group will be "example.com/first"
  • 2nd group "12345"
  • 3rd group "second/part"
  • 4th group "5432"
  • 5th group "?key=value"
Mohab
  • 109
  • 2
  • 8
  • You just want the sequence of digits or you want some other text as well? – AKSingh Jun 27 '21 at 08:49
  • 1
    I want both actually to have multiple groups 1st group will be "https://www.example.com/first" 2nd group "12345" 3rd group "second/part" 4th group "5432" 5th group "?key=value" – Mohab Jun 27 '21 at 08:53

2 Answers2

0

The initial .* is Greedy, meaning it tries to match as much as possible. It matched everything up to the last slash "https://www.example.com/first/12345/second/part". You can modify this behavior by replacing the initial .* with .*?, but that will stop after the first slash, which is also not what you want "https:/" because there are no digits after those slashes.

But really we need to back up and ask some questions about your pattern. Apparently, you have a preamble you are not interested in, an indefinite number of sequences of 'character string, followed by slash, followed by number string' and then there is the "everything after there are no more slash digit patterns".

The key question is whether the number of char/char/digits combos are indefinite or limited to a definite number like the two pairs in your example. To get the regex parser to return an unbounded number of string-number pairs, you are going to want to turn on the /g (Global) switch so regex will return all matches. That is a problem with the part of your URL at the beginning and end which does not fit your pattern.

I recommend first using a regular expression to divide your URL into three parts, preamble, path, remaining data. Then you can pass the path string to a second regular expression to parse the pairs - it will be much simpler.

If you do it that way your first expression could be:

^[a-z+.-]+?:\/\/(:www\.)?([^?#]+?)(.*)$

The first part skips over everything through the optional www. and does not capture it because you are not interested in that part. The second part captures everything up to any query or fragment (delimited by ? and #, respectively) and places it in the first capture group. The last part captures the rest of the URL into the the second capture group. In your example that is ?key=value.

Now take your first capture group, which contains the host and the path, and pass it to a second regex with the global flag set (so it processes all pairs repeatedly). This second regex will be:

(.*?)\/([0-9]{4,})\/?

For each match of this string, the parsed values and numbers will be in capture groups 1 & 2.

Chris Maurer
  • 2,339
  • 1
  • 9
  • 8
  • Honestly, I don't have the luxury of doing so as this regex is part of the pattern I use for a tool to filter out URLs it's not part of a complete program that I can do multiple steps to reach end result – Mohab Jun 28 '21 at 07:25
0

It sounds very straight-forward:

https?:\/\/(?:www\.)?(.*?)\/(\d+)\/(.*?)\/(\d+)(?:\?(.*))?

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  http                     'http'
--------------------------------------------------------------------------------
  s?                       's' (optional (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    www                      'www'
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
  )?                       end of grouping
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  (                        group and capture to \4:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \4
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \?                       '?'
--------------------------------------------------------------------------------
    (                        group and capture to \5:
--------------------------------------------------------------------------------
      .*                       any character except \n (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \5
--------------------------------------------------------------------------------
  )?                       end of grouping
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
  • This will only match the exact pattern that I provided if we passed something that contains more URL segments it will fail to match, I just wanted to exclude the sequence of numbers and everything should be grouped together – Mohab Jun 28 '21 at 07:22
  • @Mohab Ok, see now. – Ryszard Czech Jun 28 '21 at 20:40