1

I am working on a big log file whose entries are as follow:

-- "GET <b>/fss-w3-mtpage.php</b> HTTP/1.1" 200 0.084 41 "-" "c110bc/1.0" 127.0.0.1:25001  0.084

-- "GET <b>/m/firstpage/Services/getAll</b>?ids=ABCVDFDS,ASDASBDB,ASDBSA&requestId=091fa2b4-643e-4473-b6d8-40210b775dcf HTTP/1.1" 200

-- POST <b>/lastpage/Services/getAll</b>?ids=ABCVDFDS,ASDASBDB,ASDBSA&requestId=091fa2b4-643e-4473-b6d8-40210b775dcf HTTP/1.1" 200

And I wanted to extract the part that is bolded out in above sample. Here is the regex that I wrote for the above

.*(POST|GET)\s+(([^\?]+)|([^\s])) 

I want to get the part that is after GET or POST and until the first occurrence of a space ' ' or a question mark '?'.

Problem
The logical OR in the later part of the regex is not working. If I use only

.*(POST|GET)\s+([^\?]+)    

I am getting the correct portion i.e. from GET or POST until the first question mark '?'. Similarly if I use

.*(POST|GET)\s+([^\s]+)    

I am getting the correct portion i.e. from GET or POST until the first space ' ').

Please can anyone tell me where I am wrong?

Unihedron
  • 10,902
  • 13
  • 62
  • 72
Vikas Verma
  • 313
  • 1
  • 5
  • 18

3 Answers3

4

With [^\?]+ I am getting the correct portion till first question mark,
With [^\s]+ I am getting the correct portion till first space

Because those character classes mean: All characters that are no question marks, or: all characters that are no spaces.

To combine them, you want to say: All characters that are neither a question mark nor a space:

[^?\s]+

With the OR that you did used it just did try the first ([^\?]+ - including spaces), which succeeded, and would have backtracked and tried [^\s]+ (including question marks) instead if the first didn't work.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
3

Get the matched group from index 2

\b(POST|GET)\s+([^?\s]+)

Here is DEMO

Pattern explanation:

  \b                       the word boundary

  (                        group and capture to \1:
    POST                     'POST'
   |                        OR
    GET                      'GET'
  )                        end of \1

  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or more times)

  (                        group and capture to \2:

    [^?\s]+                  any character except: '?', whitespace
                             (\n, \r, \t, \f, and " ") (1 or more times)

  )                        end of \2
Braj
  • 46,415
  • 5
  • 60
  • 76
1

The below regex would match only the strings which are just after to GET or POST followed by a space or a ? symbol.

(?<=GET |POST )\s*.*?(?= |\?)

DEMO

You could use capturing groups (), inorder to capture the matched strings.

(?<=GET |POST )\s*(.*?)(?= |\?)

DEMO

Explanation:

(?<=                     look behind to see if there is:
  GET                      'GET '
 |                        OR
  POST                     'POST '
)                        end of look-behind
\s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                         more times)
(                        group and capture to \1:
  .*?                      any character except \n (0 or more
                           times)
)                        end of \1
(?=                      look ahead to see if there is:
                           ' '
 |                        OR
  \?                       '?'
)                        end of look-ahead
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274