1

i am using this Regex

private static final String HREF_PATTERN = 
    "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";

to get the link from

 <a href=www.example.com/1234 5678>

The URL is malformed. It contains a whitespace. The Problem is that i want to get the whole link including "5678", but i only get "www.example.com/1234".

I am not that good with regular Expressions. Can someone please provide a valid regex so that i can get the whole url "www.example.com/1234 5678".

Thanks

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
gRds
  • 164
  • 1
  • 2
  • 7

1 Answers1

2

The external program creates an html Email with several <a href=www.example.com/1234 5678> tags.

Assuming you cannot fix it on the source level, you can try fixing that with a regex.

If the href attribute is the only attribute, you just do not have to care about the spaces after =. Remove \\s from your pattern and it will work.

private static final String HREF_PATTERN = 
   "(?i)\\s*href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">]+))";
                                                     ^

If you have attributes with values, you will have to use a look-ahead:

private static final String HREF_PATTERN = 
    (?i)\\s*href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">]+(?=>|\\s+\\w+=)))

See the regex demo

However, this will not work with attributes like nofollow.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563