2

Sorry if this has been asked before, but I couldn't find any answers on the web. I'm having a hard time figuring out the inverse to this regex:

"\"[^>]*\">"

I want to use replaceAll to replace everything except the link. So if I had a tag similar to this:

<p><a href="http://www.google.com">Google</a></p>

I need a regex that would satisfy this:

s.replaceAll(regex, "");

to give me this output:

http://www.google.com

I know there are better ways to do this, but I have to use a regex. Any help is really appreciated, thanks!

Bert
  • 80,741
  • 17
  • 199
  • 164
user1070866
  • 21
  • 1
  • 1
  • 2

4 Answers4

16

You do not have to use replaceAll. Better use pattern groups like the following:

Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(html);
String url = null;
if (m.find()) {
    url = m.group(1); // this variable should contain the link URL
}

If you have several links into your HTML perform m.find() in loop.

AlexR
  • 114,158
  • 16
  • 130
  • 208
0

Use the method to get a map of all the properties of a HTML tag. Create a simple way to find all the properties of an HTML, like...

    Pattern linkPattern = Pattern.compile("<a (.*?)>");
    Matcher linkMatcher = linkPattern.matcher(in);
    while (linkMatcher.find()) { parseProperties(linkMatcher.group(1)).toString(); }

Get properties:

private static final Pattern PARSE_PATTERN = Pattern.compile("\\s*?(\\w*?)\\s*?=\\s*?\"(.*?)\"");

public static Map<String, String> parseProperties (String in) {

  Map<String, String> out = new HashMap<>();

  // Create matcher based on parsing pattern
  Matcher matcher = PARSE_PATTERN.matcher(in);

  // Populate map
  while (matcher.find()) { out.put(matcher.group(1), matcher.group(2)); }

  return out; 
}
somid3
  • 680
  • 1
  • 7
  • 19
0

If you always have one such link in a string, try this:

"(^[^\"]*\")|(\"[^\"]*)$"
socha23
  • 10,171
  • 2
  • 28
  • 25
  • This worked, but failed when the href tag had 'id=' before the link. I should've added that to my question, sorry. – user1070866 Nov 30 '11 at 07:51
-1

you can checkout http://regexlib.com/ for all the regex help you need. And the one below is for url :

^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$
kommradHomer
  • 4,127
  • 5
  • 51
  • 68