Using regex to get information inside an HTML tag

Question

I'm wondering how I could extract '4151' from the following code:

</th><td><a class="external exitstitial" rel="nofollow" href="http://services.runescape.com/m=itemdb_rs/viewitem.ws?obj=4151">Look up price</a>

I would like to use regex but if there is a better way I'm open for it!

Assuming that's just a fragment of a complete (X)HTML document, use XPath first to obtain the attribute value, _then_ a regular expression to extract the query parameter. — Alistair A. Israel, Aug 11 '11 at 08:58
I've already done all of that, I just need the regex to extract it. — , Aug 11 '11 at 09:02

score 4 · Accepted Answer · answered Aug 11 '11 at 09:11

The following works for me, assuming the href attribute value was already extracted:

String href = "http://services.runescape.com/m=itemdb_rs/viewitem.ws?obj=4151";
Pattern p = Pattern.compile("\\?obj=(\\d+)");
Matcher m = p.matcher(href);
if (m.find()) {
    System.out.println(m.group(1));
}

Outputs "4151"

score 3 · Answer 2 · edited May 23 '17 at 12:11

3

Here are a few parser libraries : htmlparser, jsoup, and jtidy.

In your case, regex may be fine, but here's a classic post of why you should avoid regex for html parsing.

edited May 23 '17 at 12:11

Community

1
1

answered Aug 11 '11 at 09:02

asgs

3,928
6
39
54

score 0 · Answer 3 · answered Aug 11 '11 at 09:01

This regex would get you the number -

Pattern regex = Pattern.compile("\\d+");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
    ResultString = regexMatcher.group();
}

This code is not tested and presumes your HTML string is assigned to the 'subjectString' variable.

Using regex to get information inside an HTML tag

3 Answers3