RegEx: Match AND Capture in .NET with one pattern

Question

When dealing with RegEx in .NET I have two options:

Check string for pattern match:

<a ([^>]*?)href=\"http://the.site.com/photo/[0-9]*\">
Capture a part of pattern:

<a ([^>]*?)href=\"http://the.site.com/photo/(?<photoname>.*?)\">

But what if I want to check for pattern match AND capture a part if it matches with single RegEx?

how would it capture it if it didn't match...? The lookaheads/lookbehinds are for capturing part of a match — Jonesopolis, Oct 29 '13 at 14:26
@Jonesy: `(?.*?)` will capture any character sequence, but I need `[0-9]*` only. — Paul, Oct 29 '13 at 14:28
@BeemerGuy.net: Because I do not unsderstand RegExes and therefore I am asking here. — Paul, Oct 29 '13 at 14:31
Oh. Can you give a sample input, and what you'd like to capture from it? — Jonesopolis, Oct 29 '13 at 14:32
@Paul -- ok, I thought you did that on purpose. You can keep the `[0-9]*` pattern within the capturing string, like `(?[0-9]*?)`, so just like the answerers are telling you, you can capture and match with the same pattern string... if you're just matching, the grouping (which is the parenthesis portion) is irrelevant. — BeemerGuy, Oct 29 '13 at 14:36
I have a HTML document full of `` tags and I want to capture last (digit) parts of every URL which matches `http://the.site.com/photo/DIGITS` pattern. — Paul, Oct 29 '13 at 14:37

score 2 · Accepted Answer · answered Oct 29 '13 at 14:39

2

Just use this when capturing:

<a ([^>]*?)href=\"http://the.site.com/photo/(?<photoname>[0-9]+)\">

answered Oct 29 '13 at 14:39

Toto

89,455
62
89
125

what if there's space between `href =`! What if href is not the only attribute..what if its a self enclosed tag..Please don't use regex..That would break your code..Please – Anirudha Oct 29 '13 at 14:41
2

@Anirudh what about the other 999 fail states you were going to find? – Gusdor Oct 29 '13 at 14:42
@Gusdor 999+ are the cases where the literal text itself contain anchor tags.(Ex.comments and ans on SO which are not the part of html itself but are literals)How would you differentiate between them.Also there could be arbitrary number of spaces and you won't add `\s` for each or are you – Anirudha Oct 29 '13 at 14:45
Though `?` is redundant in `[^>]*?`..regex should be a `]*)href\s*=\s*\["']http://the.site.com/photo/(?[0-9]+)\["'][^>]*` – Anirudha Oct 29 '13 at 14:50

score 1 · Answer 2 · answered Oct 29 '13 at 14:32

1

Use htmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(htmlUrl);

var pattern=@"^(?<=https?://the.site.com/photo/)\d+$";
var hrefList= doc.DocumentNode
                 .SelectNodes("//a[@href]")
                 .Select(p =>p.Attributes["href"].Value)//select all hrefs
                 .Where(p => Regex.IsMatch(p,pattern))//filter href
                 .Select(p=>Regex.Match(p,pattern).Value);//select required digits

answered Oct 29 '13 at 14:32

Anirudha

32,393
7
68
89

1

agility pack is overkill here. The OP simple wants to match URLs. – Gusdor Oct 29 '13 at 14:36
1

@Gusdor come up with a regex..I would give 1000's of cases for it to break...;) – Anirudha Oct 29 '13 at 14:37

score 0 · Answer 3 · answered Oct 29 '13 at 14:46

Good sir, you can match and capture into a named group with one pattern!

<a (?:[^>]*?)href\s*?=\s*\"http://the.site.com/photo/(?<photoname>[0-9]+)\"

The group named photoname will contain the capture you want.

This regex will work even if href is not the first attribute on the a element. It will also ignore arbitrary spaces.

RegEx: Match AND Capture in .NET with one pattern

3 Answers3