-2

When dealing with RegEx in .NET I have two options:

  1. Check string for pattern match:

    <a ([^>]*?)href=\"http://the.site.com/photo/[0-9]*\">

  2. Capture a part of pattern:

    <a ([^>]*?)href=\"http://the.site.com/photo/(?<photoname>.*?)\">

But what if I want to check for pattern match AND capture a part if it matches with single RegEx?

Paul
  • 25,812
  • 38
  • 124
  • 247

3 Answers3

2

Just use this when capturing:

<a ([^>]*?)href=\"http://the.site.com/photo/(?<photoname>[0-9]+)\">
Toto
  • 89,455
  • 62
  • 89
  • 125
1

Use htmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(htmlUrl);

var pattern=@"^(?<=https?://the.site.com/photo/)\d+$";
var hrefList= doc.DocumentNode
                 .SelectNodes("//a[@href]")
                 .Select(p =>p.Attributes["href"].Value)//select all hrefs
                 .Where(p => Regex.IsMatch(p,pattern))//filter href
                 .Select(p=>Regex.Match(p,pattern).Value);//select required digits
Anirudha
  • 32,393
  • 7
  • 68
  • 89
0

Good sir, you can match and capture into a named group with one pattern!

<a (?:[^>]*?)href\s*?=\s*\"http://the.site.com/photo/(?<photoname>[0-9]+)\"

The group named photoname will contain the capture you want.

This regex will work even if href is not the first attribute on the a element. It will also ignore arbitrary spaces.

Gusdor
  • 14,001
  • 2
  • 52
  • 64