C# RegEx MatchCollection string parser for HTML type strings

Question

I am working on a custom CMS parser application using C# and need to match tags and the content between those tags from various content snippets from string values submitted by the client. The tags are dynamic and so is the content. The requirements for this project is that it has to be in C# native and cannot use third party libraries like HTML Agility Pack.

I have been working with this as an example: https://regex101.com/r/e7twfZ/1

(?=(<picture>))(\w|\W)*(?<=<\/picture>)

...searching the string...

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Title</title>
</head>
<body>
    <picture>
        <source srcset="mobile.png" ></source>
        <source srcset="tablet.png" ></source>
        <source srcset="desktop.png" ></source>
        <img srcset="default.png">
    </picture>
</body>
</html>

However, I need to match pretty much any alpha numeric between an opening and closing caret. When I change the RegEx to:

(?=(<picture>))(\w|\W)*(?<=<\/picture>)

I lose my match.

My goal is to end up with:

new Regex(@"(?=(<picture>))(\w|\W)*(?<=<\/picture>)").Match(@"<!DOCTYPE html>
<html lang='en'>
<head>
    <title>Title</title>
</head>
<body>
    <picture>
        <source srcset='mobile.png' ></source>
        <source srcset='tablet.png' ></source>
        <source srcset='desktop.png' ></source>
        <img srcset='default.png'>
    </picture>
</body>
</html>");

However, I am still not entirely sure how to do a proper MatchCollection in C#.

Also, this is my first time posting on StackOverflow.com. I have researched fairly thoroughly but decided to ask a question since each answer seemed a little different than from what I am looking to accomplish. Thank you for your help. Feel free to offer any suggestions!

HTML is anything but regular and regular expressions simply can't handle the edge cases. That's why all the answers you found don't seem to apply. Use an HTML parsing library like AngleSharp or Html Agility Pack — Panagiotis Kanavos, Jul 01 '22 at 12:57
`(?=())([\w|\W]*)(?<=<\/picture>)` may this you are looking for. just added brackets to your pattern — Ramazan, Jul 01 '22 at 13:09
When you mentioned "string values" from the client, is there going to be additional HTML inside the tags or embedded tags? — Chason Arthur, Jul 01 '22 at 14:40
No. In fact Panagrioin sort of missed the point of my question. I used HTML as an example that the tags will be in the form of: some content - all I am looking to do is get: KEY: myCustomTag and VALUE: some content — Gary Smith, Jul 01 '22 at 14:45

C# RegEx MatchCollection string parser for HTML type strings

0 Answers0