1

So I've been trying to use some regular expressions to extract information from <a href='#' >HTML a tag</a>, for three separate schemas of possible tags.

<a id="Anchor_One" name="Anchor_One"> Anchor Details </a>
<a href="#Anchor_Two" name="Anchor_Two" > Anchor Two Details </a>
<a name="Anchor_Three" > Anchor Three Details </a>

So far I have some regular expressions to extract all the attributes from a given HTML tag /(\\w+)\s*=\\s*("[^"]*"|\'[^\']*\'|[^"\'\\s>]*)/. And I also have some regex to match links with href attribute active /<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU. But I can't seem to create a pattern to match the other combinations of what a link tag may have.

<a id="Anchor_One" name="Anchor_One"> Anchor Details </a>
<a name="Anchor_Three" > Anchor Three Details </a>

Links that do not have href attribute set, are not picked up with my current pattern, so not all the anchors can be retrieved.

    $regexp = '/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU';
    //parse the page with the provided regular expression
    if(preg_match_all($regexp, $sessionBlock, $htmlMatches))
    {

    }
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
classicjonesynz
  • 4,012
  • 5
  • 38
  • 78
  • possible duplicate of [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) – hakre Aug 20 '13 at 22:52
  • 1
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Aug 21 '13 at 02:24

2 Answers2

3

Please, please, please don't use regex to parse HTML.

HTML isn't a regular, structured language, so parsing it with regex is extremely difficult and a complete mess.

Have a look at these alternatives for parsing HTML in PHP.

Community
  • 1
  • 1
James Williams
  • 678
  • 4
  • 14
  • For your own sake, use regex as a last resort. I made the mistake of trying this and it was a disaster by the end. Try the DOM extension. – James Williams Aug 20 '13 at 22:39
  • I thought that the getAttribute method of DOMElement would be able to extract these: http://php.net/manual/en/domelement.getattribute.php – James Williams Aug 20 '13 at 22:48
  • 1
    @Killrawr: Dom is also great for HTML parsing. You can turn the warnings off (internal reporting), there is a recovery feature for broken HTML and what not. I don't see anything in your question that can not be done with DOM with no problem. – hakre Aug 20 '13 at 22:48
  • Well this has all been already brought through Q&A here on the website so I see no use to repeat it here again in comments. I'm sure you manage it, in case not, ask a new question. – hakre Aug 20 '13 at 22:58
  • What pattern from the link tags are you wanting to exactly match or capture, just the anchor names? – hwnd Aug 20 '13 at 23:07
  • @JamesWilliams thanks! I was able to work something out with DOMDocument. – classicjonesynz Aug 21 '13 at 11:23
1

try this "~<a(?=[^>]* name=[\"']([^'\"]*)|)(\s+[^>]*)?>(.*?)</a>~"

viki
  • 203
  • 2
  • 4
  • 10