0

Trying getting the href value for this HTML

<a class="list-item clearfix" href="/en/rolex/submariner-date--id2334149.htm" id="watch-2334149" style="background-color: rgb(255, 255, 255);">

      <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-image-click']);_gaq.push(['second._trackEvent','Click','search','watch-image-click']);" class="pic ">
        <span style="position:absolute">

          <img width="100" height="100" alt="Rolex Submariner Date" src="" class="photo">
        </span>
      </span>

  <span class="disc">
    <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-headline-click']);_gaq.push(['second._trackEvent','Click','search','watch-headline-click']);" class="watch-headline"><span class="underline">Rolex Submariner Date</span></span>

        <span class="spec">


          <span onmouseover="$('#infobox-title').text('Germany');$('#infobox-text').text('This dealer is from Augsburg, Germany.')" style="width: 21px;" class="flag">

          <img width="16" height="16" alt="" src="http://cdn.chrono24.com/images/flags-icons/DE.png">&nbsp;
            </span>
            <span class="icon i-hasnostore"></span>
                    <span onmouseover="$('#infobox-title').text('Trusted Seller since 2004');$('#infobox-text').text('We have no knowledge about pending/unsolved disputes or complaints about this seller.')" class="icon i-trusted"></span>

                        <span onmouseover="$('#infobox-title').text('Retailer recommendations');$('#infobox-text').text('This watch retailer is recommended on Chrono24 by 1 other watch retailers.')" class="i-buddies">
                          <span class="icon buddie-count">1</span>
                          <span class="icon i-star-blue"></span>
                        </span>


              <span onmouseover="$('#infobox-title').text('Trusted Seller since 2004');$('#infobox-text').text('We have no knowledge about pending/unsolved disputes or complaints about this seller.')" class="trustedseller">
                    <script type="text/javascript">
                        // &lt;![CDATA[
                        document.write('Trusted Seller since 2004');
                        // ]]&gt;
                    </script>Trusted Seller since 2004
                  </span>    


                  <span style="width: 2px;" class="icon"></span>
                  <span onmouseover="$('#infobox-title').text('Premium Seller');$('#infobox-text').text('The Chrono24 Premium Seller Package is only available for Trusted Sellers who frequently use Chrono24.')" class="icon i-premium"></span>
                <span onmouseover="$('#infobox-title').text('Premium Seller');$('#infobox-text').text('The Chrono24 Premium Seller Package is only available for Trusted Sellers who frequently use Chrono24.')" class="premiumseller">Premium</span>

            </span>
            <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-desc-click']);_gaq.push(['second._trackEvent','Click','search','watch-desc-click']);" class="description">
              Ref. No. 116610 LN; Steel; Automatic; Condition 0 (unworn); Year 2013; With Box; With Papers; Location: Germany, Augsburg; The current, the manufacturer's recommended retail price is 6800 Euro
            </span>


              <span class="availability">Availability: Available immediately</span>



  </span>
  <span class="pricebox">
    <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-price-click']);_gaq.push(['second._trackEvent','Click','search','watch-price-click']);" class="amount price"><span class="large">$&nbsp;7,961</span>
    </span>

    <span class="buttonbox">
      <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-button-click']);_gaq.push(['second._trackEvent','Click','search','watch-button-click']);" class="button-blue">
         <span>
          Watch details
         </span>
      </span>
    </span>


  </span>             

</a>
preg_match_all('#<a href="(.+)">#',$html,$urlarr);

This is not giving the href value at all , Don't know what going wrong with this.

Ravi Soni
  • 2,210
  • 3
  • 31
  • 53
  • I don't understand why you think that should give you the href value. – Quentin Sep 12 '13 at 14:39
  • possible duplicate of [extract image src from text?](http://stackoverflow.com/questions/11440277/extract-image-src-from-text) – undone Sep 12 '13 at 14:41
  • @undone Using tidy is just adding burdens i have used Regex before and know about parsing libs I'm also using one from them already but for one page I'm having issue, i just sought for RegEx. – Ravi Soni Sep 12 '13 at 14:51
  • @Quentin now check the RegEx. – Ravi Soni Sep 12 '13 at 15:02
  • 1
    Your links have attributes in them other than href. Don't use regex for this. – Quentin Sep 12 '13 at 15:02

4 Answers4

2

Don't use Regular Expressions on HTML; HTML is not regular!

You should take a look at SimpleXML and XPath, they are the perfect tooks for the job: http://php.net/manual/en/simplexmlelement.xpath.php

E.g.:

$xml   = new SimpleXMLElement($html);

// Select all "a" tags with href attributes
$links = $xml->xpath("//a[@href]");
// You probably want the first one
$href = $links[0]["href"]
Community
  • 1
  • 1
BenLanc
  • 2,344
  • 1
  • 19
  • 24
1

You should use domdocument instead if regexp:

 $dom = new domDocument;
    $dom->loadHTML($html);
    $dom->preserveWhiteSpace = false;
    $link  = $dom->getElementsByTagName("a");
    $links = array();
    for($i = 0; $i < $link->length; $i++) {
       $links[] = $link->item($i)->getAttribute("href");
    }
undone
  • 7,857
  • 4
  • 44
  • 69
1

All the methods with the DOM as suggested should work. If you want to use regex, you can try this:

preg_match_all('~<a (?>[^>h]++|\Bh|h(?!ref\b))*href\s*=\s*["\']?\K[^"\'>\s]++~i', $html, $matches);

If you want to match only href in a tags that have list-item clearfix as class attribute value, you can do this:

$pattern = <<<'LOD'
~
(?(DEFINE)
    (?<class> \b class \s* = \s* (["']) list-item \s+ clearfix \g{-1} )
    (?<href_value> [^"'\s>]++ )
    (?<href_start> \b href \s*=\s* ["']? )
    (?<href_end> ['"\s] )
    (?<content> (?> [^>hc]++ | \B[hc] | h(?!ref\b) | c(?!lass\b) )* )

)
    <a \s+
    \g<content>
    (?J)
    (?>
        \g<class> \g<content> \g<href_start> (?<href> \g<href_value> )
      |
        \g<href_start> (?<href> \g<href_value> ) \g<href_end> \g<content> \g<class>
    )
~xi
LOD;

preg_match_all($pattern, $html, $matches, PREG_SET_ORDER); 

foreach($matches as $match) {
    echo '<br>' . $match['href'];
}

Keep in mind that using XPath is much easier to do that:

$doc = new DOMDocument();
@$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$hrefs = $xpath->query("//a[@class='list-item clearfix']/@href");
foreach($hrefs as $href) {
    print_r($href->nodeValue);
}
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • This worked but what should i need to edit for only matching `class="list-item clearfix"`. Thanks – Ravi Soni Sep 12 '13 at 15:09
  • you should use XPath. going to be much more precise and much simpler than trying to do this with a regular expression. it will actually minimize the lines of code needed to accomplish whatever you are trying to do. which means a faster load time on the page – Malachi Sep 12 '13 at 15:33
0

It's a bad idea to use regular expressions to parse HTML (at least, in this case). Use a DOMParser such as SimpleHTMLDOM for this purpose:

It's easy as:

$html = str_get_html('...');
foreach($html->find('a') as $element) 
    echo $element->href;

Alternatively, you can load it from a file as well:

$html = file_get_html('...');
foreach($html->find('a') as $element) 
    echo $element->href;

This is also possible with the built-in DOM:

$dom = new DOMDocument();
$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a"); //all <a> tags
$urlArray = array();

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $urlArray[] = $href->getAttribute('href');
}

See it in action!

Amal Murali
  • 75,622
  • 18
  • 128
  • 150