0

I'm trying to retrieve all the url's from within the anchor tags. I have used the query response.selector.xpath('//li[@class="active"]//a/@href').extract(), to extract all the url's, but I only get a few queries.

The web page is structured as :

    `<ul class="data">
        <li id="all" class="active">
            <a class="fit" href="#1"></a>
                <div class="1">
                    <a target="_blank" href="www.yahoo.com">
                </div>
                <div class="2">
                    <a target="_blank" href="www.google.com">
                </div>  
            <a class="fit" xmlns:listval="com.indiatimes.cms.utilities.CMSDateUtility" xmlns:java="java" href="#2"></a> 
                <div class="1">
                    <a target="_blank" href="www.facebook.com">
                </div>
                <div class="2">
                    <a target="_blank" href="www.bing.com">
                </div>  
            <a class="fit"  xmlns:listval="com.indiatimes.cms.utilities.CMSDateUtility" xmlns:java="java" href="#3"></a> 
                <div class="1">
                    <a target="_blank" href="www.amazon.com">
                </div>
                <div class="2">
                    <a target="_blank" href="www.flipkart.com">
                </div>  
            <a class="fit"  xmlns:listval="com.indiatimes.cms.utilities.CMSDateUtility" xmlns:java="java" href="#4"></a> 
                <div class="1">
                    <a target="_blank" href="www.snapdeal.com">
                </div>
                <div class="2">
                    <a target="_blank" href="www.infibeam.com">
                </div>          
        </li>
</ul>`

The previous query fetches me only "www.yahoo.com" and "www.google.com". What tweak do I need to do to get all the href's ?

Virat
  • 129
  • 4
  • 20
  • I tried your code and I got all the @href fields. Could you show your results? – VMRuiz Jun 21 '17 at 07:24
  • Well, I'm applying this on another web page, which has a similar structure, but I'm getting only the href's uptill and not beyond that! – Virat Jun 21 '17 at 07:36
  • Please, disable javascript in your browser and check if the content that you are looking for is still in there. – VMRuiz Jun 21 '17 at 07:39
  • @VMRuiz I cannot find the content after disabling java script. No wonder I was not getting those href's! How do I go about it now? (coz it's more like scraping js, not html) – Virat Jun 21 '17 at 07:46
  • 1
    You need to render the page using a browser, e.g. [Splash](https://github.com/scrapinghub/splash). – Tomáš Linhart Jun 21 '17 at 07:58
  • To add to @TomášLinhart comment you can instead reverse engineer how the page populates that using javascript. Quite often it's as simple as 1 AJAX request, see this related question: https://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax – Granitosaurus Jun 21 '17 at 08:33
  • @Virat As Granitosaurus said, search for an ajax call with that info, or use selenium or splash (personaly, I prefer selenium). – VMRuiz Jun 21 '17 at 09:26

1 Answers1

0

Try CSS selector instead of Xpath

for link in response.css("li.active a"):
     link_id = link.css("::attr(href)").extract_first()
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • Well, I tried your code, and I'm getting **ExpressionError: The pseudo-class :attr() is unknown**, which I'm trying to figure out. Is it possible for you to suggest me an Xpath query? – Virat Jun 21 '17 at 07:38