Unable to retrieve all urls from with a ordered list(

Question

I'm trying to retrieve all the url's from within the anchor tags. I have used the query response.selector.xpath('//li[@class="active"]//a/@href').extract(), to extract all the url's, but I only get a few queries.

The web page is structured as :

    `<ul class="data">
        <li id="all" class="active">
            <a class="fit" href="#1"></a>
                <div class="1">
                    <a target="_blank" href="www.yahoo.com">
                </div>
                <div class="2">
                    <a target="_blank" href="www.google.com">
                </div>  
            <a class="fit" xmlns:listval="com.indiatimes.cms.utilities.CMSDateUtility" xmlns:java="java" href="#2"></a> 
                <div class="1">
                    <a target="_blank" href="www.facebook.com">
                </div>
                <div class="2">
                    <a target="_blank" href="www.bing.com">
                </div>  
            <a class="fit"  xmlns:listval="com.indiatimes.cms.utilities.CMSDateUtility" xmlns:java="java" href="#3"></a> 
                <div class="1">
                    <a target="_blank" href="www.amazon.com">
                </div>
                <div class="2">
                    <a target="_blank" href="www.flipkart.com">
                </div>  
            <a class="fit"  xmlns:listval="com.indiatimes.cms.utilities.CMSDateUtility" xmlns:java="java" href="#4"></a> 
                <div class="1">
                    <a target="_blank" href="www.snapdeal.com">
                </div>
                <div class="2">
                    <a target="_blank" href="www.infibeam.com">
                </div>          
        </li>
</ul>`

The previous query fetches me only "www.yahoo.com" and "www.google.com". What tweak do I need to do to get all the href's ?

I tried your code and I got all the @href fields. Could you show your results? — VMRuiz, Jun 21 '17 at 07:24
Well, I'm applying this on another web page, which has a similar structure, but I'm getting only the href's uptill and not beyond that! — Virat, Jun 21 '17 at 07:36
Please, disable javascript in your browser and check if the content that you are looking for is still in there. — VMRuiz, Jun 21 '17 at 07:39
@VMRuiz I cannot find the content after disabling java script. No wonder I was not getting those href's! How do I go about it now? (coz it's more like scraping js, not html) — Virat, Jun 21 '17 at 07:46
You need to render the page using a browser, e.g. [Splash](https://github.com/scrapinghub/splash). — Tomáš Linhart, Jun 21 '17 at 07:58
To add to @TomášLinhart comment you can instead reverse engineer how the page populates that using javascript. Quite often it's as simple as 1 AJAX request, see this related question: https://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax — Granitosaurus, Jun 21 '17 at 08:33
@Virat As Granitosaurus said, search for an ajax call with that info, or use selenium or splash (personaly, I prefer selenium). — VMRuiz, Jun 21 '17 at 09:26

Umair Ayub · Answer 1 · 2017-06-21T07:39:33.060

0

Try CSS selector instead of Xpath

for link in response.css("li.active a"):
     link_id = link.css("::attr(href)").extract_first()

edited Jun 21 '17 at 07:39

answered Jun 21 '17 at 07:14

Umair Ayub

19,358
14
72
146

Well, I tried your code, and I'm getting **ExpressionError: The pseudo-class :attr() is unknown**, which I'm trying to figure out. Is it possible for you to suggest me an Xpath query? – Virat Jun 21 '17 at 07:38

Unable to retrieve all urls from with a ordered list(

1 Answers1