scrapy don't detect an html element but it is visible on source page

Question

I have a request working normally on regular browsers but not on in scrapy shell. An entire HTML block get vanish as soon as I use "scrapy shell" or "scrapy crawl". I am not banned for sure.

Here, below, is the issue on the github (with pictures) before i was redirect toward here of the below link (french website property auction) with a regular browser like mozilla :

https://github.com/scrapy/scrapy/issues/2109

To make it short, I try to scrape an auction website. And with a regular browser, all data appears normally. But when I checked with the scrapy shell, a entire HTML block is missing from the response.body

scrapy shell http://www.licitor.com/ventes-judiciaires-immobilieres/tgi-fontainebleau/mercredi-15-juin-2016.html

Even when I change my user-agent by typing:

scrapy shell -s USER_AGENT='Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1'   'http...the rest of url'

I tried to change the user-agent as I was told it is a potential header issue or a javascript one.

Plus this message error on my terminal says:

[1:1:0710/114628:ERROR:PlatformKeyboardEvent.cpp(117)] Not implemented reached in static PlatformEvent::Modifiers blink::PlatformKeyboardEvent::getCurrentModifierState()

just in case, I had to add DOWNLOAD_HANDLERS: {'s3': None} in my settings in order to get rid of an ERROR message.

I am running on ubuntu 14 and I have anaconda installed on it with scrapy 1.03.

Where do I miss the point please people ?

EDIT: To check the header solution, I copy-paste the same header from mozilla browser, which works well, into my scrapy shell. Here is my code:

from scrapy import Request

req = Request('MY_URL', 
   headers={
   'Accept': 'text/html, */*; q=0.01',
   'Accept-Encoding': 'gzip, deflate, sdch',
   'Accept-Language': 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4',
   'User-Agent':"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36",
   })

fetch(req)

The HTML data is still missing.

Is possible a javascript prevents scrapy to work ?

EDIT:

I also installed scrapy-splash with its docker prerequisite.

And then, I tried to handle this issue by using the splash server.

STILL THE SAME PROBLEM !! Here is my code:

$ scrapy shell

from scrapy import Request
from scrapy_splash import SplashRequest
url='http://www.licitor.com/ventes-judiciaires-immobilieres/tgi-paris/jeudi-7-juillet-2016.html'
req = SplashRequest(url, args={'wait': 0.5}, 
headers={
'Accept': 'text/html, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4',
'User-Agent':"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36",
})

fetch(req)
view(response)

So in summary, this is what I did:

I changed my header to be the same of my Mozilla browser (which works)
I installed Splash and tried to use it to handle javascript

score 2 · Answer 1 · answered Jul 10 '16 at 18:08

This is a Javascript issue.

The section of the page that doesn't get loaded is called dynamically by an AJAX request.

Since Scrapy doesn't render any Javascript by default including AJAX requests, the contents of the block in the page stays empty.

This is definitely handleable in Scrapy using Splash.

Here's the code for a working spider that does load the page properly.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.shell import inspect_response
from scrapy.shell import open_in_browser
from scrapy_splash import SplashRequest


class LicitorSpider(scrapy.Spider):
    name = "licitor"
    allowed_domains = ["licitor.com"]
    start_urls = (
        'http://www.licitor.com/',
    )

    def parse(self, response):
        url = 'http://www.licitor.com/ventes-judiciaires-immobilieres/tgi-fontainebleau/mercredi-15-juin-2016.html'
        yield SplashRequest(url=url, callback=self.parse_item, args={'wait': 0.5})

    def parse_item(self, response):
        open_in_browser(response)
        assert ("www.dbcj-avocats.com" in response.body), "XHR request not loaded"
        inspect_response(response, self)

Make sure you have the Splash Docker instance running before you run the Spider and also add the following settings to your Spiders settings.py file.

SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

score 0 · Answer 2 · answered Jul 10 '16 at 18:14

If you view the actual HTML source from that page (view-source:http://www.licitor.com/ventes-judiciaires-immobilieres/tgi-fontainebleau/mercredi-15-juin-2016.html) you won't see the data you circled in the GitHub issue.

If you inspect your browser's network tab, when loading http://www.licitor.com/ventes-judiciaires-immobilieres/tgi-fontainebleau/mercredi-15-juin-2016.html ,you'll notice an XHR request to http://www.licitor.com/annonce/06/20/24/vente-aux-encheres/un-appartement/avon/seine-et-marne/062024.html

If you fetch this page with scrapy, you'll get the data you want.

The links to the ads are in a <ul>

<div class="Container">
        <ul class="AdResults">
        <li>
        <a class="Ad Archives First" href="/annonce/06/20/24/vente-aux-encheres/un-appartement/avon/seine-et-marne/062024.html"
            title="Un appartement, Avon, Seine-et-Marne, adjudication : 101 000 €">
...

See this scrapy shell session:

$ scrapy shell http://www.licitor.com/ventes-judiciaires-immobilieres/tgi-fontainebleau/mercredi-15-juin-2016.html
2016-07-10 20:08:35 [scrapy] INFO: Scrapy 1.0.6 started (bot: scrapybot)
(...)
2016-07-10 20:08:36 [scrapy] DEBUG: Crawled (200) <GET http://www.licitor.com/ventes-judiciaires-immobilieres/tgi-fontainebleau/mercredi-15-juin-2016.html> (referer: None)
(...)    
In [1]: for link in response.css('ul.AdResults > li > a'):
    print(link.xpath('@title').extract_first(), response.urljoin(link.xpath('@href').extract_first()))
   ...:     
(u'Un appartement, Avon, Seine-et-Marne, adjudication : 101 000 \u20ac', u'http://www.licitor.com/annonce/06/20/24/vente-aux-encheres/un-appartement/avon/seine-et-marne/062024.html')
(u"Une maison d'habitation, Montereau-Fault-Yonne (Seine-et-Marne), Seine-et-Marne, adjudication : 95 500 \u20ac", u'http://www.licitor.com/annonce/06/22/90/vente-aux-encheres/une-maison-d-habitation/montereau-fault-yonne-seine-et-marne/seine-et-marne/062290.html')
(u"Une maison d'habitation, Chevry-en-Sereine (Seine-et-Marne), Seine-et-Marne, adjudication : 48 000 \u20ac", u'http://www.licitor.com/annonce/06/22/91/vente-aux-encheres/une-maison-d-habitation/chevry-en-sereine-seine-et-marne/seine-et-marne/062291.html')

Fetching the page and collecting what's in <div class="AdContent" id="ad-062024"> shows the data that a browser displays:

In [2]: fetch('http://www.licitor.com/annonce/06/20/24/vente-aux-encheres/un-appartement/avon/seine-et-marne/062024.html')
2016-07-10 20:11:25 [scrapy] DEBUG: Crawled (200) <GET http://www.licitor.com/annonce/06/20/24/vente-aux-encheres/un-appartement/avon/seine-et-marne/062024.html> (referer: None)
(...)
In [3]: print(response.css('div.AdContent').xpath('normalize-space()').extract_first())
Annonce publiée le 27 avril 2016 62024 Tribunal de Grande Instance de Fontainebleau (Seine et Marne) Vente aux enchères publiques sur licitation en un lot mercredi 15 juin 2016 à 14h Un appartement Une cave Deux boxes en sous-solCadastré section A n°142, 150, 1.016, 1.017 et 1.075, lots n°132, 214, 240 et 242Le bien est occupé Adjudication : 101 000 € (Mise à prix : 100 000 €) Avon Résidence Les Jardins de Changis29 - 35, rue des Yèbles (exactitude non garantie) SCP Dumont, Bortolotti, Combes, Junguenet, Avocats 149, rue Grande - 77300 FontainebleauTél.: 01 60 71 57 11 www.dbcj-avocats.com Ferrari & Cie - Réf. A16/0239

scrapy don't detect an html element but it is visible on source page

2 Answers2