I have a request working normally on regular browsers but not on in scrapy shell. An entire HTML block get vanish as soon as I use "scrapy shell" or "scrapy crawl". I am not banned for sure.
Here, below, is the issue on the github (with pictures) before i was redirect toward here of the below link (french website property auction) with a regular browser like mozilla :
https://github.com/scrapy/scrapy/issues/2109
To make it short, I try to scrape an auction website. And with a regular browser, all data appears normally. But when I checked with the scrapy shell, a entire HTML block is missing from the response.body
scrapy shell http://www.licitor.com/ventes-judiciaires-immobilieres/tgi-fontainebleau/mercredi-15-juin-2016.html
Even when I change my user-agent by typing:
scrapy shell -s USER_AGENT='Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1' 'http...the rest of url'
I tried to change the user-agent as I was told it is a potential header issue or a javascript one.
Plus this message error on my terminal says:
[1:1:0710/114628:ERROR:PlatformKeyboardEvent.cpp(117)] Not implemented reached in static PlatformEvent::Modifiers blink::PlatformKeyboardEvent::getCurrentModifierState()
just in case, I had to add DOWNLOAD_HANDLERS: {'s3': None}
in my settings in order to get rid of an ERROR message.
I am running on ubuntu 14 and I have anaconda installed on it with scrapy 1.03.
Where do I miss the point please people ?
EDIT: To check the header solution, I copy-paste the same header from mozilla browser, which works well, into my scrapy shell. Here is my code:
from scrapy import Request
req = Request('MY_URL',
headers={
'Accept': 'text/html, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4',
'User-Agent':"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36",
})
fetch(req)
The HTML data is still missing.
Is possible a javascript prevents scrapy to work ?
EDIT:
I also installed scrapy-splash with its docker prerequisite.
And then, I tried to handle this issue by using the splash server.
STILL THE SAME PROBLEM !! Here is my code:
$ scrapy shell
from scrapy import Request
from scrapy_splash import SplashRequest
url='http://www.licitor.com/ventes-judiciaires-immobilieres/tgi-paris/jeudi-7-juillet-2016.html'
req = SplashRequest(url, args={'wait': 0.5},
headers={
'Accept': 'text/html, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4',
'User-Agent':"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36",
})
fetch(req)
view(response)
So in summary, this is what I did:
- I changed my header to be the same of my Mozilla browser (which works)
- I installed Splash and tried to use it to handle javascript